X Tutup
Skip to content

[Core] Optimize HashMap#90082

Closed
Nazarwadim wants to merge 1 commit intogodotengine:masterfrom
Nazarwadim:hash_map_optimize
Closed

[Core] Optimize HashMap#90082
Nazarwadim wants to merge 1 commit intogodotengine:masterfrom
Nazarwadim:hash_map_optimize

Conversation

@Nazarwadim
Copy link
Contributor

@Nazarwadim Nazarwadim commented Mar 31, 2024

Benchmarks

Compile command: scons -j11 target=template_release scu_build=true

HashMap<int, int>

Elements Time find before (nsec) Time find after (nsec)
10 685 453
50 672 707
250 2542 2330
1250 13065 8843
6250 55934 43017
31250 373167 293621
156250 1905111 1500588
781250 14192621 11117365
3906250 82247888 66698464
19531250 441888032 364882432
Elements Time insert before (usec) Time insert after (usec)
5 2 0
25 1 0
125 14 4
625 34 18
3125 162 95
15625 808 723
78125 3953 3477
390625 39874 28357
1953125 188333 154200
9765625 970730 758132

Notes. These benchmarks were conducted using the reserve method, as there were different map expansion points due to the replacement from fastmod to &. There were 5 times more elements reserved than were inserted.
In the find benchmark, 50% of the elements are not in the table itself, which gives 50% of the misses. Also it looks for all of these elements. Therefore, the increase in time is proportional to the increased number of elements.

Reducing the size of the compiled binary.
If you compare the binary generated by CI with mine and with the master, you can see that the binary is about 0.1 - 1 MB smaller.
Most likely, this is due to the _FORCE_INLINE_ that I added for some methods.

Replacing fastmod with &
I completely replaced fastmod with &.
As I checked, & is 5 times faster than fastmod and 10 times faster than %. Also, keep in mind that fastmod is not supported on 32-bit architectures.

How & works?
We can instantly calculate 141%100, where we simply discard everything but the hundredth. The remainder of the division will be 41. The same is true for 1267 % 10 = 7 or 4 % 10000 = 4. For computers, this works in binary, as I did.

The necessary condition is that the number from which we want to take the remainder of the division must be 2^n.
Let's represent some number in binary format 00100.
To find the remainder from, say, a hash of 11110, we need to subtract 1 from our number and then do a bitwise AND.
11110 & 00011 = 00010. And that's it.

How & was implemented?
In order to reduce the number of operations in the parts of the code that are most often used, I made capacity one less than used in other containers. But the get_capacity method will return the normal capacity. Next, we need to make sure that the condition for our number 2^n - 1 is met. In the _resize_and_rehash method, I use this

capacity = next_power_of_2(capacity) - 1;

to satisfy this condition. Before that, It used a loop to fit the number from the hash table array (in reserve method).

Removing unnecessary hash calculations
insert method previously had 2 _hash calculations and the operator [] had 3 in total.
I removed the hash calculation from the _insert method and renamed this method to _insert_with_hash. I renamed the other method, which was _insert_with_hash, to _insert_with_hash_and_element.
I also added the _look_up_with_hash method, which searches for a position with a previously calculated hash.

Improving hash quality for string
Using hash_fmix32 for calculated hash for strings greatly reduced the number of collisions. This can be seen by comparing the results of this code.
GDScript implementation:

var map = {}
var start = Time.get_ticks_usec()
for i in range(10000):
	map[str(i)] = 4
for i in range(10000):
	var a = map[str(i)]
print(Time.get_ticks_usec() - start)

C++ implementation:

HashMap<String, long> map;
uint64_t start = OS::get_singleton()->get_ticks_usec();
for (int i = 0; i < 10000; i++) {
	map[itos(i)] = 4;
}
uint64_t end = OS::get_singleton()->get_ticks_usec();
print_line(end - start);

There is a 30% performance improvement in gdscript and a 213% improvement in C++ If the implementation of the hashmap differs only by the hasher.

Improving Comparator for floating point numbers
Using a bitwise comparison instead of a normal one and checking for NAN significantly reduces the comparison time.

Closes: #98598

@Nazarwadim Nazarwadim force-pushed the hash_map_optimize branch 2 times, most recently from 5dec562 to 164768d Compare April 1, 2024 20:19
@Nazarwadim Nazarwadim changed the title [Core] Optimize HashMap operator[] and insert method [Core] Optimize HashMap Apr 1, 2024
@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Apr 1, 2024

Now it looks like this
Testing 100000 rand() insertions
image
Average before - after: 1779.9
Average % decrease: 16.4159557297671 %

Edit: Testing *getptr() was incorrect, because after testing the insertion, I immediately deleted the table.

Testing 100000 rand() *getptr().
image
Average before - after: 162.6
Average % decrease: 6.2203519510329 %

@MewPurPur
Copy link
Contributor

What is measured in "before" and "after"?

@Nazarwadim
Copy link
Contributor Author

What is measured in "before" and "after"?

Master branch - before.
My branch - after

@AThousandShips
Copy link
Member

No what is measured, what are the units

@Nazarwadim
Copy link
Contributor Author

No what is measured, what are the units

u_sec

@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Apr 2, 2024

benchmark code I used:

HashMap1<int, int> before;
HashMap<int, int> after;
int found1;
int found2;
uint32_t count = 100000;
uint32_t find_count = 100000;
void test_hash_map() {
	uint64_t start1 = OS::get_singleton()->get_ticks_usec();
	for (uint32_t i = 0; i < count; i++) {
		before[rand()] = rand();
	}
	uint64_t end1 = OS::get_singleton()->get_ticks_usec();
	uint64_t start2 = OS::get_singleton()->get_ticks_usec();
	for (uint32_t i = 0; i < count; i++) {
		after[rand()] = rand();
	}
	uint64_t end2 = OS::get_singleton()->get_ticks_usec();
	print_line("Time inserting before:", end1 - start1);
	print_line("Time inserting after:", end2 - start2);

	uint64_t start3 = OS::get_singleton()->get_ticks_usec();

	for (uint32_t i = 0; i < find_count; i++) {
		auto e = before.getptr(rand());
		if (e) {
			found1 = *e;
		}
	}

	uint64_t end3 = OS::get_singleton()->get_ticks_usec();
	before.clear();
	uint64_t start4 = OS::get_singleton()->get_ticks_usec();

	for (uint32_t i = 0; i < find_count; i++) {
		auto e = after.getptr(rand());
		if (e) {
			found2 = *e;
		}
	}
	uint64_t end4 = OS::get_singleton()->get_ticks_usec();
	after.clear();
	print_line("Time find before:", end3 - start3);
	print_line("Time find after:", end4 - start4);
}

@kleonc
Copy link
Member

kleonc commented Apr 2, 2024

benchmark code:
(...)

Each before/after case should be runned for the same dataset, otherwise the before/after execution times might differ not only because of the changes made in this PR but also because of e.g. different hash collisions for the input datasets.

Like in the example above before/after should be runned with the same RNG seed.

@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Apr 2, 2024

Like in the example above before/after should be runned with the same RNG seed.

I don't think it would have made much of a difference, but I'm going to try it now.
Each test has its own random with a seed of 0.

        RandomPCG random1;
	random1.seed(0);
	RandomPCG random2;
	random2.seed(0);
	RandomPCG random3;
	random1.seed(0);
	RandomPCG random4;
	random2.seed(0);

image

image

@AThousandShips
Copy link
Member

AThousandShips commented Apr 2, 2024

getptr getting worse twice, by a non-insignificant amount, is worrying, to me it implies something about these improvements are broken, there's no reason for it to become slower

Try restoring fastmod and see what changes, I have a feeling that is the cause, I don't think it can be simply removed

I don't even see how the replacement code is the same, do I don't think that's a valid change

@Nazarwadim
Copy link
Contributor Author

Try restoring fastmod and see what changes, I have a feeling that is the cause, I don't think it can be simply removed

fastmod and my version work exactly the same. I did benchmarks and my version is 2 times faster. Also, it doesn't degrade by half, but only by about 5%. As you can see, after I started using RandomPCG, the insertion has degraded by half and the search by 3. I think the best thing to do would be to try to create some random array and then insert its elements into the table and see the results. And then go through the array looking for those elements from the table.

@AThousandShips
Copy link
Member

AThousandShips commented Apr 2, 2024

Also, it doesn't degrade by half, but only by about 5%

I know? I didn't say it did? I said it was a reduction, and it doesnt make sense if the input is exactly the same

So I'd say this either means you're not comparing fairly, or the changes aren't trivial

Also I'd say getptr should be done with a reasonable range of value, just using a plain random will make it miss almost always, so please show your updated testing code, and I think it needs to be changed or it won't give usable results

@Nazarwadim
Copy link
Contributor Author

I don't even see how the replacement code is the same, do I don't think that's a valid change

For example

capacity = 10;
hash = 15;
pos = hash % capacity; // first fastmod now  pos = 5 

//then in the loop we do something like this

pos = (pos + 1) % capacity // 6 % 10 = 6
// after a few iterations pos = 9;
pos = (pos + 1) % capacity // 10 % 10 = 0

@AThousandShips
Copy link
Member

AThousandShips commented Apr 2, 2024

I'd say the getptr tests should use rand() % (2 * count) or even based on the actual size, or you'll have just a 3125:134217728 chance of a hit (or 0.0023%), so it's essentially all misses, which, granted, gives a reference for those, but the hit case is arguably much more important as it varies much more

@Nazarwadim
Copy link
Contributor Author

HashMap1<int, int> before;
HashMap<int, int> after;

Vector<int> rand_vec;
int found1;
int found2;
uint32_t count = 1000000;

void insert_random_variables_in_vec()
{
	for(int i = 0; i < 2 * count; i++) // should be 2 * count to benchmark misses
	{
		rand_vec.push_back(rand());
	}
}

void test_hash_map() {
	for (int j = 0; j < 10; j++) {
		before.clear();
		uint64_t start1 = OS::get_singleton()->get_ticks_usec();
		for (uint32_t i = 0; i < count; i++) {
			before[rand_vec[i]] = rand_vec[i] * rand_vec[i];
		}
		uint64_t end1 = OS::get_singleton()->get_ticks_usec();

		print_line("Time inserting before:", end1 - start1);
	}

	for (int j = 0; j < 10; j++) {
		after.clear();
		uint64_t start2 = OS::get_singleton()->get_ticks_usec();
		for (uint32_t i = 0; i < count; i++) {
			after[rand_vec[i]] = rand_vec[i] * rand_vec[i];
		}
		uint64_t end2 = OS::get_singleton()->get_ticks_usec();
		print_line("Time inserting after:", end2 - start2);
	}

	print_line(" ");

	for (int j = 0; j < 10; j++) {
		uint64_t start3 = OS::get_singleton()->get_ticks_usec();

		for (uint32_t i = 0; i < rand_vec.size(); i++) {
			auto e = before.getptr(rand_vec[i]);
			if (e) {
				found1 = *e;
			}
		}
		uint64_t end3 = OS::get_singleton()->get_ticks_usec();
		print_line("Time find before:", end3 - start3);
	}

	for (int j = 0; j < 10; j++) {
		uint64_t start4 = OS::get_singleton()->get_ticks_usec();

		for (uint32_t i = 0; i < rand_vec.size(); i++) {
			auto e = after.getptr(rand_vec[i]);
			if (e) {
				found2 = *e;
			}
		}
		uint64_t end4 = OS::get_singleton()->get_ticks_usec();
		print_line("Time find after:", end4 - start4);
	}

}

Console output:

Time inserting before: 98877
Time inserting before: 140672
Time inserting before: 137699
Time inserting before: 136475
Time inserting before: 136530
Time inserting before: 142960
Time inserting before: 136449
Time inserting before: 136262
Time inserting before: 137182
Time inserting before: 136673
Time inserting after: 87340
Time inserting after: 130727
Time inserting after: 130751
Time inserting after: 130086
Time inserting after: 129933
Time inserting after: 130118
Time inserting after: 131152
Time inserting after: 129826
Time inserting after: 129644
Time inserting after: 130364
 
Time find before: 66767
Time find before: 64331
Time find before: 74998
Time find before: 63937
Time find before: 64128
Time find before: 64302
Time find before: 65063
Time find before: 64686
Time find before: 64487
Time find before: 65016
Time find after: 64541
Time find after: 63817
Time find after: 63936
Time find after: 64024
Time find after: 64464
Time find after: 64537
Time find after: 64476
Time find after: 64392
Time find after: 64001
Time find after: 64075

@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Apr 2, 2024

You're still looking up at an infinitesimal chance of hitting

No, I iterate over the array whose elements I have already inserted.

if (e) { 
	found1 = *e;
	counter++;
}
.......................
print_line(counter);

Output : 10004690

@AThousandShips
Copy link
Member

A graph would be good to make it manageable :) but looks much better

@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Apr 2, 2024

image
image
image
image

@AThousandShips
Copy link
Member

That looks much more reasonable performance wise

Copy link
Member

@RandomShaper RandomShaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Now we're optimizing, wouldn't (un)likely() help in some checks? I'm specifically talking about the various exists/!exists.
  2. Regarding fastmod():
    1. I see not all uses have been replaced. Is that intended?
    2. Is every implementation of fastmod() problematic? Checking its implementation, I wonder if the replacement is better for every compiler/arch. If the replacement algo is superior in some/all cases, I'd rather patch the function and keep the calls.

Copy link
Member

@hpvb hpvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NaN checks are critical to the semantics of the dictionary.

@btarg
Copy link

btarg commented Jan 15, 2025

This looks great, hoping to see it in 4.4!

@Repiteo Repiteo modified the milestones: 4.4, 4.x Jan 20, 2025
@Repiteo
Copy link
Contributor

Repiteo commented Mar 10, 2025

Needs rebase

@Repiteo Repiteo requested a review from hpvb March 10, 2025 14:59
Removed unnecessary calls to the _hash method.
Replaced fastmod with &.
Removed a lot of unnecessary checks.
Optimized _look_up by moving checks if there are no collisions.
Added _FORCE_INLINE_ for is_empty, _lookup, _insert_with_hash_and_element methods.
Added fast 0.75 multiplier.
Used memset.
Delete allocations for elements without iterating on hashes.
Added a check if there are many collisions (DEV).
Fixed a lot of collisions in strings.
Move hash functions to cpp file.
Add hash_mixed method to StringName and NodePath.
@akien-mga akien-mga requested a review from Ivorforce March 28, 2025 12:34
Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very promising, and the performance improvements you found are very attractive indeed!

However, I find it hard to review this PR because it's so many changes at once:

  • Using fmix_32 to finalize hashes, for reduced collisions.
  • Using & instead of a modulo function, for faster execution.
  • Explicit instantiation of several HashMap types (this explains your binary size reduction, not _FORCE_INLINE_).
  • Refactor of HashMap to avoid repeated hash() calls.

I think these are good proposals, but all put together, they make for a very complex PR that is hard to estimate repercussions of. This is likely the reason it's been on the road for so long, despite promising such good improvements.
Personally, I'd prefer if you split this into 4 separate PRs (or at least, separate commits), making for incremental improvements we can track. I do not feel comfortable giving this PR an approval if I cannot fully understand it.

}
}

// Used by HashMapHasherDefault. Don`t use it as seed to create other hash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Used by HashMapHasherDefault. Don`t use it as seed to create other hash.
// Returns the hash finalized by mixing. To combine this StringName into a hash, use `hash()` instead.

// Used by HashMapHasherDefault. Don`t use it as seed to create other hash.
_FORCE_INLINE_ uint32_t hash_mixed() const {
if (unlikely(_data == nullptr)) {
return get_empty_hash();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not have to be mixed? It should return a plain djb2 hash, no?

Comment on lines +45 to +52
// Explicit instantiation.
template class HashMap<int, int>;
template class HashMap<String, int>;
template class HashMap<StringName, bool>;
template class HashMap<StringName, StringName>;
template class HashMap<StringName, String>;
template class HashMap<StringName, Variant>;
template class HashMap<StringName, int>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe explicit HashMap instantiation is out of scope for this PR.
(even though it may be a good idea)

// Must be a power of two.
static constexpr uint32_t INITIAL_CAPACITY = 16;
static constexpr uint32_t EMPTY_HASH = 0;
static_assert(EMPTY_HASH == 0, "EMPTY_HASH must always be 0 for the memcpy() optimization.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_assert(EMPTY_HASH == 0, "EMPTY_HASH must always be 0 for the memcpy() optimization.");
static_assert(EMPTY_HASH == 0, "EMPTY_HASH must always be 0 for the memset() optimization.");

@DeeJayLSP
Copy link
Contributor

DeeJayLSP commented Apr 17, 2025

I did a test using this project which has a GDScript stress test.

On a production editor build, after spawning 14000 bunnies I got the following frame rates:

master PR
77-81 87-91

Note that GDScript uses a lot of HashMaps, so this gain is significant.

GDScriptFunction::call() is probably one of the most demanding functions in the engine, and this is one of the few optimizations that result on it being a little faster.

It also seems Linux editor binary size has been decreased by 1072 KiB.

Combining this with #94118 bumps the FPS further up to 92-95, as long as the modified HashMap line is static, but for some reason only if you don't modify the thread_local Vector, otherwise the gains by this get nullified.

@Ivorforce
Copy link
Member

Ivorforce commented Apr 24, 2025

@Nazarwadim Are you interested in maintaining this PR further? I'm interested to get these changes merged, but I'm still hesitant about merging them all at once.

@DeeJayLSP
Copy link
Contributor

DeeJayLSP commented Apr 24, 2025

Needs rebase after #104985 merge.

I also need to point out that, while testing some builds, I got no GDScript performance improvement without the explicit instantiation.

@Nazarwadim
Copy link
Contributor Author

@Ivorforce I'll try to look at this PR this weekend or next.

I'd also like you to look at #97016, as in my benchmarks the performance gain for HashMap was at the same level as in this PR.

@Ivorforce
Copy link
Member

Ivorforce commented Apr 24, 2025

Awesome!
That PR wasnt on my radar before. I'll have a look when i find some time!

@Ivorforce
Copy link
Member

Ivorforce commented May 14, 2025

Hiya, here's an update as I've been testing the features proposed and described here:

Removing unnecessary hash calculations
This led to a measurable improvement of performance. I extracted this change with #106020, which has been merged by now. Thanks!

Replacing fastmod with &
I've tested the effect of fastmod and found it to be a bottleneck (mostly in get code, it's about ~90% of the logic, as long as RAM is in cache).
With &, we'd be giving up on prime based hashmap sizes. I've read around a bit on the web, and the consensus seemed to be "a paranoid hash map should use prime sizes, but it's not required if your hashes are good". I tested using a prime for StringName number of buckets (~20k entries), which did decrease standard deviation slightly (good), though it did not measurably affect performance.
I'd argue that the performance difference has to be really good if we want to give up on this technique to reduce collisions. In practice, this just means we need benchmarks.
In either case, I was able to eliminate some fastmod calls in this PR: #106569

Improving hash quality for string
I tested this by using fmix32 on hashes in StringName (~20k entries), because StringName uses the same hashing method as HashMapDefaultHasher. This increased standard deviation slightly, which means there was an increase in collisions. It also measurably regressed StringName lookup performance (not sure why, fmix32 overhead?). I don't think this is a good change (at least for a 2^16 modulo).

Using memset after reallocations
This led to a measurable improvement of performance. I've integrated this change (in a calloc based way, which should be even better) in another PR: #104124

Takeaway

There seem to be a lot of changes in this PR otherwise, so I'm not sure if I missed anything important.
It would be interesting to measure growth strategies for HashMap and find a suitable one, especially considering & vs a lookup table for hashes. Otherwise, I don't think all of the changes proposed here (beyond what I already extracted) are improvements. If you think there's anything else that might be interesting, I think it should be proposed in a new PR.

Edit: However, if you rebase on top of master (and #104124), and can still measure performance improvements, we can reconsider this verdict!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible memory leak and/or data corruption in hash_map.h
X Tutup