Conversation
5dec562 to
164768d
Compare
HashMap operator[] and insert methodHashMap
|
What is measured in "before" and "after"? |
Master branch - before. |
|
No what is measured, what are the units |
u_sec |
|
benchmark code I used: HashMap1<int, int> before;
HashMap<int, int> after;
int found1;
int found2;
uint32_t count = 100000;
uint32_t find_count = 100000;
void test_hash_map() {
uint64_t start1 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < count; i++) {
before[rand()] = rand();
}
uint64_t end1 = OS::get_singleton()->get_ticks_usec();
uint64_t start2 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < count; i++) {
after[rand()] = rand();
}
uint64_t end2 = OS::get_singleton()->get_ticks_usec();
print_line("Time inserting before:", end1 - start1);
print_line("Time inserting after:", end2 - start2);
uint64_t start3 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < find_count; i++) {
auto e = before.getptr(rand());
if (e) {
found1 = *e;
}
}
uint64_t end3 = OS::get_singleton()->get_ticks_usec();
before.clear();
uint64_t start4 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < find_count; i++) {
auto e = after.getptr(rand());
if (e) {
found2 = *e;
}
}
uint64_t end4 = OS::get_singleton()->get_ticks_usec();
after.clear();
print_line("Time find before:", end3 - start3);
print_line("Time find after:", end4 - start4);
} |
Each before/after case should be runned for the same dataset, otherwise the before/after execution times might differ not only because of the changes made in this PR but also because of e.g. different hash collisions for the input datasets. Like in the example above before/after should be runned with the same RNG seed. |
I don't think it would have made much of a difference, but I'm going to try it now. |
|
Try restoring I don't even see how the replacement code is the same, do I don't think that's a valid change |
|
I know? I didn't say it did? I said it was a reduction, and it doesnt make sense if the input is exactly the same So I'd say this either means you're not comparing fairly, or the changes aren't trivial Also I'd say |
For example capacity = 10;
hash = 15;
pos = hash % capacity; // first fastmod now pos = 5
//then in the loop we do something like this
pos = (pos + 1) % capacity // 6 % 10 = 6
// after a few iterations pos = 9;
pos = (pos + 1) % capacity // 10 % 10 = 0 |
|
I'd say the getptr tests should use |
HashMap1<int, int> before;
HashMap<int, int> after;
Vector<int> rand_vec;
int found1;
int found2;
uint32_t count = 1000000;
void insert_random_variables_in_vec()
{
for(int i = 0; i < 2 * count; i++) // should be 2 * count to benchmark misses
{
rand_vec.push_back(rand());
}
}
void test_hash_map() {
for (int j = 0; j < 10; j++) {
before.clear();
uint64_t start1 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < count; i++) {
before[rand_vec[i]] = rand_vec[i] * rand_vec[i];
}
uint64_t end1 = OS::get_singleton()->get_ticks_usec();
print_line("Time inserting before:", end1 - start1);
}
for (int j = 0; j < 10; j++) {
after.clear();
uint64_t start2 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < count; i++) {
after[rand_vec[i]] = rand_vec[i] * rand_vec[i];
}
uint64_t end2 = OS::get_singleton()->get_ticks_usec();
print_line("Time inserting after:", end2 - start2);
}
print_line(" ");
for (int j = 0; j < 10; j++) {
uint64_t start3 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < rand_vec.size(); i++) {
auto e = before.getptr(rand_vec[i]);
if (e) {
found1 = *e;
}
}
uint64_t end3 = OS::get_singleton()->get_ticks_usec();
print_line("Time find before:", end3 - start3);
}
for (int j = 0; j < 10; j++) {
uint64_t start4 = OS::get_singleton()->get_ticks_usec();
for (uint32_t i = 0; i < rand_vec.size(); i++) {
auto e = after.getptr(rand_vec[i]);
if (e) {
found2 = *e;
}
}
uint64_t end4 = OS::get_singleton()->get_ticks_usec();
print_line("Time find after:", end4 - start4);
}
}Console output: |
No, I iterate over the array whose elements I have already inserted. if (e) {
found1 = *e;
counter++;
}
.......................
print_line(counter);Output : 10004690 |
|
A graph would be good to make it manageable :) but looks much better |
|
That looks much more reasonable performance wise |
RandomShaper
left a comment
There was a problem hiding this comment.
- Now we're optimizing, wouldn't (
un)likely()help in some checks? I'm specifically talking about the variousexists/!exists. - Regarding
fastmod():- I see not all uses have been replaced. Is that intended?
- Is every implementation of
fastmod()problematic? Checking its implementation, I wonder if the replacement is better for every compiler/arch. If the replacement algo is superior in some/all cases, I'd rather patch the function and keep the calls.
ec750bf to
e7c82db
Compare
hpvb
left a comment
There was a problem hiding this comment.
The NaN checks are critical to the semantics of the dictionary.
e7c82db to
e2deba5
Compare
|
This looks great, hoping to see it in 4.4! |
|
Needs rebase |
Removed unnecessary calls to the _hash method. Replaced fastmod with &. Removed a lot of unnecessary checks. Optimized _look_up by moving checks if there are no collisions. Added _FORCE_INLINE_ for is_empty, _lookup, _insert_with_hash_and_element methods. Added fast 0.75 multiplier. Used memset. Delete allocations for elements without iterating on hashes. Added a check if there are many collisions (DEV). Fixed a lot of collisions in strings. Move hash functions to cpp file. Add hash_mixed method to StringName and NodePath.
e2deba5 to
b2f94ee
Compare
There was a problem hiding this comment.
This PR is very promising, and the performance improvements you found are very attractive indeed!
However, I find it hard to review this PR because it's so many changes at once:
- Using
fmix_32to finalize hashes, for reduced collisions. - Using
&instead of a modulo function, for faster execution. - Explicit instantiation of several
HashMaptypes (this explains your binary size reduction, not_FORCE_INLINE_). - Refactor of
HashMapto avoid repeatedhash()calls.
I think these are good proposals, but all put together, they make for a very complex PR that is hard to estimate repercussions of. This is likely the reason it's been on the road for so long, despite promising such good improvements.
Personally, I'd prefer if you split this into 4 separate PRs (or at least, separate commits), making for incremental improvements we can track. I do not feel comfortable giving this PR an approval if I cannot fully understand it.
| } | ||
| } | ||
|
|
||
| // Used by HashMapHasherDefault. Don`t use it as seed to create other hash. |
There was a problem hiding this comment.
| // Used by HashMapHasherDefault. Don`t use it as seed to create other hash. | |
| // Returns the hash finalized by mixing. To combine this StringName into a hash, use `hash()` instead. |
| // Used by HashMapHasherDefault. Don`t use it as seed to create other hash. | ||
| _FORCE_INLINE_ uint32_t hash_mixed() const { | ||
| if (unlikely(_data == nullptr)) { | ||
| return get_empty_hash(); |
There was a problem hiding this comment.
Does this not have to be mixed? It should return a plain djb2 hash, no?
| // Explicit instantiation. | ||
| template class HashMap<int, int>; | ||
| template class HashMap<String, int>; | ||
| template class HashMap<StringName, bool>; | ||
| template class HashMap<StringName, StringName>; | ||
| template class HashMap<StringName, String>; | ||
| template class HashMap<StringName, Variant>; | ||
| template class HashMap<StringName, int>; |
There was a problem hiding this comment.
I believe explicit HashMap instantiation is out of scope for this PR.
(even though it may be a good idea)
| // Must be a power of two. | ||
| static constexpr uint32_t INITIAL_CAPACITY = 16; | ||
| static constexpr uint32_t EMPTY_HASH = 0; | ||
| static_assert(EMPTY_HASH == 0, "EMPTY_HASH must always be 0 for the memcpy() optimization."); |
There was a problem hiding this comment.
| static_assert(EMPTY_HASH == 0, "EMPTY_HASH must always be 0 for the memcpy() optimization."); | |
| static_assert(EMPTY_HASH == 0, "EMPTY_HASH must always be 0 for the memset() optimization."); |
|
I did a test using this project which has a GDScript stress test. On a production editor build, after spawning 14000 bunnies I got the following frame rates:
Note that GDScript uses a lot of HashMaps, so this gain is significant.
It also seems Linux editor binary size has been decreased by 1072 KiB. Combining this with #94118 bumps the FPS further up to 92-95, as long as the modified HashMap line is |
|
@Nazarwadim Are you interested in maintaining this PR further? I'm interested to get these changes merged, but I'm still hesitant about merging them all at once. |
|
Needs rebase after #104985 merge. I also need to point out that, while testing some builds, I got no GDScript performance improvement without the explicit instantiation. |
|
@Ivorforce I'll try to look at this PR this weekend or next. I'd also like you to look at #97016, as in my benchmarks the performance gain for HashMap was at the same level as in this PR. |
|
Awesome! |
|
Hiya, here's an update as I've been testing the features proposed and described here: Removing unnecessary hash calculations Replacing fastmod with & Improving hash quality for string Using memset after reallocations TakeawayThere seem to be a lot of changes in this PR otherwise, so I'm not sure if I missed anything important. Edit: However, if you rebase on top of master (and #104124), and can still measure performance improvements, we can reconsider this verdict! |








Benchmarks
Compile command:
scons -j11 target=template_release scu_build=trueHashMap<int, int>
Notes. These benchmarks were conducted using the
reservemethod, as there were different map expansion points due to the replacement fromfastmodto&. There were 5 times more elements reserved than were inserted.In the
find benchmark, 50% of the elements are not in the table itself, which gives 50% of the misses. Also it looks for all of these elements. Therefore, the increase in time is proportional to the increased number of elements.Reducing the size of the compiled binary.
If you compare the binary generated by CI with mine and with the master, you can see that the binary is about 0.1 - 1 MB smaller.
Most likely, this is due to the
_FORCE_INLINE_that I added for some methods.Replacing fastmod with &
I completely replaced
fastmodwith&.As I checked,
&is 5 times faster thanfastmodand 10 times faster than%. Also, keep in mind thatfastmodis not supported on 32-bit architectures.How
&works?We can instantly calculate
141%100, where we simply discard everything but the hundredth. The remainder of the division will be41. The same is true for1267 % 10 = 7or4 % 10000 = 4. For computers, this works in binary, as I did.The necessary condition is that the number from which we want to take the remainder of the division must be
2^n.Let's represent some number in binary format
00100.To find the remainder from, say, a hash of
11110, we need to subtract 1 from our number and then do a bitwise AND.11110 & 00011 = 00010. And that's it.How
&was implemented?In order to reduce the number of operations in the parts of the code that are most often used, I made
capacityone less than used in other containers. But theget_capacitymethod will return the normal capacity. Next, we need to make sure that the condition for our number2^n - 1is met. In the_resize_and_rehashmethod, I use thiscapacity = next_power_of_2(capacity) - 1;to satisfy this condition. Before that, It used a loop to fit the number from the hash table array (in
reservemethod).Removing unnecessary hash calculations
insertmethod previously had 2_hashcalculations and theoperator []had 3 in total.I removed the hash calculation from the
_insertmethod and renamed this method to_insert_with_hash. I renamed the other method, which was_insert_with_hash, to_insert_with_hash_and_element.I also added the
_look_up_with_hashmethod, which searches for a position with a previously calculated hash.Improving hash quality for string
Using hash_fmix32 for calculated hash for strings greatly reduced the number of collisions. This can be seen by comparing the results of this code.
GDScript implementation:
C++ implementation:
There is a 30% performance improvement in gdscript and a 213% improvement in C++ If the implementation of the hashmap differs only by the hasher.
Improving Comparator for floating point numbers
Using a bitwise comparison instead of a normal one and checking for NAN significantly reduces the comparison time.
Closes: #98598