Improve Node::get_children performance.#110571
Conversation
Node::get_children performance by x2 when including internal chidlren.Node::get_children performance by x2 when including internal children.
Node::get_children performance by x2 when including internal children.Node::get_children performance.
|
C++: Control *ctrl = memnew(Control);
add_child(ctrl);
{
const int count = 100000;
const int internal_count = 50000;
for (int i = 0; i < internal_count; i++) {
ctrl->add_child(memnew(Control), false, INTERNAL_MODE_FRONT);
}
for (int i = 0; i < count; i++) {
ctrl->add_child(memnew(Control));
}
for (int i = 0; i < internal_count; i++) {
ctrl->add_child(memnew(Control), false, INTERNAL_MODE_BACK);
}
// Validate cache.
int child_count = ctrl->get_children().size();
print_line(vformat("Children count = %d", child_count));
}
{
int time = Time::get_singleton()->get_ticks_msec();
int count = ctrl->get_children().size();
print_line(vformat("Before: get_children, internal=true: %d ms", Time::get_singleton()->get_ticks_msec() - time));
print_line(vformat("Children count = %d", count));
}
{
int time = Time::get_singleton()->get_ticks_msec();
int count = ctrl->get_children_fast().size();
print_line(vformat("After: get_children, internal=true: %d ms", Time::get_singleton()->get_ticks_msec() - time));
print_line(vformat("Children count = %d", count));
}
{
int time = Time::get_singleton()->get_ticks_msec();
int count = ctrl->get_children(false).size();
print_line(vformat("Before: get_children, internal=false: %d ms", Time::get_singleton()->get_ticks_msec() - time));
print_line(vformat("Children count = %d", count));
}
{
int time = Time::get_singleton()->get_ticks_msec();
int count = ctrl->get_children_fast(false).size();
print_line(vformat("After: get_children, internal=false: %d ms", Time::get_singleton()->get_ticks_msec() - time));
print_line(vformat("Children count = %d", count));
}
remove_child(ctrl);
memdelete(ctrl);GDScript: var internal_count: int = 10000
var count: int = 10000
for i in internal_count:
add_child(Control.new(), false, Node.INTERNAL_MODE_FRONT)
for i in count:
add_child(Control.new())
for i in internal_count:
add_child(Control.new(), false, Node.INTERNAL_MODE_BACK)
# Validate cache.
get_children()
var time = Time.get_ticks_msec()
get_children(true)
prints("Before: include internal: ", Time.get_ticks_msec() - time)
time = Time.get_ticks_msec()
get_children_fast(true)
prints("After: include internal: ", Time.get_ticks_msec() - time)
time = Time.get_ticks_msec()
get_children()
prints("Before: exclude internal: ", Time.get_ticks_msec() - time)
time = Time.get_ticks_msec()
get_children_fast()
prints("After: exclude internal: ", Time.get_ticks_msec() - time) |
There was a problem hiding this comment.
Good find! But I don't think this is the best way to approach optimizing this function.
I'm guessing the bottleneck is the repeated get_child call, both because of the thread guard, but also because it does a lot of unnecessary checks.
I think the best way to optimize this function would be to continue using the children cache, but do so directly with a plain old loop.
First call _update_children_cache.
Then, it should determine start and end index in the children cache based on p_include_internal (either 0 and data.children_cache.size(), or based on data.internal_children_front_count_cache and data.internal_children_back_count_cache).
Finally, it should loop from start to end, appending to the array the child in the cache at the corresponding index.
This should come out cleaner and faster than your current solution.
Let me know if you want to adjust the PR accordingly, or if I should make a superseding PR.
The bottleneck was that |
Yes, but we expect children to update rarely. Updating the children cache is somewhat costly once, but iterating it afterwards is nearly free compared to iterating the |
|
This follows up int Node::get_child_count(bool p_include_internal) const {
ERR_THREAD_GUARD_V(0);
if (p_include_internal) {
return data.children.size();
}
_update_children_cache();
return data.children_cache.size() - data.internal_children_front_count_cache - data.internal_children_back_count_cache;
} |
|
The implementation of However, |
for (const KeyValue<StringName, Node *> &K : data.children) {
data.children_cache[idx] = K.value;
idx++;
} |
Yes, this would be faster on first access (with the caveat that it would work for the scenario where Like I mentioned above, in general we expect the children to be relatively constant over multiple calls. Even if they change once every few frames, there might be dozens of calls to |
|
Useful use case where it doesn't need to update the children cache: var children = get_children(true)
# Remove all non control nodes from children.
for child: Node in children:
if child.get_class() != &"Control":
remove_child(child)
child.free()
print_tree_pretty() # Now the cache is updated, but only once instead of twice. |
|
So we can update the cache if it's dirty inside the method when |
Feel free to benchmark it against my proposed solution! I suspect the second loop isn't a bottleneck, compared to the thread guards and extraneous checks that both my proposed solution and your proposed solution mitigate. After this optimization, I'm guessing the allocation and data copying to be the new bottleneck (and not a secondary loop). |
This comment was marked as resolved.
This comment was marked as resolved.
150958f to
77f6dbf
Compare
|
Thank you for helping in optimizing it, it appeared that we can't relay on |
77f6dbf to
99e94fa
Compare
5b4f827 to
11c650b
Compare
There was a problem hiding this comment.
Looks good to me! Thanks for being open about changing the implementation in this way.
For completion's sake, I'd be interested in a benchmark of the current version against master, just to make sure it's a substantial improvement still. But we can merge without it; I'm pretty confident get_child was the major bottleneck to solve here.
You’re welcome! I always appreciate feedback, even when it’s critical. Thank you for your patience and for helping me improve and deepen my understanding of the source code.
Children count = 200,000 (100,000 internal and 100,000 external) Before: get_children, internal=true: 73 ms
After: get_children, internal=true: 65 msBefore: get_children, internal=false: 36 ms
After: get_children, internal=false: 32 ms |
Calinou
left a comment
There was a problem hiding this comment.
Tested locally, it works as expected. Code looks good to me.
Benchmark
PC specifications
- CPU: AMD Ryzen 9 9950X3D
- GPU: NVIDIA GeForce RTX 5090
- RAM: 64 GB (2×32 GB DDR5-6000 CL30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 42)
Using a release export template binary with LTO with the GDScript code from the first comment. 300,000 total iterations on each run (higher values cause a crash due to the message queue running out of memory). Values shown are an average of 5 runs for each.
Include internal
| Before | After |
|---|---|
| 6603 µs | 6529 µs |
Exclude internal
| Before | After |
|---|---|
| 1892 µs | 1855 µs |
The performance improvement is very small (within margin of error possibly, as there's some variance).
|
I have made another benchmark against the current implementation and it appears that separating both loops results in faster execution since it has less calculations and uses constants. TypedArray<Node> Node::get_children(bool p_include_internal) const {
ERR_THREAD_GUARD_V(TypedArray<Node>());
_update_children_cache();
TypedArray<Node> children;
if (p_include_internal) {
const int size = data.children_cache.size();
children.resize(size);
for (int i = 0; i < size; i++) {
children[i] = data.children_cache[i];
}
} else {
const int size = data.children_cache.size() - data.internal_children_back_count_cache;
children.resize(size - data.internal_children_front_count_cache);
int idx = 0;
for (int i = data.internal_children_front_count_cache; i < size; i++) {
children[idx++] = data.children_cache[i];
}
}
return children;
}Before: get_children, internal=true: 70 ms
After: get_children, internal=true: 65 ms
Before: get_children, internal=false: 35 ms
After: get_children, internal=false: 31 ms |
11c650b to
0f52684
Compare
I’ve updated the implementation, and I’m curious to see if it now shows a measurable improvement that justifies the effort we invested. |
0f52684 to
e6a95bb
Compare
There was a problem hiding this comment.
The Array::Iterator approach should eliminate the unnecessary CoW / Array interface checks, so I'm guessing it will speed up the implementation some more. But now I'd like to see a benchmark again.
Anyway, I don't think your new approach (separate loops + iterator for Array is more complicated than the previous one, so this still works for me either way.
it appears that separating both loops results in faster execution
Note that 1ms execution time difference is neglibigle in CPU terms. We expect fluctuations of up to ~10ms every run (highly depending on the benchmark of course). You should increase the number of iterations in your benchmark to reach the upper hundreds of milliseconds, or even single digit seconds range. The same holds for @Calinou's benchmark, which appears to be in the millisecond range.
Co-authored-by: Lukas Tenbrink <lukas.tenbrink@gmail.com>
e6a95bb to
7dc6df3
Compare
Ivorforce
left a comment
There was a problem hiding this comment.
Re-approving latest version.
|
Thanks! |
This insures that
get_childis never called when using get_children and writing to the TypedArray always uses pointer.