Improve `Node::get_children` performance. by WhalesState · Pull Request #110571 · godotengine/godot

WhalesState · 2025-09-16T13:24:39Z

This insures that get_child is never called when using get_children and writing to the TypedArray always uses pointer.

WhalesState · 2025-09-16T13:27:56Z

C++:

			Control *ctrl = memnew(Control);
			add_child(ctrl);

			{
				const int count = 100000;
				const int internal_count = 50000;

				for (int i = 0; i < internal_count; i++) {
					ctrl->add_child(memnew(Control), false, INTERNAL_MODE_FRONT);
				}

				for (int i = 0; i < count; i++) {
					ctrl->add_child(memnew(Control));
				}

				for (int i = 0; i < internal_count; i++) {
					ctrl->add_child(memnew(Control), false, INTERNAL_MODE_BACK);
				}
				
				// Validate cache.
				int child_count = ctrl->get_children().size();
				print_line(vformat("Children count = %d", child_count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children().size();
				print_line(vformat("Before: get_children, internal=true: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children_fast().size();
				print_line(vformat("After: get_children, internal=true: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children(false).size();
				print_line(vformat("Before: get_children, internal=false: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children_fast(false).size();
				print_line(vformat("After: get_children, internal=false: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			remove_child(ctrl);
			memdelete(ctrl);

GDScript:

	var internal_count: int = 10000
	var count: int = 10000

	for i in internal_count:
		add_child(Control.new(), false, Node.INTERNAL_MODE_FRONT)

	for i in count:
		add_child(Control.new())

	for i in internal_count:
		add_child(Control.new(), false, Node.INTERNAL_MODE_BACK)

	# Validate cache.
	get_children()

	var time = Time.get_ticks_msec()
	get_children(true)
	prints("Before: include internal: ", Time.get_ticks_msec() - time)

	time = Time.get_ticks_msec()
	get_children_fast(true)
	prints("After: include internal: ", Time.get_ticks_msec() - time)

	time = Time.get_ticks_msec()
	get_children()
	prints("Before: exclude internal: ", Time.get_ticks_msec() - time)

	time = Time.get_ticks_msec()
	get_children_fast()
	prints("After: exclude internal: ", Time.get_ticks_msec() - time)

Ivorforce

Good find! But I don't think this is the best way to approach optimizing this function.

I'm guessing the bottleneck is the repeated get_child call, both because of the thread guard, but also because it does a lot of unnecessary checks.

I think the best way to optimize this function would be to continue using the children cache, but do so directly with a plain old loop.

First call _update_children_cache.

Then, it should determine start and end index in the children cache based on p_include_internal (either 0 and data.children_cache.size(), or based on data.internal_children_front_count_cache and data.internal_children_back_count_cache).

Finally, it should loop from start to end, appending to the array the child in the cache at the corresponding index.
This should come out cleaner and faster than your current solution.

Let me know if you want to adjust the PR accordingly, or if I should make a superseding PR.

WhalesState · 2025-09-16T13:52:05Z

I'm guessing the bottleneck is the repeated get_child call, both because of the thread guard, but also because it does a lot of unnecessary checks.

The bottleneck was that _update_children_cache() was called when include_internal==true which results in two loops over all children instead of just one loop with the current implementation, you can see the difference when include_internal==false it's only faster by 2ms which was the cost of calling get_child.

Ivorforce · 2025-09-16T13:54:52Z

The bottleneck was that _update_children_cache() was called when include_internal==true which results in two loops over all children instead of just one loop with the current implementation, you can see the difference when include_internal==false it's only faster by 2ms which was the cost of calling get_child.

Yes, but we expect children to update rarely. Updating the children cache is somewhat costly once, but iterating it afterwards is nearly free compared to iterating the HashMap.The reason being that the children cache has all elements in a contiguous array, while iterating the children HashMap can results in a lot of cache misses. It is important for the performance of Node to use the children cache wherever possible.

WhalesState · 2025-09-16T13:56:19Z

This follows up get_child_count implementation where we update the cache only to exclude internal children which results in faster return when include_internal == true.

int Node::get_child_count(bool p_include_internal) const {
	ERR_THREAD_GUARD_V(0);
	if (p_include_internal) {
		return data.children.size();
	}

	_update_children_cache();
	return data.children_cache.size() - data.internal_children_front_count_cache - data.internal_children_back_count_cache;
}

Ivorforce · 2025-09-16T14:02:29Z

The implementation of get_child_count can exploit the fact that we do not need to iterate the children at all when p_include_internal is true. This makes for an obvious shortcut without trade-off.

However, get_children does need to iterate the children, and therefore it needs to use the children cache. If it weren't using the children cache, we'd be losing most of the performance to cache misses in cold access situations.

WhalesState · 2025-09-16T14:05:51Z

Yes, but we expect children to update rarely. Updating the children cache is somewhat costly once, but iterating it afterwards is nearly free compared to iterating the HashMap.The reason being that the children cache has all elements in a contiguous array, while iterating the children HashMap can results in a lot of cache misses. It is important for the performance of Node to use the children cache wherever possible.

_update_children_cache loops over the HashMap already and i just have saved this step, a better implementation may be to check if data.children_cache_dirty so we loop over the HashMap instead of resulting in two loops over all children, this is useful especially when we need to use methods that doesn't require updating the cache like remove_child.

	for (const KeyValue<StringName, Node *> &K : data.children) {
		data.children_cache[idx] = K.value;
		idx++;
	}

Ivorforce · 2025-09-16T14:14:26Z

a better implementation may be to check if data.children_cache_dirty so we loop over the HashMap instead of resulting in two loops over all children

Yes, this would be faster on first access (with the caveat that it would work for the scenario where p_include_internal is false).
However, we shouldn't be optimizing for the 'first access' case.

Like I mentioned above, in general we expect the children to be relatively constant over multiple calls. Even if they change once every few frames, there might be dozens of calls to get_children afterwards where the children cache can be exploited.
While you have benchmarked the improvement of performance on hot, initial access, your benchmark does not represent the case we expect to occur most often, which is cold, repeated access. For this case, the children cache is the best solution, even if it does incur a performance cost on the first call where the children cache needs to be created.

WhalesState · 2025-09-16T14:14:58Z

Useful use case where it doesn't need to update the children cache:

var children = get_children(true)
# Remove all non control nodes from children.
for child: Node in children:
    if child.get_class() != &"Control":
        remove_child(child)
        child.free()
print_tree_pretty() # Now the cache is updated, but only once instead of twice.

WhalesState · 2025-09-16T14:19:09Z

So we can update the cache if it's dirty inside the method when get_children(true) is called and the cache is dirty instead of looping twice over all children, it will remain fast and yet will update the cache if it's dirty.

Ivorforce · 2025-09-16T14:23:12Z

So we can update the cache if it's dirty inside the method when get_children(true) is called and the cache is dirty instead of looping twice over all children, it will remain fast and yet will update the cache if it's dirty.

Feel free to benchmark it against my proposed solution! I suspect the second loop isn't a bottleneck, compared to the thread guards and extraneous checks that both my proposed solution and your proposed solution mitigate. After this optimization, I'm guessing the allocation and data copying to be the new bottleneck (and not a secondary loop).

WhalesState · 2025-09-16T17:36:21Z

Thank you for helping in optimizing it, it appeared that we can't relay on get_children(true) without validating cache or the order may be incorrect, I have only insured that get_child is never called which have added smaller optimization, also updated the code that i have used for testing and insured that i always validate the cache before the benchmarks which was why the first call was much slower than the others.

scene/main/node.h

scene/main/node.cpp

Ivorforce

Looks good to me! Thanks for being open about changing the implementation in this way.

For completion's sake, I'd be interested in a benchmark of the current version against master, just to make sure it's a substantial improvement still. But we can merge without it; I'm pretty confident get_child was the major bottleneck to solve here.

WhalesState · 2025-09-16T22:50:52Z

Thanks for being open about changing the implementation in this way.

You’re welcome! I always appreciate feedback, even when it’s critical. Thank you for your patience and for helping me improve and deepen my understanding of the source code.

For completion's sake, I'd be interested in a benchmark of the current version against master, just to make sure it's a substantial improvement still. But we can merge without it; I'm pretty confident get_child was the major bottleneck to solve here.

Children count = 200,000 (100,000 internal and 100,000 external)

Before: get_children, internal=true: 73 ms
 After: get_children, internal=true: 65 ms

Before: get_children, internal=false: 36 ms
 After: get_children, internal=false: 32 ms

Calinou

Tested locally, it works as expected. Code looks good to me.

Benchmark

PC specifications

CPU: AMD Ryzen 9 9950X3D
GPU: NVIDIA GeForce RTX 5090
RAM: 64 GB (2×32 GB DDR5-6000 CL30)
SSD: Solidigm P44 Pro 2 TB
OS: Linux (Fedora 42)

Using a release export template binary with LTO with the GDScript code from the first comment. 300,000 total iterations on each run (higher values cause a crash due to the message queue running out of memory). Values shown are an average of 5 runs for each.

Include internal

Before	After
6603 µs	6529 µs

Exclude internal

Before	After
1892 µs	1855 µs

The performance improvement is very small (within margin of error possibly, as there's some variance).

WhalesState · 2025-09-16T23:41:19Z

I have made another benchmark against the current implementation and it appears that separating both loops results in faster execution since it has less calculations and uses constants.

TypedArray<Node> Node::get_children(bool p_include_internal) const {
	ERR_THREAD_GUARD_V(TypedArray<Node>());
	_update_children_cache();

	TypedArray<Node> children;

	if (p_include_internal) {
		const int size = data.children_cache.size();
		children.resize(size);

		for (int i = 0; i < size; i++) {
			children[i] = data.children_cache[i];
		}
	} else {
		const int size = data.children_cache.size() - data.internal_children_back_count_cache;
		children.resize(size - data.internal_children_front_count_cache);

		int idx = 0;
		for (int i = data.internal_children_front_count_cache; i < size; i++) {
			children[idx++] = data.children_cache[i];
		}
	}

	return children;
}

Before: get_children, internal=true: 70 ms
 After: get_children, internal=true: 65 ms

Before: get_children, internal=false: 35 ms
 After: get_children, internal=false: 31 ms

WhalesState · 2025-09-16T23:54:39Z

The performance improvement is very small (within margin of error possibly, as there's some variance).

I’ve updated the implementation, and I’m curious to see if it now shows a measurable improvement that justifies the effort we invested.

Ivorforce

The Array::Iterator approach should eliminate the unnecessary CoW / Array interface checks, so I'm guessing it will speed up the implementation some more. But now I'd like to see a benchmark again.

Anyway, I don't think your new approach (separate loops + iterator for Array is more complicated than the previous one, so this still works for me either way.

it appears that separating both loops results in faster execution

Note that 1ms execution time difference is neglibigle in CPU terms. We expect fluctuations of up to ~10ms every run (highly depending on the benchmark of course). You should increase the number of iterations in your benchmark to reach the upper hundreds of milliseconds, or even single digit seconds range. The same holds for @Calinou's benchmark, which appears to be in the millisecond range.

Co-authored-by: Lukas Tenbrink <lukas.tenbrink@gmail.com>

Ivorforce

Re-approving latest version.

Repiteo · 2025-09-22T18:39:00Z

Thanks!

WhalesState requested a review from a team as a code owner September 16, 2025 13:24

WhalesState changed the title ~~Improve Node::get_children performance by x2 when including internal chidlren.~~ Improve Node::get_children performance by x2 when including internal children. Sep 16, 2025

WhalesState changed the title ~~Improve Node::get_children performance by x2 when including internal children.~~ Improve Node::get_children performance. Sep 16, 2025

Ivorforce requested changes Sep 16, 2025

View reviewed changes

Ivorforce added enhancement topic:core performance labels Sep 16, 2025

Ivorforce added this to the 4.x milestone Sep 16, 2025

This comment was marked as resolved.

Sign in to view

WhalesState force-pushed the get-children branch from 150958f to 77f6dbf Compare September 16, 2025 17:23

Ivorforce reviewed Sep 16, 2025

View reviewed changes

scene/main/node.h Outdated Show resolved Hide resolved

WhalesState force-pushed the get-children branch from 77f6dbf to 99e94fa Compare September 16, 2025 21:06

Ivorforce reviewed Sep 16, 2025

View reviewed changes

scene/main/node.cpp Outdated Show resolved Hide resolved

scene/main/node.cpp Outdated Show resolved Hide resolved

WhalesState force-pushed the get-children branch 3 times, most recently from 5b4f827 to 11c650b Compare September 16, 2025 22:02

Ivorforce approved these changes Sep 16, 2025

View reviewed changes

Ivorforce modified the milestones: 4.x, 4.6 Sep 16, 2025

Calinou reviewed Sep 16, 2025

View reviewed changes

WhalesState force-pushed the get-children branch from 11c650b to 0f52684 Compare September 16, 2025 23:52

WhalesState requested a review from Ivorforce September 16, 2025 23:58

WhalesState force-pushed the get-children branch from 0f52684 to e6a95bb Compare September 17, 2025 01:33

Ivorforce approved these changes Sep 17, 2025

View reviewed changes

Optimize Node::get_children.

7dc6df3

Co-authored-by: Lukas Tenbrink <lukas.tenbrink@gmail.com>

WhalesState force-pushed the get-children branch from e6a95bb to 7dc6df3 Compare September 17, 2025 10:31

Ivorforce approved these changes Sep 17, 2025

View reviewed changes

Repiteo merged commit 89c51cb into godotengine:master Sep 22, 2025
20 checks passed

WhalesState deleted the get-children branch September 22, 2025 19:13

Uh oh!

Conversation

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ivorforce left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WhalesState commented Sep 16, 2025

Uh oh!

Ivorforce commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhalesState commented Sep 16, 2025

Uh oh!

Ivorforce commented Sep 16, 2025

Uh oh!

WhalesState commented Sep 16, 2025

Uh oh!

Ivorforce commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ivorforce commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ivorforce left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Calinou left a comment

Choose a reason for hiding this comment

Benchmark

Include internal

Exclude internal

Uh oh!

WhalesState commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhalesState commented Sep 16, 2025

Uh oh!

Ivorforce left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivorforce left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Repiteo commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WhalesState commented Sep 16, 2025 •

edited

Loading

WhalesState commented Sep 16, 2025 •

edited

Loading

Ivorforce left a comment •

edited

Loading

Ivorforce commented Sep 16, 2025 •

edited

Loading

Ivorforce commented Sep 16, 2025 •

edited

Loading

WhalesState commented Sep 16, 2025 •

edited

Loading

WhalesState commented Sep 16, 2025 •

edited

Loading

Ivorforce commented Sep 16, 2025 •

edited

Loading

WhalesState commented Sep 16, 2025 •

edited

Loading

Ivorforce left a comment •

edited

Loading

WhalesState commented Sep 16, 2025 •

edited

Loading

WhalesState commented Sep 16, 2025 •

edited

Loading

Ivorforce left a comment •

edited

Loading