X Tutup
Skip to content

Improve Node::get_children performance.#110571

Merged
Repiteo merged 1 commit intogodotengine:masterfrom
mounirtohami:get-children
Sep 22, 2025
Merged

Improve Node::get_children performance.#110571
Repiteo merged 1 commit intogodotengine:masterfrom
mounirtohami:get-children

Conversation

@WhalesState
Copy link
Contributor

@WhalesState WhalesState commented Sep 16, 2025

This insures that get_child is never called when using get_children and writing to the TypedArray always uses pointer.

@WhalesState WhalesState requested a review from a team as a code owner September 16, 2025 13:24
@WhalesState WhalesState changed the title Improve Node::get_children performance by x2 when including internal chidlren. Improve Node::get_children performance by x2 when including internal children. Sep 16, 2025
@WhalesState WhalesState changed the title Improve Node::get_children performance by x2 when including internal children. Improve Node::get_children performance. Sep 16, 2025
@WhalesState
Copy link
Contributor Author

WhalesState commented Sep 16, 2025

C++:

			Control *ctrl = memnew(Control);
			add_child(ctrl);

			{
				const int count = 100000;
				const int internal_count = 50000;

				for (int i = 0; i < internal_count; i++) {
					ctrl->add_child(memnew(Control), false, INTERNAL_MODE_FRONT);
				}

				for (int i = 0; i < count; i++) {
					ctrl->add_child(memnew(Control));
				}

				for (int i = 0; i < internal_count; i++) {
					ctrl->add_child(memnew(Control), false, INTERNAL_MODE_BACK);
				}
				
				// Validate cache.
				int child_count = ctrl->get_children().size();
				print_line(vformat("Children count = %d", child_count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children().size();
				print_line(vformat("Before: get_children, internal=true: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children_fast().size();
				print_line(vformat("After: get_children, internal=true: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children(false).size();
				print_line(vformat("Before: get_children, internal=false: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			{
				int time = Time::get_singleton()->get_ticks_msec();
				int count = ctrl->get_children_fast(false).size();
				print_line(vformat("After: get_children, internal=false: %d ms", Time::get_singleton()->get_ticks_msec() - time));
				print_line(vformat("Children count = %d", count));
			}

			remove_child(ctrl);
			memdelete(ctrl);

GDScript:

	var internal_count: int = 10000
	var count: int = 10000

	for i in internal_count:
		add_child(Control.new(), false, Node.INTERNAL_MODE_FRONT)

	for i in count:
		add_child(Control.new())

	for i in internal_count:
		add_child(Control.new(), false, Node.INTERNAL_MODE_BACK)

	# Validate cache.
	get_children()

	var time = Time.get_ticks_msec()
	get_children(true)
	prints("Before: include internal: ", Time.get_ticks_msec() - time)

	time = Time.get_ticks_msec()
	get_children_fast(true)
	prints("After: include internal: ", Time.get_ticks_msec() - time)

	time = Time.get_ticks_msec()
	get_children()
	prints("Before: exclude internal: ", Time.get_ticks_msec() - time)

	time = Time.get_ticks_msec()
	get_children_fast()
	prints("After: exclude internal: ", Time.get_ticks_msec() - time)

Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! But I don't think this is the best way to approach optimizing this function.

I'm guessing the bottleneck is the repeated get_child call, both because of the thread guard, but also because it does a lot of unnecessary checks.

I think the best way to optimize this function would be to continue using the children cache, but do so directly with a plain old loop.

First call _update_children_cache.

Then, it should determine start and end index in the children cache based on p_include_internal (either 0 and data.children_cache.size(), or based on data.internal_children_front_count_cache and data.internal_children_back_count_cache).

Finally, it should loop from start to end, appending to the array the child in the cache at the corresponding index.
This should come out cleaner and faster than your current solution.

Let me know if you want to adjust the PR accordingly, or if I should make a superseding PR.

@WhalesState
Copy link
Contributor Author

I'm guessing the bottleneck is the repeated get_child call, both because of the thread guard, but also because it does a lot of unnecessary checks.

The bottleneck was that _update_children_cache() was called when include_internal==true which results in two loops over all children instead of just one loop with the current implementation, you can see the difference when include_internal==false it's only faster by 2ms which was the cost of calling get_child.

@Ivorforce
Copy link
Member

Ivorforce commented Sep 16, 2025

The bottleneck was that _update_children_cache() was called when include_internal==true which results in two loops over all children instead of just one loop with the current implementation, you can see the difference when include_internal==false it's only faster by 2ms which was the cost of calling get_child.

Yes, but we expect children to update rarely. Updating the children cache is somewhat costly once, but iterating it afterwards is nearly free compared to iterating the HashMap.The reason being that the children cache has all elements in a contiguous array, while iterating the children HashMap can results in a lot of cache misses. It is important for the performance of Node to use the children cache wherever possible.

@WhalesState
Copy link
Contributor Author

This follows up get_child_count implementation where we update the cache only to exclude internal children which results in faster return when include_internal == true.

int Node::get_child_count(bool p_include_internal) const {
	ERR_THREAD_GUARD_V(0);
	if (p_include_internal) {
		return data.children.size();
	}

	_update_children_cache();
	return data.children_cache.size() - data.internal_children_front_count_cache - data.internal_children_back_count_cache;
}

@Ivorforce
Copy link
Member

The implementation of get_child_count can exploit the fact that we do not need to iterate the children at all when p_include_internal is true. This makes for an obvious shortcut without trade-off.

However, get_children does need to iterate the children, and therefore it needs to use the children cache. If it weren't using the children cache, we'd be losing most of the performance to cache misses in cold access situations.

@WhalesState
Copy link
Contributor Author

Yes, but we expect children to update rarely. Updating the children cache is somewhat costly once, but iterating it afterwards is nearly free compared to iterating the HashMap.The reason being that the children cache has all elements in a contiguous array, while iterating the children HashMap can results in a lot of cache misses. It is important for the performance of Node to use the children cache wherever possible.

_update_children_cache loops over the HashMap already and i just have saved this step, a better implementation may be to check if data.children_cache_dirty so we loop over the HashMap instead of resulting in two loops over all children, this is useful especially when we need to use methods that doesn't require updating the cache like remove_child.

	for (const KeyValue<StringName, Node *> &K : data.children) {
		data.children_cache[idx] = K.value;
		idx++;
	} 

@Ivorforce
Copy link
Member

Ivorforce commented Sep 16, 2025

a better implementation may be to check if data.children_cache_dirty so we loop over the HashMap instead of resulting in two loops over all children

Yes, this would be faster on first access (with the caveat that it would work for the scenario where p_include_internal is false).
However, we shouldn't be optimizing for the 'first access' case.

Like I mentioned above, in general we expect the children to be relatively constant over multiple calls. Even if they change once every few frames, there might be dozens of calls to get_children afterwards where the children cache can be exploited.
While you have benchmarked the improvement of performance on hot, initial access, your benchmark does not represent the case we expect to occur most often, which is cold, repeated access. For this case, the children cache is the best solution, even if it does incur a performance cost on the first call where the children cache needs to be created.

@WhalesState
Copy link
Contributor Author

WhalesState commented Sep 16, 2025

Useful use case where it doesn't need to update the children cache:

var children = get_children(true)
# Remove all non control nodes from children.
for child: Node in children:
    if child.get_class() != &"Control":
        remove_child(child)
        child.free()
print_tree_pretty() # Now the cache is updated, but only once instead of twice.

@WhalesState
Copy link
Contributor Author

WhalesState commented Sep 16, 2025

So we can update the cache if it's dirty inside the method when get_children(true) is called and the cache is dirty instead of looping twice over all children, it will remain fast and yet will update the cache if it's dirty.

@Ivorforce
Copy link
Member

Ivorforce commented Sep 16, 2025

So we can update the cache if it's dirty inside the method when get_children(true) is called and the cache is dirty instead of looping twice over all children, it will remain fast and yet will update the cache if it's dirty.

Feel free to benchmark it against my proposed solution! I suspect the second loop isn't a bottleneck, compared to the thread guards and extraneous checks that both my proposed solution and your proposed solution mitigate. After this optimization, I'm guessing the allocation and data copying to be the new bottleneck (and not a secondary loop).

@WhalesState

This comment was marked as resolved.

@WhalesState
Copy link
Contributor Author

WhalesState commented Sep 16, 2025

Thank you for helping in optimizing it, it appeared that we can't relay on get_children(true) without validating cache or the order may be incorrect, I have only insured that get_child is never called which have added smaller optimization, also updated the code that i have used for testing and insured that i always validate the cache before the benchmarks which was why the first call was much slower than the others.

@WhalesState WhalesState force-pushed the get-children branch 3 times, most recently from 5b4f827 to 11c650b Compare September 16, 2025 22:02
Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for being open about changing the implementation in this way.

For completion's sake, I'd be interested in a benchmark of the current version against master, just to make sure it's a substantial improvement still. But we can merge without it; I'm pretty confident get_child was the major bottleneck to solve here.

@Ivorforce Ivorforce modified the milestones: 4.x, 4.6 Sep 16, 2025
@WhalesState
Copy link
Contributor Author

WhalesState commented Sep 16, 2025

Thanks for being open about changing the implementation in this way.

You’re welcome! I always appreciate feedback, even when it’s critical. Thank you for your patience and for helping me improve and deepen my understanding of the source code.

For completion's sake, I'd be interested in a benchmark of the current version against master, just to make sure it's a substantial improvement still. But we can merge without it; I'm pretty confident get_child was the major bottleneck to solve here.

Children count = 200,000 (100,000 internal and 100,000 external)

Before: get_children, internal=true: 73 ms
 After: get_children, internal=true: 65 ms
Before: get_children, internal=false: 36 ms
 After: get_children, internal=false: 32 ms

Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. Code looks good to me.

Benchmark

PC specifications
  • CPU: AMD Ryzen 9 9950X3D
  • GPU: NVIDIA GeForce RTX 5090
  • RAM: 64 GB (2×32 GB DDR5-6000 CL30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 42)

Using a release export template binary with LTO with the GDScript code from the first comment. 300,000 total iterations on each run (higher values cause a crash due to the message queue running out of memory). Values shown are an average of 5 runs for each.

Include internal

Before After
6603 µs 6529 µs

Exclude internal

Before After
1892 µs 1855 µs

The performance improvement is very small (within margin of error possibly, as there's some variance).

@WhalesState
Copy link
Contributor Author

WhalesState commented Sep 16, 2025

I have made another benchmark against the current implementation and it appears that separating both loops results in faster execution since it has less calculations and uses constants.

TypedArray<Node> Node::get_children(bool p_include_internal) const {
	ERR_THREAD_GUARD_V(TypedArray<Node>());
	_update_children_cache();

	TypedArray<Node> children;

	if (p_include_internal) {
		const int size = data.children_cache.size();
		children.resize(size);

		for (int i = 0; i < size; i++) {
			children[i] = data.children_cache[i];
		}
	} else {
		const int size = data.children_cache.size() - data.internal_children_back_count_cache;
		children.resize(size - data.internal_children_front_count_cache);

		int idx = 0;
		for (int i = data.internal_children_front_count_cache; i < size; i++) {
			children[idx++] = data.children_cache[i];
		}
	}

	return children;
}
Before: get_children, internal=true: 70 ms
 After: get_children, internal=true: 65 ms

Before: get_children, internal=false: 35 ms
 After: get_children, internal=false: 31 ms

@WhalesState
Copy link
Contributor Author

The performance improvement is very small (within margin of error possibly, as there's some variance).

I’ve updated the implementation, and I’m curious to see if it now shows a measurable improvement that justifies the effort we invested.

Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Array::Iterator approach should eliminate the unnecessary CoW / Array interface checks, so I'm guessing it will speed up the implementation some more. But now I'd like to see a benchmark again.

Anyway, I don't think your new approach (separate loops + iterator for Array is more complicated than the previous one, so this still works for me either way.

it appears that separating both loops results in faster execution

Note that 1ms execution time difference is neglibigle in CPU terms. We expect fluctuations of up to ~10ms every run (highly depending on the benchmark of course). You should increase the number of iterations in your benchmark to reach the upper hundreds of milliseconds, or even single digit seconds range. The same holds for @Calinou's benchmark, which appears to be in the millisecond range.

Co-authored-by: Lukas Tenbrink <lukas.tenbrink@gmail.com>
Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving latest version.

@Repiteo Repiteo merged commit 89c51cb into godotengine:master Sep 22, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Sep 22, 2025

Thanks!

@WhalesState WhalesState deleted the get-children branch September 22, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

X Tutup