X Tutup
Skip to content

Use LocalVector on GDScriptInstance::members#105555

Open
DeeJayLSP wants to merge 1 commit intogodotengine:masterfrom
DeeJayLSP:gds-local-members
Open

Use LocalVector on GDScriptInstance::members#105555
DeeJayLSP wants to merge 1 commit intogodotengine:masterfrom
DeeJayLSP:gds-local-members

Conversation

@DeeJayLSP
Copy link
Contributor

@DeeJayLSP DeeJayLSP commented Apr 19, 2025

Modifies GDScriptInstance::members to use a LocalVector instead of Vector. This should be slightly faster and go easier on memory due to the lack of CoW.

Benchmarks

Updated as of December 3, 2025

Project used for benchmarking: bunnymark_custom.zip (modify the line in _ready() that adds bunnies as intended).

It should start and immediately spawn the desired amount of bunnies. After 30 seconds, prints the FPS amount of the last 15 seconds (the first 15 are discarded as an anti-fluctuation measure) and exits.

The command was something like for i in {1..3}; do gamemoderun ./before.x86_64 --no-header && gamemoderun ./after.x86_64 --no-header; done > bunny_count.txt on a machine with almost zero programs running in the background. More than one run is necessary due to GDScript being too unstable and sensible to fluctuations.

Unlike some previous benchmarks, I opted to keep the numbers as low as possible to keep closer to more common cases. There is virtually no difference if the amount is lower than this.

500 bunnies

master PR
2514 2511
2531 2547
2537 2549

Average increase of 0.31%

1000 bunnies

master PR
1677 1682
1668 1683
1670 1673

Average increase of 0.45%

5000 bunnies

master PR
360 371
364 361
361 367

Average increase of 1.29%

Other considerations

On a very high amount of instances (probably near 50k bunnies), the improvement would certainly reach around 10%, it's just that it's nowhere near a common use case.

While the improvement numbers aren't that high, template release production builds on Linux had a decrease of 8 KiB. When running both before and after cases and looking at a system monitor, most of the times the after one will use a few KBs less memory.

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Apr 21, 2025

It seems upstream made this slower (probably #100944).

However, changing the members to a TightLocalVector instead made the performance improvement come back. GDScriptInstance::members is resized only once when an instance is created, so maybe nothing is lost by doing this. Force-pushed.

@clayjohn
Copy link
Member

CC @Ivorforce @Nazarwadim

@Ivorforce
Copy link
Member

Some regressions are to be expected from resize changes by pure chance. If this is significantly impacted by it it definitely needs a better solution in general.

Not sure how TighLocalVector could even help unless it's being resized every single frame. It has no effect if there's no resize. Do you have a repro for us to toy with?

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Apr 21, 2025

Do you have a repro for us to toy with?

I use this stress test for every PR that affects GDScript performance: https://github.com/GaijinEntertainment/godot-das/tree/master/examples/bunnymark

After further testing, I noticed there is a performance improvement only when there are too much instances:

6000 bunnies:

branch FPS
master 122-125
PR 114-116

20000 bunnies:

branch FPS
master 34-36
PR 36-37

Didn't bother measuring with regular LocalVector as all results were lower performance.

It seems a bit situational. I'll report after testing on production builds which matter the most.

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Apr 21, 2025

I'll report after testing on production builds which matter the most.

The proportions are the same, with or without production=yes

I admit that back then I was pairing the changes of this PR with #90082, which already caused a massive boost to GDScript.

I decided to test everything again with both changes from #90082 and this PR over the current master.

At 5000 bunnies:

branch avg. FPS
#90082 163-166
#90082 + LocalVector members 163-171
#90082 + TightLocalVector members 161-171

At 10000 bunnies:

branch avg. FPS
#90082 79-81
#90082 + LocalVector members 79-82
#90082 + TightLocalVector members 78-81

At 20000 bunnies:

branch FPS
#90082 37-40
#90082 + LocalVector members 39-41
#90082 + TightLocalVector members 38-39

At 50000 bunnies:

branch FPS
#90082 15-16
#90082 + LocalVector members 16-17
#90082 + TightLocalVector members 15-17

The results are too unstable, but I think maybe we should wait for #90082. A regular LocalVector seems more stable than a Tight one.

@DeeJayLSP DeeJayLSP requested a review from a team as a code owner April 23, 2025 18:16
@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Apr 23, 2025

While experimenting how fast it would be with a std::vector, I realized GDScriptInstance::reload_members() would overwrite GDScriptInstance::members by giving it a regular Vector, and casting that to LocalVector has a bit of overhead.

I have modified so it makes and assigns a LocalVector instead.

It seems whatever made the FPS drop is now gone (not sure if it was this missing change as it was fine with TightLocalVector, probably because both it and CowData are tight, or probably because of #104985 being merged), but the FPS gain is minimal (around 1.5%).

@DeeJayLSP DeeJayLSP force-pushed the gds-local-members branch 2 times, most recently from 0ecb418 to 6a11011 Compare April 24, 2025 22:57
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. I'm getting mixed performance results though, so it's hard to say if this is really a win.

Benchmark

PC specifications
  • CPU: Intel Core i9-13900K
  • GPU: NVIDIA GeForce RTX 4090
  • RAM: 64 GB (2×32 GB DDR5-5800 C30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 42)

Using a release export template binary (production=yes lto=full) on https://github.com/GaijinEntertainment/godot-das/tree/master/examples/bunnymark's GDScript version.

Count master This PR
6,000 bunnies 566 FPS 550 FPS
20,000 bunnies 159 FPS 159 FPS
50,000 bunnies 59 FPS 60 FPS

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Apr 25, 2025

I'm getting mixed performance results though.

That's why I say we should wait for #90082 or something similar.

Every time some PR touches GDScript or LocalVector it changes the outcome of this one. But in all cases combined with #90082 the performance was better.

Out of curiosity I overlapped this PR over some changes of #90082 in a production build (6000 bunnies as that was the one affected negatively):

#90082 #90082 + PR
216 FPS 224 FPS

The same improvement ratio from above can be seen with 10000 or 50000 bunnies

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented May 13, 2025

Rebased so I could check the benefits from #106020

While the profiler shows that the overhead of a LocalVector is significantly lower than that of CowData, there is a Variant::operator== that skyrockets. This also happens in the screenshots at the top of this PR page.

Before:
image

After:
image

At 10000 bunnies, I can measure a 9% performance improvement with a production template build, but at 1000 bunnies it varies between a 0.3% improvement and a 2.3% decrease. Performance decrease disaappears after 2000 bunnies.

I hope the issue with Variant isn't some LocalVector to Vector conversion.

EDIT: turns out most of CowData::get() overhead is simply being inherited by Variant::operator[].

I have tried to keep a regular Vector, but make GDScriptInstance store members's ptrw() solely on instantiation (GDScriptFunction::call() calls ptrw() on every iteration), but iterating over it revealed where the overhead came from. And was still slower.

Using LocalVector still results in both vector get and Variant::operator== being in their least overhead forms.

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented May 14, 2025

At 1000 bunnies, master vs PR
master-vs-PR

Looks like the problem was never inside Variant::get_named or Variant::set_named, which are actually benefited by this PR. VariantSetGet_Vector2_x/y::validated_get seems to get a massive increase in cycles.

Does anyone know a possible solution for this? Or is it too reliant on CoW?

Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I agree that's a massive difference. With such small percentages, i'd be inclined to say you may be measuring CPU fluctuations. Perhaps you can coax more extreme differences with a different, more tailored benchmark?

@DeeJayLSP DeeJayLSP force-pushed the gds-local-members branch from e2bee63 to 1e88bf9 Compare May 14, 2025 18:40
@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented May 14, 2025

It seems like whatever overhead in GDScriptInstance::set/get() created by this change can be mitigated by using std::move() when setting/getting the variant.

I don't know if doing this would create unwanted behavior or possibly cause crashes, but so far it recovers some of the performance loss (the increase in Variant::operator= mentioned here goes away and the FPS with 1000 bunnies is higher), tests pass and TPS Demo runs just fine.

I think the lack of CoW is causing Variant constructors to be called in the VM, resulting in overhead. It was fixed on set/get's side by using std::move(), and I don't know how to do it on the VM's side, or if this is even possible or viable at this point.

@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Dec 3, 2025

Rebased after a very long time.

By the time I opened this PR I thought I knew how to read perf outputs, but #112129 taught me a lesson (didn't know callback cycles around the change increasing could potentially indicate a perfomance gain)

Here are a few changes I did:

  • Keeping it as Tight, as the vector never gets resized except when the GDScript instance is made or when reloading the script.
  • All std::move() workarounds removed (except the script reload one) as they turned out to be doing more harm than good.

On very recent tests, when it comes to Bunnymark, there should be no performance difference at 500 bunnies, maybe a 0.06% gain (from many runs with the least amount programs running on background + Feral GameMode, as GDScript is extremely sensible to fluctuations when not performing too heavy tasks).

Lastly, when running before and after at the same time, it's possible to see sometimes that memory usage isn't stable. Sometimes before consumes many more MBs, sometimes after does. But when they match, after is always one MB lower than before. Also the build is 4KiB smaller.

From a recent discussion in the Contributors Chat, the conclusion is that members here shouldn't need CoW at all (members uses a CoW vector since before the engine was open-sourced). Maybe #107002 could be dismissed.

Will update the PR description later.

@DeeJayLSP
Copy link
Contributor Author

PR description updated with newer benchmarks and more considerations that are not the wrong profiler reading.

@DeeJayLSP DeeJayLSP force-pushed the gds-local-members branch from a64422f to f77f6ab Compare March 9, 2026 20:47
@DeeJayLSP
Copy link
Contributor Author

DeeJayLSP commented Mar 9, 2026

Switched back to regular LocalVector. While the performance change is around the same and the vector is never resized, members being a LocalVector decreased build sizes by a total of 8KiB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

X Tutup