X Tutup
Skip to content

Use Mat3x4 for model and view transforms to save bandwidth and ALUs#107923

Merged
Repiteo merged 1 commit intogodotengine:masterfrom
clayjohn:RD-mat3x4
Sep 30, 2025
Merged

Use Mat3x4 for model and view transforms to save bandwidth and ALUs#107923
Repiteo merged 1 commit intogodotengine:masterfrom
clayjohn:RD-mat3x4

Conversation

@clayjohn
Copy link
Member

@clayjohn clayjohn commented Jun 24, 2025

This improves performance in situations that are vertex shader bound (i.e. high vertex count). Early tests show that this makes an improvement on my intel integrated GPU quite broadly, but not on my M2 MBP.

I want to test a bit more widely to get a sense of the broad impact.

Checking with the Mali Offline Compiler, this change appears to shave off a few L/S operations and ALUs (about 10%). So I don't expect it to make a huge difference (especially on desktop). But its a free performance boost.

Built on top of #107876 to avoid conflicts

@Saul2022
Copy link

Cant make the artifact work for my s24 ultra , after the wheel spin is over, it gets stuck( downloaded the editor apk), 4.5 beta works fine tho.

@clayjohn
Copy link
Member Author

Cant make the artifact work for my s24 ultra , after the wheel spin is over, it gets stuck( downloaded the editor apk), 4.5 beta works fine tho.

I haven't done the mobile renderer implementation yet!

@Saul2022
Copy link

I haven't done the mobile renderer implementation yet!

ik but even selecting the mobile renderer gives the same result as with forward+

@clayjohn
Copy link
Member Author

I haven't done the mobile renderer implementation yet!

ik but even selecting the mobile renderer gives the same result as with forward+

Oh alright, I'll make sure to test on Android before marking this as ready for review.

@clayjohn clayjohn force-pushed the RD-mat3x4 branch 2 times, most recently from d4e2ca3 to 579ec17 Compare July 1, 2025 19:32
@clayjohn clayjohn marked this pull request as ready for review July 1, 2025 19:42
@clayjohn clayjohn requested a review from a team as a code owner July 1, 2025 19:42
@clayjohn
Copy link
Member Author

clayjohn commented Jul 1, 2025

Tested now on MacOS, PopOS, and Android.

On my M2 macbook, I can't measure a difference.

On my Intel integrated GPU I get a consistent 5%ish improvement in the test scene from #68959. I get similar results on my Pixel 7 (Mali-G710) and pixel 4 (Adreno 640)

On other scenes more broadly I suspect the average performance benefit will be lower than 5% since the bottlenecks are often fragment processing and this does little to help with that. But at any rate, this is a free improvement to performance (in some cases) and battery life in all cases

@clayjohn
Copy link
Member Author

clayjohn commented Jul 1, 2025

Tagging this for 4.6 dev 1. It should be a pretty safe change, but it is a little risky for the Beta cycle

@Gaktan
Copy link
Contributor

Gaktan commented Jul 18, 2025

You can save some padding by using a mat4x3 instead.

The register count should be the same, but the instruction count should be a bit lower (fewer loads).

Matrices need to be transposed when filling the constant buffer on the CPU, and multiplication order needs to be swapped in glsl.

@clayjohn
Copy link
Member Author

clayjohn commented Jul 21, 2025

@Gaktan You have it backwards. mat4x3 stores 4 vec3 so it requires 4 floats of padding (which will be inserted automatically). mat3x4 stores 3 vec4 so there is no padding.

mat3x4 unfortunately thus requires doing a transpose operation in the shader which costs a few extra cycles. But is worth it because we save more cycles on the multiplication and we save the bandwidth as well.

I have tested this code in a compiler and reviewed the disassembly and confirmed that reduces load ops, instructions, and results in a reduced bandwidth usage.

@Gaktan
Copy link
Contributor

Gaktan commented Jul 21, 2025

@Gaktan You have it backwards. mat4x3 stores 4 vec3 so it requires 4 floats of padding (which will be inserted automatically). mat3x4 stores 3 vec4 so there is no padding.

Got it, my bad. I assumed it was the same as HLSL.

The transpose() is not necessary as you can just swap the multiplication order. But either way it should be free, since a transpose is just a register shuffle.

@clayjohn
Copy link
Member Author

The transpose() is not necessary as you can just swap the multiplication order. But either way it should be free, since a transpose is just a register shuffle.

I can't swap the multiplication order in user code 🙃

Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. Code looks good to me.

On a Samsung Galaxy S25 Ultra in 1080p, I can notice an improvement. I get 59 FPS before this PR, 64 FPS after.

PC specifications
  • CPU: AMD Ryzen 9 9950X3D
  • GPU: llvmpipe
  • RAM: 64 GB (2×32 GB DDR5-6000 CL30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 42)

On my PC with llvmpipe and a tiny resolution (64×64), the MRP is very slow, but it is consistently faster after this PR. (I can't spot a difference with the NVIDIA GPU in use, hence llvmpipe.)

master

Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)

This PR

Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)

@clayjohn clayjohn modified the milestones: 4.x, 4.6 Aug 6, 2025
@Repiteo Repiteo merged commit f969403 into godotengine:master Sep 30, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Sep 30, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

X Tutup