Use Mat3x4 for model and view transforms to save bandwidth and ALUs#107923
Use Mat3x4 for model and view transforms to save bandwidth and ALUs#107923Repiteo merged 1 commit intogodotengine:masterfrom
Conversation
|
Cant make the artifact work for my s24 ultra , after the wheel spin is over, it gets stuck( downloaded the editor apk), 4.5 beta works fine tho. |
I haven't done the mobile renderer implementation yet! |
ik but even selecting the mobile renderer gives the same result as with forward+ |
Oh alright, I'll make sure to test on Android before marking this as ready for review. |
d4e2ca3 to
579ec17
Compare
|
Tested now on MacOS, PopOS, and Android. On my M2 macbook, I can't measure a difference. On my Intel integrated GPU I get a consistent 5%ish improvement in the test scene from #68959. I get similar results on my Pixel 7 (Mali-G710) and pixel 4 (Adreno 640) On other scenes more broadly I suspect the average performance benefit will be lower than 5% since the bottlenecks are often fragment processing and this does little to help with that. But at any rate, this is a free improvement to performance (in some cases) and battery life in all cases |
|
Tagging this for 4.6 dev 1. It should be a pretty safe change, but it is a little risky for the Beta cycle |
|
You can save some padding by using a mat4x3 instead. The register count should be the same, but the instruction count should be a bit lower (fewer loads). Matrices need to be transposed when filling the constant buffer on the CPU, and multiplication order needs to be swapped in glsl. |
|
@Gaktan You have it backwards.
I have tested this code in a compiler and reviewed the disassembly and confirmed that reduces load ops, instructions, and results in a reduced bandwidth usage. |
Got it, my bad. I assumed it was the same as HLSL. The |
I can't swap the multiplication order in user code 🙃 |
There was a problem hiding this comment.
Tested locally, it works as expected. Code looks good to me.
On a Samsung Galaxy S25 Ultra in 1080p, I can notice an improvement. I get 59 FPS before this PR, 64 FPS after.
PC specifications
- CPU: AMD Ryzen 9 9950X3D
- GPU: llvmpipe
- RAM: 64 GB (2×32 GB DDR5-6000 CL30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 42)
On my PC with llvmpipe and a tiny resolution (64×64), the MRP is very slow, but it is consistently faster after this PR. (I can't spot a difference with the NVIDIA GPU in use, hence llvmpipe.)
master
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
This PR
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile.glsl
Outdated
Show resolved
Hide resolved
…perations and ALUs
|
Thanks! |
This improves performance in situations that are vertex shader bound (i.e. high vertex count). Early tests show that this makes an improvement on my intel integrated GPU quite broadly, but not on my M2 MBP.
I want to test a bit more widely to get a sense of the broad impact.
Checking with the Mali Offline Compiler, this change appears to shave off a few L/S operations and ALUs (about 10%). So I don't expect it to make a huge difference (especially on desktop). But its a free performance boost.
Built on top of #107876 to avoid conflicts