Use Mat3x4 for model and view transforms to save bandwidth and ALUs by clayjohn · Pull Request #107923 · godotengine/godot

clayjohn · 2025-06-24T06:12:05Z

This improves performance in situations that are vertex shader bound (i.e. high vertex count). Early tests show that this makes an improvement on my intel integrated GPU quite broadly, but not on my M2 MBP.

I want to test a bit more widely to get a sense of the broad impact.

Checking with the Mali Offline Compiler, this change appears to shave off a few L/S operations and ALUs (about 10%). So I don't expect it to make a huge difference (especially on desktop). But its a free performance boost.

Built on top of #107876 to avoid conflicts

Saul2022 · 2025-06-24T08:03:32Z

Cant make the artifact work for my s24 ultra , after the wheel spin is over, it gets stuck( downloaded the editor apk), 4.5 beta works fine tho.

clayjohn · 2025-06-24T14:07:48Z

Cant make the artifact work for my s24 ultra , after the wheel spin is over, it gets stuck( downloaded the editor apk), 4.5 beta works fine tho.

I haven't done the mobile renderer implementation yet!

Saul2022 · 2025-06-24T17:02:11Z

I haven't done the mobile renderer implementation yet!

ik but even selecting the mobile renderer gives the same result as with forward+

clayjohn · 2025-06-24T17:07:50Z

I haven't done the mobile renderer implementation yet!

ik but even selecting the mobile renderer gives the same result as with forward+

Oh alright, I'll make sure to test on Android before marking this as ready for review.

clayjohn · 2025-07-01T19:59:15Z

Tested now on MacOS, PopOS, and Android.

On my M2 macbook, I can't measure a difference.

On my Intel integrated GPU I get a consistent 5%ish improvement in the test scene from #68959. I get similar results on my Pixel 7 (Mali-G710) and pixel 4 (Adreno 640)

On other scenes more broadly I suspect the average performance benefit will be lower than 5% since the bottlenecks are often fragment processing and this does little to help with that. But at any rate, this is a free improvement to performance (in some cases) and battery life in all cases

clayjohn · 2025-07-01T19:59:55Z

Tagging this for 4.6 dev 1. It should be a pretty safe change, but it is a little risky for the Beta cycle

Gaktan · 2025-07-18T19:53:43Z

You can save some padding by using a mat4x3 instead.

The register count should be the same, but the instruction count should be a bit lower (fewer loads).

Matrices need to be transposed when filling the constant buffer on the CPU, and multiplication order needs to be swapped in glsl.

clayjohn · 2025-07-21T08:24:53Z

@Gaktan You have it backwards. mat4x3 stores 4 vec3 so it requires 4 floats of padding (which will be inserted automatically). mat3x4 stores 3 vec4 so there is no padding.

mat3x4 unfortunately thus requires doing a transpose operation in the shader which costs a few extra cycles. But is worth it because we save more cycles on the multiplication and we save the bandwidth as well.

I have tested this code in a compiler and reviewed the disassembly and confirmed that reduces load ops, instructions, and results in a reduced bandwidth usage.

Gaktan · 2025-07-21T17:06:18Z

@Gaktan You have it backwards. mat4x3 stores 4 vec3 so it requires 4 floats of padding (which will be inserted automatically). mat3x4 stores 3 vec4 so there is no padding.

Got it, my bad. I assumed it was the same as HLSL.

The transpose() is not necessary as you can just swap the multiplication order. But either way it should be free, since a transpose is just a register shuffle.

clayjohn · 2025-07-23T08:29:31Z

The transpose() is not necessary as you can just swap the multiplication order. But either way it should be free, since a transpose is just a register shuffle.

I can't swap the multiplication order in user code 🙃

Calinou

Tested locally, it works as expected. Code looks good to me.

On a Samsung Galaxy S25 Ultra in 1080p, I can notice an improvement. I get 59 FPS before this PR, 64 FPS after.

PC specifications

CPU: AMD Ryzen 9 9950X3D
GPU: llvmpipe
RAM: 64 GB (2×32 GB DDR5-6000 CL30)
SSD: Solidigm P44 Pro 2 TB
OS: Linux (Fedora 42)

On my PC with llvmpipe and a tiny resolution (64×64), the MRP is very slow, but it is consistently faster after this PR. (I can't spot a difference with the NVIDIA GPU in use, hence llvmpipe.)

`master`

Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)

This PR

Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)
Project FPS: 2 (500.00 mspf)
Project FPS: 1 (1000.00 mspf)

servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile.glsl

…perations and ALUs

Repiteo · 2025-09-30T23:39:51Z

Thanks!

clayjohn added this to the 4.x milestone Jun 24, 2025

clayjohn added topic:rendering topic:3d performance labels Jun 24, 2025

AThousandShips added the enhancement label Jun 24, 2025

clayjohn force-pushed the RD-mat3x4 branch 2 times, most recently from d4e2ca3 to 579ec17 Compare July 1, 2025 19:32

clayjohn marked this pull request as ready for review July 1, 2025 19:42

clayjohn requested a review from a team as a code owner July 1, 2025 19:42

Calinou approved these changes Jul 23, 2025

View reviewed changes

clayjohn modified the milestones: 4.x, 4.6 Aug 6, 2025

clayjohn commented Aug 13, 2025

View reviewed changes

servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile.glsl Outdated Show resolved Hide resolved

clayjohn force-pushed the RD-mat3x4 branch from 579ec17 to e7902b4 Compare September 27, 2025 06:15

Optimize vertex shader using mat3x4 to reduce bandwidth, load/store o…

14b60f2

…perations and ALUs

clayjohn force-pushed the RD-mat3x4 branch from e7902b4 to 14b60f2 Compare September 27, 2025 06:20

Repiteo merged commit f969403 into godotengine:master Sep 30, 2025
20 checks passed

clayjohn deleted the RD-mat3x4 branch October 2, 2025 20:20

clayjohn mentioned this pull request Oct 3, 2025

Fix scene shader crash due to rename of view matrix and inverse view matrix #111227

Merged

AeioMuch mentioned this pull request Oct 6, 2025

3D rendering broken in double precision build #111308

Closed

HydrogenC mentioned this pull request Nov 6, 2025

Make plugin work for 4.6 (preview) sphynx-owner/godot-motion-blur-addon-simplified#1

Merged

Uh oh!

Conversation

clayjohn commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Saul2022 commented Jun 24, 2025

Uh oh!

clayjohn commented Jun 24, 2025

Uh oh!

Saul2022 commented Jun 24, 2025

Uh oh!

clayjohn commented Jun 24, 2025

Uh oh!

clayjohn commented Jul 1, 2025

Uh oh!

clayjohn commented Jul 1, 2025

Uh oh!

Gaktan commented Jul 18, 2025

Uh oh!

clayjohn commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gaktan commented Jul 21, 2025

Uh oh!

clayjohn commented Jul 23, 2025

Uh oh!

Calinou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

master

This PR

Uh oh!

Uh oh!

Uh oh!

Repiteo commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clayjohn commented Jun 24, 2025 •

edited

Loading

clayjohn commented Jul 21, 2025 •

edited

Loading

Calinou left a comment •

edited

Loading

`master`