core: Optimize CPU bitmap blending and copy operations#23006
core: Optimize CPU bitmap blending and copy operations#23006jarca0123 wants to merge 1 commit intoruffle-rs:masterfrom
Conversation
| // Fast division by 255: (x + 1 + (x >> 8)) >> 8 | ||
| // Exact for all values in 0..=65025 (255*255). | ||
| #[inline(always)] | ||
| fn div255(x: u16) -> u8 { |
There was a problem hiding this comment.
Why do you think this is better? It produces more complicated assembly:
regular_div:
movzx eax, di
imul eax, eax, 32897
shr eax, 23
ret
div255:
mov eax, edi
movzx ecx, ah
add eax, ecx
inc eax
shr eax, 8
ret
I would imagine the compiler knows how to optimize a division by 255, why is it an issue? Which architecture are you optimizing for?
There was a problem hiding this comment.
I am optimizing for no specific architecture in mind, I want to take every target into account. However, this truly is an oversight that I am reverting.
There was a problem hiding this comment.
If you really want to extract a function, you can do it in such a way that will prevent us from casting integers in every line, e.g.
let r = source.red() + scale(self.red(), inv_sa);
core/src/bitmap/bitmap_data.rs
Outdated
| .wrapping_add(div255(self.red() as u16 * inv_sa)); | ||
| let g = source | ||
| .green() | ||
| .wrapping_add(div255(self.green() as u16 * inv_sa)); |
There was a problem hiding this comment.
Why are you changing it to wrapping_add? Are we expecting an overflow? The assembly is identical (regular addition is wrapping in release mode, it adds overflow checks in debug mode).
There was a problem hiding this comment.
Ooh, didn't know that. Reverting this too.
1fcddcd to
9d5796f
Compare
| } | ||
| } | ||
|
|
||
| /// Blend a single pixel (src-over, premultiplied alpha, two-lane u32 trick). |
There was a problem hiding this comment.
Is there a difference between this method and Color::blend_over?
I do want, yes 😄 |
Well, in order to keep everything in one place, here are the benchmarks: Also, I found a quicker algorithm, so I'll update the PR. |
Per future CONTRIBUTING.md, I want to state that this code was generated by Claude Code, specifically Claude Opus 4.6. My workflow is iteratively profiling Ruffle with various SWFs, feeding the profiler output to the LLM, having it identify hotspots and generate optimizations, then re-profiling to verify the improvement. This is one of many PRs I intend to submit that were produced this way.
I have reviewed the code to the best of my ability, though in full transparency, I have not audited every individual line in detail.
EDIT: Here is a description of what it does: This replaces per-pixel get/set_pixel32_raw loops with row-based slice operations: memcpy for copies, and a two-lane u32 blend with div255 bit-trick for alpha compositing.
Here are some benchmarks:
Excerpt from my "comprehensive" SWF benchmarks (however I can share the full SWF if you want):