X Tutup
Skip to content

core: Optimize CPU bitmap blending and copy operations#23006

Open
jarca0123 wants to merge 1 commit intoruffle-rs:masterfrom
jarca0123:optimize-cpu-bitmap-blending
Open

core: Optimize CPU bitmap blending and copy operations#23006
jarca0123 wants to merge 1 commit intoruffle-rs:masterfrom
jarca0123:optimize-cpu-bitmap-blending

Conversation

@jarca0123
Copy link
Contributor

@jarca0123 jarca0123 commented Feb 11, 2026

Per future CONTRIBUTING.md, I want to state that this code was generated by Claude Code, specifically Claude Opus 4.6. My workflow is iteratively profiling Ruffle with various SWFs, feeding the profiler output to the LLM, having it identify hotspots and generate optimizations, then re-profiling to verify the improvement. This is one of many PRs I intend to submit that were produced this way.

I have reviewed the code to the best of my ability, though in full transparency, I have not audited every individual line in detail.

EDIT: Here is a description of what it does: This replaces per-pixel get/set_pixel32_raw loops with row-based slice operations: memcpy for copies, and a two-lane u32 blend with div255 bit-trick for alpha compositing.

Here are some benchmarks:

Benchmark This PR (ms) Nightly (ms) Change
copypixels_merge_alpha 468 2459 -81.0%
copypixels_merge_alpha_small 33 168 -80.4%
colortransform_alpha_tint 388 360 +7.8%
sprite_render_pipeline 420 534 -21.3%
draw_alpha_blend 1076 1694 -36.5%
draw_alpha_matrix_blend 1499 1639 -8.5%
particle_fade_composite 113 131 -13.7%
multi_sprite_composite 195 817 -76.1%
locked_buffer_composite 32 158 -79.7%

Excerpt from my "comprehensive" SWF benchmarks (however I can share the full SWF if you want):

  private function benchCopyPixelsMergeAlpha(n:int):void {
      var src:BitmapData = new BitmapData(256, 256, true, 0x80FF0000); // 50% alpha red                                                                                                                                                                                                                                     
      var dst:BitmapData = new BitmapData(256, 256, true, 0xFF000000); // opaque black                                                                                                                                                                                                                                      
      var rect:Rectangle = new Rectangle(0, 0, 256, 256);                                                                                                                                                                                                                                                                   
      var pt:Point = new Point(0, 0);
      for (var i:int = 0; i < n; i++) {
          dst.copyPixels(src, rect, pt, null, null, true); // mergeAlpha=true
      }
      src.dispose();
      dst.dispose();
  }

// Fast division by 255: (x + 1 + (x >> 8)) >> 8
// Exact for all values in 0..=65025 (255*255).
#[inline(always)]
fn div255(x: u16) -> u8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think this is better? It produces more complicated assembly:

regular_div:
        movzx   eax, di
        imul    eax, eax, 32897
        shr     eax, 23
        ret

div255:
        mov     eax, edi
        movzx   ecx, ah
        add     eax, ecx
        inc     eax
        shr     eax, 8
        ret

I would imagine the compiler knows how to optimize a division by 255, why is it an issue? Which architecture are you optimizing for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am optimizing for no specific architecture in mind, I want to take every target into account. However, this truly is an oversight that I am reverting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want to extract a function, you can do it in such a way that will prevent us from casting integers in every line, e.g.

let r = source.red() + scale(self.red(), inv_sa);

.wrapping_add(div255(self.red() as u16 * inv_sa));
let g = source
.green()
.wrapping_add(div255(self.green() as u16 * inv_sa));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you changing it to wrapping_add? Are we expecting an overflow? The assembly is identical (regular addition is wrapping in release mode, it adds overflow checks in debug mode).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, didn't know that. Reverting this too.

@kjarosh kjarosh added T-perf Type: Performance Improvements A-core Area: Core player, where no other category fits llm The PR contains mostly LLM-generated code waiting-on-author Waiting on the PR author to make the requested changes labels Feb 11, 2026
@jarca0123 jarca0123 force-pushed the optimize-cpu-bitmap-blending branch from 1fcddcd to 9d5796f Compare February 12, 2026 06:27
}
}

/// Blend a single pixel (src-over, premultiplied alpha, two-lane u32 trick).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between this method and Color::blend_over?

@n0samu
Copy link
Member

n0samu commented Feb 14, 2026

(however I can share the full SWF if you want):

I do want, yes 😄

@jarca0123
Copy link
Contributor Author

jarca0123 commented Feb 15, 2026

(however I can share the full SWF if you want):

I do want, yes 😄

Well, in order to keep everything in one place, here are the benchmarks:
test.zip

Also, I found a quicker algorithm, so I'll update the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-core Area: Core player, where no other category fits llm The PR contains mostly LLM-generated code T-perf Type: Performance Improvements waiting-on-author Waiting on the PR author to make the requested changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

X Tutup