Author: | Wojciech Muła |
---|---|
Added on: | 2008-06-03 |
Updates: | 2016-03-03 (results from Core i5) |
Contents
Alpha blending refers to many different operations. This note describes results for the over operator that works on RGBA pixels with premultiplied alpha.
Basic formula:
background = (alpha * foreground) + background
where alpha in range [0 .. 255], and + denotes add with saturation.
The reference implementation coded in C:
Rf = foreground & 0xff Gf = (foreground >> 8) & 0xff Bf = (foreground >> 16) & 0xff Af = (foreground >> 24) & 0xff Rb = background & 0xff Gb = (background >> 8) & 0xff Bb = (background >> 16) & 0xff R = (Rf * Af)/256 + Rb G = (Gf * Af)/256 + Gb B = (Bf * Af)/256 + Bb if (R > 255) R = 255 if (G > 255) G = 255 if (B > 255) B = 255 background = R | (G << 8) | (B << 16)
Note: dividing by 256 never bring component value 255 — to obtain correct range some additional operations are needed. Probably no one notice differences.
Load 4 foreground pixels:
xmm0 = [r0 g0 b0 a0|r1 g1 b1 a1|r2 g2 b2 a2|r3 g3 b3 a3]
Extend components range from bytes to words. SSSE3 implementation uses UNPCKLBW and UNPCKHBW, SSE4 — PMOVZXBW (PSHUFB could be also used, but need two additional vectors):
xmm2 = [r0 __ g0 __|b0 __ a0 __|r1 __ g1 __ |b1 __ a1 __] xmm3 = [r2 __ g2 __|b2 __ a2 __|r3 __ g3 __ |b3 __ a3 __]
Populate alpha and multiply by 256 (PSHUFB):
xmm0 = [__ a0 __ a0|__ a0 __ __|__ a1 __ a1 |__ a1 __ __] xmm1 = [__ a2 __ a2|__ a2 __ __|__ a3 __ a3 |__ a3 __ __]
Multiply alpha with components: x * (alpha << 8) — result of PMULHUW is the higher word of result, no additional right-shift is needed to back to range [0..255]:
xmm0 = [r0*a0 g0*a0|b0*a0 __ __|r1*a1 g1*a1|b1*a1 __ __] xmm2 = [r2*a2 g2*a2|b2*a2 __ __|r3*a3 g3*a3|b3*a3 __ __]
Since the max value of multiplication result is not greater than 255, word to byte conversion is done (PACKUSWB):
xmm0 = [R0 G0 B0 __|R1 G1 B1 __|R2 G2 B2 __|R3 G3 B3 __]
Last step is to load background pixels, add with saturation (PADDUSB) to xmm0 and save back results.
While SSE can perform 8 multiplies, the sample program uses just 6 results, i.e. 75% of full power is utilized. At the moment I have no idea how to overcome this at low cost.
Sample program blend_32bpp contains four procedures:
Program was compiled with following options:
gcc -O3 -Wall -pedantic -std=c99 blend_32bpp.c -o test
Two images 640 x 480 pixels (1.2MB) were blended 1'000 times, each test was repeated 5 times, and results averaged.
Test machine was Core 2 Duo @ 2.6GHz, run under Linux control.
procedure | time [s] | speedup | |
---|---|---|---|
x86 | 1.727 | 100% | ===== |
SSSE3 | 0.238 | 725% | ==================================== |
SSE4 | 0.256 | 675% | ================================== |
SSE4-2 | 0.173 | 995% | ================================================== |
Results are impressive, I think well-tuned procedure could bring even bigger speedup.
Two images 640 x 480 pixels (1.2MB) were blended 10'000 times.
Machine: Core i5 M540 @ 2.53GHz
procedure | time [s] | speedup | |
---|---|---|---|
x86 | 14.69 | 1.00 | ===== |
SSSE3 | 2.38 | 6.17 | ============================== |
SSE4 | 2.41 | 6.09 | ============================== |
SSE4-2 | 1.96 | 7.49 | ===================================== |