SSSE3/SSE4: alpha blending — operator over

Author: Wojciech Muła
Added on:2008-06-03
Updates:2016-03-03 (results from Core i5)



Alpha blending refers to many different operations. This note describes results for the over operator that works on RGBA pixels with premultiplied alpha.

Basic formula:

background = (alpha * foreground) + background

where alpha in range [0 .. 255], and + denotes add with saturation.

The reference implementation coded in C:

Rf =  foreground & 0xff
Gf = (foreground >>  8) & 0xff
Bf = (foreground >> 16) & 0xff
Af = (foreground >> 24) & 0xff

Rb =  background & 0xff
Gb = (background >>  8) & 0xff
Bb = (background >> 16) & 0xff

R = (Rf * Af)/256 + Rb
G = (Gf * Af)/256 + Gb
B = (Bf * Af)/256 + Bb

if (R > 255) R = 255
if (G > 255) G = 255
if (B > 255) B = 255

background = R | (G << 8) | (B << 16)

Note: dividing by 256 never bring component value 255 — to obtain correct range some additional operations are needed. Probably no one notice differences.

SSSE3 and SSE4 algorithm outline

  1. Load 4 foreground pixels:

    xmm0 = [r0 g0 b0 a0|r1 g1 b1 a1|r2 g2 b2 a2|r3 g3 b3 a3]
  2. Extend components range from bytes to words. SSSE3 implementation uses UNPCKLBW and UNPCKHBW, SSE4 — PMOVZXBW (PSHUFB could be also used, but need two additional vectors):

    xmm2 = [r0 __ g0 __|b0 __ a0 __|r1 __ g1 __ |b1 __ a1 __]
    xmm3 = [r2 __ g2 __|b2 __ a2 __|r3 __ g3 __ |b3 __ a3 __]
  3. Populate alpha and multiply by 256 (PSHUFB):

    xmm0 = [__ a0 __ a0|__ a0 __ __|__ a1 __ a1 |__ a1 __ __]
    xmm1 = [__ a2 __ a2|__ a2 __ __|__ a3 __ a3 |__ a3 __ __]
  4. Multiply alpha with components: x * (alpha << 8) — result of PMULHUW is the higher word of result, no additional right-shift is needed to back to range [0..255]:

    xmm0 = [r0*a0 g0*a0|b0*a0 __ __|r1*a1 g1*a1|b1*a1 __ __]
    xmm2 = [r2*a2 g2*a2|b2*a2 __ __|r3*a3 g3*a3|b3*a3 __ __]
  5. Since the max value of multiplication result is not greater than 255, word to byte conversion is done (PACKUSWB):

    xmm0 = [R0 G0 B0 __|R1 G1 B1 __|R2 G2 B2 __|R3 G3 B3 __]
  6. Last step is to load background pixels, add with saturation (PADDUSB) to xmm0 and save back results.

Possible drawback

While SSE can perform 8 multiplies, the sample program uses just 6 results, i.e. 75% of full power is utilized. At the moment I have no idea how to overcome this at low cost.

SSSE3 and SSE4 implementation

Sample program blend_32bpp contains four procedures:

Test results

Program was compiled with following options:

gcc -O3 -Wall -pedantic -std=c99 blend_32bpp.c -o test


Two images 640 x 480 pixels (1.2MB) were blended 1'000 times, each test was repeated 5 times, and results averaged.

Test machine was Core 2 Duo @ 2.6GHz, run under Linux control.

procedure time [s] speedup  
x86 1.727 100% =====
SSSE3 0.238 725% ====================================
SSE4 0.256 675% ==================================
SSE4-2 0.173 995% ==================================================

Results are impressive, I think well-tuned procedure could bring even bigger speedup.

Core i5

Two images 640 x 480 pixels (1.2MB) were blended 10'000 times.

Machine: Core i5 M540 @ 2.53GHz

procedure time [s] speedup  
x86 14.69 1.00 =====
SSSE3 2.38 6.17 ==============================
SSE4 2.41 6.09 ==============================
SSE4-2 1.96 7.49 =====================================