SSSE3/SSE4: alpha blending — operator over

Author:Wojciech Muła
Added on:2008-06-03
Updates:2016-03-03 (results from Core i5)

Contents

Introduction

Alpha blending refers to many different operations. This note describes results for the over operator that works on RGBA pixels with premultiplied alpha.

Basic formula:

background = (alpha * foreground) + background

where alpha in range [0 .. 255], and + denotes add with saturation.

The reference implementation coded in C:

Rf =  foreground & 0xff
Gf = (foreground >>  8) & 0xff
Bf = (foreground >> 16) & 0xff
Af = (foreground >> 24) & 0xff

Rb =  background & 0xff
Gb = (background >>  8) & 0xff
Bb = (background >> 16) & 0xff

R = (Rf * Af)/256 + Rb
G = (Gf * Af)/256 + Gb
B = (Bf * Af)/256 + Bb

if (R > 255) R = 255
if (G > 255) G = 255
if (B > 255) B = 255

background = R | (G << 8) | (B << 16)

Note: dividing by 256 never bring component value 255 — to obtain correct range some additional operations are needed. Probably no one notice differences.

SSSE3 and SSE4 algorithm outline

  1. Load 4 foreground pixels:

    xmm0 = [r0 g0 b0 a0|r1 g1 b1 a1|r2 g2 b2 a2|r3 g3 b3 a3]
    
  2. Extend components range from bytes to words. SSSE3 implementation uses UNPCKLBW and UNPCKHBW, SSE4 — PMOVZXBW (PSHUFB could be also used, but need two additional vectors):

    xmm2 = [r0 __ g0 __|b0 __ a0 __|r1 __ g1 __ |b1 __ a1 __]
    xmm3 = [r2 __ g2 __|b2 __ a2 __|r3 __ g3 __ |b3 __ a3 __]
    
  3. Populate alpha and multiply by 256 (PSHUFB):

    xmm0 = [__ a0 __ a0|__ a0 __ __|__ a1 __ a1 |__ a1 __ __]
    xmm1 = [__ a2 __ a2|__ a2 __ __|__ a3 __ a3 |__ a3 __ __]
    
  4. Multiply alpha with components: x * (alpha << 8) — result of PMULHUW is the higher word of result, no additional right-shift is needed to back to range [0..255]:

    xmm0 = [r0*a0 g0*a0|b0*a0 __ __|r1*a1 g1*a1|b1*a1 __ __]
    xmm2 = [r2*a2 g2*a2|b2*a2 __ __|r3*a3 g3*a3|b3*a3 __ __]
    
  5. Since the max value of multiplication result is not greater than 255, word to byte conversion is done (PACKUSWB):

    xmm0 = [R0 G0 B0 __|R1 G1 B1 __|R2 G2 B2 __|R3 G3 B3 __]
    
  6. Last step is to load background pixels, add with saturation (PADDUSB) to xmm0 and save back results.

Possible drawback

While SSE can perform 8 multiplies, the sample program uses just 6 results, i.e. 75% of full power is utilized. At the moment I have no idea how to overcome this at low cost.

SSSE3 and SSE4 implementation

Sample program blend_32bpp contains four procedures:

Test results

Program was compiled with following options:

gcc -O3 -Wall -pedantic -std=c99 blend_32bpp.c -o test

Core2

Two images 640 x 480 pixels (1.2MB) were blended 1'000 times, each test was repeated 5 times, and results averaged.

Test machine was Core 2 Duo @ 2.6GHz, run under Linux control.

procedure time [s] speedup  
x86 1.727 100% =====
SSSE3 0.238 725% ====================================
SSE4 0.256 675% ==================================
SSE4-2 0.173 995% ==================================================

Results are impressive, I think well-tuned procedure could bring even bigger speedup.

Core i5

Two images 640 x 480 pixels (1.2MB) were blended 10'000 times.

Machine: Core i5 M540 @ 2.53GHz

procedure time [s] speedup  
x86 14.69 1.00 =====
SSSE3 2.38 6.17 ==============================
SSE4 2.41 6.09 ==============================
SSE4-2 1.96 7.49 =====================================