SSSE3/SSE4: alpha blending — operator over

Author:	Wojciech Muła
Added on:	2008-06-03
Updates:	2016-03-03 (results from Core i5)

Contents

Introduction
SSSE3 and SSE4 algorithm outline
Possible drawback
SSSE3 and SSE4 implementation
Test results
- Core2
- Core i5

Introduction

Alpha blending refers to many different operations. This note describes results for the over operator that works on RGBA pixels with premultiplied alpha.

Basic formula:

background = (alpha * foreground) + background

where alpha in range [0 .. 255], and + denotes add with saturation.

The reference implementation coded in C:

Rf =  foreground & 0xff
Gf = (foreground >>  8) & 0xff
Bf = (foreground >> 16) & 0xff
Af = (foreground >> 24) & 0xff

Rb =  background & 0xff
Gb = (background >>  8) & 0xff
Bb = (background >> 16) & 0xff

R = (Rf * Af)/256 + Rb
G = (Gf * Af)/256 + Gb
B = (Bf * Af)/256 + Bb

if (R > 255) R = 255
if (G > 255) G = 255
if (B > 255) B = 255

background = R | (G << 8) | (B << 16)

Note: dividing by 256 never bring component value 255 — to obtain correct range some additional operations are needed. Probably no one notice differences.

SSSE3 and SSE4 algorithm outline

Load 4 foreground pixels:

xmm0 = [r0 g0 b0 a0|r1 g1 b1 a1|r2 g2 b2 a2|r3 g3 b3 a3]

Extend components range from bytes to words. SSSE3 implementation uses UNPCKLBW and UNPCKHBW, SSE4 — PMOVZXBW (PSHUFB could be also used, but need two additional vectors):
```
xmm2 = [r0 __ g0 __|b0 __ a0 __|r1 __ g1 __ |b1 __ a1 __]
xmm3 = [r2 __ g2 __|b2 __ a2 __|r3 __ g3 __ |b3 __ a3 __]
```

Populate alpha and multiply by 256 (PSHUFB):

xmm0 = [__ a0 __ a0|__ a0 __ __|__ a1 __ a1 |__ a1 __ __]
xmm1 = [__ a2 __ a2|__ a2 __ __|__ a3 __ a3 |__ a3 __ __]

Multiply alpha with components: x * (alpha << 8) — result of PMULHUW is the higher word of result, no additional right-shift is needed to back to range [0..255]:
```
xmm0 = [r0*a0 g0*a0|b0*a0 __ __|r1*a1 g1*a1|b1*a1 __ __]
xmm2 = [r2*a2 g2*a2|b2*a2 __ __|r3*a3 g3*a3|b3*a3 __ __]
```
Since the max value of multiplication result is not greater than 255, word to byte conversion is done (PACKUSWB):
```
xmm0 = [R0 G0 B0 __|R1 G1 B1 __|R2 G2 B2 __|R3 G3 B3 __]
```
Last step is to load background pixels, add with saturation (PADDUSB) to xmm0 and save back results.

Possible drawback

While SSE can perform 8 multiplies, the sample program uses just 6 results, i.e. 75% of full power is utilized. At the moment I have no idea how to overcome this at low cost.

SSSE3 and SSE4 implementation

Sample program blend_32bpp contains four procedures:

x86 — C implementation (1 pixel/iteration)
SSSE3 — SIMD reference implementation (4 pixels/iteration)
SSE4 — instruction pmovzx used instead of punpckxbw (4 pixels/iteration)
SSE4-2 — unrolled SSE4 variant (8 pixels/iteration)

Test results

Program was compiled with following options:

gcc -O3 -Wall -pedantic -std=c99 blend_32bpp.c -o test

Core2

Two images 640 x 480 pixels (1.2MB) were blended 1'000 times, each test was repeated 5 times, and results averaged.

Test machine was Core 2 Duo @ 2.6GHz, run under Linux control.

procedure	time [s]	speedup
x86	1.727	100%	`=====`
SSSE3	0.238	725%	`====================================`
SSE4	0.256	675%	`==================================`
SSE4-2	0.173	995%	`==================================================`

Results are impressive, I think well-tuned procedure could bring even bigger speedup.

Core i5

Two images 640 x 480 pixels (1.2MB) were blended 10'000 times.

Machine: Core i5 M540 @ 2.53GHz

procedure	time [s]	speedup
x86	14.69	1.00	`=====`
SSSE3	2.38	6.17	`==============================`
SSE4	2.41	6.09	`==============================`
SSE4-2	1.96	7.49	`=====================================`