# SSSE3/SSE4: alpha blending — operator over

Author: Wojciech Muła 2008-06-03 2016-03-03 (results from Core i5)

# Introduction

Alpha blending refers to many different operations. This note describes results for the over operator that works on RGBA pixels with premultiplied alpha.

Basic formula:

```background = (alpha * foreground) + background
```

where alpha in range [0 .. 255], and + denotes add with saturation.

The reference implementation coded in C:

```Rf =  foreground & 0xff
Gf = (foreground >>  8) & 0xff
Bf = (foreground >> 16) & 0xff
Af = (foreground >> 24) & 0xff

Rb =  background & 0xff
Gb = (background >>  8) & 0xff
Bb = (background >> 16) & 0xff

R = (Rf * Af)/256 + Rb
G = (Gf * Af)/256 + Gb
B = (Bf * Af)/256 + Bb

if (R > 255) R = 255
if (G > 255) G = 255
if (B > 255) B = 255

background = R | (G << 8) | (B << 16)
```

Note: dividing by 256 never bring component value 255 — to obtain correct range some additional operations are needed. Probably no one notice differences.

# SSSE3 and SSE4 algorithm outline

```xmm0 = [r0 g0 b0 a0|r1 g1 b1 a1|r2 g2 b2 a2|r3 g3 b3 a3]
```
2. Extend components range from bytes to words. SSSE3 implementation uses UNPCKLBW and UNPCKHBW, SSE4 — PMOVZXBW (PSHUFB could be also used, but need two additional vectors):

```xmm2 = [r0 __ g0 __|b0 __ a0 __|r1 __ g1 __ |b1 __ a1 __]
xmm3 = [r2 __ g2 __|b2 __ a2 __|r3 __ g3 __ |b3 __ a3 __]
```
3. Populate alpha and multiply by 256 (PSHUFB):

```xmm0 = [__ a0 __ a0|__ a0 __ __|__ a1 __ a1 |__ a1 __ __]
xmm1 = [__ a2 __ a2|__ a2 __ __|__ a3 __ a3 |__ a3 __ __]
```
4. Multiply alpha with components: x * (alpha << 8) — result of PMULHUW is the higher word of result, no additional right-shift is needed to back to range [0..255]:

```xmm0 = [r0*a0 g0*a0|b0*a0 __ __|r1*a1 g1*a1|b1*a1 __ __]
xmm2 = [r2*a2 g2*a2|b2*a2 __ __|r3*a3 g3*a3|b3*a3 __ __]
```
5. Since the max value of multiplication result is not greater than 255, word to byte conversion is done (PACKUSWB):

```xmm0 = [R0 G0 B0 __|R1 G1 B1 __|R2 G2 B2 __|R3 G3 B3 __]
```
6. Last step is to load background pixels, add with saturation (PADDUSB) to xmm0 and save back results.

# Possible drawback

While SSE can perform 8 multiplies, the sample program uses just 6 results, i.e. 75% of full power is utilized. At the moment I have no idea how to overcome this at low cost.

# SSSE3 and SSE4 implementation

Sample program blend_32bpp contains four procedures:

• x86 — C implementation (1 pixel/iteration)
• SSSE3 — SIMD reference implementation (4 pixels/iteration)
• SSE4 — instruction pmovzx used instead of punpckxbw (4 pixels/iteration)
• SSE4-2 — unrolled SSE4 variant (8 pixels/iteration)

# Test results

Program was compiled with following options:

```gcc -O3 -Wall -pedantic -std=c99 blend_32bpp.c -o test
```

## Core2

Two images 640 x 480 pixels (1.2MB) were blended 1'000 times, each test was repeated 5 times, and results averaged.

Test machine was Core 2 Duo @ 2.6GHz, run under Linux control.

procedure time [s] speedup
x86 1.727 100% =====
SSSE3 0.238 725% ====================================
SSE4 0.256 675% ==================================
SSE4-2 0.173 995% ==================================================

Results are impressive, I think well-tuned procedure could bring even bigger speedup.

## Core i5

Two images 640 x 480 pixels (1.2MB) were blended 10'000 times.

Machine: Core i5 M540 @ 2.53GHz

procedure time [s] speedup
x86 14.69 1.00 =====
SSSE3 2.38 6.17 ==============================
SSE4 2.41 6.09 ==============================
SSE4-2 1.96 7.49 =====================================