# SSSE3: PMADDUBSW and image crossfading

Author: Wojciech Muła 2008-06-21 2016-05-03 (Dmitry Petrov noticed that alpha can have better resolution; new SWAR code; updated results for Core i5) 2016-03-02 (results from Core i5), 2015-12-26 (instruction pmaddubsw was introduced by SSSE3, not SSE4; thanks a lot Harold!)

# Introduction

Image crossfading is a kind of alpha blending where a final pixel is the result of linear interpolation of pixels from two images:

```result_pixel = pixel1 * alpha + pixel2 * (1 - alpha)
```

where alpha lie in range [0, 1]. Of course when operating on "pixels" color components are considered; components are unsigned bytes.

SSSE3 introduced instruction PMADDUBSW. This instruction multiply a destination vector of unsigned bytes by a source vector of signed bytes — the result is a vector of signed words. Then adjacent words are added with signed saturation (the same operation as PHADDSW).

This is exactly what crossafading needs.

The obvious drawback is that instruction operates on signed values. Because alpha must be positive, this reduces resolution of alpha from 8 to 7 bits. (was: Because multiplication results are signed and then added, the sum must not be greater than 32767 — this requirement reduces resolution by another bit. Finally alpha must lie in range [0..63].) Dmitry Petrov pointed out that alpha can be a 7-bit value, as such value never cause an overflow. Let's assume that both pixel1 and pixel2 have maximum value, and check if following inequality is true:

```(1) 255 * alpha + 255 * (127 - alpha) < 2^15 - 1
(2)                         255 * 127 < 2^15 - 1
(3)                             32385 < 32767
```

Obviously the inequality is true.

# Algorithm outline

1. Prepare constant vector of 127*alpha/127*(1-alpha):

```xmm6 = packed_byte(alpha, 127-alpha, alpha, 127-alpha, ..., alpha, 127-alpha)
```
2. Load 16 components from images X and Y:

```movdqa (%eax), %xmm0    // xmm0 = packed_byte(rX1, gX1, bX1, _, rX2, gX2, bX2, _, ...)
movdqa (%ebx), %xmm1    // xmm1 = packed_byte(rY1, gY1, bY1, _, rY2, gY2, bY2, _, ...)
```
3. Interleave components:

```movdqa    %xmm0, %xmm2
punpcklbw %xmm1, %xmm0  // xmm0 = packed_byte(rX1, rY1, gX1, gY1, bX1, bY2, ...)
punpcklbw %xmm1, %xmm2  // xmm2 = packed_byte(rX8, rY8, gX10, gY10, bX11, bY11, ...)
```
4. Interpolate components with PMADDUBSW:

```pmaddubsw %xmm6, %xmm0  // xmm0 = packed_byte(127*((rX1 * alpha) + rY1*(1 - alpha)), ...)
pmaddubsw %xmm6, %xmm2  // xmm2 = packed_byte(127*((rX8 * alpha) + rY8*(1 - alpha)), ...)
```
5. Divide by 64 — now all words lie in range [0..255]:

```psrlw       \$16, %xmm0  // xmm0 = packed_byte((rX1 * alpha) + rY1*(1 - alpha), ...)
psrlw       \$16, %xmm2  // xmm2 = packed_byte((rX8 * alpha) + rY8*(1 - alpha), ...)
```
6. Pack words to bytes and save result:

```packuswb  %xmm2, %xmm0
movdqa    %xmm0, (%ecx)
```
7. goto 2

# SWAR

Crossfading using SWAR approach is possible and on 64-bit machines is pretty fast. The core SWAR idea is to perform 4 multiplications using single multiplication instruction. When a register has following byte layout:

```[00aa|00bb|00cc|00dd]
```

then multiplication by value in range [0..255] yields a vector of four 16-bit values.

## Algorithm outline

1. Load four pixels:

```A = imageA[i]                           [a1|b1|g1|r1|a0|b0|g0|r0]
B = imageB[i]                           [a3|b3|g3|r3|a2|b2|g2|r2]
```
2. Isolate odd and even components (_ denotes 0):

```A_lo =       A  & 0x00ff00ff00ff00ff    [__|b1|__|r1|__|b0|__|r0]
A_hi = (A >> 8) & 0x00ff00ff00ff00ff    [__|a1|__|g1|__|a0|__|g0]

B_lo =       B  & 0x00ff00ff00ff00ff    [__|b3|__|r3|__|b2|__|r2]
B_hi = (B >> 8) & 0x00ff00ff00ff00ff    [__|a3|__|g3|__|a2|__|g2]
```
3. Multiply by alpha and 255 - alpha (values in range [0..255]):

```A_lo *= alpha                           [b1*a |r1*a |b0*a |r0*a ]
A_hi *= alpha                           [a1*a |g1*a |a0*a |g0*a ]

B_lo *= 255 - alpha                     [b3*a'|r3*a'|b2*a'|r2*a']
B_hi *= 255 - alpha                     [a3*a'|g3*a'|a2*a'|g2*a']
```
4. Add components, no overflow is possible. Now each components is premultiplied by 256:

```R0 = A_lo + B_lo                        [b1+b3|r1+r3|b0+b2|r0+r2]
R1 = A_hi + B_hi                        [a1+a3|g1+bg|a0+a2|g0+g2]
```
5. Divide by 256 and set proper byte position:

```R0 = (R0 >> 8) & 0x00ff00ff00ff00ff     [__|b1|__|r1|__|b0|__|r0]
R1 =        R1 & 0xff00ff00ff00ff00     [a1|__|g1|__|a0|__|g0|__]
```
6. Save the result:

```data[i] = R0 | R1                       [a1|b1|g1|r1|a0|b0|g0|r0]
```

Number of operations per two pixels:

• bit-and: 6,
• bit-or: 1,
• add: 2,
• mul: 4,
• shift: 3.

# Sample program

Sample program contains following procedures:

• x86 — C implementation (blend);
• SSE4 — SSE approach: first components are extended from byte to word, then multiply with PMULUHW and finally packed back (SSE4_blend); I called it SSE4 because for extending instruction PMOVZXBW is used, however this task could be accomplished with an old good PUNPCKxBW;
• SSE4-2 — implementation of above algorithm (SSE42_blend);
• swar — implementation of SWAR;
• SSE(1) — SSE translation of the SWAR algorithm;
• SSE(2) — SSE translation of the SWAR algorithm, with reduced numbers of shifts.

# Test results

Program was compiled with following options:

```gcc -O3 -Wall -pedantic -std=c99 mix_32bpp.c -o mix
```

Image 1024 x 768 pixels were crossfaded 100 times, total time is given

## Core2 (outdated)

Results from Core2 Duo E8200 @ 2.6GHz.

procedure time [us] speedup
x86 745,702 1.00
SSE4 309,393 2.41
SSE4-2 309,167 2.41
• Speedup over the x86 code is around 2.4 times. However comparison shows that speed of both SSE procedures is equal.

## Core i5

Results from Core i5 M 540 @ 2.53GHz

procedure time [us] speedup
SSE(2) 64,221 1.00
SSE(1) 69,228 0.93
SSE4 55,530 1.16
SSE4-2 44,315 1.45
swar 143,059 0.45
x86 206,044 0.31
• The SSE4-2 algorithm is almost 1.5 faster than good implementation of the naive algorithm (SSE(2)).
• The SWAR algorithm is almost 40% faster than the scalar version.

## Core i7

Results from Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

procedure time [us] speedup
SSE(2) 31,727 1.00
SSE(1) 33,684 0.94
SSE4 47,566 0.67
SSE4-2 20,939 1.52
swar 71,826 0.44
x86 70,728 0.45
• As for Core i5, the SSE4-2 algorithm is almost 1.5 faster than SSE(2).
• But the SWAR algorithm is not faster than the scalar.

# Conclusions

• And what is worth to note that gcc invoked with the -O3 switch produced quite fast x86 code. Without any optimization x86 code was almost 10 times slower! I am surprised, and in my private ranking of the best open source application GCC has gone up.