Author: | Wojciech Muła |
---|---|
Added on: | 2008-06-21 |
Updated on: | 2016-05-03 (Dmitry Petrov noticed that alpha can have better resolution; new SWAR code; updated results for Core i5) 2016-03-02 (results from Core i5), 2015-12-26 (instruction pmaddubsw was introduced by SSSE3, not SSE4; thanks a lot Harold!) |
Contents
Image crossfading is a kind of alpha blending where a final pixel is the result of linear interpolation of pixels from two images:
result_pixel = pixel1 * alpha + pixel2 * (1 - alpha)
where alpha lie in range [0, 1]. Of course when operating on "pixels" color components are considered; components are unsigned bytes.
SSSE3 introduced instruction PMADDUBSW. This instruction multiply a destination vector of unsigned bytes by a source vector of signed bytes — the result is a vector of signed words. Then adjacent words are added with signed saturation (the same operation as PHADDSW).
This is exactly what crossafading needs.
The obvious drawback is that instruction operates on signed values. Because alpha must be positive, this reduces resolution of alpha from 8 to 7 bits. (was: Because multiplication results are signed and then added, the sum must not be greater than 32767 — this requirement reduces resolution by another bit. Finally alpha must lie in range [0..63].) Dmitry Petrov pointed out that alpha can be a 7-bit value, as such value never cause an overflow. Let's assume that both pixel1 and pixel2 have maximum value, and check if following inequality is true:
(1) 255 * alpha + 255 * (127 - alpha) < 2^15 - 1 (2) 255 * 127 < 2^15 - 1 (3) 32385 < 32767
Obviously the inequality is true.
Prepare constant vector of 127*alpha/127*(1-alpha):
xmm6 = packed_byte(alpha, 127-alpha, alpha, 127-alpha, ..., alpha, 127-alpha)
Load 16 components from images X and Y:
movdqa (%eax), %xmm0 // xmm0 = packed_byte(rX1, gX1, bX1, _, rX2, gX2, bX2, _, ...) movdqa (%ebx), %xmm1 // xmm1 = packed_byte(rY1, gY1, bY1, _, rY2, gY2, bY2, _, ...)
Interleave components:
movdqa %xmm0, %xmm2 punpcklbw %xmm1, %xmm0 // xmm0 = packed_byte(rX1, rY1, gX1, gY1, bX1, bY2, ...) punpcklbw %xmm1, %xmm2 // xmm2 = packed_byte(rX8, rY8, gX10, gY10, bX11, bY11, ...)
Interpolate components with PMADDUBSW:
pmaddubsw %xmm6, %xmm0 // xmm0 = packed_byte(127*((rX1 * alpha) + rY1*(1 - alpha)), ...) pmaddubsw %xmm6, %xmm2 // xmm2 = packed_byte(127*((rX8 * alpha) + rY8*(1 - alpha)), ...)
Divide by 64 — now all words lie in range [0..255]:
psrlw $16, %xmm0 // xmm0 = packed_byte((rX1 * alpha) + rY1*(1 - alpha), ...) psrlw $16, %xmm2 // xmm2 = packed_byte((rX8 * alpha) + rY8*(1 - alpha), ...)
Pack words to bytes and save result:
packuswb %xmm2, %xmm0 movdqa %xmm0, (%ecx)
goto 2
Crossfading using SWAR approach is possible and on 64-bit machines is pretty fast. The core SWAR idea is to perform 4 multiplications using single multiplication instruction. When a register has following byte layout:
[00aa|00bb|00cc|00dd]
then multiplication by value in range [0..255] yields a vector of four 16-bit values.
Load four pixels:
A = imageA[i] [a1|b1|g1|r1|a0|b0|g0|r0] B = imageB[i] [a3|b3|g3|r3|a2|b2|g2|r2]
Isolate odd and even components (_ denotes 0):
A_lo = A & 0x00ff00ff00ff00ff [__|b1|__|r1|__|b0|__|r0] A_hi = (A >> 8) & 0x00ff00ff00ff00ff [__|a1|__|g1|__|a0|__|g0] B_lo = B & 0x00ff00ff00ff00ff [__|b3|__|r3|__|b2|__|r2] B_hi = (B >> 8) & 0x00ff00ff00ff00ff [__|a3|__|g3|__|a2|__|g2]
Multiply by alpha and 255 - alpha (values in range [0..255]):
A_lo *= alpha [b1*a |r1*a |b0*a |r0*a ] A_hi *= alpha [a1*a |g1*a |a0*a |g0*a ] B_lo *= 255 - alpha [b3*a'|r3*a'|b2*a'|r2*a'] B_hi *= 255 - alpha [a3*a'|g3*a'|a2*a'|g2*a']
Add components, no overflow is possible. Now each components is premultiplied by 256:
R0 = A_lo + B_lo [b1+b3|r1+r3|b0+b2|r0+r2] R1 = A_hi + B_hi [a1+a3|g1+bg|a0+a2|g0+g2]
Divide by 256 and set proper byte position:
R0 = (R0 >> 8) & 0x00ff00ff00ff00ff [__|b1|__|r1|__|b0|__|r0] R1 = R1 & 0xff00ff00ff00ff00 [a1|__|g1|__|a0|__|g0|__]
Save the result:
data[i] = R0 | R1 [a1|b1|g1|r1|a0|b0|g0|r0]
Number of operations per two pixels:
Sample program contains following procedures:
Program was compiled with following options:
gcc -O3 -Wall -pedantic -std=c99 mix_32bpp.c -o mix
Image 1024 x 768 pixels were crossfaded 100 times, total time is given
Results from Core2 Duo E8200 @ 2.6GHz.
procedure | time [us] | speedup |
---|---|---|
x86 | 745,702 | 1.00 |
SSE4 | 309,393 | 2.41 |
SSE4-2 | 309,167 | 2.41 |
Results from Core i5 M 540 @ 2.53GHz
procedure | time [us] | speedup |
---|---|---|
SSE(2) | 64,221 | 1.00 |
SSE(1) | 69,228 | 0.93 |
SSE4 | 55,530 | 1.16 |
SSE4-2 | 44,315 | 1.45 |
swar | 143,059 | 0.45 |
x86 | 206,044 | 0.31 |
Results from Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
procedure | time [us] | speedup |
---|---|---|
SSE(2) | 31,727 | 1.00 |
SSE(1) | 33,684 | 0.94 |
SSE4 | 47,566 | 0.67 |
SSE4-2 | 20,939 | 1.52 |
swar | 71,826 | 0.44 |
x86 | 70,728 | 0.45 |