Author: | Wojciech Muła |
---|---|
Added on: | 2008-06-01 |
Updated on: | 2016-03-04 (+link to github) |
32bpp pixels have four components: red, green, blue and alpha channel. The same number of lookup tables is needed; elements of tables has size 4 bytes, and can be combined with simple or:
transformed_pixel := LUT_R[R] or LUT_G[G] or LUT[B] or LUT[A]
Or without alpha channel:
transformed_pixel := LUT_R[R] or LUT_G[G] or LUT[B]
I did some tests with SSE2 and SSE4 instructions used to minimize memory references — with a single XMM instruction 16 bytes are read. Main problem is how to extract bytes or double words from the selected position of an XMM register.
The x86 code is a base for further improvements. If pixel is loaded into an x86 register, following code can be used to extract all RGBA components:
movl (%%esi), %%eax ; eax - pixel movzbl %%al, %%ebx ; R movzbl %%ah, %%ecx ; G shrl $16, %%eax movzbl %%al, %%edx ; B movzbl %%ah, %%eax ; A movl LUT_R(,%%ebx,4), %%ebx orl LUT_G(,%%ecx,4), %%ebx orl LUT_A(,%%edx,4), %%ebx orl LUT_B(,%%eax,4), %%ebx ; ebx - transformed_pixel movl %%ebx, (%%edi)
Code that works with RGB pixels is of course shorter:
movl (%%esi), %%eax ; eax - pixel movzbl %%al, %%ebx ; R movzbl %%ah, %%ecx ; G shrl $16, %%eax movzbl %%al, %%edx ; B movl LUT_R(,%%ebx,4), %%ebx orl LUT_G(,%%ecx,4), %%ebx orl LUT_B(,%%eax,4), %%ebx ; ebx - transformed_pixel movl %%ebx, (%%edi)
SSE2 code uses the same scheme as the x86 code, however it fetches 4 pixels at the same time, and load eax from XMM register with a MOVD instruction. Since MOVD moves lowest dword, additional shifts are needed to place all dwords at that position — PSHUFD instruction is used to do this.
SSE4 (SSE4.1) introduced instructions PEXTRB, PEXTRD and PEXTRQ — element's index is hardcoded in opcode, destination is register or memory location, extracted byte/dword/qword is zero-extended. Contrary operation is performed by PINSRx instructions. These instructions seem perfect, do exactly what SSE-assist lookup needs.
PEXTRx/PINSRx have throughput one cycle, however latency is very long — five cycles. I think it is possible to compensate latency, but not in 32-bit code — we can use just 5 registers, because 3 are used for two pointers and one is a loop counter; the 64-bit mode gives 8 extra registers.
Tests was done on Core 2 Duo @ 2.6GHz, under Linux control. Image 1024 x 768 was transformed 1000 times, test were run 10 times.
Sample program is available at github, and was compiled with following options:
gcc -O3 lookup_32bpp.c -o test_rgb gcc -O3 -DRGBA lookup_32bpp.c -o test_rgba
Function naive is a C implementation. GCC generated code very similar to x86 presented above, however added some extra instructions that slowed down whole procedure.
Other function refers to these described earlier.
Gain 1.3 times.
function | time [s] | speedup |
---|---|---|
naive | 2.26 | 100% |
x86 | 1.90 | 119% |
SSE2 | 1.76 | 128% |
SSE4 | 1.89 | 120% |
No observable gain.
function | time [s] | speedup |
---|---|---|
naive | 1.55 | 100% |
x86 | 1.57 | 98% |
SSE2 | 1.53 | 101% |
SSE4 | 1.54 | 100% |