Author: | Wojciech Muła |
---|---|
Added on: | 2008-06-01 |
Updates: | 2016-03-06 (link to github, results from Core i5) |
Contents
Basically this kind of conversion needs following steps:
R = (pixel16 and 0x001f) shl 3 G = (pixel16 and 0x07e0) shr 5 B = (pixel16 and 0xf800) shr 11 pixel32 = R or (G shl 8) or (B shl 16)
Since there aren't many pixels (32 or 64 thousand) lookup tables can be used. First approach is to use one big table indexed by pixels treated as natural numbers: this table has size 65536 * 4 bytes = 262144 bytes. Just one memory access is needed to get 32bpp pixel, however the table size is big, and even if it fits in a L2 cache, then the memory latency kill performance.
pixel32 = LUT[pixel16]
Another approach needs two tables indexed by the lower and the higher byte of a pixel, the final pixel is result of bitwise or. These tables has size 2 * 256 * 4 bytes = 2048 bytes — perfectly fit in a L1 cache.
pixel32 = LUT_hi[pixel16 shr 8] or LUT_lo[pixel16 and 0xff]
Sample program is available at github, and includes following procedures:
Program was compiled with following flags:
gcc -O3 pixconv16bpp-32bpp.c -o test gcc -O3 -DNONTEMPORAL pixconv16bpp-32bpp.c -o testnta
Here are timing from my Core 2 Duo E8200 @ 2.6GHz. Each procedure was called 10 times, results are average.
procedure | time [us] | speedup | |
---|---|---|---|
naive | 16625 | 100% | ========== |
lookup8 | 13584 | 120% | ============ |
lookup16 | 24438 | 65% | ====== |
lookup8(2) | 6175 | 270% | =========================== |
MMX | 7862 | 210% | ===================== |
SSE2 | 4103 | 405% | ======================================== |
SSE2(2) | 4604 | 360% | ==================================== |
procedure | time [us] | speedup | |
---|---|---|---|
naive | 49371 | 100% | ========== |
lookup8 | 40177 | 120% | ============ |
lookup16 | 73051 | 65% | ====== |
lookup8(2) | 18574 | 265% | ========================== |
MMX | 23483 | 210% | ========================== |
SSE2 | 12703 | 390% | ======================================= |
SSE2(2) | 13716 | 360% | ==================================== |
procedure | time [us] | speedup | |
---|---|---|---|
naive | 77634 | 100% | ========== |
lookup8 | 62893 | 120% | ============ |
lookup16 | 115156 | 65% | ====== |
lookup8(2) | 28830 | 270% | =========================== |
MMX | 36452 | 210% | ===================== |
SSE2 | 19217 | 400% | ======================================== |
SSE2(2) | 21696 | 360% | ==================================== |
procedure | time [us] | speedup | |
---|---|---|---|
naive | 130867 | 100% | ========== |
lookup8 | 106543 | 120% | ============ |
lookup16 | 205421 | 60% | ====== |
lookup8(2) | 48503 | 270% | =========================== |
MMX | 62737 | 210% | ===================== |
SSE2 | 37881 | 345% | ================================== |
SSE2(2) | 44162 | 295% | ============================= |
speedup | ||||
---|---|---|---|---|
procedure | 320x200 | 640x480 | 800x600 | 1024x768 |
naive | 100% | 100% | 100% | 100% |
lookup8 | 120% | 120% | 120% | 120% |
lookup16 | 65% | 65% | 65% | 60% |
lookup8(2) | 270% | 265% | 270% | 270% |
MMX | 210% | 210% | 210% | 210% |
SSE2 | 405% | 390% | 400% | 345% |
SSE2(2) | 360% | 360% | 360% | 295% |
As we see the worst results have lookup16. Single lookup8 is a bit faster than naive implementation, lookup8(2) that reads 2 pixels in one iteration is almost 2 times faster. I think memory latencies can help to understand these results: L1 latency is small, 3 cycles or less, L2 cache — around 10-15 cycles.
SIMD versions are naturally much faster, however SSE2(2) is a bit slower than basic SSE2.
CPU: Core i5 CPU M 540 @ 2.53GHz
The test program was run 100 times, the minimum measurement were considered.
procedure | time [us] | speedup | |
---|---|---|---|
naive | 11,136 | 1.00 | ========== |
lookup8 | 8,584 | 1.30 | ============= |
lookup16 | 10,433 | 1.06 | ========== |
lookup8(2) | 3,360 | 3.31 | ================================= |
MMX | 3,750 | 2.97 | ================================ |
SSE2 | 2,262 | 4.92 | ================================================= |
SSE2(2) | 2,192 | 5.08 | ================================================== |
procedure | time [us] | speedup | |
---|---|---|---|
naive | 53,748 | 1.00 | ========== |
lookup8 | 41,333 | 1.30 | ============= |
lookup16 | 52,948 | 1.02 | ========== |
lookup8(2) | 15,853 | 3.39 | ================================= |
MMX | 18,114 | 2.97 | ============================== |
SSE2 | 11,813 | 4.55 | ============================================= |
SSE2(2) | 11,411 | 4.71 | =============================================== |
procedure | time [us] | speedup | |
---|---|---|---|
naive | 82,974 | 1.00 | ========== |
lookup8 | 66,451 | 1.25 | ============ |
lookup16 | 98,951 | 0.83 | ======== |
lookup8(2) | 24,834 | 3.34 | ================================= |
MMX | 31,525 | 2.63 | ========================== |
SSE2 | 25,625 | 3.24 | ================================ |
SSE2(2) | 23,893 | 3.47 | =================================== |
procedure | time [us] | speedup | |
---|---|---|---|
naive | 141,784 | 1.00 | ========== |
lookup8 | 112,439 | 1.26 | ============= |
lookup16 | 191,798 | 0.73 | ======= |
lookup8(2) | 45,362 | 3.13 | =============================== |
MMX | 78,091 | 1.82 | ================== |
SSE2 | 75,230 | 1.88 | =================== |
SSE2(2) | 75,403 | 1.88 | =================== |
speedup | ||||
---|---|---|---|---|
procedure | 320x200 | 640x480 | 800x600 | 1024x768 |
naive | 1.00 | 1.00 | 1.00 | 1.00 |
lookup8 | 1.30 | 1.30 | 1.25 | 1.26 |
lookup16 | 1.06 | 1.02 | 0.83 | 0.73 |
lookup8(2) | 3.31 | 3.39 | 3.34 | 3.13 |
MMX | 2.97 | 2.97 | 2.63 | 1.82 |
SSE2 | 4.92 | 4.55 | 3.24 | 1.88 |
SSE2(2) | 5.08 | 4.71 | 3.47 | 1.88 |