16bpp/15bpp to 32bpp pixel conversions — different methods

Author:Wojciech Muła
Added on:2008-06-01
Updates:2016-03-06 (link to github, results from Core i5)

Contents

Introduction

Basically this kind of conversion needs following steps:

R = (pixel16 and 0x001f) shl 3
G = (pixel16 and 0x07e0) shr 5
B = (pixel16 and 0xf800) shr 11

pixel32 = R or (G shl 8) or (B shl 16)

Since there aren't many pixels (32 or 64 thousand) lookup tables can be used. First approach is to use one big table indexed by pixels treated as natural numbers: this table has size 65536 * 4 bytes = 262144 bytes. Just one memory access is needed to get 32bpp pixel, however the table size is big, and even if it fits in a L2 cache, then the memory latency kill performance.

pixel32 = LUT[pixel16]

Another approach needs two tables indexed by the lower and the higher byte of a pixel, the final pixel is result of bitwise or. These tables has size 2 * 256 * 4 bytes = 2048 bytes — perfectly fit in a L1 cache.

pixel32 = LUT_hi[pixel16 shr 8] or LUT_lo[pixel16 and 0xff]

Sample program

Sample program is available at github, and includes following procedures:

Test results

Core 2

Program was compiled with following flags:

gcc -O3 pixconv16bpp-32bpp.c -o test

gcc -O3 -DNONTEMPORAL pixconv16bpp-32bpp.c -o testnta

Here are timing from my Core 2 Duo E8200 @ 2.6GHz. Each procedure was called 10 times, results are average.

Image 320x200

procedure time [us] speedup  
naive 16625 100% ==========
lookup8 13584 120% ============
lookup16 24438 65% ======
lookup8(2) 6175 270% ===========================
MMX 7862 210% =====================
SSE2 4103 405% ========================================
SSE2(2) 4604 360% ====================================

Image 640x480

procedure time [us] speedup  
naive 49371 100% ==========
lookup8 40177 120% ============
lookup16 73051 65% ======
lookup8(2) 18574 265% ==========================
MMX 23483 210% ==========================
SSE2 12703 390% =======================================
SSE2(2) 13716 360% ====================================

Image 800x600

procedure time [us] speedup  
naive 77634 100% ==========
lookup8 62893 120% ============
lookup16 115156 65% ======
lookup8(2) 28830 270% ===========================
MMX 36452 210% =====================
SSE2 19217 400% ========================================
SSE2(2) 21696 360% ====================================

Image 1024x768

procedure time [us] speedup  
naive 130867 100% ==========
lookup8 106543 120% ============
lookup16 205421 60% ======
lookup8(2) 48503 270% ===========================
MMX 62737 210% =====================
SSE2 37881 345% ==================================
SSE2(2) 44162 295% =============================

Comparison

  speedup
procedure 320x200 640x480 800x600 1024x768
naive 100% 100% 100% 100%
lookup8 120% 120% 120% 120%
lookup16 65% 65% 65% 60%
lookup8(2) 270% 265% 270% 270%
MMX 210% 210% 210% 210%
SSE2 405% 390% 400% 345%
SSE2(2) 360% 360% 360% 295%

As we see the worst results have lookup16. Single lookup8 is a bit faster than naive implementation, lookup8(2) that reads 2 pixels in one iteration is almost 2 times faster. I think memory latencies can help to understand these results: L1 latency is small, 3 cycles or less, L2 cache — around 10-15 cycles.

SIMD versions are naturally much faster, however SSE2(2) is a bit slower than basic SSE2.

Core i5

CPU: Core i5 CPU M 540 @ 2.53GHz

The test program was run 100 times, the minimum measurement were considered.

Image 320x200

procedure time [us] speedup  
naive 11,136 1.00 ==========
lookup8 8,584 1.30 =============
lookup16 10,433 1.06 ==========
lookup8(2) 3,360 3.31 =================================
MMX 3,750 2.97 ================================
SSE2 2,262 4.92 =================================================
SSE2(2) 2,192 5.08 ==================================================

Image 640x480

procedure time [us] speedup  
naive 53,748 1.00 ==========
lookup8 41,333 1.30 =============
lookup16 52,948 1.02 ==========
lookup8(2) 15,853 3.39 =================================
MMX 18,114 2.97 ==============================
SSE2 11,813 4.55 =============================================
SSE2(2) 11,411 4.71 ===============================================

Image 800x600

procedure time [us] speedup  
naive 82,974 1.00 ==========
lookup8 66,451 1.25 ============
lookup16 98,951 0.83 ========
lookup8(2) 24,834 3.34 =================================
MMX 31,525 2.63 ==========================
SSE2 25,625 3.24 ================================
SSE2(2) 23,893 3.47 ===================================

Image 1024x768

procedure time [us] speedup  
naive 141,784 1.00 ==========
lookup8 112,439 1.26 =============
lookup16 191,798 0.73 =======
lookup8(2) 45,362 3.13 ===============================
MMX 78,091 1.82 ==================
SSE2 75,230 1.88 ===================
SSE2(2) 75,403 1.88 ===================

Comparison

  speedup
procedure 320x200 640x480 800x600 1024x768
naive 1.00 1.00 1.00 1.00
lookup8 1.30 1.30 1.25 1.26
lookup16 1.06 1.02 0.83 0.73
lookup8(2) 3.31 3.39 3.34 3.13
MMX 2.97 2.97 2.63 1.82
SSE2 4.92 4.55 3.24 1.88
SSE2(2) 5.08 4.71 3.47 1.88
  • The unrolled SSE implementation beats others in most cases.
  • The improved lookup8(2) is surprisingly good. It keeps the speedup at the same level regardless of the image size.