SSSE3: printing hex values

Author:Wojciech Muła
Added on:2008-04-29
Updated:2016-03-07 (github repo, results from Core i5), 2008-05-24

Contents

SIMD algorithm

Instruction PSHUFB does parallel lookup from 16-byte array stored in an XMM register — this is exactly what bin to hex conversion needs.

Code snippet showing the idea:

movdqa    (%eax), %xmm0 ; xmm0 = {0xba, 0xdc, 0xaf, 0xe8, ...}
movdqa     %xmm0, %xmm1 ; xmm1 -- bits 4..7 shifted 4 positions right
psrlw         $4, %xmm1 ; xmm1 = {0xad, 0xca, 0xfe, 0x80, ...}
punpcklbw  %xmm0, %xmm1 ; xmm0 = {0xba, 0xad, 0xdc, 0xca, 0xaf, 0xfe, 0xe8, 0x80, ...}
                        ; MASK = packed_byte(0x0f)
pand        MASK, %xmm1 ; xmm0 = {0xb0, 0xa0, 0xd0, 0xc0, 0xa0, 0xf0, 0xe0, 0x80, ...}
                        ;      -- bits 0..3
movdqa HEXDIGITS, %xmm0 ; HEXDIGITS = {'0', '1', '2', '3', ..., 'a', 'b', 'c', 'd', 'e', 'f'}
pshufb     %xmm1, %xmm0 ; xmm0 = {'b', 'a', 'd', 'c', 'a', 'f', 'e', '8', ...}

Tests

hexprint.c is a test program that compares speed of presented method with three other lookup-based methods:

In a single iteration 100 x 16 bytes are decoded, and the number of iterations is 100000.

Core2

Here are times measured on my Linux box, with Core 2 Duo E8200.

method user time speedup  
std 780 100% ==========
std2 640 122% ============
std3 640 122% ============
ssse3 580 133% =============

Core i5

CPU: i5 CPU M 540 @ 2.53GHz

method user time speedup  
std 9.74 1.00 ==========
std2 8.99 1.08 ==========
std3 9.07 1.07 ==========
ssse3 8.35 1.16 ============

Summary:

  • There is no visible improvment on newever CPUs.