Author: | Wojciech Muła |
---|---|
Added on: | 2008-04-29 |
Updated: | 2016-03-07 (github repo, results from Core i5), 2008-05-24 |
Contents
Instruction PSHUFB does parallel lookup from 16-byte array stored in an XMM register — this is exactly what bin to hex conversion needs.
Code snippet showing the idea:
movdqa (%eax), %xmm0 ; xmm0 = {0xba, 0xdc, 0xaf, 0xe8, ...} movdqa %xmm0, %xmm1 ; xmm1 -- bits 4..7 shifted 4 positions right psrlw $4, %xmm1 ; xmm1 = {0xad, 0xca, 0xfe, 0x80, ...} punpcklbw %xmm0, %xmm1 ; xmm0 = {0xba, 0xad, 0xdc, 0xca, 0xaf, 0xfe, 0xe8, 0x80, ...} ; MASK = packed_byte(0x0f) pand MASK, %xmm1 ; xmm0 = {0xb0, 0xa0, 0xd0, 0xc0, 0xa0, 0xf0, 0xe0, 0x80, ...} ; -- bits 0..3 movdqa HEXDIGITS, %xmm0 ; HEXDIGITS = {'0', '1', '2', '3', ..., 'a', 'b', 'c', 'd', 'e', 'f'} pshufb %xmm1, %xmm0 ; xmm0 = {'b', 'a', 'd', 'c', 'a', 'f', 'e', '8', ...}
hexprint.c is a test program that compares speed of presented method with three other lookup-based methods:
In a single iteration 100 x 16 bytes are decoded, and the number of iterations is 100000.
Here are times measured on my Linux box, with Core 2 Duo E8200.
method | user time | speedup | |
---|---|---|---|
std | 780 | 100% | ========== |
std2 | 640 | 122% | ============ |
std3 | 640 | 122% | ============ |
ssse3 | 580 | 133% | ============= |
CPU: i5 CPU M 540 @ 2.53GHz
method | user time | speedup | |
---|---|---|---|
std | 9.74 | 1.00 | ========== |
std2 | 8.99 | 1.08 | ========== |
std3 | 9.07 | 1.07 | ========== |
ssse3 | 8.35 | 1.16 | ============ |
Summary: