Author: | Wojciech Muła |
---|---|
Added on: | 2014-09-21 |
Updates: | 2016-03-07 (Core i5 results) |
Contents
For a nibble (0..15) stored in a byte conversion could be coded in this way:
char nibble_to_hex(uint8_t byte) { assert(byte >= 0 && byte <= 15); char c = byte + '0'; if (byte > 9) c += 'a' - '0' - 10; return c; }
If a nibble is greater than 9, then resulting letter have to be ASCII 'a' .. 'f' (or 'A' .. 'F'). It's done with simple correction of the code; value of correction 'a' - 10 - '0' is 39, and for uppercase letters it is 7.
The condition have to be replaced by branchless expression. First code is changed to:
char nibble_to_hex2(uint8_t byte) { assert(byte >= 0 && byte <= 15); const char corr = 'a' - '0' - 10 const char c = byte + '0'; uint8_t mask = (byte > 9) ? 0xff : 0x00; return c + (mask & corr); }
We're bit closer. Now the question is: how to get a mask from condition byte > 9? Let's examine simple addition: 128 - 10 + x. For values x = 0 .. 9 the result is in range 128 - 10 .. 128 - 9 and for values x = 10 .. 15 the result is in range 128 .. 128 + 5.
Observation: for x in range 10 .. 15 the result has the highest bit set, otherwise it's clear. In other words we get 0x80 or 0x00 depending on condition, and now mask could be calculated as:
uint8_t tmp = 128 - 10 + byte; uint8_t msb = tmp & 0x80; uint8_t mask = msb - (msb >> 7) | msb; // 3 operations
Since correction's value is 39 or 7, i.e. is less than 128, the mask could be calculated simpler, yielding values 0x7f/0x00:
uint8_t mask = msb - (msb >> 7); // 2 operations
The final version:
char nibble_to_hex3(uint8_t byte) { assert(byte >= 0 && byte <= 15); const char corr = 'a' - '0' - 10 const char c = byte + '0'; uint8_t tmp = 128 - 10 + byte; uint8_t msb = tmp & 0x80; uint8_t mask = msb - (msb >> 7); // 0x7f or 0x00 return c + (mask & corr); }
The trick presented above could be easily used in SWAR procedure. A procedure converts 4 nibbles (32-bit code) or 8 nibbles (64-bit code) at once.
#define packed(byte) ((uint8_t)(byte) * 0x01010101u) uint32_t nibbles_to_hex_swar(uint32_t nibbles) { assert((nibbles & 0xf0f0f0f0) == 0); const uint32_t ascii09 = nibbles + packed('0'); const uint32_t correction = packed('a' - '0' - 10); const uint32_t tmp = nibbles + packed(128 - 10); const uint32_t msb = tmp & packed(0x80); const uint32_t mask = msb - (msb >> 7); return ascii09 + (mask & correction); }
Number of operations:
A SIMD version is based on the straightforward implementation of procedure nibble_to_hex2, where generating mask is done with single instruction PCMPGTB.
The procedure converts 16 nibbles.
uint8_t nine[9] = packed_byte(9); uint8_t ascii0[9] = packed_byte('0'); uint8_t correction[9] = packed_byte('a' - 10 - '0'); // assembler movdqu nibbles_x_16, %xmm0 movdqa %xmm0, %xmm1 // convert to ASCII paddb ascii0, %xmm1 // make mask pcmpgtb nine, %xmm0 // correct result pand correction, %xmm0 paddb %xmm1, %xmm0 // save result...
So far we silently assumed that nibbles are already placed in separated bytes. Expanding nibbles to bytes could be simply saved as:
uint32_t nibble_expand_naive(uint32_t x) { assert(x <= 0xffff); const uint32_t n0 = x & 0x0f; const uint32_t n1 = (x >> 4) & 0x0f; const uint32_t n2 = (x >> 8) & 0x0f; const uint32_t n3 = (x >> 12) & 0x0f; return n3 | (n2 << 8) | (n1 << 16) | (n0 << 24); }
Important: note that order of nibbles is reversed, because in strings the most significant digit is placed at the index 0.
However the procedure requires many operations, for 16-bit values these numbers are:
Multiplication could be used to express two independent shifts and one bit-or:
uint32_t nibble_expand_mul(uint32_t x) { assert(x <= 0xffff); // 1. isolate nibbles 0 and 2 const uint32_t n02 = x & 0x0f0f; // 2. isolate nibbles 1 and 3 const uint32_t n13 = x & 0xf0f0; // 3. move nibble 2 to 3rd byte // n02 = 0x00000a0b const uint32_t tmp1 = n02 * 0x00000101; // tmp1 = 0x000a???b const uint32_t tmp2 = tmp1 & 0x00ff00ff; // tmp2 = 0x000a000b // 4. move nibble 1 to 1st byte and 3 to the 4th byte const uint32_t tmp3 = n13 * 0x00001010; // tmp3 = 0x0c???f00 const uint32_t tmp4 = tmp3 & 0xff00ff00; // tmp4 = 0x0c000f00 return bswap(tmp2 | tmp4); }
Number of operations decreased to:
Multiplication has bigger latency than shift, but throughput is 1 cycle. Experiments on Penitum M have showed that this approach is not better than naive, but maybe never CPUs will execute this code faster.
Bit Manipulation Instruction is a new Intel's ISA extension, that adds some interesting instructions. One of them is PDEP, parallel deposit. Pseudocode shows how PDEP works (example for 64-bit words):
src1 - mask src2 - input dst - result dst := 0 k := 0 for i := 0 .. 64 loop if src1.bit[i] is set then dst.bit[i] := src2.bit[k] k := k + 1 end if end loop
This is exactly what nibble expansion needs:
int64_t nibble_expand_bmi2(uint64_t x) { assert(x <= 0xffffffff); const uint64_t src1 = 0x0f0f0f0f0f0f0f0f; const uint64_t src2 = x; const uint64_t result = pdep(src1, src2); return bswap(result); }
The function pdep is a wrapper around PDEP instruction.
With SSE it's quite simple (I'll use dot instead of 0 for masked values).
Load data:
movq reg, %xmm0 -- xmm0 = |..|..|..|..|..|..|..|..|fe|dc|ba|98|76|54|32|10| movdqa %xmm0, %xmm0
Isolate even nibbles:
pand mask, %xmm0 -- xmm0 = |..|..|..|..|..|..|..|..|.e|.c|.a|.8|.6|.4|.2|.0|
Isolate odd nibbles:
psrlw 8, %xmm1 -- xmm1 = |..|..|..|..|..|..|..|..|.f|ed|cb|a9|87|65|43|21| pand mask, %xmm1 -- xmm1 = |..|..|..|..|..|..|..|..|.f|.d|.b|.9|.7|.5|.3|.1|
Interleave expanded nibbles:
punpcklbw %xmm1, %xmm0 -- xmm0 = |f.|.e|.d|.c|.b|.a|.9|.8|.7|.6|.5|.4|.3|.2|.1|.0|
Swap bytes
Results produced by program speedup from the samples. All possible combinations of expanding and conversion algorithms were considered.
expand | convert | time [s] | speedup | |
---|---|---|---|---|
naive | naive | 0.4789 | 1.00 | ████████ |
swar | 0.1533 | 3.12 | ████████████████████████ | |
simd | 0.1884 | 2.54 | ████████████████████ | |
mul | naive | 0.4629 | 1.03 | ████████ |
swar | 0.1621 | 2.96 | ███████████████████████ | |
simd | 0.1905 | 2.51 | ████████████████████ | |
simd | naive | 0.4680 | 1.02 | ████████ |
swar | 0.1625 | 2.95 | ███████████████████████ | |
simd | 0.2037 | 2.35 | ██████████████████ |
Notes:
CPU: Core i5 CPU M 540 @ 2.53GHz
expand | convert | time [s] | speedup | |
---|---|---|---|---|
naive | naive | 0.1191 | 1.00 | ██████████ |
swar | 0.0406 | 2.93 | ███████████████████████████████ | |
simd | 0.0434 | 2.74 | █████████████████████████████ | |
mul | naive | 0.1062 | 1.12 | ████████████ |
swar | 0.0405 | 2.94 | ███████████████████████████████ | |
simd | 0.0433 | 2.75 | █████████████████████████████ | |
simd | naive | 0.1061 | 1.12 | ████████████ |
swar | 0.0404 | 2.95 | ████████████████████████████████ | |
simd | 0.0433 | 2.75 | █████████████████████████████ |
Sample programs are available at github.