Author: Wojciech Muła 2014-09-21 2016-03-07 (Core i5 results)

Branchless converting single nibble

For a nibble (0..15) stored in a byte conversion could be coded in this way:

```char nibble_to_hex(uint8_t byte) {
assert(byte >= 0 && byte <= 15);

char c = byte + '0';
if (byte > 9)
c += 'a' - '0' - 10;

return c;
}
```

If a nibble is greater than 9, then resulting letter have to be ASCII 'a' .. 'f' (or 'A' .. 'F'). It's done with simple correction of the code; value of correction 'a' - 10 - '0' is 39, and for uppercase letters it is 7.

The condition have to be replaced by branchless expression. First code is changed to:

```char nibble_to_hex2(uint8_t byte) {
assert(byte >= 0 && byte <= 15);

const char corr = 'a' - '0' - 10
const char c    = byte + '0';

uint8_t mask    = (byte > 9) ? 0xff : 0x00;

return c + (mask & corr);
}
```

We're bit closer. Now the question is: how to get a mask from condition byte > 9? Let's examine simple addition: 128 - 10 + x. For values x = 0 .. 9 the result is in range 128 - 10 .. 128 - 9 and for values x = 10 .. 15 the result is in range 128 .. 128 + 5.

Observation: for x in range 10 .. 15 the result has the highest bit set, otherwise it's clear. In other words we get 0x80 or 0x00 depending on condition, and now mask could be calculated as:

```uint8_t tmp  = 128 - 10 + byte;
uint8_t msb  = tmp & 0x80;

uint8_t mask = msb - (msb >> 7) | msb; // 3 operations
```

Since correction's value is 39 or 7, i.e. is less than 128, the mask could be calculated simpler, yielding values 0x7f/0x00:

```uint8_t mask = msb - (msb >> 7); // 2 operations
```

The final version:

```char nibble_to_hex3(uint8_t byte) {
assert(byte >= 0 && byte <= 15);

const char corr = 'a' - '0' - 10
const char c    = byte + '0';

uint8_t tmp  = 128 - 10 + byte;
uint8_t msb  = tmp & 0x80;

uint8_t mask = msb - (msb >> 7); // 0x7f or 0x00

return c + (mask & corr);
}
```

Coverting nibbles — SWAR version

The trick presented above could be easily used in SWAR procedure. A procedure converts 4 nibbles (32-bit code) or 8 nibbles (64-bit code) at once.

```#define packed(byte) ((uint8_t)(byte) * 0x01010101u)

uint32_t nibbles_to_hex_swar(uint32_t nibbles) {
assert((nibbles & 0xf0f0f0f0) == 0);

const uint32_t ascii09    = nibbles + packed('0');
const uint32_t correction = packed('a' - '0' - 10);

const uint32_t tmp  = nibbles  + packed(128 - 10);
const uint32_t msb  = tmp & packed(0x80);
const uint32_t mask = msb - (msb >> 7);

return ascii09 + (mask & correction);
}
```

Number of operations:

• 1 subtract,
• 2 bit-and,
• 1 shift.

Convering nibbles — SIMD version

A SIMD version is based on the straightforward implementation of procedure nibble_to_hex2, where generating mask is done with single instruction PCMPGTB.

The procedure converts 16 nibbles.

```uint8_t nine[9]         = packed_byte(9);
uint8_t ascii0[9]       = packed_byte('0');
uint8_t correction[9]   = packed_byte('a' - 10 - '0');

// assembler

movdqu    nibbles_x_16, %xmm0
movdqa    %xmm0, %xmm1

// convert to ASCII

pcmpgtb   nine, %xmm0

// correct result
pand      correction, %xmm0

// save result...
```

Split nibbles

So far we silently assumed that nibbles are already placed in separated bytes. Expanding nibbles to bytes could be simply saved as:

```uint32_t nibble_expand_naive(uint32_t x) {
assert(x <= 0xffff);

const uint32_t n0 = x & 0x0f;
const uint32_t n1 = (x >>  4) & 0x0f;
const uint32_t n2 = (x >>  8) & 0x0f;
const uint32_t n3 = (x >> 12) & 0x0f;

return n3 | (n2 << 8) | (n1 << 16) | (n0 << 24);
}
```

Important: note that order of nibbles is reversed, because in strings the most significant digit is placed at the index 0.

However the procedure requires many operations, for 16-bit values these numbers are:

• 6 shifts,
• 4 bit-and,
• 3 bit-or.

x86 - plain

Multiplication could be used to express two independent shifts and one bit-or:

```uint32_t nibble_expand_mul(uint32_t x) {
assert(x <= 0xffff);

// 1. isolate nibbles 0 and 2
const uint32_t n02 = x & 0x0f0f;

// 2. isolate nibbles 1 and 3
const uint32_t n13 = x & 0xf0f0;

// 3. move nibble 2 to 3rd byte             // n02  = 0x00000a0b
const uint32_t tmp1 = n02  * 0x00000101;    // tmp1 = 0x000a???b
const uint32_t tmp2 = tmp1 & 0x00ff00ff;    // tmp2 = 0x000a000b

// 4. move nibble 1 to 1st byte and 3 to the 4th byte
const uint32_t tmp3 = n13  * 0x00001010;    // tmp3 = 0x0c???f00
const uint32_t tmp4 = tmp3 & 0xff00ff00;    // tmp4 = 0x0c000f00

return bswap(tmp2 | tmp4);
}
```

Number of operations decreased to:

• 2 multiplications,
• 4 bit-and,
• 1 bit-or,
• 1 byte-swap (implemented by the function bswap, merely a wrapper around CPU instruction named the same).

Multiplication has bigger latency than shift, but throughput is 1 cycle. Experiments on Penitum M have showed that this approach is not better than naive, but maybe never CPUs will execute this code faster.

x86 - BMI2 (Bit Manipulation Instructions)

Bit Manipulation Instruction is a new Intel's ISA extension, that adds some interesting instructions. One of them is PDEP, parallel deposit. Pseudocode shows how PDEP works (example for 64-bit words):

```src1 - mask
src2 - input
dst  - result

dst := 0
k   := 0
for i := 0 .. 64 loop
if src1.bit[i] is set then
dst.bit[i] := src2.bit[k]
k := k + 1
end if
end loop
```

This is exactly what nibble expansion needs:

```int64_t nibble_expand_bmi2(uint64_t x) {
assert(x <= 0xffffffff);

const uint64_t src1 = 0x0f0f0f0f0f0f0f0f;
const uint64_t src2 = x;

const uint64_t result = pdep(src1, src2);

return bswap(result);
}
```

The function pdep is a wrapper around PDEP instruction.

SSE

With SSE it's quite simple (I'll use dot instead of 0 for masked values).

```movq   reg,   %xmm0      -- xmm0 = |..|..|..|..|..|..|..|..|fe|dc|ba|98|76|54|32|10|
movdqa %xmm0, %xmm0
```
2. Isolate even nibbles:

```pand   mask,  %xmm0      -- xmm0 = |..|..|..|..|..|..|..|..|.e|.c|.a|.8|.6|.4|.2|.0|
```
3. Isolate odd nibbles:

```psrlw    8,   %xmm1      -- xmm1 = |..|..|..|..|..|..|..|..|.f|ed|cb|a9|87|65|43|21|
pand   mask,  %xmm1      -- xmm1 = |..|..|..|..|..|..|..|..|.f|.d|.b|.9|.7|.5|.3|.1|
```
4. Interleave expanded nibbles:

```punpcklbw %xmm1, %xmm0   -- xmm0 = |f.|.e|.d|.c|.b|.a|.9|.8|.7|.6|.5|.4|.3|.2|.1|.0|
```
5. Swap bytes

Speedup comparison

Results produced by program speedup from the samples. All possible combinations of expanding and conversion algorithms were considered.

Results from Pentium M

expand convert time [s] speedup
naive naive 0.4789 1.00 ████████
swar 0.1533 3.12 ████████████████████████
simd 0.1884 2.54 ████████████████████
mul naive 0.4629 1.03 ████████
swar 0.1621 2.96 ███████████████████████
simd 0.1905 2.51 ████████████████████
simd naive 0.4680 1.02 ████████
swar 0.1625 2.95 ███████████████████████
simd 0.2037 2.35 ██████████████████

Notes:

• An expanding procedure don't dominate over the running time.
• The naive expand is not so bad.
• Surprisingly the fastest conversion code is scalar SWAR, not SSE.

Results from Core i5

CPU: Core i5 CPU M 540 @ 2.53GHz

expand convert time [s] speedup
naive naive 0.1191 1.00 ██████████
swar 0.0406 2.93 ███████████████████████████████
simd 0.0434 2.74 █████████████████████████████
mul naive 0.1062 1.12 ████████████
swar 0.0405 2.94 ███████████████████████████████
simd 0.0433 2.75 █████████████████████████████
simd naive 0.1061 1.12 ████████████
swar 0.0404 2.95 ████████████████████████████████
simd 0.0433 2.75 █████████████████████████████