Conversion numbers to binary representation

Author: Wojciech Muła 2014-09-11 2016-11-20 (SWAR version 3, by Larry Bank, updates results) 2015-04-19 (SWAR version 2, results from Core i5) 2014-09-22 (BMI2 results), 2014-09-16 (+BMI2, lookup, experiments results)

Converting byte — SWAR version 1

Algorithm

1. Populate byte v (bits: abcdefgh) in a 64-bit word:

```const uint64_t r1 = v * 0x0101010101010101
```
```+--------+--------+--------+--------+--------+--------+--------+--------+
|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|
+--------+--------+--------+--------+--------+--------+--------+--------+
```
1. Isolate one bit per byte:

```const uint64_t r2 = r1 & 0x8040201008040201
```
```+--------+--------+--------+--------+--------+--------+--------+--------+
|a.......|.b......|..c.....|...d....|....e...|.....f..|......g.|.......h|
+--------+--------+--------+--------+--------+--------+--------+--------+
```
1. Clone each bit to the highest position:

```const uint64_t r3 = r2 + 0x00406070787c7e7f
```
```     +--------+--------+--------+--------+--------+--------+--------+--------+
r3 = |a.......|.b......|..c.....|...d....|....e...|.....f..|......g.|.......h|
+--------+--------+--------+--------+--------+--------+--------+--------+

+--------+--------+--------+--------+--------+--------+--------+--------+
+ |00000000|01000000|01100000|01110000|01111000|01111100|01111110|01111111|
+--------+--------+--------+--------+--------+--------+--------+--------+

+--------+--------+--------+--------+--------+--------+--------+--------+
r3 = |a.......|bx......|cxx.....|dxxx....|exxxx...|fxxxxx..|gxxxxxx.|hxxxxxxx|
+--------+--------+--------+--------+--------+--------+--------+--------+
```
1. Move 7th bits to the 0th position and mask the lowest bits:

```const uint64_t r4 = (r3 >> 7) & 0x0101010101010101
```
```+--------+--------+--------+--------+--------+--------+--------+--------+
|.......a|.......b|.......c|.......d|.......e|.......f|.......g|.......h|
+--------+--------+--------+--------+--------+--------+--------+--------+
```
1. Finally convert to ASCII:

```const uint64_t result = 0x3030303030303030 + r4  // ord('0') == 0x30
```

Notes

Sample implementation:

```uint64_t convert_swar(uint8_t v) {

const uint64_t r1 =  v * 0x0101010101010101;
const uint64_t r2 = r1 & 0x8040201008040201;
const uint64_t r3 = r2 + 0x00406070787c7e7f;
const uint64_t r4 = (r3 >> 7) & 0x0101010101010101;

return 0x3030303030303030 + r4;  // ord('0') == 0x30
}
```

This algorithm requires:

• 1 multiplication,
• 1 right shift,
• 1 bit-and,
• 1 bit-or,

Multiplication and shift are the slowest operations.

Converting byte — SWAR version 2 — multiplication-based approach

In this approach single multiplication is used to scatter 4 bits on 4 bytes, then a single bit-and is used to isolate bits:

```input   = 0b0000_abcd                                  (8-bit number)
output  = 0b0000_000a bcd0_00ab cd00_0abc d000_abcd    (32-bit number)
output' = 0b0000_000a 0000_000b 0000_000c 0000_000d
```

The magic multiplication constant is build as follows:

1. the bit d must stay at position 0, so the first term of is 1;
2. the bit c must appear at position 8, so we have to shift left by 8 − 1;
3. likewise to move the bit b at position 16 we have to shift left by 16 − 2
4. and finally shift left by 24 − 3 to move the bit a.

The constant is 1 + (1 << (8 - 1)) + (1 << (16 - 2)) + (1 << (24 - 3)) or 0x204081.

Downside of this method is processing only 4 bits per step.

Notes

Sample implementation:

```const uint32_t mul = 1 + (1 << (8 - 1)) + (1 << (16 - 2)) + (1 << (24 - 3));
const uint32_t r1 = (v & 0xf) * mul;            // expand bits 0..3
const uint32_t r2 = (v >> 4) * mul;             // expand bits 4..7

const uint64_t r3 = r1 | (uint64_t(r2) << 32);  // merge 32-bit values
const uint64_t r4 = r3 & 0x0101010101010101;    // isolate bits

return 0x3030303030303030 + r4;                 // to ASCII
```

This algorithm requires:

• 2 32-bit multiplications (32-bit muls are faster than 64-bit multiplication);
• 2 shifts;
• 2 bit-and;
• 1 bit-or;

Converting byte — SWAR version 3 — multiplication-based approach

Author of this method is Larry Bank. We exchanged a few mails and Larry has gave me permission to describe his clever method here.

Similarly to the previous SWAR method, this one also uses a single multiplication. The multiplication scatters 7 lowest bit of an 8-bit number. Thus the method is suitable for 64-bit numbers.

Assuming we have a 7-bit number abcdefg, following shifts are needed (example for a 32-bit number):

```bits: 76543210 76543210 76543210 76543210

t0 =  ........ ........ ........ .abcdefg   (shift by 0*7)
t1 =  ........ ........ ..abcdef g.......   (shift by 1*7)
t2 =  ........ ...abcde fg...... ........   (shift by 2*7)
t3 =  ....abcd efg..... ........ ........   (shift by 3*7)

t4 =  t1 | t2 | t3 | t4
=  ....abcd efgabcde fgabcdef gabcdefg

t5 = t4 & 0x01010101
= .......d .......e .......f .......g
```

To covert all bits of a byte, the specialised path for the MSB is needed.

Notes

Sample implementation:

```// convert lowest 7 bits
const uint64_t mul = 0x02040810204081;
const uint64_t r1  = (v & 0x7f) * mul;
const uint64_t r2  = r1 & 0x0101010101010101;

// move the most significant bit to 56th bit
const uint64_t r3  = uint64_t(v & 0x80) << (56 - 7);
const uint64_t r4  = r2 | r3;

return 0x3030303030303030 + r4;
```

This algorithm requires:

• 1 64-bit multiplication
• 1 shift;
• 3 bit-and;
• 1 bit-or;

Moving the MSB to desired position may be also accomplished with range extending (casting). First we cast unsigned char (v) to signed value, extending this value to 64-bit word, efficiently broadcasting MSB on bits 8 thou 63. Then single bit-and forms proper word:

```const uint64_t r3  = uint64_t(int64_t(int8_t(v))) & 0x0100000000000000;
```

This strange cast chain compiles to a single instruction movsx, which is as cheap as a normal move. The modified implementation doesn't require a bit-shift.

Converting word — SIMD version

The algorithm is similar to SWAR version 1. Instead of cloning bits, shift and mask (points 3 & 4) just a single instruction PCMPEQB is used:

```uint8_t bit_mask[16] = {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 /* x 2*/};
uint8_t ascii[16]    = packed_byte('0');

// assember:

movdqa  packed_byte(v), %xmm0

// 2. isolate bits
pand     %xmm7, %xmm0
pcmpeqb  %xmm7, %xmm0     // xmm0 = {0x00 or 0xff}

// 3. convert to ASCII
movdqa   ascii, %xmm7
psub     %xmm7, %xmm0     // xmm7 - result
```

Full implementation is available at github (see Sample Programs below).

Converting byte — BMI2 version

Bit Manipulation Instruction is a new Intel's ISA extension, that add some interesting instructions. One of them is PDEP, parallel deposit, following pseudocode shows how PDEP works (example for 64-bit words):

```src1 - mask
src2 - input
dst  - result

dst := 0
k   := 0
for i := 0 .. 64 loop
if src1.bit[i] is set then
dst.bit[i] := src2.bit[k]
k := k + 1
end if
end loop
```

Equivalent procedure in C (64-bit mode):

```uint64_t pdep(uint64_t src1, uint64_t src2) {

uint64_t result = 0;
int k = 0;
for (int i=0; i < 64; i++) {

const uint64_t mask = (1 << i);

if (src2 & (1 << k)) {
}

k += 1;
}
}

result;
}
```

Binary conversion is trivial, single PDEB realizes steps from 2 through 4 of the algorithm. Procedure bswap performs byte-swap and it's simply a wrapper for BSWAP instruction:

```uint64_t convert_pdep(uint8_t byte) {

const uint64_t expanded = pdep(byte, 0x0101010101010101);
const uint64_t swapped  = bswap(expanded);

return 0x3030303030303030 + swapped;
}
```

Yes, only three instructions!

Converting byte — lookup

Using simple lookup requires 256 * 8 = 2kB, this might be a performance problem.

Speed comparison

Results produced by program speedup from the samples.

Results from Pentium M (outdated)

method time [s] speedup
naive 0.4061 1.00 ██████
lookup 0.0636 6.35 ████████████████████████████████████████
SWAR 0.2226 1.82 ███████████
SIMD 0.1708 2.38 ███████████████

Notes:

• The lookup table size is really big, it would destroy L1 cache.
• The SIMD version is not well written, it just convert 8 bits, while a 16-byte register can hold the result of conversion for two bytes.
• Pentium M is a 32-bit processor, conversion routines use type uint64_t, i.e. most 64-bit operations in the SWAR version were performed by more than one CPU instruction.

Results from Core i5 (Westmere)

method time [s] speedup
naive 2.4826 1.00 ███
lookup 0.1992 12.46 █████████████████████████████████████
SWAR 0.2732 9.09 ███████████████████████████
SWAR2 0.2274 10.92 ████████████████████████████████
SWAR3 0.2743 9.05 ███████████████████████████
SIMD 0.3622 6.85 ████████████████████

Results from Core i7 (Skylake)

method time [s] speedup
naive 1.1796 1.00 ███
lookup 0.0805 14.66 ███████████████████████████████████████████
SWAR 0.3824 3.08 █████████
SWAR2 0.2781 4.24 ████████████
SWAR3 0.4508 2.62 ███████
SIMD 0.7077 1.67 █████
PDEP 0.1067 11.05 █████████████████████████████████

Notes:

• Lookup is the fastest.
• But all register-only implementations aren't very bad, and PDEP version is the best in this group.
• The SIMD version performance is really poor.

Sample program

Sample programs are available at github.