Conversion numbers to binary representation

Author: Wojciech Muła
Added on:2014-09-11
Update:2016-11-20 (SWAR version 3, by Larry Bank, updates results) 2015-04-19 (SWAR version 2, results from Core i5) 2014-09-22 (BMI2 results), 2014-09-16 (+BMI2, lookup, experiments results)

Contents

Converting byte — SWAR version 1

Algorithm

  1. Populate byte v (bits: abcdefgh) in a 64-bit word:

    const uint64_t r1 = v * 0x0101010101010101
    
+--------+--------+--------+--------+--------+--------+--------+--------+
|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|
+--------+--------+--------+--------+--------+--------+--------+--------+
  1. Isolate one bit per byte:

    const uint64_t r2 = r1 & 0x8040201008040201
    
+--------+--------+--------+--------+--------+--------+--------+--------+
|a.......|.b......|..c.....|...d....|....e...|.....f..|......g.|.......h|
+--------+--------+--------+--------+--------+--------+--------+--------+
  1. Clone each bit to the highest position:

    const uint64_t r3 = r2 + 0x00406070787c7e7f
    
     +--------+--------+--------+--------+--------+--------+--------+--------+
r3 = |a.......|.b......|..c.....|...d....|....e...|.....f..|......g.|.......h|
     +--------+--------+--------+--------+--------+--------+--------+--------+

     +--------+--------+--------+--------+--------+--------+--------+--------+
   + |00000000|01000000|01100000|01110000|01111000|01111100|01111110|01111111|
     +--------+--------+--------+--------+--------+--------+--------+--------+

     +--------+--------+--------+--------+--------+--------+--------+--------+
r3 = |a.......|bx......|cxx.....|dxxx....|exxxx...|fxxxxx..|gxxxxxx.|hxxxxxxx|
     +--------+--------+--------+--------+--------+--------+--------+--------+
  1. Move 7th bits to the 0th position and mask the lowest bits:

    const uint64_t r4 = (r3 >> 7) & 0x0101010101010101
    
+--------+--------+--------+--------+--------+--------+--------+--------+
|.......a|.......b|.......c|.......d|.......e|.......f|.......g|.......h|
+--------+--------+--------+--------+--------+--------+--------+--------+
  1. Finally convert to ASCII:

    const uint64_t result = 0x3030303030303030 + r4  // ord('0') == 0x30
    

Notes

Sample implementation:

uint64_t convert_swar(uint8_t v) {

    const uint64_t r1 =  v * 0x0101010101010101;
    const uint64_t r2 = r1 & 0x8040201008040201;
    const uint64_t r3 = r2 + 0x00406070787c7e7f;
    const uint64_t r4 = (r3 >> 7) & 0x0101010101010101;

    return 0x3030303030303030 + r4;  // ord('0') == 0x30
}

This algorithm requires:

  • 1 multiplication,
  • 1 right shift,
  • 1 bit-and,
  • 1 bit-or,
  • 2 additions.

Multiplication and shift are the slowest operations.

Converting byte — SWAR version 2 — multiplication-based approach

In this approach single multiplication is used to scatter 4 bits on 4 bytes, then a single bit-and is used to isolate bits:

input   = 0b0000_abcd                                  (8-bit number)
output  = 0b0000_000a bcd0_00ab cd00_0abc d000_abcd    (32-bit number)
output' = 0b0000_000a 0000_000b 0000_000c 0000_000d

The magic multiplication constant is build as follows:

  1. the bit d must stay at position 0, so the first term of is 1;
  2. the bit c must appear at position 8, so we have to shift left by 8 − 1;
  3. likewise to move the bit b at position 16 we have to shift left by 16 − 2
  4. and finally shift left by 24 − 3 to move the bit a.

The constant is 1 + (1 << (8 - 1)) + (1 << (16 - 2)) + (1 << (24 - 3)) or 0x204081.

Downside of this method is processing only 4 bits per step.

Notes

Sample implementation:

const uint32_t mul = 1 + (1 << (8 - 1)) + (1 << (16 - 2)) + (1 << (24 - 3));
const uint32_t r1 = (v & 0xf) * mul;            // expand bits 0..3
const uint32_t r2 = (v >> 4) * mul;             // expand bits 4..7

const uint64_t r3 = r1 | (uint64_t(r2) << 32);  // merge 32-bit values
const uint64_t r4 = r3 & 0x0101010101010101;    // isolate bits

return 0x3030303030303030 + r4;                 // to ASCII

This algorithm requires:

  • 2 32-bit multiplications (32-bit muls are faster than 64-bit multiplication);
  • 2 shifts;
  • 2 bit-and;
  • 1 bit-or;
  • 1 add.

Converting byte — SWAR version 3 — multiplication-based approach

Author of this method is Larry Bank. We exchanged a few mails and Larry has gave me permission to describe his clever method here.

Similarly to the previous SWAR method, this one also uses a single multiplication. The multiplication scatters 7 lowest bit of an 8-bit number. Thus the method is suitable for 64-bit numbers.

Assuming we have a 7-bit number abcdefg, following shifts are needed (example for a 32-bit number):

bits: 76543210 76543210 76543210 76543210

t0 =  ........ ........ ........ .abcdefg   (shift by 0*7)
t1 =  ........ ........ ..abcdef g.......   (shift by 1*7)
t2 =  ........ ...abcde fg...... ........   (shift by 2*7)
t3 =  ....abcd efg..... ........ ........   (shift by 3*7)

t4 =  t1 | t2 | t3 | t4
   =  ....abcd efgabcde fgabcdef gabcdefg

t5 = t4 & 0x01010101
   = .......d .......e .......f .......g

To covert all bits of a byte, the specialised path for the MSB is needed.

Notes

Sample implementation:

// convert lowest 7 bits
const uint64_t mul = 0x02040810204081;
const uint64_t r1  = (v & 0x7f) * mul;
const uint64_t r2  = r1 & 0x0101010101010101;

// move the most significant bit to 56th bit
const uint64_t r3  = uint64_t(v & 0x80) << (56 - 7);
const uint64_t r4  = r2 | r3;

return 0x3030303030303030 + r4;

This algorithm requires:

  • 1 64-bit multiplication
  • 1 shift;
  • 3 bit-and;
  • 1 bit-or;
  • 1 add.

Moving the MSB to desired position may be also accomplished with range extending (casting). First we cast unsigned char (v) to signed value, extending this value to 64-bit word, efficiently broadcasting MSB on bits 8 thou 63. Then single bit-and forms proper word:

const uint64_t r3  = uint64_t(int64_t(int8_t(v))) & 0x0100000000000000;

This strange cast chain compiles to a single instruction movsx, which is as cheap as a normal move. The modified implementation doesn't require a bit-shift.

Converting word — SIMD version

The algorithm is similar to SWAR version 1. Instead of cloning bits, shift and mask (points 3 & 4) just a single instruction PCMPEQB is used:

uint8_t bit_mask[16] = {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 /* x 2*/};
uint8_t ascii[16]    = packed_byte('0');

// assember:

// load input
movdqa  packed_byte(v), %xmm0

// 2. isolate bits
movdqa   bit_mask, %xmm7
pand     %xmm7, %xmm0
pcmpeqb  %xmm7, %xmm0     // xmm0 = {0x00 or 0xff}

// 3. convert to ASCII
movdqa   ascii, %xmm7
psub     %xmm7, %xmm0     // xmm7 - result

Full implementation is available at github (see Sample Programs below).

Converting byte — BMI2 version

Bit Manipulation Instruction is a new Intel's ISA extension, that add some interesting instructions. One of them is PDEP, parallel deposit, following pseudocode shows how PDEP works (example for 64-bit words):

src1 - mask
src2 - input
dst  - result

dst := 0
k   := 0
for i := 0 .. 64 loop
    if src1.bit[i] is set then
        dst.bit[i] := src2.bit[k]
        k := k + 1
    end if
end loop

Equivalent procedure in C (64-bit mode):

uint64_t pdep(uint64_t src1, uint64_t src2) {

    uint64_t result = 0;
    int k = 0;
    for (int i=0; i < 64; i++) {

        const uint64_t mask = (1 << i);

        if ((src1 & mask)
            if (src2 & (1 << k)) {
                result |= mask
            }

            k += 1;
        }
    }

    result;
}

Binary conversion is trivial, single PDEB realizes steps from 2 through 4 of the algorithm. Procedure bswap performs byte-swap and it's simply a wrapper for BSWAP instruction:

uint64_t convert_pdep(uint8_t byte) {

    const uint64_t expanded = pdep(byte, 0x0101010101010101);
    const uint64_t swapped  = bswap(expanded);

    return 0x3030303030303030 + swapped;
}

Yes, only three instructions!

Converting byte — lookup

Using simple lookup requires 256 * 8 = 2kB, this might be a performance problem.

Speed comparison

Results produced by program speedup from the samples.

Results from Pentium M (outdated)

method time [s] speedup  
naive 0.4061 1.00 ██████
lookup 0.0636 6.35 ████████████████████████████████████████
SWAR 0.2226 1.82 ███████████
SIMD 0.1708 2.38 ███████████████

Notes:

  • The lookup table size is really big, it would destroy L1 cache.
  • The SIMD version is not well written, it just convert 8 bits, while a 16-byte register can hold the result of conversion for two bytes.
  • Pentium M is a 32-bit processor, conversion routines use type uint64_t, i.e. most 64-bit operations in the SWAR version were performed by more than one CPU instruction.

Results from Core i5 (Westmere)

method time [s] speedup  
naive 2.4826 1.00 ███
lookup 0.1992 12.46 █████████████████████████████████████
SWAR 0.2732 9.09 ███████████████████████████
SWAR2 0.2274 10.92 ████████████████████████████████
SWAR3 0.2743 9.05 ███████████████████████████
SIMD 0.3622 6.85 ████████████████████

Results from Core i7 (Skylake)

method time [s] speedup  
naive 1.1796 1.00 ███
lookup 0.0805 14.66 ███████████████████████████████████████████
SWAR 0.3824 3.08 █████████
SWAR2 0.2781 4.24 ████████████
SWAR3 0.4508 2.62 ███████
SIMD 0.7077 1.67 █████
PDEP 0.1067 11.05 █████████████████████████████████

Notes:

  • Lookup is the fastest.
  • But all register-only implementations aren't very bad, and PDEP version is the best in this group.
  • The SIMD version performance is really poor.

Sample program

Sample programs are available at github.