Author: | Wojciech Muła |
---|---|
Added on: | 2014-09-11 |
Update: | 2016-11-20 (SWAR version 3, by Larry Bank, updates results) 2015-04-19 (SWAR version 2, results from Core i5) 2014-09-22 (BMI2 results), 2014-09-16 (+BMI2, lookup, experiments results) |
Contents
Populate byte v (bits: abcdefgh) in a 64-bit word:
const uint64_t r1 = v * 0x0101010101010101
+--------+--------+--------+--------+--------+--------+--------+--------+ |abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh|abcdefgh| +--------+--------+--------+--------+--------+--------+--------+--------+
Isolate one bit per byte:
const uint64_t r2 = r1 & 0x8040201008040201
+--------+--------+--------+--------+--------+--------+--------+--------+ |a.......|.b......|..c.....|...d....|....e...|.....f..|......g.|.......h| +--------+--------+--------+--------+--------+--------+--------+--------+
Clone each bit to the highest position:
const uint64_t r3 = r2 + 0x00406070787c7e7f
+--------+--------+--------+--------+--------+--------+--------+--------+ r3 = |a.......|.b......|..c.....|...d....|....e...|.....f..|......g.|.......h| +--------+--------+--------+--------+--------+--------+--------+--------+ +--------+--------+--------+--------+--------+--------+--------+--------+ + |00000000|01000000|01100000|01110000|01111000|01111100|01111110|01111111| +--------+--------+--------+--------+--------+--------+--------+--------+ +--------+--------+--------+--------+--------+--------+--------+--------+ r3 = |a.......|bx......|cxx.....|dxxx....|exxxx...|fxxxxx..|gxxxxxx.|hxxxxxxx| +--------+--------+--------+--------+--------+--------+--------+--------+
Move 7th bits to the 0th position and mask the lowest bits:
const uint64_t r4 = (r3 >> 7) & 0x0101010101010101
+--------+--------+--------+--------+--------+--------+--------+--------+ |.......a|.......b|.......c|.......d|.......e|.......f|.......g|.......h| +--------+--------+--------+--------+--------+--------+--------+--------+
Finally convert to ASCII:
const uint64_t result = 0x3030303030303030 + r4 // ord('0') == 0x30
Sample implementation:
uint64_t convert_swar(uint8_t v) { const uint64_t r1 = v * 0x0101010101010101; const uint64_t r2 = r1 & 0x8040201008040201; const uint64_t r3 = r2 + 0x00406070787c7e7f; const uint64_t r4 = (r3 >> 7) & 0x0101010101010101; return 0x3030303030303030 + r4; // ord('0') == 0x30 }
This algorithm requires:
Multiplication and shift are the slowest operations.
In this approach single multiplication is used to scatter 4 bits on 4 bytes, then a single bit-and is used to isolate bits:
input = 0b0000_abcd (8-bit number) output = 0b0000_000a bcd0_00ab cd00_0abc d000_abcd (32-bit number) output' = 0b0000_000a 0000_000b 0000_000c 0000_000d
The magic multiplication constant is build as follows:
The constant is 1 + (1 << (8 - 1)) + (1 << (16 - 2)) + (1 << (24 - 3)) or 0x204081.
Downside of this method is processing only 4 bits per step.
Sample implementation:
const uint32_t mul = 1 + (1 << (8 - 1)) + (1 << (16 - 2)) + (1 << (24 - 3)); const uint32_t r1 = (v & 0xf) * mul; // expand bits 0..3 const uint32_t r2 = (v >> 4) * mul; // expand bits 4..7 const uint64_t r3 = r1 | (uint64_t(r2) << 32); // merge 32-bit values const uint64_t r4 = r3 & 0x0101010101010101; // isolate bits return 0x3030303030303030 + r4; // to ASCII
This algorithm requires:
Author of this method is Larry Bank. We exchanged a few mails and Larry has gave me permission to describe his clever method here.
Similarly to the previous SWAR method, this one also uses a single multiplication. The multiplication scatters 7 lowest bit of an 8-bit number. Thus the method is suitable for 64-bit numbers.
Assuming we have a 7-bit number abcdefg, following shifts are needed (example for a 32-bit number):
bits: 76543210 76543210 76543210 76543210 t0 = ........ ........ ........ .abcdefg (shift by 0*7) t1 = ........ ........ ..abcdef g....... (shift by 1*7) t2 = ........ ...abcde fg...... ........ (shift by 2*7) t3 = ....abcd efg..... ........ ........ (shift by 3*7) t4 = t1 | t2 | t3 | t4 = ....abcd efgabcde fgabcdef gabcdefg t5 = t4 & 0x01010101 = .......d .......e .......f .......g
To covert all bits of a byte, the specialised path for the MSB is needed.
Sample implementation:
// convert lowest 7 bits const uint64_t mul = 0x02040810204081; const uint64_t r1 = (v & 0x7f) * mul; const uint64_t r2 = r1 & 0x0101010101010101; // move the most significant bit to 56th bit const uint64_t r3 = uint64_t(v & 0x80) << (56 - 7); const uint64_t r4 = r2 | r3; return 0x3030303030303030 + r4;
This algorithm requires:
Moving the MSB to desired position may be also accomplished with range extending (casting). First we cast unsigned char (v) to signed value, extending this value to 64-bit word, efficiently broadcasting MSB on bits 8 thou 63. Then single bit-and forms proper word:
const uint64_t r3 = uint64_t(int64_t(int8_t(v))) & 0x0100000000000000;
This strange cast chain compiles to a single instruction movsx, which is as cheap as a normal move. The modified implementation doesn't require a bit-shift.
The algorithm is similar to SWAR version 1. Instead of cloning bits, shift and mask (points 3 & 4) just a single instruction PCMPEQB is used:
uint8_t bit_mask[16] = {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 /* x 2*/}; uint8_t ascii[16] = packed_byte('0'); // assember: // load input movdqa packed_byte(v), %xmm0 // 2. isolate bits movdqa bit_mask, %xmm7 pand %xmm7, %xmm0 pcmpeqb %xmm7, %xmm0 // xmm0 = {0x00 or 0xff} // 3. convert to ASCII movdqa ascii, %xmm7 psub %xmm7, %xmm0 // xmm7 - result
Full implementation is available at github (see Sample Programs below).
Bit Manipulation Instruction is a new Intel's ISA extension, that add some interesting instructions. One of them is PDEP, parallel deposit, following pseudocode shows how PDEP works (example for 64-bit words):
src1 - mask src2 - input dst - result dst := 0 k := 0 for i := 0 .. 64 loop if src1.bit[i] is set then dst.bit[i] := src2.bit[k] k := k + 1 end if end loop
Equivalent procedure in C (64-bit mode):
uint64_t pdep(uint64_t src1, uint64_t src2) { uint64_t result = 0; int k = 0; for (int i=0; i < 64; i++) { const uint64_t mask = (1 << i); if ((src1 & mask) if (src2 & (1 << k)) { result |= mask } k += 1; } } result; }
Binary conversion is trivial, single PDEB realizes steps from 2 through 4 of the algorithm. Procedure bswap performs byte-swap and it's simply a wrapper for BSWAP instruction:
uint64_t convert_pdep(uint8_t byte) { const uint64_t expanded = pdep(byte, 0x0101010101010101); const uint64_t swapped = bswap(expanded); return 0x3030303030303030 + swapped; }
Yes, only three instructions!
Using simple lookup requires 256 * 8 = 2kB, this might be a performance problem.
Results produced by program speedup from the samples.
method | time [s] | speedup | |
---|---|---|---|
naive | 0.4061 | 1.00 | ██████ |
lookup | 0.0636 | 6.35 | ████████████████████████████████████████ |
SWAR | 0.2226 | 1.82 | ███████████ |
SIMD | 0.1708 | 2.38 | ███████████████ |
Notes:
method | time [s] | speedup | |
---|---|---|---|
naive | 2.4826 | 1.00 | ███ |
lookup | 0.1992 | 12.46 | █████████████████████████████████████ |
SWAR | 0.2732 | 9.09 | ███████████████████████████ |
SWAR2 | 0.2274 | 10.92 | ████████████████████████████████ |
SWAR3 | 0.2743 | 9.05 | ███████████████████████████ |
SIMD | 0.3622 | 6.85 | ████████████████████ |
method | time [s] | speedup | |
---|---|---|---|
naive | 1.1796 | 1.00 | ███ |
lookup | 0.0805 | 14.66 | ███████████████████████████████████████████ |
SWAR | 0.3824 | 3.08 | █████████ |
SWAR2 | 0.2781 | 4.24 | ████████████ |
SWAR3 | 0.4508 | 2.62 | ███████ |
SIMD | 0.7077 | 1.67 | █████ |
PDEP | 0.1067 | 11.05 | █████████████████████████████████ |
Notes:
Sample programs are available at github.