Author: | Wojciech Muła |
---|---|
Added on: | 2014-09-23 |
Contents
PDEP is a new instruction from BMI2 (Bit Manipulation Instruction), pseudocode for 32-bit PDEP variant:
src - a source word mask - a mask result := 0 k := 0 for i := 0 .. 31 loop if mask.bit[i] then result.bit[i] := src.bit[k] k := k + 1 end if end loop
Quite complicated, but it is really fast — latency is just 3 cycles and throughput is only one cycle. I've showed how to use this instruction in conversion integers to binary and haxadecimal representation.
I was wondering how this algorithm would execute on an old hardware.
uint32_t naive(uint32_t src, uint32_t mask) { uint32_t result = 0; uint32_t mask_bit = 1; uint32_t src_bit = 1; int i; for (i=0; i < 32; i++, mask_bit <<= 1) { if (mask & mask_bit) { if (src & src_bit) { result |= mask_bit; } src_bit <<= 1; } } return result; }
Shift right by 1 could be expressed as a simple add, on x86 this could also be encoded using LEA (no register flags dependecy).
uint32_t branchless(uint32_t src, uint32_t mask) { uint32_t result = 0; while (1) { // 1. isolate the lowest set bit of mask const uint32_t lowest = (-mask & mask); if (lowest == 0) { break; } // 2. populate LSB from src const uint32_t LSB = (uint32_t)((int32_t)(src << 31) >> 31); // 3. copy bit from mask result |= LSB & lowest; // 4. clear lowest bit mask &= ~lowest; // 5. prepare for next iteration src >>= 1; } return result; }
The branchless version is much more complicated, but there is one clear advantage over naive version — number of iterations is equal to number of bits set in a mask.
This is clear when we look at sample results.
mask | naive [s] | branchless [s] | speedup |
---|---|---|---|
00000000 | 0.1368 | 0.0035 | 38.76 |
00000001 | 0.1388 | 0.0081 | 17.19 |
00000003 | 0.1597 | 0.0127 | 12.56 |
00000007 | 0.1661 | 0.0230 | 7.24 |
0000000f | 0.1691 | 0.0256 | 6.60 |
0000001f | 0.1842 | 0.0328 | 5.61 |
0000003f | 0.2430 | 0.0910 | 2.67 |
0000007f | 0.2904 | 0.0499 | 5.82 |
000000ff | 0.2743 | 0.0481 | 5.71 |
000001ff | 0.2165 | 0.0530 | 4.08 |
000003ff | 0.2139 | 0.0666 | 3.21 |
000007ff | 0.2181 | 0.0764 | 2.85 |
00000fff | 0.2183 | 0.0898 | 2.43 |
00001fff | 0.2159 | 0.0920 | 2.35 |
00003fff | 0.2288 | 0.0950 | 2.41 |
00007fff | 0.2294 | 0.1085 | 2.11 |
0000ffff | 0.2322 | 0.1086 | 2.14 |
0001ffff | 0.2330 | 0.1074 | 2.17 |
0003ffff | 0.3133 | 0.1158 | 2.71 |
0007ffff | 0.2273 | 0.1216 | 1.87 |
000fffff | 0.2396 | 0.1328 | 1.80 |
001fffff | 0.2381 | 0.1308 | 1.82 |
003fffff | 0.2667 | 0.1376 | 1.94 |
007fffff | 0.2559 | 0.1701 | 1.50 |
00ffffff | 0.2637 | 0.1511 | 1.75 |
01ffffff | 0.2721 | 0.1645 | 1.65 |
03ffffff | 0.2868 | 0.1617 | 1.77 |
07ffffff | 0.2895 | 0.1878 | 1.54 |
0fffffff | 0.2816 | 0.1737 | 1.62 |
1fffffff | 0.2652 | 0.1765 | 1.50 |
3fffffff | 0.2742 | 0.1949 | 1.41 |
7fffffff | 0.2369 | 0.1876 | 1.26 |
ffffffff | 0.2388 | 0.1988 | 1.20 |
Sample programs are available at github.