# Software emulation of PDEP

Author: Wojciech Muła 2014-09-23

# Introduction

PDEP is a new instruction from BMI2 (Bit Manipulation Instruction), pseudocode for 32-bit PDEP variant:

```src    - a source word

result := 0
k := 0
for i := 0 .. 31 loop
result.bit[i] := src.bit[k]
k := k + 1
end if
end loop
```

Quite complicated, but it is really fast — latency is just 3 cycles and throughput is only one cycle. I've showed how to use this instruction in conversion integers to binary and haxadecimal representation.

I was wondering how this algorithm would execute on an old hardware.

# Naive implementation

```uint32_t naive(uint32_t src, uint32_t mask) {

uint32_t result   = 0;
uint32_t src_bit  = 1;
int i;

for (i=0; i < 32; i++, mask_bit <<= 1) {
if (src & src_bit) {
}

src_bit <<= 1;
}
}

return result;
}
```

Shift right by 1 could be expressed as a simple add, on x86 this could also be encoded using LEA (no register flags dependecy).

# Branchless implementation

```uint32_t branchless(uint32_t src, uint32_t mask) {

uint32_t result = 0;

while (1) {

// 1. isolate the lowest set bit of mask

if (lowest == 0) {
break;
}

// 2. populate LSB from src
const uint32_t LSB = (uint32_t)((int32_t)(src << 31) >> 31);

// 3. copy bit from mask
result |= LSB & lowest;

// 4. clear lowest bit

// 5. prepare for next iteration
src >>= 1;
}

return result;
}
```

# Comparison results on Pentium M

The branchless version is much more complicated, but there is one clear advantage over naive version — number of iterations is equal to number of bits set in a mask.

This is clear when we look at sample results.

mask naive [s] branchless [s] speedup
00000000 0.1368 0.0035 38.76
00000001 0.1388 0.0081 17.19
00000003 0.1597 0.0127 12.56
00000007 0.1661 0.0230 7.24
0000000f 0.1691 0.0256 6.60
0000001f 0.1842 0.0328 5.61
0000003f 0.2430 0.0910 2.67
0000007f 0.2904 0.0499 5.82
000000ff 0.2743 0.0481 5.71
000001ff 0.2165 0.0530 4.08
000003ff 0.2139 0.0666 3.21
000007ff 0.2181 0.0764 2.85
00000fff 0.2183 0.0898 2.43
00001fff 0.2159 0.0920 2.35
00003fff 0.2288 0.0950 2.41
00007fff 0.2294 0.1085 2.11
0000ffff 0.2322 0.1086 2.14
0001ffff 0.2330 0.1074 2.17
0003ffff 0.3133 0.1158 2.71
0007ffff 0.2273 0.1216 1.87
000fffff 0.2396 0.1328 1.80
001fffff 0.2381 0.1308 1.82
003fffff 0.2667 0.1376 1.94
007fffff 0.2559 0.1701 1.50
00ffffff 0.2637 0.1511 1.75
01ffffff 0.2721 0.1645 1.65
03ffffff 0.2868 0.1617 1.77
07ffffff 0.2895 0.1878 1.54
0fffffff 0.2816 0.1737 1.62
1fffffff 0.2652 0.1765 1.50
3fffffff 0.2742 0.1949 1.41
7fffffff 0.2369 0.1876 1.26
ffffffff 0.2388 0.1988 1.20

## Sample program

Sample programs are available at github.