Software emulation of PDEP

Author:Wojciech Muła
Added on:2014-09-23

Contents

Introduction

PDEP is a new instruction from BMI2 (Bit Manipulation Instruction), pseudocode for 32-bit PDEP variant:

src    - a source word
mask   - a mask

result := 0
k := 0
for i := 0 .. 31 loop
    if mask.bit[i] then
        result.bit[i] := src.bit[k]
        k := k + 1
    end if
end loop

Quite complicated, but it is really fast — latency is just 3 cycles and throughput is only one cycle. I've showed how to use this instruction in conversion integers to binary and haxadecimal representation.

I was wondering how this algorithm would execute on an old hardware.

Naive implementation

uint32_t naive(uint32_t src, uint32_t mask) {

    uint32_t result   = 0;
    uint32_t mask_bit = 1;
    uint32_t src_bit  = 1;
    int i;

    for (i=0; i < 32; i++, mask_bit <<= 1) {
        if (mask & mask_bit) {
            if (src & src_bit) {
                result |= mask_bit;
            }

            src_bit <<= 1;
        }
    }

    return result;
}

Shift right by 1 could be expressed as a simple add, on x86 this could also be encoded using LEA (no register flags dependecy).

Branchless implementation

uint32_t branchless(uint32_t src, uint32_t mask) {

    uint32_t result = 0;

    while (1) {

        // 1. isolate the lowest set bit of mask
        const uint32_t lowest = (-mask & mask);

        if (lowest == 0) {
            break;
        }

        // 2. populate LSB from src
        const uint32_t LSB = (uint32_t)((int32_t)(src << 31) >> 31);

        // 3. copy bit from mask
        result |= LSB & lowest;

        // 4. clear lowest bit
        mask &= ~lowest;

        // 5. prepare for next iteration
        src >>= 1;
    }

    return result;
}

Comparison results on Pentium M

The branchless version is much more complicated, but there is one clear advantage over naive version — number of iterations is equal to number of bits set in a mask.

This is clear when we look at sample results.

mask naive [s] branchless [s] speedup
00000000 0.1368 0.0035 38.76
00000001 0.1388 0.0081 17.19
00000003 0.1597 0.0127 12.56
00000007 0.1661 0.0230 7.24
0000000f 0.1691 0.0256 6.60
0000001f 0.1842 0.0328 5.61
0000003f 0.2430 0.0910 2.67
0000007f 0.2904 0.0499 5.82
000000ff 0.2743 0.0481 5.71
000001ff 0.2165 0.0530 4.08
000003ff 0.2139 0.0666 3.21
000007ff 0.2181 0.0764 2.85
00000fff 0.2183 0.0898 2.43
00001fff 0.2159 0.0920 2.35
00003fff 0.2288 0.0950 2.41
00007fff 0.2294 0.1085 2.11
0000ffff 0.2322 0.1086 2.14
0001ffff 0.2330 0.1074 2.17
0003ffff 0.3133 0.1158 2.71
0007ffff 0.2273 0.1216 1.87
000fffff 0.2396 0.1328 1.80
001fffff 0.2381 0.1308 1.82
003fffff 0.2667 0.1376 1.94
007fffff 0.2559 0.1701 1.50
00ffffff 0.2637 0.1511 1.75
01ffffff 0.2721 0.1645 1.65
03ffffff 0.2868 0.1617 1.77
07ffffff 0.2895 0.1878 1.54
0fffffff 0.2816 0.1737 1.62
1fffffff 0.2652 0.1765 1.50
3fffffff 0.2742 0.1949 1.41
7fffffff 0.2369 0.1876 1.26
ffffffff 0.2388 0.1988 1.20

Sample program

Sample programs are available at github.