SIMD parallel bits deposit/extract

Author: Wojciech Muła
Added on:2025-01-05

Contents

Introduction

The BMI2 extension introduced two complementary instructions: parallel bits deposit (PDEP) and parallel bits extract (PEXT).

The PDEP scatters continuous set of bits to positions denoted by the mask. The PEXT does the opposite: gathers/compresses selected bits into a continuous word.

SIMD instruction sets do not directly support this kind of operations. There is GF2P8AFFINEQB in AVX-512, that allows arbitrary bit shuffling at the byte level (see Use AVX512 Galois field affine transformation for bit shuffling).

In this text we show approaches suitable for implementing PEXT and PDEP for wider element widths on any SIMD ISA.

PEXT

Specification

Pseudocode for PEXT:

N - the number of bits in a word (32 or 64)

k := 0
result := 0
for m := 0 to N do
    if mask[m] == 1 then
        result[k] = data[m]
        k := k + 1
    end
end for
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 11 │ 0 │ 0 │ 1 │ 0 │ 11 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
data = │ 10110100101010110 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                 |       |               |   |           |       |   │
                 │       │               │   │           │       │   └───┐
                 │       │               │   │           │       └───┐   │
                 │       │               │   │           └───────┐   │   │
                 │       │               │   └───────────────┐   │   │   │
                 │       │               └───────────────┐   │   │   │   │
                 │       └───────────────────────────┐   │   │   │   │   │
                 └───────────────────────────────┐   │   │   │   │   │   │
                                                 │   │   │   │   │   │   │
                                                 ▼   ▼   ▼   ▼   ▼   ▼   ▼
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
pext = │ 00000000001010111 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Software implementation

The most basic operation we need is copying the m-th bit from data to k-th position in the result word. An important property is that the source index m is greater or equals the destination index k.

We can avoid any explicit shifting of bits. We mask the m-th bit of data: this yields either 1 << m or 0. We have the mask for destination bit, having value 1 << k. Calculating the minimum of these two values yields either 1 << k or 0. Finally, we merge that value with the result word.

The following pseudocode shows the idea:

                              // m-th bit = 1             m-th bit = 0
masked = data & (1 << m);     // masked = 1 << m          masked = 0
k_mask = 1 << k               // k_mask = 1 << k          k_mask = 1 << k
bit_k  = min(masked, k_mask); // bit_k  = 1 << k          bit_k  = 0

To isolate the first bit, we'll use a well known trick of resetting such bit:

x'  = (x - 1) & x;  // reset lowest bit set
fbs = x ^ x';       // isolated lowest bit set

And this is how this works for sample values:

                                                       the first bit set
                                                                │
                                                                ▼
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
x           = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
x - 1       = │ 00101000110101111 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
(x-1) & x   = │ 00101000110100000 │ = x'
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
x' ^ x      = │ 00000000000010000 │ = fbs
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

AVX2 implementation

The following C++ template generates an implementation. The most generic one would scan all mask bits (MAX_MASK_BITS=32), but when we know the maximum number of bits set in masks, we may limit the number of iterations accordingly.

The parameter EARLY_EXIT decides whether to check in every iteration if all mask bits were scanned on not. It adds runtime overhead, and in the case we don't know how diverse masks are, this check won't help. If we know that the number of bits in different masks are similar, this check may be a win.

template <int MAX_MASK_BITS, bool EARLY_EXIT>
void avx2_pext_u32_reference(const uint32_t* data_arr, const uint32_t* mask_arr, uint32_t* out_arr, size_t n) {
    static_assert(MAX_MASK_BITS > 0);
    static_assert(MAX_MASK_BITS <= 32);

    const __m256i one  = _mm256_set1_epi32(1);
    const __m256i zero = _mm256_set1_epi32(0);
    for (size_t i=0; i < n; i += 8) {
        __m256i data = _mm256_loadu_si256((const __m256i*)(&data_arr[i]));
        __m256i mask = _mm256_loadu_si256((const __m256i*)(&mask_arr[i]));
        __m256i out  = _mm256_set1_epi32(0);

        __m256i bit = one;

        for (int j=0; j < MAX_MASK_BITS; j++) {
            // 1. isolate the first non-zero bit set of mask
            const __m256i m0 = _mm256_sub_epi32(mask, one);
            const __m256i m1 = _mm256_and_si256(mask, m0);
            const __m256i m2 = _mm256_xor_si256(mask, m1);

            // 2. isolate that bit from data word
            const __m256i d0 = _mm256_and_si256(data, m2);

            // 3. move that bit on the next position in out
            const __m256i o0 = _mm256_min_epu32(d0, bit);
            out = _mm256_or_si256(out, o0);

            // 4. reset selected bit in mask (we already done it)
            mask = m1;

            // 5. advance to the next bit in out
            bit = _mm256_add_epi32(bit, bit);

            // 6. all are zeros?
            if (EARLY_EXIT && _mm256_testc_si256(zero, mask)) {
                break;
            }
        }

        _mm256_storeu_si256((__m256i*)(&out_arr[i]), out);
    }
}

AVX-512 implementation

Solution for AVX-512 can use the above trick, but it is simpler to directly use mask registers. There is instruction VPTESTMD that sets a register mask based on bit-and of two operands. This allows us to produce element-wise masks from m-th bits.

Then conditional VPORD can be used to update the resulting word.

The following fragment from AVX-512 implementation shows these steps.

// 1. isolate the first bit set of mask
//                                                 mask = [0101_1001_1100_0000|...]
const __m512i m0 = _mm512_sub_epi32(mask, one); // m0   = [0101_1001_1011_1111|...]
const __m512i m1 = _mm512_and_si512(mask, m0);  // m1   = [0101_1001_1000_0000|...]
const __m512i m2 = _mm512_xor_si512(mask, m1);  // m2   = [0000_0000_0100_0000|...]

// 2. get m-th data bits in a mask register
const __mmask64 m = _mm512_test_epi32_mask(data, m2);

// 3. set k-th bit when m-th bit is set
out = _mm512_mask_or_epi32(out, m, out, bit);

// 4. reset selected bit in mask
mask = m1;

// 5. the next bit to set
bit = _mm512_add_epi32(bit, bit);

PDEP

Specification

Pseudocode for PDEP:

N - the number of bits in a word (32 or 64)

k := 0
result := 0
for m := 0 to N do
    if mask[m] == 1 then
        result[m] = data[k]
        k := k + 1
    end
end for

Example of PDEP invoked with a 4-bit mask and sample data.

       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask = │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 11 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
data = │ 10110100101011010 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                                                             │   |   |   |
             ┌───────────────────────────────────────────────┘   │   │   │
             │               ┌───────────────────────────────────┘   │   │
             │               │               ┌───────────────────────┘   │
             │               │               │   ┌───────────────────────┘
             │               │               │   │
             ▼               ▼               ▼   ▼
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
pdep = │ 01000000010000000 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Software implementation

In the case of PDEP we need to copy the k-th from data to the m-th bit of destination. Similarly to PEXT, in this case relation mk is hold.

It is also possible to avoid any bit shifting, but the method is not as elegant as for PEXT.

  1. We start from isolating m-th bit from the mask

                                                                m-th bit
                                                                    │
                                                                    ▼
                  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
    mask        = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │
                  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
    mask - 1    = │ 00101000110101111 │
                  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  2. In the word mask - 1 we have a series of ones before the m. Since km, we may now add the isolated k-th bit from data. If that bit it one, it will cause the series of ones to carry up the bit again to m-th position.

                                                                m-th bit
                                                                    │
                                                                    ▼
                  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
    mask - 1    = │ 00101000110101111 │
                  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
    data_k      = │ 00000000000000010 │
                  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                                                                                ▲
                                                                                │
                                                                             k-th bit
    
                  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
    sum         = │ 00101000110110001 │
                  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  3. Than we need isolate the m-th bit, mask that bit in sum and use it to update the resulting word.

    m1 = mask - 1
    m2 = mask & m1
    m3 = mask ^ m2
    
    data_k = data & (1 << k)
    sum    = m1 + data_k
    data_m = m3 & sum
    
    result = result | data_m
    

AVX2 implementation

While the shift-less method is feasible, we found an easier way to update the result word. We use VPCMPEQD that fills a 32-bit word with zeros or ones, depending on comparison result. We isolate the k-th bit of data and compare it with zero.

This yield a negated mask, which we use to filter out m-th bits.

The following snippet from AVX2 implementation shows the order of operations.

// 1. isolate the first non-zoro bit set of mask (at m)
//                                                 mask = [0101_1001_1100_0000|0000_1110_1100_1000|...]
const __m256i m0 = _mm256_sub_epi32(mask, one); // m0   = [0101_1001_1011_1111|0000_1110_1100_0111|...]
const __m256i m1 = _mm256_and_si256(mask, m0);  // m1   = [0101_1001_1000_0000|0000_1110_1100_0000|...]
const __m256i m2 = _mm256_xor_si256(mask, m1);  // m2   = [0000_0000_0100_0000|0000_0000_0000_1000|...]

// 2. isolate k-th bit from data                   data = [1100_0000_1111_1110|0000_0000_1000_0000|...]
//                                                  bit = [0000_0000_0001_0000]0000_0000_0001_0000|...]
const __m256i d0 = _mm256_and_si256(data, bit); //   d0 = [0000_0000_0001_0000|0000_0000_0000_0000|...]
                                                //                      ^                   ^
// 4. fill word with *negation* of data bit
const __m256i d1 = _mm256_cmpeq_epi32(d0, zero);//   d1 = [0000_0000_0000_0000|1111_1111_1111_1111|...]

// 5. keep the mask bit, iff data[k] == 1
const __m256i m3 = _mm256_andnot_si256(d1, m2); //   m3 = [0000_0000_0100_0000|0000_0000_0000_0000|...]

// 6. update the out
out = _mm256_or_si256(out, m3);
mask = m1;

// 7. the next bit in data to check
bit = _mm256_add_epi32(bit, bit);

AVX-512 implementation

AVX-512 code uses the same approach as PEXT implementation. We transfer k-th bits from data word to a mask register and then use masked VPTERNLOGD to conditionally bit-or the m-th bits into the result words.

// 1. get k-th data bits in a mask register
const __mmask64 m = _mm512_test_epi32_mask(data, bit);

// 2. isolate the first non-zoro bit set of mask (at m)

//                                                 mask = [0101_1001_1100_0000|0000_1110_1100_1000|...]
const __m512i m0 = _mm512_sub_epi32(mask, one); // m0   = [0101_1001_1011_1111|0000_1110_1100_0111|...]
const __m512i m1 = _mm512_and_si512(mask, m0);  // m1   = [0101_1001_1000_0000|0000_1110_1100_0000|...]

// 3. out = out | (mask ^ m1)
out = _mm512_mask_ternarylogic_epi32(out, m, mask, m1, 0xf6);

// 4. the next mask value
mask = m1;

// 5. the next bit in data to update
bit = _mm512_add_epi32(bit, bit);

Experiments

The baseline code uses PDEP or PEXT invoked in a simple loop. It's a bit unrealistic solution, but can be seen it as a theoretical limit. More realistic scenario is when we have some computations done on vector registers and we have to switch to scalar code to perform PDEP or PEXT.

Summary:

All test programs were compiled with options -O3 -march=native on each machine running tests.

Tested PEXT procedures
Procedure Comments
pext scalar PEXT baseline
pext AVX2 AVX2 implementation for 32-bit masks
pext AVX2 (early exit) AVX2 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
pext AVX2 (6 bit) AVX2 implementation for 6-bit masks
pext AVX2 (6-bit, early exit) AVX2 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
pext AVX2 (8 bit) AVX2 implementation for 8-bit masks
pext AVX2 (8-bit, early exit) AVX2 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
pext AVX2 (16 bit) AVX2 implementation for 16-bit masks
pext AVX2 (16-bit, early exit) AVX2 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
pext AVX2 (24 bit) AVX2 implementation for 24-bit masks
pext AVX2 (24-bit, early exit) AVX2 implementation for 24-bit masks, with breaking the inner loop when all masks become zero
pext AVX-512 AVX-512 implementation for 32-bit masks
pext AVX-512 (early exit) AVX-512 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
pext AVX-512 (6 bit) AVX-512 implementation for 6-bit masks
pext AVX-512 (6-bit, early exit) AVX-512 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
pext AVX-512 (8 bit) AVX-512 implementation for 8-bit masks
pext AVX-512 (8-bit, early exit) AVX-512 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
pext AVX-512 (16 bit) AVX-512 implementation for 16-bit masks
pext AVX-512 (16-bit, early exit) AVX-512 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
pext AVX-512 (24 bit) AVX-512 implementation for 24-bit masks
pext AVX-512 (24-bit, early exit) AVX-512 implementation for 24-bit masks, with breaking the inner loop when all masks become zero
Tested PDEP procedures
Procedure Comments
pdep scalar PDEP baseline
pdep AVX2 AVX2 implementation for 32-bit masks
pdep AVX2 (early exit) AVX2 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
pdep AVX2 (6 bit) AVX2 implementation for 6-bit masks
pdep AVX2 (6-bit, early exit) AVX2 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
pdep AVX2 (8 bit) AVX2 implementation for 8-bit masks
pdep AVX2 (8-bit, early exit) AVX2 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
pdep AVX2 (16 bit) AVX2 implementation for 16-bit masks
pdep AVX2 (16-bit, early exit) AVX2 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
pdep AVX2 (24 bit) AVX2 implementation for 24-bit masks
pdep AVX2 (24-bit, early exit) AVX2 implementation for 24-bit masks, with breaking the inner loop when all masks become zero
pdep AVX-512 AVX-512 implementation for 32-bit masks
pdep AVX-512 (early exit) AVX-512 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
pdep AVX-512 (6 bit) AVX-512 implementation for 6-bit masks
pdep AVX-512 (6-bit, early exit) AVX-512 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
pdep AVX-512 (8 bit) AVX-512 implementation for 8-bit masks
pdep AVX-512 (8-bit, early exit) AVX-512 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
pdep AVX-512 (16 bit) AVX-512 implementation for 16-bit masks
pdep AVX-512 (16-bit, early exit) AVX-512 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
pdep AVX-512 (24 bit) AVX-512 implementation for 24-bit masks
pdep AVX-512 (24-bit, early exit) AVX-512 implementation for 24-bit masks, with breaking the inner loop when all masks become zero

Ryzen 7

  • Compiler: gcc (Debian 14.1.0-5) 14.1.0
  • CPU: AMD Ryzen 7 7730U with Radeon Graphics
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pext scalar 1.012 0.915 1.0 ████████████████████████████████████████
pext AVX2 4.549 4.536 0.2 ████████
pext AVX2 (early exit) 6.174 6.133 0.1 █████▉
mask has no more than 6 bits
pext scalar 0.957 0.918 1.0 ████████████████████████████████████▎
pext AVX2 (6 bit) 0.862 0.834 1.1 ████████████████████████████████████████
pext AVX2 (6 bit, early exit) 0.923 0.896 1.0 █████████████████████████████████████▏
mask has no more than 8 bits
pext scalar 0.938 0.909 1.0 ███████████████████████████████████████▍
pext AVX2 (8 bit) 0.914 0.896 1.0 ████████████████████████████████████████
pext AVX2 (8 bit, early exit) 1.207 1.195 0.8 █████████████████████████████▉
mask has no more than 16 bits
pext scalar 0.938 0.909 1.0 ████████████████████████████████████████
pext AVX2 (16 bit) 2.022 2.005 0.5 ██████████████████▏
pext AVX2 (16 bit, early exit) 4.169 4.160 0.2 ████████▋
mask has no more than 24 bits
pext scalar 0.939 0.911 1.0 ████████████████████████████████████████
pext AVX2 (24 bit) 3.314 3.297 0.3 ███████████
pext AVX2 (24 bit, early exit) 5.322 5.310 0.2 ██████▊
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pdep scalar 0.924 0.881 1.0 ████████████████████████████████████████
pdep AVX2 4.409 4.398 0.2 ████████
pdep AVX2 (early exit) 6.230 6.206 0.1 █████▋
mask has no more than 6 bits
pdep scalar 0.906 0.878 1.0 ████████████████████████████████████▎
pdep AVX2 (6 bit) 0.813 0.798 1.1 ████████████████████████████████████████
pdep AVX2 (6 bit, early exit) 0.931 0.918 1.0 ██████████████████████████████████▊
mask has no more than 8 bits
pdep scalar 0.905 0.876 1.0 ████████████████████████████████████████
pdep AVX2 (8 bit) 0.940 0.928 0.9 █████████████████████████████████████▊
pdep AVX2 (8 bit, early exit) 1.251 1.240 0.7 ████████████████████████████▎
mask has no more than 16 bits
pdep scalar 0.897 0.878 1.0 ████████████████████████████████████████
pdep AVX2 (16 bit) 2.070 2.055 0.4 █████████████████
pdep AVX2 (16 bit, early exit) 4.254 4.242 0.2 ████████▎
mask has no more than 24 bits
pdep scalar 0.902 0.876 1.0 ████████████████████████████████████████
pdep AVX2 (24 bit) 3.260 3.244 0.3 ██████████▊
pdep AVX2 (24 bit, early exit) 5.397 5.385 0.2 ██████▌

Alder Lake

  • Compiler: gcc (Debian 13.2.0-25) 13.2.0
  • CPU: 12th Gen Intel(R) Core(TM) i7-1255U
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pext scalar 1.527 1.286 1.0 ████████████████████████████████████████
pext AVX2 7.556 6.846 0.2 ███████▌
pext AVX2 (early exit) 8.778 8.001 0.2 ██████▍
mask has no more than 6 bits
pext scalar 1.471 1.264 1.0 ████████████████████████████████████████
pext AVX2 (6 bit) 1.482 1.289 1.0 ███████████████████████████████████████▏
pext AVX2 (6 bit, early exit) 1.673 1.477 0.9 ██████████████████████████████████▏
mask has no more than 8 bits
pext scalar 1.480 1.273 1.0 ████████████████████████████████████████
pext AVX2 (8 bit) 1.654 1.446 0.9 ███████████████████████████████████▏
pext AVX2 (8 bit, early exit) 2.233 2.006 0.6 █████████████████████████▍
mask has no more than 16 bits
pext scalar 1.464 1.268 1.0 ████████████████████████████████████████
pext AVX2 (16 bit) 3.428 3.042 0.4 ████████████████▋
pext AVX2 (16 bit, early exit) 6.045 5.461 0.2 █████████▎
mask has no more than 24 bits
pext scalar 1.491 1.257 1.0 ████████████████████████████████████████
pext AVX2 (24 bit) 5.622 5.025 0.3 ██████████
pext AVX2 (24 bit, early exit) 7.925 7.078 0.2 ███████
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pdep scalar 1.512 1.276 1.0 ████████████████████████████████████████
pdep AVX2 7.857 7.091 0.2 ███████▏
pdep AVX2 (early exit) 8.317 7.452 0.2 ██████▊
mask has no more than 6 bits
pdep scalar 1.479 1.246 1.0 ████████████████████████████████████████
pdep AVX2 (6 bit) 1.472 1.318 0.9 █████████████████████████████████████▊
pdep AVX2 (6 bit, early exit) 1.699 1.498 0.8 █████████████████████████████████▎
mask has no more than 8 bits
pdep scalar 1.441 1.250 1.0 ████████████████████████████████████████
pdep AVX2 (8 bit) 1.768 1.517 0.8 ████████████████████████████████▉
pdep AVX2 (8 bit, early exit) 2.282 2.038 0.6 ████████████████████████▌
mask has no more than 16 bits
pdep scalar 1.476 1.270 1.0 ████████████████████████████████████████
pdep AVX2 (16 bit) 3.472 3.130 0.4 ████████████████▏
pdep AVX2 (16 bit, early exit) 6.062 5.497 0.2 █████████▏
mask has no more than 24 bits
pdep scalar 1.484 1.282 1.0 ████████████████████████████████████████
pdep AVX2 (24 bit) 5.795 5.244 0.2 █████████▊
pdep AVX2 (24 bit, early exit) 7.715 6.835 0.2 ███████▌

Ice Lake

  • Compiler: gcc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2)
  • CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pext scalar 3.105 2.968 1.0 ████████████████████████████████████████
pext AVX2 18.473 14.244 0.2 ████████▎
pext AVX2 (early exit) 13.330 12.203 0.2 █████████▋
pext AVX512 9.780 9.657 0.3 ████████████▎
pext AVX512 (early exit) 9.098 8.885 0.3 █████████████▎
pext AVX512 v2 9.735 9.610 0.3 ████████████▎
pext AVX512 v2 (early exit) 8.421 8.171 0.4 ██████████████▌
mask has no more than 6 bits
pext scalar 3.160 3.050 1.0 ███████████████▌
pext AVX2 (6 bit) 2.079 1.972 1.5 ████████████████████████▏
pext AVX2 (6 bit, early exit) 2.832 2.700 1.1 █████████████████▋
pext AVX512 (6 bit) 1.360 1.259 2.4 █████████████████████████████████████▊
pext AVX512 (6 bit, early exit) 1.755 1.600 1.9 █████████████████████████████▊
pext AVX512 v2 (6 bit) 1.282 1.191 2.6 ████████████████████████████████████████
pext AVX512 v2 (6 bit, early exit) 1.647 1.510 2.0 ███████████████████████████████▌
mask has no more than 8 bits
pext scalar 3.146 3.022 1.0 ██████████████████▉
pext AVX2 (8 bit) 2.740 2.610 1.2 █████████████████████▉
pext AVX2 (8 bit, early exit) 3.742 3.552 0.9 ████████████████▏
pext AVX512 (8 bit) 1.695 1.541 2.0 █████████████████████████████████████▏
pext AVX512 (8 bit, early exit) 2.294 2.132 1.4 ██████████████████████████▉
pext AVX512 v2 (8 bit) 1.559 1.433 2.1 ████████████████████████████████████████
pext AVX512 v2 (8 bit, early exit) 2.201 2.026 1.5 ████████████████████████████▎
mask has no more than 16 bits
pext scalar 3.084 2.983 1.0 ████████████████████████████████████▊
pext AVX2 (16 bit) 5.406 5.146 0.6 █████████████████████▎
pext AVX2 (16 bit, early exit) 8.798 8.562 0.3 ████████████▊
pext AVX512 (16 bit) 3.192 2.936 1.0 █████████████████████████████████████▍
pext AVX512 (16 bit, early exit) 5.321 5.113 0.6 █████████████████████▍
pext AVX512 v2 (16 bit) 2.983 2.746 1.1 ████████████████████████████████████████
pext AVX512 v2 (16 bit, early exit) 4.665 4.575 0.7 ████████████████████████
mask has no more than 24 bits
pext scalar 3.082 2.992 1.0 ████████████████████████████████████████
pext AVX2 (24 bit) 12.732 11.147 0.3 ██████████▋
pext AVX2 (24 bit, early exit) 13.043 12.588 0.2 █████████▌
pext AVX512 (24 bit) 7.410 7.290 0.4 ████████████████▍
pext AVX512 (24 bit, early exit) 8.349 8.149 0.4 ██████████████▋
pext AVX512 v2 (24 bit) 7.384 7.269 0.4 ████████████████▍
pext AVX512 v2 (24 bit, early exit) 7.397 7.121 0.4 ████████████████▊
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pdep scalar 3.117 2.969 1.0 ████████████████████████████████████████
pdep AVX2 19.314 19.046 0.2 ██████▏
pdep AVX2 (early exit) 16.449 16.272 0.2 ███████▎
pdep AVX512 10.165 9.934 0.3 ███████████▉
pdep AVX512 (early exit) 8.663 8.252 0.4 ██████████████▍
pdep AVX512 v2 9.401 8.432 0.4 ██████████████
pdep AVX512 v2 (early exit) 8.040 7.691 0.4 ███████████████▍
mask has no more than 6 bits
pdep scalar 2.839 2.705 1.0 ████████████████▋
pdep AVX2 (6 bit) 2.153 2.006 1.3 ██████████████████████▌
pdep AVX2 (6 bit, early exit) 2.860 2.678 1.0 ████████████████▉
pdep AVX512 (6 bit) 1.620 1.388 1.9 ████████████████████████████████▌
pdep AVX512 (6 bit, early exit) 1.852 1.720 1.6 ██████████████████████████▎
pdep AVX512 v2 (6 bit) 1.189 1.132 2.4 ████████████████████████████████████████
pdep AVX512 v2 (6 bit, early exit) 1.510 1.422 1.9 ███████████████████████████████▊
mask has no more than 8 bits
pdep scalar 3.128 3.035 1.0 ████████████████▌
pdep AVX2 (8 bit) 3.115 2.948 1.0 █████████████████
pdep AVX2 (8 bit, early exit) 4.122 3.870 0.8 ████████████▉
pdep AVX512 (8 bit) 1.769 1.678 1.8 █████████████████████████████▉
pdep AVX512 (8 bit, early exit) 2.220 2.159 1.4 ███████████████████████▏
pdep AVX512 v2 (8 bit) 1.295 1.254 2.4 ████████████████████████████████████████
pdep AVX512 v2 (8 bit, early exit) 1.822 1.695 1.8 █████████████████████████████▌
mask has no more than 16 bits
pdep scalar 3.157 3.042 1.0 ██████████████████████████████▊
pdep AVX2 (16 bit) 6.010 5.802 0.5 ████████████████▏
pdep AVX2 (16 bit, early exit) 9.223 8.982 0.3 ██████████▍
pdep AVX512 (16 bit) 3.616 3.364 0.9 ███████████████████████████▊
pdep AVX512 (16 bit, early exit) 5.407 5.180 0.6 ██████████████████
pdep AVX512 v2 (16 bit) 2.452 2.343 1.3 ████████████████████████████████████████
pdep AVX512 v2 (16 bit, early exit) 4.449 4.290 0.7 █████████████████████▊
mask has no more than 24 bits
pdep scalar 3.131 3.042 1.0 ████████████████████████████████████████
pdep AVX2 (24 bit) 13.123 10.766 0.3 ███████████▎
pdep AVX2 (24 bit, early exit) 15.508 12.937 0.2 █████████▍
pdep AVX512 (24 bit) 7.661 7.438 0.4 ████████████████▎
pdep AVX512 (24 bit, early exit) 8.637 8.416 0.4 ██████████████▍
pdep AVX512 v2 (24 bit) 7.110 6.954 0.4 █████████████████▍
pdep AVX512 v2 (24 bit, early exit) 6.173 5.914 0.5 ████████████████████▌

Skylake-X

  • Compiler: gcc (GCC) 11.2.0
  • CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pext scalar 2.943 2.894 1.0 ████████████████████████████████████████
pext AVX2 13.043 13.019 0.2 ████████▉
pext AVX2 (early exit) 12.986 12.952 0.2 ████████▉
pext AVX512 10.296 10.267 0.3 ███████████▎
pext AVX512 (early exit) 11.788 11.756 0.2 █████████▊
pext AVX512 v2 10.263 10.232 0.3 ███████████▎
pext AVX512 v2 (early exit) 10.896 10.853 0.3 ██████████▋
mask has no more than 6 bits
pext scalar 2.986 2.958 1.0 ███████████████████████████████████████▏
pext AVX2 (6 bit) 2.910 2.894 1.0 ████████████████████████████████████████
pext AVX2 (6 bit, early exit) 3.143 3.103 1.0 █████████████████████████████████████▎
pext AVX512 (6 bit) 3.009 2.979 1.0 ██████████████████████████████████████▊
pext AVX512 (6 bit, early exit) 3.339 3.223 0.9 ███████████████████████████████████▉
pext AVX512 v2 (6 bit) 3.109 2.963 1.0 ███████████████████████████████████████
pext AVX512 v2 (6 bit, early exit) 3.172 3.124 0.9 █████████████████████████████████████
mask has no more than 8 bits
pext scalar 2.980 2.957 1.0 ████████████████████████████████████████
pext AVX2 (8 bit) 3.254 3.224 0.9 ████████████████████████████████████▋
pext AVX2 (8 bit, early exit) 3.821 3.791 0.8 ███████████████████████████████▏
pext AVX512 (8 bit) 3.311 3.234 0.9 ████████████████████████████████████▌
pext AVX512 (8 bit, early exit) 3.852 3.799 0.8 ███████████████████████████████▏
pext AVX512 v2 (8 bit) 3.159 3.122 0.9 █████████████████████████████████████▉
pext AVX512 v2 (8 bit, early exit) 3.644 3.616 0.8 ████████████████████████████████▋
mask has no more than 16 bits
pext scalar 2.974 2.948 1.0 ████████████████████████████████████████
pext AVX2 (16 bit) 5.719 5.697 0.5 ████████████████████▋
pext AVX2 (16 bit, early exit) 9.145 9.118 0.3 ████████████▉
pext AVX512 (16 bit) 4.774 4.747 0.6 ████████████████████████▊
pext AVX512 (16 bit, early exit) 7.565 7.541 0.4 ███████████████▋
pext AVX512 v2 (16 bit) 4.572 4.548 0.6 █████████████████████████▉
pext AVX512 v2 (16 bit, early exit) 7.293 7.262 0.4 ████████████████▏
mask has no more than 24 bits
pext scalar 3.037 2.963 1.0 ████████████████████████████████████████
pext AVX2 (24 bit) 9.721 9.704 0.3 ████████████▏
pext AVX2 (24 bit, early exit) 12.069 12.038 0.2 █████████▊
pext AVX512 (24 bit) 7.816 7.790 0.4 ███████████████▏
pext AVX512 (24 bit, early exit) 10.960 10.934 0.3 ██████████▊
pext AVX512 v2 (24 bit) 7.752 7.722 0.4 ███████████████▎
pext AVX512 v2 (24 bit, early exit) 9.947 9.915 0.3 ███████████▉
procedure time in cycles per byte speed-up
  average best    
any 32-bit mask
pdep scalar 3.051 2.878 1.0 ████████████████████████████████████████
pdep AVX2 12.760 12.741 0.2 █████████
pdep AVX2 (early exit) 16.604 16.584 0.2 ██████▉
pdep AVX512 10.707 10.686 0.3 ██████████▊
pdep AVX512 (early exit) 11.476 11.448 0.3 ██████████
pdep AVX512 v2 8.677 8.636 0.3 █████████████▎
pdep AVX512 v2 (early exit) 10.570 10.543 0.3 ██████████▉
mask has no more than 6 bits
pdep scalar 3.012 2.983 1.0 ██████████████████████████████████████
pdep AVX2 (6 bit) 3.012 2.990 1.0 ██████████████████████████████████████
pdep AVX2 (6 bit, early exit) 3.280 3.240 0.9 ███████████████████████████████████
pdep AVX512 (6 bit) 3.162 3.110 1.0 ████████████████████████████████████▌
pdep AVX512 (6 bit, early exit) 3.419 3.399 0.9 █████████████████████████████████▍
pdep AVX512 v2 (6 bit) 2.863 2.841 1.0 ████████████████████████████████████████
pdep AVX512 v2 (6 bit, early exit) 3.106 3.075 1.0 ████████████████████████████████████▉
mask has no more than 8 bits
pdep scalar 2.961 2.937 1.0 ████████████████████████████████████████
pdep AVX2 (8 bit) 3.332 3.295 0.9 ███████████████████████████████████▋
pdep AVX2 (8 bit, early exit) 4.052 4.020 0.7 █████████████████████████████▏
pdep AVX512 (8 bit) 3.431 3.394 0.9 ██████████████████████████████████▌
pdep AVX512 (8 bit, early exit) 3.940 3.915 0.8 ██████████████████████████████
pdep AVX512 v2 (8 bit) 2.976 2.958 1.0 ███████████████████████████████████████▋
pdep AVX512 v2 (8 bit, early exit) 3.497 3.408 0.9 ██████████████████████████████████▍
mask has no more than 16 bits
pdep scalar 3.019 2.981 1.0 ████████████████████████████████████████
pdep AVX2 (16 bit) 5.505 5.481 0.5 █████████████████████▊
pdep AVX2 (16 bit, early exit) 8.897 8.870 0.3 █████████████▍
pdep AVX512 (16 bit) 4.782 4.758 0.6 █████████████████████████
pdep AVX512 (16 bit, early exit) 7.411 7.381 0.4 ████████████████▏
pdep AVX512 v2 (16 bit) 3.978 3.952 0.8 ██████████████████████████████▏
pdep AVX512 v2 (16 bit, early exit) 6.199 6.180 0.5 ███████████████████▎
mask has no more than 24 bits
pdep scalar 2.956 2.935 1.0 ████████████████████████████████████████
pdep AVX2 (24 bit) 9.169 9.155 0.3 ████████████▊
pdep AVX2 (24 bit, early exit) 13.622 13.597 0.2 ████████▋
pdep AVX512 (24 bit) 8.199 8.159 0.4 ██████████████▍
pdep AVX512 (24 bit, early exit) 11.098 11.069 0.3 ██████████▌
pdep AVX512 v2 (24 bit) 6.560 6.515 0.5 ██████████████████
pdep AVX512 v2 (24 bit, early exit) 9.406 9.383 0.3 ████████████▌

Source code

Sample implementation is available at GitHub.