SIMD parallel bits deposit/extract

Author:	Wojciech Muła
Added on:	2025-01-05

Contents

Introduction
PEXT
- Specification
- Software implementation
  - AVX2 implementation
  - AVX-512 implementation
PDEP
- Specification
- Software implementation
  - AVX2 implementation
  - AVX-512 implementation
Experiments
- Ryzen 7
- Alder Lake
- Ice Lake
- Skylake-X
Source code

Introduction

The BMI2 extension introduced two complementary instructions: parallel bits deposit (PDEP) and parallel bits extract (PEXT).

The PDEP scatters continuous set of bits to positions denoted by the mask. The PEXT does the opposite: gathers/compresses selected bits into a continuous word.

SIMD instruction sets do not directly support this kind of operations. There is GF2P8AFFINEQB in AVX-512, that allows arbitrary bit shuffling at the byte level (see Use AVX512 Galois field affine transformation for bit shuffling).

In this text we show approaches suitable for implementing PEXT and PDEP for wider element widths on any SIMD ISA.

PEXT

Specification

Pseudocode for PEXT:

N - the number of bits in a word (32 or 64)

k := 0
result := 0
for m := 0 to N do
    if mask[m] == 1 then
        result[k] = data[m]
        k := k + 1
    end
end for

       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
data = │ 1 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                 |       |               |   |           |       |   │
                 │       │               │   │           │       │   └───┐
                 │       │               │   │           │       └───┐   │
                 │       │               │   │           └───────┐   │   │
                 │       │               │   └───────────────┐   │   │   │
                 │       │               └───────────────┐   │   │   │   │
                 │       └───────────────────────────┐   │   │   │   │   │
                 └───────────────────────────────┐   │   │   │   │   │   │
                                                 │   │   │   │   │   │   │
                                                 ▼   ▼   ▼   ▼   ▼   ▼   ▼
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
pext = │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Software implementation

The most basic operation we need is copying the m-th bit from data to k-th position in the result word. An important property is that the source index m is greater or equals the destination index k.

We can avoid any explicit shifting of bits. We mask the m-th bit of data: this yields either 1 << m or 0. We have the mask for destination bit, having value 1 << k. Calculating the minimum of these two values yields either 1 << k or 0. Finally, we merge that value with the result word.

The following pseudocode shows the idea:

                              // m-th bit = 1             m-th bit = 0
masked = data & (1 << m);     // masked = 1 << m          masked = 0
k_mask = 1 << k               // k_mask = 1 << k          k_mask = 1 << k
bit_k  = min(masked, k_mask); // bit_k  = 1 << k          bit_k  = 0

To isolate the first bit, we'll use a well known trick of resetting such bit:

x'  = (x - 1) & x;  // reset lowest bit set
fbs = x ^ x';       // isolated lowest bit set

And this is how this works for sample values:

                                                       the first bit set
                                                                │
                                                                ▼
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
x           = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
x - 1       = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ 1 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
(x-1) & x   = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ = x'
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
x' ^ x      = │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ = fbs
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

AVX2 implementation

The following C++ template generates an implementation. The most generic one would scan all mask bits (MAX_MASK_BITS=32), but when we know the maximum number of bits set in masks, we may limit the number of iterations accordingly.

The parameter EARLY_EXIT decides whether to check in every iteration if all mask bits were scanned on not. It adds runtime overhead, and in the case we don't know how diverse masks are, this check won't help. If we know that the number of bits in different masks are similar, this check may be a win.

template <int MAX_MASK_BITS, bool EARLY_EXIT>
void avx2_pext_u32_reference(const uint32_t* data_arr, const uint32_t* mask_arr, uint32_t* out_arr, size_t n) {
    static_assert(MAX_MASK_BITS > 0);
    static_assert(MAX_MASK_BITS <= 32);

    const __m256i one  = _mm256_set1_epi32(1);
    const __m256i zero = _mm256_set1_epi32(0);
    for (size_t i=0; i < n; i += 8) {
        __m256i data = _mm256_loadu_si256((const __m256i*)(&data_arr[i]));
        __m256i mask = _mm256_loadu_si256((const __m256i*)(&mask_arr[i]));
        __m256i out  = _mm256_set1_epi32(0);

        __m256i bit = one;

        for (int j=0; j < MAX_MASK_BITS; j++) {
            // 1. isolate the first non-zero bit set of mask
            const __m256i m0 = _mm256_sub_epi32(mask, one);
            const __m256i m1 = _mm256_and_si256(mask, m0);
            const __m256i m2 = _mm256_xor_si256(mask, m1);

            // 2. isolate that bit from data word
            const __m256i d0 = _mm256_and_si256(data, m2);

            // 3. move that bit on the next position in out
            const __m256i o0 = _mm256_min_epu32(d0, bit);
            out = _mm256_or_si256(out, o0);

            // 4. reset selected bit in mask (we already done it)
            mask = m1;

            // 5. advance to the next bit in out
            bit = _mm256_add_epi32(bit, bit);

            // 6. all are zeros?
            if (EARLY_EXIT && _mm256_testc_si256(zero, mask)) {
                break;
            }
        }

        _mm256_storeu_si256((__m256i*)(&out_arr[i]), out);
    }
}

AVX-512 implementation

Solution for AVX-512 can use the above trick, but it is simpler to directly use mask registers. There is instruction VPTESTMD that sets a register mask based on bit-and of two operands. This allows us to produce element-wise masks from m-th bits.

Then conditional VPORD can be used to update the resulting word.

The following fragment from AVX-512 implementation shows these steps.

// 1. isolate the first bit set of mask
//                                                 mask = [0101_1001_1100_0000|...]
const __m512i m0 = _mm512_sub_epi32(mask, one); // m0   = [0101_1001_1011_1111|...]
const __m512i m1 = _mm512_and_si512(mask, m0);  // m1   = [0101_1001_1000_0000|...]
const __m512i m2 = _mm512_xor_si512(mask, m1);  // m2   = [0000_0000_0100_0000|...]

// 2. get m-th data bits in a mask register
const __mmask64 m = _mm512_test_epi32_mask(data, m2);

// 3. set k-th bit when m-th bit is set
out = _mm512_mask_or_epi32(out, m, out, bit);

// 4. reset selected bit in mask
mask = m1;

// 5. the next bit to set
bit = _mm512_add_epi32(bit, bit);

PDEP

Specification

Pseudocode for PDEP:

N - the number of bits in a word (32 or 64)

k := 0
result := 0
for m := 0 to N do
    if mask[m] == 1 then
        result[m] = data[k]
        k := k + 1
    end
end for

Example of PDEP invoked with a 4-bit mask and sample data.

       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask = │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
data = │ 1 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                                                             │   |   |   |
             ┌───────────────────────────────────────────────┘   │   │   │
             │               ┌───────────────────────────────────┘   │   │
             │               │               ┌───────────────────────┘   │
             │               │               │   ┌───────────────────────┘
             │               │               │   │
             ▼               ▼               ▼   ▼
       ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
pdep = │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
       └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Software implementation

In the case of PDEP we need to copy the k-th from data to the m-th bit of destination. Similarly to PEXT, in this case relation m ≥ k is hold.

It is also possible to avoid any bit shifting, but the method is not as elegant as for PEXT.

We start from isolating m-th bit from the mask

                                                            m-th bit
                                                                │
                                                                ▼
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask        = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask - 1    = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ 1 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

In the word mask - 1 we have a series of ones before the m. Since k ≤ m, we may now add the isolated k-th bit from data. If that bit it one, it will cause the series of ones to carry up the bit again to m-th position.

                                                            m-th bit
                                                                │
                                                                ▼
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
mask - 1    = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ 1 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
data_k      = │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                                                                            ▲
                                                                            │
                                                                         k-th bit

              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
sum         = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 1 │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Than we need isolate the m-th bit, mask that bit in sum and use it to update the resulting word.

m1 = mask - 1
m2 = mask & m1
m3 = mask ^ m2

data_k = data & (1 << k)
sum    = m1 + data_k
data_m = m3 & sum

result = result | data_m

AVX2 implementation

While the shift-less method is feasible, we found an easier way to update the result word. We use VPCMPEQD that fills a 32-bit word with zeros or ones, depending on comparison result. We isolate the k-th bit of data and compare it with zero.

This yield a negated mask, which we use to filter out m-th bits.

The following snippet from AVX2 implementation shows the order of operations.

// 1. isolate the first non-zoro bit set of mask (at m)
//                                                 mask = [0101_1001_1100_0000|0000_1110_1100_1000|...]
const __m256i m0 = _mm256_sub_epi32(mask, one); // m0   = [0101_1001_1011_1111|0000_1110_1100_0111|...]
const __m256i m1 = _mm256_and_si256(mask, m0);  // m1   = [0101_1001_1000_0000|0000_1110_1100_0000|...]
const __m256i m2 = _mm256_xor_si256(mask, m1);  // m2   = [0000_0000_0100_0000|0000_0000_0000_1000|...]

// 2. isolate k-th bit from data                   data = [1100_0000_1111_1110|0000_0000_1000_0000|...]
//                                                  bit = [0000_0000_0001_0000]0000_0000_0001_0000|...]
const __m256i d0 = _mm256_and_si256(data, bit); //   d0 = [0000_0000_0001_0000|0000_0000_0000_0000|...]
                                                //                      ^                   ^
// 4. fill word with *negation* of data bit
const __m256i d1 = _mm256_cmpeq_epi32(d0, zero);//   d1 = [0000_0000_0000_0000|1111_1111_1111_1111|...]

// 5. keep the mask bit, iff data[k] == 1
const __m256i m3 = _mm256_andnot_si256(d1, m2); //   m3 = [0000_0000_0100_0000|0000_0000_0000_0000|...]

// 6. update the out
out = _mm256_or_si256(out, m3);
mask = m1;

// 7. the next bit in data to check
bit = _mm256_add_epi32(bit, bit);

AVX-512 implementation

AVX-512 code uses the same approach as PEXT implementation. We transfer k-th bits from data word to a mask register and then use masked VPTERNLOGD to conditionally bit-or the m-th bits into the result words.

// 1. get k-th data bits in a mask register
const __mmask64 m = _mm512_test_epi32_mask(data, bit);

// 2. isolate the first non-zoro bit set of mask (at m)

//                                                 mask = [0101_1001_1100_0000|0000_1110_1100_1000|...]
const __m512i m0 = _mm512_sub_epi32(mask, one); // m0   = [0101_1001_1011_1111|0000_1110_1100_0111|...]
const __m512i m1 = _mm512_and_si512(mask, m0);  // m1   = [0101_1001_1000_0000|0000_1110_1100_0000|...]

// 3. out = out | (mask ^ m1)
out = _mm512_mask_ternarylogic_epi32(out, m, mask, m1, 0xf6);

// 4. the next mask value
mask = m1;

// 5. the next bit in data to update
bit = _mm512_add_epi32(bit, bit);

Experiments

The baseline code uses PDEP or PEXT invoked in a simple loop. It's a bit unrealistic solution, but can be seen it as a theoretical limit. More realistic scenario is when we have some computations done on vector registers and we have to switch to scalar code to perform PDEP or PEXT.

Summary:

The vectorized approaches have complexity proportional to the number of 1 bits in mask values.
AVX2 code is never faster than scalar code.
AVX-512 code is faster only on IceLake, when the mask values have a few bits (6 or 8 in tests).

All test programs were compiled with options -O3 -march=native on each machine running tests.

Tested PEXT procedures
Procedure	Comments
`pext scalar`	`PEXT` baseline
`pext AVX2`	AVX2 implementation for 32-bit masks
`pext AVX2 (early exit)`	AVX2 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
`pext AVX2 (6 bit)`	AVX2 implementation for 6-bit masks
`pext AVX2 (6-bit, early exit)`	AVX2 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
`pext AVX2 (8 bit)`	AVX2 implementation for 8-bit masks
`pext AVX2 (8-bit, early exit)`	AVX2 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
`pext AVX2 (16 bit)`	AVX2 implementation for 16-bit masks
`pext AVX2 (16-bit, early exit)`	AVX2 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
`pext AVX2 (24 bit)`	AVX2 implementation for 24-bit masks
`pext AVX2 (24-bit, early exit)`	AVX2 implementation for 24-bit masks, with breaking the inner loop when all masks become zero
`pext AVX-512`	AVX-512 implementation for 32-bit masks
`pext AVX-512 (early exit)`	AVX-512 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
`pext AVX-512 (6 bit)`	AVX-512 implementation for 6-bit masks
`pext AVX-512 (6-bit, early exit)`	AVX-512 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
`pext AVX-512 (8 bit)`	AVX-512 implementation for 8-bit masks
`pext AVX-512 (8-bit, early exit)`	AVX-512 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
`pext AVX-512 (16 bit)`	AVX-512 implementation for 16-bit masks
`pext AVX-512 (16-bit, early exit)`	AVX-512 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
`pext AVX-512 (24 bit)`	AVX-512 implementation for 24-bit masks
`pext AVX-512 (24-bit, early exit)`	AVX-512 implementation for 24-bit masks, with breaking the inner loop when all masks become zero

Tested PDEP procedures
Procedure	Comments
`pdep scalar`	`PDEP` baseline
`pdep AVX2`	AVX2 implementation for 32-bit masks
`pdep AVX2 (early exit)`	AVX2 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX2 (6 bit)`	AVX2 implementation for 6-bit masks
`pdep AVX2 (6-bit, early exit)`	AVX2 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX2 (8 bit)`	AVX2 implementation for 8-bit masks
`pdep AVX2 (8-bit, early exit)`	AVX2 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX2 (16 bit)`	AVX2 implementation for 16-bit masks
`pdep AVX2 (16-bit, early exit)`	AVX2 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX2 (24 bit)`	AVX2 implementation for 24-bit masks
`pdep AVX2 (24-bit, early exit)`	AVX2 implementation for 24-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX-512`	AVX-512 implementation for 32-bit masks
`pdep AVX-512 (early exit)`	AVX-512 implementation for 32-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX-512 (6 bit)`	AVX-512 implementation for 6-bit masks
`pdep AVX-512 (6-bit, early exit)`	AVX-512 implementation for 6-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX-512 (8 bit)`	AVX-512 implementation for 8-bit masks
`pdep AVX-512 (8-bit, early exit)`	AVX-512 implementation for 8-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX-512 (16 bit)`	AVX-512 implementation for 16-bit masks
`pdep AVX-512 (16-bit, early exit)`	AVX-512 implementation for 16-bit masks, with breaking the inner loop when all masks become zero
`pdep AVX-512 (24 bit)`	AVX-512 implementation for 24-bit masks
`pdep AVX-512 (24-bit, early exit)`	AVX-512 implementation for 24-bit masks, with breaking the inner loop when all masks become zero

Ryzen 7

Compiler: gcc (Debian 14.1.0-5) 14.1.0
CPU: AMD Ryzen 7 7730U with Radeon Graphics

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pext scalar	1.012	0.915	1.0	`████████████████████████████████████████`
pext AVX2	4.549	4.536	0.2	`████████`
pext AVX2 (early exit)	6.174	6.133	0.1	`█████▉`
mask has no more than 6 bits
pext scalar	0.957	0.918	1.0	`████████████████████████████████████▎`
pext AVX2 (6 bit)	0.862	0.834	1.1	`████████████████████████████████████████`
pext AVX2 (6 bit, early exit)	0.923	0.896	1.0	`█████████████████████████████████████▏`
mask has no more than 8 bits
pext scalar	0.938	0.909	1.0	`███████████████████████████████████████▍`
pext AVX2 (8 bit)	0.914	0.896	1.0	`████████████████████████████████████████`
pext AVX2 (8 bit, early exit)	1.207	1.195	0.8	`█████████████████████████████▉`
mask has no more than 16 bits
pext scalar	0.938	0.909	1.0	`████████████████████████████████████████`
pext AVX2 (16 bit)	2.022	2.005	0.5	`██████████████████▏`
pext AVX2 (16 bit, early exit)	4.169	4.160	0.2	`████████▋`
mask has no more than 24 bits
pext scalar	0.939	0.911	1.0	`████████████████████████████████████████`
pext AVX2 (24 bit)	3.314	3.297	0.3	`███████████`
pext AVX2 (24 bit, early exit)	5.322	5.310	0.2	`██████▊`

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pdep scalar	0.924	0.881	1.0	`████████████████████████████████████████`
pdep AVX2	4.409	4.398	0.2	`████████`
pdep AVX2 (early exit)	6.230	6.206	0.1	`█████▋`
mask has no more than 6 bits
pdep scalar	0.906	0.878	1.0	`████████████████████████████████████▎`
pdep AVX2 (6 bit)	0.813	0.798	1.1	`████████████████████████████████████████`
pdep AVX2 (6 bit, early exit)	0.931	0.918	1.0	`██████████████████████████████████▊`
mask has no more than 8 bits
pdep scalar	0.905	0.876	1.0	`████████████████████████████████████████`
pdep AVX2 (8 bit)	0.940	0.928	0.9	`█████████████████████████████████████▊`
pdep AVX2 (8 bit, early exit)	1.251	1.240	0.7	`████████████████████████████▎`
mask has no more than 16 bits
pdep scalar	0.897	0.878	1.0	`████████████████████████████████████████`
pdep AVX2 (16 bit)	2.070	2.055	0.4	`█████████████████`
pdep AVX2 (16 bit, early exit)	4.254	4.242	0.2	`████████▎`
mask has no more than 24 bits
pdep scalar	0.902	0.876	1.0	`████████████████████████████████████████`
pdep AVX2 (24 bit)	3.260	3.244	0.3	`██████████▊`
pdep AVX2 (24 bit, early exit)	5.397	5.385	0.2	`██████▌`

Alder Lake

Compiler: gcc (Debian 13.2.0-25) 13.2.0
CPU: 12th Gen Intel(R) Core(TM) i7-1255U

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pext scalar	1.527	1.286	1.0	`████████████████████████████████████████`
pext AVX2	7.556	6.846	0.2	`███████▌`
pext AVX2 (early exit)	8.778	8.001	0.2	`██████▍`
mask has no more than 6 bits
pext scalar	1.471	1.264	1.0	`████████████████████████████████████████`
pext AVX2 (6 bit)	1.482	1.289	1.0	`███████████████████████████████████████▏`
pext AVX2 (6 bit, early exit)	1.673	1.477	0.9	`██████████████████████████████████▏`
mask has no more than 8 bits
pext scalar	1.480	1.273	1.0	`████████████████████████████████████████`
pext AVX2 (8 bit)	1.654	1.446	0.9	`███████████████████████████████████▏`
pext AVX2 (8 bit, early exit)	2.233	2.006	0.6	`█████████████████████████▍`
mask has no more than 16 bits
pext scalar	1.464	1.268	1.0	`████████████████████████████████████████`
pext AVX2 (16 bit)	3.428	3.042	0.4	`████████████████▋`
pext AVX2 (16 bit, early exit)	6.045	5.461	0.2	`█████████▎`
mask has no more than 24 bits
pext scalar	1.491	1.257	1.0	`████████████████████████████████████████`
pext AVX2 (24 bit)	5.622	5.025	0.3	`██████████`
pext AVX2 (24 bit, early exit)	7.925	7.078	0.2	`███████`

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pdep scalar	1.512	1.276	1.0	`████████████████████████████████████████`
pdep AVX2	7.857	7.091	0.2	`███████▏`
pdep AVX2 (early exit)	8.317	7.452	0.2	`██████▊`
mask has no more than 6 bits
pdep scalar	1.479	1.246	1.0	`████████████████████████████████████████`
pdep AVX2 (6 bit)	1.472	1.318	0.9	`█████████████████████████████████████▊`
pdep AVX2 (6 bit, early exit)	1.699	1.498	0.8	`█████████████████████████████████▎`
mask has no more than 8 bits
pdep scalar	1.441	1.250	1.0	`████████████████████████████████████████`
pdep AVX2 (8 bit)	1.768	1.517	0.8	`████████████████████████████████▉`
pdep AVX2 (8 bit, early exit)	2.282	2.038	0.6	`████████████████████████▌`
mask has no more than 16 bits
pdep scalar	1.476	1.270	1.0	`████████████████████████████████████████`
pdep AVX2 (16 bit)	3.472	3.130	0.4	`████████████████▏`
pdep AVX2 (16 bit, early exit)	6.062	5.497	0.2	`█████████▏`
mask has no more than 24 bits
pdep scalar	1.484	1.282	1.0	`████████████████████████████████████████`
pdep AVX2 (24 bit)	5.795	5.244	0.2	`█████████▊`
pdep AVX2 (24 bit, early exit)	7.715	6.835	0.2	`███████▌`

Ice Lake

Compiler: gcc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2)
CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pext scalar	3.105	2.968	1.0	`████████████████████████████████████████`
pext AVX2	18.473	14.244	0.2	`████████▎`
pext AVX2 (early exit)	13.330	12.203	0.2	`█████████▋`
pext AVX512	9.780	9.657	0.3	`████████████▎`
pext AVX512 (early exit)	9.098	8.885	0.3	`█████████████▎`
pext AVX512 v2	9.735	9.610	0.3	`████████████▎`
pext AVX512 v2 (early exit)	8.421	8.171	0.4	`██████████████▌`
mask has no more than 6 bits
pext scalar	3.160	3.050	1.0	`███████████████▌`
pext AVX2 (6 bit)	2.079	1.972	1.5	`████████████████████████▏`
pext AVX2 (6 bit, early exit)	2.832	2.700	1.1	`█████████████████▋`
pext AVX512 (6 bit)	1.360	1.259	2.4	`█████████████████████████████████████▊`
pext AVX512 (6 bit, early exit)	1.755	1.600	1.9	`█████████████████████████████▊`
pext AVX512 v2 (6 bit)	1.282	1.191	2.6	`████████████████████████████████████████`
pext AVX512 v2 (6 bit, early exit)	1.647	1.510	2.0	`███████████████████████████████▌`
mask has no more than 8 bits
pext scalar	3.146	3.022	1.0	`██████████████████▉`
pext AVX2 (8 bit)	2.740	2.610	1.2	`█████████████████████▉`
pext AVX2 (8 bit, early exit)	3.742	3.552	0.9	`████████████████▏`
pext AVX512 (8 bit)	1.695	1.541	2.0	`█████████████████████████████████████▏`
pext AVX512 (8 bit, early exit)	2.294	2.132	1.4	`██████████████████████████▉`
pext AVX512 v2 (8 bit)	1.559	1.433	2.1	`████████████████████████████████████████`
pext AVX512 v2 (8 bit, early exit)	2.201	2.026	1.5	`████████████████████████████▎`
mask has no more than 16 bits
pext scalar	3.084	2.983	1.0	`████████████████████████████████████▊`
pext AVX2 (16 bit)	5.406	5.146	0.6	`█████████████████████▎`
pext AVX2 (16 bit, early exit)	8.798	8.562	0.3	`████████████▊`
pext AVX512 (16 bit)	3.192	2.936	1.0	`█████████████████████████████████████▍`
pext AVX512 (16 bit, early exit)	5.321	5.113	0.6	`█████████████████████▍`
pext AVX512 v2 (16 bit)	2.983	2.746	1.1	`████████████████████████████████████████`
pext AVX512 v2 (16 bit, early exit)	4.665	4.575	0.7	`████████████████████████`
mask has no more than 24 bits
pext scalar	3.082	2.992	1.0	`████████████████████████████████████████`
pext AVX2 (24 bit)	12.732	11.147	0.3	`██████████▋`
pext AVX2 (24 bit, early exit)	13.043	12.588	0.2	`█████████▌`
pext AVX512 (24 bit)	7.410	7.290	0.4	`████████████████▍`
pext AVX512 (24 bit, early exit)	8.349	8.149	0.4	`██████████████▋`
pext AVX512 v2 (24 bit)	7.384	7.269	0.4	`████████████████▍`
pext AVX512 v2 (24 bit, early exit)	7.397	7.121	0.4	`████████████████▊`

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pdep scalar	3.117	2.969	1.0	`████████████████████████████████████████`
pdep AVX2	19.314	19.046	0.2	`██████▏`
pdep AVX2 (early exit)	16.449	16.272	0.2	`███████▎`
pdep AVX512	10.165	9.934	0.3	`███████████▉`
pdep AVX512 (early exit)	8.663	8.252	0.4	`██████████████▍`
pdep AVX512 v2	9.401	8.432	0.4	`██████████████`
pdep AVX512 v2 (early exit)	8.040	7.691	0.4	`███████████████▍`
mask has no more than 6 bits
pdep scalar	2.839	2.705	1.0	`████████████████▋`
pdep AVX2 (6 bit)	2.153	2.006	1.3	`██████████████████████▌`
pdep AVX2 (6 bit, early exit)	2.860	2.678	1.0	`████████████████▉`
pdep AVX512 (6 bit)	1.620	1.388	1.9	`████████████████████████████████▌`
pdep AVX512 (6 bit, early exit)	1.852	1.720	1.6	`██████████████████████████▎`
pdep AVX512 v2 (6 bit)	1.189	1.132	2.4	`████████████████████████████████████████`
pdep AVX512 v2 (6 bit, early exit)	1.510	1.422	1.9	`███████████████████████████████▊`
mask has no more than 8 bits
pdep scalar	3.128	3.035	1.0	`████████████████▌`
pdep AVX2 (8 bit)	3.115	2.948	1.0	`█████████████████`
pdep AVX2 (8 bit, early exit)	4.122	3.870	0.8	`████████████▉`
pdep AVX512 (8 bit)	1.769	1.678	1.8	`█████████████████████████████▉`
pdep AVX512 (8 bit, early exit)	2.220	2.159	1.4	`███████████████████████▏`
pdep AVX512 v2 (8 bit)	1.295	1.254	2.4	`████████████████████████████████████████`
pdep AVX512 v2 (8 bit, early exit)	1.822	1.695	1.8	`█████████████████████████████▌`
mask has no more than 16 bits
pdep scalar	3.157	3.042	1.0	`██████████████████████████████▊`
pdep AVX2 (16 bit)	6.010	5.802	0.5	`████████████████▏`
pdep AVX2 (16 bit, early exit)	9.223	8.982	0.3	`██████████▍`
pdep AVX512 (16 bit)	3.616	3.364	0.9	`███████████████████████████▊`
pdep AVX512 (16 bit, early exit)	5.407	5.180	0.6	`██████████████████`
pdep AVX512 v2 (16 bit)	2.452	2.343	1.3	`████████████████████████████████████████`
pdep AVX512 v2 (16 bit, early exit)	4.449	4.290	0.7	`█████████████████████▊`
mask has no more than 24 bits
pdep scalar	3.131	3.042	1.0	`████████████████████████████████████████`
pdep AVX2 (24 bit)	13.123	10.766	0.3	`███████████▎`
pdep AVX2 (24 bit, early exit)	15.508	12.937	0.2	`█████████▍`
pdep AVX512 (24 bit)	7.661	7.438	0.4	`████████████████▎`
pdep AVX512 (24 bit, early exit)	8.637	8.416	0.4	`██████████████▍`
pdep AVX512 v2 (24 bit)	7.110	6.954	0.4	`█████████████████▍`
pdep AVX512 v2 (24 bit, early exit)	6.173	5.914	0.5	`████████████████████▌`

Skylake-X

Compiler: gcc (GCC) 11.2.0
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pext scalar	2.943	2.894	1.0	`████████████████████████████████████████`
pext AVX2	13.043	13.019	0.2	`████████▉`
pext AVX2 (early exit)	12.986	12.952	0.2	`████████▉`
pext AVX512	10.296	10.267	0.3	`███████████▎`
pext AVX512 (early exit)	11.788	11.756	0.2	`█████████▊`
pext AVX512 v2	10.263	10.232	0.3	`███████████▎`
pext AVX512 v2 (early exit)	10.896	10.853	0.3	`██████████▋`
mask has no more than 6 bits
pext scalar	2.986	2.958	1.0	`███████████████████████████████████████▏`
pext AVX2 (6 bit)	2.910	2.894	1.0	`████████████████████████████████████████`
pext AVX2 (6 bit, early exit)	3.143	3.103	1.0	`█████████████████████████████████████▎`
pext AVX512 (6 bit)	3.009	2.979	1.0	`██████████████████████████████████████▊`
pext AVX512 (6 bit, early exit)	3.339	3.223	0.9	`███████████████████████████████████▉`
pext AVX512 v2 (6 bit)	3.109	2.963	1.0	`███████████████████████████████████████`
pext AVX512 v2 (6 bit, early exit)	3.172	3.124	0.9	`█████████████████████████████████████`
mask has no more than 8 bits
pext scalar	2.980	2.957	1.0	`████████████████████████████████████████`
pext AVX2 (8 bit)	3.254	3.224	0.9	`████████████████████████████████████▋`
pext AVX2 (8 bit, early exit)	3.821	3.791	0.8	`███████████████████████████████▏`
pext AVX512 (8 bit)	3.311	3.234	0.9	`████████████████████████████████████▌`
pext AVX512 (8 bit, early exit)	3.852	3.799	0.8	`███████████████████████████████▏`
pext AVX512 v2 (8 bit)	3.159	3.122	0.9	`█████████████████████████████████████▉`
pext AVX512 v2 (8 bit, early exit)	3.644	3.616	0.8	`████████████████████████████████▋`
mask has no more than 16 bits
pext scalar	2.974	2.948	1.0	`████████████████████████████████████████`
pext AVX2 (16 bit)	5.719	5.697	0.5	`████████████████████▋`
pext AVX2 (16 bit, early exit)	9.145	9.118	0.3	`████████████▉`
pext AVX512 (16 bit)	4.774	4.747	0.6	`████████████████████████▊`
pext AVX512 (16 bit, early exit)	7.565	7.541	0.4	`███████████████▋`
pext AVX512 v2 (16 bit)	4.572	4.548	0.6	`█████████████████████████▉`
pext AVX512 v2 (16 bit, early exit)	7.293	7.262	0.4	`████████████████▏`
mask has no more than 24 bits
pext scalar	3.037	2.963	1.0	`████████████████████████████████████████`
pext AVX2 (24 bit)	9.721	9.704	0.3	`████████████▏`
pext AVX2 (24 bit, early exit)	12.069	12.038	0.2	`█████████▊`
pext AVX512 (24 bit)	7.816	7.790	0.4	`███████████████▏`
pext AVX512 (24 bit, early exit)	10.960	10.934	0.3	`██████████▊`
pext AVX512 v2 (24 bit)	7.752	7.722	0.4	`███████████████▎`
pext AVX512 v2 (24 bit, early exit)	9.947	9.915	0.3	`███████████▉`

procedure	time in cycles per byte		speed-up
	average	best
any 32-bit mask
pdep scalar	3.051	2.878	1.0	`████████████████████████████████████████`
pdep AVX2	12.760	12.741	0.2	`█████████`
pdep AVX2 (early exit)	16.604	16.584	0.2	`██████▉`
pdep AVX512	10.707	10.686	0.3	`██████████▊`
pdep AVX512 (early exit)	11.476	11.448	0.3	`██████████`
pdep AVX512 v2	8.677	8.636	0.3	`█████████████▎`
pdep AVX512 v2 (early exit)	10.570	10.543	0.3	`██████████▉`
mask has no more than 6 bits
pdep scalar	3.012	2.983	1.0	`██████████████████████████████████████`
pdep AVX2 (6 bit)	3.012	2.990	1.0	`██████████████████████████████████████`
pdep AVX2 (6 bit, early exit)	3.280	3.240	0.9	`███████████████████████████████████`
pdep AVX512 (6 bit)	3.162	3.110	1.0	`████████████████████████████████████▌`
pdep AVX512 (6 bit, early exit)	3.419	3.399	0.9	`█████████████████████████████████▍`
pdep AVX512 v2 (6 bit)	2.863	2.841	1.0	`████████████████████████████████████████`
pdep AVX512 v2 (6 bit, early exit)	3.106	3.075	1.0	`████████████████████████████████████▉`
mask has no more than 8 bits
pdep scalar	2.961	2.937	1.0	`████████████████████████████████████████`
pdep AVX2 (8 bit)	3.332	3.295	0.9	`███████████████████████████████████▋`
pdep AVX2 (8 bit, early exit)	4.052	4.020	0.7	`█████████████████████████████▏`
pdep AVX512 (8 bit)	3.431	3.394	0.9	`██████████████████████████████████▌`
pdep AVX512 (8 bit, early exit)	3.940	3.915	0.8	`██████████████████████████████`
pdep AVX512 v2 (8 bit)	2.976	2.958	1.0	`███████████████████████████████████████▋`
pdep AVX512 v2 (8 bit, early exit)	3.497	3.408	0.9	`██████████████████████████████████▍`
mask has no more than 16 bits
pdep scalar	3.019	2.981	1.0	`████████████████████████████████████████`
pdep AVX2 (16 bit)	5.505	5.481	0.5	`█████████████████████▊`
pdep AVX2 (16 bit, early exit)	8.897	8.870	0.3	`█████████████▍`
pdep AVX512 (16 bit)	4.782	4.758	0.6	`█████████████████████████`
pdep AVX512 (16 bit, early exit)	7.411	7.381	0.4	`████████████████▏`
pdep AVX512 v2 (16 bit)	3.978	3.952	0.8	`██████████████████████████████▏`
pdep AVX512 v2 (16 bit, early exit)	6.199	6.180	0.5	`███████████████████▎`
mask has no more than 24 bits
pdep scalar	2.956	2.935	1.0	`████████████████████████████████████████`
pdep AVX2 (24 bit)	9.169	9.155	0.3	`████████████▊`
pdep AVX2 (24 bit, early exit)	13.622	13.597	0.2	`████████▋`
pdep AVX512 (24 bit)	8.199	8.159	0.4	`██████████████▍`
pdep AVX512 (24 bit, early exit)	11.098	11.069	0.3	`██████████▌`
pdep AVX512 v2 (24 bit)	6.560	6.515	0.5	`██████████████████`
pdep AVX512 v2 (24 bit, early exit)	9.406	9.383	0.3	`████████████▌`

Source code

Sample implementation is available at GitHub.