Author: | Wojciech Muła |
---|---|
Added on: | 2025-01-05 |
Contents
The BMI2 extension introduced two complementary instructions: parallel bits deposit (PDEP) and parallel bits extract (PEXT).
The PDEP scatters continuous set of bits to positions denoted by the mask. The PEXT does the opposite: gathers/compresses selected bits into a continuous word.
SIMD instruction sets do not directly support this kind of operations. There is GF2P8AFFINEQB in AVX-512, that allows arbitrary bit shuffling at the byte level (see Use AVX512 Galois field affine transformation for bit shuffling).
In this text we show approaches suitable for implementing PEXT and PDEP for wider element widths on any SIMD ISA.
Pseudocode for PEXT:
N - the number of bits in a word (32 or 64) k := 0 result := 0 for m := 0 to N do if mask[m] == 1 then result[k] = data[m] k := k + 1 end end for
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ mask = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ data = │ 1 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ | | | | | | │ │ │ │ │ │ │ └───┐ │ │ │ │ │ └───┐ │ │ │ │ │ └───────┐ │ │ │ │ │ └───────────────┐ │ │ │ │ │ └───────────────┐ │ │ │ │ │ └───────────────────────────┐ │ │ │ │ │ └───────────────────────────────┐ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ pext = │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
The most basic operation we need is copying the m-th bit from data to k-th position in the result word. An important property is that the source index m is greater or equals the destination index k.
We can avoid any explicit shifting of bits. We mask the m-th bit of data: this yields either 1 << m or 0. We have the mask for destination bit, having value 1 << k. Calculating the minimum of these two values yields either 1 << k or 0. Finally, we merge that value with the result word.
The following pseudocode shows the idea:
// m-th bit = 1 m-th bit = 0 masked = data & (1 << m); // masked = 1 << m masked = 0 k_mask = 1 << k // k_mask = 1 << k k_mask = 1 << k bit_k = min(masked, k_mask); // bit_k = 1 << k bit_k = 0
To isolate the first bit, we'll use a well known trick of resetting such bit:
x' = (x - 1) & x; // reset lowest bit set fbs = x ^ x'; // isolated lowest bit set
And this is how this works for sample values:
the first bit set │ ▼ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ x = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ x - 1 = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ 1 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ (x-1) & x = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ = x' └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ x' ^ x = │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ = fbs └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
The following C++ template generates an implementation. The most generic one would scan all mask bits (MAX_MASK_BITS=32), but when we know the maximum number of bits set in masks, we may limit the number of iterations accordingly.
The parameter EARLY_EXIT decides whether to check in every iteration if all mask bits were scanned on not. It adds runtime overhead, and in the case we don't know how diverse masks are, this check won't help. If we know that the number of bits in different masks are similar, this check may be a win.
template <int MAX_MASK_BITS, bool EARLY_EXIT> void avx2_pext_u32_reference(const uint32_t* data_arr, const uint32_t* mask_arr, uint32_t* out_arr, size_t n) { static_assert(MAX_MASK_BITS > 0); static_assert(MAX_MASK_BITS <= 32); const __m256i one = _mm256_set1_epi32(1); const __m256i zero = _mm256_set1_epi32(0); for (size_t i=0; i < n; i += 8) { __m256i data = _mm256_loadu_si256((const __m256i*)(&data_arr[i])); __m256i mask = _mm256_loadu_si256((const __m256i*)(&mask_arr[i])); __m256i out = _mm256_set1_epi32(0); __m256i bit = one; for (int j=0; j < MAX_MASK_BITS; j++) { // 1. isolate the first non-zero bit set of mask const __m256i m0 = _mm256_sub_epi32(mask, one); const __m256i m1 = _mm256_and_si256(mask, m0); const __m256i m2 = _mm256_xor_si256(mask, m1); // 2. isolate that bit from data word const __m256i d0 = _mm256_and_si256(data, m2); // 3. move that bit on the next position in out const __m256i o0 = _mm256_min_epu32(d0, bit); out = _mm256_or_si256(out, o0); // 4. reset selected bit in mask (we already done it) mask = m1; // 5. advance to the next bit in out bit = _mm256_add_epi32(bit, bit); // 6. all are zeros? if (EARLY_EXIT && _mm256_testc_si256(zero, mask)) { break; } } _mm256_storeu_si256((__m256i*)(&out_arr[i]), out); } }
Solution for AVX-512 can use the above trick, but it is simpler to directly use mask registers. There is instruction VPTESTMD that sets a register mask based on bit-and of two operands. This allows us to produce element-wise masks from m-th bits.
Then conditional VPORD can be used to update the resulting word.
The following fragment from AVX-512 implementation shows these steps.
// 1. isolate the first bit set of mask // mask = [0101_1001_1100_0000|...] const __m512i m0 = _mm512_sub_epi32(mask, one); // m0 = [0101_1001_1011_1111|...] const __m512i m1 = _mm512_and_si512(mask, m0); // m1 = [0101_1001_1000_0000|...] const __m512i m2 = _mm512_xor_si512(mask, m1); // m2 = [0000_0000_0100_0000|...] // 2. get m-th data bits in a mask register const __mmask64 m = _mm512_test_epi32_mask(data, m2); // 3. set k-th bit when m-th bit is set out = _mm512_mask_or_epi32(out, m, out, bit); // 4. reset selected bit in mask mask = m1; // 5. the next bit to set bit = _mm512_add_epi32(bit, bit);
Pseudocode for PDEP:
N - the number of bits in a word (32 or 64) k := 0 result := 0 for m := 0 to N do if mask[m] == 1 then result[m] = data[k] k := k + 1 end end for
Example of PDEP invoked with a 4-bit mask and sample data.
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ mask = │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ data = │ 1 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ │ | | | ┌───────────────────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────┘ │ │ │ │ ┌───────────────────────┘ │ │ │ │ ┌───────────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ pdep = │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
In the case of PDEP we need to copy the k-th from data to the m-th bit of destination. Similarly to PEXT, in this case relation m ≥ k is hold.
It is also possible to avoid any bit shifting, but the method is not as elegant as for PEXT.
We start from isolating m-th bit from the mask
m-th bit │ ▼ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ mask = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ mask - 1 = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ 1 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
In the word mask - 1 we have a series of ones before the m. Since k ≤ m, we may now add the isolated k-th bit from data. If that bit it one, it will cause the series of ones to carry up the bit again to m-th position.
m-th bit │ ▼ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ mask - 1 = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 1 │ 1 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ data_k = │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ▲ │ k-th bit ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ sum = │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 1 │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
Than we need isolate the m-th bit, mask that bit in sum and use it to update the resulting word.
m1 = mask - 1 m2 = mask & m1 m3 = mask ^ m2 data_k = data & (1 << k) sum = m1 + data_k data_m = m3 & sum result = result | data_m
While the shift-less method is feasible, we found an easier way to update the result word. We use VPCMPEQD that fills a 32-bit word with zeros or ones, depending on comparison result. We isolate the k-th bit of data and compare it with zero.
This yield a negated mask, which we use to filter out m-th bits.
The following snippet from AVX2 implementation shows the order of operations.
// 1. isolate the first non-zoro bit set of mask (at m) // mask = [0101_1001_1100_0000|0000_1110_1100_1000|...] const __m256i m0 = _mm256_sub_epi32(mask, one); // m0 = [0101_1001_1011_1111|0000_1110_1100_0111|...] const __m256i m1 = _mm256_and_si256(mask, m0); // m1 = [0101_1001_1000_0000|0000_1110_1100_0000|...] const __m256i m2 = _mm256_xor_si256(mask, m1); // m2 = [0000_0000_0100_0000|0000_0000_0000_1000|...] // 2. isolate k-th bit from data data = [1100_0000_1111_1110|0000_0000_1000_0000|...] // bit = [0000_0000_0001_0000]0000_0000_0001_0000|...] const __m256i d0 = _mm256_and_si256(data, bit); // d0 = [0000_0000_0001_0000|0000_0000_0000_0000|...] // ^ ^ // 4. fill word with *negation* of data bit const __m256i d1 = _mm256_cmpeq_epi32(d0, zero);// d1 = [0000_0000_0000_0000|1111_1111_1111_1111|...] // 5. keep the mask bit, iff data[k] == 1 const __m256i m3 = _mm256_andnot_si256(d1, m2); // m3 = [0000_0000_0100_0000|0000_0000_0000_0000|...] // 6. update the out out = _mm256_or_si256(out, m3); mask = m1; // 7. the next bit in data to check bit = _mm256_add_epi32(bit, bit);
AVX-512 code uses the same approach as PEXT implementation. We transfer k-th bits from data word to a mask register and then use masked VPTERNLOGD to conditionally bit-or the m-th bits into the result words.
// 1. get k-th data bits in a mask register const __mmask64 m = _mm512_test_epi32_mask(data, bit); // 2. isolate the first non-zoro bit set of mask (at m) // mask = [0101_1001_1100_0000|0000_1110_1100_1000|...] const __m512i m0 = _mm512_sub_epi32(mask, one); // m0 = [0101_1001_1011_1111|0000_1110_1100_0111|...] const __m512i m1 = _mm512_and_si512(mask, m0); // m1 = [0101_1001_1000_0000|0000_1110_1100_0000|...] // 3. out = out | (mask ^ m1) out = _mm512_mask_ternarylogic_epi32(out, m, mask, m1, 0xf6); // 4. the next mask value mask = m1; // 5. the next bit in data to update bit = _mm512_add_epi32(bit, bit);
The baseline code uses PDEP or PEXT invoked in a simple loop. It's a bit unrealistic solution, but can be seen it as a theoretical limit. More realistic scenario is when we have some computations done on vector registers and we have to switch to scalar code to perform PDEP or PEXT.
Summary:
All test programs were compiled with options -O3 -march=native on each machine running tests.
Procedure | Comments |
---|---|
pext scalar | PEXT baseline |
pext AVX2 | AVX2 implementation for 32-bit masks |
pext AVX2 (early exit) | AVX2 implementation for 32-bit masks, with breaking the inner loop when all masks become zero |
pext AVX2 (6 bit) | AVX2 implementation for 6-bit masks |
pext AVX2 (6-bit, early exit) | AVX2 implementation for 6-bit masks, with breaking the inner loop when all masks become zero |
pext AVX2 (8 bit) | AVX2 implementation for 8-bit masks |
pext AVX2 (8-bit, early exit) | AVX2 implementation for 8-bit masks, with breaking the inner loop when all masks become zero |
pext AVX2 (16 bit) | AVX2 implementation for 16-bit masks |
pext AVX2 (16-bit, early exit) | AVX2 implementation for 16-bit masks, with breaking the inner loop when all masks become zero |
pext AVX2 (24 bit) | AVX2 implementation for 24-bit masks |
pext AVX2 (24-bit, early exit) | AVX2 implementation for 24-bit masks, with breaking the inner loop when all masks become zero |
pext AVX-512 | AVX-512 implementation for 32-bit masks |
pext AVX-512 (early exit) | AVX-512 implementation for 32-bit masks, with breaking the inner loop when all masks become zero |
pext AVX-512 (6 bit) | AVX-512 implementation for 6-bit masks |
pext AVX-512 (6-bit, early exit) | AVX-512 implementation for 6-bit masks, with breaking the inner loop when all masks become zero |
pext AVX-512 (8 bit) | AVX-512 implementation for 8-bit masks |
pext AVX-512 (8-bit, early exit) | AVX-512 implementation for 8-bit masks, with breaking the inner loop when all masks become zero |
pext AVX-512 (16 bit) | AVX-512 implementation for 16-bit masks |
pext AVX-512 (16-bit, early exit) | AVX-512 implementation for 16-bit masks, with breaking the inner loop when all masks become zero |
pext AVX-512 (24 bit) | AVX-512 implementation for 24-bit masks |
pext AVX-512 (24-bit, early exit) | AVX-512 implementation for 24-bit masks, with breaking the inner loop when all masks become zero |
Procedure | Comments |
---|---|
pdep scalar | PDEP baseline |
pdep AVX2 | AVX2 implementation for 32-bit masks |
pdep AVX2 (early exit) | AVX2 implementation for 32-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX2 (6 bit) | AVX2 implementation for 6-bit masks |
pdep AVX2 (6-bit, early exit) | AVX2 implementation for 6-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX2 (8 bit) | AVX2 implementation for 8-bit masks |
pdep AVX2 (8-bit, early exit) | AVX2 implementation for 8-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX2 (16 bit) | AVX2 implementation for 16-bit masks |
pdep AVX2 (16-bit, early exit) | AVX2 implementation for 16-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX2 (24 bit) | AVX2 implementation for 24-bit masks |
pdep AVX2 (24-bit, early exit) | AVX2 implementation for 24-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX-512 | AVX-512 implementation for 32-bit masks |
pdep AVX-512 (early exit) | AVX-512 implementation for 32-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX-512 (6 bit) | AVX-512 implementation for 6-bit masks |
pdep AVX-512 (6-bit, early exit) | AVX-512 implementation for 6-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX-512 (8 bit) | AVX-512 implementation for 8-bit masks |
pdep AVX-512 (8-bit, early exit) | AVX-512 implementation for 8-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX-512 (16 bit) | AVX-512 implementation for 16-bit masks |
pdep AVX-512 (16-bit, early exit) | AVX-512 implementation for 16-bit masks, with breaking the inner loop when all masks become zero |
pdep AVX-512 (24 bit) | AVX-512 implementation for 24-bit masks |
pdep AVX-512 (24-bit, early exit) | AVX-512 implementation for 24-bit masks, with breaking the inner loop when all masks become zero |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pext scalar | 1.012 | 0.915 | 1.0 | ████████████████████████████████████████ |
pext AVX2 | 4.549 | 4.536 | 0.2 | ████████ |
pext AVX2 (early exit) | 6.174 | 6.133 | 0.1 | █████▉ |
mask has no more than 6 bits | ||||
pext scalar | 0.957 | 0.918 | 1.0 | ████████████████████████████████████▎ |
pext AVX2 (6 bit) | 0.862 | 0.834 | 1.1 | ████████████████████████████████████████ |
pext AVX2 (6 bit, early exit) | 0.923 | 0.896 | 1.0 | █████████████████████████████████████▏ |
mask has no more than 8 bits | ||||
pext scalar | 0.938 | 0.909 | 1.0 | ███████████████████████████████████████▍ |
pext AVX2 (8 bit) | 0.914 | 0.896 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (8 bit, early exit) | 1.207 | 1.195 | 0.8 | █████████████████████████████▉ |
mask has no more than 16 bits | ||||
pext scalar | 0.938 | 0.909 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (16 bit) | 2.022 | 2.005 | 0.5 | ██████████████████▏ |
pext AVX2 (16 bit, early exit) | 4.169 | 4.160 | 0.2 | ████████▋ |
mask has no more than 24 bits | ||||
pext scalar | 0.939 | 0.911 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (24 bit) | 3.314 | 3.297 | 0.3 | ███████████ |
pext AVX2 (24 bit, early exit) | 5.322 | 5.310 | 0.2 | ██████▊ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pdep scalar | 0.924 | 0.881 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 | 4.409 | 4.398 | 0.2 | ████████ |
pdep AVX2 (early exit) | 6.230 | 6.206 | 0.1 | █████▋ |
mask has no more than 6 bits | ||||
pdep scalar | 0.906 | 0.878 | 1.0 | ████████████████████████████████████▎ |
pdep AVX2 (6 bit) | 0.813 | 0.798 | 1.1 | ████████████████████████████████████████ |
pdep AVX2 (6 bit, early exit) | 0.931 | 0.918 | 1.0 | ██████████████████████████████████▊ |
mask has no more than 8 bits | ||||
pdep scalar | 0.905 | 0.876 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (8 bit) | 0.940 | 0.928 | 0.9 | █████████████████████████████████████▊ |
pdep AVX2 (8 bit, early exit) | 1.251 | 1.240 | 0.7 | ████████████████████████████▎ |
mask has no more than 16 bits | ||||
pdep scalar | 0.897 | 0.878 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (16 bit) | 2.070 | 2.055 | 0.4 | █████████████████ |
pdep AVX2 (16 bit, early exit) | 4.254 | 4.242 | 0.2 | ████████▎ |
mask has no more than 24 bits | ||||
pdep scalar | 0.902 | 0.876 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (24 bit) | 3.260 | 3.244 | 0.3 | ██████████▊ |
pdep AVX2 (24 bit, early exit) | 5.397 | 5.385 | 0.2 | ██████▌ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pext scalar | 1.527 | 1.286 | 1.0 | ████████████████████████████████████████ |
pext AVX2 | 7.556 | 6.846 | 0.2 | ███████▌ |
pext AVX2 (early exit) | 8.778 | 8.001 | 0.2 | ██████▍ |
mask has no more than 6 bits | ||||
pext scalar | 1.471 | 1.264 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (6 bit) | 1.482 | 1.289 | 1.0 | ███████████████████████████████████████▏ |
pext AVX2 (6 bit, early exit) | 1.673 | 1.477 | 0.9 | ██████████████████████████████████▏ |
mask has no more than 8 bits | ||||
pext scalar | 1.480 | 1.273 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (8 bit) | 1.654 | 1.446 | 0.9 | ███████████████████████████████████▏ |
pext AVX2 (8 bit, early exit) | 2.233 | 2.006 | 0.6 | █████████████████████████▍ |
mask has no more than 16 bits | ||||
pext scalar | 1.464 | 1.268 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (16 bit) | 3.428 | 3.042 | 0.4 | ████████████████▋ |
pext AVX2 (16 bit, early exit) | 6.045 | 5.461 | 0.2 | █████████▎ |
mask has no more than 24 bits | ||||
pext scalar | 1.491 | 1.257 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (24 bit) | 5.622 | 5.025 | 0.3 | ██████████ |
pext AVX2 (24 bit, early exit) | 7.925 | 7.078 | 0.2 | ███████ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pdep scalar | 1.512 | 1.276 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 | 7.857 | 7.091 | 0.2 | ███████▏ |
pdep AVX2 (early exit) | 8.317 | 7.452 | 0.2 | ██████▊ |
mask has no more than 6 bits | ||||
pdep scalar | 1.479 | 1.246 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (6 bit) | 1.472 | 1.318 | 0.9 | █████████████████████████████████████▊ |
pdep AVX2 (6 bit, early exit) | 1.699 | 1.498 | 0.8 | █████████████████████████████████▎ |
mask has no more than 8 bits | ||||
pdep scalar | 1.441 | 1.250 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (8 bit) | 1.768 | 1.517 | 0.8 | ████████████████████████████████▉ |
pdep AVX2 (8 bit, early exit) | 2.282 | 2.038 | 0.6 | ████████████████████████▌ |
mask has no more than 16 bits | ||||
pdep scalar | 1.476 | 1.270 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (16 bit) | 3.472 | 3.130 | 0.4 | ████████████████▏ |
pdep AVX2 (16 bit, early exit) | 6.062 | 5.497 | 0.2 | █████████▏ |
mask has no more than 24 bits | ||||
pdep scalar | 1.484 | 1.282 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (24 bit) | 5.795 | 5.244 | 0.2 | █████████▊ |
pdep AVX2 (24 bit, early exit) | 7.715 | 6.835 | 0.2 | ███████▌ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pext scalar | 3.105 | 2.968 | 1.0 | ████████████████████████████████████████ |
pext AVX2 | 18.473 | 14.244 | 0.2 | ████████▎ |
pext AVX2 (early exit) | 13.330 | 12.203 | 0.2 | █████████▋ |
pext AVX512 | 9.780 | 9.657 | 0.3 | ████████████▎ |
pext AVX512 (early exit) | 9.098 | 8.885 | 0.3 | █████████████▎ |
pext AVX512 v2 | 9.735 | 9.610 | 0.3 | ████████████▎ |
pext AVX512 v2 (early exit) | 8.421 | 8.171 | 0.4 | ██████████████▌ |
mask has no more than 6 bits | ||||
pext scalar | 3.160 | 3.050 | 1.0 | ███████████████▌ |
pext AVX2 (6 bit) | 2.079 | 1.972 | 1.5 | ████████████████████████▏ |
pext AVX2 (6 bit, early exit) | 2.832 | 2.700 | 1.1 | █████████████████▋ |
pext AVX512 (6 bit) | 1.360 | 1.259 | 2.4 | █████████████████████████████████████▊ |
pext AVX512 (6 bit, early exit) | 1.755 | 1.600 | 1.9 | █████████████████████████████▊ |
pext AVX512 v2 (6 bit) | 1.282 | 1.191 | 2.6 | ████████████████████████████████████████ |
pext AVX512 v2 (6 bit, early exit) | 1.647 | 1.510 | 2.0 | ███████████████████████████████▌ |
mask has no more than 8 bits | ||||
pext scalar | 3.146 | 3.022 | 1.0 | ██████████████████▉ |
pext AVX2 (8 bit) | 2.740 | 2.610 | 1.2 | █████████████████████▉ |
pext AVX2 (8 bit, early exit) | 3.742 | 3.552 | 0.9 | ████████████████▏ |
pext AVX512 (8 bit) | 1.695 | 1.541 | 2.0 | █████████████████████████████████████▏ |
pext AVX512 (8 bit, early exit) | 2.294 | 2.132 | 1.4 | ██████████████████████████▉ |
pext AVX512 v2 (8 bit) | 1.559 | 1.433 | 2.1 | ████████████████████████████████████████ |
pext AVX512 v2 (8 bit, early exit) | 2.201 | 2.026 | 1.5 | ████████████████████████████▎ |
mask has no more than 16 bits | ||||
pext scalar | 3.084 | 2.983 | 1.0 | ████████████████████████████████████▊ |
pext AVX2 (16 bit) | 5.406 | 5.146 | 0.6 | █████████████████████▎ |
pext AVX2 (16 bit, early exit) | 8.798 | 8.562 | 0.3 | ████████████▊ |
pext AVX512 (16 bit) | 3.192 | 2.936 | 1.0 | █████████████████████████████████████▍ |
pext AVX512 (16 bit, early exit) | 5.321 | 5.113 | 0.6 | █████████████████████▍ |
pext AVX512 v2 (16 bit) | 2.983 | 2.746 | 1.1 | ████████████████████████████████████████ |
pext AVX512 v2 (16 bit, early exit) | 4.665 | 4.575 | 0.7 | ████████████████████████ |
mask has no more than 24 bits | ||||
pext scalar | 3.082 | 2.992 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (24 bit) | 12.732 | 11.147 | 0.3 | ██████████▋ |
pext AVX2 (24 bit, early exit) | 13.043 | 12.588 | 0.2 | █████████▌ |
pext AVX512 (24 bit) | 7.410 | 7.290 | 0.4 | ████████████████▍ |
pext AVX512 (24 bit, early exit) | 8.349 | 8.149 | 0.4 | ██████████████▋ |
pext AVX512 v2 (24 bit) | 7.384 | 7.269 | 0.4 | ████████████████▍ |
pext AVX512 v2 (24 bit, early exit) | 7.397 | 7.121 | 0.4 | ████████████████▊ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pdep scalar | 3.117 | 2.969 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 | 19.314 | 19.046 | 0.2 | ██████▏ |
pdep AVX2 (early exit) | 16.449 | 16.272 | 0.2 | ███████▎ |
pdep AVX512 | 10.165 | 9.934 | 0.3 | ███████████▉ |
pdep AVX512 (early exit) | 8.663 | 8.252 | 0.4 | ██████████████▍ |
pdep AVX512 v2 | 9.401 | 8.432 | 0.4 | ██████████████ |
pdep AVX512 v2 (early exit) | 8.040 | 7.691 | 0.4 | ███████████████▍ |
mask has no more than 6 bits | ||||
pdep scalar | 2.839 | 2.705 | 1.0 | ████████████████▋ |
pdep AVX2 (6 bit) | 2.153 | 2.006 | 1.3 | ██████████████████████▌ |
pdep AVX2 (6 bit, early exit) | 2.860 | 2.678 | 1.0 | ████████████████▉ |
pdep AVX512 (6 bit) | 1.620 | 1.388 | 1.9 | ████████████████████████████████▌ |
pdep AVX512 (6 bit, early exit) | 1.852 | 1.720 | 1.6 | ██████████████████████████▎ |
pdep AVX512 v2 (6 bit) | 1.189 | 1.132 | 2.4 | ████████████████████████████████████████ |
pdep AVX512 v2 (6 bit, early exit) | 1.510 | 1.422 | 1.9 | ███████████████████████████████▊ |
mask has no more than 8 bits | ||||
pdep scalar | 3.128 | 3.035 | 1.0 | ████████████████▌ |
pdep AVX2 (8 bit) | 3.115 | 2.948 | 1.0 | █████████████████ |
pdep AVX2 (8 bit, early exit) | 4.122 | 3.870 | 0.8 | ████████████▉ |
pdep AVX512 (8 bit) | 1.769 | 1.678 | 1.8 | █████████████████████████████▉ |
pdep AVX512 (8 bit, early exit) | 2.220 | 2.159 | 1.4 | ███████████████████████▏ |
pdep AVX512 v2 (8 bit) | 1.295 | 1.254 | 2.4 | ████████████████████████████████████████ |
pdep AVX512 v2 (8 bit, early exit) | 1.822 | 1.695 | 1.8 | █████████████████████████████▌ |
mask has no more than 16 bits | ||||
pdep scalar | 3.157 | 3.042 | 1.0 | ██████████████████████████████▊ |
pdep AVX2 (16 bit) | 6.010 | 5.802 | 0.5 | ████████████████▏ |
pdep AVX2 (16 bit, early exit) | 9.223 | 8.982 | 0.3 | ██████████▍ |
pdep AVX512 (16 bit) | 3.616 | 3.364 | 0.9 | ███████████████████████████▊ |
pdep AVX512 (16 bit, early exit) | 5.407 | 5.180 | 0.6 | ██████████████████ |
pdep AVX512 v2 (16 bit) | 2.452 | 2.343 | 1.3 | ████████████████████████████████████████ |
pdep AVX512 v2 (16 bit, early exit) | 4.449 | 4.290 | 0.7 | █████████████████████▊ |
mask has no more than 24 bits | ||||
pdep scalar | 3.131 | 3.042 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (24 bit) | 13.123 | 10.766 | 0.3 | ███████████▎ |
pdep AVX2 (24 bit, early exit) | 15.508 | 12.937 | 0.2 | █████████▍ |
pdep AVX512 (24 bit) | 7.661 | 7.438 | 0.4 | ████████████████▎ |
pdep AVX512 (24 bit, early exit) | 8.637 | 8.416 | 0.4 | ██████████████▍ |
pdep AVX512 v2 (24 bit) | 7.110 | 6.954 | 0.4 | █████████████████▍ |
pdep AVX512 v2 (24 bit, early exit) | 6.173 | 5.914 | 0.5 | ████████████████████▌ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pext scalar | 2.943 | 2.894 | 1.0 | ████████████████████████████████████████ |
pext AVX2 | 13.043 | 13.019 | 0.2 | ████████▉ |
pext AVX2 (early exit) | 12.986 | 12.952 | 0.2 | ████████▉ |
pext AVX512 | 10.296 | 10.267 | 0.3 | ███████████▎ |
pext AVX512 (early exit) | 11.788 | 11.756 | 0.2 | █████████▊ |
pext AVX512 v2 | 10.263 | 10.232 | 0.3 | ███████████▎ |
pext AVX512 v2 (early exit) | 10.896 | 10.853 | 0.3 | ██████████▋ |
mask has no more than 6 bits | ||||
pext scalar | 2.986 | 2.958 | 1.0 | ███████████████████████████████████████▏ |
pext AVX2 (6 bit) | 2.910 | 2.894 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (6 bit, early exit) | 3.143 | 3.103 | 1.0 | █████████████████████████████████████▎ |
pext AVX512 (6 bit) | 3.009 | 2.979 | 1.0 | ██████████████████████████████████████▊ |
pext AVX512 (6 bit, early exit) | 3.339 | 3.223 | 0.9 | ███████████████████████████████████▉ |
pext AVX512 v2 (6 bit) | 3.109 | 2.963 | 1.0 | ███████████████████████████████████████ |
pext AVX512 v2 (6 bit, early exit) | 3.172 | 3.124 | 0.9 | █████████████████████████████████████ |
mask has no more than 8 bits | ||||
pext scalar | 2.980 | 2.957 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (8 bit) | 3.254 | 3.224 | 0.9 | ████████████████████████████████████▋ |
pext AVX2 (8 bit, early exit) | 3.821 | 3.791 | 0.8 | ███████████████████████████████▏ |
pext AVX512 (8 bit) | 3.311 | 3.234 | 0.9 | ████████████████████████████████████▌ |
pext AVX512 (8 bit, early exit) | 3.852 | 3.799 | 0.8 | ███████████████████████████████▏ |
pext AVX512 v2 (8 bit) | 3.159 | 3.122 | 0.9 | █████████████████████████████████████▉ |
pext AVX512 v2 (8 bit, early exit) | 3.644 | 3.616 | 0.8 | ████████████████████████████████▋ |
mask has no more than 16 bits | ||||
pext scalar | 2.974 | 2.948 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (16 bit) | 5.719 | 5.697 | 0.5 | ████████████████████▋ |
pext AVX2 (16 bit, early exit) | 9.145 | 9.118 | 0.3 | ████████████▉ |
pext AVX512 (16 bit) | 4.774 | 4.747 | 0.6 | ████████████████████████▊ |
pext AVX512 (16 bit, early exit) | 7.565 | 7.541 | 0.4 | ███████████████▋ |
pext AVX512 v2 (16 bit) | 4.572 | 4.548 | 0.6 | █████████████████████████▉ |
pext AVX512 v2 (16 bit, early exit) | 7.293 | 7.262 | 0.4 | ████████████████▏ |
mask has no more than 24 bits | ||||
pext scalar | 3.037 | 2.963 | 1.0 | ████████████████████████████████████████ |
pext AVX2 (24 bit) | 9.721 | 9.704 | 0.3 | ████████████▏ |
pext AVX2 (24 bit, early exit) | 12.069 | 12.038 | 0.2 | █████████▊ |
pext AVX512 (24 bit) | 7.816 | 7.790 | 0.4 | ███████████████▏ |
pext AVX512 (24 bit, early exit) | 10.960 | 10.934 | 0.3 | ██████████▊ |
pext AVX512 v2 (24 bit) | 7.752 | 7.722 | 0.4 | ███████████████▎ |
pext AVX512 v2 (24 bit, early exit) | 9.947 | 9.915 | 0.3 | ███████████▉ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
any 32-bit mask | ||||
pdep scalar | 3.051 | 2.878 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 | 12.760 | 12.741 | 0.2 | █████████ |
pdep AVX2 (early exit) | 16.604 | 16.584 | 0.2 | ██████▉ |
pdep AVX512 | 10.707 | 10.686 | 0.3 | ██████████▊ |
pdep AVX512 (early exit) | 11.476 | 11.448 | 0.3 | ██████████ |
pdep AVX512 v2 | 8.677 | 8.636 | 0.3 | █████████████▎ |
pdep AVX512 v2 (early exit) | 10.570 | 10.543 | 0.3 | ██████████▉ |
mask has no more than 6 bits | ||||
pdep scalar | 3.012 | 2.983 | 1.0 | ██████████████████████████████████████ |
pdep AVX2 (6 bit) | 3.012 | 2.990 | 1.0 | ██████████████████████████████████████ |
pdep AVX2 (6 bit, early exit) | 3.280 | 3.240 | 0.9 | ███████████████████████████████████ |
pdep AVX512 (6 bit) | 3.162 | 3.110 | 1.0 | ████████████████████████████████████▌ |
pdep AVX512 (6 bit, early exit) | 3.419 | 3.399 | 0.9 | █████████████████████████████████▍ |
pdep AVX512 v2 (6 bit) | 2.863 | 2.841 | 1.0 | ████████████████████████████████████████ |
pdep AVX512 v2 (6 bit, early exit) | 3.106 | 3.075 | 1.0 | ████████████████████████████████████▉ |
mask has no more than 8 bits | ||||
pdep scalar | 2.961 | 2.937 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (8 bit) | 3.332 | 3.295 | 0.9 | ███████████████████████████████████▋ |
pdep AVX2 (8 bit, early exit) | 4.052 | 4.020 | 0.7 | █████████████████████████████▏ |
pdep AVX512 (8 bit) | 3.431 | 3.394 | 0.9 | ██████████████████████████████████▌ |
pdep AVX512 (8 bit, early exit) | 3.940 | 3.915 | 0.8 | ██████████████████████████████ |
pdep AVX512 v2 (8 bit) | 2.976 | 2.958 | 1.0 | ███████████████████████████████████████▋ |
pdep AVX512 v2 (8 bit, early exit) | 3.497 | 3.408 | 0.9 | ██████████████████████████████████▍ |
mask has no more than 16 bits | ||||
pdep scalar | 3.019 | 2.981 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (16 bit) | 5.505 | 5.481 | 0.5 | █████████████████████▊ |
pdep AVX2 (16 bit, early exit) | 8.897 | 8.870 | 0.3 | █████████████▍ |
pdep AVX512 (16 bit) | 4.782 | 4.758 | 0.6 | █████████████████████████ |
pdep AVX512 (16 bit, early exit) | 7.411 | 7.381 | 0.4 | ████████████████▏ |
pdep AVX512 v2 (16 bit) | 3.978 | 3.952 | 0.8 | ██████████████████████████████▏ |
pdep AVX512 v2 (16 bit, early exit) | 6.199 | 6.180 | 0.5 | ███████████████████▎ |
mask has no more than 24 bits | ||||
pdep scalar | 2.956 | 2.935 | 1.0 | ████████████████████████████████████████ |
pdep AVX2 (24 bit) | 9.169 | 9.155 | 0.3 | ████████████▊ |
pdep AVX2 (24 bit, early exit) | 13.622 | 13.597 | 0.2 | ████████▋ |
pdep AVX512 (24 bit) | 8.199 | 8.159 | 0.4 | ██████████████▍ |
pdep AVX512 (24 bit, early exit) | 11.098 | 11.069 | 0.3 | ██████████▌ |
pdep AVX512 v2 (24 bit) | 6.560 | 6.515 | 0.5 | ██████████████████ |
pdep AVX512 v2 (24 bit, early exit) | 9.406 | 9.383 | 0.3 | ████████████▌ |
Sample implementation is available at GitHub.