Author: | Wojciech Muła |
---|---|
Added on: | 2019-01-29 |
Contents
We want to count how many times given byte occurs in a byte stream; here is a C program doing this:
#include <stdint.h> #include <stddef.h> size_t countbyte(uint8_t* data, size_t size, uint8_t b) { size_t result = 0; for (size_t i=0; i < size; i++) result += (data[i] == b); return result; }
The current GCC vectorization algorithm is able to handle the presented procedure, but its output is not optimal. For an AVX2 target GCC keeps four 64-bit sub-counters which are updated in every iteration and then added in the end.
In a single iteration 32 bytes are loaded and then compared with vector filled with given byte:
vpcmpeqb (%rax), %ymm7, %ymm0
Result of this operation is vector of bytes filled with either ones (0xff) or zeros. Then 0xff are converted to ones by bit-and operation:
vpand %ymm6, %ymm0, %ymm0 ; ymm6 = _mm256_set1_epi8(0x01)
The last step of algorithm is casting these 8-bit numbers to 64-bit value and updating the mentioned counters.
The conversion is done gradually: first from 8-bit numbers to 16-bit ones; then from 16-bit to 32-bit and finally from 32-bit to 64-bit values. This conversion must be done for each 4-byte subarray of input register. It means that following code is repeated eight times:
vpmovzxbw %xmm0, %ymm1 ; 8-bit -> 16-bit numbers vpmovzxwd %xmm1, %ymm4 ; 16-bit -> 32-bit numbers vpmovzxdq %xmm4, %ymm2 ; 32-bit -> 64-bit numbers vpaddq %ymm4, %ymm2, %ymm2 ; update the sub-counters
Below is the full dissasembly from GCC 9 (snapshot from 2019-01-17, options: -O3 -march=cannonlake).
vpmovzxbw %xmm0, %ymm1 vpmovzxwd %xmm1, %ymm4 vpmovzxdq %xmm4, %ymm2 vextracti128 $0x1, %ymm1, %xmm1 vextracti128 $0x1, %ymm4, %xmm4 vpmovzxwd %xmm1, %ymm1 vpmovzxdq %xmm4, %ymm4 vpaddq %ymm4, %ymm2, %ymm2 vextracti128 $0x1, %ymm0, %xmm0 vpmovzxdq %xmm1, %ymm4 vextracti128 $0x1, %ymm1, %xmm1 vpmovzxbw %xmm0, %ymm0 vpmovzxdq %xmm1, %ymm1 vpmovzxwd %xmm0, %ymm3 vpaddq %ymm1, %ymm4, %ymm1 vpaddq %ymm1, %ymm2, %ymm1 vextracti128 $0x1, %ymm0, %xmm0 vpmovzxdq %xmm3, %ymm2 vextracti128 $0x1, %ymm3, %xmm3 vpmovzxwd %xmm0, %ymm0 vpmovzxdq %xmm3, %ymm3 vpaddq %ymm3, %ymm2, %ymm3 vpmovzxdq %xmm0, %ymm2 vextracti128 $0x1, %ymm0, %xmm0 vpmovzxdq %xmm0, %ymm0 vpaddq %ymm3, %ymm1, %ymm1 vpaddq %ymm0, %ymm2, %ymm0 vpaddq %ymm0, %ymm1, %ymm0 vpaddq %ymm0, %ymm5, %ymm5
The byte-wise comparison in both AVX2 and SSE yields vector of 0 or -1. Accumulation can be done in two steps:
The second point can use instruction PSADBW (_mm256_sad_epu8), which calculates the sum of absolute values of differences between bytes. When one of arguments is full of zeros, the instruction effectively calculates sum of bytes from the second argument. This is the fastest option according to my experiments described in SIMDized sum of all bytes in the array.
Below is the main loop of AVX2 implementation:
while (ptr + 256*32 < end) { local_sum = _mm256_setzero_si256(); // update 32 x 8-bit counter for (int i=0; i < 256; i++, ptr += 32) { const __m256i in = _mm256_loadu_si256((const __m256i*)ptr); const __m256i eq = _mm256_cmpeq_epi8(in, v); // 0 or -1 local_sum = _mm256_sub_epi8(local_sum, eq); } // update the global accumulator 2 x 64-bit const __m256i tmp = _mm256_sad_epu8(local_sum, _mm256_setzero_si256()); global_sum = _mm256_add_epi64(global_sum, tmp); }
Since in this problem we just want to tell whether given byte in a registers is equal or not the searched value, we might consider encoding this information as a bit-set. And then, once we have a number, count ones in it, using a population count instruction.
The obvious solution would require just three instructions:
// compare bytes — get a byte-mask const __m256i eq = _mm256_cmpeq_epi8(input, _mm256_set1_epi8(byte)); // convert byte mask into bit-mask with VPMOVMSKB const uint32_t mask = _mm256_movemask_epi8(eq); // count bits result += __builtin_popcountll(mask);
However, I think it's not the best solution. A better way is to calculate eight byte-masks, merge them into 256 (AVX2) or 128 (SSE) bit set and then count bytes. Unfortunately, SSE nor AVX2 has a bit-level blend, so we resort to basic bit operations.
Below is the main loop:
const __m256i eq0 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 0*16))); const __m256i eq1 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 1*16))); const __m256i eq2 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 2*16))); const __m256i eq3 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 3*16))); const __m256i eq4 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 4*16))); const __m256i eq5 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 5*16))); const __m256i eq6 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 6*16))); const __m256i eq7 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 7*16))); const __m256i eq0bit = _mm256_and_si256(eq0, _mm256_set1_epi8(0x01)); const __m256i eq1bit = _mm256_and_si256(eq1, _mm256_set1_epi8(0x02)); const __m256i eq2bit = _mm256_and_si256(eq2, _mm256_set1_epi8(0x04)); const __m256i eq3bit = _mm256_and_si256(eq3, _mm256_set1_epi8(0x08)); const __m256i eq4bit = _mm256_and_si256(eq4, _mm256_set1_epi8(0x10)); const __m256i eq5bit = _mm256_and_si256(eq5, _mm256_set1_epi8(0x20)); const __m256i eq6bit = _mm256_and_si256(eq6, _mm256_set1_epi8(0x40)); const __m256i eq7bit = _mm256_and_si256(eq7, _mm256_set1_epi8(int8_t(0x80))); const __m256i m01 = _mm256_or_si256(eq0bit, eq1bit); const __m256i m23 = _mm256_or_si256(eq2bit, eq3bit); const __m256i m45 = _mm256_or_si256(eq4bit, eq5bit); const __m256i m67 = _mm256_or_si256(eq6bit, eq7bit); const __m256i m0123 = _mm256_or_si256(m01, m23); const __m256i m4567 = _mm256_or_si256(m45, m67); const __m256i merged = _mm256_or_si256(m0123, m4567); result += __builtin_popcountll(_mm256_extract_epi64(merged, 0)); result += __builtin_popcountll(_mm256_extract_epi64(merged, 1)); result += __builtin_popcountll(_mm256_extract_epi64(merged, 2)); result += __builtin_popcountll(_mm256_extract_epi64(merged, 3));
Since AVX512 comparisons yield bitmasks, we can directly use population count instruction. Naive implementation is shown below.
while (ptr + 64 < end) { const __m512i in = _mm512_loadu_si512((const __m512i*)ptr); sum += __builtin_popcountll(_mm512_cmpeq_epi8_mask(in, v)); ptr += 64; }
procedure | description |
---|---|
scalar | in fact code vectorized by GCC for either AVX2 or AVX512BW |
SSE | 16 x 8-bit counter summed up every 255 iterations |
SSE (popcount) | population count of 128-bit words build from eight byte-masks |
AVX2 | 32 x 8-bit counter summed up every 255 iterations |
AVX2 (popcount) | population count of 256-bit words build from eight byte-masks |
AVX512BW | |
AVX512BW (unrolled) | the above procedure unrolled four times |
Each procedure was run three times and minimum measurements were noted.
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Compiler: g++-8 (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 0.613 | 0.610 | 1.0 | ████▉ |
SSE | 0.107 | 0.104 | 5.9 | ████████████████████████████▊ |
SSE (popcount) | 0.100 | 0.098 | 6.2 | ██████████████████████████████▌ |
AVX2 | 0.077 | 0.075 | 8.1 | ████████████████████████████████████████ |
AVX2 (popcount) | 0.107 | 0.104 | 5.9 | ████████████████████████████▊ |
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
Compiler: GCC version 8.1.0 (Ubuntu 8.1.0-5ubuntu1~16.04)
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 0.452 | 0.446 | 1.0 | ████▍ |
SSE | 0.118 | 0.115 | 3.9 | █████████████████ |
SSE (popcount) | 0.113 | 0.111 | 4.0 | █████████████████▋ |
AVX2 | 0.094 | 0.090 | 5.0 | █████████████████████▊ |
AVX2 (popcount) | 0.111 | 0.107 | 4.2 | ██████████████████▎ |
AVX512BW | 0.072 | 0.062 | 7.2 | ███████████████████████████████▌ |
AVX512BW (unrolled) | 0.055 | 0.049 | 9.1 | ████████████████████████████████████████ |
CPU: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz
Compiler: g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 0.288 | 0.280 | 1.0 | ██████ |
SSE | 0.099 | 0.093 | 3.0 | ██████████████████ |
SSE (popcount) | 0.091 | 0.085 | 3.3 | ███████████████████▊ |
AVX2 | 0.073 | 0.070 | 4.0 | ████████████████████████ |
AVX2 (popcount) | 0.084 | 0.081 | 3.5 | ████████████████████▋ |
AVX512BW | 0.070 | 0.065 | 4.3 | █████████████████████████▊ |
AVX512BW (unrolled) | 0.049 | 0.042 | 6.7 | ████████████████████████████████████████ |
All implementation are available at github.