SIMDized counting byte in byte stream

Author:	Wojciech Muła
Added on:	2019-01-29

Contents

Introduction
- GCC vectorization
AVX2 and SSE vectorization
- 8-bit counters
- Population count
AVX512BW vectorization
Evaluation
Conclusions
Source code

Introduction

We want to count how many times given byte occurs in a byte stream; here is a C program doing this:

#include <stdint.h>
#include <stddef.h>

size_t countbyte(uint8_t* data, size_t size, uint8_t b) {
    size_t result = 0;
    for (size_t i=0; i < size; i++)
        result += (data[i] == b);

    return result;
}

GCC vectorization

The current GCC vectorization algorithm is able to handle the presented procedure, but its output is not optimal. For an AVX2 target GCC keeps four 64-bit sub-counters which are updated in every iteration and then added in the end.

In a single iteration 32 bytes are loaded and then compared with vector filled with given byte:

vpcmpeqb    (%rax), %ymm7, %ymm0

Result of this operation is vector of bytes filled with either ones (0xff) or zeros. Then 0xff are converted to ones by bit-and operation:

vpand       %ymm6, %ymm0, %ymm0 ; ymm6 = _mm256_set1_epi8(0x01)

The last step of algorithm is casting these 8-bit numbers to 64-bit value and updating the mentioned counters.

The conversion is done gradually: first from 8-bit numbers to 16-bit ones; then from 16-bit to 32-bit and finally from 32-bit to 64-bit values. This conversion must be done for each 4-byte subarray of input register. It means that following code is repeated eight times:

vpmovzxbw       %xmm0, %ymm1        ; 8-bit -> 16-bit numbers
vpmovzxwd       %xmm1, %ymm4        ; 16-bit -> 32-bit numbers
vpmovzxdq       %xmm4, %ymm2        ; 32-bit -> 64-bit numbers
vpaddq          %ymm4, %ymm2, %ymm2 ; update the sub-counters

Below is the full dissasembly from GCC 9 (snapshot from 2019-01-17, options: -O3 -march=cannonlake).

vpmovzxbw       %xmm0, %ymm1
vpmovzxwd       %xmm1, %ymm4
vpmovzxdq       %xmm4, %ymm2
vextracti128    $0x1, %ymm1, %xmm1
vextracti128    $0x1, %ymm4, %xmm4
vpmovzxwd       %xmm1, %ymm1
vpmovzxdq       %xmm4, %ymm4
vpaddq          %ymm4, %ymm2, %ymm2
vextracti128    $0x1, %ymm0, %xmm0
vpmovzxdq       %xmm1, %ymm4
vextracti128    $0x1, %ymm1, %xmm1
vpmovzxbw       %xmm0, %ymm0
vpmovzxdq       %xmm1, %ymm1
vpmovzxwd       %xmm0, %ymm3
vpaddq          %ymm1, %ymm4, %ymm1
vpaddq          %ymm1, %ymm2, %ymm1
vextracti128    $0x1, %ymm0, %xmm0
vpmovzxdq       %xmm3, %ymm2
vextracti128    $0x1, %ymm3, %xmm3
vpmovzxwd       %xmm0, %ymm0
vpmovzxdq       %xmm3, %ymm3
vpaddq          %ymm3, %ymm2, %ymm3
vpmovzxdq       %xmm0, %ymm2
vextracti128    $0x1, %ymm0, %xmm0
vpmovzxdq       %xmm0, %ymm0
vpaddq          %ymm3, %ymm1, %ymm1
vpaddq          %ymm0, %ymm2, %ymm0
vpaddq          %ymm0, %ymm1, %ymm0
vpaddq          %ymm0, %ymm5, %ymm5

AVX2 and SSE vectorization

8-bit counters

The byte-wise comparison in both AVX2 and SSE yields vector of 0 or -1. Accumulation can be done in two steps:

sum up to 255 elements in a vector register — in means we have 32 (for AVX2) or 16 sub-counters (for SSE2), where each byte will hold count in range 0 .. 255;
calculate the sum of 8-bit sub-counters in a wider type.

The second point can use instruction PSADBW (_mm256_sad_epu8), which calculates the sum of absolute values of differences between bytes. When one of arguments is full of zeros, the instruction effectively calculates sum of bytes from the second argument. This is the fastest option according to my experiments described in SIMDized sum of all bytes in the array.

Below is the main loop of AVX2 implementation:

while (ptr + 256*32 < end) {
    local_sum = _mm256_setzero_si256();

    // update 32 x 8-bit counter
    for (int i=0; i < 256; i++, ptr += 32) {
        const __m256i in = _mm256_loadu_si256((const __m256i*)ptr);
        const __m256i eq = _mm256_cmpeq_epi8(in, v); // 0 or -1

        local_sum = _mm256_sub_epi8(local_sum, eq);
    }

    // update the global accumulator 2 x 64-bit
    const __m256i tmp = _mm256_sad_epu8(local_sum, _mm256_setzero_si256());
    global_sum = _mm256_add_epi64(global_sum, tmp);
}

Population count

Since in this problem we just want to tell whether given byte in a registers is equal or not the searched value, we might consider encoding this information as a bit-set. And then, once we have a number, count ones in it, using a population count instruction.

The obvious solution would require just three instructions:

// compare bytes — get a byte-mask
const __m256i  eq   = _mm256_cmpeq_epi8(input, _mm256_set1_epi8(byte));

// convert byte mask into bit-mask with VPMOVMSKB
const uint32_t mask = _mm256_movemask_epi8(eq);

// count bits
result += __builtin_popcountll(mask);

However, I think it's not the best solution. A better way is to calculate eight byte-masks, merge them into 256 (AVX2) or 128 (SSE) bit set and then count bytes. Unfortunately, SSE nor AVX2 has a bit-level blend, so we resort to basic bit operations.

Below is the main loop:

const __m256i eq0 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 0*16)));
const __m256i eq1 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 1*16)));
const __m256i eq2 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 2*16)));
const __m256i eq3 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 3*16)));
const __m256i eq4 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 4*16)));
const __m256i eq5 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 5*16)));
const __m256i eq6 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 6*16)));
const __m256i eq7 = _mm256_cmpeq_epi8(v, _mm256_loadu_si256((const __m256i*)(ptr + 7*16)));

const __m256i eq0bit = _mm256_and_si256(eq0, _mm256_set1_epi8(0x01));
const __m256i eq1bit = _mm256_and_si256(eq1, _mm256_set1_epi8(0x02));
const __m256i eq2bit = _mm256_and_si256(eq2, _mm256_set1_epi8(0x04));
const __m256i eq3bit = _mm256_and_si256(eq3, _mm256_set1_epi8(0x08));
const __m256i eq4bit = _mm256_and_si256(eq4, _mm256_set1_epi8(0x10));
const __m256i eq5bit = _mm256_and_si256(eq5, _mm256_set1_epi8(0x20));
const __m256i eq6bit = _mm256_and_si256(eq6, _mm256_set1_epi8(0x40));
const __m256i eq7bit = _mm256_and_si256(eq7, _mm256_set1_epi8(int8_t(0x80)));

const __m256i m01    = _mm256_or_si256(eq0bit, eq1bit);
const __m256i m23    = _mm256_or_si256(eq2bit, eq3bit);
const __m256i m45    = _mm256_or_si256(eq4bit, eq5bit);
const __m256i m67    = _mm256_or_si256(eq6bit, eq7bit);

const __m256i m0123  = _mm256_or_si256(m01, m23);
const __m256i m4567  = _mm256_or_si256(m45, m67);

const __m256i merged = _mm256_or_si256(m0123, m4567);

result += __builtin_popcountll(_mm256_extract_epi64(merged, 0));
result += __builtin_popcountll(_mm256_extract_epi64(merged, 1));
result += __builtin_popcountll(_mm256_extract_epi64(merged, 2));
result += __builtin_popcountll(_mm256_extract_epi64(merged, 3));

AVX512BW vectorization

Since AVX512 comparisons yield bitmasks, we can directly use population count instruction. Naive implementation is shown below.

while (ptr + 64 < end) {
    const __m512i in = _mm512_loadu_si512((const __m512i*)ptr);
    sum += __builtin_popcountll(_mm512_cmpeq_epi8_mask(in, v));

    ptr += 64;
}

Evaluation

procedure	description
scalar	in fact code vectorized by GCC for either AVX2 or AVX512BW
SSE	16 x 8-bit counter summed up every 255 iterations
SSE (popcount)	population count of 128-bit words build from eight byte-masks
AVX2	32 x 8-bit counter summed up every 255 iterations
AVX2 (popcount)	population count of 256-bit words build from eight byte-masks
AVX512BW
AVX512BW (unrolled)	the above procedure unrolled four times

Each procedure was run three times and minimum measurements were noted.

Skylake

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

Compiler: g++-8 (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0

procedure	time in cycles per byte		speed-up
	average	best
scalar	0.613	0.610	1.0	`████▉`
SSE	0.107	0.104	5.9	`████████████████████████████▊`
SSE (popcount)	0.100	0.098	6.2	`██████████████████████████████▌`
AVX2	0.077	0.075	8.1	`████████████████████████████████████████`
AVX2 (popcount)	0.107	0.104	5.9	`████████████████████████████▊`

Skylake-X

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

Compiler: GCC version 8.1.0 (Ubuntu 8.1.0-5ubuntu1~16.04)

procedure	time in cycles per byte		speed-up
	average	best
scalar	0.452	0.446	1.0	`████▍`
SSE	0.118	0.115	3.9	`█████████████████`
SSE (popcount)	0.113	0.111	4.0	`█████████████████▋`
AVX2	0.094	0.090	5.0	`█████████████████████▊`
AVX2 (popcount)	0.111	0.107	4.2	`██████████████████▎`
AVX512BW	0.072	0.062	7.2	`███████████████████████████████▌`
AVX512BW (unrolled)	0.055	0.049	9.1	`████████████████████████████████████████`

CannonLake

CPU: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz

Compiler: g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

procedure	time in cycles per byte		speed-up
	average	best
scalar	0.288	0.280	1.0	`██████`
SSE	0.099	0.093	3.0	`██████████████████`
SSE (popcount)	0.091	0.085	3.3	`███████████████████▊`
AVX2	0.073	0.070	4.0	`████████████████████████`
AVX2 (popcount)	0.084	0.081	3.5	`████████████████████▋`
AVX512BW	0.070	0.065	4.3	`█████████████████████████▊`
AVX512BW (unrolled)	0.049	0.042	6.7	`████████████████████████████████████████`

Conclusions

In case of SSE & AVX2, the popcount-based procedures are not that bad. I suspected that due to complex procedures that postprocess the comparison results, the performance would be much worse.
On AVX512 machines the AVX512BW code is faster than AVX2 procedures, but unrolling does not help that much. My rough guess is that the culprit is popcnt, as this instruction can be dispatched only on single port and latency is three cycles.

Source code

All implementation are available at github.