SIMDized sum of all bytes in the array — part 2: signed bytes

Author:	Wojciech Muła
Added on:	2018-11-18
Updated on:	2019-02-04 (faster PSADBW approach contributed by Matthieu Darbois, update results from GCC); 2019-02-03 (new method also contributed by Matthieu Darbois)

Contents

Introduction
- Algorithm used by GCC
Explicit casting
Shift-based casting
AVX2-specific instruction VPSADBW — naive approach
AVX2-specific instruction VPSADBW — faster algorithm
AVX2-specific instruction VPMADDUBSW
Experiments
- Haswell
- Skylake
- SkylakeX
- CannonLake
Conclusions
Acknowledgements
Source code

Introduction

This is the second part of SIMDized sum of all bytes in the array. The first part describes summing unsigned bytes, here we're going to experiment with summing of signed bytes.

The baseline C implementation is:

int32_t sumbytes(int8_t* array, size_t size) {

    int32_t result = 0;

    for (size_t i=0; i < size; i++)
        result += int32_t(array[i]);

    return result;
}

And the C++ implementation:

#include <numeric>

int32_t sumbytes(int8_t* array, size_t size) {
    return std::accumulate(array, array + size, int32_t(0));
}

Algorithm used by GCC

Below is the assembly code of the main loop compiled for Skylake by GCC 7.3.0 with flags -O3 -march=skylake:

vpmovsxbw       %xmm1, %ymm2
vextracti128    $0x1, %ymm1, %xmm1
vpmovsxwd       %xmm2, %ymm3
vextracti128    $0x1, %ymm2, %xmm2
vpmovsxbw       %xmm1, %ymm1
vpaddd          %ymm0, %ymm3, %ymm3
vpmovsxwd       %xmm2, %ymm0
vpaddd          %ymm3, %ymm0, %ymm2
vpmovsxwd       %xmm1, %ymm0
vextracti128    $0x1, %ymm1, %xmm1
vpaddd          %ymm2, %ymm0, %ymm0
vpmovsxwd       %xmm1, %ymm1
vpaddd          %ymm0, %ymm1, %ymm0

The approach used here by GCC is exactly the same as for summing unsigned bytes. There are multiple 32-bit sub-accumulators in single register, i.e. eight in case of AVX2 (four in SSE code), which are added together in the end, forming the scalar result.

To get 32-bit values there's two-step casting from int8_t to int32_t:

First extend a vector of int8_t into two vectors of int16_t numbers (VPMOVSXBW).
Then, get four vectors of int32_t from the vectors obtained in the previous step (VPMOVSXWD).

The cast instruction VPMOVSX extends the lower part of a register, in this case the lower half. This is the reason why extractions of helves (VEXTRACTI128) are needed.

Explicit casting

There's instruction VPMOVSXBD which casts directly from int8_t to int32_t. It get lower 64 bits of input register, thus to convert all bytes from an AVX2 register the instruction has to be called 4 times and some shifting is needed.

Below is sample implementation

int32_t avx2_sumsignedbytes(int8_t* array, size_t size) {

    __m256i accumulator = _mm256_setzero_si256();

    for (size_t i=0; i < size; i += 32) {
        const __m256i v = _mm256_loadu_si256((__m256i*)(array + i));

        const __m128i lo = _mm256_extracti128_si256(v, 0);
        const __m128i hi = _mm256_extracti128_si256(v, 1);

        const __m256i t0 = _mm256_cvtepi8_epi32(lo);
        const __m256i t1 = _mm256_cvtepi8_epi32(hi);
        const __m256i t2 = _mm256_cvtepi8_epi32(_mm_bsrli_si128(lo, 8));
        const __m256i t3 = _mm256_cvtepi8_epi32(_mm_bsrli_si128(hi, 8));

        accumulator = _mm256_add_epi32(accumulator, t0);
        accumulator = _mm256_add_epi32(accumulator, t1);
        accumulator = _mm256_add_epi32(accumulator, t2);
        accumulator = _mm256_add_epi32(accumulator, t3);
    }

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 1)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 3)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 5)) +
           int32_t(_mm256_extract_epi32(accumulator, 6)) +
           int32_t(_mm256_extract_epi32(accumulator, 7));
}

Shift-based casting

In this approach we also cast directly from 8 into 32-bit numbers, but we use 32-bit shifts.

To convert 3rd bytes in all 32-bit words of registers we simply do arithmetic shift right by 24 bits right. This shift repeats the most significant (a sign bit) of our 8-bit value.

However, to do the same for other bytes in a 32-bit we need two shifts. The first one left, which places byte at 3rd position. Then arithmetic shift right is used to extend the type.

Algorithm is:

Load the input vector.

// v   = [  5 | -1 |  2 | -3 |  7 |  1 |  2 |  3 | -6 | -1 | -3 |  8 | -7 | -12|  3 |  2 ]
const __m256i v = _mm256_loadu_si256(ptr);

Extend 3rd bytes.

// v0  = [                 5 |                 7 |                -6 |                -7 ]
const __m256i v0 = _mm256_srai_epi32(v, 3*8);

Extend 2nd bytes.

// v1  = [ -1 |  2 | -3 |  0 |  1 |  2 |  3 |  0 | -1 | -3 |  8 |  0 | -12|  3 |  2 |  0 ] >>> 24
//     = [                -1 |                 1 |                -1 |               -12 ]
const __m256i v1 = _mm256_srai_epi32(_mm256_slli_epi32(v, 1*8), 3*8);

Extend 1st bytes.

// v2  = [  2 | -3 |  0 |  0 |  2 |  3 |  0 |  0 | -3 |  8 |  0 |  0 |  3 |  2 |  0 |  0 ] >>> 24
//     = [                 2 |                 2 |                -3 |                 3 ]
const __m256i v2 = _mm256_srai_epi32(_mm256_slli_epi32(v, 2*8), 3*8);

Extend 0th bytes.

// v3  = [ -3 |  0 |  0 |  0 |  3 |  0 |  0 |  0 |  8 |  0 |  0 |  0 |  2 |  0 |  0 |  0 ] >>> 24
//     = [                -3 |                 3 |                 8 |                 2 ]
const __m256i v3 = _mm256_srai_epi32(_mm256_slli_epi32(v, 3*8), 3*8);

Update the accumulator.

accumulator = _mm256_add_epi32(accumulator, v0);
accumulator = _mm256_add_epi32(accumulator, v1);
accumulator = _mm256_add_epi32(accumulator, v2);
accumulator = _mm256_add_epi32(accumulator, v3);

Sample implementation:

int32_t avx2_sumsignedbytes_variant2(int8_t* array, size_t size) {

    __m256i accumulator = _mm256_setzero_si256();

    for (size_t i=0; i < size; i += 32) {
        const __m256i v = _mm256_loadu_si256((__m256i*)(array + i));
        const __m256i v0 = _mm256_srai_epi32(v, 3*8);
        const __m256i v1 = _mm256_srai_epi32(_mm256_slli_epi32(v, 1*8), 3*8);
        const __m256i v2 = _mm256_srai_epi32(_mm256_slli_epi32(v, 2*8), 3*8);
        const __m256i v3 = _mm256_srai_epi32(_mm256_slli_epi32(v, 3*8), 3*8);

        accumulator = _mm256_add_epi32(accumulator, v0);
        accumulator = _mm256_add_epi32(accumulator, v1);
        accumulator = _mm256_add_epi32(accumulator, v2);
        accumulator = _mm256_add_epi32(accumulator, v3);
    }

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 1)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 3)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 5)) +
           int32_t(_mm256_extract_epi32(accumulator, 6)) +
           int32_t(_mm256_extract_epi32(accumulator, 7));
}

AVX2-specific instruction VPSADBW — naive approach

AVX2 has got instruction VPSADBW (_mm256_sad_epu8) that calculates Sum Of Absolute Differences (SAD) of unsigned bytes. Single SAD function works on eight-element subvectors (64-bit slices), and stores the results on corresponding 64-bit elements of the result vector; in case of AVX2 the VPSADBW yields four numbers.

And while the instruction is perfect for summing unsigned bytes, in case of signed bytes it not that great. VPSADBW has to be used twice: for positive and negative elements of input vector, as it works only on unsigned bytes.

Below is the outline of algorithm:

Load the input vector.

// v   = [  5 | -1 |  2 | -3 |  7 |  1 |  2 |  3 | -6 | -1 | -3 |  8 | -7 | -12|  3 |  2 ]
const __m256i v = _mm256_loadu_si256(ptr);

Find mask for negative numbers.

// m   = [ 00 | ff | 00 | ff | 00 | 00 | 00 | 00 | ff | ff | ff | 00 | ff | ff | 00 | 00 ]
const __m256i m  = _mm256_cmplt_epi8(v, zero);

Left positive elements.

// pos = [  5 |  0 |  2 |  0 |  7 |  1 |  2 |  3 |  0 |  0 |  0 |  8 |  0 |  0 |  3 |  2 ]
const __m128i pos = _mm256_andnot_si256(m, v);

Sum the positive elements.

// t0  = [                14 |                13 |                 8 |                 5 ]
const __m256i t0 = _mm256_sad_epu8(pos, zero);

Get the absolute value.

// va  = [  5 |  1 |  2 |  3 |  7 |  1 |  2 |  3 |  6 |  1 |  3 |  8 |  7 |  12|  3 |  2 ]
const __m256i va = _mm256_abs_epi8(v);

Left only the absolute values of negative elements.

// neg = [  0 |  1 |  0 |  3 |  0 |  0 |  0 |  0 |  6 |  1 |  3 |  0 |  7 |  0 |  0 |  0 ]
const __m256i neg = _mm256_and_si256(m, va);

Sum the negative elements.

// t1  = [                 4 |                 0 |                10 |                 7 ]
const __m256i t1 = _mm256_sad_epu8(neg, zero);

Update the accumulators.

positive = _mm256_add_epi32(positive, t0);
negative = _mm256_sub_epi32(negative, t1);

Below is the actual implementation.

#define _mm256_cmplt_epi8(a, b) _mm256_cmpgt_epi8(b, a)

int32_t avx2_sadbw_sumsignedbytes(int8_t* array, size_t size) {

    const __m256i zero = _mm256_setzero_si256();
    __m256i positive = zero;
    __m256i negative = zero;

    for (size_t i=0; i < size; i += 32) {
        const __m256i v  = _mm256_loadu_si256((__m256i*)(array + i));
        const __m256i m  = _mm256_cmplt_epi8(v, zero);
        const __m256i va = _mm256_abs_epi8(v);

        // sum just positive numbers
        const __m256i t0 = _mm256_sad_epu8(_mm256_andnot_si256(m, v), zero);

        // sum just negative numbers
        const __m256i t1 = _mm256_sad_epu8(_mm256_and_si256(m, va), zero);

        positive = _mm256_add_epi32(positive, t0);
        negative = _mm256_sub_epi32(negative, t1);
    }

    const __m256i accumulator = _mm256_add_epi32(positive, negative);

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 6));
}

AVX2-specific instruction VPSADBW — faster algorithm

Matthieu Darbois came up with a way batter method that utilizing VPSADBW. The input range of bytes is changed from int8_t range, i.e. -128 .. +128, into uint8_t range by adding 128. This addition sets the input range to 0 .. 255, and thanks to that single invocation of VPSADBW is sufficient to add unsigned bytes. The input adjustment is done with a cheap binary operation, thus the procedure is as fast as its unsigned counterpart.

Due to the input adjustments, the final result is increased by 128 times the number of bytes. This value must be extracted to get the real singed sum.

Below is the actual implementation.

int32_t avx2_sadbw_variant_sumsignedbytes(int8_t* array, size_t size) {

    const __m256i zero   = _mm256_setzero_si256();
    const __m256i addend = _mm256_set1_epi8(-128);
    __m256i accumulator = zero;

    for (size_t i=0; i < size; i += 32) {
        const __m256i v  = _mm256_loadu_si256((__m256i*)(array + i));
        const __m256i t0 = _mm256_xor_si256(v, addend);
        const __m256i t1 = _mm256_sad_epu8(t0, zero);

        accumulator = _mm256_add_epi32(accumulator, t1);
    }

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 6)) -
           128 * int32_t(size);
}

AVX2-specific instruction VPMADDUBSW

Matthieu Darbois also proposed to use instruction VPMADDUBSW (_mm256_madd_epi16) to perform horizontal additions of bytes.

The instruction itself is more complex, as it multiplies vertically bytes and then adds the pairs of intermediate signed 16-bit results, forming a new vector of 16-bit values. If another vector is filled with 1, then efficiently we add pair of bytes horizontally.

The inner loop of algorithm uses VPMADDUBSW to obtain 16-bit counters which update a 16-bit local accumulator. To avoid overflow, after a fixed number of iterations (128), the local accumulator is extended to 32-bit values and added to the global, 32-bit accumulator.

Below is the actual implementation.

int32_t avx2_maddubs_sumsignedbytes(int8_t* array, size_t size) {

    const __m256i zero   = _mm256_setzero_si256();
    const __m256i one_8  = _mm256_set1_epi8(1);
    const __m256i one_16 = _mm256_set1_epi16(1);
    __m256i accumulator = zero;

    for (size_t j = size / (32 * 128); j != 0; --j) {
        __m256i local_accumulator = zero;
        for (size_t i = 0; i < 128 * 32; i += 32) {
            const __m256i v   = _mm256_loadu_si256((__m256i*)(array + i));
            const __m256i t0 = _mm256_maddubs_epi16(one_8, v);
            local_accumulator = _mm256_add_epi16(local_accumulator, t0);
        }
        const __m256i t1 = _mm256_madd_epi16(one_16, local_accumulator);
        accumulator = _mm256_add_epi32(accumulator, t1);
        array += 128 * 32;
    }

    size_t remainder = size - ((size / (32 * 128)) * (32 * 128));
    if (remainder) {
        __m256i local_accumulator = zero;
        for (size_t i = 0; i < remainder; i += 32) {
            const __m256i v   = _mm256_loadu_si256((__m256i*)(array + i));
            const __m256i t0 = _mm256_maddubs_epi16(one_8, v);
            local_accumulator = _mm256_add_epi16(local_accumulator, t0);
        }
        const __m256i t1 = _mm256_madd_epi16(one_16, local_accumulator);
        accumulator = _mm256_add_epi32(accumulator, t1);
    }

    const __m128i accumulator128 = _mm_add_epi32(_mm256_extracti128_si256(accumulator, 0), _mm256_extracti128_si256(accumulator, 1));

    return int32_t(_mm_extract_epi32(accumulator128, 0)) +
           int32_t(_mm_extract_epi32(accumulator128, 1)) +
           int32_t(_mm_extract_epi32(accumulator128, 2)) +
           int32_t(_mm_extract_epi32(accumulator128, 3));
}

Experiments

Tested procedures
scalar	plain loop
scalar (C++)	std::accumulate
SSE	explicit casting
SSE (v2)	shif-based casting
SSE (sadbw)	two PSADBW instructions
SSE (sadbw, unrolled)	the above procedure unrolled four times
AVX2	explicit casting
AVX2 (v2)	shift-based casting
AVX2 (sadbw)	two VPSADBW instructions
AVX2 (sadbw, unrolled)	the above procedure unrolled four times
AVX2 (sadbw, variant)	single VPSADBW instruction
AVX2 (maddubs)	VPMADDUBSW instruction

The procedures were run three times and minimum values were noted.

Haswell

CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz (Haswell)

Compiler: Apple LLVM version 10.0.0 (clang-1000.11.45.5)

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.132	0.153	1.00	`████████▋`
scalar (C++)	0.122	0.131	1.08	`█████████▍`
SSE	0.294	0.300	0.45	`███▉`
SSE (v2)	0.294	0.317	0.45	`███▉`
SSE (sadbw)	0.130	0.141	1.02	`████████▊`
SSE (sadbw, unrolled)	0.130	0.135	1.02	`████████▊`
AVX2	0.153	0.161	0.86	`███████▌`
AVX2 (v2)	0.147	0.150	0.90	`███████▊`
AVX2 (sadbw)	0.070	0.072	1.89	`████████████████▍`
AVX2 (sadbw, unrolled)	0.060	0.067	2.20	`███████████████████▏`
AVX2 (sadbw, variant)	0.043	0.048	3.07	`██████████████████████████▋`
AVX2 (maddubs)	0.023	0.026	5.74	`██████████████████████████████████████████████████`
size 16384 items
scalar	0.118	0.127	1.00	`████████▋`
scalar (C++)	0.121	0.127	0.98	`████████▍`
SSE	0.295	0.298	0.40	`███▍`
SSE (v2)	0.295	0.316	0.40	`███▍`
SSE (sadbw)	0.130	0.136	0.91	`███████▉`
SSE (sadbw, unrolled)	0.131	0.134	0.90	`███████▊`
AVX2	0.154	0.158	0.77	`██████▋`
AVX2 (v2)	0.147	0.149	0.80	`██████▉`
AVX2 (sadbw)	0.068	0.072	1.74	`███████████████`
AVX2 (sadbw, unrolled)	0.065	0.070	1.82	`███████████████▊`
AVX2 (sadbw, variant)	0.039	0.043	3.03	`██████████████████████████▎`
AVX2 (maddubs)	0.021	0.023	5.62	`████████████████████████████████████████████████▉`
size 32768 items
scalar	0.121	0.129	1.00	`████████▋`
scalar (C++)	0.122	0.131	0.99	`████████▋`
SSE	0.296	0.296	0.41	`███▌`
SSE (v2)	0.296	0.297	0.41	`███▌`
SSE (sadbw)	0.132	0.133	0.92	`███████▉`
SSE (sadbw, unrolled)	0.132	0.133	0.92	`███████▉`
AVX2	0.156	0.157	0.78	`██████▊`
AVX2 (v2)	0.148	0.148	0.82	`███████`
AVX2 (sadbw)	0.068	0.069	1.78	`███████████████▌`
AVX2 (sadbw, unrolled)	0.064	0.064	1.89	`████████████████▍`
AVX2 (sadbw, variant)	0.040	0.040	3.02	`██████████████████████████▎`
AVX2 (maddubs)	0.023	0.024	5.26	`█████████████████████████████████████████████▊`

Skylake

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

GCC: g++-8 (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.264	0.265	1.00	`███████████▉`
scalar (C++)	0.263	0.265	1.00	`███████████▉`
SSE	0.404	0.407	0.65	`███████▊`
SSE (v2)	0.233	0.235	1.13	`█████████████▌`
SSE (sadbw)	0.191	0.193	1.38	`████████████████▍`
SSE (sadbw, unrolled)	0.191	0.193	1.38	`████████████████▍`
AVX2	0.206	0.208	1.28	`███████████████▎`
AVX2 (v2)	0.128	0.129	2.06	`████████████████████████▌`
AVX2 (sadbw)	0.112	0.114	2.36	`████████████████████████████▏`
AVX2 (sadbw, unrolled)	0.099	0.102	2.67	`███████████████████████████████▊`
AVX2 (sadbw, variant)	0.069	0.071	3.83	`█████████████████████████████████████████████▋`
AVX2 (maddubs)	0.063	0.065	4.19	`██████████████████████████████████████████████████`
size 16384 items
scalar	0.260	0.261	1.00	`███████████▉`
scalar (C++)	0.260	0.261	1.00	`███████████▉`
SSE	0.403	0.405	0.65	`███████▋`
SSE (v2)	0.231	0.232	1.13	`█████████████▍`
SSE (sadbw)	0.188	0.189	1.38	`████████████████▌`
SSE (sadbw, unrolled)	0.188	0.189	1.38	`████████████████▌`
AVX2	0.203	0.204	1.28	`███████████████▎`
AVX2 (v2)	0.125	0.126	2.08	`████████████████████████▊`
AVX2 (sadbw)	0.109	0.110	2.39	`████████████████████████████▍`
AVX2 (sadbw, unrolled)	0.097	0.099	2.68	`███████████████████████████████▉`
AVX2 (sadbw, variant)	0.066	0.066	3.94	`███████████████████████████████████████████████`
AVX2 (maddubs)	0.064	0.064	4.06	`████████████████████████████████████████████████▍`
size 32768 items
scalar	0.260	0.261	1.00	`███████████▉`
scalar (C++)	0.260	0.261	1.00	`███████████▉`
SSE	0.403	0.405	0.65	`███████▋`
SSE (v2)	0.231	0.232	1.13	`█████████████▍`
SSE (sadbw)	0.188	0.189	1.38	`████████████████▌`
SSE (sadbw, unrolled)	0.188	0.189	1.38	`████████████████▌`
AVX2	0.202	0.203	1.29	`███████████████▎`
AVX2 (v2)	0.125	0.126	2.08	`████████████████████████▊`
AVX2 (sadbw)	0.109	0.109	2.39	`████████████████████████████▍`
AVX2 (sadbw, unrolled)	0.098	0.099	2.65	`███████████████████████████████▋`
AVX2 (sadbw, variant)	0.066	0.066	3.94	`███████████████████████████████████████████████`
AVX2 (maddubs)	0.063	0.064	4.13	`█████████████████████████████████████████████████▏`

SkylakeX

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

GCC: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.284	0.289	1.00	`█████████▎`
scalar (C++)	0.285	0.288	1.00	`█████████▏`
SSE	0.437	0.441	0.65	`██████`
SSE (v2)	0.243	0.246	1.17	`██████████▊`
SSE (sadbw)	0.177	0.186	1.60	`██████████████▊`
SSE (sadbw, unrolled)	0.177	0.185	1.60	`██████████████▊`
AVX2	0.222	0.224	1.28	`███████████▊`
AVX2 (v2)	0.135	0.137	2.10	`███████████████████▍`
AVX2 (sadbw)	0.106	0.109	2.68	`████████████████████████▊`
AVX2 (sadbw, unrolled)	0.104	0.109	2.73	`█████████████████████████▎`
AVX2 (sadbw, variant)	0.055	0.059	5.16	`███████████████████████████████████████████████▊`
AVX2 (maddubs)	0.061	0.066	4.66	`███████████████████████████████████████████`
size 16384 items
scalar	0.281	0.284	1.00	`█████████▎`
scalar (C++)	0.282	0.283	1.00	`█████████▏`
SSE	0.437	0.438	0.64	`█████▉`
SSE (v2)	0.242	0.243	1.16	`██████████▋`
SSE (sadbw)	0.176	0.182	1.60	`██████████████▊`
SSE (sadbw, unrolled)	0.176	0.182	1.60	`██████████████▊`
AVX2	0.219	0.220	1.28	`███████████▊`
AVX2 (v2)	0.132	0.134	2.13	`███████████████████▋`
AVX2 (sadbw)	0.103	0.105	2.73	`█████████████████████████▏`
AVX2 (sadbw, unrolled)	0.102	0.104	2.75	`█████████████████████████▍`
AVX2 (sadbw, variant)	0.052	0.053	5.40	`██████████████████████████████████████████████████`
AVX2 (maddubs)	0.055	0.057	5.11	`███████████████████████████████████████████████▎`
size 32768 items
scalar	0.281	0.283	1.00	`█████████▎`
scalar (C++)	0.281	0.282	1.00	`█████████▎`
SSE	0.437	0.438	0.64	`█████▉`
SSE (v2)	0.242	0.243	1.16	`██████████▋`
SSE (sadbw)	0.177	0.182	1.59	`██████████████▋`
SSE (sadbw, unrolled)	0.177	0.182	1.59	`██████████████▋`
AVX2	0.219	0.220	1.28	`███████████▊`
AVX2 (v2)	0.133	0.134	2.11	`███████████████████▌`
AVX2 (sadbw)	0.103	0.105	2.73	`█████████████████████████▏`
AVX2 (sadbw, unrolled)	0.103	0.105	2.73	`█████████████████████████▏`
AVX2 (sadbw, variant)	0.052	0.054	5.40	`██████████████████████████████████████████████████`
AVX2 (maddubs)	0.057	0.058	4.93	`█████████████████████████████████████████████▌`

CannonLake

CPU: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz

Compiler: g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.203	0.205	1.00	`█████████████▌`
scalar (C++)	0.203	0.205	1.00	`█████████████▌`
SSE	0.303	0.306	0.67	`█████████`
SSE (v2)	0.175	0.177	1.16	`███████████████▋`
SSE (sadbw)	0.154	0.155	1.32	`█████████████████▊`
SSE (sadbw, unrolled)	0.154	0.155	1.32	`█████████████████▊`
AVX2	0.153	0.155	1.33	`█████████████████▉`
AVX2 (v2)	0.102	0.104	1.99	`██████████████████████████▉`
AVX2 (sadbw)	0.089	0.091	2.28	`██████████████████████████████▉`
AVX2 (sadbw, unrolled)	0.072	0.075	2.82	`██████████████████████████████████████▏`
AVX2 (sadbw, variant)	0.061	0.062	3.33	`█████████████████████████████████████████████`
AVX2 (maddubs)	0.055	0.058	3.69	`██████████████████████████████████████████████████`
size 16384 items
scalar	0.196	0.198	1.00	`█████████████▌`
scalar (C++)	0.196	0.197	1.00	`█████████████▌`
SSE	0.302	0.304	0.65	`████████▊`
SSE (v2)	0.173	0.174	1.13	`███████████████▎`
SSE (sadbw)	0.152	0.153	1.29	`█████████████████▍`
SSE (sadbw, unrolled)	0.152	0.153	1.29	`█████████████████▍`
AVX2	0.151	0.153	1.30	`█████████████████▌`
AVX2 (v2)	0.099	0.101	1.98	`██████████████████████████▊`
AVX2 (sadbw)	0.087	0.088	2.25	`██████████████████████████████▌`
AVX2 (sadbw, unrolled)	0.071	0.073	2.76	`█████████████████████████████████████▍`
AVX2 (sadbw, variant)	0.058	0.059	3.38	`█████████████████████████████████████████████▊`
AVX2 (maddubs)	0.054	0.056	3.63	`█████████████████████████████████████████████████▏`
size 32768 items
scalar	0.195	0.197	1.00	`█████████████▌`
scalar (C++)	0.195	0.197	1.00	`█████████████▌`
SSE	0.302	0.305	0.65	`████████▋`
SSE (v2)	0.173	0.174	1.13	`███████████████▎`
SSE (sadbw)	0.151	0.153	1.29	`█████████████████▍`
SSE (sadbw, unrolled)	0.151	0.153	1.29	`█████████████████▍`
AVX2	0.151	0.153	1.29	`█████████████████▍`
AVX2 (v2)	0.099	0.101	1.97	`██████████████████████████▋`
AVX2 (sadbw)	0.087	0.087	2.24	`██████████████████████████████▎`
AVX2 (sadbw, unrolled)	0.071	0.072	2.75	`█████████████████████████████████████▏`
AVX2 (sadbw, variant)	0.058	0.059	3.36	`█████████████████████████████████████████████▌`
AVX2 (maddubs)	0.055	0.057	3.55	`████████████████████████████████████████████████`

Conclusions

On Skylake and SkylakeX architectures VSADBW method is fastest, 2.5 times faster than compiled code.
On Haswell VPMADDUBSW method significantly outperforms other approaches, it's faster 1.7 times than single VPSADBW method. However, on Skylake and newer architectures both methods are comparable.
The results from GCC presented in this article before 2019-02-04 were worse for SkylakeX — to my surprise the flag -O3 doesn't enable -funroll-loops. Thanks to Richard Startin and Travis Downs for pointing this out.

Acknowledgements

Big thanks to Daniel Lemire who provides access to Skylake, SkylakeX and CannonLake machines, where I can run benchmarks.

Source code

Source code is available on github.