SIMDized sum of all bytes in the array

Author:	Wojciech Muła
Added on:	2018-10-24

Contents

Introduction
- Algorithm used by GCC
32-bit accumulators
16-bit accumulators
- Straightforward approach
- 8-bit halves
AVX2-specific instruction VPSADBW
Experiments
Conclusions
See also
Acknowledgements
Source code

Introduction

I was curious how GCC vectorizes function that sums bytes from an array. Below is a loop-based implementation.

uint32_t sumbytes(uint8_t* array, size_t size) {

    uint32_t result = 0;

    for (size_t i=0; i < size; i++)
        result += uint32_t(array[i]);

    return result;
}

The same algorithm can be expressed with following C++ code.

#include <numeric>

uint32_t sumbytes(uint8_t* array, size_t size) {
    return std::accumulate(array, array + size, uint32_t(0));
}

When I saw the assembly generated by GCC I was sure that it's possible to make it better and faster. This text summarizes my findings.

I focus solely on Skylake performance and AVX2 code. The sources have got also implementations of SSE procedures and experiments include timings from an older CPU.

Algorithm used by GCC

Below is the assembly code of the main loop compiled for Skylake by GCC 7.3.0 with flags -O3 -march=skylake:

vpmovzxbw    %xmm0, %ymm2
vextracti128 $0x1, %ymm0, %xmm0
vpmovzxwd    %xmm2, %ymm1
vextracti128 $0x1, %ymm2, %xmm2
vpmovzxbw    %xmm0, %ymm0
vpmovzxwd    %xmm2, %ymm2
vpaddd       %ymm2, %ymm1, %ymm1
vpmovzxwd    %xmm0, %ymm2
vextracti128 $0x1, %ymm0, %xmm0
vpaddd       %ymm2, %ymm1, %ymm1
vpmovzxwd    %xmm0, %ymm0
vpaddd       %ymm0, %ymm1, %ymm0
vpaddd       %ymm0, %ymm3, %ymm3

GCC nicely vectorized the algorithm: it keeps multiple 32-bit sub-accumulators in single register, i.e. eight in case of AVX2 (four in SSE code). These 32-bit numbers are added together in the end, forming the scalar result.

Now, let's look how the type casting is done. Although AVX2 has variant of instruction VPMOVZXBD that converts directly from uint8_t to uint32_t (intrinsic _mm256_cvtepu8_epi32) the compiler does the conversion in two steps:

First, it extends a vector of uint8_t into two vectors of uint16_t numbers (VPMOVZXBW).
Then, gets four vectors of uint32_t from the vectors obtained in the previous step (VPMOVZXWD).

The cast instruction VPMOVZX extends the lower part of a register, in this case the lower half. This is the reason why extractions of helves (VEXTRACTI128) are needed.

32-bit accumulators

In this approach we cast directly from uint8_t to uint32_t. But casting isn't done by the mentioned VPMOVZXBD — we use only bit-ands and shifts within 32-bit lanes, that has some benefits.

Load the vector:

//       |15          13 | 12          8 | 7           4 | 3           0 |
// v   = [ a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p ]
const __m128i v = _mm_loadu_si128(ptr);

Mask 0th bytes in each 32-bit lane

const __m128i lobyte_mask = _mm_set1_epi32(0x000000ff);

// t0  = [ 0 | 0 | 0 | d | 0 | 0 | 0 | h | 0 | 0 | 0 | l | 0 | 0 | 0 | p ]
const __m128i t0 = _mm_and_si128(lobyte_mask, v);

Move 1st bytes into 0th position in each lane.

const __m128i v1 = _mm_srli_epi32(v, 1*8);

// t1  = [ 0 | 0 | 0 | c | 0 | 0 | 0 | g | 0 | 0 | 0 | k | 0 | 0 | 0 | o ]
const __m128i t1 = _mm_and_si128(lobyte_mask, v1);

Move 2nd bytes into 0th position in each lane.

const __m128i v2 = _mm_srli_epi32(v, 2*8);

// t2  = [ 0 | 0 | 0 | b | 0 | 0 | 0 | f | 0 | 0 | 0 | j | 0 | 0 | 0 | n ]
const __m128i t2 = _mm_and_si128(lobyte_mask, v2);

Move 3rd bytes into 0th position in each lane.

// t3  = [ 0 | 0 | 0 | a | 0 | 0 | 0 | e | 0 | 0 | 0 | i | 0 | 0 | 0 | m ]
const __m128i t3 = _mm_srli_epi32(v, 3*8);

Update the accumulator.

accumulator = _mm_add_epi32(accumulator, t0);
accumulator = _mm_add_epi32(accumulator, t1);
accumulator = _mm_add_epi32(accumulator, t2);
accumulator = _mm_add_epi32(accumulator, t3);

Here is an actual implementation:

uint32_t sse_sumbytes_variant2(uint8_t* array, size_t size) {

    const __m128i lobyte_mask = _mm_set1_epi32(0x000000ff);

    __m128i accumulator = _mm_setzero_si128();

    for (size_t i=0; i < size; i += 16) {
        const __m128i v = _mm_loadu_si128((__m128i*)(array + i));
        const __m128i v0 = v;
        const __m128i v1 = _mm_srli_epi32(v, 1*8);
        const __m128i v2 = _mm_srli_epi32(v, 2*8);
        const __m128i v3 = _mm_srli_epi32(v, 3*8);

        const __m128i t0 = _mm_and_si128(lobyte_mask, v0);
        const __m128i t1 = _mm_and_si128(lobyte_mask, v1);
        const __m128i t2 = _mm_and_si128(lobyte_mask, v2);
        const __m128i t3 = v3;

        accumulator = _mm_add_epi32(accumulator, t0);
        accumulator = _mm_add_epi32(accumulator, t1);
        accumulator = _mm_add_epi32(accumulator, t2);
        accumulator = _mm_add_epi32(accumulator, t3);
    }

    return uint32_t(_mm_extract_epi32(accumulator, 0)) +
           uint32_t(_mm_extract_epi32(accumulator, 1)) +
           uint32_t(_mm_extract_epi32(accumulator, 2)) +
           uint32_t(_mm_extract_epi32(accumulator, 3));
}

The assembly code of the main loop is:

vpsrld $0x8, %ymm1, %ymm3
vpsrld $0x10, %ymm1, %ymm2
vpsrld $0x18, %ymm1, %ymm0
vpand  %ymm3, %ymm4, %ymm3
vpaddd %ymm5, %ymm0, %ymm0
vpand  %ymm2, %ymm4, %ymm2
vpand  %ymm1, %ymm4, %ymm1
vpaddd %ymm3, %ymm0, %ymm0
vpaddd %ymm2, %ymm0, %ymm0
vpaddd %ymm1, %ymm0, %ymm0

This approach is two times faster than the compiler's version. There are two reasons of that:

We have 10 instructions instead of 13.
All of these instruction are simple and they are fast. Let's look at the instructions' parameters for the Skylake architecture taken from Agner Fog's site.

instruction	micro-ops	port allocation	latency	reciprocal throughput
compiler version
3 x vextracti128	1	p5	3	1
2 x vpmovzxbw	1	p5	3	1
4 x vpmovzxwd	1	p5	3	1
4 x vpaddd	1	p015	1	0.33
our approach
3 x vpand	1	p015	1	0.33
3 x vpsrlid	1	p05	1	0.5
4 x vpaddd	1	p015	1	0.33

First of all, most instructions emitted by compiler have latency 3 cycles. But latency could be compensated (especially in unrolled loops). Worse is that all instructions except VPADD are dispatched on the same execution port #5. So, it's likely that there will be congestion on the port.

Instructions used by our approach can be dispatched to different ports (and thus executed in parallel); moreover, all have latency just 1 cycle.

16-bit accumulators

Using 16-bit accumulators instead of 32-bit accumulators doubles capacity, so in AVX2 code we're updating 16 instead of 8 accumulators in a single iteration. But this approach limits the number of additions that can be performed without overflow to max(uint16_t)/max(uint8_t) = 65535/255 = 257.

The overcome this we need two nested loops. The outer loop keeps 32-bit accumulators, as the initial version does. The inner loop processes 256 * 8 bytes, i.e. iterates 64 times. When the inner loop completes, 16-bit accumulators are extended to 32-bit and the outer accumulators are updated.

Straightforward approach

Below is a C++ implementation that directly use vectors of uint16_t in the inner loop.

uint32_t avx2_16bit_sumbytes_variant2(uint8_t* array, size_t size) {

    const __m256i mask_lobytes = _mm256_set1_epi16(0x00ff);

    __m256i accumulator = _mm256_setzero_si256();

#define SIZE (8 * 256)
    for (size_t i=0; i < size; i += SIZE) {

        __m256i accu_16bit = _mm256_setzero_si256();

        for (size_t j=0; j < SIZE; j += 32) {
            const __m256i v = _mm256_loadu_si256((__m256i*)(array + i + j));

            const __m256i t0 = _mm256_and_si256(v, mask_lobytes);
            const __m256i t1 = _mm256_srli_epi16(v, 8);

            accu_16bit = _mm256_add_epi16(accu_16bit, t0);
            accu_16bit = _mm256_add_epi16(accu_16bit, t1);
        }

        const __m256i t0 = _mm256_and_si256(accu_16bit, _mm256_set1_epi32(0x0000ffff));
        const __m256i t1 = _mm256_srli_epi32(accu_16bit, 16);

        accumulator = _mm256_add_epi32(accumulator, t0);
        accumulator = _mm256_add_epi32(accumulator, t1);
    }
#undef SIZE

    return uint32_t(_mm256_extract_epi32(accumulator, 0)) +
           uint32_t(_mm256_extract_epi32(accumulator, 1)) +
           uint32_t(_mm256_extract_epi32(accumulator, 2)) +
           uint32_t(_mm256_extract_epi32(accumulator, 3)) +
           uint32_t(_mm256_extract_epi32(accumulator, 4)) +
           uint32_t(_mm256_extract_epi32(accumulator, 5)) +
           uint32_t(_mm256_extract_epi32(accumulator, 6)) +
           uint32_t(_mm256_extract_epi32(accumulator, 7));
}

8-bit halves

In this approach we also use 16-bit accumulators, but their lower and higher bytes are stored separately in distinct vectors. Thanks to that we use the full capacity of AVX2, i.e. we can perform 32 additions at time. In each iteration the input vector of bytes is added to the lower part of accumulator. Higher part is updated in the second step, getting carry from the previous step. Following scalar code shows the idea:

acc_lo += input;
if (carry_flag)
    acc_hi += 1;

How to detect an overflow? In a scalar code it's simple, as we can read the carry flag and react accordingly (GCC provides nice wrappers for this, like __builtin_add_overflow). But there is no such a thing like vector of carry flags in neither AVX2 or SSE.

We can detect overflow with help of saturated addition. In saturated arithmetics, if an operation would cause overflow, then result is the maximum (or minimum) value the target type can hold. Thus, the saturated addition might be expressed as:

uint8_t satadd(uint8_t a, uint8_t b) {

    if (uint32_t(a) + uint32_t(b) <= 255) // this check is done *somehow* in the hardware
        return a + b;
    else
        return 255;
}

We perform two operations:

normal addition (in wrap-around arithmetic);
and saturated one.

If results of both sums are not equal, it means there is the overflow. But since AVX2 has no "not-equal" relation (there is comparison for "equal"), we need to invert the logic. We initialize the higher part of accumulator to the highest possible value, i.e. the number of inner loop iterations. Then, we decrement the higher part with the "no-overflow" byte-mask. Following scalar code illustrates the idea.

acc_lo = 0;
acc_hi = 64;

for (size_t i=0; i < 64; i++) {
    const uint8_t tmp = satadd(acc_lo, input);
    acc_lo += input;

    if (acc_lo == tmp) {
        acc_hi -= 1;
    }
}

The actual implementation is shown below.

uint32_t sse_8bit_sumbytes(uint8_t* array, size_t size) {

    __m128i accumulator = _mm_setzero_si128();

    for (size_t i=0; i < size; i += 256) {

        __m128i accu_8bit_lo = _mm_setzero_si128();
        __m128i accu_8bit_hi = _mm_set1_epi8(16);

        for (size_t j=0; j < 256; j += 16) {
            const __m128i v = _mm_loadu_si128((__m128i*)(array + i + j));

            const __m128i t0 = _mm_add_epi8(v, accu_8bit_lo);
            const __m128i t1 = _mm_adds_epu8(v, accu_8bit_lo);

            const __m128i not_carry = _mm_cmpeq_epi8(t0, t1);

            accu_8bit_lo = t0;
            accu_8bit_hi = _mm_add_epi8(accu_8bit_hi, not_carry);
        }

        const __m128i sum_lo = _mm_sad_epu8(accu_8bit_lo, _mm_setzero_si128());
        const __m128i sum_hi = _mm_slli_epi32(_mm_sad_epu8(accu_8bit_hi, _mm_setzero_si128()), 8);

        accumulator = _mm_add_epi32(accumulator, sum_lo);
        accumulator = _mm_add_epi32(accumulator, sum_hi);
    }

    return uint32_t(_mm_extract_epi32(accumulator, 0)) +
           uint32_t(_mm_extract_epi32(accumulator, 1)) +
           uint32_t(_mm_extract_epi32(accumulator, 2)) +
           uint32_t(_mm_extract_epi32(accumulator, 3));
}

AVX2-specific instruction VPSADBW

AVX2 defines instruction VPSADBW (_mm256_sad_epu8) that calculates Sum Of Absolute Differences (SAD) of unsigned bytes. Single SAD function works on eight-element subvectors (64-bit slices), and stores the results on corresponding 64-bit elements of the result vector; in case of AVX2 the VPSADB yields four numbers.

Below is a sample C++ code that naively implements SAD function.

uint64_t SAD(uint8_t a[], uint8_t b[]) {

    uint64_t result = 0;

    for (i=0; i < 8; i++) {
        if (a[i] > b[i])
            result += a[i] - b[i];
        else
            result += b[i] - a[i];
    }

    return reuslt;
}

If one of argument vectors is full of zeros, then SAD calculates the sum of bytes from another argument — which exactly what we need. Moreover, since the instruction calculates 64-bit sums, bigger arrays can be processed without overflow.

Below is the simplest, non-unrolled C++ code.

uint32_t avx2_sadbw_sumbytes(uint8_t* array, size_t size) {

    const __m256i zero = _mm256_setzero_si256();
    __m256i accumulator = zero;

    for (size_t i=0; i < size; i += 32) {
        const __m256i v  = _mm256_loadu_si256((__m256i*)(array + i));
        const __m256i t0 = _mm256_sad_epu8(v, zero);

        accumulator = _mm256_add_epi32(accumulator, t0);
    }

    return uint32_t(_mm256_extract_epi32(accumulator, 0)) +
           uint32_t(_mm256_extract_epi32(accumulator, 2)) +
           uint32_t(_mm256_extract_epi32(accumulator, 4)) +
           uint32_t(_mm256_extract_epi32(accumulator, 6));
}

Although the VPSADBW instruction has latency of 3 cycles, it has low throughput of 1 cycle. In case of an unrolled procedure the latency is almost ridden of. The performance boost over the compiler code is almost 6 times.

Experiments

Tested procedures
scalar	plain loop
scalar (C++)	std::accumulate
SSE	convert `uint8_t -> uint16_t` with `PMOVZXWD`
SSE (v2)	32-bit accumulators
SSE (sadbw)	PSADBW instruction
SSE (sadbw, unrolled)	the above procedure unrolled four times
SSE (16bit accu)	16-bit accumulators, `uint16_t -> uint32_t` with `PMOVZXBW`
SSE (16bit accu, v2)	16-bit accumulators, `uint16_t -> uint32_t` with `PAND`/`PSRLD`
SSE (16bit accu, v2, unrolled)	the above procedure unrolled four times
SSE (8bit accu)	8-bit accumulators
AVX2 (v2)	32-bit accumulators
AVX2 (sadbw)	VPSADBW instruction
AVX2 (sadbw, unrolled)	the above procedure unrolled four times
AVX2 (16bit accu, v2)	16-bit accumulators, `uint16_t -> uint32_t` with `VPAND`/`VPSRLD`
AVX2 (16bit accu, v2, unrolled)	the above procedure unrolled four times
AVX2 (8bit accu)	8-bit accumulators

The procedures were run three times and minimum values were noted.

Skylake

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

GCC: gcc (GCC) 7.3.0

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.294	0.297	1.00	`████████▍`
scalar (C++)	0.295	0.297	1.00	`████████▍`
SSE	0.440	0.442	0.67	`█████▋`
SSE (v2)	0.252	0.254	1.17	`█████████▊`
SSE (sadbw)	0.129	0.132	2.28	`███████████████████▏`
SSE (sadbw, unrolled)	0.129	0.132	2.28	`███████████████████▏`
SSE (16bit accu)	0.195	0.197	1.51	`████████████▋`
SSE (16bit accu, v2)	0.169	0.176	1.74	`██████████████▋`
SSE (16bit accu, v2, unrolled)	0.109	0.111	2.70	`██████████████████████▋`
SSE (8bit accu)	0.125	0.130	2.35	`███████████████████▊`
AVX2 (v2)	0.138	0.140	2.13	`█████████████████▉`
AVX2 (sadbw)	0.075	0.077	3.92	`█████████████████████████████████`
AVX2 (sadbw, unrolled)	0.051	0.052	5.76	`████████████████████████████████████████████████▌`
AVX2 (16bit accu, v2)	0.091	0.093	3.23	`███████████████████████████▏`
AVX2 (16bit accu, v2, unrolled)	0.073	0.075	4.03	`█████████████████████████████████▉`
AVX2 (8bit accu)	0.080	0.083	3.67	`██████████████████████████████▉`
size 16384 items
scalar	0.285	0.286	1.00	`████████▍`
scalar (C++)	0.285	0.286	1.00	`████████▍`
SSE	0.439	0.440	0.65	`█████▍`
SSE (v2)	0.251	0.252	1.14	`█████████▌`
SSE (sadbw)	0.126	0.127	2.26	`███████████████████`
SSE (sadbw, unrolled)	0.126	0.127	2.26	`███████████████████`
SSE (16bit accu)	0.191	0.192	1.49	`████████████▌`
SSE (16bit accu, v2)	0.173	0.181	1.65	`█████████████▊`
SSE (16bit accu, v2, unrolled)	0.107	0.108	2.66	`██████████████████████▍`
SSE (8bit accu)	0.123	0.126	2.32	`███████████████████▌`
AVX2 (v2)	0.134	0.135	2.13	`█████████████████▉`
AVX2 (sadbw)	0.072	0.072	3.96	`█████████████████████████████████▎`
AVX2 (sadbw, unrolled)	0.048	0.049	5.94	`██████████████████████████████████████████████████`
AVX2 (16bit accu, v2)	0.089	0.090	3.20	`██████████████████████████▉`
AVX2 (16bit accu, v2, unrolled)	0.070	0.070	4.07	`██████████████████████████████████▎`
AVX2 (8bit accu)	0.076	0.077	3.75	`███████████████████████████████▌`
size 32768 items
scalar	0.284	0.285	1.00	`████████▍`
scalar (C++)	0.284	0.284	1.00	`████████▍`
SSE	0.439	0.440	0.65	`█████▍`
SSE (v2)	0.251	0.252	1.13	`█████████▌`
SSE (sadbw)	0.126	0.127	2.25	`██████████████████▉`
SSE (sadbw, unrolled)	0.126	0.127	2.25	`██████████████████▉`
SSE (16bit accu)	0.191	0.192	1.49	`████████████▌`
SSE (16bit accu, v2)	0.169	0.173	1.68	`██████████████▏`
SSE (16bit accu, v2, unrolled)	0.109	0.109	2.61	`█████████████████████▉`
SSE (8bit accu)	0.124	0.127	2.29	`███████████████████▎`
AVX2 (v2)	0.134	0.135	2.12	`█████████████████▊`
AVX2 (sadbw)	0.072	0.073	3.94	`█████████████████████████████████▏`
AVX2 (sadbw, unrolled)	0.048	0.049	5.92	`█████████████████████████████████████████████████▊`
AVX2 (16bit accu, v2)	0.089	0.090	3.19	`██████████████████████████▊`
AVX2 (16bit accu, v2, unrolled)	0.071	0.072	4.00	`█████████████████████████████████▋`
AVX2 (8bit accu)	0.076	0.077	3.74	`███████████████████████████████▍`

SkylakeX

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

GCC: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.278	0.385	1.00	`███████▎`
scalar (C++)	0.278	0.281	1.00	`███████▎`
SSE	0.431	0.434	0.65	`████▊`
SSE (v2)	0.216	0.248	1.29	`█████████▍`
SSE (sadbw)	0.092	0.095	3.02	`██████████████████████▎`
SSE (sadbw, unrolled)	0.092	0.120	3.02	`██████████████████████▎`
SSE (16bit accu)	0.188	0.191	1.48	`██████████▉`
SSE (16bit accu, v2)	0.134	0.138	2.07	`███████████████▎`
SSE (16bit accu, v2, unrolled)	0.120	0.121	2.32	`█████████████████`
SSE (8bit accu)	0.133	0.140	2.09	`███████████████▍`
AVX2 (v2)	0.123	0.125	2.26	`████████████████▋`
AVX2 (sadbw)	0.060	0.063	4.63	`██████████████████████████████████▏`
AVX2 (sadbw, unrolled)	0.041	0.043	6.78	`██████████████████████████████████████████████████`
AVX2 (16bit accu, v2)	0.076	0.078	3.66	`██████████████████████████▉`
AVX2 (16bit accu, v2, unrolled)	0.064	0.065	4.34	`████████████████████████████████`
AVX2 (8bit accu)	0.070	0.072	3.97	`█████████████████████████████▎`
size 16384 items
scalar	0.280	0.282	1.00	`███████▎`
scalar (C++)	0.280	0.281	1.00	`███████▎`
SSE	0.435	0.437	0.64	`████▋`
SSE (v2)	0.218	0.249	1.28	`█████████▍`
SSE (sadbw)	0.093	0.094	3.01	`██████████████████████▏`
SSE (sadbw, unrolled)	0.093	0.125	3.01	`██████████████████████▏`
SSE (16bit accu)	0.188	0.190	1.49	`██████████▉`
SSE (16bit accu, v2)	0.134	0.135	2.09	`███████████████▍`
SSE (16bit accu, v2, unrolled)	0.124	0.125	2.26	`████████████████▋`
SSE (8bit accu)	0.142	0.145	1.97	`██████████████▌`
AVX2 (v2)	0.124	0.125	2.26	`████████████████▋`
AVX2 (sadbw)	0.062	0.063	4.52	`█████████████████████████████████▎`
AVX2 (sadbw, unrolled)	0.044	0.044	6.36	`██████████████████████████████████████████████▉`
AVX2 (16bit accu, v2)	0.079	0.080	3.54	`██████████████████████████▏`
AVX2 (16bit accu, v2, unrolled)	0.066	0.067	4.24	`███████████████████████████████▎`
AVX2 (8bit accu)	0.070	0.071	4.00	`█████████████████████████████▍`
size 32768 items
scalar	0.281	0.282	1.00	`███████▎`
scalar (C++)	0.281	0.281	1.00	`███████▎`
SSE	0.436	0.437	0.64	`████▊`
SSE (v2)	0.219	0.250	1.28	`█████████▍`
SSE (sadbw)	0.095	0.097	2.96	`█████████████████████▊`
SSE (sadbw, unrolled)	0.096	0.125	2.93	`█████████████████████▌`
SSE (16bit accu)	0.189	0.191	1.49	`██████████▉`
SSE (16bit accu, v2)	0.134	0.137	2.10	`███████████████▍`
SSE (16bit accu, v2, unrolled)	0.125	0.125	2.25	`████████████████▌`
SSE (8bit accu)	0.146	0.149	1.92	`██████████████▏`
AVX2 (v2)	0.125	0.126	2.25	`████████████████▌`
AVX2 (sadbw)	0.062	0.063	4.53	`█████████████████████████████████▍`
AVX2 (sadbw, unrolled)	0.045	0.045	6.24	`██████████████████████████████████████████████`
AVX2 (16bit accu, v2)	0.079	0.080	3.56	`██████████████████████████▏`
AVX2 (16bit accu, v2, unrolled)	0.067	0.068	4.19	`██████████████████████████████▉`
AVX2 (8bit accu)	0.071	0.072	3.96	`█████████████████████████████▏`

Westmere

CPU: Intel(R) Core(TM) i5 CPU M 540 @ 2.53GHz

GCC: gcc (Debian 7.3.0-16) 7.3.0

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	0.370	0.475	1.00	`███████████████▌`
scalar (C++)	0.371	0.374	1.00	`███████████████▍`
SSE	0.320	0.323	1.16	`█████████████████▉`
SSE (v2)	0.311	0.311	1.19	`██████████████████▍`
SSE (sadbw)	0.142	0.143	2.61	`████████████████████████████████████████▍`
SSE (sadbw, unrolled)	0.141	0.143	2.62	`████████████████████████████████████████▊`
SSE (16bit accu)	0.204	0.207	1.81	`████████████████████████████▏`
SSE (16bit accu, v2)	0.170	0.174	2.18	`█████████████████████████████████▊`
SSE (16bit accu, v2, unrolled)	0.115	0.117	3.22	`██████████████████████████████████████████████████`
SSE (8bit accu)	0.149	0.151	2.48	`██████████████████████████████████████▌`
size 16384 items
scalar	0.375	0.376	1.00	`███████████████▌`
scalar (C++)	0.375	0.377	1.00	`███████████████▌`
SSE	0.325	0.327	1.15	`█████████████████▉`
SSE (v2)	0.309	0.318	1.21	`██████████████████▊`
SSE (sadbw)	0.152	0.153	2.47	`██████████████████████████████████████▎`
SSE (sadbw, unrolled)	0.152	0.153	2.47	`██████████████████████████████████████▎`
SSE (16bit accu)	0.214	0.215	1.75	`███████████████████████████▏`
SSE (16bit accu, v2)	0.176	0.177	2.13	`█████████████████████████████████`
SSE (16bit accu, v2, unrolled)	0.126	0.127	2.98	`██████████████████████████████████████████████▎`
SSE (8bit accu)	0.159	0.159	2.36	`████████████████████████████████████▋`
size 32768 items
scalar	0.375	0.377	1.00	`███████████████▌`
scalar (C++)	0.375	0.377	1.00	`███████████████▌`
SSE	0.326	0.328	1.15	`█████████████████▉`
SSE (v2)	0.292	0.293	1.28	`███████████████████▉`
SSE (sadbw)	0.153	0.154	2.45	`██████████████████████████████████████`
SSE (sadbw, unrolled)	0.153	0.154	2.45	`██████████████████████████████████████`
SSE (16bit accu)	0.215	0.217	1.74	`███████████████████████████`
SSE (16bit accu, v2)	0.180	0.181	2.08	`████████████████████████████████▍`
SSE (16bit accu, v2, unrolled)	0.140	0.140	2.68	`█████████████████████████████████████████▋`
SSE (8bit accu)	0.163	0.164	2.30	`███████████████████████████████████▊`

Conclusions

On Skylake and SkylakeX architectures SADBW method is unbeatable, it's 5-6 times faster than compiled code.
On Westmere we can be even faster than the SADB method (but not too much). Anyways, we're faster 2-3 times than compiled code.
It's worth to note that GCC compiles plain C and C++ procedures into the same code. It is another proof that C++ templates are not that bad (of course in term of performance).

Acknowledgements

Big thanks to Daniel Lemire who provides access to Skylake/SkylakeX machines, where I can run benchmarks.

Source code

Source code is available on github.