SIMDized sum of all bytes in the array — part 2: signed bytes

Author:Wojciech Muła
Added on:2018-11-18

Contents

Introduction

This is the second part of SIMDized sum of all bytes in the array. The first part describes summing unsigned bytes, here we're going to experiment with summing of signed bytes.

The baseline C implementation is:

int32_t sumbytes(int8_t* array, size_t size) {

    int32_t result = 0;

    for (size_t i=0; i < size; i++)
        result += int32_t(array[i]);

    return result;
}

And the C++ implementation:

#include <numeric>

int32_t sumbytes(int8_t* array, size_t size) {
    return std::accumulate(array, array + size, int32_t(0));
}

Algorithm used by GCC

Below is the assembly code of the main loop compiled for Skylake by GCC 7.3.0 with flags -O3 -march=skylake:

vpmovsxbw       %xmm1, %ymm2
vextracti128    $0x1, %ymm1, %xmm1
vpmovsxwd       %xmm2, %ymm3
vextracti128    $0x1, %ymm2, %xmm2
vpmovsxbw       %xmm1, %ymm1
vpaddd          %ymm0, %ymm3, %ymm3
vpmovsxwd       %xmm2, %ymm0
vpaddd          %ymm3, %ymm0, %ymm2
vpmovsxwd       %xmm1, %ymm0
vextracti128    $0x1, %ymm1, %xmm1
vpaddd          %ymm2, %ymm0, %ymm0
vpmovsxwd       %xmm1, %ymm1
vpaddd          %ymm0, %ymm1, %ymm0

The approach used here by GCC is exactly the same as for summing unsigned bytes. There are multiple 32-bit sub-accumulators in single register, i.e. eight in case of AVX2 (four in SSE code), which are added together in the end, forming the scalar result.

To get 32-bit values there's two-step casting from int8_t to int32_t:

  1. First extend a vector of int8_t into two vectors of int16_t numbers (VPMOVSXBW).
  2. Then, get four vectors of int32_t from the vectors obtained in the previous step (VPMOVSXWD).

The cast instruction VPMOVSX extends the lower part of a register, in this case the lower half. This is the reason why extractions of helves (VEXTRACTI128) are needed.

Explicit casting

There's instruction VPMOVSXBD which casts directly from int8_t to int32_t. It get lower 64 bits of input register, thus to convert all bytes from an AVX2 register the instruction has to be called 4 times and some shifting is needed.

Below is sample implementation

int32_t avx2_sumsignedbytes(int8_t* array, size_t size) {

    __m256i accumulator = _mm256_setzero_si256();

    for (size_t i=0; i < size; i += 32) {
        const __m256i v = _mm256_loadu_si256((__m256i*)(array + i));

        const __m128i lo = _mm256_extracti128_si256(v, 0);
        const __m128i hi = _mm256_extracti128_si256(v, 1);

        const __m256i t0 = _mm256_cvtepi8_epi32(lo);
        const __m256i t1 = _mm256_cvtepi8_epi32(hi);
        const __m256i t2 = _mm256_cvtepi8_epi32(_mm_bsrli_si128(lo, 8));
        const __m256i t3 = _mm256_cvtepi8_epi32(_mm_bsrli_si128(hi, 8));

        accumulator = _mm256_add_epi32(accumulator, t0);
        accumulator = _mm256_add_epi32(accumulator, t1);
        accumulator = _mm256_add_epi32(accumulator, t2);
        accumulator = _mm256_add_epi32(accumulator, t3);
    }

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 1)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 3)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 5)) +
           int32_t(_mm256_extract_epi32(accumulator, 6)) +
           int32_t(_mm256_extract_epi32(accumulator, 7));
}

Shift-based casting

In this approach we also cast directly from 8 into 32-bit numbers, but we use 32-bit shifts.

To convert 3rd bytes in all 32-bit words of registers we simply do arithmetic shift right by 24 bits right. This shift repeats the most significant (a sign bit) of our 8-bit value.

However, to do the same for other bytes in a 32-bit we need two shifts. The first one left, which places byte at 3rd position. Then arithmetic shift right is used to extend the type.

Algorithm is:

  1. Load the input vector.

    // v   = [  5 | -1 |  2 | -3 |  7 |  1 |  2 |  3 | -6 | -1 | -3 |  8 | -7 | -12|  3 |  2 ]
    const __m256i v = _mm256_loadu_si256(ptr);
    
  2. Extend 3rd bytes.

    // v0  = [                 5 |                 7 |                -6 |                -7 ]
    const __m256i v0 = _mm256_srai_epi32(v, 3*8);
    
  3. Extend 2nd bytes.

    // v1  = [ -1 |  2 | -3 |  0 |  1 |  2 |  3 |  0 | -1 | -3 |  8 |  0 | -12|  3 |  2 |  0 ] >>> 24
    //     = [                -1 |                 1 |                -1 |               -12 ]
    const __m256i v1 = _mm256_srai_epi32(_mm256_slli_epi32(v, 1*8), 3*8);
    
  4. Extend 1st bytes.

    // v2  = [  2 | -3 |  0 |  0 |  2 |  3 |  0 |  0 | -3 |  8 |  0 |  0 |  3 |  2 |  0 |  0 ] >>> 24
    //     = [                 2 |                 2 |                -3 |                 3 ]
    const __m256i v2 = _mm256_srai_epi32(_mm256_slli_epi32(v, 2*8), 3*8);
    
  1. Extend 0th bytes.

    // v3  = [ -3 |  0 |  0 |  0 |  3 |  0 |  0 |  0 |  8 |  0 |  0 |  0 |  2 |  0 |  0 |  0 ] >>> 24
    //     = [                -3 |                 3 |                 8 |                 2 ]
    const __m256i v3 = _mm256_srai_epi32(_mm256_slli_epi32(v, 3*8), 3*8);
    
  1. Update the accumulator.

    accumulator = _mm256_add_epi32(accumulator, v0);
    accumulator = _mm256_add_epi32(accumulator, v1);
    accumulator = _mm256_add_epi32(accumulator, v2);
    accumulator = _mm256_add_epi32(accumulator, v3);
    

Sample implementation:

int32_t avx2_sumsignedbytes_variant2(int8_t* array, size_t size) {

    __m256i accumulator = _mm256_setzero_si256();

    for (size_t i=0; i < size; i += 32) {
        const __m256i v = _mm256_loadu_si256((__m256i*)(array + i));
        const __m256i v0 = _mm256_srai_epi32(v, 3*8);
        const __m256i v1 = _mm256_srai_epi32(_mm256_slli_epi32(v, 1*8), 3*8);
        const __m256i v2 = _mm256_srai_epi32(_mm256_slli_epi32(v, 2*8), 3*8);
        const __m256i v3 = _mm256_srai_epi32(_mm256_slli_epi32(v, 3*8), 3*8);

        accumulator = _mm256_add_epi32(accumulator, v0);
        accumulator = _mm256_add_epi32(accumulator, v1);
        accumulator = _mm256_add_epi32(accumulator, v2);
        accumulator = _mm256_add_epi32(accumulator, v3);
    }

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 1)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 3)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 5)) +
           int32_t(_mm256_extract_epi32(accumulator, 6)) +
           int32_t(_mm256_extract_epi32(accumulator, 7));
}

AVX2-specific instruction VPSADBW

AVX2 has got instruction VPSADBW (_mm256_sad_epu8) that calculates Sum Of Absolute Differences (SAD) of unsigned bytes. Single SAD function works on eight-element subvectors (64-bit slices), and stores the results on corresponding 64-bit elements of the result vector; in case of AVX2 the VPSADB yields four numbers.

And while the instruction is perfect for summing unsigned bytes, in case of signed bytes it not that great. VPSADBW has to be used twice: for positive and negative elements of input vector, as it works only on unsigned bytes.

Below is the outline of algorithm:

  1. Load the input vector.

    // v   = [  5 | -1 |  2 | -3 |  7 |  1 |  2 |  3 | -6 | -1 | -3 |  8 | -7 | -12|  3 |  2 ]
    const __m256i v = _mm256_loadu_si256(ptr);
    
  2. Find mask for negative numbers.

    // m   = [ 00 | ff | 00 | ff | 00 | 00 | 00 | 00 | ff | ff | ff | 00 | ff | ff | 00 | 00 ]
    const __m256i m  = _mm256_cmplt_epi8(v, zero);
    
  3. Left positive elements.

    // pos = [  5 |  0 |  2 |  0 |  7 |  1 |  2 |  3 |  0 |  0 |  0 |  8 |  0 |  0 |  3 |  2 ]
    const __m128i pos = _mm256_andnot_si256(m, v);
    
  4. Sum the positive elements.

    // t0  = [                14 |                13 |                 8 |                 5 ]
    const __m256i t0 = _mm256_sad_epu8(pos, zero);
    
  5. Get the absolute value.

    // va  = [  5 |  1 |  2 |  3 |  7 |  1 |  2 |  3 |  6 |  1 |  3 |  8 |  7 |  12|  3 |  2 ]
    const __m256i va = _mm256_abs_epi8(v);
    
  6. Left only the absolute values of negative elements.

    // neg = [  0 |  1 |  0 |  3 |  0 |  0 |  0 |  0 |  6 |  1 |  3 |  0 |  7 |  0 |  0 |  0 ]
    const __m256i neg = _mm256_and_si256(m, va);
    
  7. Sum the negative elements.

    // t1  = [                 4 |                 0 |                10 |                 7 ]
    const __m256i t1 = _mm256_sad_epu8(neg, zero);
    
  8. Update the accumulators.

    positive = _mm256_add_epi32(positive, t0);
    negative = _mm256_sub_epi32(negative, t1);
    

Below is the actual implementation.

#define _mm256_cmplt_epi8(a, b) _mm256_cmpgt_epi8(b, a)

int32_t avx2_sadbw_sumsignedbytes(int8_t* array, size_t size) {

    const __m256i zero = _mm256_setzero_si256();
    __m256i positive = zero;
    __m256i negative = zero;

    for (size_t i=0; i < size; i += 32) {
        const __m256i v  = _mm256_loadu_si256((__m256i*)(array + i));
        const __m256i m  = _mm256_cmplt_epi8(v, zero);
        const __m256i va = _mm256_abs_epi8(v);

        // sum just positive numbers
        const __m256i t0 = _mm256_sad_epu8(_mm256_andnot_si256(m, v), zero);

        // sum just negative numbers
        const __m256i t1 = _mm256_sad_epu8(_mm256_and_si256(m, va), zero);

        positive = _mm256_add_epi32(positive, t0);
        negative = _mm256_sub_epi32(negative, t1);
    }

    const __m256i accumulator = _mm256_add_epi32(positive, negative);

    return int32_t(_mm256_extract_epi32(accumulator, 0)) +
           int32_t(_mm256_extract_epi32(accumulator, 2)) +
           int32_t(_mm256_extract_epi32(accumulator, 4)) +
           int32_t(_mm256_extract_epi32(accumulator, 6));
}

Experiments

Tested procedures
scalar plain loop
scalar (C++) std::accumulate
SSE explicit casting
SSE (v2) shif-based casting
SSE (sadbw) PSADBW instruction
SSE (sadbw, unrolled) the above procedure unrolled four times
AVX2 explicit casting
AVX2 (v2) shif-based casting
AVX2 (sadbw) VPSADBW instruction
AVX2 (sadbw, unrolled) the above procedure unrolled four times

The procedures were run three times and minimum values were noted.

Skylake

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

GCC: gcc (GCC) 7.3.0

procedure best avg. speedup  
  [cycles] [cycles]    
size 4096 items
scalar 0.293 0.296 1.00 █████████████████▉
scalar (C++) 0.294 0.297 1.00 █████████████████▊
SSE 0.440 0.442 0.67 ███████████▉
SSE (v2) 0.254 0.256 1.15 ████████████████████▋
SSE (sadbw) 0.223 0.225 1.31 ███████████████████████▌
SSE (sadbw, unrolled) 0.223 0.224 1.31 ███████████████████████▌
AVX2 0.223 0.225 1.31 ███████████████████████▌
AVX2 (v2) 0.146 0.150 2.01 ███████████████████████████████████▉
AVX2 (sadbw) 0.129 0.131 2.27 ████████████████████████████████████████▋
AVX2 (sadbw, unrolled) 0.105 0.108 2.79 ██████████████████████████████████████████████████
size 16384 items
scalar 0.285 0.286 1.00 █████████████████▉
scalar (C++) 0.285 0.286 1.00 █████████████████▉
SSE 0.439 0.440 0.65 ███████████▋
SSE (v2) 0.251 0.252 1.14 ████████████████████▎
SSE (sadbw) 0.220 0.221 1.30 ███████████████████████▏
SSE (sadbw, unrolled) 0.220 0.221 1.30 ███████████████████████▏
AVX2 0.220 0.221 1.30 ███████████████████████▏
AVX2 (v2) 0.142 0.148 2.01 ███████████████████████████████████▉
AVX2 (sadbw) 0.126 0.127 2.26 ████████████████████████████████████████▌
AVX2 (sadbw, unrolled) 0.103 0.105 2.77 █████████████████████████████████████████████████▌
size 32768 items
scalar 0.284 0.284 1.00 █████████████████▉
scalar (C++) 0.284 0.284 1.00 █████████████████▉
SSE 0.439 0.440 0.65 ███████████▌
SSE (v2) 0.251 0.252 1.13 ████████████████████▎
SSE (sadbw) 0.220 0.221 1.29 ███████████████████████▏
SSE (sadbw, unrolled) 0.220 0.221 1.29 ███████████████████████▏
AVX2 0.220 0.221 1.29 ███████████████████████▏
AVX2 (v2) 0.143 0.150 1.99 ███████████████████████████████████▌
AVX2 (sadbw) 0.126 0.126 2.25 ████████████████████████████████████████▍
AVX2 (sadbw, unrolled) 0.104 0.105 2.73 ████████████████████████████████████████████████▉

SkylakeX

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

GCC: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0

procedure best avg. speedup  
  [cycles] [cycles]    
size 4096 items
scalar 0.280 0.291 1.00 █████████████████▊
scalar (C++) 0.280 0.282 1.00 █████████████████▊
SSE 0.433 0.437 0.65 ███████████▌
SSE (v2) 0.248 0.250 1.13 ████████████████████
SSE (sadbw) 0.202 0.204 1.39 ████████████████████████▋
SSE (sadbw, unrolled) 0.202 0.204 1.39 ████████████████████████▋
AVX2 0.218 0.220 1.28 ██████████████████████▊
AVX2 (v2) 0.133 0.136 2.11 █████████████████████████████████████▍
AVX2 (sadbw) 0.116 0.119 2.41 ██████████████████████████████████████████▉
AVX2 (sadbw, unrolled) 0.100 0.102 2.80 █████████████████████████████████████████████████▊
size 16384 items
scalar 0.281 0.283 1.00 █████████████████▊
scalar (C++) 0.280 0.281 1.00 █████████████████▊
SSE 0.436 0.437 0.64 ███████████▍
SSE (v2) 0.249 0.250 1.13 ████████████████████
SSE (sadbw) 0.202 0.203 1.39 ████████████████████████▊
SSE (sadbw, unrolled) 0.202 0.203 1.39 ████████████████████████▊
AVX2 0.218 0.219 1.29 ██████████████████████▉
AVX2 (v2) 0.133 0.134 2.11 █████████████████████████████████████▌
AVX2 (sadbw) 0.117 0.118 2.40 ██████████████████████████████████████████▋
AVX2 (sadbw, unrolled) 0.100 0.101 2.81 ██████████████████████████████████████████████████
size 32768 items
scalar 0.281 0.282 1.00 █████████████████▊
scalar (C++) 0.281 0.282 1.00 █████████████████▊
SSE 0.436 0.438 0.64 ███████████▍
SSE (v2) 0.249 0.250 1.13 ████████████████████
SSE (sadbw) 0.203 0.204 1.38 ████████████████████████▋
SSE (sadbw, unrolled) 0.203 0.204 1.38 ████████████████████████▋
AVX2 0.219 0.219 1.28 ██████████████████████▊
AVX2 (v2) 0.134 0.135 2.10 █████████████████████████████████████▎
AVX2 (sadbw) 0.117 0.118 2.40 ██████████████████████████████████████████▋
AVX2 (sadbw, unrolled) 0.101 0.102 2.78 █████████████████████████████████████████████████▌

Conclusions

Acknowledgements

Big thanks to Daniel Lemire who provides access to Skylake/SkylakeX machines, where I can run benchmarks.

Source code

Source code is available on github.