Author: | Wojciech Muła |
---|---|
Added on: | 2018-11-18 |
Updated on: | 2019-02-04 (faster PSADBW approach contributed by Matthieu Darbois, update results from GCC); 2019-02-03 (new method also contributed by Matthieu Darbois) |
Contents
This is the second part of SIMDized sum of all bytes in the array. The first part describes summing unsigned bytes, here we're going to experiment with summing of signed bytes.
The baseline C implementation is:
int32_t sumbytes(int8_t* array, size_t size) { int32_t result = 0; for (size_t i=0; i < size; i++) result += int32_t(array[i]); return result; }
And the C++ implementation:
#include <numeric> int32_t sumbytes(int8_t* array, size_t size) { return std::accumulate(array, array + size, int32_t(0)); }
Below is the assembly code of the main loop compiled for Skylake by GCC 7.3.0 with flags -O3 -march=skylake:
vpmovsxbw %xmm1, %ymm2 vextracti128 $0x1, %ymm1, %xmm1 vpmovsxwd %xmm2, %ymm3 vextracti128 $0x1, %ymm2, %xmm2 vpmovsxbw %xmm1, %ymm1 vpaddd %ymm0, %ymm3, %ymm3 vpmovsxwd %xmm2, %ymm0 vpaddd %ymm3, %ymm0, %ymm2 vpmovsxwd %xmm1, %ymm0 vextracti128 $0x1, %ymm1, %xmm1 vpaddd %ymm2, %ymm0, %ymm0 vpmovsxwd %xmm1, %ymm1 vpaddd %ymm0, %ymm1, %ymm0
The approach used here by GCC is exactly the same as for summing unsigned bytes. There are multiple 32-bit sub-accumulators in single register, i.e. eight in case of AVX2 (four in SSE code), which are added together in the end, forming the scalar result.
To get 32-bit values there's two-step casting from int8_t to int32_t:
The cast instruction VPMOVSX extends the lower part of a register, in this case the lower half. This is the reason why extractions of helves (VEXTRACTI128) are needed.
There's instruction VPMOVSXBD which casts directly from int8_t to int32_t. It get lower 64 bits of input register, thus to convert all bytes from an AVX2 register the instruction has to be called 4 times and some shifting is needed.
Below is sample implementation
int32_t avx2_sumsignedbytes(int8_t* array, size_t size) { __m256i accumulator = _mm256_setzero_si256(); for (size_t i=0; i < size; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m128i lo = _mm256_extracti128_si256(v, 0); const __m128i hi = _mm256_extracti128_si256(v, 1); const __m256i t0 = _mm256_cvtepi8_epi32(lo); const __m256i t1 = _mm256_cvtepi8_epi32(hi); const __m256i t2 = _mm256_cvtepi8_epi32(_mm_bsrli_si128(lo, 8)); const __m256i t3 = _mm256_cvtepi8_epi32(_mm_bsrli_si128(hi, 8)); accumulator = _mm256_add_epi32(accumulator, t0); accumulator = _mm256_add_epi32(accumulator, t1); accumulator = _mm256_add_epi32(accumulator, t2); accumulator = _mm256_add_epi32(accumulator, t3); } return int32_t(_mm256_extract_epi32(accumulator, 0)) + int32_t(_mm256_extract_epi32(accumulator, 1)) + int32_t(_mm256_extract_epi32(accumulator, 2)) + int32_t(_mm256_extract_epi32(accumulator, 3)) + int32_t(_mm256_extract_epi32(accumulator, 4)) + int32_t(_mm256_extract_epi32(accumulator, 5)) + int32_t(_mm256_extract_epi32(accumulator, 6)) + int32_t(_mm256_extract_epi32(accumulator, 7)); }
In this approach we also cast directly from 8 into 32-bit numbers, but we use 32-bit shifts.
To convert 3rd bytes in all 32-bit words of registers we simply do arithmetic shift right by 24 bits right. This shift repeats the most significant (a sign bit) of our 8-bit value.
However, to do the same for other bytes in a 32-bit we need two shifts. The first one left, which places byte at 3rd position. Then arithmetic shift right is used to extend the type.
Algorithm is:
Load the input vector.
// v = [ 5 | -1 | 2 | -3 | 7 | 1 | 2 | 3 | -6 | -1 | -3 | 8 | -7 | -12| 3 | 2 ] const __m256i v = _mm256_loadu_si256(ptr);
Extend 3rd bytes.
// v0 = [ 5 | 7 | -6 | -7 ] const __m256i v0 = _mm256_srai_epi32(v, 3*8);
Extend 2nd bytes.
// v1 = [ -1 | 2 | -3 | 0 | 1 | 2 | 3 | 0 | -1 | -3 | 8 | 0 | -12| 3 | 2 | 0 ] >>> 24 // = [ -1 | 1 | -1 | -12 ] const __m256i v1 = _mm256_srai_epi32(_mm256_slli_epi32(v, 1*8), 3*8);
Extend 1st bytes.
// v2 = [ 2 | -3 | 0 | 0 | 2 | 3 | 0 | 0 | -3 | 8 | 0 | 0 | 3 | 2 | 0 | 0 ] >>> 24 // = [ 2 | 2 | -3 | 3 ] const __m256i v2 = _mm256_srai_epi32(_mm256_slli_epi32(v, 2*8), 3*8);
Extend 0th bytes.
// v3 = [ -3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 2 | 0 | 0 | 0 ] >>> 24 // = [ -3 | 3 | 8 | 2 ] const __m256i v3 = _mm256_srai_epi32(_mm256_slli_epi32(v, 3*8), 3*8);
Update the accumulator.
accumulator = _mm256_add_epi32(accumulator, v0); accumulator = _mm256_add_epi32(accumulator, v1); accumulator = _mm256_add_epi32(accumulator, v2); accumulator = _mm256_add_epi32(accumulator, v3);
Sample implementation:
int32_t avx2_sumsignedbytes_variant2(int8_t* array, size_t size) { __m256i accumulator = _mm256_setzero_si256(); for (size_t i=0; i < size; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m256i v0 = _mm256_srai_epi32(v, 3*8); const __m256i v1 = _mm256_srai_epi32(_mm256_slli_epi32(v, 1*8), 3*8); const __m256i v2 = _mm256_srai_epi32(_mm256_slli_epi32(v, 2*8), 3*8); const __m256i v3 = _mm256_srai_epi32(_mm256_slli_epi32(v, 3*8), 3*8); accumulator = _mm256_add_epi32(accumulator, v0); accumulator = _mm256_add_epi32(accumulator, v1); accumulator = _mm256_add_epi32(accumulator, v2); accumulator = _mm256_add_epi32(accumulator, v3); } return int32_t(_mm256_extract_epi32(accumulator, 0)) + int32_t(_mm256_extract_epi32(accumulator, 1)) + int32_t(_mm256_extract_epi32(accumulator, 2)) + int32_t(_mm256_extract_epi32(accumulator, 3)) + int32_t(_mm256_extract_epi32(accumulator, 4)) + int32_t(_mm256_extract_epi32(accumulator, 5)) + int32_t(_mm256_extract_epi32(accumulator, 6)) + int32_t(_mm256_extract_epi32(accumulator, 7)); }
AVX2 has got instruction VPSADBW (_mm256_sad_epu8) that calculates Sum Of Absolute Differences (SAD) of unsigned bytes. Single SAD function works on eight-element subvectors (64-bit slices), and stores the results on corresponding 64-bit elements of the result vector; in case of AVX2 the VPSADBW yields four numbers.
And while the instruction is perfect for summing unsigned bytes, in case of signed bytes it not that great. VPSADBW has to be used twice: for positive and negative elements of input vector, as it works only on unsigned bytes.
Below is the outline of algorithm:
Load the input vector.
// v = [ 5 | -1 | 2 | -3 | 7 | 1 | 2 | 3 | -6 | -1 | -3 | 8 | -7 | -12| 3 | 2 ] const __m256i v = _mm256_loadu_si256(ptr);
Find mask for negative numbers.
// m = [ 00 | ff | 00 | ff | 00 | 00 | 00 | 00 | ff | ff | ff | 00 | ff | ff | 00 | 00 ] const __m256i m = _mm256_cmplt_epi8(v, zero);
Left positive elements.
// pos = [ 5 | 0 | 2 | 0 | 7 | 1 | 2 | 3 | 0 | 0 | 0 | 8 | 0 | 0 | 3 | 2 ] const __m128i pos = _mm256_andnot_si256(m, v);
Sum the positive elements.
// t0 = [ 14 | 13 | 8 | 5 ] const __m256i t0 = _mm256_sad_epu8(pos, zero);
Get the absolute value.
// va = [ 5 | 1 | 2 | 3 | 7 | 1 | 2 | 3 | 6 | 1 | 3 | 8 | 7 | 12| 3 | 2 ] const __m256i va = _mm256_abs_epi8(v);
Left only the absolute values of negative elements.
// neg = [ 0 | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 6 | 1 | 3 | 0 | 7 | 0 | 0 | 0 ] const __m256i neg = _mm256_and_si256(m, va);
Sum the negative elements.
// t1 = [ 4 | 0 | 10 | 7 ] const __m256i t1 = _mm256_sad_epu8(neg, zero);
Update the accumulators.
positive = _mm256_add_epi32(positive, t0); negative = _mm256_sub_epi32(negative, t1);
Below is the actual implementation.
#define _mm256_cmplt_epi8(a, b) _mm256_cmpgt_epi8(b, a) int32_t avx2_sadbw_sumsignedbytes(int8_t* array, size_t size) { const __m256i zero = _mm256_setzero_si256(); __m256i positive = zero; __m256i negative = zero; for (size_t i=0; i < size; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m256i m = _mm256_cmplt_epi8(v, zero); const __m256i va = _mm256_abs_epi8(v); // sum just positive numbers const __m256i t0 = _mm256_sad_epu8(_mm256_andnot_si256(m, v), zero); // sum just negative numbers const __m256i t1 = _mm256_sad_epu8(_mm256_and_si256(m, va), zero); positive = _mm256_add_epi32(positive, t0); negative = _mm256_sub_epi32(negative, t1); } const __m256i accumulator = _mm256_add_epi32(positive, negative); return int32_t(_mm256_extract_epi32(accumulator, 0)) + int32_t(_mm256_extract_epi32(accumulator, 2)) + int32_t(_mm256_extract_epi32(accumulator, 4)) + int32_t(_mm256_extract_epi32(accumulator, 6)); }
Matthieu Darbois came up with a way batter method that utilizing VPSADBW. The input range of bytes is changed from int8_t range, i.e. -128 .. +128, into uint8_t range by adding 128. This addition sets the input range to 0 .. 255, and thanks to that single invocation of VPSADBW is sufficient to add unsigned bytes. The input adjustment is done with a cheap binary operation, thus the procedure is as fast as its unsigned counterpart.
Due to the input adjustments, the final result is increased by 128 times the number of bytes. This value must be extracted to get the real singed sum.
Below is the actual implementation.
int32_t avx2_sadbw_variant_sumsignedbytes(int8_t* array, size_t size) { const __m256i zero = _mm256_setzero_si256(); const __m256i addend = _mm256_set1_epi8(-128); __m256i accumulator = zero; for (size_t i=0; i < size; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m256i t0 = _mm256_xor_si256(v, addend); const __m256i t1 = _mm256_sad_epu8(t0, zero); accumulator = _mm256_add_epi32(accumulator, t1); } return int32_t(_mm256_extract_epi32(accumulator, 0)) + int32_t(_mm256_extract_epi32(accumulator, 2)) + int32_t(_mm256_extract_epi32(accumulator, 4)) + int32_t(_mm256_extract_epi32(accumulator, 6)) - 128 * int32_t(size); }
Matthieu Darbois also proposed to use instruction VPMADDUBSW (_mm256_madd_epi16) to perform horizontal additions of bytes.
The instruction itself is more complex, as it multiplies vertically bytes and then adds the pairs of intermediate signed 16-bit results, forming a new vector of 16-bit values. If another vector is filled with 1, then efficiently we add pair of bytes horizontally.
The inner loop of algorithm uses VPMADDUBSW to obtain 16-bit counters which update a 16-bit local accumulator. To avoid overflow, after a fixed number of iterations (128), the local accumulator is extended to 32-bit values and added to the global, 32-bit accumulator.
Below is the actual implementation.
int32_t avx2_maddubs_sumsignedbytes(int8_t* array, size_t size) { const __m256i zero = _mm256_setzero_si256(); const __m256i one_8 = _mm256_set1_epi8(1); const __m256i one_16 = _mm256_set1_epi16(1); __m256i accumulator = zero; for (size_t j = size / (32 * 128); j != 0; --j) { __m256i local_accumulator = zero; for (size_t i = 0; i < 128 * 32; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m256i t0 = _mm256_maddubs_epi16(one_8, v); local_accumulator = _mm256_add_epi16(local_accumulator, t0); } const __m256i t1 = _mm256_madd_epi16(one_16, local_accumulator); accumulator = _mm256_add_epi32(accumulator, t1); array += 128 * 32; } size_t remainder = size - ((size / (32 * 128)) * (32 * 128)); if (remainder) { __m256i local_accumulator = zero; for (size_t i = 0; i < remainder; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m256i t0 = _mm256_maddubs_epi16(one_8, v); local_accumulator = _mm256_add_epi16(local_accumulator, t0); } const __m256i t1 = _mm256_madd_epi16(one_16, local_accumulator); accumulator = _mm256_add_epi32(accumulator, t1); } const __m128i accumulator128 = _mm_add_epi32(_mm256_extracti128_si256(accumulator, 0), _mm256_extracti128_si256(accumulator, 1)); return int32_t(_mm_extract_epi32(accumulator128, 0)) + int32_t(_mm_extract_epi32(accumulator128, 1)) + int32_t(_mm_extract_epi32(accumulator128, 2)) + int32_t(_mm_extract_epi32(accumulator128, 3)); }
scalar | plain loop |
scalar (C++) | std::accumulate |
SSE | explicit casting |
SSE (v2) | shif-based casting |
SSE (sadbw) | two PSADBW instructions |
SSE (sadbw, unrolled) | the above procedure unrolled four times |
AVX2 | explicit casting |
AVX2 (v2) | shift-based casting |
AVX2 (sadbw) | two VPSADBW instructions |
AVX2 (sadbw, unrolled) | the above procedure unrolled four times |
AVX2 (sadbw, variant) | single VPSADBW instruction |
AVX2 (maddubs) | VPMADDUBSW instruction |
The procedures were run three times and minimum values were noted.
CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz (Haswell)
Compiler: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.132 | 0.153 | 1.00 | ████████▋ |
scalar (C++) | 0.122 | 0.131 | 1.08 | █████████▍ |
SSE | 0.294 | 0.300 | 0.45 | ███▉ |
SSE (v2) | 0.294 | 0.317 | 0.45 | ███▉ |
SSE (sadbw) | 0.130 | 0.141 | 1.02 | ████████▊ |
SSE (sadbw, unrolled) | 0.130 | 0.135 | 1.02 | ████████▊ |
AVX2 | 0.153 | 0.161 | 0.86 | ███████▌ |
AVX2 (v2) | 0.147 | 0.150 | 0.90 | ███████▊ |
AVX2 (sadbw) | 0.070 | 0.072 | 1.89 | ████████████████▍ |
AVX2 (sadbw, unrolled) | 0.060 | 0.067 | 2.20 | ███████████████████▏ |
AVX2 (sadbw, variant) | 0.043 | 0.048 | 3.07 | ██████████████████████████▋ |
AVX2 (maddubs) | 0.023 | 0.026 | 5.74 | ██████████████████████████████████████████████████ |
size 16384 items | ||||
scalar | 0.118 | 0.127 | 1.00 | ████████▋ |
scalar (C++) | 0.121 | 0.127 | 0.98 | ████████▍ |
SSE | 0.295 | 0.298 | 0.40 | ███▍ |
SSE (v2) | 0.295 | 0.316 | 0.40 | ███▍ |
SSE (sadbw) | 0.130 | 0.136 | 0.91 | ███████▉ |
SSE (sadbw, unrolled) | 0.131 | 0.134 | 0.90 | ███████▊ |
AVX2 | 0.154 | 0.158 | 0.77 | ██████▋ |
AVX2 (v2) | 0.147 | 0.149 | 0.80 | ██████▉ |
AVX2 (sadbw) | 0.068 | 0.072 | 1.74 | ███████████████ |
AVX2 (sadbw, unrolled) | 0.065 | 0.070 | 1.82 | ███████████████▊ |
AVX2 (sadbw, variant) | 0.039 | 0.043 | 3.03 | ██████████████████████████▎ |
AVX2 (maddubs) | 0.021 | 0.023 | 5.62 | ████████████████████████████████████████████████▉ |
size 32768 items | ||||
scalar | 0.121 | 0.129 | 1.00 | ████████▋ |
scalar (C++) | 0.122 | 0.131 | 0.99 | ████████▋ |
SSE | 0.296 | 0.296 | 0.41 | ███▌ |
SSE (v2) | 0.296 | 0.297 | 0.41 | ███▌ |
SSE (sadbw) | 0.132 | 0.133 | 0.92 | ███████▉ |
SSE (sadbw, unrolled) | 0.132 | 0.133 | 0.92 | ███████▉ |
AVX2 | 0.156 | 0.157 | 0.78 | ██████▊ |
AVX2 (v2) | 0.148 | 0.148 | 0.82 | ███████ |
AVX2 (sadbw) | 0.068 | 0.069 | 1.78 | ███████████████▌ |
AVX2 (sadbw, unrolled) | 0.064 | 0.064 | 1.89 | ████████████████▍ |
AVX2 (sadbw, variant) | 0.040 | 0.040 | 3.02 | ██████████████████████████▎ |
AVX2 (maddubs) | 0.023 | 0.024 | 5.26 | █████████████████████████████████████████████▊ |
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
GCC: g++-8 (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.264 | 0.265 | 1.00 | ███████████▉ |
scalar (C++) | 0.263 | 0.265 | 1.00 | ███████████▉ |
SSE | 0.404 | 0.407 | 0.65 | ███████▊ |
SSE (v2) | 0.233 | 0.235 | 1.13 | █████████████▌ |
SSE (sadbw) | 0.191 | 0.193 | 1.38 | ████████████████▍ |
SSE (sadbw, unrolled) | 0.191 | 0.193 | 1.38 | ████████████████▍ |
AVX2 | 0.206 | 0.208 | 1.28 | ███████████████▎ |
AVX2 (v2) | 0.128 | 0.129 | 2.06 | ████████████████████████▌ |
AVX2 (sadbw) | 0.112 | 0.114 | 2.36 | ████████████████████████████▏ |
AVX2 (sadbw, unrolled) | 0.099 | 0.102 | 2.67 | ███████████████████████████████▊ |
AVX2 (sadbw, variant) | 0.069 | 0.071 | 3.83 | █████████████████████████████████████████████▋ |
AVX2 (maddubs) | 0.063 | 0.065 | 4.19 | ██████████████████████████████████████████████████ |
size 16384 items | ||||
scalar | 0.260 | 0.261 | 1.00 | ███████████▉ |
scalar (C++) | 0.260 | 0.261 | 1.00 | ███████████▉ |
SSE | 0.403 | 0.405 | 0.65 | ███████▋ |
SSE (v2) | 0.231 | 0.232 | 1.13 | █████████████▍ |
SSE (sadbw) | 0.188 | 0.189 | 1.38 | ████████████████▌ |
SSE (sadbw, unrolled) | 0.188 | 0.189 | 1.38 | ████████████████▌ |
AVX2 | 0.203 | 0.204 | 1.28 | ███████████████▎ |
AVX2 (v2) | 0.125 | 0.126 | 2.08 | ████████████████████████▊ |
AVX2 (sadbw) | 0.109 | 0.110 | 2.39 | ████████████████████████████▍ |
AVX2 (sadbw, unrolled) | 0.097 | 0.099 | 2.68 | ███████████████████████████████▉ |
AVX2 (sadbw, variant) | 0.066 | 0.066 | 3.94 | ███████████████████████████████████████████████ |
AVX2 (maddubs) | 0.064 | 0.064 | 4.06 | ████████████████████████████████████████████████▍ |
size 32768 items | ||||
scalar | 0.260 | 0.261 | 1.00 | ███████████▉ |
scalar (C++) | 0.260 | 0.261 | 1.00 | ███████████▉ |
SSE | 0.403 | 0.405 | 0.65 | ███████▋ |
SSE (v2) | 0.231 | 0.232 | 1.13 | █████████████▍ |
SSE (sadbw) | 0.188 | 0.189 | 1.38 | ████████████████▌ |
SSE (sadbw, unrolled) | 0.188 | 0.189 | 1.38 | ████████████████▌ |
AVX2 | 0.202 | 0.203 | 1.29 | ███████████████▎ |
AVX2 (v2) | 0.125 | 0.126 | 2.08 | ████████████████████████▊ |
AVX2 (sadbw) | 0.109 | 0.109 | 2.39 | ████████████████████████████▍ |
AVX2 (sadbw, unrolled) | 0.098 | 0.099 | 2.65 | ███████████████████████████████▋ |
AVX2 (sadbw, variant) | 0.066 | 0.066 | 3.94 | ███████████████████████████████████████████████ |
AVX2 (maddubs) | 0.063 | 0.064 | 4.13 | █████████████████████████████████████████████████▏ |
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
GCC: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.284 | 0.289 | 1.00 | █████████▎ |
scalar (C++) | 0.285 | 0.288 | 1.00 | █████████▏ |
SSE | 0.437 | 0.441 | 0.65 | ██████ |
SSE (v2) | 0.243 | 0.246 | 1.17 | ██████████▊ |
SSE (sadbw) | 0.177 | 0.186 | 1.60 | ██████████████▊ |
SSE (sadbw, unrolled) | 0.177 | 0.185 | 1.60 | ██████████████▊ |
AVX2 | 0.222 | 0.224 | 1.28 | ███████████▊ |
AVX2 (v2) | 0.135 | 0.137 | 2.10 | ███████████████████▍ |
AVX2 (sadbw) | 0.106 | 0.109 | 2.68 | ████████████████████████▊ |
AVX2 (sadbw, unrolled) | 0.104 | 0.109 | 2.73 | █████████████████████████▎ |
AVX2 (sadbw, variant) | 0.055 | 0.059 | 5.16 | ███████████████████████████████████████████████▊ |
AVX2 (maddubs) | 0.061 | 0.066 | 4.66 | ███████████████████████████████████████████ |
size 16384 items | ||||
scalar | 0.281 | 0.284 | 1.00 | █████████▎ |
scalar (C++) | 0.282 | 0.283 | 1.00 | █████████▏ |
SSE | 0.437 | 0.438 | 0.64 | █████▉ |
SSE (v2) | 0.242 | 0.243 | 1.16 | ██████████▋ |
SSE (sadbw) | 0.176 | 0.182 | 1.60 | ██████████████▊ |
SSE (sadbw, unrolled) | 0.176 | 0.182 | 1.60 | ██████████████▊ |
AVX2 | 0.219 | 0.220 | 1.28 | ███████████▊ |
AVX2 (v2) | 0.132 | 0.134 | 2.13 | ███████████████████▋ |
AVX2 (sadbw) | 0.103 | 0.105 | 2.73 | █████████████████████████▏ |
AVX2 (sadbw, unrolled) | 0.102 | 0.104 | 2.75 | █████████████████████████▍ |
AVX2 (sadbw, variant) | 0.052 | 0.053 | 5.40 | ██████████████████████████████████████████████████ |
AVX2 (maddubs) | 0.055 | 0.057 | 5.11 | ███████████████████████████████████████████████▎ |
size 32768 items | ||||
scalar | 0.281 | 0.283 | 1.00 | █████████▎ |
scalar (C++) | 0.281 | 0.282 | 1.00 | █████████▎ |
SSE | 0.437 | 0.438 | 0.64 | █████▉ |
SSE (v2) | 0.242 | 0.243 | 1.16 | ██████████▋ |
SSE (sadbw) | 0.177 | 0.182 | 1.59 | ██████████████▋ |
SSE (sadbw, unrolled) | 0.177 | 0.182 | 1.59 | ██████████████▋ |
AVX2 | 0.219 | 0.220 | 1.28 | ███████████▊ |
AVX2 (v2) | 0.133 | 0.134 | 2.11 | ███████████████████▌ |
AVX2 (sadbw) | 0.103 | 0.105 | 2.73 | █████████████████████████▏ |
AVX2 (sadbw, unrolled) | 0.103 | 0.105 | 2.73 | █████████████████████████▏ |
AVX2 (sadbw, variant) | 0.052 | 0.054 | 5.40 | ██████████████████████████████████████████████████ |
AVX2 (maddubs) | 0.057 | 0.058 | 4.93 | █████████████████████████████████████████████▌ |
CPU: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz
Compiler: g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.203 | 0.205 | 1.00 | █████████████▌ |
scalar (C++) | 0.203 | 0.205 | 1.00 | █████████████▌ |
SSE | 0.303 | 0.306 | 0.67 | █████████ |
SSE (v2) | 0.175 | 0.177 | 1.16 | ███████████████▋ |
SSE (sadbw) | 0.154 | 0.155 | 1.32 | █████████████████▊ |
SSE (sadbw, unrolled) | 0.154 | 0.155 | 1.32 | █████████████████▊ |
AVX2 | 0.153 | 0.155 | 1.33 | █████████████████▉ |
AVX2 (v2) | 0.102 | 0.104 | 1.99 | ██████████████████████████▉ |
AVX2 (sadbw) | 0.089 | 0.091 | 2.28 | ██████████████████████████████▉ |
AVX2 (sadbw, unrolled) | 0.072 | 0.075 | 2.82 | ██████████████████████████████████████▏ |
AVX2 (sadbw, variant) | 0.061 | 0.062 | 3.33 | █████████████████████████████████████████████ |
AVX2 (maddubs) | 0.055 | 0.058 | 3.69 | ██████████████████████████████████████████████████ |
size 16384 items | ||||
scalar | 0.196 | 0.198 | 1.00 | █████████████▌ |
scalar (C++) | 0.196 | 0.197 | 1.00 | █████████████▌ |
SSE | 0.302 | 0.304 | 0.65 | ████████▊ |
SSE (v2) | 0.173 | 0.174 | 1.13 | ███████████████▎ |
SSE (sadbw) | 0.152 | 0.153 | 1.29 | █████████████████▍ |
SSE (sadbw, unrolled) | 0.152 | 0.153 | 1.29 | █████████████████▍ |
AVX2 | 0.151 | 0.153 | 1.30 | █████████████████▌ |
AVX2 (v2) | 0.099 | 0.101 | 1.98 | ██████████████████████████▊ |
AVX2 (sadbw) | 0.087 | 0.088 | 2.25 | ██████████████████████████████▌ |
AVX2 (sadbw, unrolled) | 0.071 | 0.073 | 2.76 | █████████████████████████████████████▍ |
AVX2 (sadbw, variant) | 0.058 | 0.059 | 3.38 | █████████████████████████████████████████████▊ |
AVX2 (maddubs) | 0.054 | 0.056 | 3.63 | █████████████████████████████████████████████████▏ |
size 32768 items | ||||
scalar | 0.195 | 0.197 | 1.00 | █████████████▌ |
scalar (C++) | 0.195 | 0.197 | 1.00 | █████████████▌ |
SSE | 0.302 | 0.305 | 0.65 | ████████▋ |
SSE (v2) | 0.173 | 0.174 | 1.13 | ███████████████▎ |
SSE (sadbw) | 0.151 | 0.153 | 1.29 | █████████████████▍ |
SSE (sadbw, unrolled) | 0.151 | 0.153 | 1.29 | █████████████████▍ |
AVX2 | 0.151 | 0.153 | 1.29 | █████████████████▍ |
AVX2 (v2) | 0.099 | 0.101 | 1.97 | ██████████████████████████▋ |
AVX2 (sadbw) | 0.087 | 0.087 | 2.24 | ██████████████████████████████▎ |
AVX2 (sadbw, unrolled) | 0.071 | 0.072 | 2.75 | █████████████████████████████████████▏ |
AVX2 (sadbw, variant) | 0.058 | 0.059 | 3.36 | █████████████████████████████████████████████▌ |
AVX2 (maddubs) | 0.055 | 0.057 | 3.55 | ████████████████████████████████████████████████ |
Big thanks to Daniel Lemire who provides access to Skylake, SkylakeX and CannonLake machines, where I can run benchmarks.
Source code is available on github.