Author: | Wojciech Muła |
---|---|
Added on: | 2018-10-24 |
Contents
I was curious how GCC vectorizes function that sums bytes from an array. Below is a loop-based implementation.
uint32_t sumbytes(uint8_t* array, size_t size) { uint32_t result = 0; for (size_t i=0; i < size; i++) result += uint32_t(array[i]); return result; }
The same algorithm can be expressed with following C++ code.
#include <numeric> uint32_t sumbytes(uint8_t* array, size_t size) { return std::accumulate(array, array + size, uint32_t(0)); }
When I saw the assembly generated by GCC I was sure that it's possible to make it better and faster. This text summarizes my findings.
I focus solely on Skylake performance and AVX2 code. The sources have got also implementations of SSE procedures and experiments include timings from an older CPU.
Below is the assembly code of the main loop compiled for Skylake by GCC 7.3.0 with flags -O3 -march=skylake:
vpmovzxbw %xmm0, %ymm2 vextracti128 $0x1, %ymm0, %xmm0 vpmovzxwd %xmm2, %ymm1 vextracti128 $0x1, %ymm2, %xmm2 vpmovzxbw %xmm0, %ymm0 vpmovzxwd %xmm2, %ymm2 vpaddd %ymm2, %ymm1, %ymm1 vpmovzxwd %xmm0, %ymm2 vextracti128 $0x1, %ymm0, %xmm0 vpaddd %ymm2, %ymm1, %ymm1 vpmovzxwd %xmm0, %ymm0 vpaddd %ymm0, %ymm1, %ymm0 vpaddd %ymm0, %ymm3, %ymm3
GCC nicely vectorized the algorithm: it keeps multiple 32-bit sub-accumulators in single register, i.e. eight in case of AVX2 (four in SSE code). These 32-bit numbers are added together in the end, forming the scalar result.
Now, let's look how the type casting is done. Although AVX2 has variant of instruction VPMOVZXBD that converts directly from uint8_t to uint32_t (intrinsic _mm256_cvtepu8_epi32) the compiler does the conversion in two steps:
The cast instruction VPMOVZX extends the lower part of a register, in this case the lower half. This is the reason why extractions of helves (VEXTRACTI128) are needed.
In this approach we cast directly from uint8_t to uint32_t. But casting isn't done by the mentioned VPMOVZXBD — we use only bit-ands and shifts within 32-bit lanes, that has some benefits.
Load the vector:
// |15 13 | 12 8 | 7 4 | 3 0 | // v = [ a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p ] const __m128i v = _mm_loadu_si128(ptr);
Mask 0th bytes in each 32-bit lane
const __m128i lobyte_mask = _mm_set1_epi32(0x000000ff); // t0 = [ 0 | 0 | 0 | d | 0 | 0 | 0 | h | 0 | 0 | 0 | l | 0 | 0 | 0 | p ] const __m128i t0 = _mm_and_si128(lobyte_mask, v);
Move 1st bytes into 0th position in each lane.
const __m128i v1 = _mm_srli_epi32(v, 1*8); // t1 = [ 0 | 0 | 0 | c | 0 | 0 | 0 | g | 0 | 0 | 0 | k | 0 | 0 | 0 | o ] const __m128i t1 = _mm_and_si128(lobyte_mask, v1);
Move 2nd bytes into 0th position in each lane.
const __m128i v2 = _mm_srli_epi32(v, 2*8); // t2 = [ 0 | 0 | 0 | b | 0 | 0 | 0 | f | 0 | 0 | 0 | j | 0 | 0 | 0 | n ] const __m128i t2 = _mm_and_si128(lobyte_mask, v2);
Move 3rd bytes into 0th position in each lane.
// t3 = [ 0 | 0 | 0 | a | 0 | 0 | 0 | e | 0 | 0 | 0 | i | 0 | 0 | 0 | m ] const __m128i t3 = _mm_srli_epi32(v, 3*8);
Update the accumulator.
accumulator = _mm_add_epi32(accumulator, t0); accumulator = _mm_add_epi32(accumulator, t1); accumulator = _mm_add_epi32(accumulator, t2); accumulator = _mm_add_epi32(accumulator, t3);
Here is an actual implementation:
uint32_t sse_sumbytes_variant2(uint8_t* array, size_t size) { const __m128i lobyte_mask = _mm_set1_epi32(0x000000ff); __m128i accumulator = _mm_setzero_si128(); for (size_t i=0; i < size; i += 16) { const __m128i v = _mm_loadu_si128((__m128i*)(array + i)); const __m128i v0 = v; const __m128i v1 = _mm_srli_epi32(v, 1*8); const __m128i v2 = _mm_srli_epi32(v, 2*8); const __m128i v3 = _mm_srli_epi32(v, 3*8); const __m128i t0 = _mm_and_si128(lobyte_mask, v0); const __m128i t1 = _mm_and_si128(lobyte_mask, v1); const __m128i t2 = _mm_and_si128(lobyte_mask, v2); const __m128i t3 = v3; accumulator = _mm_add_epi32(accumulator, t0); accumulator = _mm_add_epi32(accumulator, t1); accumulator = _mm_add_epi32(accumulator, t2); accumulator = _mm_add_epi32(accumulator, t3); } return uint32_t(_mm_extract_epi32(accumulator, 0)) + uint32_t(_mm_extract_epi32(accumulator, 1)) + uint32_t(_mm_extract_epi32(accumulator, 2)) + uint32_t(_mm_extract_epi32(accumulator, 3)); }
The assembly code of the main loop is:
vpsrld $0x8, %ymm1, %ymm3 vpsrld $0x10, %ymm1, %ymm2 vpsrld $0x18, %ymm1, %ymm0 vpand %ymm3, %ymm4, %ymm3 vpaddd %ymm5, %ymm0, %ymm0 vpand %ymm2, %ymm4, %ymm2 vpand %ymm1, %ymm4, %ymm1 vpaddd %ymm3, %ymm0, %ymm0 vpaddd %ymm2, %ymm0, %ymm0 vpaddd %ymm1, %ymm0, %ymm0
This approach is two times faster than the compiler's version. There are two reasons of that:
instruction | micro-ops | port allocation | latency | reciprocal throughput |
---|---|---|---|---|
compiler version | ||||
3 x vextracti128 | 1 | p5 | 3 | 1 |
2 x vpmovzxbw | 1 | p5 | 3 | 1 |
4 x vpmovzxwd | 1 | p5 | 3 | 1 |
4 x vpaddd | 1 | p015 | 1 | 0.33 |
our approach | ||||
3 x vpand | 1 | p015 | 1 | 0.33 |
3 x vpsrlid | 1 | p05 | 1 | 0.5 |
4 x vpaddd | 1 | p015 | 1 | 0.33 |
First of all, most instructions emitted by compiler have latency 3 cycles. But latency could be compensated (especially in unrolled loops). Worse is that all instructions except VPADD are dispatched on the same execution port #5. So, it's likely that there will be congestion on the port.
Instructions used by our approach can be dispatched to different ports (and thus executed in parallel); moreover, all have latency just 1 cycle.
Using 16-bit accumulators instead of 32-bit accumulators doubles capacity, so in AVX2 code we're updating 16 instead of 8 accumulators in a single iteration. But this approach limits the number of additions that can be performed without overflow to max(uint16_t)/max(uint8_t) = 65535/255 = 257.
The overcome this we need two nested loops. The outer loop keeps 32-bit accumulators, as the initial version does. The inner loop processes 256 * 8 bytes, i.e. iterates 64 times. When the inner loop completes, 16-bit accumulators are extended to 32-bit and the outer accumulators are updated.
Below is a C++ implementation that directly use vectors of uint16_t in the inner loop.
uint32_t avx2_16bit_sumbytes_variant2(uint8_t* array, size_t size) { const __m256i mask_lobytes = _mm256_set1_epi16(0x00ff); __m256i accumulator = _mm256_setzero_si256(); #define SIZE (8 * 256) for (size_t i=0; i < size; i += SIZE) { __m256i accu_16bit = _mm256_setzero_si256(); for (size_t j=0; j < SIZE; j += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i + j)); const __m256i t0 = _mm256_and_si256(v, mask_lobytes); const __m256i t1 = _mm256_srli_epi16(v, 8); accu_16bit = _mm256_add_epi16(accu_16bit, t0); accu_16bit = _mm256_add_epi16(accu_16bit, t1); } const __m256i t0 = _mm256_and_si256(accu_16bit, _mm256_set1_epi32(0x0000ffff)); const __m256i t1 = _mm256_srli_epi32(accu_16bit, 16); accumulator = _mm256_add_epi32(accumulator, t0); accumulator = _mm256_add_epi32(accumulator, t1); } #undef SIZE return uint32_t(_mm256_extract_epi32(accumulator, 0)) + uint32_t(_mm256_extract_epi32(accumulator, 1)) + uint32_t(_mm256_extract_epi32(accumulator, 2)) + uint32_t(_mm256_extract_epi32(accumulator, 3)) + uint32_t(_mm256_extract_epi32(accumulator, 4)) + uint32_t(_mm256_extract_epi32(accumulator, 5)) + uint32_t(_mm256_extract_epi32(accumulator, 6)) + uint32_t(_mm256_extract_epi32(accumulator, 7)); }
In this approach we also use 16-bit accumulators, but their lower and higher bytes are stored separately in distinct vectors. Thanks to that we use the full capacity of AVX2, i.e. we can perform 32 additions at time. In each iteration the input vector of bytes is added to the lower part of accumulator. Higher part is updated in the second step, getting carry from the previous step. Following scalar code shows the idea:
acc_lo += input; if (carry_flag) acc_hi += 1;
How to detect an overflow? In a scalar code it's simple, as we can read the carry flag and react accordingly (GCC provides nice wrappers for this, like __builtin_add_overflow). But there is no such a thing like vector of carry flags in neither AVX2 or SSE.
We can detect overflow with help of saturated addition. In saturated arithmetics, if an operation would cause overflow, then result is the maximum (or minimum) value the target type can hold. Thus, the saturated addition might be expressed as:
uint8_t satadd(uint8_t a, uint8_t b) { if (uint32_t(a) + uint32_t(b) <= 255) // this check is done *somehow* in the hardware return a + b; else return 255; }
We perform two operations:
If results of both sums are not equal, it means there is the overflow. But since AVX2 has no "not-equal" relation (there is comparison for "equal"), we need to invert the logic. We initialize the higher part of accumulator to the highest possible value, i.e. the number of inner loop iterations. Then, we decrement the higher part with the "no-overflow" byte-mask. Following scalar code illustrates the idea.
acc_lo = 0; acc_hi = 64; for (size_t i=0; i < 64; i++) { const uint8_t tmp = satadd(acc_lo, input); acc_lo += input; if (acc_lo == tmp) { acc_hi -= 1; } }
The actual implementation is shown below.
uint32_t sse_8bit_sumbytes(uint8_t* array, size_t size) { __m128i accumulator = _mm_setzero_si128(); for (size_t i=0; i < size; i += 256) { __m128i accu_8bit_lo = _mm_setzero_si128(); __m128i accu_8bit_hi = _mm_set1_epi8(16); for (size_t j=0; j < 256; j += 16) { const __m128i v = _mm_loadu_si128((__m128i*)(array + i + j)); const __m128i t0 = _mm_add_epi8(v, accu_8bit_lo); const __m128i t1 = _mm_adds_epu8(v, accu_8bit_lo); const __m128i not_carry = _mm_cmpeq_epi8(t0, t1); accu_8bit_lo = t0; accu_8bit_hi = _mm_add_epi8(accu_8bit_hi, not_carry); } const __m128i sum_lo = _mm_sad_epu8(accu_8bit_lo, _mm_setzero_si128()); const __m128i sum_hi = _mm_slli_epi32(_mm_sad_epu8(accu_8bit_hi, _mm_setzero_si128()), 8); accumulator = _mm_add_epi32(accumulator, sum_lo); accumulator = _mm_add_epi32(accumulator, sum_hi); } return uint32_t(_mm_extract_epi32(accumulator, 0)) + uint32_t(_mm_extract_epi32(accumulator, 1)) + uint32_t(_mm_extract_epi32(accumulator, 2)) + uint32_t(_mm_extract_epi32(accumulator, 3)); }
AVX2 defines instruction VPSADBW (_mm256_sad_epu8) that calculates Sum Of Absolute Differences (SAD) of unsigned bytes. Single SAD function works on eight-element subvectors (64-bit slices), and stores the results on corresponding 64-bit elements of the result vector; in case of AVX2 the VPSADB yields four numbers.
Below is a sample C++ code that naively implements SAD function.
uint64_t SAD(uint8_t a[], uint8_t b[]) { uint64_t result = 0; for (i=0; i < 8; i++) { if (a[i] > b[i]) result += a[i] - b[i]; else result += b[i] - a[i]; } return reuslt; }
If one of argument vectors is full of zeros, then SAD calculates the sum of bytes from another argument — which exactly what we need. Moreover, since the instruction calculates 64-bit sums, bigger arrays can be processed without overflow.
Below is the simplest, non-unrolled C++ code.
uint32_t avx2_sadbw_sumbytes(uint8_t* array, size_t size) { const __m256i zero = _mm256_setzero_si256(); __m256i accumulator = zero; for (size_t i=0; i < size; i += 32) { const __m256i v = _mm256_loadu_si256((__m256i*)(array + i)); const __m256i t0 = _mm256_sad_epu8(v, zero); accumulator = _mm256_add_epi32(accumulator, t0); } return uint32_t(_mm256_extract_epi32(accumulator, 0)) + uint32_t(_mm256_extract_epi32(accumulator, 2)) + uint32_t(_mm256_extract_epi32(accumulator, 4)) + uint32_t(_mm256_extract_epi32(accumulator, 6)); }
Although the VPSADBW instruction has latency of 3 cycles, it has low throughput of 1 cycle. In case of an unrolled procedure the latency is almost ridden of. The performance boost over the compiler code is almost 6 times.
scalar | plain loop |
scalar (C++) | std::accumulate |
SSE | convert uint8_t -> uint16_t with PMOVZXWD |
SSE (v2) | 32-bit accumulators |
SSE (sadbw) | PSADBW instruction |
SSE (sadbw, unrolled) | the above procedure unrolled four times |
SSE (16bit accu) | 16-bit accumulators, uint16_t -> uint32_t with PMOVZXBW |
SSE (16bit accu, v2) | 16-bit accumulators, uint16_t -> uint32_t with PAND/PSRLD |
SSE (16bit accu, v2, unrolled) | the above procedure unrolled four times |
SSE (8bit accu) | 8-bit accumulators |
AVX2 (v2) | 32-bit accumulators |
AVX2 (sadbw) | VPSADBW instruction |
AVX2 (sadbw, unrolled) | the above procedure unrolled four times |
AVX2 (16bit accu, v2) | 16-bit accumulators, uint16_t -> uint32_t with VPAND/VPSRLD |
AVX2 (16bit accu, v2, unrolled) | the above procedure unrolled four times |
AVX2 (8bit accu) | 8-bit accumulators |
The procedures were run three times and minimum values were noted.
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
GCC: gcc (GCC) 7.3.0
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.294 | 0.297 | 1.00 | ████████▍ |
scalar (C++) | 0.295 | 0.297 | 1.00 | ████████▍ |
SSE | 0.440 | 0.442 | 0.67 | █████▋ |
SSE (v2) | 0.252 | 0.254 | 1.17 | █████████▊ |
SSE (sadbw) | 0.129 | 0.132 | 2.28 | ███████████████████▏ |
SSE (sadbw, unrolled) | 0.129 | 0.132 | 2.28 | ███████████████████▏ |
SSE (16bit accu) | 0.195 | 0.197 | 1.51 | ████████████▋ |
SSE (16bit accu, v2) | 0.169 | 0.176 | 1.74 | ██████████████▋ |
SSE (16bit accu, v2, unrolled) | 0.109 | 0.111 | 2.70 | ██████████████████████▋ |
SSE (8bit accu) | 0.125 | 0.130 | 2.35 | ███████████████████▊ |
AVX2 (v2) | 0.138 | 0.140 | 2.13 | █████████████████▉ |
AVX2 (sadbw) | 0.075 | 0.077 | 3.92 | █████████████████████████████████ |
AVX2 (sadbw, unrolled) | 0.051 | 0.052 | 5.76 | ████████████████████████████████████████████████▌ |
AVX2 (16bit accu, v2) | 0.091 | 0.093 | 3.23 | ███████████████████████████▏ |
AVX2 (16bit accu, v2, unrolled) | 0.073 | 0.075 | 4.03 | █████████████████████████████████▉ |
AVX2 (8bit accu) | 0.080 | 0.083 | 3.67 | ██████████████████████████████▉ |
size 16384 items | ||||
scalar | 0.285 | 0.286 | 1.00 | ████████▍ |
scalar (C++) | 0.285 | 0.286 | 1.00 | ████████▍ |
SSE | 0.439 | 0.440 | 0.65 | █████▍ |
SSE (v2) | 0.251 | 0.252 | 1.14 | █████████▌ |
SSE (sadbw) | 0.126 | 0.127 | 2.26 | ███████████████████ |
SSE (sadbw, unrolled) | 0.126 | 0.127 | 2.26 | ███████████████████ |
SSE (16bit accu) | 0.191 | 0.192 | 1.49 | ████████████▌ |
SSE (16bit accu, v2) | 0.173 | 0.181 | 1.65 | █████████████▊ |
SSE (16bit accu, v2, unrolled) | 0.107 | 0.108 | 2.66 | ██████████████████████▍ |
SSE (8bit accu) | 0.123 | 0.126 | 2.32 | ███████████████████▌ |
AVX2 (v2) | 0.134 | 0.135 | 2.13 | █████████████████▉ |
AVX2 (sadbw) | 0.072 | 0.072 | 3.96 | █████████████████████████████████▎ |
AVX2 (sadbw, unrolled) | 0.048 | 0.049 | 5.94 | ██████████████████████████████████████████████████ |
AVX2 (16bit accu, v2) | 0.089 | 0.090 | 3.20 | ██████████████████████████▉ |
AVX2 (16bit accu, v2, unrolled) | 0.070 | 0.070 | 4.07 | ██████████████████████████████████▎ |
AVX2 (8bit accu) | 0.076 | 0.077 | 3.75 | ███████████████████████████████▌ |
size 32768 items | ||||
scalar | 0.284 | 0.285 | 1.00 | ████████▍ |
scalar (C++) | 0.284 | 0.284 | 1.00 | ████████▍ |
SSE | 0.439 | 0.440 | 0.65 | █████▍ |
SSE (v2) | 0.251 | 0.252 | 1.13 | █████████▌ |
SSE (sadbw) | 0.126 | 0.127 | 2.25 | ██████████████████▉ |
SSE (sadbw, unrolled) | 0.126 | 0.127 | 2.25 | ██████████████████▉ |
SSE (16bit accu) | 0.191 | 0.192 | 1.49 | ████████████▌ |
SSE (16bit accu, v2) | 0.169 | 0.173 | 1.68 | ██████████████▏ |
SSE (16bit accu, v2, unrolled) | 0.109 | 0.109 | 2.61 | █████████████████████▉ |
SSE (8bit accu) | 0.124 | 0.127 | 2.29 | ███████████████████▎ |
AVX2 (v2) | 0.134 | 0.135 | 2.12 | █████████████████▊ |
AVX2 (sadbw) | 0.072 | 0.073 | 3.94 | █████████████████████████████████▏ |
AVX2 (sadbw, unrolled) | 0.048 | 0.049 | 5.92 | █████████████████████████████████████████████████▊ |
AVX2 (16bit accu, v2) | 0.089 | 0.090 | 3.19 | ██████████████████████████▊ |
AVX2 (16bit accu, v2, unrolled) | 0.071 | 0.072 | 4.00 | █████████████████████████████████▋ |
AVX2 (8bit accu) | 0.076 | 0.077 | 3.74 | ███████████████████████████████▍ |
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
GCC: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.278 | 0.385 | 1.00 | ███████▎ |
scalar (C++) | 0.278 | 0.281 | 1.00 | ███████▎ |
SSE | 0.431 | 0.434 | 0.65 | ████▊ |
SSE (v2) | 0.216 | 0.248 | 1.29 | █████████▍ |
SSE (sadbw) | 0.092 | 0.095 | 3.02 | ██████████████████████▎ |
SSE (sadbw, unrolled) | 0.092 | 0.120 | 3.02 | ██████████████████████▎ |
SSE (16bit accu) | 0.188 | 0.191 | 1.48 | ██████████▉ |
SSE (16bit accu, v2) | 0.134 | 0.138 | 2.07 | ███████████████▎ |
SSE (16bit accu, v2, unrolled) | 0.120 | 0.121 | 2.32 | █████████████████ |
SSE (8bit accu) | 0.133 | 0.140 | 2.09 | ███████████████▍ |
AVX2 (v2) | 0.123 | 0.125 | 2.26 | ████████████████▋ |
AVX2 (sadbw) | 0.060 | 0.063 | 4.63 | ██████████████████████████████████▏ |
AVX2 (sadbw, unrolled) | 0.041 | 0.043 | 6.78 | ██████████████████████████████████████████████████ |
AVX2 (16bit accu, v2) | 0.076 | 0.078 | 3.66 | ██████████████████████████▉ |
AVX2 (16bit accu, v2, unrolled) | 0.064 | 0.065 | 4.34 | ████████████████████████████████ |
AVX2 (8bit accu) | 0.070 | 0.072 | 3.97 | █████████████████████████████▎ |
size 16384 items | ||||
scalar | 0.280 | 0.282 | 1.00 | ███████▎ |
scalar (C++) | 0.280 | 0.281 | 1.00 | ███████▎ |
SSE | 0.435 | 0.437 | 0.64 | ████▋ |
SSE (v2) | 0.218 | 0.249 | 1.28 | █████████▍ |
SSE (sadbw) | 0.093 | 0.094 | 3.01 | ██████████████████████▏ |
SSE (sadbw, unrolled) | 0.093 | 0.125 | 3.01 | ██████████████████████▏ |
SSE (16bit accu) | 0.188 | 0.190 | 1.49 | ██████████▉ |
SSE (16bit accu, v2) | 0.134 | 0.135 | 2.09 | ███████████████▍ |
SSE (16bit accu, v2, unrolled) | 0.124 | 0.125 | 2.26 | ████████████████▋ |
SSE (8bit accu) | 0.142 | 0.145 | 1.97 | ██████████████▌ |
AVX2 (v2) | 0.124 | 0.125 | 2.26 | ████████████████▋ |
AVX2 (sadbw) | 0.062 | 0.063 | 4.52 | █████████████████████████████████▎ |
AVX2 (sadbw, unrolled) | 0.044 | 0.044 | 6.36 | ██████████████████████████████████████████████▉ |
AVX2 (16bit accu, v2) | 0.079 | 0.080 | 3.54 | ██████████████████████████▏ |
AVX2 (16bit accu, v2, unrolled) | 0.066 | 0.067 | 4.24 | ███████████████████████████████▎ |
AVX2 (8bit accu) | 0.070 | 0.071 | 4.00 | █████████████████████████████▍ |
size 32768 items | ||||
scalar | 0.281 | 0.282 | 1.00 | ███████▎ |
scalar (C++) | 0.281 | 0.281 | 1.00 | ███████▎ |
SSE | 0.436 | 0.437 | 0.64 | ████▊ |
SSE (v2) | 0.219 | 0.250 | 1.28 | █████████▍ |
SSE (sadbw) | 0.095 | 0.097 | 2.96 | █████████████████████▊ |
SSE (sadbw, unrolled) | 0.096 | 0.125 | 2.93 | █████████████████████▌ |
SSE (16bit accu) | 0.189 | 0.191 | 1.49 | ██████████▉ |
SSE (16bit accu, v2) | 0.134 | 0.137 | 2.10 | ███████████████▍ |
SSE (16bit accu, v2, unrolled) | 0.125 | 0.125 | 2.25 | ████████████████▌ |
SSE (8bit accu) | 0.146 | 0.149 | 1.92 | ██████████████▏ |
AVX2 (v2) | 0.125 | 0.126 | 2.25 | ████████████████▌ |
AVX2 (sadbw) | 0.062 | 0.063 | 4.53 | █████████████████████████████████▍ |
AVX2 (sadbw, unrolled) | 0.045 | 0.045 | 6.24 | ██████████████████████████████████████████████ |
AVX2 (16bit accu, v2) | 0.079 | 0.080 | 3.56 | ██████████████████████████▏ |
AVX2 (16bit accu, v2, unrolled) | 0.067 | 0.068 | 4.19 | ██████████████████████████████▉ |
AVX2 (8bit accu) | 0.071 | 0.072 | 3.96 | █████████████████████████████▏ |
CPU: Intel(R) Core(TM) i5 CPU M 540 @ 2.53GHz
GCC: gcc (Debian 7.3.0-16) 7.3.0
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 0.370 | 0.475 | 1.00 | ███████████████▌ |
scalar (C++) | 0.371 | 0.374 | 1.00 | ███████████████▍ |
SSE | 0.320 | 0.323 | 1.16 | █████████████████▉ |
SSE (v2) | 0.311 | 0.311 | 1.19 | ██████████████████▍ |
SSE (sadbw) | 0.142 | 0.143 | 2.61 | ████████████████████████████████████████▍ |
SSE (sadbw, unrolled) | 0.141 | 0.143 | 2.62 | ████████████████████████████████████████▊ |
SSE (16bit accu) | 0.204 | 0.207 | 1.81 | ████████████████████████████▏ |
SSE (16bit accu, v2) | 0.170 | 0.174 | 2.18 | █████████████████████████████████▊ |
SSE (16bit accu, v2, unrolled) | 0.115 | 0.117 | 3.22 | ██████████████████████████████████████████████████ |
SSE (8bit accu) | 0.149 | 0.151 | 2.48 | ██████████████████████████████████████▌ |
size 16384 items | ||||
scalar | 0.375 | 0.376 | 1.00 | ███████████████▌ |
scalar (C++) | 0.375 | 0.377 | 1.00 | ███████████████▌ |
SSE | 0.325 | 0.327 | 1.15 | █████████████████▉ |
SSE (v2) | 0.309 | 0.318 | 1.21 | ██████████████████▊ |
SSE (sadbw) | 0.152 | 0.153 | 2.47 | ██████████████████████████████████████▎ |
SSE (sadbw, unrolled) | 0.152 | 0.153 | 2.47 | ██████████████████████████████████████▎ |
SSE (16bit accu) | 0.214 | 0.215 | 1.75 | ███████████████████████████▏ |
SSE (16bit accu, v2) | 0.176 | 0.177 | 2.13 | █████████████████████████████████ |
SSE (16bit accu, v2, unrolled) | 0.126 | 0.127 | 2.98 | ██████████████████████████████████████████████▎ |
SSE (8bit accu) | 0.159 | 0.159 | 2.36 | ████████████████████████████████████▋ |
size 32768 items | ||||
scalar | 0.375 | 0.377 | 1.00 | ███████████████▌ |
scalar (C++) | 0.375 | 0.377 | 1.00 | ███████████████▌ |
SSE | 0.326 | 0.328 | 1.15 | █████████████████▉ |
SSE (v2) | 0.292 | 0.293 | 1.28 | ███████████████████▉ |
SSE (sadbw) | 0.153 | 0.154 | 2.45 | ██████████████████████████████████████ |
SSE (sadbw, unrolled) | 0.153 | 0.154 | 2.45 | ██████████████████████████████████████ |
SSE (16bit accu) | 0.215 | 0.217 | 1.74 | ███████████████████████████ |
SSE (16bit accu, v2) | 0.180 | 0.181 | 2.08 | ████████████████████████████████▍ |
SSE (16bit accu, v2, unrolled) | 0.140 | 0.140 | 2.68 | █████████████████████████████████████████▋ |
SSE (8bit accu) | 0.163 | 0.164 | 2.30 | ███████████████████████████████████▊ |
There is also second part, dealing with signed bytes
Big thanks to Daniel Lemire who provides access to Skylake/SkylakeX machines, where I can run benchmarks.
Source code is available on github.