Author: | Wojciech Muła |
---|---|
Added on: | 2018-04-11 |
Updated on: | 2018-04-25 (sync source codes with github version), 2018-04-18 (links to HN and reddit discussions, test performance of unrolled SSE and AVX2 procedures, and AVX512 procedures) |
Contents
Recently, I came across a function that checks whether an array is sorted, i.e. if there is no element which would be greater than its successor. Below is a sample implementation:
bool is_sorted(const int32_t* input, size_t n) { if (n < 2) { return true; } for (size_t i=0; i < n - 1; i++) { if (input[i] > input[i + 1]) return false; } return true; }
I was sure that such a trivial loop is autovectorized by all decent compilers. I checked this on Compiler Explorer and to my surprise none of compilers does it. This is the state for GCC 7.3 (and upcoming GCC 8.0), clang 6.0 and ICC 19.
This text explores possible vectorization schemas.
In the following examples vectors of signed 32-bit integers are assumed.
The simplest vectorized solution is suitable for all SIMD flavours. In a loop two vectors are filled; one vector contains items a = input[0 .. k - 1], another b = input[1 .. k] (where k is the vector size; for SSE k=4, for AVX2 k=8 and for AVX512 k=16).
Then the comparison for greater yields a vector a[i] > b[i]. If all its elements are zero (false) it means that in the range 0 .. k-1 the relation is not violated.
Below is the outline of the algorithm's loop (with k=4):
Load the first vector:
curr = [ a0 | a1 | a2 | a3 ]
Load the second vector:
next = [ a1 | a2 | a3 | a4 ]
Compare curr > next:
mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 ]
If any element of mask is not zero, then return false.
Otherwise advance the input pointer by k and go back to 1.
bool is_sorted_sse(int32_t* a, size_t n) { size_t i = 0; if (n > 4) { for (/**/; i < n - 4; i += 4) { const __m128i curr = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i)); const __m128i next = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i + 1)); const __m128i mask = _mm_cmpgt_epi32(curr, next); if (!_mm_test_all_zeros(mask, mask)) { return false; } } } for (/**/; i + 1 < n; i++) { if (a[i] > a[i + 1]) return false; } return true; }
The generic approach has one issue. The vector next shares k-1 elements with curr, but we anyway read from the memory all k elements.
With help of the SSE instruction _mm_palignr_epi8 (palingr) the number of memory accesses can be reduced. The instruction gets two 16-byte vectors, joins them into a 32-byte temporary array and then copies the selected subarray into a 16-byte vector.
We keep two vectors (chunk0 and chunk1) containing as a whole an eight-element subarray of the input. In each iteration this subarray is shifted right by one element. In each iteration we read just one chunk, i.e. it done at cost of single memory load.
The algorithm works as follows:
// chunk0 = [ a0 | a1 | a2 | a3 ] __m128i chunk0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(array));
// chunk1 = [ a4 | a5 | a6 | a7 ] const __m128i chunk1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(array + i + 4));
// curr = [ a0 | a1 | a2 | a3 ] const __m128i curr = chunk0;
// tem = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ] // | chunk0 | chunk1 | // next = [ a1 | a2 | a3 | a4 ] const __m128i next = _mm_alignr_epi8(chunk1, chunk0, 4);
// mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 ] const __m128i mask = _mm_cmpgt_epi32(curr, next);
chunk0 = chunk1; // chunk1 will be updated at the loop beginning
bool is_sorted_sse_2(int32_t* a, size_t n) { size_t i = 0; if (n >= 8) { __m128i chunk0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a)); do { const __m128i chunk1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i + 4)); const __m128i curr = chunk0; const __m128i next = _mm_alignr_epi8(chunk1, chunk0, 4); const __m128i mask = _mm_cmpgt_epi32(curr, next); if (!_mm_test_all_zeros(mask, mask)) { return false; } chunk0 = chunk1; i += 4; } while (i < n - 4); } for (/**/; i + 1 < n; i++) { if (a[i] > a[i + 1]) return false; } return true; }
Unfortunately the AVX2 version of _mm256_alignr_epi8 doesn't operate on the whole 32-byte register, but on its 16-byte halves (lanes).
The AVX2 approach uses instruction _mm256_permutevar8x32_epi32, which moves 32-bit elements across the lanes in the given order.
In a single iteration we read eight elements:
curr = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ]
Then the vector curr is shifted (permuted) by one element right, only the last item (a7) is kept on the same position:
next = [ a1 | a2 | a3 | a4 | a5 | a6 | a7 | a7 ]
The comparison efficiently tests the first seven elements:
mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 | a4 > a5 | a5 > a6 | a6 > a7 | a7 > a7 ] always false
bool is_sorted_avx2(int32_t* a, size_t n) { const __m256i shuffle_pattern = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 7); size_t i = 0; while (i < n - 8) { // curr = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ] const __m256i curr = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a + i)); // next = [ a1 | a2 | a3 | a4 | a5 | a6 | a7 | a7 ] const __m256i next = _mm256_permutevar8x32_epi32(curr, shuffle_pattern); // Note: the last element of curr and next is a7, thus for this element // the comparison result is always zero. // // In fact, the first 7 elements are being tested. const __m256i mask = _mm256_cmpgt_epi32(curr, next); if (!_mm256_testz_si256(mask, mask)) { return false; } i += 7; } for (/**/; i + 1 < n; i++) { if (a[i] > a[i + 1]) return false; } return true; }
scalar | is_sorted |
SSE (generic) | the generic SIMD algorithm using SSE instructions |
SSE (generic, unrolled 4 times) | the above procedure unrolled 4 times |
AVX2 (generic) | the generic SIMD algorithm using AVX2 instructions |
AVX2 (generic, unrolled 4 times) | the above procedure unrolled 4 times |
AVX512 (generic) | the generic SIMD algorithm using AVX512F instructions |
SSE | SSE-specific procedure |
SSE (unrolled 4 times) | implementation of the above procedure proposed by HeroicKatora on reddit; thank you! |
AVX2 | AVX2-specific procedure |
AVX2 (unroled 4 times) | the above procedure unrolled 4 times; processes 4*7 items per iteration |
AVX512 | AVX512 variant of the AVX2 approach |
Compiler: GCC 7.3.0
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
procedure | time [us] | speed-up |
---|---|---|
element count 128 | ||
scalar | 8080 | 1.00 |
SSE (generic) | 2970 | 2.72 |
SSE (generic, unrolled 4 times) | 1917 | 4.21 |
SSE | 3696 | 2.19 |
SSE (unrolled 4 times) | 1845 | 4.38 |
AVX2 (generic) | 1769 | 4.57 |
AVX2 (generic, unrolled 4 times) | 3216 | 2.51 |
AVX2 | 1945 | 4.15 |
AVX2 (unrolled 4 times) | 1571 | 5.14 |
element count 256 | ||
scalar | 13982 | 1.00 |
SSE (generic) | 5022 | 2.78 |
SSE (generic, unrolled 4 times) | 2865 | 4.88 |
SSE | 7054 | 1.98 |
SSE (unrolled 4 times) | 3139 | 4.45 |
AVX2 (generic) | 4037 | 3.46 |
AVX2 (generic, unrolled 4 times) | 4395 | 3.18 |
AVX2 | 3722 | 3.76 |
AVX2 (unrolled 4 times) | 2263 | 6.18 |
element count 512 | ||
scalar | 25039 | 1.00 |
SSE (generic) | 9222 | 2.72 |
SSE (generic, unrolled 4 times) | 5951 | 4.21 |
SSE | 13525 | 1.85 |
SSE (unrolled 4 times) | 6407 | 3.91 |
AVX2 (generic) | 6806 | 3.68 |
AVX2 (generic, unrolled 4 times) | 6850 | 3.66 |
AVX2 | 6759 | 3.70 |
AVX2 (unrolled 4 times) | 5071 | 4.94 |
element count 1024 | ||
scalar | 46586 | 1.00 |
SSE (generic) | 17792 | 2.62 |
SSE (generic, unrolled 4 times) | 10671 | 4.37 |
SSE | 26160 | 1.78 |
SSE (unrolled 4 times) | 11598 | 4.02 |
AVX2 (generic) | 12483 | 3.73 |
AVX2 (generic, unrolled 4 times) | 11602 | 4.02 |
AVX2 | 12346 | 3.77 |
AVX2 (unrolled 4 times) | 8770 | 5.31 |
element count 2048 | ||
scalar | 86920 | 1.00 |
SSE (generic) | 34936 | 2.49 |
SSE (generic, unrolled 4 times) | 20131 | 4.32 |
SSE | 51254 | 1.70 |
SSE (unrolled 4 times) | 21978 | 3.95 |
AVX2 (generic) | 23776 | 3.66 |
AVX2 (generic, unrolled 4 times) | 21038 | 4.13 |
AVX2 | 23976 | 3.63 |
AVX2 (unrolled 4 times) | 17327 | 5.02 |
element count 4096 | ||
scalar | 171457 | 1.00 |
SSE (generic) | 68884 | 2.49 |
SSE (generic, unrolled 4 times) | 39006 | 4.40 |
SSE | 101823 | 1.68 |
SSE (unrolled 4 times) | 42858 | 4.00 |
AVX2 (generic) | 46424 | 3.69 |
AVX2 (generic, unrolled 4 times) | 39651 | 4.32 |
AVX2 | 47139 | 3.64 |
AVX2 (unrolled 4 times) | 33659 | 5.09 |
element count 16384 | ||
scalar | 671985 | 1.00 |
SSE (generic) | 311616 | 2.16 |
SSE (generic, unrolled 4 times) | 210776 | 3.19 |
SSE | 409935 | 1.64 |
SSE (unrolled 4 times) | 191207 | 3.51 |
AVX2 (generic) | 210032 | 3.20 |
AVX2 (generic, unrolled 4 times) | 207815 | 3.23 |
AVX2 | 219203 | 3.07 |
AVX2 (unrolled 4 times) | 164703 | 4.08 |
element count 65536 | ||
scalar | 2587279 | 1.00 |
SSE (generic) | 1258949 | 2.06 |
SSE (generic, unrolled 4 times) | 870655 | 2.97 |
SSE | 1645715 | 1.57 |
SSE (unrolled 4 times) | 796933 | 3.25 |
AVX2 (generic) | 850206 | 3.04 |
AVX2 (generic, unrolled 4 times) | 860592 | 3.01 |
AVX2 | 899470 | 2.88 |
AVX2 (unrolled 4 times) | 724050 | 3.57 |
Compiler: GCC 5.4.0
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
procedure | time [us] | speed-up |
---|---|---|
element count 128 | ||
scalar | 6091 | 1.00 |
SSE (generic) | 2634 | 2.31 |
SSE (generic, unrolled 4 times) | 2084 | 2.92 |
SSE | 3329 | 1.83 |
SSE (unrolled 4 times) | 2327 | 2.62 |
AVX2 (generic) | 1648 | 3.70 |
AVX2 (generic, unrolled 4 times) | 2191 | 2.78 |
AVX2 | 1774 | 3.43 |
AVX2 (unrolled 4 times) | 1657 | 3.68 |
AVX512 (generic) | 1367 | 4.46 |
AVX512 | 1150 | 5.30 |
element count 256 | ||
scalar | 11445 | 1.00 |
SSE (generic) | 5008 | 2.29 |
SSE (generic, unrolled 4 times) | 3363 | 3.40 |
SSE | 6593 | 1.74 |
SSE (unrolled 4 times) | 3976 | 2.88 |
AVX2 (generic) | 3021 | 3.79 |
AVX2 (generic, unrolled 4 times) | 2954 | 3.87 |
AVX2 | 3455 | 3.31 |
AVX2 (unrolled 4 times) | 2061 | 5.55 |
AVX512 (generic) | 1938 | 5.91 |
AVX512 | 2012 | 5.69 |
element count 512 | ||
scalar | 22133 | 1.00 |
SSE (generic) | 12736 | 1.74 |
SSE (generic, unrolled 4 times) | 6163 | 3.59 |
SSE | 14349 | 1.54 |
SSE (unrolled 4 times) | 7268 | 3.05 |
AVX2 (generic) | 5937 | 3.73 |
AVX2 (generic, unrolled 4 times) | 5141 | 4.31 |
AVX2 | 8279 | 2.67 |
AVX2 (unrolled 4 times) | 4053 | 5.46 |
AVX512 (generic) | 3094 | 7.15 |
AVX512 | 3562 | 6.21 |
element count 1024 | ||
scalar | 43534 | 1.00 |
SSE (generic) | 24440 | 1.78 |
SSE (generic, unrolled 4 times) | 12885 | 3.38 |
SSE | 27446 | 1.59 |
SSE (unrolled 4 times) | 13877 | 3.14 |
AVX2 (generic) | 12664 | 3.44 |
AVX2 (generic, unrolled 4 times) | 7503 | 5.80 |
AVX2 | 14728 | 2.96 |
AVX2 (unrolled 4 times) | 8034 | 5.42 |
AVX512 (generic) | 5380 | 8.09 |
AVX512 | 5443 | 8.00 |
element count 2048 | ||
scalar | 86305 | 1.00 |
SSE (generic) | 47782 | 1.81 |
SSE (generic, unrolled 4 times) | 24381 | 3.54 |
SSE | 53797 | 1.60 |
SSE (unrolled 4 times) | 27908 | 3.09 |
AVX2 (generic) | 24044 | 3.59 |
AVX2 (generic, unrolled 4 times) | 13547 | 6.37 |
AVX2 | 28291 | 3.05 |
AVX2 (unrolled 4 times) | 15484 | 5.57 |
AVX512 (generic) | 10813 | 7.98 |
AVX512 | 11200 | 7.71 |
element count 4096 | ||
scalar | 171882 | 1.00 |
SSE (generic) | 94751 | 1.81 |
SSE (generic, unrolled 4 times) | 47284 | 3.64 |
SSE | 106064 | 1.62 |
SSE (unrolled 4 times) | 54288 | 3.17 |
AVX2 (generic) | 47036 | 3.65 |
AVX2 (generic, unrolled 4 times) | 26236 | 6.55 |
AVX2 | 55040 | 3.12 |
AVX2 (unrolled 4 times) | 29973 | 5.73 |
AVX512 (generic) | 19904 | 8.64 |
AVX512 | 21735 | 7.91 |
element count 16384 | ||
scalar | 685498 | 1.00 |
SSE (generic) | 417421 | 1.64 |
SSE (generic, unrolled 4 times) | 266764 | 2.57 |
SSE | 438934 | 1.56 |
SSE (unrolled 4 times) | 239919 | 2.86 |
AVX2 (generic) | 204391 | 3.35 |
AVX2 (generic, unrolled 4 times) | 181020 | 3.79 |
AVX2 | 253740 | 2.70 |
AVX2 (unrolled 4 times) | 165795 | 4.13 |
AVX512 (generic) | 182358 | 3.76 |
AVX512 | 117629 | 5.83 |
element count 65536 | ||
scalar | 2740076 | 1.00 |
SSE (generic) | 1663774 | 1.65 |
SSE (generic, unrolled 4 times) | 1062554 | 2.58 |
SSE | 1756262 | 1.56 |
SSE (unrolled 4 times) | 938962 | 2.92 |
AVX2 (generic) | 973695 | 2.81 |
AVX2 (generic, unrolled 4 times) | 792764 | 3.46 |
AVX2 | 1011261 | 2.71 |
AVX2 (unrolled 4 times) | 660586 | 4.15 |
AVX512 (generic) | 729784 | 3.75 |
AVX512 | 467945 | 5.86 |
All implementations are available on github.