Is sorted using SIMD instructions

Author: Wojciech Muła 2018-04-11 2018-04-25 (sync source codes with github version), 2018-04-18 (links to HN and reddit discussions, test performance of unrolled SSE and AVX2 procedures, and AVX512 procedures)

Introduction

Recently, I came across a function that checks whether an array is sorted, i.e. if there is no element which would be greater than its successor. Below is a sample implementation:

```bool is_sorted(const int32_t* input, size_t n) {
if (n < 2) {
return true;
}

for (size_t i=0; i < n - 1; i++) {
if (input[i] > input[i + 1])
return false;
}

return true;
}
```

I was sure that such a trivial loop is autovectorized by all decent compilers. I checked this on Compiler Explorer and to my surprise none of compilers does it. This is the state for GCC 7.3 (and upcoming GCC 8.0), clang 6.0 and ICC 19.

This text explores possible vectorization schemas.

Vectorization

In the following examples vectors of signed 32-bit integers are assumed.

Generic SIMD

The simplest vectorized solution is suitable for all SIMD flavours. In a loop two vectors are filled; one vector contains items a = input[0 .. k - 1], another b = input[1 .. k] (where k is the vector size; for SSE k=4, for AVX2 k=8 and for AVX512 k=16).

Then the comparison for greater yields a vector a[i] > b[i]. If all its elements are zero (false) it means that in the range 0 .. k-1 the relation is not violated.

Below is the outline of the algorithm's loop (with k=4):

```curr = [    a0   |    a1   |    a2   |    a3   ]
```

```next = [    a1   |    a2   |    a3   |    a4   ]
```
3. Compare curr > next:

```mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 ]
```
4. If any element of mask is not zero, then return false.

5. Otherwise advance the input pointer by k and go back to 1.

Sample SSE implementation

```bool is_sorted_sse(int32_t* a, size_t n) {

size_t i = 0;
if (n > 4) {
for (/**/; i < n - 4; i += 4) {
const __m128i curr = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i));
const __m128i next = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i + 1));

const __m128i mask = _mm_cmpgt_epi32(curr, next);
return false;
}
}
}

for (/**/; i + 1 < n; i++) {
if (a[i] > a[i + 1])
return false;
}

return true;
}
```

SSE

The generic approach has one issue. The vector next shares k-1 elements with curr, but we anyway read from the memory all k elements.

With help of the SSE instruction _mm_palignr_epi8 (palingr) the number of memory accesses can be reduced. The instruction gets two 16-byte vectors, joins them into a 32-byte temporary array and then copies the selected subarray into a 16-byte vector.

We keep two vectors (chunk0 and chunk1) containing as a whole an eight-element subarray of the input. In each iteration this subarray is shifted right by one element. In each iteration we read just one chunk, i.e. it done at cost of single memory load.

The algorithm works as follows:

1. Before the main loop load into vector chunk0 the first portion of input:
```// chunk0 = [    a0   |    a1   |    a2   |    a3   ]
```
1. In the loop read the next portion of array into chunk1:
```// chunk1 = [    a4   |    a5   |    a6   |    a7   ]
const __m128i chunk1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(array + i + 4));
```
1. Now build the vector curr — it is simply an alias for chunk0:
```// curr   = [    a0   |    a1   |    a2   |    a3   ]
const __m128i curr = chunk0;
```
1. The vector next is build from the pair chunk1:chunk0 using _mm_alignr_epi8:
```// tem    = [    a0   |    a1   |    a2   |    a3   |    a4   |    a5   |    a6   |    a7   ]
//          |              chunk0                   |                 chunk1                |

// next   = [    a1   |    a2   |    a3   |    a4   ]
const __m128i next = _mm_alignr_epi8(chunk1, chunk0, 4);
```
1. Finally the vectors are compared:
```// mask   = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 ]
const __m128i mask = _mm_cmpgt_epi32(curr, next);
```
1. If any element of mask is not zero, then return false.
2. Shift right the input view:
```chunk0 = chunk1;
// chunk1 will be updated at the loop beginning
```
1. Advance the input pointer by 4 and go to 1.

Sample implementation

```bool is_sorted_sse_2(int32_t* a, size_t n) {

size_t i = 0;
if (n >= 8) {
do {
const __m128i chunk1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i + 4));
const __m128i curr = chunk0;
const __m128i next = _mm_alignr_epi8(chunk1, chunk0, 4);

const __m128i mask = _mm_cmpgt_epi32(curr, next);
return false;
}

chunk0 = chunk1;
i += 4;
} while (i < n - 4);
}

for (/**/; i + 1 < n; i++) {
if (a[i] > a[i + 1])
return false;
}

return true;
}
```

AVX2

Unfortunately the AVX2 version of _mm256_alignr_epi8 doesn't operate on the whole 32-byte register, but on its 16-byte halves (lanes).

The AVX2 approach uses instruction _mm256_permutevar8x32_epi32, which moves 32-bit elements across the lanes in the given order.

In a single iteration we read eight elements:

```curr = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ]
```

Then the vector curr is shifted (permuted) by one element right, only the last item (a7) is kept on the same position:

```next = [ a1 | a2 | a3 | a4 | a5 | a6 | a7 | a7 ]
```

The comparison efficiently tests the first seven elements:

```mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 | a4 > a5 | a5 > a6 | a6 > a7 | a7 > a7 ]
always false
```

Sample implementation

```bool is_sorted_avx2(int32_t* a, size_t n) {

const __m256i shuffle_pattern = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 7);

size_t i = 0;
while (i < n - 8) {
// curr = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ]
const __m256i curr = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a + i));
// next = [ a1 | a2 | a3 | a4 | a5 | a6 | a7 | a7 ]
const __m256i next = _mm256_permutevar8x32_epi32(curr, shuffle_pattern);

// Note: the last element of curr and next is a7, thus for this element
//       the comparison result is always zero.
//
// In fact, the first 7 elements are being tested.
const __m256i mask = _mm256_cmpgt_epi32(curr, next);
return false;
}

i += 7;
}

for (/**/; i + 1 < n; i++) {
if (a[i] > a[i + 1])
return false;
}

return true;
}
```

Performance comparison

 scalar is_sorted SSE (generic) the generic SIMD algorithm using SSE instructions SSE (generic, unrolled 4 times) the above procedure unrolled 4 times AVX2 (generic) the generic SIMD algorithm using AVX2 instructions AVX2 (generic, unrolled 4 times) the above procedure unrolled 4 times AVX512 (generic) the generic SIMD algorithm using AVX512F instructions SSE SSE-specific procedure SSE (unrolled 4 times) implementation of the above procedure proposed by HeroicKatora on reddit; thank you! AVX2 AVX2-specific procedure AVX2 (unroled 4 times) the above procedure unrolled 4 times; processes 4*7 items per iteration AVX512 AVX512 variant of the AVX2 approach

Skylake

Compiler: GCC 7.3.0

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

procedure time [us] speed-up
element count 128
scalar 8080 1.00
SSE (generic) 2970 2.72
SSE (generic, unrolled 4 times) 1917 4.21
SSE 3696 2.19
SSE (unrolled 4 times) 1845 4.38
AVX2 (generic) 1769 4.57
AVX2 (generic, unrolled 4 times) 3216 2.51
AVX2 1945 4.15
AVX2 (unrolled 4 times) 1571 5.14
element count 256
scalar 13982 1.00
SSE (generic) 5022 2.78
SSE (generic, unrolled 4 times) 2865 4.88
SSE 7054 1.98
SSE (unrolled 4 times) 3139 4.45
AVX2 (generic) 4037 3.46
AVX2 (generic, unrolled 4 times) 4395 3.18
AVX2 3722 3.76
AVX2 (unrolled 4 times) 2263 6.18
element count 512
scalar 25039 1.00
SSE (generic) 9222 2.72
SSE (generic, unrolled 4 times) 5951 4.21
SSE 13525 1.85
SSE (unrolled 4 times) 6407 3.91
AVX2 (generic) 6806 3.68
AVX2 (generic, unrolled 4 times) 6850 3.66
AVX2 6759 3.70
AVX2 (unrolled 4 times) 5071 4.94
element count 1024
scalar 46586 1.00
SSE (generic) 17792 2.62
SSE (generic, unrolled 4 times) 10671 4.37
SSE 26160 1.78
SSE (unrolled 4 times) 11598 4.02
AVX2 (generic) 12483 3.73
AVX2 (generic, unrolled 4 times) 11602 4.02
AVX2 12346 3.77
AVX2 (unrolled 4 times) 8770 5.31
element count 2048
scalar 86920 1.00
SSE (generic) 34936 2.49
SSE (generic, unrolled 4 times) 20131 4.32
SSE 51254 1.70
SSE (unrolled 4 times) 21978 3.95
AVX2 (generic) 23776 3.66
AVX2 (generic, unrolled 4 times) 21038 4.13
AVX2 23976 3.63
AVX2 (unrolled 4 times) 17327 5.02
element count 4096
scalar 171457 1.00
SSE (generic) 68884 2.49
SSE (generic, unrolled 4 times) 39006 4.40
SSE 101823 1.68
SSE (unrolled 4 times) 42858 4.00
AVX2 (generic) 46424 3.69
AVX2 (generic, unrolled 4 times) 39651 4.32
AVX2 47139 3.64
AVX2 (unrolled 4 times) 33659 5.09
element count 16384
scalar 671985 1.00
SSE (generic) 311616 2.16
SSE (generic, unrolled 4 times) 210776 3.19
SSE 409935 1.64
SSE (unrolled 4 times) 191207 3.51
AVX2 (generic) 210032 3.20
AVX2 (generic, unrolled 4 times) 207815 3.23
AVX2 219203 3.07
AVX2 (unrolled 4 times) 164703 4.08
element count 65536
scalar 2587279 1.00
SSE (generic) 1258949 2.06
SSE (generic, unrolled 4 times) 870655 2.97
SSE 1645715 1.57
SSE (unrolled 4 times) 796933 3.25
AVX2 (generic) 850206 3.04
AVX2 (generic, unrolled 4 times) 860592 3.01
AVX2 899470 2.88
AVX2 (unrolled 4 times) 724050 3.57

Skylake-X

Compiler: GCC 5.4.0

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

procedure time [us] speed-up
element count 128
scalar 6091 1.00
SSE (generic) 2634 2.31
SSE (generic, unrolled 4 times) 2084 2.92
SSE 3329 1.83
SSE (unrolled 4 times) 2327 2.62
AVX2 (generic) 1648 3.70
AVX2 (generic, unrolled 4 times) 2191 2.78
AVX2 1774 3.43
AVX2 (unrolled 4 times) 1657 3.68
AVX512 (generic) 1367 4.46
AVX512 1150 5.30
element count 256
scalar 11445 1.00
SSE (generic) 5008 2.29
SSE (generic, unrolled 4 times) 3363 3.40
SSE 6593 1.74
SSE (unrolled 4 times) 3976 2.88
AVX2 (generic) 3021 3.79
AVX2 (generic, unrolled 4 times) 2954 3.87
AVX2 3455 3.31
AVX2 (unrolled 4 times) 2061 5.55
AVX512 (generic) 1938 5.91
AVX512 2012 5.69
element count 512
scalar 22133 1.00
SSE (generic) 12736 1.74
SSE (generic, unrolled 4 times) 6163 3.59
SSE 14349 1.54
SSE (unrolled 4 times) 7268 3.05
AVX2 (generic) 5937 3.73
AVX2 (generic, unrolled 4 times) 5141 4.31
AVX2 8279 2.67
AVX2 (unrolled 4 times) 4053 5.46
AVX512 (generic) 3094 7.15
AVX512 3562 6.21
element count 1024
scalar 43534 1.00
SSE (generic) 24440 1.78
SSE (generic, unrolled 4 times) 12885 3.38
SSE 27446 1.59
SSE (unrolled 4 times) 13877 3.14
AVX2 (generic) 12664 3.44
AVX2 (generic, unrolled 4 times) 7503 5.80
AVX2 14728 2.96
AVX2 (unrolled 4 times) 8034 5.42
AVX512 (generic) 5380 8.09
AVX512 5443 8.00
element count 2048
scalar 86305 1.00
SSE (generic) 47782 1.81
SSE (generic, unrolled 4 times) 24381 3.54
SSE 53797 1.60
SSE (unrolled 4 times) 27908 3.09
AVX2 (generic) 24044 3.59
AVX2 (generic, unrolled 4 times) 13547 6.37
AVX2 28291 3.05
AVX2 (unrolled 4 times) 15484 5.57
AVX512 (generic) 10813 7.98
AVX512 11200 7.71
element count 4096
scalar 171882 1.00
SSE (generic) 94751 1.81
SSE (generic, unrolled 4 times) 47284 3.64
SSE 106064 1.62
SSE (unrolled 4 times) 54288 3.17
AVX2 (generic) 47036 3.65
AVX2 (generic, unrolled 4 times) 26236 6.55
AVX2 55040 3.12
AVX2 (unrolled 4 times) 29973 5.73
AVX512 (generic) 19904 8.64
AVX512 21735 7.91
element count 16384
scalar 685498 1.00
SSE (generic) 417421 1.64
SSE (generic, unrolled 4 times) 266764 2.57
SSE 438934 1.56
SSE (unrolled 4 times) 239919 2.86
AVX2 (generic) 204391 3.35
AVX2 (generic, unrolled 4 times) 181020 3.79
AVX2 253740 2.70
AVX2 (unrolled 4 times) 165795 4.13
AVX512 (generic) 182358 3.76
AVX512 117629 5.83
element count 65536
scalar 2740076 1.00
SSE (generic) 1663774 1.65
SSE (generic, unrolled 4 times) 1062554 2.58
SSE 1756262 1.56
SSE (unrolled 4 times) 938962 2.92
AVX2 (generic) 973695 2.81
AVX2 (generic, unrolled 4 times) 792764 3.46
AVX2 1011261 2.71
AVX2 (unrolled 4 times) 660586 4.15
AVX512 (generic) 729784 3.75
AVX512 467945 5.86