Is sorted using SIMD instructions

Author:	Wojciech Muła
Added on:	2018-04-11
Updated on:	2018-04-25 (sync source codes with github version), 2018-04-18 (links to HN and reddit discussions, test performance of unrolled SSE and AVX2 procedures, and AVX512 procedures)

Contents

Introduction
Vectorization
Performance comparison
- Skylake
- Skylake-X
See also
Source code

Introduction

Recently, I came across a function that checks whether an array is sorted, i.e. if there is no element which would be greater than its successor. Below is a sample implementation:

bool is_sorted(const int32_t* input, size_t n) {
    if (n < 2) {
        return true;
    }

    for (size_t i=0; i < n - 1; i++) {
        if (input[i] > input[i + 1])
            return false;
    }

    return true;
}

I was sure that such a trivial loop is autovectorized by all decent compilers. I checked this on Compiler Explorer and to my surprise none of compilers does it. This is the state for GCC 7.3 (and upcoming GCC 8.0), clang 6.0 and ICC 19.

This text explores possible vectorization schemas.

Vectorization

In the following examples vectors of signed 32-bit integers are assumed.

Generic SIMD

The simplest vectorized solution is suitable for all SIMD flavours. In a loop two vectors are filled; one vector contains items a = input[0 .. k - 1], another b = input[1 .. k] (where k is the vector size; for SSE k=4, for AVX2 k=8 and for AVX512 k=16).

Then the comparison for greater yields a vector a[i] > b[i]. If all its elements are zero (false) it means that in the range 0 .. k-1 the relation is not violated.

Below is the outline of the algorithm's loop (with k=4):

Load the first vector:

curr = [    a0   |    a1   |    a2   |    a3   ]

Load the second vector:

next = [    a1   |    a2   |    a3   |    a4   ]

Compare curr > next:

mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 ]

If any element of mask is not zero, then return false.
Otherwise advance the input pointer by k and go back to 1.

Sample SSE implementation

bool is_sorted_sse(int32_t* a, size_t n) {

    size_t i = 0;
    if (n > 4) {
        for (/**/; i < n - 4; i += 4) {
            const __m128i curr = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i));
            const __m128i next = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i + 1));

            const __m128i mask = _mm_cmpgt_epi32(curr, next);
            if (!_mm_test_all_zeros(mask, mask)) {
                return false;
            }
        }
    }

    for (/**/; i + 1 < n; i++) {
        if (a[i] > a[i + 1])
            return false;
    }

    return true;
}

SSE

The generic approach has one issue. The vector next shares k-1 elements with curr, but we anyway read from the memory all k elements.

With help of the SSE instruction _mm_palignr_epi8 (palingr) the number of memory accesses can be reduced. The instruction gets two 16-byte vectors, joins them into a 32-byte temporary array and then copies the selected subarray into a 16-byte vector.

We keep two vectors (chunk0 and chunk1) containing as a whole an eight-element subarray of the input. In each iteration this subarray is shifted right by one element. In each iteration we read just one chunk, i.e. it done at cost of single memory load.

The algorithm works as follows:

Before the main loop load into vector chunk0 the first portion of input:

// chunk0 = [    a0   |    a1   |    a2   |    a3   ]
__m128i chunk0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(array));

In the loop read the next portion of array into chunk1:

// chunk1 = [    a4   |    a5   |    a6   |    a7   ]
const __m128i chunk1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(array + i + 4));

Now build the vector curr — it is simply an alias for chunk0:

// curr   = [    a0   |    a1   |    a2   |    a3   ]
const __m128i curr = chunk0;

The vector next is build from the pair chunk1:chunk0 using _mm_alignr_epi8:

// tem    = [    a0   |    a1   |    a2   |    a3   |    a4   |    a5   |    a6   |    a7   ]
//          |              chunk0                   |                 chunk1                |

// next   = [    a1   |    a2   |    a3   |    a4   ]
const __m128i next = _mm_alignr_epi8(chunk1, chunk0, 4);

Finally the vectors are compared:

// mask   = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 ]
const __m128i mask = _mm_cmpgt_epi32(curr, next);

If any element of mask is not zero, then return false.
Shift right the input view:

chunk0 = chunk1;
// chunk1 will be updated at the loop beginning

Advance the input pointer by 4 and go to 1.

Sample implementation

bool is_sorted_sse_2(int32_t* a, size_t n) {

    size_t i = 0;
    if (n >= 8) {
        __m128i chunk0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a));
        do {
            const __m128i chunk1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(a + i + 4));
            const __m128i curr = chunk0;
            const __m128i next = _mm_alignr_epi8(chunk1, chunk0, 4);

            const __m128i mask = _mm_cmpgt_epi32(curr, next);
            if (!_mm_test_all_zeros(mask, mask)) {
                return false;
            }

            chunk0 = chunk1;
            i += 4;
        } while (i < n - 4);
    }

    for (/**/; i + 1 < n; i++) {
        if (a[i] > a[i + 1])
            return false;
    }

    return true;
}

AVX2

Unfortunately the AVX2 version of _mm256_alignr_epi8 doesn't operate on the whole 32-byte register, but on its 16-byte halves (lanes).

The AVX2 approach uses instruction _mm256_permutevar8x32_epi32, which moves 32-bit elements across the lanes in the given order.

In a single iteration we read eight elements:

curr = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ]

Then the vector curr is shifted (permuted) by one element right, only the last item (a7) is kept on the same position:

next = [ a1 | a2 | a3 | a4 | a5 | a6 | a7 | a7 ]

The comparison efficiently tests the first seven elements:

mask = [ a0 > a1 | a1 > a2 | a2 > a3 | a3 > a4 | a4 > a5 | a5 > a6 | a6 > a7 | a7 > a7 ]
                                                                              always false

Sample implementation

bool is_sorted_avx2(int32_t* a, size_t n) {

    const __m256i shuffle_pattern = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 7);

    size_t i = 0;
    while (i < n - 8) {
        // curr = [ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 ]
        const __m256i curr = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a + i));
        // next = [ a1 | a2 | a3 | a4 | a5 | a6 | a7 | a7 ]
        const __m256i next = _mm256_permutevar8x32_epi32(curr, shuffle_pattern);

        // Note: the last element of curr and next is a7, thus for this element
        //       the comparison result is always zero.
        //
        // In fact, the first 7 elements are being tested.
        const __m256i mask = _mm256_cmpgt_epi32(curr, next);
        if (!_mm256_testz_si256(mask, mask)) {
            return false;
        }

        i += 7;
    }

    for (/**/; i + 1 < n; i++) {
        if (a[i] > a[i + 1])
            return false;
    }

    return true;
}

Performance comparison

Tested procedures
scalar	`is_sorted`
SSE (generic)	the generic SIMD algorithm using SSE instructions
SSE (generic, unrolled 4 times)	the above procedure unrolled 4 times
AVX2 (generic)	the generic SIMD algorithm using AVX2 instructions
AVX2 (generic, unrolled 4 times)	the above procedure unrolled 4 times
AVX512 (generic)	the generic SIMD algorithm using AVX512F instructions
SSE	SSE-specific procedure
SSE (unrolled 4 times)	implementation of the above procedure proposed by HeroicKatora on reddit; thank you!
AVX2	AVX2-specific procedure
AVX2 (unrolled 4 times)	the above procedure unrolled 4 times; processes 4*7 items per iteration
AVX512	AVX512 variant of the AVX2 approach

Skylake

Compiler: GCC 7.3.0

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

procedure	time [us]	speed-up
element count 128
scalar	8080	1.00
SSE (generic)	2970	2.72
SSE (generic, unrolled 4 times)	1917	4.21
SSE	3696	2.19
SSE (unrolled 4 times)	1845	4.38
AVX2 (generic)	1769	4.57
AVX2 (generic, unrolled 4 times)	3216	2.51
AVX2	1945	4.15
AVX2 (unrolled 4 times)	1571	5.14
element count 256
scalar	13982	1.00
SSE (generic)	5022	2.78
SSE (generic, unrolled 4 times)	2865	4.88
SSE	7054	1.98
SSE (unrolled 4 times)	3139	4.45
AVX2 (generic)	4037	3.46
AVX2 (generic, unrolled 4 times)	4395	3.18
AVX2	3722	3.76
AVX2 (unrolled 4 times)	2263	6.18
element count 512
scalar	25039	1.00
SSE (generic)	9222	2.72
SSE (generic, unrolled 4 times)	5951	4.21
SSE	13525	1.85
SSE (unrolled 4 times)	6407	3.91
AVX2 (generic)	6806	3.68
AVX2 (generic, unrolled 4 times)	6850	3.66
AVX2	6759	3.70
AVX2 (unrolled 4 times)	5071	4.94
element count 1024
scalar	46586	1.00
SSE (generic)	17792	2.62
SSE (generic, unrolled 4 times)	10671	4.37
SSE	26160	1.78
SSE (unrolled 4 times)	11598	4.02
AVX2 (generic)	12483	3.73
AVX2 (generic, unrolled 4 times)	11602	4.02
AVX2	12346	3.77
AVX2 (unrolled 4 times)	8770	5.31
element count 2048
scalar	86920	1.00
SSE (generic)	34936	2.49
SSE (generic, unrolled 4 times)	20131	4.32
SSE	51254	1.70
SSE (unrolled 4 times)	21978	3.95
AVX2 (generic)	23776	3.66
AVX2 (generic, unrolled 4 times)	21038	4.13
AVX2	23976	3.63
AVX2 (unrolled 4 times)	17327	5.02
element count 4096
scalar	171457	1.00
SSE (generic)	68884	2.49
SSE (generic, unrolled 4 times)	39006	4.40
SSE	101823	1.68
SSE (unrolled 4 times)	42858	4.00
AVX2 (generic)	46424	3.69
AVX2 (generic, unrolled 4 times)	39651	4.32
AVX2	47139	3.64
AVX2 (unrolled 4 times)	33659	5.09
element count 16384
scalar	671985	1.00
SSE (generic)	311616	2.16
SSE (generic, unrolled 4 times)	210776	3.19
SSE	409935	1.64
SSE (unrolled 4 times)	191207	3.51
AVX2 (generic)	210032	3.20
AVX2 (generic, unrolled 4 times)	207815	3.23
AVX2	219203	3.07
AVX2 (unrolled 4 times)	164703	4.08
element count 65536
scalar	2587279	1.00
SSE (generic)	1258949	2.06
SSE (generic, unrolled 4 times)	870655	2.97
SSE	1645715	1.57
SSE (unrolled 4 times)	796933	3.25
AVX2 (generic)	850206	3.04
AVX2 (generic, unrolled 4 times)	860592	3.01
AVX2	899470	2.88
AVX2 (unrolled 4 times)	724050	3.57

Skylake-X

Compiler: GCC 5.4.0

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

procedure	time [us]	speed-up
element count 128
scalar	6091	1.00
SSE (generic)	2634	2.31
SSE (generic, unrolled 4 times)	2084	2.92
SSE	3329	1.83
SSE (unrolled 4 times)	2327	2.62
AVX2 (generic)	1648	3.70
AVX2 (generic, unrolled 4 times)	2191	2.78
AVX2	1774	3.43
AVX2 (unrolled 4 times)	1657	3.68
AVX512 (generic)	1367	4.46
AVX512	1150	5.30
element count 256
scalar	11445	1.00
SSE (generic)	5008	2.29
SSE (generic, unrolled 4 times)	3363	3.40
SSE	6593	1.74
SSE (unrolled 4 times)	3976	2.88
AVX2 (generic)	3021	3.79
AVX2 (generic, unrolled 4 times)	2954	3.87
AVX2	3455	3.31
AVX2 (unrolled 4 times)	2061	5.55
AVX512 (generic)	1938	5.91
AVX512	2012	5.69
element count 512
scalar	22133	1.00
SSE (generic)	12736	1.74
SSE (generic, unrolled 4 times)	6163	3.59
SSE	14349	1.54
SSE (unrolled 4 times)	7268	3.05
AVX2 (generic)	5937	3.73
AVX2 (generic, unrolled 4 times)	5141	4.31
AVX2	8279	2.67
AVX2 (unrolled 4 times)	4053	5.46
AVX512 (generic)	3094	7.15
AVX512	3562	6.21
element count 1024
scalar	43534	1.00
SSE (generic)	24440	1.78
SSE (generic, unrolled 4 times)	12885	3.38
SSE	27446	1.59
SSE (unrolled 4 times)	13877	3.14
AVX2 (generic)	12664	3.44
AVX2 (generic, unrolled 4 times)	7503	5.80
AVX2	14728	2.96
AVX2 (unrolled 4 times)	8034	5.42
AVX512 (generic)	5380	8.09
AVX512	5443	8.00
element count 2048
scalar	86305	1.00
SSE (generic)	47782	1.81
SSE (generic, unrolled 4 times)	24381	3.54
SSE	53797	1.60
SSE (unrolled 4 times)	27908	3.09
AVX2 (generic)	24044	3.59
AVX2 (generic, unrolled 4 times)	13547	6.37
AVX2	28291	3.05
AVX2 (unrolled 4 times)	15484	5.57
AVX512 (generic)	10813	7.98
AVX512	11200	7.71
element count 4096
scalar	171882	1.00
SSE (generic)	94751	1.81
SSE (generic, unrolled 4 times)	47284	3.64
SSE	106064	1.62
SSE (unrolled 4 times)	54288	3.17
AVX2 (generic)	47036	3.65
AVX2 (generic, unrolled 4 times)	26236	6.55
AVX2	55040	3.12
AVX2 (unrolled 4 times)	29973	5.73
AVX512 (generic)	19904	8.64
AVX512	21735	7.91
element count 16384
scalar	685498	1.00
SSE (generic)	417421	1.64
SSE (generic, unrolled 4 times)	266764	2.57
SSE	438934	1.56
SSE (unrolled 4 times)	239919	2.86
AVX2 (generic)	204391	3.35
AVX2 (generic, unrolled 4 times)	181020	3.79
AVX2	253740	2.70
AVX2 (unrolled 4 times)	165795	4.13
AVX512 (generic)	182358	3.76
AVX512	117629	5.83
element count 65536
scalar	2740076	1.00
SSE (generic)	1663774	1.65
SSE (generic, unrolled 4 times)	1062554	2.58
SSE	1756262	1.56
SSE (unrolled 4 times)	938962	2.92
AVX2 (generic)	973695	2.81
AVX2 (generic, unrolled 4 times)	792764	3.46
AVX2	1011261	2.71
AVX2 (unrolled 4 times)	660586	4.15
AVX512 (generic)	729784	3.75
AVX512	467945	5.86

Source code

All implementations are available on github.

Is sorted using SIMD instructions

Introduction

Vectorization

Generic SIMD

Sample SSE implementation

SSE

Sample implementation

AVX2

Sample implementation

Performance comparison

Skylake

Skylake-X

See also

Source code