# Sorting an AVX512 register

Author: Wojciech Muła 2016-10-08

# Introduction

Presented method allows to sort a whole AVX512 register or its subrange, it is a variant of counting sort. The time complexity is linear, moreover method works entirely on registers, no extra memory operations are done. It may also be easily extended to sorting more than one register.

The method is suitable for sorting 32- and 64-bit integers, and also floating point numbers, both single and double precision.

# Algorithm

A single step of the algorithm consist:

1. Broadcast i-th element of the input.
2. Compare for less the broadcasted vector with the input. This yields the number of items less than i-th item.
3. Likewise compare for equality. This yields number of item's repetitions
4. Having these two numbers we can form a mask which can be used to merge the broadacted vector with a result vector, putting i-th number at the correct position.

This is repeated for every item of input. When sorting 32-bit numbers 16 iterations have to be done.

## Example

```result = [   0,   0,   0,   0,   0,   0,   0,   0]
input  = [  10,   1,   5,  20,  10,   8,  60,  99]
brds   = [  10,  10,  10,  10,  10,  10,  10,  10] - broadcast 10

less   = [   0,   1,   1,   0,   0,   1,   0,   0] -> 3 items less than 10
equal  = [   1,   0,   0,   0,   1,   0,   0,   0] -> 10 repeats 2 times

mask   = (1 << (3 + 2)) - (1 << 3) = 0b00011000

result = merge(brds, mask)
[   0,   0,   0,  10,  10,   0,   0,   0]
```

## Implementation

Sample implementation with loop. It is slower than fully unrolled code.

```__m512i avx512_sort_loop_epi32(const __m512i v) {

__m512i result = _mm512_setzero_si512();
__m512i index  = _mm512_setzero_si512();
__m512i incr   = _mm512_set1_epi32(1);

for (int i=0; i < 16; i++) {
const __m512i  b    = _mm512_permutexvar_epi32(index, v);
const uint16_t lt   = _mm_popcnt_u32(_mm512_cmplt_epi32_mask(v, b));
const uint16_t eq   = _mm_popcnt_u32(_mm512_cmpeq_epi32_mask(v, b));
const uint16_t mask = (uint32_t(1) << (lt + eq)) - (uint32_t(1) << lt);
index  = _mm512_add_epi32(index, incr);
}

return result;
}
```

## Performance results

Compiler: GCC 5.3.0 CPU: Knights Landing 7120

Sorting 16 x 32-bit numbers (one AVX512 register)

algorithm time [s]
std::sort 1.53
insertion sort 5.55
AVX512F unrolled 1.77
AVX512F (for loop) 2.04

Sorting 32 x 32-bit numbers (two AVX512 registers)

algorithm time [s]
std::sort 5.24
insertion sort 12.99
AVX512F unrolled 5.78

Speed of vectorized algorithms is comparable to std::sort for C++ library. However, the algorithms are not meant to be used as a replacement of library function, but as a part of other algorithms. For instance the unrolled vectorized algorithm was used in quicksort implementation to sort short ranges, making the whole sorting substantially faster.

## Source code

Github repository contains various variants of sorting and a test program.