Finding index of the minimum value using SIMD instructions

Author: Wojciech Muła 2018-10-03

Introduction

The goal is to find the first index of the minimum value in a non-empty sequence.

Following C code shows the idea.

```size_t min_index_scalar(int32_t* array, size_t size) {

assert(array != nullptr);
assert(size > 0);

size_t minindex = 0;
int32_t minvalue = array[0];

for (size_t i=1; i < size; i++) {
if (array[i] < minvalue) {
minvalue = array[i];
minindex = i;
}
}

return minindex;
}
```

The C++ standard library allows to express the same algorithm in one line.

```#include <algorithm> // for std::min_element
#include <iterator>  // for std::distance

template <typename T>
size_t min_index(const T& v) {

assert(!v.empty());

return std::distance(v.begin(), std::min_element(v.begin(), v.end()));
}
```

The current versions of compilers (GCC 8.2, clang 7.0.0) are not able to autovectorize the code. However, they do autovectorize finding the minimum value, i.e. statement like return *std::min_element(v.begin(), v.end()).

SIMD approach

In a SIMD approach we keep three vectors of: minimum values, corresponding indices and also current scalar indices. When the main loop completes, we select the appropriate single index from these vectors.

The algorithm outline is shown below (for four-element vectors).

```// 1. SIMD part

Vector indices    = [0, 1, 2, 3]                    // basically [i + 0, i + 1, i + 2, i + 4]
Vector increment  = [4, 4, 4, 4]
// sample values
Vector minvalues  = load_vector(input[0])           // [77, 33, 11, 44]
Vector minindices = [0, 1, 2, 3]                    // [ 0,  1,  2,  3]

for (i = 4; i < input_size; i += 4) {

indices += increment;                           // [4, 5, 6, 7]

// compare
Vector values = load_vector(input[i]);          // [55, 44, 22, 22]
Mask   less   = compare(values, minvalues);     // [55 < 77, 44 < 33, 22 < 11, 22 < 44]
// [true, false, false, true]
//  two items will be updated
// update
minvalues =  blend(minvalues, values, less)     // [55, 33, 11, 22]
minindices = blend(minindices, indices, less);  // [ 4,  1,  2,  7]
}

// 2. scalar part

min_value = get_item(min_values, 0);
min_index = get_item(min_indices, 0);
for (int i = 1; i < 4; i++) {
if (get_item(min_values, i) < min_value) {
min_value = get_item(min_values, i);
min_index = get_item(min_indices, i);
} else if (get_item(min_values, i) == min_values)
// if some values are repeated, pick the smaller index
min_index = min(min_index, get_item(min_indices, i));
}

return min_index;
```

Function compare yields a mask, it can be a byte-mask (SSE, AVX2) or a bit-mask (AVX512). In SSE it might be instruction pcmpgtd (_mm_pcmplt_epi32), in AVX512 it might be vpcmpd (_mm512_cmp_epi32_mask).

Function blend is a vector selection operator, i.e. mask[i] ? x[i] : y[i]; it stores items from either vector x or y based on the corresponding mask value. Many SIMD ISAs provides such operation; for instance SSE has the instruction pblendv (_mm_blendv_epi8).

Please note that in case of SSE the blend instruction is relative new, as it was introduced in SSE4.1. For really old CPUs the blend operator has to be expressed with binary operations: (x & mask) | (y & ~mask). Such expression is compiled into a sequence of three instructions: and, and-not, or.

Additionally, since blend instructions tend to be slow, it's better to update minvalues vector using the min operator.

Experiments

The above schema was translated into procedures using SSE (including an unrolled version), AVX2 and AVX512. In this article just SSE and AVX512 procedures are shown, but obviously all implementations are available.

SSE implementation

```size_t min_index_sse(int32_t* array, size_t size) {

common_assertions;

const __m128i increment = _mm_set1_epi32(4);
__m128i indices         = _mm_setr_epi32(0, 1, 2, 3);
__m128i minindices      = indices;

for (size_t i=4; i < size; i += 4) {

const __m128i values        = _mm_loadu_si128((__m128i*)(array + i));
const __m128i lt            = _mm_cmplt_epi32(values, minvalues);
minindices = _mm_blendv_epi8(minindices, indices, lt);
minvalues  = _mm_min_epi32(values, minvalues);
}

// find min index in vector result (in an extremely naive way)
int32_t values_array[4];
uint32_t indices_array[4];

_mm_storeu_si128((__m128i*)values_array, minvalues);
_mm_storeu_si128((__m128i*)indices_array, minindices);

size_t  minindex = indices_array[0];
int32_t minvalue = values_array[0];
for (int i=1; i < 4; i++) {
if (values_array[i] < minvalue) {
minvalue = values_array[i];
minindex = indices_array[i];
} else if (values_array[i] == minvalue) {
minindex = std::min(minindex, size_t(indices_array[i]));
}
}

return minindex;
}
```

AVX512F implementation

```size_t min_index_avx512f(int32_t* array, size_t size) {

common_assertions;

const __m512i increment = _mm512_set1_epi32(16);
__m512i indices         = _mm512_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
__m512i minindices      = indices;

for (size_t i=16; i < size; i += 16) {

const __m512i values        = _mm512_loadu_si512((__m512i*)(array + i));
minvalues  = _mm512_min_epi32(minvalues, values);
}

// find min index in vector result (in an extremely naive way)
int32_t values_array[16];
uint32_t indices_array[16];

_mm512_storeu_si512((__m512i*)values_array, minvalues);
_mm512_storeu_si512((__m512i*)indices_array, minindices);

size_t  minindex = indices_array[0];
int32_t minvalue = values_array[0];
for (int i=1; i < 16; i++) {
if (values_array[i] < minvalue) {
minvalue = values_array[i];
minindex = indices_array[i];
} else if (values_array[i] == minvalue) {
minindex = std::min(minindex, size_t(indices_array[i]));
}
}

return minindex;
}
```

Performance evaluation

Following procedures were tested:

All procedures work on signed 32-bit integers.

Test programs were run three times and minimum measurements were noted. The unit is CPU cycles per array item.

Skylake

Compiler: GCC 7.3.0

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

procedure best avg. speedup
[cycles] [cycles]
size 4096 items
scalar 3.006 3.036 1.00 ████▍
SSE 1.823 1.865 1.65 ███████▎
SSE unrolled 0.624 0.632 4.82 █████████████████████▍
AVX2 0.268 0.272 11.22 ██████████████████████████████████████████████████
size 16384 items
scalar 3.010 3.021 1.00 ████▍
SSE 1.829 1.875 1.65 ███████▎
SSE unrolled 0.636 0.643 4.73 █████████████████████
AVX2 0.292 0.298 10.31 █████████████████████████████████████████████▉
size 32768 items
scalar 3.013 3.022 1.00 ████▍
SSE 1.830 1.876 1.65 ███████▎
SSE unrolled 0.640 0.645 4.71 ████████████████████▉
AVX2 0.295 0.300 10.21 █████████████████████████████████████████████▌

Skylake-X

Compiler: GCC 8.1.0

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

procedure best avg. speedup
[cycles] [cycles]
size 4096 items
scalar 2.994 3.408 1.00 ██▋
SSE 1.721 1.814 1.74 ████▊
SSE unrolled 0.652 0.656 4.59 ████████████▌
AVX2 0.265 0.273 11.30 ██████████████████████████████▉
AVX512F 0.164 0.173 18.26 ██████████████████████████████████████████████████
size 16384 items
scalar 2.996 3.430 1.00 ██▋
SSE 1.605 1.681 1.87 █████
SSE unrolled 0.564 0.568 5.31 ██████████████▌
AVX2 0.322 0.326 9.30 █████████████████████████▍
AVX512F 0.208 0.213 14.40 ███████████████████████████████████████▍
size 32768 items
scalar 3.000 3.548 1.00 ██▋
SSE 1.622 1.681 1.85 █████
SSE unrolled 0.563 0.567 5.33 ██████████████▌
AVX2 0.321 0.325 9.35 █████████████████████████▌
AVX512F 0.206 0.210 14.56 ███████████████████████████████████████▉

Acknowledgements

I would like to thank my friend Romek, who inspired me to work again on side projects.

Source code

Source code is available on github.