Finding index of the minimum value using SIMD instructions

Author:	Wojciech Muła
Added on:	2018-10-03

Contents

Introduction
SIMD approach
Experiments
Acknowledgements
Source code

Introduction

The goal is to find the first index of the minimum value in a non-empty sequence.

Following C code shows the idea.

size_t min_index_scalar(int32_t* array, size_t size) {

    assert(array != nullptr);
    assert(size > 0);

    size_t minindex = 0;
    int32_t minvalue = array[0];

    for (size_t i=1; i < size; i++) {
        if (array[i] < minvalue) {
            minvalue = array[i];
            minindex = i;
        }
    }

    return minindex;
}

The C++ standard library allows to express the same algorithm in one line.

#include <algorithm> // for std::min_element
#include <iterator>  // for std::distance

template <typename T>
size_t min_index(const T& v) {

    assert(!v.empty());

    return std::distance(v.begin(), std::min_element(v.begin(), v.end()));
}

The current versions of compilers (GCC 8.2, clang 7.0.0) are not able to autovectorize the code. However, they do autovectorize finding the minimum value, i.e. statement like return *std::min_element(v.begin(), v.end()).

SIMD approach

In a SIMD approach we keep three vectors of: minimum values, corresponding indices and also current scalar indices. When the main loop completes, we select the appropriate single index from these vectors.

The algorithm outline is shown below (for four-element vectors).

// 1. SIMD part

Vector indices    = [0, 1, 2, 3]                    // basically [i + 0, i + 1, i + 2, i + 4]
Vector increment  = [4, 4, 4, 4]
                                                    // sample values
Vector minvalues  = load_vector(input[0])           // [77, 33, 11, 44]
Vector minindices = [0, 1, 2, 3]                    // [ 0,  1,  2,  3]

for (i = 4; i < input_size; i += 4) {

    // advance scalar indices
    indices += increment;                           // [4, 5, 6, 7]

    // compare
    Vector values = load_vector(input[i]);          // [55, 44, 22, 22]
    Mask   less   = compare(values, minvalues);     // [55 < 77, 44 < 33, 22 < 11, 22 < 44]
                                                    // [true, false, false, true]
                                                    //  two items will be updated
    // update
    minvalues =  blend(minvalues, values, less)     // [55, 33, 11, 22]
    minindices = blend(minindices, indices, less);  // [ 4,  1,  2,  7]
}

// 2. scalar part

min_value = get_item(min_values, 0);
min_index = get_item(min_indices, 0);
for (int i = 1; i < 4; i++) {
    if (get_item(min_values, i) < min_value) {
        min_value = get_item(min_values, i);
        min_index = get_item(min_indices, i);
    } else if (get_item(min_values, i) == min_values)
        // if some values are repeated, pick the smaller index
        min_index = min(min_index, get_item(min_indices, i));
}

return min_index;

Function compare yields a mask, it can be a byte-mask (SSE, AVX2) or a bit-mask (AVX512). In SSE it might be instruction pcmpgtd (_mm_pcmplt_epi32), in AVX512 it might be vpcmpd (_mm512_cmp_epi32_mask).

Function blend is a vector selection operator, i.e. mask[i] ? x[i] : y[i]; it stores items from either vector x or y based on the corresponding mask value. Many SIMD ISAs provides such operation; for instance SSE has the instruction pblendv (_mm_blendv_epi8).

Please note that in case of SSE the blend instruction is relative new, as it was introduced in SSE4.1. For really old CPUs the blend operator has to be expressed with binary operations: (x & mask) | (y & ~mask). Such expression is compiled into a sequence of three instructions: and, and-not, or.

Additionally, since blend instructions tend to be slow, it's better to update minvalues vector using the min operator.

Experiments

The above schema was translated into procedures using SSE (including an unrolled version), AVX2 and AVX512. In this article just SSE and AVX512 procedures are shown, but obviously all implementations are available.

SSE implementation

size_t min_index_sse(int32_t* array, size_t size) {

    common_assertions;

    const __m128i increment = _mm_set1_epi32(4);
    __m128i indices         = _mm_setr_epi32(0, 1, 2, 3);
    __m128i minindices      = indices;
    __m128i minvalues       = _mm_loadu_si128((__m128i*)array);

    for (size_t i=4; i < size; i += 4) {

        indices = _mm_add_epi32(indices, increment);

        const __m128i values        = _mm_loadu_si128((__m128i*)(array + i));
        const __m128i lt            = _mm_cmplt_epi32(values, minvalues);
        minindices = _mm_blendv_epi8(minindices, indices, lt);
        minvalues  = _mm_min_epi32(values, minvalues);
    }

    // find min index in vector result (in an extremely naive way)
    int32_t values_array[4];
    uint32_t indices_array[4];

    _mm_storeu_si128((__m128i*)values_array, minvalues);
    _mm_storeu_si128((__m128i*)indices_array, minindices);

    size_t  minindex = indices_array[0];
    int32_t minvalue = values_array[0];
    for (int i=1; i < 4; i++) {
        if (values_array[i] < minvalue) {
            minvalue = values_array[i];
            minindex = indices_array[i];
        } else if (values_array[i] == minvalue) {
            minindex = std::min(minindex, size_t(indices_array[i]));
        }
    }

    return minindex;
}

AVX512F implementation

size_t min_index_avx512f(int32_t* array, size_t size) {

    common_assertions;

    const __m512i increment = _mm512_set1_epi32(16);
    __m512i indices         = _mm512_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
    __m512i minindices      = indices;
    __m512i minvalues       = _mm512_loadu_si512((__m512i*)array);

    for (size_t i=16; i < size; i += 16) {

        indices = _mm512_add_epi32(indices, increment);

        const __m512i values        = _mm512_loadu_si512((__m512i*)(array + i));
        const __mmask16 lt          = _mm512_cmplt_epi32_mask(values, minvalues);
        minindices = _mm512_mask_blend_epi32(lt, minindices, indices);
        minvalues  = _mm512_min_epi32(minvalues, values);
    }

    // find min index in vector result (in an extremely naive way)
    int32_t values_array[16];
    uint32_t indices_array[16];

    _mm512_storeu_si512((__m512i*)values_array, minvalues);
    _mm512_storeu_si512((__m512i*)indices_array, minindices);

    size_t  minindex = indices_array[0];
    int32_t minvalue = values_array[0];
    for (int i=1; i < 16; i++) {
        if (values_array[i] < minvalue) {
            minvalue = values_array[i];
            minindex = indices_array[i];
        } else if (values_array[i] == minvalue) {
            minindex = std::min(minindex, size_t(indices_array[i]));
        }
    }

    return minindex;
}

Performance evaluation

Following procedures were tested:

SSE,
SSE unrolled,
AVX2,
and AVX512F.

All procedures work on signed 32-bit integers.

Test programs were run three times and minimum measurements were noted. The unit is CPU cycles per array item.

Skylake

Compiler: GCC 7.3.0

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	3.006	3.036	1.00	`████▍`
SSE	1.823	1.865	1.65	`███████▎`
SSE unrolled	0.624	0.632	4.82	`█████████████████████▍`
AVX2	0.268	0.272	11.22	`██████████████████████████████████████████████████`
size 16384 items
scalar	3.010	3.021	1.00	`████▍`
SSE	1.829	1.875	1.65	`███████▎`
SSE unrolled	0.636	0.643	4.73	`█████████████████████`
AVX2	0.292	0.298	10.31	`█████████████████████████████████████████████▉`
size 32768 items
scalar	3.013	3.022	1.00	`████▍`
SSE	1.830	1.876	1.65	`███████▎`
SSE unrolled	0.640	0.645	4.71	`████████████████████▉`
AVX2	0.295	0.300	10.21	`█████████████████████████████████████████████▌`

Skylake-X

Compiler: GCC 8.1.0

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

procedure	best	avg.	speedup
	[cycles]	[cycles]
size 4096 items
scalar	2.994	3.408	1.00	`██▋`
SSE	1.721	1.814	1.74	`████▊`
SSE unrolled	0.652	0.656	4.59	`████████████▌`
AVX2	0.265	0.273	11.30	`██████████████████████████████▉`
AVX512F	0.164	0.173	18.26	`██████████████████████████████████████████████████`
size 16384 items
scalar	2.996	3.430	1.00	`██▋`
SSE	1.605	1.681	1.87	`█████`
SSE unrolled	0.564	0.568	5.31	`██████████████▌`
AVX2	0.322	0.326	9.30	`█████████████████████████▍`
AVX512F	0.208	0.213	14.40	`███████████████████████████████████████▍`
size 32768 items
scalar	3.000	3.548	1.00	`██▋`
SSE	1.622	1.681	1.85	`█████`
SSE unrolled	0.563	0.567	5.33	`██████████████▌`
AVX2	0.321	0.325	9.35	`█████████████████████████▌`
AVX512F	0.206	0.210	14.56	`███████████████████████████████████████▉`

Acknowledgements

I would like to thank my friend Romek, who inspired me to work again on side projects.

Source code

Source code is available on github.