Author: | Wojciech Muła |
---|---|
Added on: | 2018-10-03 |
Contents
The goal is to find the first index of the minimum value in a non-empty sequence.
Following C code shows the idea.
size_t min_index_scalar(int32_t* array, size_t size) { assert(array != nullptr); assert(size > 0); size_t minindex = 0; int32_t minvalue = array[0]; for (size_t i=1; i < size; i++) { if (array[i] < minvalue) { minvalue = array[i]; minindex = i; } } return minindex; }
The C++ standard library allows to express the same algorithm in one line.
#include <algorithm> // for std::min_element #include <iterator> // for std::distance template <typename T> size_t min_index(const T& v) { assert(!v.empty()); return std::distance(v.begin(), std::min_element(v.begin(), v.end())); }
The current versions of compilers (GCC 8.2, clang 7.0.0) are not able to autovectorize the code. However, they do autovectorize finding the minimum value, i.e. statement like return *std::min_element(v.begin(), v.end()).
In a SIMD approach we keep three vectors of: minimum values, corresponding indices and also current scalar indices. When the main loop completes, we select the appropriate single index from these vectors.
The algorithm outline is shown below (for four-element vectors).
// 1. SIMD part Vector indices = [0, 1, 2, 3] // basically [i + 0, i + 1, i + 2, i + 4] Vector increment = [4, 4, 4, 4] // sample values Vector minvalues = load_vector(input[0]) // [77, 33, 11, 44] Vector minindices = [0, 1, 2, 3] // [ 0, 1, 2, 3] for (i = 4; i < input_size; i += 4) { // advance scalar indices indices += increment; // [4, 5, 6, 7] // compare Vector values = load_vector(input[i]); // [55, 44, 22, 22] Mask less = compare(values, minvalues); // [55 < 77, 44 < 33, 22 < 11, 22 < 44] // [true, false, false, true] // two items will be updated // update minvalues = blend(minvalues, values, less) // [55, 33, 11, 22] minindices = blend(minindices, indices, less); // [ 4, 1, 2, 7] } // 2. scalar part min_value = get_item(min_values, 0); min_index = get_item(min_indices, 0); for (int i = 1; i < 4; i++) { if (get_item(min_values, i) < min_value) { min_value = get_item(min_values, i); min_index = get_item(min_indices, i); } else if (get_item(min_values, i) == min_values) // if some values are repeated, pick the smaller index min_index = min(min_index, get_item(min_indices, i)); } return min_index;
Function compare yields a mask, it can be a byte-mask (SSE, AVX2) or a bit-mask (AVX512). In SSE it might be instruction pcmpgtd (_mm_pcmplt_epi32), in AVX512 it might be vpcmpd (_mm512_cmp_epi32_mask).
Function blend is a vector selection operator, i.e. mask[i] ? x[i] : y[i]; it stores items from either vector x or y based on the corresponding mask value. Many SIMD ISAs provides such operation; for instance SSE has the instruction pblendv (_mm_blendv_epi8).
Please note that in case of SSE the blend instruction is relative new, as it was introduced in SSE4.1. For really old CPUs the blend operator has to be expressed with binary operations: (x & mask) | (y & ~mask). Such expression is compiled into a sequence of three instructions: and, and-not, or.
Additionally, since blend instructions tend to be slow, it's better to update minvalues vector using the min operator.
The above schema was translated into procedures using SSE (including an unrolled version), AVX2 and AVX512. In this article just SSE and AVX512 procedures are shown, but obviously all implementations are available.
size_t min_index_sse(int32_t* array, size_t size) { common_assertions; const __m128i increment = _mm_set1_epi32(4); __m128i indices = _mm_setr_epi32(0, 1, 2, 3); __m128i minindices = indices; __m128i minvalues = _mm_loadu_si128((__m128i*)array); for (size_t i=4; i < size; i += 4) { indices = _mm_add_epi32(indices, increment); const __m128i values = _mm_loadu_si128((__m128i*)(array + i)); const __m128i lt = _mm_cmplt_epi32(values, minvalues); minindices = _mm_blendv_epi8(minindices, indices, lt); minvalues = _mm_min_epi32(values, minvalues); } // find min index in vector result (in an extremely naive way) int32_t values_array[4]; uint32_t indices_array[4]; _mm_storeu_si128((__m128i*)values_array, minvalues); _mm_storeu_si128((__m128i*)indices_array, minindices); size_t minindex = indices_array[0]; int32_t minvalue = values_array[0]; for (int i=1; i < 4; i++) { if (values_array[i] < minvalue) { minvalue = values_array[i]; minindex = indices_array[i]; } else if (values_array[i] == minvalue) { minindex = std::min(minindex, size_t(indices_array[i])); } } return minindex; }
size_t min_index_avx512f(int32_t* array, size_t size) { common_assertions; const __m512i increment = _mm512_set1_epi32(16); __m512i indices = _mm512_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); __m512i minindices = indices; __m512i minvalues = _mm512_loadu_si512((__m512i*)array); for (size_t i=16; i < size; i += 16) { indices = _mm512_add_epi32(indices, increment); const __m512i values = _mm512_loadu_si512((__m512i*)(array + i)); const __mmask16 lt = _mm512_cmplt_epi32_mask(values, minvalues); minindices = _mm512_mask_blend_epi32(lt, minindices, indices); minvalues = _mm512_min_epi32(minvalues, values); } // find min index in vector result (in an extremely naive way) int32_t values_array[16]; uint32_t indices_array[16]; _mm512_storeu_si512((__m512i*)values_array, minvalues); _mm512_storeu_si512((__m512i*)indices_array, minindices); size_t minindex = indices_array[0]; int32_t minvalue = values_array[0]; for (int i=1; i < 16; i++) { if (values_array[i] < minvalue) { minvalue = values_array[i]; minindex = indices_array[i]; } else if (values_array[i] == minvalue) { minindex = std::min(minindex, size_t(indices_array[i])); } } return minindex; }
Following procedures were tested:
All procedures work on signed 32-bit integers.
Test programs were run three times and minimum measurements were noted. The unit is CPU cycles per array item.
Compiler: GCC 7.3.0
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 3.006 | 3.036 | 1.00 | ████▍ |
SSE | 1.823 | 1.865 | 1.65 | ███████▎ |
SSE unrolled | 0.624 | 0.632 | 4.82 | █████████████████████▍ |
AVX2 | 0.268 | 0.272 | 11.22 | ██████████████████████████████████████████████████ |
size 16384 items | ||||
scalar | 3.010 | 3.021 | 1.00 | ████▍ |
SSE | 1.829 | 1.875 | 1.65 | ███████▎ |
SSE unrolled | 0.636 | 0.643 | 4.73 | █████████████████████ |
AVX2 | 0.292 | 0.298 | 10.31 | █████████████████████████████████████████████▉ |
size 32768 items | ||||
scalar | 3.013 | 3.022 | 1.00 | ████▍ |
SSE | 1.830 | 1.876 | 1.65 | ███████▎ |
SSE unrolled | 0.640 | 0.645 | 4.71 | ████████████████████▉ |
AVX2 | 0.295 | 0.300 | 10.21 | █████████████████████████████████████████████▌ |
Compiler: GCC 8.1.0
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
size 4096 items | ||||
scalar | 2.994 | 3.408 | 1.00 | ██▋ |
SSE | 1.721 | 1.814 | 1.74 | ████▊ |
SSE unrolled | 0.652 | 0.656 | 4.59 | ████████████▌ |
AVX2 | 0.265 | 0.273 | 11.30 | ██████████████████████████████▉ |
AVX512F | 0.164 | 0.173 | 18.26 | ██████████████████████████████████████████████████ |
size 16384 items | ||||
scalar | 2.996 | 3.430 | 1.00 | ██▋ |
SSE | 1.605 | 1.681 | 1.87 | █████ |
SSE unrolled | 0.564 | 0.568 | 5.31 | ██████████████▌ |
AVX2 | 0.322 | 0.326 | 9.30 | █████████████████████████▍ |
AVX512F | 0.208 | 0.213 | 14.40 | ███████████████████████████████████████▍ |
size 32768 items | ||||
scalar | 3.000 | 3.548 | 1.00 | ██▋ |
SSE | 1.622 | 1.681 | 1.85 | █████ |
SSE unrolled | 0.563 | 0.567 | 5.33 | ██████████████▌ |
AVX2 | 0.321 | 0.325 | 9.35 | █████████████████████████▌ |
AVX512F | 0.206 | 0.210 | 14.56 | ███████████████████████████████████████▉ |
I would like to thank my friend Romek, who inspired me to work again on side projects.
Source code is available on github.