Author: | Wojciech Muła |
---|---|
Added on: | 2016-12-22 |
There is an array of 64-bit values, we want to find the first bit set; the array is sparse.
The obvious scalar solution scans whole array. For the first non-zero word it calculates position of the first bit within the word (using the instruction bsf or tzcnt) and returns the appropriate value.
uint64_t scalar_bfs(const uint64_t* data, size_t n) { for (size_t i=0; i < n; i++) { if (data[i]) { return i * 64 + bfs(data[i]); } } return -1; }
The AVX512 solution minimizes number of branches by searching non-zero word in an register. When comparison's result is non-zero, we locate which word of the register it is, and process this word in the same way as the scalar version.
uint64_t avx512f_bfs(const uint64_t* data, size_t n) { const size_t chunks = n / 8; const size_t tail = n % 8; for (size_t i=0; i < chunks; i++) { const __m512i v = _mm512_loadu_si512((const __m512i*)(data + i * 8)); const uint32_t m = _mm512_cmpneq_epi64_mask(v, _mm512_setzero_si512()); if (m) { const size_t index = bfs(m) + 8*i; return index * 64 + bfs(data[index]); } } // process tail // ... return -1; }
A test program was compiled by GCC 5.3.0 and run on Knights Landing processor, model 7210.
The program clears the array of 64-bit words and set the last bit in the last word.
size [64-bit words] | cycles per word | speed-up | |
---|---|---|---|
scalar | AVX512 | ||
8 | 4.88 | 3.25 | 1.50 |
16 | 4.06 | 2.44 | 1.66 |
32 | 3.66 | 1.22 | 3.00 |
64 | 3.25 | 1.02 | 3.19 |
128 | 3.15 | 0.91 | 3.46 |
256 | 3.10 | 0.76 | 4.08 |
512 | 3.05 | 0.69 | 4.42 |
1024 | 3.02 | 0.66 | 4.58 |
Test programs are available on github.