# AVX512 — first bit set in a large array

Author: Wojciech Muła 2016-12-22

# Problem

There is an array of 64-bit values, we want to find the first bit set; the array is sparse.

# Scalar version

The obvious scalar solution scans whole array. For the first non-zero word it calculates position of the first bit within the word (using the instruction bsf or tzcnt) and returns the appropriate value.

```uint64_t scalar_bfs(const uint64_t* data, size_t n) {

for (size_t i=0; i < n; i++) {
if (data[i]) {
return i * 64 + bfs(data[i]);
}
}

return -1;
}
```

# AVX512F

The AVX512 solution minimizes number of branches by searching non-zero word in an register. When comparison's result is non-zero, we locate which word of the register it is, and process this word in the same way as the scalar version.

```uint64_t avx512f_bfs(const uint64_t* data, size_t n) {

const size_t chunks = n / 8;
const size_t tail   = n % 8;

for (size_t i=0; i < chunks; i++) {
const __m512i  v = _mm512_loadu_si512((const __m512i*)(data + i * 8));
const uint32_t m = _mm512_cmpneq_epi64_mask(v, _mm512_setzero_si512());
if (m) {
const size_t index = bfs(m) + 8*i;
return index * 64 +  bfs(data[index]);
}
}

// process tail
// ...

return -1;
}
```

# Peformance results

A test program was compiled by GCC 5.3.0 and run on Knights Landing processor, model 7210.

The program clears the array of 64-bit words and set the last bit in the last word.

size [64-bit words] cycles per word speed-up
scalar AVX512
8 4.88 3.25 1.50
16 4.06 2.44 1.66
32 3.66 1.22 3.00
64 3.25 1.02 3.19
128 3.15 0.91 3.46
256 3.10 0.76 4.08
512 3.05 0.69 4.42
1024 3.02 0.66 4.58

# Source code

Test programs are available on github.