AVX512 — first bit set in a large array

Author:Wojciech Muła
Added on:2016-12-22

Problem

There is an array of 64-bit values, we want to find the first bit set; the array is sparse.

Scalar version

The obvious scalar solution scans whole array. For the first non-zero word it calculates position of the first bit within the word (using the instruction bsf or tzcnt) and returns the appropriate value.

uint64_t scalar_bfs(const uint64_t* data, size_t n) {

    for (size_t i=0; i < n; i++) {
        if (data[i]) {
            return i * 64 + bfs(data[i]);
        }
    }

    return -1;
}

AVX512F

The AVX512 solution minimizes number of branches by searching non-zero word in an register. When comparison's result is non-zero, we locate which word of the register it is, and process this word in the same way as the scalar version.

uint64_t avx512f_bfs(const uint64_t* data, size_t n) {

    const size_t chunks = n / 8;
    const size_t tail   = n % 8;

    for (size_t i=0; i < chunks; i++) {
        const __m512i  v = _mm512_loadu_si512((const __m512i*)(data + i * 8));
        const uint32_t m = _mm512_cmpneq_epi64_mask(v, _mm512_setzero_si512());
        if (m) {
            const size_t index = bfs(m) + 8*i;
            return index * 64 +  bfs(data[index]);
        }
    }

    // process tail
    // ...

    return -1;
}

Peformance results

A test program was compiled by GCC 5.3.0 and run on Knights Landing processor, model 7210.

The program clears the array of 64-bit words and set the last bit in the last word.

size [64-bit words] cycles per word speed-up
scalar AVX512
8 4.88 3.25 1.50
16 4.06 2.44 1.66
32 3.66 1.22 3.00
64 3.25 1.02 3.19
128 3.15 0.91 3.46
256 3.10 0.76 4.08
512 3.05 0.69 4.42
1024 3.02 0.66 4.58

Source code

Test programs are available on github.