Building a bitmask

Author:Wojciech Muła
Added on:2016-09-14

Contents

The problem

There is an array of 32-bit integers and a key — a specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items. Pseudocode:

-- n - array size
for i = 0 .. n - 1 loop
    if key = array[i] then
        bitvector[i] = 1
    else
        bitvector[i] = 0
end for

A C++ interface:

void bitmask(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector);

Naive procedure

void bitmask_naive(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

    for (size_t i=0; i < n; i++) {
        const uint8_t val = (array[i] == key); // in C++ bool casted to int has value 0 or 1

        bitvector[i/8] |= val << (i % 8);
    }
}

The main problem with this code is that it involves at least one bit-shift (likely two) and one bit-or. The first bit-shift is used to find index in bitvector, the second bit-shift moves the result to i % 8 bit.

Assembler code produced by GCC 5.3 (with the -O3 flag):

.L5:
    movq    %rax, %r9
    xorl    %r8d, %r8d
    movl    %eax, %ecx
    shrq    $3, %r9                 ; index
    cmpl    %edx, (%rdi,%rax,4)
    sete    %r8b                    ; set val
    andl    $7, %ecx                ; i % 8
    addq    $1, %rax
    sall    %cl, %r8d               ; shift val left by i % 8
    orb %r8b, (%r10,%r9)            ; update byte at the index
    cmpq    %rax, %rsi
    jne .L5

Better procedure (#1)

This code is suitable for 64-bit machines. The only advantage over the naive version is updating a bitvector in 32-bit portions.

void bitmask_better_1(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

    const size_t blocks = n / 32;
    const size_t tail   = n % 32;

    const uint32_t *input = array;
    uint32_t* bitvector32 = reinterpret_cast<uint32_t*>(bitvector);

    for (size_t i=0; i < blocks; i++) {

        uint64_t result = 0xffffffffffffffffllu;
        uint64_t bit    = uint64_t(1) << 32;

        for (int j=0; j < 32; j++) {
            uint64_t low = bit - 1;

            // j-th bit (in the higher 32-bit word) is set if current input is not equal to the key
            uint64_t tmp = (key ^ *input) + low;
            input++;

            result ^= (tmp & bit); // clear j-th if the above condition is true
            bit <<= 1;
        }

        *bitvector32++ = result >> 32; // save 32 bits at once
    }

    // process tail
}

Better procedure (#2)

This solution maps very well on x86 assembly. And unlike the previous solution, any relation can be used in place of equal/not-equal.

void bitmask_better_2(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

    const size_t blocks = n / 32;
    const size_t tail   = n % 32;

    const uint32_t *input = array;
    uint32_t* bitvector32 = reinterpret_cast<uint32_t*>(bitvector);

    for (size_t i=0; i < blocks; i++) {

        uint32_t result = 0;

        for (int j=31; j >= 0; j--) {
            const uint32_t bit = (current[j] == key);   // (1)
            result = 2*result + bit;                    // (2)
        }

        *bitvector32++ = result;
        input += 32;
    }

    // process tail
}

The line number 1 should be translated into two instructions: comparison and then setcc. The line number 2 should be translated into single lea.

Assembler code produced by GCC 5.3 (with the -O3 flag):

xorl    %edi, %edi
cmpl    %edx, (%rax)            ; set value of bit
sete    %dil                    ; ...
subq    $4, %rax
cmpq    %rax, %r8
leal    (%rdi,%rsi,2), %esi     ; update result (note: a register)
jne     .L35
addq    $4, %rcx
subq    $-128, %r9
movl    %esi, -4(%rcx)          ; save 32 bit at once

The big advantage of this method is that instead of equal/not equal relation all other arithmetic relations could be used. It's just a matter of setcc selection.

SIMD procedure

SSE procedure processes data in 16-element blocks (4 x 4 items). Using a single SSE register is possible but not optimal.

Like the previous procedure, this one also allows to use any arithmetic relation.

Algorithm

key = 0xaaaa;

d0                    d1                    d2                    d3
[0000|aaaa|bbbb|0000] [aaaa|cccc|1111|aaaa] [0000|aaaa|dddd|2222] [3333|1111|aaaa|cccc]

A vector comparison yields:

d0                    d1                    d2                    d3
[0000|ffff|0000|0000] [ffff|0000|0000|ffff] [0000|ffff|0000|0000] [0000|0000|ffff|0000]

Now d0 & d1 and d2 & d3 are converted from 32-bit mask to 16-bit using packssdw (note the signed saturation):

d01                       d23
[00|ff|00|00|ff|00|00|ff] [00|ff|00|00|00|00|ff|00]

And then from 16-bit to 8-bit masks using packsswb:

byte-mask
[0f00f00f0f0000f0]

Finally the byte mask is converted into bit-mask with pmovmskb

Sample code

void bitmask_SSE(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

    const size_t blocks = n / 16;
    const size_t tail   = n % 16;

    const __m128i *input = reinterpret_cast<const __m128i*>(array);
    uint16_t* bitvector16 = reinterpret_cast<uint16_t*>(bitvector);

    const __m128i vkey = _mm_set1_epi32(key);

    for (size_t i=0; i < blocks; i++) {

        // packed dword results
        const __m128i d0 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 0));
        const __m128i d1 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 1));
        const __m128i d2 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 2));
        const __m128i d3 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 3));

        // packed word results
        const __m128i d01 = _mm_packs_epi32(d0, d1);
        const __m128i d23 = _mm_packs_epi32(d2, d3);

        // packed byte result
        const __m128i pck = _mm_packs_epi16(d01, d23);

        // convert byte-mask into bitmask
        bitvector16[i] = _mm_movemask_epi8(pck);
    }

    // process tail
}

Performance

Above procedures were tested in a real system. The SIMD and the second scalar version are the fastest on Skylake.

Sample code

Github repository contains algorithms and a test program.