# Building a bitmask

Author: Wojciech Muła 2016-09-14

# The problem

There is an array of 32-bit integers and a key — a specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items. Pseudocode:

```-- n - array size
for i = 0 .. n - 1 loop
if key = array[i] then
bitvector[i] = 1
else
bitvector[i] = 0
end for
```

A C++ interface:

```void bitmask(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector);
```

# Naive procedure

```void bitmask_naive(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

for (size_t i=0; i < n; i++) {
const uint8_t val = (array[i] == key); // in C++ bool casted to int has value 0 or 1

bitvector[i/8] |= val << (i % 8);
}
}
```

The main problem with this code is that it involves at least one bit-shift (likely two) and one bit-or. The first bit-shift is used to find index in bitvector, the second bit-shift moves the result to i % 8 bit.

Assembler code produced by GCC 5.3 (with the -O3 flag):

```.L5:
movq    %rax, %r9
xorl    %r8d, %r8d
movl    %eax, %ecx
shrq    \$3, %r9                 ; index
cmpl    %edx, (%rdi,%rax,4)
sete    %r8b                    ; set val
andl    \$7, %ecx                ; i % 8
sall    %cl, %r8d               ; shift val left by i % 8
orb %r8b, (%r10,%r9)            ; update byte at the index
cmpq    %rax, %rsi
jne .L5
```

# Better procedure (#1)

This code is suitable for 64-bit machines. The only advantage over the naive version is updating a bitvector in 32-bit portions.

```void bitmask_better_1(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

const size_t blocks = n / 32;
const size_t tail   = n % 32;

const uint32_t *input = array;
uint32_t* bitvector32 = reinterpret_cast<uint32_t*>(bitvector);

for (size_t i=0; i < blocks; i++) {

uint64_t result = 0xffffffffffffffffllu;
uint64_t bit    = uint64_t(1) << 32;

for (int j=0; j < 32; j++) {
uint64_t low = bit - 1;

// j-th bit (in the higher 32-bit word) is set if current input is not equal to the key
uint64_t tmp = (key ^ *input) + low;
input++;

result ^= (tmp & bit); // clear j-th if the above condition is true
bit <<= 1;
}

*bitvector32++ = result >> 32; // save 32 bits at once
}

// process tail
}
```

# Better procedure (#2)

This solution maps very well on x86 assembly. And unlike the previous solution, any relation can be used in place of equal/not-equal.

```void bitmask_better_2(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

const size_t blocks = n / 32;
const size_t tail   = n % 32;

const uint32_t *input = array;
uint32_t* bitvector32 = reinterpret_cast<uint32_t*>(bitvector);

for (size_t i=0; i < blocks; i++) {

uint32_t result = 0;

for (int j=31; j >= 0; j--) {
const uint32_t bit = (current[j] == key);   // (1)
result = 2*result + bit;                    // (2)
}

*bitvector32++ = result;
input += 32;
}

// process tail
}
```

The line number 1 should be translated into two instructions: comparison and then setcc. The line number 2 should be translated into single lea.

Assembler code produced by GCC 5.3 (with the -O3 flag):

```xorl    %edi, %edi
cmpl    %edx, (%rax)            ; set value of bit
sete    %dil                    ; ...
subq    \$4, %rax
cmpq    %rax, %r8
leal    (%rdi,%rsi,2), %esi     ; update result (note: a register)
jne     .L35
subq    \$-128, %r9
movl    %esi, -4(%rcx)          ; save 32 bit at once
```

The big advantage of this method is that instead of equal/not equal relation all other arithmetic relations could be used. It's just a matter of setcc selection.

# SIMD procedure

SSE procedure processes data in 16-element blocks (4 x 4 items). Using a single SSE register is possible but not optimal.

Like the previous procedure, this one also allows to use any arithmetic relation.

## Algorithm

```key = 0xaaaa;

d0                    d1                    d2                    d3
[0000|aaaa|bbbb|0000] [aaaa|cccc|1111|aaaa] [0000|aaaa|dddd|2222] [3333|1111|aaaa|cccc]
```

A vector comparison yields:

```d0                    d1                    d2                    d3
[0000|ffff|0000|0000] [ffff|0000|0000|ffff] [0000|ffff|0000|0000] [0000|0000|ffff|0000]
```

Now d0 & d1 and d2 & d3 are converted from 32-bit mask to 16-bit using packssdw (note the signed saturation):

```d01                       d23
[00|ff|00|00|ff|00|00|ff] [00|ff|00|00|00|00|ff|00]
```

And then from 16-bit to 8-bit masks using packsswb:

```byte-mask
[0f00f00f0f0000f0]
```

Finally the byte mask is converted into bit-mask with pmovmskb

## Sample code

```void bitmask_SSE(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) {

const size_t blocks = n / 16;
const size_t tail   = n % 16;

const __m128i *input = reinterpret_cast<const __m128i*>(array);
uint16_t* bitvector16 = reinterpret_cast<uint16_t*>(bitvector);

const __m128i vkey = _mm_set1_epi32(key);

for (size_t i=0; i < blocks; i++) {

// packed dword results
const __m128i d0 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 0));
const __m128i d1 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 1));
const __m128i d2 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 2));
const __m128i d3 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 3));

// packed word results
const __m128i d01 = _mm_packs_epi32(d0, d1);
const __m128i d23 = _mm_packs_epi32(d2, d3);

// packed byte result
const __m128i pck = _mm_packs_epi16(d01, d23);