# SIMD: detecting a bit pattern

Author: Wojciech Muła 2015-03-22

Contents

# Introduction

The problem: there are 64-bit values with some data bits and some metadata bits; the metadata includes a k-bit field describing a "type" (k >= 0). The type field is located in a lower 32-bits.

Procedure processes two "types", one denoted with the code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path. If there are some types 5, we have to call an additional function (a virtual method to be precise). Pseudocode:

```for (size_t i = 0; i < size; i += 8) {

// two AVX registers, or 8 elements, are processed in a single loop

__m256i A_lo = {A64[i + 0], A64[i + 1], A64[i + 2], A64[i + 3]};
__m256i A_hi = {A64[i + 4], A64[i + 5], A64[i + 6], A64[i + 7]};

__m256i B_lo = {B64[i + 0], B64[i + 1], B64[i + 2], B64[i + 3]};
__m256i B_hi = {B64[i + 4], B64[i + 5], B64[i + 6], B64[i + 7]};

if (any element of A or B vector has type 5) { // ***

// slow path
for (int k = 0; k < 4; k++) {
result[i + k]     = function(A_lo[i + k], B_lo[i + k]);
result[i + k + 4] = function(A_hi[i + k], B_hi[i + k]);
}
} else {

// fast path
...
}

// further processing
}
```

We have to fill condition of the if-statement.

# Version 1

```// 1. isolate type fields

// assembler instruction:
auto A_lo_type = A_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
auto A_hi_type = A_hi & packed_qword(TYPE_FIELD_MASK);  // PAND
auto B_lo_type = B_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
auto B_hi_type = B_hi & packed_qword(TYPE_FIELD_MASK);  // PAND

// 2. convert packed_qword to packed_dword (64 => 32 bits)

auto A_type = convert(A_lo_type, A_hi_type);            // PACKSSQD
auto B_type = convert(B_lo_type, B_hi_type);            // PACKSSQD

// 3. compare types with magic 5

auto A_has_5 = compare(A_type, TYPE_FIELD_5);           // PCMPEQD
auto B_has_5 = compare(B_type, TYPE_FIELD_5);           // PCMPEQD

auto any_5 = A_has_5 | B_has_5;                         // POR

// 4. get a bit mask

```

In total ten instructions. It doesn't look well. Let's try another approach.

# Version 2

In this version we change the third step. An addition is used instead of a comparison: A_type + B_type + magic_value — the magic_value causes to set the highest bit when A_type or B_type is 5. How to calculate this value? Let say the code field is located at 4th bit and it has 3 bits width:

```31                         6   3    0 bit
+-+------------------------+---+----+
|X|111111111111111111111111|TYP|    |
+-+------------------------+---+----+
```

All bits after type field and before the most significant bit X are set. If an overflow occur while adding type field, then the carry bit will propagate through ones up to the bit X. In this case magic_value = 0b0111 1111 1111 1111 1111 1111 1000 0000 = 0x7fffff80.

```// 1. isolate type fields
// assembler instruction:
auto A_lo_type = A_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
auto A_hi_type = A_hi & packed_qword(TYPE_FIELD_MASK);  // PAND
auto B_lo_type = B_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
auto B_hi_type = B_hi & packed_qword(TYPE_FIELD_MASK);  // PAND

// 2. convert packed_qword to packed_dword (64 => 32 bits)

auto A_type = convert(A_lo_type, A_hi_type);            // PACKSSQD
auto B_type = convert(B_lo_type, B_hi_type);            // PACKSSQD

// 3. compare types with magic 5

auto any_5 = A_type + B_type + packed_dword(0x7fffff80) // PADDD x 2

// 4. same is version 1

```

In total nine instructions. A bit better and we've also learned a new trick.

# Version 3

Now we stop thinking "in SIMD-way". Simply compare binary images of 3 and 5: 3 = 0b011, 5 = 0b101 — it's clear that 5 has set 3rd bit, so this bit must be somehow gathered by the movemask instruction.

Here is the solution:

```// 1. merge lo and hi parts of inputs

auto T_lo = A_lo | B_lo;                                // POR
auto T_hi = B_hi | B_hi;                                // POR

// 2. isolate merged type fields

auto T_lo_type = T_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
auto T_hi_type = T_hi & packed_qword(TYPE_FIELD_MASK);  // PAND

// 3. convert packed_qword to packed_dword (64 => 32 bits)

auto type1 = convert(T_lo_type, T_hi_type);             // PACKSSQD

// 4. move the highest bit of type bit field to the MSB

// shift depends on position of highest bit, it is 31 - something
auto type2 = type1 << shift;                            // PSLRD

// 4. get a bit mask