SIMD: detecting a bit pattern

Author: Wojciech Muła
Added on:2015-03-22

Contents

Introduction

The problem: there are 64-bit values with some data bits and some metadata bits; the metadata includes a k-bit field describing a "type" (k >= 0). The type field is located in a lower 32-bits.

Procedure processes two "types", one denoted with the code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path. If there are some types 5, we have to call an additional function (a virtual method to be precise). Pseudocode:

for (size_t i = 0; i < size; i += 8) {

    // two AVX registers, or 8 elements, are processed in a single loop

    __m256i A_lo = {A64[i + 0], A64[i + 1], A64[i + 2], A64[i + 3]};
    __m256i A_hi = {A64[i + 4], A64[i + 5], A64[i + 6], A64[i + 7]};

    __m256i B_lo = {B64[i + 0], B64[i + 1], B64[i + 2], B64[i + 3]};
    __m256i B_hi = {B64[i + 4], B64[i + 5], B64[i + 6], B64[i + 7]};

    if (any element of A or B vector has type 5) { // ***

        // slow path
        for (int k = 0; k < 4; k++) {
            result[i + k]     = function(A_lo[i + k], B_lo[i + k]);
            result[i + k + 4] = function(A_hi[i + k], B_hi[i + k]);
        }
    } else {

        // fast path
        ...
    }

    // further processing
}

We have to fill condition of the if-statement.

Version 1

// 1. isolate type fields

                                                            // assembler instruction:
    auto A_lo_type = A_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto A_hi_type = A_hi & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto B_lo_type = B_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto B_hi_type = B_hi & packed_qword(TYPE_FIELD_MASK);  // PAND

// 2. convert packed_qword to packed_dword (64 => 32 bits)

    auto A_type = convert(A_lo_type, A_hi_type);            // PACKSSQD
    auto B_type = convert(B_lo_type, B_hi_type);            // PACKSSQD

// 3. compare types with magic 5

    auto A_has_5 = compare(A_type, TYPE_FIELD_5);           // PCMPEQD
    auto B_has_5 = compare(B_type, TYPE_FIELD_5);           // PCMPEQD

    auto any_5 = A_has_5 | B_has_5;                         // POR

// 4. get a bit mask

    auto any_5_bitmask = movemask(any_5);                   // MOVMSK_PS

In total ten instructions. It doesn't look well. Let's try another approach.

Version 2

In this version we change the third step. An addition is used instead of a comparison: A_type + B_type + magic_value — the magic_value causes to set the highest bit when A_type or B_type is 5. How to calculate this value? Let say the code field is located at 4th bit and it has 3 bits width:

31                         6   3    0 bit
+-+------------------------+---+----+
|X|111111111111111111111111|TYP|    |
+-+------------------------+---+----+

All bits after type field and before the most significant bit X are set. If an overflow occur while adding type field, then the carry bit will propagate through ones up to the bit X. In this case magic_value = 0b0111 1111 1111 1111 1111 1111 1000 0000 = 0x7fffff80.

// 1. isolate type fields
                                                            // assembler instruction:
    auto A_lo_type = A_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto A_hi_type = A_hi & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto B_lo_type = B_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto B_hi_type = B_hi & packed_qword(TYPE_FIELD_MASK);  // PAND

// 2. convert packed_qword to packed_dword (64 => 32 bits)

    auto A_type = convert(A_lo_type, A_hi_type);            // PACKSSQD
    auto B_type = convert(B_lo_type, B_hi_type);            // PACKSSQD

// 3. compare types with magic 5

    auto any_5 = A_type + B_type + packed_dword(0x7fffff80) // PADDD x 2

// 4. same is version 1

    auto any_5_bitmask = movemask(any_5);                   // MOVMSK_PS

In total nine instructions. A bit better and we've also learned a new trick.

Version 3

Now we stop thinking "in SIMD-way". Simply compare binary images of 3 and 5: 3 = 0b011, 5 = 0b101 — it's clear that 5 has set 3rd bit, so this bit must be somehow gathered by the movemask instruction.

Here is the solution:

// 1. merge lo and hi parts of inputs

    auto T_lo = A_lo | B_lo;                                // POR
    auto T_hi = B_hi | B_hi;                                // POR

// 2. isolate merged type fields

    auto T_lo_type = T_lo & packed_qword(TYPE_FIELD_MASK);  // PAND
    auto T_hi_type = T_hi & packed_qword(TYPE_FIELD_MASK);  // PAND

// 3. convert packed_qword to packed_dword (64 => 32 bits)

    auto type1 = convert(T_lo_type, T_hi_type);             // PACKSSQD

// 4. move the highest bit of type bit field to the MSB

    // shift depends on position of highest bit, it is 31 - something
    auto type2 = type1 << shift;                            // PSLRD

// 4. get a bit mask

    auto any_5_bitmask = movemask(type2);                   // MOVMSK_PS

Seven instructions. I can't do this better.