Author: | Wojciech Muła |
---|---|
Added on: | 2015-03-22 |
Contents
The problem: there are 64-bit values with some data bits and some metadata bits; the metadata includes a k-bit field describing a "type" (k >= 0). The type field is located in a lower 32-bits.
Procedure processes two "types", one denoted with the code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path. If there are some types 5, we have to call an additional function (a virtual method to be precise). Pseudocode:
for (size_t i = 0; i < size; i += 8) { // two AVX registers, or 8 elements, are processed in a single loop __m256i A_lo = {A64[i + 0], A64[i + 1], A64[i + 2], A64[i + 3]}; __m256i A_hi = {A64[i + 4], A64[i + 5], A64[i + 6], A64[i + 7]}; __m256i B_lo = {B64[i + 0], B64[i + 1], B64[i + 2], B64[i + 3]}; __m256i B_hi = {B64[i + 4], B64[i + 5], B64[i + 6], B64[i + 7]}; if (any element of A or B vector has type 5) { // *** // slow path for (int k = 0; k < 4; k++) { result[i + k] = function(A_lo[i + k], B_lo[i + k]); result[i + k + 4] = function(A_hi[i + k], B_hi[i + k]); } } else { // fast path ... } // further processing }
We have to fill condition of the if-statement.
// 1. isolate type fields // assembler instruction: auto A_lo_type = A_lo & packed_qword(TYPE_FIELD_MASK); // PAND auto A_hi_type = A_hi & packed_qword(TYPE_FIELD_MASK); // PAND auto B_lo_type = B_lo & packed_qword(TYPE_FIELD_MASK); // PAND auto B_hi_type = B_hi & packed_qword(TYPE_FIELD_MASK); // PAND // 2. convert packed_qword to packed_dword (64 => 32 bits) auto A_type = convert(A_lo_type, A_hi_type); // PACKSSQD auto B_type = convert(B_lo_type, B_hi_type); // PACKSSQD // 3. compare types with magic 5 auto A_has_5 = compare(A_type, TYPE_FIELD_5); // PCMPEQD auto B_has_5 = compare(B_type, TYPE_FIELD_5); // PCMPEQD auto any_5 = A_has_5 | B_has_5; // POR // 4. get a bit mask auto any_5_bitmask = movemask(any_5); // MOVMSK_PS
In total ten instructions. It doesn't look well. Let's try another approach.
In this version we change the third step. An addition is used instead of a comparison: A_type + B_type + magic_value — the magic_value causes to set the highest bit when A_type or B_type is 5. How to calculate this value? Let say the code field is located at 4th bit and it has 3 bits width:
31 6 3 0 bit +-+------------------------+---+----+ |X|111111111111111111111111|TYP| | +-+------------------------+---+----+
All bits after type field and before the most significant bit X are set. If an overflow occur while adding type field, then the carry bit will propagate through ones up to the bit X. In this case magic_value = 0b0111 1111 1111 1111 1111 1111 1000 0000 = 0x7fffff80.
// 1. isolate type fields // assembler instruction: auto A_lo_type = A_lo & packed_qword(TYPE_FIELD_MASK); // PAND auto A_hi_type = A_hi & packed_qword(TYPE_FIELD_MASK); // PAND auto B_lo_type = B_lo & packed_qword(TYPE_FIELD_MASK); // PAND auto B_hi_type = B_hi & packed_qword(TYPE_FIELD_MASK); // PAND // 2. convert packed_qword to packed_dword (64 => 32 bits) auto A_type = convert(A_lo_type, A_hi_type); // PACKSSQD auto B_type = convert(B_lo_type, B_hi_type); // PACKSSQD // 3. compare types with magic 5 auto any_5 = A_type + B_type + packed_dword(0x7fffff80) // PADDD x 2 // 4. same is version 1 auto any_5_bitmask = movemask(any_5); // MOVMSK_PS
In total nine instructions. A bit better and we've also learned a new trick.
Now we stop thinking "in SIMD-way". Simply compare binary images of 3 and 5: 3 = 0b011, 5 = 0b101 — it's clear that 5 has set 3rd bit, so this bit must be somehow gathered by the movemask instruction.
Here is the solution:
// 1. merge lo and hi parts of inputs auto T_lo = A_lo | B_lo; // POR auto T_hi = B_hi | B_hi; // POR // 2. isolate merged type fields auto T_lo_type = T_lo & packed_qword(TYPE_FIELD_MASK); // PAND auto T_hi_type = T_hi & packed_qword(TYPE_FIELD_MASK); // PAND // 3. convert packed_qword to packed_dword (64 => 32 bits) auto type1 = convert(T_lo_type, T_hi_type); // PACKSSQD // 4. move the highest bit of type bit field to the MSB // shift depends on position of highest bit, it is 31 - something auto type2 = type1 << shift; // PSLRD // 4. get a bit mask auto any_5_bitmask = movemask(type2); // MOVMSK_PS
Seven instructions. I can't do this better.