Author: | Wojciech Muła |
---|---|
Added on: | 2016-09-14 |
Contents
There is an array of 32-bit integers and a key — a specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items. Pseudocode:
-- n - array size for i = 0 .. n - 1 loop if key = array[i] then bitvector[i] = 1 else bitvector[i] = 0 end for
A C++ interface:
void bitmask(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector);
void bitmask_naive(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) { for (size_t i=0; i < n; i++) { const uint8_t val = (array[i] == key); // in C++ bool casted to int has value 0 or 1 bitvector[i/8] |= val << (i % 8); } }
The main problem with this code is that it involves at least one bit-shift (likely two) and one bit-or. The first bit-shift is used to find index in bitvector, the second bit-shift moves the result to i % 8 bit.
Assembler code produced by GCC 5.3 (with the -O3 flag):
.L5: movq %rax, %r9 xorl %r8d, %r8d movl %eax, %ecx shrq $3, %r9 ; index cmpl %edx, (%rdi,%rax,4) sete %r8b ; set val andl $7, %ecx ; i % 8 addq $1, %rax sall %cl, %r8d ; shift val left by i % 8 orb %r8b, (%r10,%r9) ; update byte at the index cmpq %rax, %rsi jne .L5
This code is suitable for 64-bit machines. The only advantage over the naive version is updating a bitvector in 32-bit portions.
void bitmask_better_1(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) { const size_t blocks = n / 32; const size_t tail = n % 32; const uint32_t *input = array; uint32_t* bitvector32 = reinterpret_cast<uint32_t*>(bitvector); for (size_t i=0; i < blocks; i++) { uint64_t result = 0xffffffffffffffffllu; uint64_t bit = uint64_t(1) << 32; for (int j=0; j < 32; j++) { uint64_t low = bit - 1; // j-th bit (in the higher 32-bit word) is set if current input is not equal to the key uint64_t tmp = (key ^ *input) + low; input++; result ^= (tmp & bit); // clear j-th if the above condition is true bit <<= 1; } *bitvector32++ = result >> 32; // save 32 bits at once } // process tail }
This solution maps very well on x86 assembly. And unlike the previous solution, any relation can be used in place of equal/not-equal.
void bitmask_better_2(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) { const size_t blocks = n / 32; const size_t tail = n % 32; const uint32_t *input = array; uint32_t* bitvector32 = reinterpret_cast<uint32_t*>(bitvector); for (size_t i=0; i < blocks; i++) { uint32_t result = 0; for (int j=31; j >= 0; j--) { const uint32_t bit = (current[j] == key); // (1) result = 2*result + bit; // (2) } *bitvector32++ = result; input += 32; } // process tail }
The line number 1 should be translated into two instructions: comparison and then setcc. The line number 2 should be translated into single lea.
Assembler code produced by GCC 5.3 (with the -O3 flag):
xorl %edi, %edi cmpl %edx, (%rax) ; set value of bit sete %dil ; ... subq $4, %rax cmpq %rax, %r8 leal (%rdi,%rsi,2), %esi ; update result (note: a register) jne .L35 addq $4, %rcx subq $-128, %r9 movl %esi, -4(%rcx) ; save 32 bit at once
The big advantage of this method is that instead of equal/not equal relation all other arithmetic relations could be used. It's just a matter of setcc selection.
SSE procedure processes data in 16-element blocks (4 x 4 items). Using a single SSE register is possible but not optimal.
Like the previous procedure, this one also allows to use any arithmetic relation.
key = 0xaaaa; d0 d1 d2 d3 [0000|aaaa|bbbb|0000] [aaaa|cccc|1111|aaaa] [0000|aaaa|dddd|2222] [3333|1111|aaaa|cccc]
A vector comparison yields:
d0 d1 d2 d3 [0000|ffff|0000|0000] [ffff|0000|0000|ffff] [0000|ffff|0000|0000] [0000|0000|ffff|0000]
Now d0 & d1 and d2 & d3 are converted from 32-bit mask to 16-bit using packssdw (note the signed saturation):
d01 d23 [00|ff|00|00|ff|00|00|ff] [00|ff|00|00|00|00|ff|00]
And then from 16-bit to 8-bit masks using packsswb:
byte-mask [0f00f00f0f0000f0]
Finally the byte mask is converted into bit-mask with pmovmskb
void bitmask_SSE(const uint32_t* array, size_t n, const uint32_t key, uint8_t* bitvector) { const size_t blocks = n / 16; const size_t tail = n % 16; const __m128i *input = reinterpret_cast<const __m128i*>(array); uint16_t* bitvector16 = reinterpret_cast<uint16_t*>(bitvector); const __m128i vkey = _mm_set1_epi32(key); for (size_t i=0; i < blocks; i++) { // packed dword results const __m128i d0 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 0)); const __m128i d1 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 1)); const __m128i d2 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 2)); const __m128i d3 = _mm_cmpeq_epi32(vkey, _mm_loadu_si128(input + 4*i + 3)); // packed word results const __m128i d01 = _mm_packs_epi32(d0, d1); const __m128i d23 = _mm_packs_epi32(d2, d3); // packed byte result const __m128i pck = _mm_packs_epi16(d01, d23); // convert byte-mask into bitmask bitvector16[i] = _mm_movemask_epi8(pck); } // process tail }
Above procedures were tested in a real system. The SIMD and the second scalar version are the fastest on Skylake.
Github repository contains algorithms and a test program.