Author: | Wojciech Muła |
---|---|
Addded on: | 2015-03-21 |
Informal specification:
__m128i mask_lower(const unsigned n) { assert(n < 16); /* __m128i result = 0; for (unsinged int i=0; i < n; i++) { result.byte[i] = 0xff; } */ switch (n) { case 0: return {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; case 1: return {0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; case 2: return {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; // ... case 14: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00}; case 15: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff}; } } __m128i mask_higher(const unsigned n) { assert(n < 16); return ~mask_lower(15 - n); }
The obvious approach is to use two lookups with 16 entries each, then mask_lower could be expressed by a single movdqa. Downside of this solution is quite large lookup: 2*16*16 bytes = 512 bytes. For AVX/AVX2 it's 4 times larger (2KiB).
This idea comes from my friend Piotr Wyderski and I have permission to publish his trick. It's pretty neat. :)
We need just 48 bytes of continuous memory, i.e. three SSE vectors. The memory is filled with following pattern:
lookup[ 0..15] = 0xff; lookup[16..31] = 0x00; lookup[32..47] = 0xff;
Then mask_lower is reduced to a single unaligned load movdqu from lookup[index]. Likewise mask_higher loads vector from lookup[16 + index]. It can't be done better.
This solution use the instruction palignr. The instruction gets three arguments: two vectors and an immediate index, join vectors into one big 32-bytes vector, and then return subvector starts at the index.
See psuedocode:
// A, B - 16-elements vectors // tmp - 32-elements vector // index - immediate value 0..15 __m128i palignr(__m128 A, __m128i B, unsigned index) { tmp[0 .. 15] := A[0 .. 15]; tmp[16 .. 31] := B[0 .. 15]; return tmp[index .. index+16]; }
When the vector A is full of ones (zeros), and the B is zero (full of ones), then palignr calculates mask_lower (mask_higher). Filling vector with ones or zeros is fast, pcmepeq reg, reg and pxor reg, reg does the job.
Unfortunately palignr doesn't allow variable index values, just an immediate, so for each value we have to invoke different instruction. This costs two jumps and it's too high cost. But if we need to conserve the memory or the operation isn't on a hot path then this solution seems to be OK.