Author: | Wojciech Muła |
---|---|
Added on: | 2016-03-13 |
Contents
In articles about base64 encoding and decoding I've showed how to implement SIMD version of a lookup table using basic vector instruction. This text describes another technique which employs my favourite instruction pshufb.
The task is defined as follows:
Now, for an input value a proper subrange is determined, and the value associated to the subrange is returned. Of course, everything is done in parallel.
All subranges span the whole input range (0..255). Subranges are order and each subrange gets its index from 0 to n. Limitation: since the method uses pshufb, the maximum number of subranges is 16.
Algorithm finds an subrange's index for given value by comparing the value with the upper limit of each subrange and accumulating the comparison value. When the index is know, then it is transformed by pshufb into a value associated with a subrange.
Pseudocode:
procedure byte_lookup(x: packed_byte) begin index := packed_byte(0) for i := 0 to n: // cmp[k] = 0xff or 0x00 cmp := packed_compare(x, subrange[i].upper) index := packed_subtract(index, cmp) end for // index[k] = 0..n LUT = packed_byte(const_0, const_1, const_1, ...); return pshufb(LUT, index); end
Number of operations:
Please be aware that both constant vectors used in comparison and a LUT vector could be stored by a compiler in memory and fetched during execution.
Lookup in base64 decoder.
index | input | comment | shift |
---|---|---|---|
0 | 00 .. 42 | invalid | |
1 | 43 | '+' | 19 |
2 | 44 .. 46 | invalid | |
3 | 47 | '/' | 16 |
4 | 48 .. 57 | '0' - '9' | 4 |
5 | 58 .. 64 | invalid | |
6 | 65 .. 90 | 'A' - 'Z' | -65 |
7 | 91 .. 96 | invalid | |
8 | 97 .. 122 | 'a' - 'z' | -71 |
9 | 122 .. ... | invalid |
C++ code:
__m128i index = packed_byte(0); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(42))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(43))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(46))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(47))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(57))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(64))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(90))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(96))); index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(122))); const char invalid = 0; const __m128i LUT = _mm_setr_epi8( /* 0 */ invalid, /* 1 */ char(19), /* 2 */ invalid, /* 3 */ char(16), /* 4 */ char(4), /* 5 */ invalid, /* 6 */ char(-65), /* 7 */ invalid, /* 8 */ char(-71), /* 9 */ invalid, // the rest is undefined, so also marked as invalid invalid, invalid, invalid, invalid, invalid, invalid ); const __m128i shift = _mm_shuffle_epi8(LUT, index);