Author: | Wojciech Muła |
---|---|
Added on: | 2016-04-03 |
Updated on: | 2018-12-08 (performance results from Cannon Lake [AVX512VL & AVX512VBMI]), 2018-11-05 (performance results from SkylakeX [AVX512BW]) |
Contents
The SIMD versions of base64 conversion algorithms were described in Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions. I also described realization of both encoding and decoding using AVX512F (Foundation) instructions.
AVX512BW (Byte & Word) comes with a great number of new instructions; following instructions can help base64-related problems:
The extension AVX512VBMI adds even more powerful instructions:
The extension AVX512VL adds just one, but really nice instruction:
2018-04-18: In the earlier versions of this text I wrongly assumed that instructions vpermb and vpermi2b are part of AVX512BW. Sorry for that.
The SIMD encoding consist following steps:
In a SSE code loading data and splitting 24-bit words into 32-bit lanes is done by pshufb instruction. In AVX2 the instruction operates on 128-bit lanes, making this step more complicated.
In case of AVX512BW two shuffles are needed.
Initial layout of the register after loading the data. We need only triplets A .. P:
[????|????|????|????|PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA] | lane 3 ╵ lane 2 ╵ lane 1 ╵ lane 0 |
Move 32-bit words across lanes using vpermd:
[0000|PPPO|OONN|NMMM|0000|LLLK|KKJJ|JIII|0000|HHHG|GGFF|FEEE|0000|DDDC|CCBB|BAAA] | lane 3 ╵ lane 2 ╵ lane 1 ╵ lane 0 |
Reshuffle bytes within 128-bit lanes to format required by the AVX512BW/AVX512VL unpack procedure, which is shown in the next section.
The input layout of single lane (0th):
[00 00 00 00|D2 D1 D0 C2|C1 C0 B2 B1|B0 A2 A1 A0] | 4 bytes |
The layout of lane after shuffling; the order of bytes within 32-bit word is 1, 2, 0, 1:
[D1 D2 D0 D1|C1 C2 C0 C1|B1 B2 B0 B1|A1 A2 A0 A1] 10 11 9 10 7 8 6 7 4 5 3 4 1 2 0 1
First one moves data across the 128-bit lanes. Then in each lane bytes are shuffled into desired positions.
const __m512i tmp1 = _mm512_loadu_si512(input + i); const __m512i tmp2 = _mm512_permutexvar_epi32( _mm512_set_epi32(-1, 11, 10, 9, -1, 8, 7, 6, -1, 5, 4, 3, -1, 2, 1, 0), tmp1 ); const __m512i tmp3 = _mm512_shuffle_epi8( tmp2, _mm512_set4_epi32(0x0a0b090a, 0x07080607, 0x04050304, 0x01020001) );
In case of AVX512VBMI the problem is solved with one invocation of vpermb.
// load 48 bytes // v = [...|DDDC|CCBB|BAAA] const __m512i v = _mm512_loadu_si512(reinterpret_cast<const __m512i*>(input + i)); // split 24-bit words to 32-bit lanes // in = [...|0DDD|0CCC|0BBB|0AAA] const __m512i in = _mm512_permutexvar_epi8(shuffle_input, v);
Each 32-bit input word contains four 6-bit fields a, b, c and d; the expected output from this step:
[00dddddd|00cccccc|00bbbbbb|00aaaaaa]
Input order of fields is following:
[????????|ccdddddd|bbbbcccc|aaaaaabb]
Below is an AVX512F version from the article linked above. Lack of byte-level instructions makes the procedure pretty complicated.
template <int shift, uint32_t mask> __m512i merge(__m512i target, __m512i src) { __m512i shifted; if (shift > 0) { shifted = _mm512_srli_epi32(src, shift); } else { shifted = _mm512_slli_epi32(src, -shift); } return _mm512_ternarylogic_epi32(_mm512_set1_epi32(mask), shifted, target, 0xca); } __m512i unpack(const __m512i in) { // [00000000|00000000|00000000|00aaaaaa] __m512i indices = _mm512_and_si512(_mm512_srli_epi32(in, 2), packed_dword(0x0000003f)); // [00000000|00000000|00BB0000|00aaaaaa] indices = merge<-12, 0x00003000>(indices, in); // [00000000|00000000|00BBbbbb|00aaaaaa] indices = merge< 4, 0x00000f00>(indices, in); // [00000000|00CCCC00|00BBbbbb|00aaaaaa] indices = merge<-10, 0x003c0000>(indices, in); // [00000000|00CCCCcc|00BBbbbb|00aaaaaa] indices = merge< 6, 0x00030000>(indices, in); // [00dddddd|00CCCCcc|00BBbbbb|00aaaaaa] indices = merge< -8, 0x3f000000>(indices, in); return indices; }
The unpacking could be performed faster with help of vpsllvw and vpsrlvw instructions. This require a change in the 32-bit lane format from:
[????????|ccdddddd|bbbbcccc|aaaaaabb] byte 2 byte 1 byte 0
to:
[bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc] ^^^^ ^^^^ unused bits unused bits byte 1 byte 2 byte 0 byte 1
It does require only different vector passed to vpshufb (AVX512BW) or vpermb (AVX512VBMI) used for splitting bytes.
Algorithm:
// t0 = [0000cccc|cc000000|aaaaaa00|00000000] const __m512i t0 = _mm512_and_si512(in, _mm512_set1_epi32(0x0fc0fc00));
// t1 = [00000000|00cccccc|00000000|00aaaaaa] const __m512i t1 = _mm512_srlv_epi16(t0, _mm512_set1_epi32(0x0006000a));
// t2 = [ccdddddd|00000000|aabbbbbb|cccc0000] const __m512i t2 = _mm512_sllv_epi16(in, _mm512_set1_epi32(0x00080004));
// = [00dddddd|00cccccc|00bbbbbb|00aaaaaa] const __m512i indices = _mm512_ternarylogic_epi32(_mm512_set1_epi32(0x3f003f00), t2, t1, 0xca);
The procedure costs just two shifts and two bitwise operations.
AVX512VL defines the instruction vpmultishiftqb, that may replace all variable shift instructions from the previous point. Please note that the layout of 32-bit lanes require the same modification as described in the previous point.
The instruction builds a vector of bytes from octets located at any position in a quadword. Following psudocode shows the algorithm:
for i in 0 .. 7 loop qword := input.qword[i]; for j in 0 .. 7 loop index := indices.byte[i * 8 + j]; output.byte[i * 8 + j] = rotate_right(qword, index) and 0xff; end loop end loop
Although vpmultishiftqb produces a vector of bytes and the encoding needs just 6 lower bits, no masking is needed. The instruction vpermb (described above) does masking internally.
Below is a code snippet the shows the proper parameters for vpmultishiftqb.
// after multishift a single 32-bit lane has following layout: // [bbbbcccc|bbcccccc|aabbbbbb|ddaaaaaa], // i.e.: (a = [10:17], b = [4:11], c = [22:27], d = [16:21]) const __m512i shifts = packed_qword(0x3036242a1016040alu); // 48, 54, 36, 42, 16, 22, 4, 10 const __m512i indices = _mm512_multishift_epi64_epi8(shifts, in);
The last part of the algorithm is converting, in parallel, all 6-bit indices into ASCII codes.
Although vpshufb works on 128-bit lanes it is sufficient for conversion. The fastest procedure from SSE might be translated one-to-one:
__m512i lookup_pshufb_improved(const __m512i input) { // reduce 0..51 -> 0 // 52..61 -> 1 .. 10 // 62 -> 11 // 63 -> 12 __m512i result = _mm512_subs_epu8(input, packed_byte(51)); // distinguish between ranges 0..25 and 26..51: // 0 .. 25 -> remains 0 // 26 .. 51 -> becomes 13 const __mmask64 less = _mm512_cmpgt_epi8_mask(packed_byte(26), input); result = _mm512_mask_mov_epi8(result, less, packed_byte(13)); /* the SSE lookup is readable const __m128i shift_LUT = _mm_setr_epi8( 'a' - 26, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '+' - 62, '/' - 63, 'A', 0, 0 ); */ const __m512i shift_LUT = _mm512_set4_epi32( 0x000041f0, 0xedfcfcfc, 0xfcfcfcfc, 0xfcfcfc47 ); // read shift result = _mm512_shuffle_epi8(shift_LUT, result); return _mm512_add_epi8(result, input); }
The already introduced instruction vpermb does lookup in an AVX512 64-byte register. This is exactly the size of lookup table needed by the encoding algorithm. Thus, just a single invocation of the instruction is needed.
const __m512i result = _mm512_permutexvar_epi8(indices, lookup);
The SIMD decoding consists following steps:
In case of decoding the limitations of vpshufb also are not the problem, the fastest SSE decoding solution can be applied. Below is the actual implementation.
__m512i lookup_pshufb_bitmask(const __m512i input) { const __m512i higher_nibble = _mm512_and_si512(_mm512_srli_epi32(input, 4), packed_byte(0x0f)); const __m512i lower_nibble = _mm512_and_si512(input, packed_byte(0x0f)); const __m512i shiftLUT = _mm512_set4lanes_epi8( 0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0); const __m512i maskLUT = _mm512_set4lanes_epi8( /* 0 : 0b1010_1000*/ 0xa8, /* 1 .. 9 : 0b1111_1000*/ 0xf8, 0xf8, 0xf8, 0xf8, 0xf8, 0xf8, 0xf8, 0xf8, 0xf8, /* 10 : 0b1111_0000*/ 0xf0, /* 11 : 0b0101_0100*/ 0x54, /* 12 .. 14 : 0b0101_0000*/ 0x50, 0x50, 0x50, /* 15 : 0b0101_0100*/ 0x54 ); const __m512i bitposLUT = _mm512_set4lanes_epi8( 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ); const __m512i sh = _mm512_shuffle_epi8(shiftLUT, higher_nibble); const __mmask64 eq_2f = _mm512_cmpeq_epi8_mask(input, packed_byte(0x2f)); const __m512i shift = _mm512_mask_mov_epi8(sh, eq_2f, packed_byte(16)); const __m512i M = _mm512_shuffle_epi8(maskLUT, lower_nibble); const __m512i bit = _mm512_shuffle_epi8(bitposLUT, higher_nibble); const uint64_t match = _mm512_test_epi8_mask(M, bit); if (match != uint64_t(-1)) { // report error } return _mm512_add_epi8(input, shift); }
This step is a perfect place to utilize the instruction vpermi2b, it requires three registers:
Valid input characters in base64 are always standard ASCII, so they never have set the most significant bit (MSB). Thanks to that the seven lowest bits of the input could be directly used as indices for vpermi2b. (The instruction simply ignores MSB, so no masking is required.)
The lookup table has to be precalculated. It translates from an ASCII code into 6-bit data or an invalid character marker. The marker value is 0x80. Thanks to that both extended ASCII and invalid characters could be easily identified in one step.
__m512i lookup(const __m512i input) { const __m512i lookup_0 = precalc::lookup_0; const __m512i lookup_1 = precalc::lookup_1; const __m512i translated = _mm512_permutex2var_epi8(lookup_0, input, lookup_1); const uint64_t mask = _mm512_movepi8_mask(translated | input); // convert MSBs to the mask if (mask) { report error; } return translated; }
The final stage of decoding is packing all 6-bit fields into a continues array of 48 bytes. It is done in two steps:
The first step is a direct translation of the SSE code. It uses twice a multiply-add instruction that does, for given constants, two shifts and one bitwise or at once.
// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa] // merge: [0000cccc|ccdddddd|0000aaaa|aabbbbbb] const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x01400140)); // result: [00000000|aaaaaabb|bbbbcccc|ccdddddd] return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x00011000));
Initial layout of the register after step A:
[0PPP|0OOO|0NNN|0MMM|0LLL|0KKK|0JJJ|0III|0HHH|0GGG|0FFF|0EEE|0DDD|0CCC|0BBB|0AAA] | lane 3 | lane 2 | lane 1 | lane 0 |
Pack bytes within lanes using vpshufb; 3rd doublwords in each lane are empty:
[0000|PPPO|OONN|NMMM|0000|LLLK|KKJJ|JIII|0000|HHHG|GGFF|FEEE|0000|DDDC|CCBB|BAAA] | lane 3 | lane 2 | lane 1 | lane 0 |
Move non-empty 32-bit words across lanes using vpermd and form continuous 48-byte array:
[0000|0000|0000|0000|PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA] | lane 3 | lane 2 | lane 1 | lane 0 |
Below is a sample implementation.
const __m512i t1 = _mm512_shuffle_epi8( packed, _mm512_set4lanes_epi8( 2, 1, 0, 6, 5, 4, 10, 9, 8, 14, 13, 12, -1, -1, -1, -1) ); // shuffle bytes const __m512i s6 = _mm512_setr_epi32( 0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, // unused 0, 0, 0, 0); const __m512i t2 = _mm512_permutexvar_epi32(s6, t1); _mm512_storeu_si512(reinterpret_cast<__m512i*>(out), t2);
In case of AVX512VBMI this step uses just single vpermb.
Daniel Lemire gave me access to machines that supports the AVX512BW, AVX512VL and AVX512VBMI instruction sets. Below are performance results for both encoding and decoding.
CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
Compiler: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0
scalar (32 bit) | lookup-based method that loads four input bytes at once |
scalar (64 bit) | similarly to the above one, but read 8 input bytes |
AVX2 (lookup: pshufb improved) | code that uses shuffle-based algorithm |
AVX512BW (lookup: pshufb improved) | exactly the same shuffle-based algorithm |
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
scalar (32 bit) | 5.768 | 6.452 | 1.00 | ███▋ |
scalar (64 bit) | 5.537 | 5.599 | 1.04 | ███▊ |
AVX2 (lookup: pshufb unrolled improved) | 0.681 | 0.698 | 8.47 | ███████████████████████████████▏ |
AVX512BW (lookup: optimized2) | 0.726 | 0.734 | 7.94 | █████████████████████████████▎ |
AVX512BW (lookup: pshufb improved) | 0.425 | 0.436 | 13.57 | ██████████████████████████████████████████████████ |
AVX512BW (lookup: pshufb improved, unrolled x 2) | 0.425 | 0.441 | 13.57 | ██████████████████████████████████████████████████ |
scalar | naive lookup-based implementation |
improved scalar | also lookup-based code, but detection of input errors is done for four input bytes at once |
AVX2 (lookup: pshufb bitmask, pack: multiply-add) | uses the same algorithm as AVX512BW implementation |
AVX512BW (lookup: N/A, pack: multiply-add) | implementation of the described above lookup and pack algorithms |
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
scalar | 1.023 | 1.091 | 1.00 | ██████▉ |
improved scalar | 1.024 | 1.085 | 1.00 | ██████▉ |
AVX2 (lookup: pshufb bitmask, pack: multiply-add) | 0.222 | 0.225 | 4.61 | ███████████████████████████████▊ |
AVX512BW (lookup: N/A, pack: multiply-add) | 0.141 | 0.144 | 7.26 | ██████████████████████████████████████████████████ |
CPU: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz
GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
scalar (32 bit) | lookup-based method that loads four input bytes at once |
scalar (64 bit) | similarly to the above one, but read 8 input bytes |
AVX2 (lookup: pshufb improved) | code that uses shuffle-based algorithm |
AVX512BW (lookup: pshufb improved) | exactly the same shuffle-based algorithm |
AVX512BW (lookup: pshufb improved, unrolled x 2) | the above procedure unroled two times |
AVX512VBMI | step 1 & 2: _mm512_permutexvar_epi8, step 3: AVX512BW code, step 4: again _mm512_permutexvar_epi8 |
AVX512VL | step 1 & 2: _mm512_permutexvar_epi8, step 3: _mm512_multishift_epi64_epi8, step 4: again _mm512_permutexvar_epi8 |
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
scalar (32 bit) | 6.082 | 6.117 | 1.00 | █▌ |
scalar (64 bit) | 6.096 | 6.134 | 1.00 | █▌ |
AVX2 (lookup: pshufb unrolled improved) | 0.752 | 0.772 | 8.09 | ████████████▉ |
AVX512BW (lookup: pshufb improved) | 0.391 | 0.405 | 15.55 | ████████████████████████▉ |
AVX512BW (lookup: pshufb improved, unrolled x 2) | 0.387 | 0.399 | 15.72 | █████████████████████████▏ |
AVX512VBMI | 0.227 | 0.233 | 26.79 | ██████████████████████████████████████████▉ |
AVX512VL | 0.195 | 0.202 | 31.19 | ██████████████████████████████████████████████████ |
scalar | naive lookup-based implementation |
improved scalar | also lookup-based code, but detection of input errors is done for four input bytes at once |
AVX2 (lookup: pshufb bitmask, pack: multiply-add) | uses the same algorithm as AVX512BW implementation |
AVX512BW (lookup: N/A, pack: multiply-add) | implementation of the described above lookup and pack algorithms |
AVX512VBMI (lookup: N/A, pack: multiply-add) | lookup is done by _mm512_permutexvar_epi8 and pack by _mm512_permutex2var_epi8 |
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
scalar | 1.030 | 1.044 | 1.00 | ████▏ |
improved scalar | 1.031 | 1.043 | 1.00 | ████▏ |
AVX2 (lookup: pshufb bitmask, pack: multiply-add) | 0.232 | 0.234 | 4.44 | ██████████████████▋ |
AVX512BW (lookup: N/A, pack: multiply-add) | 0.128 | 0.131 | 8.05 | █████████████████████████████████▉ |
AVX512VBMI (lookup: N/A, pack: multiply-add) | 0.087 | 0.089 | 11.84 | ██████████████████████████████████████████████████ |
Just an inner loops was shown; in case of decoding exception handling was omitted. Code generated by GCC 7.3.0.
vpermd (%rsi,%rax,1),%zmm12,%zmm0 vmovdqa64 %zmm7,%zmm26 add $0x30,%rax add $0x40,%rdx vpshufb %zmm11,%zmm0,%zmm0 vpandd %zmm0,%zmm9,%zmm1 vpsllvw %zmm6,%zmm0,%zmm0 vpsrlvw %zmm8,%zmm1,%zmm1 vpternlogd $0xca,%zmm1,%zmm0,%zmm26 vpcmpgtb %zmm26,%zmm4,%k1 vpsubusb %zmm5,%zmm26,%zmm1 vmovdqu8 %zmm3,%zmm1{%k1} vpshufb %zmm1,%zmm10,%zmm1 vpaddb %zmm26,%zmm1,%zmm0 vmovdqu64 %zmm0,-0x40(%rdx)
vpermb (%rdi, %rax, 1), %zmm7, %zmm0 vmovdqa64 %zmm2, %zmm16 add $0x30, %rax add $0x40, %rdx vpandd %zmm0, %zmm5, %zmm1 vpsllvw %zmm3, %zmm0, %zmm0 vpsrlvw %zmm4, %zmm1, %zmm1 vpternlogd $0xca, %zmm1, %zmm0, %zmm16 vpermb %zmm6, %zmm16, %zmm0 vmovdqu64 %zmm0, -0x40(%rdx)
vpermb (%rsi, %rax, 1), %zmm3, %zmm0 add $0x30, %rax add $0x40, %rdx vpmultishiftqb %zmm0, %zmm1, %zmm0 vpermb %zmm2, %zmm0, %zmm0 vmovdqu64 %zmm0, -0x40(%rdx)
vmovdqu64 (%rbx,%r14,1),%zmm3 vpsrld $0x4,%zmm3,%zmm1 vpandd %zmm4,%zmm1,%zmm1 vpcmpeqb %zmm10,%zmm3,%k1 vpshufb %zmm1,%zmm11,%zmm0 vpshufb %zmm1,%zmm7,%zmm1 vmovdqa64 %zmm0,%zmm2 vpandd %zmm3,%zmm4,%zmm0 vmovdqu8 %zmm9,%zmm2{%k1} vpshufb %zmm0,%zmm8,%zmm0 vptestmb %zmm1,%zmm0,%k2 kmovq %k2,%rax cmp $0xffffffffffffffff,%rax je error_handling vpaddb %zmm3,%zmm2,%zmm0 vpmaddubsw %zmm6,%zmm0,%zmm0 vpmaddwd %zmm5,%zmm0,%zmm0 vpshufb %zmm13,%zmm0,%zmm0 add $0x40,%r14 add $0x30,%rcx vpermd %zmm0,%zmm12,%zmm0 vmovdqu64 %zmm0,-0x30(%rcx)
vmovdqu64 (%r14, %r13, 1), %zmm0 vmovdqa64 %zmm5, %zmm1 vpermt2b %zmm6, %zmm0, %zmm1 vporq %zmm1, %zmm0, %zmm0 vpmovb2m %zmm0, %k1 kmovq %k1, %rax test %rax, %rax jne error_handling vpmaddubsw %zmm3, %zmm1, %zmm0 vpmaddwd %zmm2, %zmm0, %zmm0 vpermb %zmm0, %zmm4, %zmm0 add $0x40, %r13 add $0x30, %rdx vmovdqu64 %zmm0, -0x30(%rdx)
Repository contains implementations of both encoding and decoding procedures.