Author: | Wojciech Muła |
---|---|
Added on: | 2016-01-17 |
Last update: | 2017-11-29 (spelling) |
Contents
Surprisingly good results of base64 encoding with SIMD instructions forced me to check the opposite algorithm, i.e. the decoding. The decoding is slightly more complicated as it has to check the input's validity.
A decoder must also consider character '=' at the end of input, but since it's done once, I didn't bother with this in a sample code.
2016-12-18 note: in the initial version of this text I wrongly assumed order of input words, Alfred Klomp noted that the standard imposes a specific order. Today's change fixes this error.
In a basic scalar version the minimum input quantity is 4 character, which are transformed into 3 bytes. The scalar version uses a lookup table translating from ASCII code into a 6-bit data. For non-base64 characters the lookup table contains a special marker. After decoding four characters and checking for errors, the 6-bit words are merged to form a 24-bit word.
The inner loop of the algorithm looks like this.
// 1. translate input const uint8_t b0 = lookup[input[i + 0]]; const uint8_t b1 = lookup[input[i + 1]]; const uint8_t b2 = lookup[input[i + 2]]; const uint8_t b3 = lookup[input[i + 3]]; if (b0, b1, b2 or b3 invalid) { report error } // 2. save output bytes // input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa] // output: [00000000|ccdddddd|bbbbcccc|aaaaaabb] *out++ = (b1 >> 4) | (b0 << 2); *out++ = (b2 >> 2) | (b1 << 4); *out++ = b3 | (b2 << 6);
Forming the output is expansive, it requires:
It is possible to get rid off the bit-shift and use four lookup tables for each input byte. Each table contains already-shifted 6-bit data fields, thanks to that only bit-ors are required. It is at cost of bigger memory requirements. The improved version requires 4kB (4*4*256), while the basic one requires only 256 bytes.
Following code shows the idea.
// 1. lookup and build output word at the same time const uint32_t dword = lookup32[0][input[i + 0]] | lookup32[1][input[i + 1]] | lookup32[2][input[i + 2]] | lookup32[3][input[i + 3]]; if (dword is invalid) { report error } // 2. save 32 bits reinterpret_cast<uint32_t*>(out) = dword; out += 3;
Another negative feature is that the method saves 4 bytes instead of 3.
In an SSE version the lookup table must be encoded as a sequence of instructions. Such program will perform parallel translation of all input bytes.
Following table describes how to map an input character into a 6-bit data (i is the character code), according to the base64 standard.
range | expression | after constant folding |
---|---|---|
A-Z | i - ord('A') | i - 65 |
a-z | i - ord('a') + 26 | i - 71 |
0-9 | i - ord('0') + 52 | i + 4 |
'+' | i - ord('+') + 62 | i + 19 |
'/' | i - ord('/') + 63 | i + 16 |
other | error |
The basic step of the vector version produces a vector of constants (from the third column) for characters matching given range. After checking all five ranges, the vectors are merged. Since none of constants is zero, then if the merged vector has any zero byte, it means that the input contains an invalid character.
If everything is OK, then the last step is to simple add the input vector and the merged one. The result vector contains 6-bit data in each byte.
Procedure doing the translation.
__m128i lookup_base(const __m128i input) { // shift for range 'A' - 'Z' const __m128i ge_A = _mm_cmpgt_epi8(input, packed_byte('A' - 1)); const __m128i le_Z = _mm_cmplt_epi8(input, packed_byte('Z' + 1)); const __m128i range_AZ = _mm_and_si128(packed_byte(-65), _mm_and_si128(ge_A, le_Z)); // shift for range 'a' - 'z' const __m128i ge_a = _mm_cmpgt_epi8(input, packed_byte('a' - 1)); const __m128i le_z = _mm_cmplt_epi8(input, packed_byte('z' + 1)); const __m128i range_az = _mm_and_si128(packed_byte(-71), _mm_and_si128(ge_a, le_z)); // shift for range '0' - '9' const __m128i ge_0 = _mm_cmpgt_epi8(input, packed_byte('0' - 1)); const __m128i le_9 = _mm_cmplt_epi8(input, packed_byte('9' + 1)); const __m128i range_09 = _mm_and_si128(packed_byte(4), _mm_and_si128(ge_0, le_9)); // shift for character '+' const __m128i eq_plus = _mm_cmpeq_epi8(input, packed_byte('+')); const __m128i char_plus = _mm_and_si128(packed_byte(19), eq_plus); // shift for character '/' const __m128i eq_slash = _mm_cmpeq_epi8(input, packed_byte('/')); const __m128i char_slash = _mm_and_si128(packed_byte(16), eq_slash); // merge partial results const __m128i shift = _mm_or_si128(range_AZ, _mm_or_si128(range_az, _mm_or_si128(range_09, _mm_or_si128(char_plus, char_slash)))); const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(shift, packed_byte(0))); if (mask) { // report an error } return _mm_add_epi8(input, shift); }
Number of operations:
Number of constants: 9.
Total number of operations is 23. This is 1.4375 instructions per character.
Note: the SSE compare for less or greater works on signed values, in this case it is not a problem.
Instead of creating partial shifts (constants range_AZ, range_az and so on) and then merging them with a series of bit-ors, the shift vector could be updated using the byte blend instruction pblendv (_mm_blendv_epi8).
Below is an updated procedure.
__m128i lookup_byte_blend(const __m128i input) { __m128i shift; // shift for range 'A' - 'Z' const __m128i ge_A = _mm_cmpgt_epi8(input, packed_byte('A' - 1)); const __m128i le_Z = _mm_cmplt_epi8(input, packed_byte('Z' + 1)); shift = _mm_and_si128(packed_byte(-65), _mm_and_si128(ge_A, le_Z)); // shift for range 'a' - 'z' const __m128i ge_a = _mm_cmpgt_epi8(input, packed_byte('a' - 1)); const __m128i le_z = _mm_cmplt_epi8(input, packed_byte('z' + 1)); shift = _mm_blendv_epi8(shift, packed_byte(-71), _mm_and_si128(ge_a, le_z)); // shift for range '0' - '9' const __m128i ge_0 = _mm_cmpgt_epi8(input, packed_byte('0' - 1)); const __m128i le_9 = _mm_cmplt_epi8(input, packed_byte('9' + 1)); shift = _mm_blendv_epi8(shift, packed_byte(4), _mm_and_si128(ge_0, le_9)); // shift for character '+' const __m128i eq_plus = _mm_cmpeq_epi8(input, packed_byte('+')); shift = _mm_blendv_epi8(shift, packed_byte(19), eq_plus); // shift for character '/' const __m128i eq_slash = _mm_cmpeq_epi8(input, packed_byte('/')); shift = _mm_blendv_epi8(shift, packed_byte(16), eq_slash); const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(shift, packed_byte(0))); if (mask) { // report an error } return _mm_add_epi8(input, shift); }
Number of operations:
Number of constants: 9.
The total number of operations is 19, three less than the in base version. This is 1.1875 instructions per character. However, the instruction pblendv has both latency and throughput equal 2 cycles and it seems to be a problem. As experiments showed, sometimes use of the instruction gives small improvement, but sometimes makes things slower.
This section describes another lookup procedure that employs the technique described in a separate article. The lookup procedure uses pshufb instruction. I very like this instruction, but experiments showed that this version is slower. My heart sank.
Instead of checking all possible ranges as in all other methods, this algorithm determines possible range for a byte and the associated shift value. Then validates a byte against that single range and adds the selected shift.
To select both the range and the shift, the higher nibble of byte is used. Lets look at the rewritten lookup table:
input [hex] | valid input [ASCII] | valid range | shift |
---|---|---|---|
0x | invalid | ||
1x | |||
2x | '+' | 0x2b .. 0x2b | 0x3e |
'/' | 0x2f .. 0x2f | 0x3f | |
3x | '0' .. '9' | 0x30 .. 0x39 | 0x34 |
4x | 'A' .. 'O' | 0x41 .. 0x4f | 0x00 |
5x | 'P' .. 'Z' | 0x50 .. 0x5a | 0x0f |
6x | 'a' .. 'o' | 0x61 .. 0x6f | 0x1a |
7x | 'p' .. 'z' | 0x70 .. 0x7a | 0x29 |
8x .. Fx | invalid |
The outline of algorithm (for a single byte):
higher_nibble = byte >> 4; upper_limit = upper_limit_LUT[higher_nibble]; lower_limit = lower_limit_LUT[higher_nibble]; if (byte < lower_limit || byte > upper_limit) { report error } decoded = byte + shift_LUT[higher_nibble];
For nibbles in set {0, 1, 8, 9, a, b, c, d, e, f} the LUTs define invalid ranges, always yielding errors. For nibbles {3, 4, 5, 6, 7} the ranges are well defined.
The only exception occurs for nibble 2, as there are two distinct values associated with these bytes. The special case is solved by choosing the first, one-element range 0x2b and introducing an additional code handling input equal to 0x2f. The extra code doesn't complicate the algorithm too much:
higher_nibble = byte >> 4; upper_limit = upper_limit_LUT[higher_nibble]; lower_limit = lower_limit_LUT[higher_nibble]; if ((byte < lower_limit || byte > upper_limit) && byte != 0x2f) { report error } decoded = byte + shift_LUT[higher_nibble]; if (byte == 0x2f) { decoded -= 3; // (0x2b - 0x2f) + (0x3f - 0x3e) = -4 + 1 = -3 }
The SIMD (SSE) code implementing above algorithm:
__m128i lookup_pshufb(const __m128i input) { const __m128i higher_nibble = _mm_srli_epi32(input, 4) & packed_byte(0x0f); const char linv = 1; const char hinv = 0; const __m128i lower_bound_LUT = _mm_setr_epi8( /* 0 */ linv, /* 1 */ linv, /* 2 */ 0x2b, /* 3 */ 0x30, /* 4 */ 0x41, /* 5 */ 0x50, /* 6 */ 0x61, /* 7 */ 0x70, /* 8 */ linv, /* 9 */ linv, /* a */ linv, /* b */ linv, /* c */ linv, /* d */ linv, /* e */ linv, /* f */ linv ); const __m128i upper_bound_LUT = _mm_setr_epi8( /* 0 */ hinv, /* 1 */ hinv, /* 2 */ 0x2b, /* 3 */ 0x39, /* 4 */ 0x4f, /* 5 */ 0x5a, /* 6 */ 0x6f, /* 7 */ 0x7a, /* 8 */ hinv, /* 9 */ hinv, /* a */ hinv, /* b */ hinv, /* c */ hinv, /* d */ hinv, /* e */ hinv, /* f */ hinv ); // the difference between the shift and lower bound const __m128i shift_LUT = _mm_setr_epi8( /* 0 */ 0x00, /* 1 */ 0x00, /* 2 */ 0x3e - 0x2b, /* 3 */ 0x34 - 0x30, /* 4 */ 0x00 - 0x41, /* 5 */ 0x0f - 0x50, /* 6 */ 0x1a - 0x61, /* 7 */ 0x29 - 0x70, /* 8 */ 0x00, /* 9 */ 0x00, /* a */ 0x00, /* b */ 0x00, /* c */ 0x00, /* d */ 0x00, /* e */ 0x00, /* f */ 0x00 ); const __m128i upper_bound = _mm_shuffle_epi8(upper_bound_LUT, higher_nibble); const __m128i lower_bound = _mm_shuffle_epi8(lower_bound_LUT, higher_nibble); const __m128i below = _mm_cmplt_epi8(input, lower_bound); const __m128i above = _mm_cmpgt_epi8(input, upper_bound); const __m128i eq_2f = _mm_cmpeq_epi8(input, packed_byte(0x2f)); // in_range = not (below or above) or eq_2f // outside = not in_range = below or above and not eq_2f (from deMorgan law) const __m128i outside = _mm_andnot_si128(eq_2f, above | below); const auto mask = _mm_movemask_epi8(outside); if (mask) { report error } const __m128i shift = _mm_shuffle_epi8(shift_LUT, higher_nibble); const __m128i t0 = _mm_add_epi8(input, shift); const __m128i result = _mm_add_epi8(t0, _mm_and_si128(eq_2f, packed_byte(-3))); return result; }
Number of operations:
Number of constants: 6.
Total number of operations is 14. This is 0.875 instructions per character.
This is a slightly improved version of the previous method, where the range checking is done using bit-masks.
First we group all shift values by the lower nibble; zero shift means invalid input. We can note that in each column (except the 3rd) there are two different values: zero or a specific value. The 3rd column needs a special care, but it is not difficult.
lower nibble | higher nibble ('.' denotes zero) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f | |
0 | . | . | . | 4 | . | -65 | . | -71 | . | . | . | . | . | . | . | . |
1 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
2 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
3 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
4 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
5 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
6 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
7 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
8 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
9 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
a | . | . | . | . | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
b | . | . | 19 | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
c | . | . | . | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
d | . | . | . | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
e | . | . | . | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
f | . | . | 16 | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
Using the higher nibble we can pick a shift from vector of non-zero values from each column: [0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0]. The lower nibble selects a bit-mask, representing a set of valid values for higher nibble. A bit-mask is derived directly from rows of the table, each non-zero value sets a bit in the mask. There are six different bit-masks.
We need an extra fix-up code to distinguish between the '+' and '/' character, i.e. choose either 16 or 19 from the 3rd column.
The core of the algorithm goes as follows:
The SIMD (SSE) code implementing above algorithm.
__m128i lookup_pshufb_bitmask(const __m128i input) { const __m128i higher_nibble = _mm_srli_epi32(input, 4) & packed_byte(0x0f); const __m128i lower_nibble = input & packed_byte(0x0f); const __m128i shiftLUT = _mm_setr_epi8( 0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0); const __m128i maskLUT = _mm_setr_epi8( /* 0 */ char(0b10101000), /* 1 .. 9 */ char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000), /* 10 */ char(0b11110000), /* 11 */ char(0b01010100), /* 12 .. 14 */ char(0b01010000), char(0b01010000), char(0b01010000), /* 15 */ char(0b01010100) ); const __m128i bitposLUT = _mm_setr_epi8( 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, char(0x80), 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ); // shift + fix-up for '/' and '+' const __m128i sh = _mm_shuffle_epi8(shiftLUT, higher_nibble); const __m128i eq_2f = _mm_cmpeq_epi8(input, packed_byte(0x2f)); const __m128i shift = _mm_blendv_epi8(sh, packed_byte(16), eq_2f); // load bit-masks const __m128i M = _mm_shuffle_epi8(maskLUT, lower_nibble); // (1 << higher_nibble[i]) & 0xff const __m128i bit = _mm_shuffle_epi8(bitposLUT, higher_nibble); const __m128i non_match = _mm_cmpeq_epi8(_mm_and_si128(M, bit), _mm_setzero_si128()); const auto mask = _mm_movemask_epi8(non_match); if (mask) { // report error } const __m128i result = _mm_add_epi8(input, shift); return result; }
Number of operations:
Number of constants: 6.
The total number of operations is 12. This is 0.750 instructions per character.
operation | algorithm | ||||
---|---|---|---|---|---|
base | byte-blend | incremental | pshufb | pshufb-bitmask | |
comparison (le/gt/eq) | 9 | 9 | 9 | 3 | 2 |
bit-and | 8 | 4 | 0 | 2 | 3 |
bit-andnot | 0 | 0 | 0 | 1 | 0 |
bit-or | 4 | 0 | 0 | 1 | 0 |
shift | 0 | 0 | 0 | 1 | 1 |
add | 1 | 1 | 10 | 2 | 1 |
sub | 0 | 0 | 0 | 0 | 0 |
byte-blend | 0 | 3 | 0 | 0 | 1 |
movemask | 1 | 1 | 1 | 1 | 1 |
pshufb | 0 | 0 | 1 | 3 | 3 |
total instructions | 23 | 19 | 21 | 14 | 12 |
instructions/byte | 1.4375 | 1.1875 | 1.3125 | 0.875 | 0.75 |
constants count | 9 | 9 | 10 | 6 | 6 |
The result of lookup procedure is a vector fo 32-bit words having following bit-layout:
[00dddddd|00cccccc|00bbbbbb|00aaaaaa]
Bits a, b, c and d are data. The expected output is:
[00000000|aaaaaabb|bbbbcccc|ccdddddd]
Once bits are packed into 24-bit words, all bytes are shuffled in order to build a 12-byte result. Please note that also the order of bytes within 24-bit words is changed in this step, i.e. bytes 0 and 2 are swapped:
[ddddddcc|bbbbcccc|aaaaaabb]
Fortunatelly everything is done by a single pshufb invocation.
The naive variant isolates all 6-bit fields, shifts them to required positions and merges into one word.
#define packed_dword(x) _mm_set1_epi32(x) #define masked(x, mask) _mm_and_si128(x, _mm_set1_epi32(mask)) const __m128i ca = masked(values, 0x003f003f); const __m128i db = masked(values, 0x3f003f00);
// t0 = [0000cccc|ccdddddd|0000aaaa|aabbbbbb] const __m128i t0 = _mm_or_si128( _mm_srli_epi32(db, 8), _mm_slli_epi32(ca, 6) );
// t1 = [dddd0000|aaaaaabb|bbbbcccc|dddddddd] const __m128i t1 = _mm_or_si128( _mm_srli_epi32(t0, 16), _mm_slli_epi32(t0, 12) );
// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd] result = masked(t1, 0x00ffffff);
Number of operations:
The first merge of adjacent bytes is performed by instruction pmaddubsw. The instruction vertically multiplies signed bytes yielding 16-bit signed intermediate values and then the values are added. Input values are 6-bit, so obviously are non-negative and the instruction could be safely used.
The instruction pmaddusbw is used to perform shift & bit-or in a single step.
Then merging 12-bit fields is done by another multiply-add instruction pmaddwd. The instruction also operates on signed values, but again — the inputs are never negative.
// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa] // merge: [0000cccc|ccdddddd|0000aaaa|aabbbbbb] const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x01400140)); // result: [00000000|aaaaaabb|bbbbcccc|ccdddddd] return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x00011000));
Number of operations:
The result is stoed in an XMM register, however only 12 lowest bytes are required. My experiments on Core i5 have showed that saving selected bytes using dedicated instruction maskmovdqu (intrinsic _mm_maskmoveu_si128) is slower than ordinary write of 16 bytes which overwrites the previous 4 bytes.
With help of BMI2 instruction pext making a 24-bit word from 6-bit words is pretty simple. However, bytes 0th and 2nd of each word have to be swapped, that costs extra instructions. Additionally, the instruction works on CPU registers and this require extra instruction to transfer data from the SIMD registers.
Here is a procedure which processes 64-bit word
uint64_t pack_bytes(uint64_t v) { const uint64_t p = _pext_u64(v, 0x3f3f3f3f3f3f3f3f); const uint64_t b0 = p & 0x0000ff0000ff; const uint64_t b1 = p & 0x00ff0000ff00; const uint64_t b2 = p & 0xff0000ff0000; return (b0 << 16) | b1 | (b2 >> 16); }
It costs:
An AVX2 version of lookup procedure is nearly the same as the SSE version. The only minor difference is caused by lack of comparison "less or equal", thus all range checking have to use the "grater" relation.
An output from the AVX2 lookup could be either processed using standard SIMD operations or BMI2 instructions.
The sample code is available at github, there are three programs:
The CPU architecture: Westmere i5 M540 @ 2.53GHz
GCC: 6.2.0 (Debian)
procedure | lookup | pack | time [s] | speedup | |
---|---|---|---|---|---|
improved scalar | N/A | N/A | 0.04722 | 1.00 | ██████████ |
scalar | N/A | N/A | 0.06875 | 0.69 | ██████▊ |
SSE | base | naive | 0.02892 | 1.63 | ████████████████▎ |
SSE | byte blend | naive | 0.02970 | 1.59 | ███████████████▉ |
SSE | incremental | naive | 0.03249 | 1.45 | ██████████████▌ |
SSE | pshufb | naive | 0.02753 | 1.72 | █████████████████▏ |
SSE | base | multiply-add | 0.02554 | 1.85 | ██████████████████▍ |
SSE | byte blend | multiply-add | 0.02511 | 1.88 | ██████████████████▊ |
SSE | incremental | multiply-add | 0.02862 | 1.65 | ████████████████▍ |
SSE | pshufb | multiply-add | 0.02170 | 2.18 | █████████████████████▊ |
SSE | pshufb bitmask | multiply-add | 0.02085 | 2.26 | ██████████████████████▋ |
Conclusions:
The CPU architecture: Haswell i7-4770 CPU @ 3.40GHz.
GCC: 5.4.1 (Ubuntu)
procedure | lookup | pack | time [s] | speedup | |
---|---|---|---|---|---|
improved scalar | N/A | N/A | 0.02281 | 1.00 | ██████████ |
scalar | N/A | N/A | 0.04965 | 0.46 | ████▌ |
scalar & BMI2 | N/A | N/A | 0.04485 | 0.51 | █████ |
SSE | base | naive | 0.01683 | 1.36 | █████████████▌ |
SSE | byte blend | naive | 0.01783 | 1.28 | ████████████▊ |
SSE | incremental | naive | 0.01790 | 1.27 | ████████████▋ |
SSE | pshufb | naive | 0.01286 | 1.77 | █████████████████▋ |
SSE | base | multiply-add | 0.01383 | 1.65 | ████████████████▍ |
SSE | byte blend | multiply-add | 0.01365 | 1.67 | ████████████████▋ |
SSE | incremental | multiply-add | 0.01500 | 1.52 | ███████████████▏ |
SSE | pshufb | multiply-add | 0.00967 | 2.36 | ███████████████████████▌ |
SSE | pshufb bitmask | multiply-add | 0.00927 | 2.46 | ████████████████████████▌ |
SSE & BMI2 | base | N/A | 0.02184 | 1.04 | ██████████▍ |
SSE & BMI2 | byte blend | N/A | 0.02288 | 1.00 | █████████▉ |
SSE & BMI2 | incremental | N/A | 0.02190 | 1.04 | ██████████▍ |
AVX2 | base | naive | 0.01108 | 2.06 | ████████████████████▌ |
AVX2 | byte blend | naive | 0.01229 | 1.86 | ██████████████████▌ |
AVX2 | pshufb | naive | 0.00897 | 2.54 | █████████████████████████▍ |
AVX2 | base | multiply-add | 0.00973 | 2.34 | ███████████████████████▍ |
AVX2 | byte blend | multiply-add | 0.00977 | 2.33 | ███████████████████████▎ |
AVX2 | pshufb | multiply-add | 0.00849 | 2.69 | ██████████████████████████▊ |
AVX2 | pshufb bitmask | multiply-add | 0.00848 | 2.69 | ██████████████████████████▉ |
AVX2 & BMI2 | base | N/A | 0.02236 | 1.02 | ██████████▏ |
AVX2 & BMI2 | byte blend | N/A | 0.02371 | 0.96 | █████████▌ |
Conclusions:
The CPU architecture: Skylake i7-6700 CPU @ 3.40GHz
GCC: 5.4.1 (Ubuntu)
procedure | lookup | pack | time [s] | speedup | |
---|---|---|---|---|---|
improved scalar | N/A | N/A | 0.02084 | 1.00 | ██████████ |
scalar | N/A | N/A | 0.04980 | 0.42 | ████▏ |
scalar & BMI2 | N/A | N/A | 0.04494 | 0.46 | ████▋ |
SSE | base | naive | 0.01502 | 1.39 | █████████████▊ |
SSE | byte blend | naive | 0.01645 | 1.27 | ████████████▋ |
SSE | incremental | naive | 0.01684 | 1.24 | ████████████▍ |
SSE | pshufb | naive | 0.01229 | 1.70 | ████████████████▉ |
SSE | base | multiply-add | 0.01216 | 1.71 | █████████████████▏ |
SSE | byte blend | multiply-add | 0.01405 | 1.48 | ██████████████▊ |
SSE | incremental | multiply-add | 0.01198 | 1.74 | █████████████████▍ |
SSE | pshufb | multiply-add | 0.00888 | 2.35 | ███████████████████████▍ |
SSE | pshufb bitmask | multiply-add | 0.00847 | 2.46 | ████████████████████████▌ |
SSE & BMI2 | base | N/A | 0.02090 | 1.00 | █████████▉ |
SSE & BMI2 | byte blend | N/A | 0.02097 | 0.99 | █████████▉ |
SSE & BMI2 | incremental | N/A | 0.02098 | 0.99 | █████████▉ |
AVX2 | base | naive | 0.01006 | 2.07 | ████████████████████▋ |
AVX2 | byte blend | naive | 0.00998 | 2.09 | ████████████████████▉ |
AVX2 | pshufb | naive | 0.00816 | 2.55 | █████████████████████████▌ |
AVX2 | base | multiply-add | 0.00852 | 2.45 | ████████████████████████▍ |
AVX2 | byte blend | multiply-add | 0.00842 | 2.48 | ████████████████████████▊ |
AVX2 | pshufb | multiply-add | 0.00720 | 2.89 | ████████████████████████████▉ |
AVX2 | pshufb bitmask | multiply-add | 0.00705 | 2.96 | █████████████████████████████▌ |
AVX2 & BMI2 | base | N/A | 0.01994 | 1.05 | ██████████▍ |
AVX2 & BMI2 | byte blend | N/A | 0.01982 | 1.05 | ██████████▌ |
The CPU architecture: Bulldozer FX-8150 CPU
GCC: 4.8.4 (Ubuntu)
procedure | lookup | pack | time [s] | speedup | |
---|---|---|---|---|---|
improved scalar | N/A | N/A | 0.03223 | 1.00 | ██████████ |
scalar | N/A | N/A | 0.06352 | 0.51 | █████ |
SSE | base | naive | 0.02387 | 1.35 | █████████████▌ |
SSE | byte blend | naive | 0.02316 | 1.39 | █████████████▉ |
SSE | incremental | naive | 0.02626 | 1.23 | ████████████▎ |
SSE | pshufb | naive | 0.02015 | 1.60 | ███████████████▉ |
SSE | base | multiply-add | 0.02107 | 1.53 | ███████████████▎ |
SSE | byte blend | multiply-add | 0.01971 | 1.64 | ████████████████▎ |
SSE | incremental | multiply-add | 0.02215 | 1.46 | ██████████████▌ |
SSE | pshufb | multiply-add | 0.01660 | 1.94 | ███████████████████▍ |
SSE | pshufb bitmask | multiply-add | 0.01656 | 1.95 | ███████████████████▍ |
The AVX2 version wouldn't be possible without Nathan Kurz's help and Daniel Lemire who kindly gave me access to test machines having brand new CPUs (Haswell and Skylake, and also AMD Bulldozer).