Base64 decoding with SIMD instructions

Author:Wojciech Muła
Added on:2016-01-17
Updates:2016-03-30 (new lookup algorithm, updated results) 2016-03-19 (new packing methods, updated results), 2016-03-13 (two new lookup algorithms, updated results), 2016-03-10 (AVX2 notes, Skylake results), 2016-01-25 (minor fixes thanks to HN comments), 2016-01-18 (Core i7 results)

Contents

Introduction

Surprisingly good results of base64 encoding with SIMD instructions forced me to check the opposite algorithm, i.e. decoding. Decoding is slightly more complicated as it has to check the input's validity.

Decoder must also consider character '=' at the end of input, but since it's done once, I didn't bother with this in a sample code.

Base algorithm

In a basic scalar version minimum input quantity is 4 character, which are transformed into 3 bytes. The scalar version use a lookup table translating from ASCII code to a 6-bit data. For non-base64 characters the lookup table contains a special marker. After decoding 4 character and checking for errors, the 6-bit words are merged to form a 24-bit word.

The inner loop of the algorithm looks like this:

// 1. translate input
const uint8_t b0 = lookup[input[i + 0]];
const uint8_t b1 = lookup[input[i + 1]];
const uint8_t b2 = lookup[input[i + 2]];
const uint8_t b3 = lookup[input[i + 3]];

if (b0, b1, b2 or b3 invalid) {
    report error
}

// 2. save output bytes
*out++ = b0 | (b1 << 6);
*out++ = (b1 >> 2) | (b2 << 4);
*out++ = (b2 >> 4) | (b3 << 2);

Forming the output is expansive, it requires:

Improvement scalar version

It is possible to get rid off the bit-shift and use four lookup tables for each input byte. Each table contains already-shifted 6-bit data fields, thanks to that only bit-ors are required. It is at cost of bigger memory requirements. The improved version requires 4kB (4*4*256), while the basic one requires only 256 bytes.

Following code shows the idea:

// 1. lookup and build output word at the same time
const uint32_t dword = lookup32[0][input[i + 0]]
                     | lookup32[1][input[i + 1]]
                     | lookup32[2][input[i + 2]]
                     | lookup32[3][input[i + 3]];

// 2. save 32 bits
reinterpret_cast<uint32_t*>(out) = dword;
out += 3;

Another negative feature is that the method saves 4 bytes instead of 3.

SSE version

Vector lookup (base)

In an SSE version the lookup table must be encoded as a sequence of instructions. Such program will perform parallel translation of all input bytes.

Following table describes how to map an input character into a 6-bit data (i is the character code), according to base64.

range expression after constant folding
A-Z i - ord('A') i - 65
a-z i - ord('a') + 26 i - 71
0-9 i - ord('0') + 52 i + 4
'+' i - ord('+') + 62 i + 19
'/' i - ord('/') + 63 i + 16
other error

The basic step of the vector version is producing vector of constants (from the third column) for characters matching given range. After checking all five ranges, the vectors are merged. Since none of constants is zero, then if the merged vector has any zero, it means that the input contains invalid character.

If everything is OK, then the last step is simple add the input vector and the merged one. The result vector contains 6-bit data in each byte.

Procedure doing the translation:

__m128i lookup_base(const __m128i input) {

    // shift for range 'A' - 'Z'
    const __m128i ge_A = _mm_cmpgt_epi8(input, packed_byte('A' - 1));
    const __m128i le_Z = _mm_cmplt_epi8(input, packed_byte('Z' + 1));
    const __m128i range_AZ = _mm_and_si128(packed_byte(-65), _mm_and_si128(ge_A, le_Z));

    // shift for range 'a' - 'z'
    const __m128i ge_a = _mm_cmpgt_epi8(input, packed_byte('a' - 1));
    const __m128i le_z = _mm_cmplt_epi8(input, packed_byte('z' + 1));
    const __m128i range_az = _mm_and_si128(packed_byte(-71), _mm_and_si128(ge_a, le_z));

    // shift for range '0' - '9'
    const __m128i ge_0 = _mm_cmpgt_epi8(input, packed_byte('0' - 1));
    const __m128i le_9 = _mm_cmplt_epi8(input, packed_byte('9' + 1));
    const __m128i range_09 = _mm_and_si128(packed_byte(4), _mm_and_si128(ge_0, le_9));

    // shift for character '+'
    const __m128i eq_plus = _mm_cmpeq_epi8(input, packed_byte('+'));
    const __m128i char_plus = _mm_and_si128(packed_byte(19), eq_plus);

    // shift for character '/'
    const __m128i eq_slash = _mm_cmpeq_epi8(input, packed_byte('/'));
    const __m128i char_slash = _mm_and_si128(packed_byte(16), eq_slash);

    // merge partial results

    const __m128i shift = _mm_or_si128(range_AZ,
                          _mm_or_si128(range_az,
                          _mm_or_si128(range_09,
                          _mm_or_si128(char_plus, char_slash))));

    const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(shift, packed_byte(0)));
    if (mask) {
        // report an error
    }

    return _mm_add_epi8(input, shift);
}

Number of operations:

  • comparison (le/gt/eq): 9,
  • bit-and: 8,
  • bit-or: 4,
  • add: 1,
  • movemask: 1.

Number of constants: 9.

Total number of operations is 23. This is 1.4375 instructions per character.

Note: SSE compare for less or greater works on signed values, in this case it is not a problem.

Vector lookup (byte blend)

Instead of creating partial shifts (constants range_AZ, range_az and so on) and then merging them with a series of bit-ors, the shift vector could be updated using a byte blend instruction pblendv (_mm_blendv_epi8).

Procedure:

__m128i lookup_byte_blend(const __m128i input) {

    __m128i shift;

    // shift for range 'A' - 'Z'
    const __m128i ge_A = _mm_cmpgt_epi8(input, packed_byte('A' - 1));
    const __m128i le_Z = _mm_cmplt_epi8(input, packed_byte('Z' + 1));
    shift = _mm_and_si128(packed_byte(-65), _mm_and_si128(ge_A, le_Z));

    // shift for range 'a' - 'z'
    const __m128i ge_a = _mm_cmpgt_epi8(input, packed_byte('a' - 1));
    const __m128i le_z = _mm_cmplt_epi8(input, packed_byte('z' + 1));
    shift = _mm_blendv_epi8(shift, packed_byte(-71), _mm_and_si128(ge_a, le_z));

    // shift for range '0' - '9'
    const __m128i ge_0 = _mm_cmpgt_epi8(input, packed_byte('0' - 1));
    const __m128i le_9 = _mm_cmplt_epi8(input, packed_byte('9' + 1));
    shift = _mm_blendv_epi8(shift, packed_byte(4), _mm_and_si128(ge_0, le_9));

    // shift for character '+'
    const __m128i eq_plus = _mm_cmpeq_epi8(input, packed_byte('+'));
    shift = _mm_blendv_epi8(shift, packed_byte(19), eq_plus);

    // shift for character '/'
    const __m128i eq_slash = _mm_cmpeq_epi8(input, packed_byte('/'));
    shift = _mm_blendv_epi8(shift, packed_byte(16), eq_slash);

    const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(shift, packed_byte(0)));
    if (mask) {
        // report an error
    }

    return _mm_add_epi8(input, shift);
}

Number of operations:

  • comparison (le/gt/eq): 9,
  • bit-and: 4,
  • bit-or: 0,
  • byte-blend: 3,
  • add: 1,
  • movemask: 1.

Number of constants: 9.

Total number of operations is 19, three less than the base version. This is 1.1875 instructions per character. However, the instruction pblendv has both latency and throughput equal 2 cycles and it seems to be a problem. As experiments have shown sometimes use of the instruction gives small improvement, but sometimes makes things slower.

Vector lookup (incremental)

A lookup procedure which employs the technique described in a separate article. The lookup procedure uses pshufb instruction. I very like this instruction, but experiments have shown that this version is slower. My heart sank.

Vector lookup (pshufb) new

Instead of checking all possible ranges as in all other methods, this algorithm determines possible range for a byte and proper shift value. Then validates a byte against that single range and adds the selected shift.

To select both the range and the shift, the higher nibble of byte is used. Lets look at the rewritten lookup table:

input [hex] valid input [ASCII] valid range shift
0x invalid
1x
2x '+' 0x2b .. 0x2b 0x3e
'/' 0x2f .. 0x2f 0x3f
3x '0' .. '9' 0x30 .. 0x39 0x34
4x 'A' .. 'O' 0x41 .. 0x4f 0x00
5x 'P' .. 'Z' 0x50 .. 0x5a 0x0f
6x 'a' .. 'o' 0x61 .. 0x6f 0x1a
7x 'p' .. 'z' 0x70 .. 0x7a 0x29
8x .. Fx invalid

The outline of algorithm (for a single byte):

higher_nibble = byte >> 4;

upper_limit = upper_limit_LUT[higher_nibble];
lower_limit = lower_limit_LUT[higher_nibble];

if (byte < lower_limit || byte > upper_limit) {
    report error
}

decoded = byte - lower_limit + shift_LUT[higher_nibble];

For nibbles in set {0, 1, 8, 9, a, b, c, d, e, f} the LUTs define invalid ranges, always yielding errors. For nibbles {3, 4, 5, 6, 7} ranges are well defined.

The only exception occurs for nibble 2, as there are two distinct values associated with these bytes. The special case is solved by choosing the first, 1-element range 0x2b and introducing an additional code handling input equal 0x2f. The extra code doesn't complicate the algorithm too much:

higher_nibble = byte >> 4;

upper_limit = upper_limit_LUT[higher_nibble];
lower_limit = lower_limit_LUT[higher_nibble];

if ((byte < lower_limit || byte > upper_limit) && byte != 0x2f) {
    report error
}

decoded = byte - lower_limit + shift_LUT[higher_nibble];

if (byte == 0x2f) {
    decoded -= 3; // (0x2b - 0x2f) + (0x3f - 0x3e) = -4 + 1 = -3
}

The SIMD (SSE) code implementing above algorithm:

__m128i lookup_pshufb(const __m128i input) {

    const __m128i higher_nibble = _mm_srli_epi32(input, 4) & packed_byte(0x0f);
    const char linv = 1;
    const char hinv = 0;

    const __m128i lower_bound_LUT = _mm_setr_epi8(
        /* 0 */ linv, /* 1 */ linv, /* 2 */ 0x2b, /* 3 */ 0x30,
        /* 4 */ 0x41, /* 5 */ 0x50, /* 6 */ 0x61, /* 7 */ 0x70,
        /* 8 */ linv, /* 9 */ linv, /* a */ linv, /* b */ linv,
        /* c */ linv, /* d */ linv, /* e */ linv, /* f */ linv
    );

    const __m128i upper_bound_LUT = _mm_setr_epi8(
        /* 0 */ hinv, /* 1 */ hinv, /* 2 */ 0x2b, /* 3 */ 0x39,
        /* 4 */ 0x4f, /* 5 */ 0x5a, /* 6 */ 0x6f, /* 7 */ 0x7a,
        /* 8 */ hinv, /* 9 */ hinv, /* a */ hinv, /* b */ hinv,
        /* c */ hinv, /* d */ hinv, /* e */ hinv, /* f */ hinv
    );

    const __m128i shift_LUT = _mm_setr_epi8(
        /* 0 */ 0x00, /* 1 */ 0x00, /* 2 */ 0x3e, /* 3 */ 0x34,
        /* 4 */ 0x00, /* 5 */ 0x0f, /* 6 */ 0x1a, /* 7 */ 0x29,
        /* 8 */ 0x00, /* 9 */ 0x00, /* a */ 0x00, /* b */ 0x00,
        /* c */ 0x00, /* d */ 0x00, /* e */ 0x00, /* f */ 0x00
    );

    const __m128i upper_bound = _mm_shuffle_epi8(upper_bound_LUT, higher_nibble);
    const __m128i lower_bound = _mm_shuffle_epi8(lower_bound_LUT, higher_nibble);

    const __m128i below = _mm_cmplt_epi8(input, lower_bound);
    const __m128i above = _mm_cmpgt_epi8(input, upper_bound);
    const __m128i eq_2f = _mm_cmpeq_epi8(input, packed_byte(0x2f));

    // in_range = not (below or above) or eq_2f
    // outside  = not in_range = below or above and not eq_2f (from deMorgan law)
    const __m128i outside = _mm_andnot_si128(eq_2f, above | below);

    const auto mask = _mm_movemask_epi8(outside);
    if (mask) {
        report error
    }

    __m128i shift  = _mm_shuffle_epi8(shift_LUT, higher_nibble);
    __m128i result = _mm_sub_epi8(input, lower_bound);
    result = _mm_add_epi8(result, shift);
    result = _mm_add_epi8(result, _mm_and_si128(eq_2f, packed_byte(-3)));

    return result;
}

Number of operations:

  • comparison (le/gt/eq): 3,
  • shift: 1,
  • bit-and: 2,
  • bit-andnot: 1,
  • bit-or: 1,
  • add: 2,
  • sub: 1,
  • movemask: 1
  • pshufb: 3.

Number of constants: 6.

Total number of operations is 15. This is 0.9375 instructions per character.

Comparison of vector lookups new

operation algorithm
base byte-blend incremental pshufb
comparison (le/gt/eq) 9 9 9 3
bit-and 8 4 0 2
bit-andnot 0 0 0 1
bit-or 4 0 0 1
shift 0 0 0 1
add 1 1 10 2
sub 0 0 0 1
byte-blend 0 3 0 0
movemask 1 1 1 1
pshufb 0 0 1 3
total instructions 23 19 21 15
instructions/byte 1.4375 1.1875 1.3125 0.9375
constants count 9 9 10 6

Gathering data

Result of lookup procedure is four 32-bit words having following bit-layout:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bits a, b, c and d are data. The expected output is:

[00000000|ddddddcc|ccccbbbb|bbaaaaaa]

Once bits were packed into 24-bit words, all bytes are shuffled in order to build a 12-byte result. This is done by single pshufb invocation.

Pack — naive variant

The naive variant isolates all 6-bit field, shifts them to required positions and merges into one word:

const __m128i bits_a = _mm_and_si128(values, packed_dword(0x0000003f));
const __m128i bits_b = _mm_srli_epi32(_mm_and_si128(values, packed_dword(0x00003f00)), 2);
const __m128i bits_c = _mm_srli_epi32(_mm_and_si128(values, packed_dword(0x003f0000)), 4);
const __m128i bits_d = _mm_srli_epi32(_mm_and_si128(values, packed_dword(0x3f000000)), 6);

result = _mm_or_si128(bits_a, _mm_or_si128(bits_b, _mm_or_si128(bits_c, bits_d)));

Number of operations:

  • 4 bit-and,
  • 3 bit-shift,
  • 3 bit-or.

Pack — improved variant

In an improved version two adjacent bytes are merged, forming 12-bit words in the 16-bit halves of dword. Then these 12-bit words are merged together:

// bits_ac: packed_dword([00000000|00cccccc|00000000|00aaaaaa] x 4)
// bits_bd: packed_dword([00dddddd|00000000|00bbbbbb|00000000] x 4)
const __m128i bits_ac = _mm_and_si128(values, packed_dword(0x003f003f));
const __m128i bits_bd = _mm_and_si128(values, packed_dword(0x3f003f00));

// tmp:     packed_dword([0000dddd|ddcccccc|0000bbbb|bbaaaaaa] x 4)
const __m128i tmp     = _mm_or_si128(bits_ac, _mm_srli_epi32(bits_bd, 2));

// bits_cd: packed_dword([0000dddd|ddcccccc|00000000|00000000] x 4)
// bits_ab: packed_dword([00000000|00000000|0000bbbb|bbaaaaaa] x 4)
const __m128i bits_cd = _mm_and_si128(tmp, packed_dword(0x0fff0000));
const __m128i bits_ab = _mm_and_si128(tmp, packed_dword(0x00000fff));

//          packed_dword([00000000|ddddddcc|ccccbbbb|bbaaaaaa] x 4)
result = _mm_or_si128(bits_ab, _mm_srli_epi32(bits_cd, 4));

Number of operations:

  • 4 bit-and,
  • 2 bit-shift (1 less),
  • 2 bit-or (1 less).

Pack — multiply-add + improved variant

This is slightly modified improved variant. The first merge of adjacent bytes is performed by instruction pmaddubsw. The instruction vertically multiplies signed bytes yielding 16-bit signed intermediate values and then the values are added. Input values are 6-bit, so obviously are non-negative and the instruction could be safely used.

Instruction pmaddusbw is used to perform shift & bit-or in a single step; merging 12-bit words is done by regular shifts.

//          packed_dword([0000dddd|ddcccccc|0000bbbb|bbaaaaaa] x 4)
const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x40014001));

// bits_cd: packed_dword([0000dddd|ddcccccc|00000000|00000000] x 4)
// bits_ab: packed_dword([00000000|00000000|0000bbbb|bbaaaaaa] x 4)
const __m128i bits_cd = _mm_and_si128(merge_ab_and_bc, packed_dword(0x0fff0000));
const __m128i bits_ab = _mm_and_si128(merge_ab_and_bc, packed_dword(0x00000fff));

// merged:  packed_dword([00000000|ddddddcc|ccccbbbb|bbaaaaaa] x 4)
result = _mm_or_si128(bits_ab, _mm_srli_epi32(bits_cd, 4));

Number of operations:

  • 2 bit-and,
  • 1 bit-shift,
  • 1 bit-or,
  • 1 multiply-add.

Pack — multiply-add variant

All field merging is done using multiply-add instructions. First, bytes are merged using pmaddusbw like in the above variant. And then 12-bit words are merged using single pmaddwd.

Again, pmaddwd operates on signed values, but inputs are never negative.

const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x40014001));

return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x10000001));

Number of operations:

  • 2 multiply-add.

Saving result (SSE variant)

Result is an XMM register, however only 12 lowest bytes is required. My experiments on Core i5 have showed that saving selected bytes using dedicated instruction maskmovdqu (intrinsic _mm_maskmoveu_si128) is slower than ordinary write of 16 bytes which overwrites previous 4 bytes.

BMI2

With help of BMI2 instruction pext this can be simplified to:

  • two extractions of 64-bit register's halves,
  • two invocation of pext,
  • two writes.

AVX2 version

An AVX2 version of lookup procedure is nearly the same as the SSE version. The only minor difference is caused by lacking of comparison "less or equal", thus all range checking have to use "grater" relation.

An output from the AVX2 lookup could be either processed using standard SIMD operations or BMI2 instructions.

Sample code

The sample code is available at github, there are three programs:

Experiments

The program speed was compiled with GCC 4.9.2 and was run 10 times. The minimum times were noted.

Core i5 results (Westmere) update

The CPU architecture: Westmere i5 M540 @ 2.53GHz

GCC: 4.9.2 (Debian)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.04741 1.00 ██████████
scalar N/A N/A 0.06894 0.69 ██████▉
SSE base naive 0.03389 1.40 █████████████▉
SSE byte blend naive 0.03026 1.57 ███████████████▋
SSE incremental naive 0.03407 1.39 █████████████▉
SSE pshufb naive 0.02777 1.71 █████████████████
SSE base improved 0.03189 1.49 ██████████████▊
SSE byte blend improved 0.02793 1.70 ████████████████▉
SSE incremental improved 0.03268 1.45 ██████████████▌
SSE pshufb improved 0.02576 1.84 ██████████████████▍
SSE base improved + mult-add 0.03002 1.58 ███████████████▊
SSE byte blend improved + mult-add 0.02709 1.75 █████████████████▌
SSE incremental improved + mult-add 0.03031 1.56 ███████████████▋
SSE pshufb improved + mult-add 0.02390 1.98 ███████████████████▊
SSE base multiply-add 0.02848 1.66 ████████████████▋
SSE byte blend multiply-add 0.02593 1.83 ██████████████████▎
SSE incremental multiply-add 0.02879 1.65 ████████████████▍
SSE pshufb multiply-add 0.02229 2.13 █████████████████████▎

Conclusions:

  • Eliminating 5 shifts from the scalar version boosted code by 1.45. This is impressive.
  • Using byte-blend instruction gives significant boost.
  • Packing algorithm using two multiply-add instruction is, surprisingly, the best. Multiply-add instructions have pretty high latencies, 4-5 cycles.

Core i7 results (Haswell)

The CPU architecture: Haswell i7-4770 CPU @ 3.40GHz.

GCC: 5.3.0 (Ubuntu)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.02268 1.00 ██████████
scalar N/A N/A 0.04949 0.46 ████▌
scalar & BMI2 N/A N/A 0.04463 0.51 █████
SSE base naive 0.01882 1.21 ████████████
SSE byte blend naive 0.01846 1.23 ████████████▎
SSE incremental naive 0.01734 1.31 █████████████
SSE pshufb naive 0.01396 1.62 ████████████████▏
SSE base improved 0.01604 1.41 ██████████████▏
SSE byte blend improved 0.01665 1.36 █████████████▌
SSE incremental improved 0.01651 1.37 █████████████▋
SSE pshufb improved 0.01287 1.76 █████████████████▌
SSE base improved + mult-add 0.01546 1.47 ██████████████▋
SSE byte blend improved + mult-add 0.01551 1.46 ██████████████▌
SSE incremental improved + mult-add 0.01642 1.38 █████████████▊
SSE pshufb improved + mult-add 0.01147 1.98 ███████████████████▊
SSE base multiply-add 0.01372 1.65 ████████████████▌
SSE byte blend multiply-add 0.01354 1.68 ████████████████▊
SSE incremental multiply-add 0.01490 1.52 ███████████████▏
SSE pshufb multiply-add 0.01013 2.24 ██████████████████████▍
SSE & BMI2 base N/A 0.01341 1.69 ████████████████▉
SSE & BMI2 byte blend N/A 0.01412 1.61 ████████████████
SSE & BMI2 incremental N/A 0.01629 1.39 █████████████▉
AVX2 base naive 0.01136 2.00 ███████████████████▉
AVX2 byte blend naive 0.01247 1.82 ██████████████████▏
AVX2 pshufb naive 0.00941 2.41 ████████████████████████
AVX2 base improved 0.01075 2.11 █████████████████████
AVX2 byte blend improved 0.01204 1.88 ██████████████████▊
AVX2 pshufb improved 0.00903 2.51 █████████████████████████
AVX2 base improved + mult-add 0.01027 2.21 ██████████████████████
AVX2 byte blend improved + mult-add 0.01069 2.12 █████████████████████▏
AVX2 pshufb improved + mult-add 0.00872 2.60 ██████████████████████████
AVX2 base multiply-add 0.00977 2.32 ███████████████████████▏
AVX2 byte blend multiply-add 0.00979 2.32 ███████████████████████▏
AVX2 pshufb multiply-add 0.00855 2.65 ██████████████████████████▌
AVX2 & BMI2 base N/A 0.01306 1.74 █████████████████▎
AVX2 & BMI2 byte blend N/A 0.01307 1.74 █████████████████▎

Conclusions:

  • On Core i7 speedup over the improved scalar version is really small.
  • However, BMI2 helps, but again — not too much.

Core i7 results (Skylake)

The CPU architecture: Skylake i7-6700 CPU @ 3.40GHz

GCC: 5.3.0 (Ubuntu)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.01903 1.00 ██████████
scalar N/A N/A 0.04551 0.42 ████▏
scalar & BMI2 N/A N/A 0.04107 0.46 ████▋
SSE base naive 0.01537 1.24 ████████████▍
SSE byte blend naive 0.01544 1.23 ████████████▎
SSE incremental naive 0.01488 1.28 ████████████▊
SSE pshufb naive 0.01254 1.52 ███████████████▏
SSE base improved 0.01357 1.40 ██████████████
SSE byte blend improved 0.01401 1.36 █████████████▌
SSE incremental improved 0.01422 1.34 █████████████▍
SSE pshufb improved 0.01126 1.69 ████████████████▉
SSE base improved + mult-add 0.01273 1.49 ██████████████▉
SSE byte blend improved + mult-add 0.01396 1.36 █████████████▋
SSE incremental improved + mult-add 0.01371 1.39 █████████████▉
SSE pshufb improved + mult-add 0.00995 1.91 ███████████████████▏
SSE base multiply-add 0.01098 1.73 █████████████████▎
SSE byte blend multiply-add 0.01280 1.49 ██████████████▊
SSE incremental multiply-add 0.01098 1.73 █████████████████▎
SSE pshufb multiply-add 0.00901 2.11 █████████████████████
SSE & BMI2 base N/A 0.01148 1.66 ████████████████▌
SSE & BMI2 byte blend N/A 0.01284 1.48 ██████████████▊
SSE & BMI2 incremental N/A 0.01193 1.60 ███████████████▉
AVX2 base naive 0.00931 2.04 ████████████████████▍
AVX2 byte blend naive 0.00924 2.06 ████████████████████▌
AVX2 pshufb naive 0.00800 2.38 ███████████████████████▊
AVX2 base improved 0.00916 2.08 ████████████████████▊
AVX2 byte blend improved 0.00900 2.11 █████████████████████▏
AVX2 pshufb improved 0.00765 2.49 ████████████████████████▉
AVX2 base improved + mult-add 0.00853 2.23 ██████████████████████▎
AVX2 byte blend improved + mult-add 0.00847 2.25 ██████████████████████▍
AVX2 pshufb improved + mult-add 0.00734 2.59 █████████████████████████▉
AVX2 base multiply-add 0.00799 2.38 ███████████████████████▊
AVX2 byte blend multiply-add 0.00794 2.40 ███████████████████████▉
AVX2 pshufb multiply-add 0.00705 2.70 ██████████████████████████▉
AVX2 & BMI2 base N/A 0.01094 1.74 █████████████████▍
AVX2 & BMI2 byte blend N/A 0.01053 1.81 ██████████████████

Acknowledgments

The AVX2 version wouldn't be possible without Nathan Kurz's help and Daniel Lemire, who kindly gave me access to test machines having brand new CPUs (Haswell and Skylake).

See also