Base64 decoding with SIMD instructions

Author:	Wojciech Muła
Added on:	2016-01-17
Last update:	2017-11-29 (spelling)

Contents

Introduction

Surprisingly good results of base64 encoding with SIMD instructions forced me to check the opposite algorithm, i.e. the decoding. The decoding is slightly more complicated as it has to check the input's validity.

A decoder must also consider character '=' at the end of input, but since it's done once, I didn't bother with this in a sample code.

2016-12-18 note: in the initial version of this text I wrongly assumed order of input words, Alfred Klomp noted that the standard imposes a specific order. Today's change fixes this error.

Base algorithm

In a basic scalar version the minimum input quantity is 4 character, which are transformed into 3 bytes. The scalar version uses a lookup table translating from ASCII code into a 6-bit data. For non-base64 characters the lookup table contains a special marker. After decoding four characters and checking for errors, the 6-bit words are merged to form a 24-bit word.

The inner loop of the algorithm looks like this.

// 1. translate input
const uint8_t b0 = lookup[input[i + 0]];
const uint8_t b1 = lookup[input[i + 1]];
const uint8_t b2 = lookup[input[i + 2]];
const uint8_t b3 = lookup[input[i + 3]];

if (b0, b1, b2 or b3 invalid) {
    report error
}

// 2. save output bytes

// input:  [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
// output: [00000000|ccdddddd|bbbbcccc|aaaaaabb]

*out++ = (b1 >> 4) | (b0 << 2);
*out++ = (b2 >> 2) | (b1 << 4);
*out++ = b3 | (b2 << 6);

Forming the output is expansive, it requires:

5 bit-shift,
3 bit-ors.

Improved scalar version

It is possible to get rid off the bit-shift and use four lookup tables for each input byte. Each table contains already-shifted 6-bit data fields, thanks to that only bit-ors are required. It is at cost of bigger memory requirements. The improved version requires 4kB (4*4*256), while the basic one requires only 256 bytes.

Following code shows the idea.

// 1. lookup and build output word at the same time
const uint32_t dword = lookup32[0][input[i + 0]]
                     | lookup32[1][input[i + 1]]
                     | lookup32[2][input[i + 2]]
                     | lookup32[3][input[i + 3]];

if (dword is invalid) {
    report error
}

// 2. save 32 bits
reinterpret_cast<uint32_t*>(out) = dword;
out += 3;

Another negative feature is that the method saves 4 bytes instead of 3.

SSE version

Vector lookup (base)

In an SSE version the lookup table must be encoded as a sequence of instructions. Such program will perform parallel translation of all input bytes.

Following table describes how to map an input character into a 6-bit data (i is the character code), according to the base64 standard.

range	expression	after constant folding
A-Z	i - ord('A')	i - 65
a-z	i - ord('a') + 26	i - 71
0-9	i - ord('0') + 52	i + 4
'+'	i - ord('+') + 62	i + 19
'/'	i - ord('/') + 63	i + 16
other	error

The basic step of the vector version produces a vector of constants (from the third column) for characters matching given range. After checking all five ranges, the vectors are merged. Since none of constants is zero, then if the merged vector has any zero byte, it means that the input contains an invalid character.

If everything is OK, then the last step is to simple add the input vector and the merged one. The result vector contains 6-bit data in each byte.

Procedure doing the translation.

__m128i lookup_base(const __m128i input) {

    // shift for range 'A' - 'Z'
    const __m128i ge_A = _mm_cmpgt_epi8(input, packed_byte('A' - 1));
    const __m128i le_Z = _mm_cmplt_epi8(input, packed_byte('Z' + 1));
    const __m128i range_AZ = _mm_and_si128(packed_byte(-65), _mm_and_si128(ge_A, le_Z));

    // shift for range 'a' - 'z'
    const __m128i ge_a = _mm_cmpgt_epi8(input, packed_byte('a' - 1));
    const __m128i le_z = _mm_cmplt_epi8(input, packed_byte('z' + 1));
    const __m128i range_az = _mm_and_si128(packed_byte(-71), _mm_and_si128(ge_a, le_z));

    // shift for range '0' - '9'
    const __m128i ge_0 = _mm_cmpgt_epi8(input, packed_byte('0' - 1));
    const __m128i le_9 = _mm_cmplt_epi8(input, packed_byte('9' + 1));
    const __m128i range_09 = _mm_and_si128(packed_byte(4), _mm_and_si128(ge_0, le_9));

    // shift for character '+'
    const __m128i eq_plus = _mm_cmpeq_epi8(input, packed_byte('+'));
    const __m128i char_plus = _mm_and_si128(packed_byte(19), eq_plus);

    // shift for character '/'
    const __m128i eq_slash = _mm_cmpeq_epi8(input, packed_byte('/'));
    const __m128i char_slash = _mm_and_si128(packed_byte(16), eq_slash);

    // merge partial results

    const __m128i shift = _mm_or_si128(range_AZ,
                          _mm_or_si128(range_az,
                          _mm_or_si128(range_09,
                          _mm_or_si128(char_plus, char_slash))));

    const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(shift, packed_byte(0)));
    if (mask) {
        // report an error
    }

    return _mm_add_epi8(input, shift);
}

Number of operations:

comparison (le/gt/eq): 9,
bit-and: 8,
bit-or: 4,
add: 1,
movemask: 1.

Number of constants: 9.

Total number of operations is 23. This is 1.4375 instructions per character.

Note: the SSE compare for less or greater works on signed values, in this case it is not a problem.

Vector lookup (byte blend)

Instead of creating partial shifts (constants range_AZ, range_az and so on) and then merging them with a series of bit-ors, the shift vector could be updated using the byte blend instruction pblendv (_mm_blendv_epi8).

Below is an updated procedure.

__m128i lookup_byte_blend(const __m128i input) {

    __m128i shift;

    // shift for range 'A' - 'Z'
    const __m128i ge_A = _mm_cmpgt_epi8(input, packed_byte('A' - 1));
    const __m128i le_Z = _mm_cmplt_epi8(input, packed_byte('Z' + 1));
    shift = _mm_and_si128(packed_byte(-65), _mm_and_si128(ge_A, le_Z));

    // shift for range 'a' - 'z'
    const __m128i ge_a = _mm_cmpgt_epi8(input, packed_byte('a' - 1));
    const __m128i le_z = _mm_cmplt_epi8(input, packed_byte('z' + 1));
    shift = _mm_blendv_epi8(shift, packed_byte(-71), _mm_and_si128(ge_a, le_z));

    // shift for range '0' - '9'
    const __m128i ge_0 = _mm_cmpgt_epi8(input, packed_byte('0' - 1));
    const __m128i le_9 = _mm_cmplt_epi8(input, packed_byte('9' + 1));
    shift = _mm_blendv_epi8(shift, packed_byte(4), _mm_and_si128(ge_0, le_9));

    // shift for character '+'
    const __m128i eq_plus = _mm_cmpeq_epi8(input, packed_byte('+'));
    shift = _mm_blendv_epi8(shift, packed_byte(19), eq_plus);

    // shift for character '/'
    const __m128i eq_slash = _mm_cmpeq_epi8(input, packed_byte('/'));
    shift = _mm_blendv_epi8(shift, packed_byte(16), eq_slash);

    const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(shift, packed_byte(0)));
    if (mask) {
        // report an error
    }

    return _mm_add_epi8(input, shift);
}

Number of operations:

comparison (le/gt/eq): 9,
bit-and: 4,
bit-or: 0,
byte-blend: 3,
add: 1,
movemask: 1.

Number of constants: 9.

The total number of operations is 19, three less than the in base version. This is 1.1875 instructions per character. However, the instruction pblendv has both latency and throughput equal 2 cycles and it seems to be a problem. As experiments showed, sometimes use of the instruction gives small improvement, but sometimes makes things slower.

Vector lookup (incremental)

This section describes another lookup procedure that employs the technique described in a separate article. The lookup procedure uses pshufb instruction. I very like this instruction, but experiments showed that this version is slower. My heart sank.

Vector lookup (pshufb)

Instead of checking all possible ranges as in all other methods, this algorithm determines possible range for a byte and the associated shift value. Then validates a byte against that single range and adds the selected shift.

To select both the range and the shift, the higher nibble of byte is used. Lets look at the rewritten lookup table:

input [hex]	valid input [ASCII]	valid range	shift
0x	invalid
1x	invalid
2x	'+'	0x2b .. 0x2b	0x3e
2x	'/'	0x2f .. 0x2f	0x3f
3x	'0' .. '9'	0x30 .. 0x39	0x34
4x	'A' .. 'O'	0x41 .. 0x4f	0x00
5x	'P' .. 'Z'	0x50 .. 0x5a	0x0f
6x	'a' .. 'o'	0x61 .. 0x6f	0x1a
7x	'p' .. 'z'	0x70 .. 0x7a	0x29
8x .. Fx	invalid

The outline of algorithm (for a single byte):

higher_nibble = byte >> 4;

upper_limit = upper_limit_LUT[higher_nibble];
lower_limit = lower_limit_LUT[higher_nibble];

if (byte < lower_limit || byte > upper_limit) {
    report error
}

decoded = byte + shift_LUT[higher_nibble];

For nibbles in set {0, 1, 8, 9, a, b, c, d, e, f} the LUTs define invalid ranges, always yielding errors. For nibbles {3, 4, 5, 6, 7} the ranges are well defined.

The only exception occurs for nibble 2, as there are two distinct values associated with these bytes. The special case is solved by choosing the first, one-element range 0x2b and introducing an additional code handling input equal to 0x2f. The extra code doesn't complicate the algorithm too much:

higher_nibble = byte >> 4;

upper_limit = upper_limit_LUT[higher_nibble];
lower_limit = lower_limit_LUT[higher_nibble];

if ((byte < lower_limit || byte > upper_limit) && byte != 0x2f) {
    report error
}

decoded = byte + shift_LUT[higher_nibble];

if (byte == 0x2f) {
    decoded -= 3; // (0x2b - 0x2f) + (0x3f - 0x3e) = -4 + 1 = -3
}

The SIMD (SSE) code implementing above algorithm:

__m128i lookup_pshufb(const __m128i input) {

    const __m128i higher_nibble = _mm_srli_epi32(input, 4) & packed_byte(0x0f);
    const char linv = 1;
    const char hinv = 0;

    const __m128i lower_bound_LUT = _mm_setr_epi8(
        /* 0 */ linv, /* 1 */ linv, /* 2 */ 0x2b, /* 3 */ 0x30,
        /* 4 */ 0x41, /* 5 */ 0x50, /* 6 */ 0x61, /* 7 */ 0x70,
        /* 8 */ linv, /* 9 */ linv, /* a */ linv, /* b */ linv,
        /* c */ linv, /* d */ linv, /* e */ linv, /* f */ linv
    );

    const __m128i upper_bound_LUT = _mm_setr_epi8(
        /* 0 */ hinv, /* 1 */ hinv, /* 2 */ 0x2b, /* 3 */ 0x39,
        /* 4 */ 0x4f, /* 5 */ 0x5a, /* 6 */ 0x6f, /* 7 */ 0x7a,
        /* 8 */ hinv, /* 9 */ hinv, /* a */ hinv, /* b */ hinv,
        /* c */ hinv, /* d */ hinv, /* e */ hinv, /* f */ hinv
    );

    // the difference between the shift and lower bound
    const __m128i shift_LUT = _mm_setr_epi8(
        /* 0 */ 0x00,        /* 1 */ 0x00,        /* 2 */ 0x3e - 0x2b, /* 3 */ 0x34 - 0x30,
        /* 4 */ 0x00 - 0x41, /* 5 */ 0x0f - 0x50, /* 6 */ 0x1a - 0x61, /* 7 */ 0x29 - 0x70,
        /* 8 */ 0x00,        /* 9 */ 0x00,        /* a */ 0x00,        /* b */ 0x00,
        /* c */ 0x00,        /* d */ 0x00,        /* e */ 0x00,        /* f */ 0x00
    );

    const __m128i upper_bound = _mm_shuffle_epi8(upper_bound_LUT, higher_nibble);
    const __m128i lower_bound = _mm_shuffle_epi8(lower_bound_LUT, higher_nibble);

    const __m128i below = _mm_cmplt_epi8(input, lower_bound);
    const __m128i above = _mm_cmpgt_epi8(input, upper_bound);
    const __m128i eq_2f = _mm_cmpeq_epi8(input, packed_byte(0x2f));

    // in_range = not (below or above) or eq_2f
    // outside  = not in_range = below or above and not eq_2f (from deMorgan law)
    const __m128i outside = _mm_andnot_si128(eq_2f, above | below);

    const auto mask = _mm_movemask_epi8(outside);
    if (mask) {
        report error
    }

    const __m128i shift  = _mm_shuffle_epi8(shift_LUT, higher_nibble);
    const __m128i t0     = _mm_add_epi8(input, shift);
    const __m128i result = _mm_add_epi8(t0, _mm_and_si128(eq_2f, packed_byte(-3)));

    return result;
}

Number of operations:

comparison (le/gt/eq): 3,
shift: 1,
bit-and: 2,
bit-andnot: 1,
bit-or: 1,
add: 2,
movemask: 1
pshufb: 3.

Number of constants: 6.

Total number of operations is 14. This is 0.875 instructions per character.

Vector lookup (pshufb with bitmask)

This is a slightly improved version of the previous method, where the range checking is done using bit-masks.

First we group all shift values by the lower nibble; zero shift means invalid input. We can note that in each column (except the 3rd) there are two different values: zero or a specific value. The 3rd column needs a special care, but it is not difficult.

lower nibble	higher nibble ('.' denotes zero)
lower nibble	0	1	2	3	4	5	6	7	8	9	a	b	c	d	e	f
0	.	.	.	4	.	-65	.	-71	.	.	.	.	.	.	.	.
1	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
2	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
3	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
4	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
5	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
6	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
7	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
8	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
9	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
a	.	.	.	.	-65	-65	-71	-71	.	.	.	.	.	.	.	.
b	.	.	19	.	-65	.	-71	.	.	.	.	.	.	.	.	.
c	.	.	.	.	-65	.	-71	.	.	.	.	.	.	.	.	.
d	.	.	.	.	-65	.	-71	.	.	.	.	.	.	.	.	.
e	.	.	.	.	-65	.	-71	.	.	.	.	.	.	.	.	.
f	.	.	16	.	-65	.	-71	.	.	.	.	.	.	.	.	.

Using the higher nibble we can pick a shift from vector of non-zero values from each column: [0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0]. The lower nibble selects a bit-mask, representing a set of valid values for higher nibble. A bit-mask is derived directly from rows of the table, each non-zero value sets a bit in the mask. There are six different bit-masks.

We need an extra fix-up code to distinguish between the '+' and '/' character, i.e. choose either 16 or 19 from the 3rd column.

The core of the algorithm goes as follows:

Using the lower nibble we select a bit-mask (pshufb).
Then we build a vector (1 << higher nibble[i]) & 0xff (pshufb or AMD XOP vshlb).
Bitwise and is used to check if all higher nibbles match corresponding bit-masks. If any doesn't match, we report an error.
Finally, using the higher nibble we select shifts and add them to the input.

The SIMD (SSE) code implementing above algorithm.

__m128i lookup_pshufb_bitmask(const __m128i input) {

    const __m128i higher_nibble = _mm_srli_epi32(input, 4) & packed_byte(0x0f);
    const __m128i lower_nibble  = input & packed_byte(0x0f);

    const __m128i shiftLUT = _mm_setr_epi8(
        0,   0,  19,   4, -65, -65, -71, -71,
        0,   0,   0,   0,   0,   0,   0,   0);

    const __m128i maskLUT  = _mm_setr_epi8(
        /* 0        */ char(0b10101000),
        /* 1 .. 9   */ char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000),
                       char(0b11111000), char(0b11111000), char(0b11111000), char(0b11111000),
                       char(0b11111000),
        /* 10       */ char(0b11110000),
        /* 11       */ char(0b01010100),
        /* 12 .. 14 */ char(0b01010000), char(0b01010000), char(0b01010000),
        /* 15       */ char(0b01010100)
    );

    const __m128i bitposLUT = _mm_setr_epi8(
        0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, char(0x80),
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
    );

    // shift + fix-up for '/' and '+'
    const __m128i sh     = _mm_shuffle_epi8(shiftLUT,  higher_nibble);
    const __m128i eq_2f  = _mm_cmpeq_epi8(input, packed_byte(0x2f));
    const __m128i shift  = _mm_blendv_epi8(sh, packed_byte(16), eq_2f);

    // load bit-masks
    const __m128i M      = _mm_shuffle_epi8(maskLUT,   lower_nibble);

    // (1 << higher_nibble[i]) & 0xff
    const __m128i bit    = _mm_shuffle_epi8(bitposLUT, higher_nibble);

    const __m128i non_match = _mm_cmpeq_epi8(_mm_and_si128(M, bit), _mm_setzero_si128());
    const auto mask = _mm_movemask_epi8(non_match);
    if (mask) {
        // report error
    }

    const __m128i result = _mm_add_epi8(input, shift);
    return result;
}

Number of operations:

comparison (le/gt/eq): 2,
shift: 1,
bit-and: 3,
bit-or: 0,
bit-blend: 1,
add: 1,
movemask: 1
pshufb: 3.

Number of constants: 6.

The total number of operations is 12. This is 0.750 instructions per character.

Comparison of vector lookups

operation	algorithm
operation	base	byte-blend	incremental	pshufb	pshufb-bitmask
comparison (le/gt/eq)	9	9	9	3	2
bit-and	8	4	0	2	3
bit-andnot	0	0	0	1	0
bit-or	4	0	0	1	0
shift	0	0	0	1	1
add	1	1	10	2	1
sub	0	0	0	0	0
byte-blend	0	3	0	0	1
movemask	1	1	1	1	1
pshufb	0	0	1	3	3
total instructions	23	19	21	14	12
instructions/byte	1.4375	1.1875	1.3125	0.875	0.75
constants count	9	9	10	6	6

Gathering data

The result of lookup procedure is a vector fo 32-bit words having following bit-layout:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bits a, b, c and d are data. The expected output is:

[00000000|aaaaaabb|bbbbcccc|ccdddddd]

Once bits are packed into 24-bit words, all bytes are shuffled in order to build a 12-byte result. Please note that also the order of bytes within 24-bit words is changed in this step, i.e. bytes 0 and 2 are swapped:

[ddddddcc|bbbbcccc|aaaaaabb]

Fortunately everything is done by a single pshufb invocation.

Pack — naive variant

The naive variant isolates all 6-bit fields, shifts them to required positions and merges into one word.

Isolate fields c, a and d, b:

#define packed_dword(x) _mm_set1_epi32(x)
#define masked(x, mask) _mm_and_si128(x, _mm_set1_epi32(mask))

const __m128i ca = masked(values, 0x003f003f);
const __m128i db = masked(values, 0x3f003f00);

Swap order of c, d and a, b within 16-bit words.

// t0   =  [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const __m128i t0 = _mm_or_si128(
                    _mm_srli_epi32(db, 8),
                    _mm_slli_epi32(ca, 6)
                   );

Swap order of 12-bit fields ba and cd within a 32-bit word.

// t1   =  [dddd0000|aaaaaabb|bbbbcccc|dddddddd]
const __m128i t1 = _mm_or_si128(
                    _mm_srli_epi32(t0, 16),
                    _mm_slli_epi32(t0, 12)
                   );

Mask out garbage

// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
result = masked(t1, 0x00ffffff);

Number of operations:

3 bit-and,
4 bit-shift,
2 bit-or.

Pack — multiply-add variant

The first merge of adjacent bytes is performed by instruction pmaddubsw. The instruction vertically multiplies signed bytes yielding 16-bit signed intermediate values and then the values are added. Input values are 6-bit, so obviously are non-negative and the instruction could be safely used.

The instruction pmaddusbw is used to perform shift & bit-or in a single step.

Then merging 12-bit fields is done by another multiply-add instruction pmaddwd. The instruction also operates on signed values, but again --- the inputs are never negative.

// input:  [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

// merge:  [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x01400140));

// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x00011000));

Number of operations:

2 multiply-add.

Saving result (SSE variant)

The result is stoed in an XMM register, however only 12 lowest bytes are required. My experiments on Core i5 have showed that saving selected bytes using dedicated instruction maskmovdqu (intrinsic _mm_maskmoveu_si128) is slower than ordinary write of 16 bytes which overwrites the previous 4 bytes.

BMI2

With help of BMI2 instruction pext making a 24-bit word from 6-bit words is pretty simple. However, bytes 0th and 2nd of each word have to be swapped, that costs extra instructions. Additionally, the instruction works on CPU registers and this require extra instruction to transfer data from the SIMD registers.

Here is a procedure which processes 64-bit word

uint64_t pack_bytes(uint64_t v) {

        const uint64_t p  = _pext_u64(v, 0x3f3f3f3f3f3f3f3f);

        const uint64_t b0 = p & 0x0000ff0000ff;
        const uint64_t b1 = p & 0x00ff0000ff00;
        const uint64_t b2 = p & 0xff0000ff0000;

        return (b0 << 16) | b1 | (b2 >> 16);
}

It costs:

1 pext,
2 shift,
3 bit-and,
2 bit-or.

AVX2 version

An AVX2 version of lookup procedure is nearly the same as the SSE version. The only minor difference is caused by lack of comparison "less or equal", thus all range checking have to use the "grater" relation.

An output from the AVX2 lookup could be either processed using standard SIMD operations or BMI2 instructions.

Sample code

The sample code is available at github, there are three programs:

verify — verifies all decoding procedures, both scalar, SSE and AVX2;
check — checks if base64 decoders work correctly.
speed — allows to measure speed of all or selected base64 decoder; it decodes 64 MiB of artificial data 10 times, and then print the smallest measurement.

Experiments

Core i5 results (Westmere)

The CPU architecture: Westmere i5 M540 @ 2.53GHz

GCC: 6.2.0 (Debian)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.04722	1.00	██████████
scalar	N/A	N/A	0.06875	0.69	██████▊
SSE	base	naive	0.02892	1.63	████████████████▎
SSE	byte blend	naive	0.02970	1.59	███████████████▉
SSE	incremental	naive	0.03249	1.45	██████████████▌
SSE	pshufb	naive	0.02753	1.72	█████████████████▏
SSE	base	multiply-add	0.02554	1.85	██████████████████▍
SSE	byte blend	multiply-add	0.02511	1.88	██████████████████▊
SSE	incremental	multiply-add	0.02862	1.65	████████████████▍
SSE	pshufb	multiply-add	0.02170	2.18	█████████████████████▊
SSE	pshufb bitmask	multiply-add	0.02085	2.26	██████████████████████▋

Conclusions:

Eliminating 5 shifts from the scalar version boosted code by 1.45. This is impressive.
Using the byte-blend instruction gives significant boost.
Packing algorithm using two multiply-add instruction is, surprisingly, the best. Multiply-add instructions have pretty high latencies, 4-5 cycles.

Core i7 results (Haswell)

The CPU architecture: Haswell i7-4770 CPU @ 3.40GHz.

GCC: 5.4.1 (Ubuntu)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.02281	1.00	██████████
scalar	N/A	N/A	0.04965	0.46	████▌
scalar & BMI2	N/A	N/A	0.04485	0.51	█████
SSE	base	naive	0.01683	1.36	█████████████▌
SSE	byte blend	naive	0.01783	1.28	████████████▊
SSE	incremental	naive	0.01790	1.27	████████████▋
SSE	pshufb	naive	0.01286	1.77	█████████████████▋
SSE	base	multiply-add	0.01383	1.65	████████████████▍
SSE	byte blend	multiply-add	0.01365	1.67	████████████████▋
SSE	incremental	multiply-add	0.01500	1.52	███████████████▏
SSE	pshufb	multiply-add	0.00967	2.36	███████████████████████▌
SSE	pshufb bitmask	multiply-add	0.00927	2.46	████████████████████████▌
SSE & BMI2	base	N/A	0.02184	1.04	██████████▍
SSE & BMI2	byte blend	N/A	0.02288	1.00	█████████▉
SSE & BMI2	incremental	N/A	0.02190	1.04	██████████▍
AVX2	base	naive	0.01108	2.06	████████████████████▌
AVX2	byte blend	naive	0.01229	1.86	██████████████████▌
AVX2	pshufb	naive	0.00897	2.54	█████████████████████████▍
AVX2	base	multiply-add	0.00973	2.34	███████████████████████▍
AVX2	byte blend	multiply-add	0.00977	2.33	███████████████████████▎
AVX2	pshufb	multiply-add	0.00849	2.69	██████████████████████████▊
AVX2	pshufb bitmask	multiply-add	0.00848	2.69	██████████████████████████▉
AVX2 & BMI2	base	N/A	0.02236	1.02	██████████▏
AVX2 & BMI2	byte blend	N/A	0.02371	0.96	█████████▌

Conclusions:

BMI2 doesn't help at all.

Core i7 results (Skylake)

The CPU architecture: Skylake i7-6700 CPU @ 3.40GHz

GCC: 5.4.1 (Ubuntu)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.02084	1.00	██████████
scalar	N/A	N/A	0.04980	0.42	████▏
scalar & BMI2	N/A	N/A	0.04494	0.46	████▋
SSE	base	naive	0.01502	1.39	█████████████▊
SSE	byte blend	naive	0.01645	1.27	████████████▋
SSE	incremental	naive	0.01684	1.24	████████████▍
SSE	pshufb	naive	0.01229	1.70	████████████████▉
SSE	base	multiply-add	0.01216	1.71	█████████████████▏
SSE	byte blend	multiply-add	0.01405	1.48	██████████████▊
SSE	incremental	multiply-add	0.01198	1.74	█████████████████▍
SSE	pshufb	multiply-add	0.00888	2.35	███████████████████████▍
SSE	pshufb bitmask	multiply-add	0.00847	2.46	████████████████████████▌
SSE & BMI2	base	N/A	0.02090	1.00	█████████▉
SSE & BMI2	byte blend	N/A	0.02097	0.99	█████████▉
SSE & BMI2	incremental	N/A	0.02098	0.99	█████████▉
AVX2	base	naive	0.01006	2.07	████████████████████▋
AVX2	byte blend	naive	0.00998	2.09	████████████████████▉
AVX2	pshufb	naive	0.00816	2.55	█████████████████████████▌
AVX2	base	multiply-add	0.00852	2.45	████████████████████████▍
AVX2	byte blend	multiply-add	0.00842	2.48	████████████████████████▊
AVX2	pshufb	multiply-add	0.00720	2.89	████████████████████████████▉
AVX2	pshufb bitmask	multiply-add	0.00705	2.96	█████████████████████████████▌
AVX2 & BMI2	base	N/A	0.01994	1.05	██████████▍
AVX2 & BMI2	byte blend	N/A	0.01982	1.05	██████████▌

AMD Bulldozer results

The CPU architecture: Bulldozer FX-8150 CPU

GCC: 4.8.4 (Ubuntu)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.03223	1.00	██████████
scalar	N/A	N/A	0.06352	0.51	█████
SSE	base	naive	0.02387	1.35	█████████████▌
SSE	byte blend	naive	0.02316	1.39	█████████████▉
SSE	incremental	naive	0.02626	1.23	████████████▎
SSE	pshufb	naive	0.02015	1.60	███████████████▉
SSE	base	multiply-add	0.02107	1.53	███████████████▎
SSE	byte blend	multiply-add	0.01971	1.64	████████████████▎
SSE	incremental	multiply-add	0.02215	1.46	██████████████▌
SSE	pshufb	multiply-add	0.01660	1.94	███████████████████▍
SSE	pshufb bitmask	multiply-add	0.01656	1.95	███████████████████▍

Acknowledgments

The AVX2 version wouldn't be possible without Nathan Kurz's help and Daniel Lemire who kindly gave me access to test machines having brand new CPUs (Haswell and Skylake, and also AMD Bulldozer).

Changelog

2016-12-23 — improved pshufb-based lookup procedure
2016-12-18 — align order of output words with the base64 spec)
2016-12-04 — refreshed measurements, added results from AMD Bulldozer
2016-12-01 — removed a superfluous instruction
2016-11-07 — syntax highlighting
2016-03-30 — new lookup algorithm, updated results
2016-03-19 — new packing methods, updated results
2016-03-13 — two new lookup algorithms, updated results
2016-03-10 — AVX2 notes, Skylake results
2016-01-25 — minor fixes thanks to HN comments
2016-01-18 — Core i7 results

Base64 decoding with SIMD instructions

Introduction

Base algorithm

Improved scalar version

SSE version

Vector lookup (base)

Vector lookup (byte blend)

Vector lookup (incremental)

Vector lookup (pshufb)

Vector lookup (pshufb with bitmask)

Comparison of vector lookups

Gathering data

Pack — naive variant

Pack — multiply-add variant

Saving result (SSE variant)

BMI2

AVX2 version

Sample code

Experiments

Core i5 results (Westmere)

Core i7 results (Haswell)

Core i7 results (Skylake)

AMD Bulldozer results

Acknowledgments

See also

Changelog