Base64 encoding & decoding using AVX512BW instructions

Author:	Wojciech Muła
Added on:	2016-04-03
Updated on:	2018-12-08 (performance results from Cannon Lake [AVX512VL & AVX512VBMI]), 2018-11-05 (performance results from SkylakeX [AVX512BW])

Contents

Introduction

The SIMD versions of base64 conversion algorithms were described in Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions. I also described realization of both encoding and decoding using AVX512F (Foundation) instructions.

AVX512BW (Byte & Word) comes with a great number of new instructions; following instructions can help base64-related problems:

vpshufb (intrinsic _mm512_shuffle_epi8) — does a lookup in 128-bit lanes. For base64 algorithm it's sufficient;
vpermd (_mm512_permutexvar_epi32) — moves 32-bit words across the 128-bit lanes;
vpsllvw (_mm512_sllv_epi16) and vpsrlvw (_mm512_srlv_epi16) --- shifts individual 16-bit words by a variable amount, saved in a ZMM register.

The extension AVX512VBMI adds even more powerful instructions:

vpermb (_mm512_permutexvar_epi8) — does a lookup in a 64-byte table (a ZMM register). Unlike pshufb it doesn't destroy the lookup register;
vpermi2b (_mm512_permutex2var_epi8) — does a lookup in a 128-byte table formed by two ZMM registers.

The extension AVX512VL adds just one, but really nice instruction:

vpmultishiftqb (_mm512_multishift_epi64_epi8) — moves 8-bit subwords onto selected bytes.

2018-04-18: In the earlier versions of this text I wrongly assumed that instructions vpermb and vpermi2b are part of AVX512BW. Sorry for that.

Encoding

The SIMD encoding consist following steps:

In each step 48 bytes are loaded (16 * 24 bit).
Split each 24-bit words into separate 32-bit lanes.
In each 32-bit lane move 6-bit words to separate bytes.
Convert 6-bit indices to ASCII, according to base64 lookup.

Steps 1 & 2 — Loading data and splitting bytes

In a SSE code loading data and splitting 24-bit words into 32-bit lanes is done by pshufb instruction. In AVX2 the instruction operates on 128-bit lanes, making this step more complicated.

AVX512BW

In case of AVX512BW two shuffles are needed.

Initial layout of the register after loading the data. We need only triplets A .. P:

[????|????|????|????|PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA]
|     lane 3        ╵     lane 2        ╵     lane 1        ╵     lane 0        |

Move 32-bit words across lanes using vpermd:

[0000|PPPO|OONN|NMMM|0000|LLLK|KKJJ|JIII|0000|HHHG|GGFF|FEEE|0000|DDDC|CCBB|BAAA]
|     lane 3        ╵     lane 2        ╵     lane 1        ╵     lane 0        |

Reshuffle bytes within 128-bit lanes to format required by the AVX512BW/AVX512VL unpack procedure, which is shown in the next section.

The input layout of single lane (0th):
```
[00 00 00 00|D2 D1 D0 C2|C1 C0 B2 B1|B0 A2 A1 A0]
|  4 bytes  |
```
The layout of lane after shuffling; the order of bytes within 32-bit word is 1, 2, 0, 1:
```
[D1 D2 D0 D1|C1 C2 C0 C1|B1 B2 B0 B1|A1 A2 A0 A1]
 10 11 9  10 7  8  6  7  4  5  3  4  1  2  0  1
```

First one moves data across the 128-bit lanes. Then in each lane bytes are shuffled into desired positions.

const __m512i tmp1 = _mm512_loadu_si512(input + i);

const __m512i tmp2 = _mm512_permutexvar_epi32(
    _mm512_set_epi32(-1, 11, 10, 9, -1, 8, 7, 6, -1, 5, 4, 3, -1, 2, 1, 0),
    tmp1
);

const __m512i tmp3 = _mm512_shuffle_epi8(
    tmp2,
    _mm512_set4_epi32(0x0a0b090a, 0x07080607, 0x04050304, 0x01020001)
);

AVX512VBMI

In case of AVX512VBMI the problem is solved with one invocation of vpermb.

// load 48 bytes
// v = [...|DDDC|CCBB|BAAA]
const __m512i v = _mm512_loadu_si512(reinterpret_cast<const __m512i*>(input + i));

// split 24-bit words to 32-bit lanes
// in = [...|0DDD|0CCC|0BBB|0AAA]
const __m512i in = _mm512_permutexvar_epi8(shuffle_input, v);

Step 3 — moving 6-bit word to separate bytes

Each 32-bit input word contains four 6-bit fields a, b, c and d; the expected output from this step:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

AVX512F

Input order of fields is following:

[????????|ccdddddd|bbbbcccc|aaaaaabb]

Below is an AVX512F version from the article linked above. Lack of byte-level instructions makes the procedure pretty complicated.

template <int shift, uint32_t mask>
__m512i merge(__m512i target, __m512i src) {
    __m512i shifted;
    if (shift > 0) {
        shifted = _mm512_srli_epi32(src, shift);
    } else {
        shifted = _mm512_slli_epi32(src, -shift);
    }

    return _mm512_ternarylogic_epi32(_mm512_set1_epi32(mask), shifted, target, 0xca);
}

__m512i unpack(const __m512i in) {
    // [00000000|00000000|00000000|00aaaaaa]
    __m512i indices = _mm512_and_si512(_mm512_srli_epi32(in, 2), packed_dword(0x0000003f));

    // [00000000|00000000|00BB0000|00aaaaaa]
    indices = merge<-12, 0x00003000>(indices, in);

    // [00000000|00000000|00BBbbbb|00aaaaaa]
    indices = merge<  4, 0x00000f00>(indices, in);

    // [00000000|00CCCC00|00BBbbbb|00aaaaaa]
    indices = merge<-10, 0x003c0000>(indices, in);

    // [00000000|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge<  6, 0x00030000>(indices, in);

    // [00dddddd|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge< -8, 0x3f000000>(indices, in);

    return indices;
}

AVX512BW

The unpacking could be performed faster with help of vpsllvw and vpsrlvw instructions. This require a change in the 32-bit lane format from:

[????????|ccdddddd|bbbbcccc|aaaaaabb]
           byte 2   byte 1   byte 0

to:

[bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc]
 ^^^^                           ^^^^
 unused bits             unused bits

 byte 1    byte 2   byte 0   byte 1

It does require only different vector passed to vpshufb (AVX512BW) or vpermb (AVX512VBMI) used for splitting bytes.

Algorithm:

Isolate fields a and c.

// t0    = [0000cccc|cc000000|aaaaaa00|00000000]
const __m512i t0 = _mm512_and_si512(in, _mm512_set1_epi32(0x0fc0fc00));

Shift right the field a by 10 bits, and the field c by 6 bits.

// t1    = [00000000|00cccccc|00000000|00aaaaaa]
const __m512i t1 = _mm512_srlv_epi16(t0, _mm512_set1_epi32(0x0006000a));

Shift left the field b by 4 bits, and the field d by 8 bits (note that no masking is done.)

// t2    = [ccdddddd|00000000|aabbbbbb|cccc0000]
const __m512i t2 = _mm512_sllv_epi16(in, _mm512_set1_epi32(0x00080004));

Finally copy the selected bits from t2 to t1.

//         = [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
const __m512i indices = _mm512_ternarylogic_epi32(_mm512_set1_epi32(0x3f003f00), t2, t1, 0xca);

The procedure costs just two shifts and two bitwise operations.

AVX512VL

AVX512VL defines the instruction vpmultishiftqb, that may replace all variable shift instructions from the previous point. Please note that the layout of 32-bit lanes require the same modification as described in the previous point.

The instruction builds a vector of bytes from octets located at any position in a quadword. Following psudocode shows the algorithm:

for i in 0 .. 7 loop
    qword := input.qword[i];

    for j in 0 .. 7 loop
        index := indices.byte[i * 8 + j];
        output.byte[i * 8 + j] = rotate_right(qword, index) and 0xff;
    end loop
end loop

Although vpmultishiftqb produces a vector of bytes and the encoding needs just 6 lower bits, no masking is needed. The instruction vpermb (described above) does masking internally.

Below is a code snippet the shows the proper parameters for vpmultishiftqb.

// after multishift a single 32-bit lane has following layout:
// [bbbbcccc|bbcccccc|aabbbbbb|ddaaaaaa],
// i.e.: (a = [10:17], b = [4:11], c = [22:27], d = [16:21])

const __m512i shifts  = packed_qword(0x3036242a1016040alu); // 48, 54, 36, 42, 16, 22, 4, 10
const __m512i indices = _mm512_multishift_epi64_epi8(shifts, in);

Step 4 — conversion to ASCII

The last part of the algorithm is converting, in parallel, all 6-bit indices into ASCII codes.

AVX512BW

Although vpshufb works on 128-bit lanes it is sufficient for conversion. The fastest procedure from SSE might be translated one-to-one:

__m512i lookup_pshufb_improved(const __m512i input) {

    // reduce  0..51 -> 0
    //        52..61 -> 1 .. 10
    //            62 -> 11
    //            63 -> 12
    __m512i result = _mm512_subs_epu8(input, packed_byte(51));

    // distinguish between ranges 0..25 and 26..51:
    //         0 .. 25 -> remains 0
    //        26 .. 51 -> becomes 13
    const __mmask64 less = _mm512_cmpgt_epi8_mask(packed_byte(26), input);
    result = _mm512_mask_mov_epi8(result, less, packed_byte(13));

    /* the SSE lookup is readable
        const __m128i shift_LUT = _mm_setr_epi8(
            'a' - 26, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52,
            '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '+' - 62,
            '/' - 63, 'A', 0, 0
        );
    */
    const __m512i shift_LUT = _mm512_set4_epi32(
        0x000041f0,
        0xedfcfcfc,
        0xfcfcfcfc,
        0xfcfcfc47
    );

    // read shift
    result = _mm512_shuffle_epi8(shift_LUT, result);

    return _mm512_add_epi8(result, input);
}

AVX512VBMI

The already introduced instruction vpermb does lookup in an AVX512 64-byte register. This is exactly the size of lookup table needed by the encoding algorithm. Thus, just a single invocation of the instruction is needed.

const __m512i result = _mm512_permutexvar_epi8(indices, lookup);

Decoding

The SIMD decoding consists following steps:

Translate in parallel from ASCII into 6-bit values saved on separate bytes. At this stage the error detection is performed.
Pack 6-bit data into continuous bit stream — the result has 48 bytes.

Step 1 — translation from ASCII

AVX512BW

In case of decoding the limitations of vpshufb also are not the problem, the fastest SSE decoding solution can be applied. Below is the actual implementation.

__m512i lookup_pshufb_bitmask(const __m512i input) {

    const __m512i higher_nibble = _mm512_and_si512(_mm512_srli_epi32(input, 4), packed_byte(0x0f));
    const __m512i lower_nibble  = _mm512_and_si512(input, packed_byte(0x0f));

    const __m512i shiftLUT = _mm512_set4lanes_epi8(
        0,   0,  19,   4, -65, -65, -71, -71,
        0,   0,   0,   0,   0,   0,   0,   0);

    const __m512i maskLUT  = _mm512_set4lanes_epi8(
        /* 0        : 0b1010_1000*/ 0xa8,
        /* 1 .. 9   : 0b1111_1000*/ 0xf8, 0xf8, 0xf8, 0xf8,
                                    0xf8, 0xf8, 0xf8, 0xf8,
                                    0xf8,
        /* 10       : 0b1111_0000*/ 0xf0,
        /* 11       : 0b0101_0100*/ 0x54,
        /* 12 .. 14 : 0b0101_0000*/ 0x50, 0x50, 0x50,
        /* 15       : 0b0101_0100*/ 0x54
    );

    const __m512i bitposLUT = _mm512_set4lanes_epi8(
        0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
    );

    const __m512i   sh      = _mm512_shuffle_epi8(shiftLUT,  higher_nibble);
    const __mmask64 eq_2f   = _mm512_cmpeq_epi8_mask(input, packed_byte(0x2f));
    const __m512i   shift   = _mm512_mask_mov_epi8(sh, eq_2f, packed_byte(16));

    const __m512i M         = _mm512_shuffle_epi8(maskLUT,   lower_nibble);
    const __m512i bit       = _mm512_shuffle_epi8(bitposLUT, higher_nibble);

    const uint64_t match    = _mm512_test_epi8_mask(M, bit);

    if (match != uint64_t(-1)) {
        // report error
    }

    return _mm512_add_epi8(input, shift);
}

AVX512VBMI

This step is a perfect place to utilize the instruction vpermi2b, it requires three registers:

indices,
the lower & higher halves of a 128-item lookup table.

Valid input characters in base64 are always standard ASCII, so they never have set the most significant bit (MSB). Thanks to that the seven lowest bits of the input could be directly used as indices for vpermi2b. (The instruction simply ignores MSB, so no masking is required.)

The lookup table has to be precalculated. It translates from an ASCII code into 6-bit data or an invalid character marker. The marker value is 0x80. Thanks to that both extended ASCII and invalid characters could be easily identified in one step.

__m512i lookup(const __m512i input) {

    const __m512i lookup_0 = precalc::lookup_0;
    const __m512i lookup_1 = precalc::lookup_1;

    const __m512i translated = _mm512_permutex2var_epi8(lookup_0, input, lookup_1);

    const uint64_t mask = _mm512_movepi8_mask(translated | input); // convert MSBs to the mask
    if (mask) {
        report error;
    }

    return translated;
}

Step 2 — Packing bit-fields

The final stage of decoding is packing all 6-bit fields into a continues array of 48 bytes. It is done in two steps:

Pack four fields within 32-bit words into 24-bit words.
Move these 3-byte words into the continuous array.

Step A

The first step is a direct translation of the SSE code. It uses twice a multiply-add instruction that does, for given constants, two shifts and one bitwise or at once.

// input:  [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

// merge:  [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x01400140));

// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x00011000));

Step B — AVX512BW

Initial layout of the register after step A:

[0PPP|0OOO|0NNN|0MMM|0LLL|0KKK|0JJJ|0III|0HHH|0GGG|0FFF|0EEE|0DDD|0CCC|0BBB|0AAA]
|     lane 3        |     lane 2        |     lane 1        |     lane 0        |

Pack bytes within lanes using vpshufb; 3rd doublewords in each lane are empty:

[0000|PPPO|OONN|NMMM|0000|LLLK|KKJJ|JIII|0000|HHHG|GGFF|FEEE|0000|DDDC|CCBB|BAAA]
|     lane 3        |     lane 2        |     lane 1        |     lane 0        |

Move non-empty 32-bit words across lanes using vpermd and form continuous 48-byte array:

[0000|0000|0000|0000|PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA]
|     lane 3        |     lane 2        |     lane 1        |     lane 0        |

Below is a sample implementation.

const __m512i t1 = _mm512_shuffle_epi8(
    packed,
    _mm512_set4lanes_epi8(
         2,  1,  0,
         6,  5,  4,
        10,  9,  8,
        14, 13, 12,
        -1, -1, -1, -1)
);

// shuffle bytes
const __m512i s6 = _mm512_setr_epi32(
     0,  1,  2,
     4,  5,  6,
     8,  9, 10,
    12, 13, 14,
    // unused
     0,  0,  0, 0);

const __m512i t2 = _mm512_permutexvar_epi32(s6, t1);

_mm512_storeu_si512(reinterpret_cast<__m512i*>(out), t2);

Step B — AVX512VBMI

In case of AVX512VBMI this step uses just single vpermb.

Performance evaluation

Daniel Lemire gave me access to machines that supports the AVX512BW, AVX512VL and AVX512VBMI instruction sets. Below are performance results for both encoding and decoding.

SkylakeX

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz

Compiler: gcc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0

Encoding

Tested procedures
scalar (32 bit)	lookup-based method that loads four input bytes at once
scalar (64 bit)	similarly to the above one, but read 8 input bytes
AVX2 (lookup: pshufb improved)	code that uses shuffle-based algorithm
AVX512BW (lookup: pshufb improved)	exactly the same shuffle-based algorithm

procedure	best	avg.	speedup
	[cycles]	[cycles]
scalar (32 bit)	5.768	6.452	1.00	`███▋`
scalar (64 bit)	5.537	5.599	1.04	`███▊`
AVX2 (lookup: pshufb unrolled improved)	0.681	0.698	8.47	`███████████████████████████████▏`
AVX512BW (lookup: optimized2)	0.726	0.734	7.94	`█████████████████████████████▎`
AVX512BW (lookup: pshufb improved)	0.425	0.436	13.57	`██████████████████████████████████████████████████`
AVX512BW (lookup: pshufb improved, unrolled x 2)	0.425	0.441	13.57	`██████████████████████████████████████████████████`

Decoding

Tested procedures
scalar	naive lookup-based implementation
improved scalar	also lookup-based code, but detection of input errors is done for four input bytes at once
AVX2 (lookup: pshufb bitmask, pack: multiply-add)	uses the same algorithm as AVX512BW implementation
AVX512BW (lookup: N/A, pack: multiply-add)	implementation of the described above lookup and pack algorithms

procedure	best	avg.	speedup
	[cycles]	[cycles]
scalar	1.023	1.091	1.00	`██████▉`
improved scalar	1.024	1.085	1.00	`██████▉`
AVX2 (lookup: pshufb bitmask, pack: multiply-add)	0.222	0.225	4.61	`███████████████████████████████▊`
AVX512BW (lookup: N/A, pack: multiply-add)	0.141	0.144	7.26	`██████████████████████████████████████████████████`

Cannon Lake

CPU: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz

GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

Encoding

Tested procedures
scalar (32 bit)	lookup-based method that loads four input bytes at once
scalar (64 bit)	similarly to the above one, but read 8 input bytes
AVX2 (lookup: pshufb improved)	code that uses shuffle-based algorithm
AVX512BW (lookup: pshufb improved)	exactly the same shuffle-based algorithm
AVX512BW (lookup: pshufb improved, unrolled x 2)	the above procedure unrolled two times
AVX512VBMI	step 1 & 2: _mm512_permutexvar_epi8, step 3: AVX512BW code, step 4: again _mm512_permutexvar_epi8
AVX512VL	step 1 & 2: _mm512_permutexvar_epi8, step 3: _mm512_multishift_epi64_epi8, step 4: again _mm512_permutexvar_epi8

procedure	best	avg.	speedup
	[cycles]	[cycles]
scalar (32 bit)	6.082	6.117	1.00	`█▌`
scalar (64 bit)	6.096	6.134	1.00	`█▌`
AVX2 (lookup: pshufb unrolled improved)	0.752	0.772	8.09	`████████████▉`
AVX512BW (lookup: pshufb improved)	0.391	0.405	15.55	`████████████████████████▉`
AVX512BW (lookup: pshufb improved, unrolled x 2)	0.387	0.399	15.72	`█████████████████████████▏`
AVX512VBMI	0.227	0.233	26.79	`██████████████████████████████████████████▉`
AVX512VL	0.195	0.202	31.19	`██████████████████████████████████████████████████`

Decoding

Tested procedures
scalar	naive lookup-based implementation
improved scalar	also lookup-based code, but detection of input errors is done for four input bytes at once
AVX2 (lookup: pshufb bitmask, pack: multiply-add)	uses the same algorithm as AVX512BW implementation
AVX512BW (lookup: N/A, pack: multiply-add)	implementation of the described above lookup and pack algorithms
AVX512VBMI (lookup: N/A, pack: multiply-add)	lookup is done by _mm512_permutexvar_epi8 and pack by _mm512_permutex2var_epi8

procedure	best	avg.	speedup
	[cycles]	[cycles]
scalar	1.030	1.044	1.00	`████▏`
improved scalar	1.031	1.043	1.00	`████▏`
AVX2 (lookup: pshufb bitmask, pack: multiply-add)	0.232	0.234	4.44	`██████████████████▋`
AVX512BW (lookup: N/A, pack: multiply-add)	0.128	0.131	8.05	`█████████████████████████████████▉`
AVX512VBMI (lookup: N/A, pack: multiply-add)	0.087	0.089	11.84	`██████████████████████████████████████████████████`

Assembly dumps

Just an inner loops was shown; in case of decoding exception handling was omitted. Code generated by GCC 7.3.0.

Encoding

AVX512BW

vpermd      (%rsi,%rax,1),%zmm12,%zmm0
vmovdqa64   %zmm7,%zmm26
add         $0x30,%rax
add         $0x40,%rdx
vpshufb     %zmm11,%zmm0,%zmm0
vpandd      %zmm0,%zmm9,%zmm1
vpsllvw     %zmm6,%zmm0,%zmm0
vpsrlvw     %zmm8,%zmm1,%zmm1
vpternlogd  $0xca,%zmm1,%zmm0,%zmm26
vpcmpgtb    %zmm26,%zmm4,%k1
vpsubusb    %zmm5,%zmm26,%zmm1
vmovdqu8    %zmm3,%zmm1{%k1}
vpshufb     %zmm1,%zmm10,%zmm1
vpaddb      %zmm26,%zmm1,%zmm0
vmovdqu64   %zmm0,-0x40(%rdx)

AVX512VBMI

vpermb     (%rdi, %rax, 1), %zmm7, %zmm0
vmovdqa64  %zmm2, %zmm16
add        $0x30, %rax
add        $0x40, %rdx
vpandd     %zmm0, %zmm5, %zmm1
vpsllvw    %zmm3, %zmm0, %zmm0
vpsrlvw    %zmm4, %zmm1, %zmm1
vpternlogd $0xca, %zmm1, %zmm0, %zmm16
vpermb     %zmm6, %zmm16, %zmm0
vmovdqu64  %zmm0, -0x40(%rdx)

AVX512VL

vpermb         (%rsi, %rax, 1), %zmm3, %zmm0
add            $0x30, %rax
add            $0x40, %rdx
vpmultishiftqb %zmm0, %zmm1, %zmm0
vpermb         %zmm2, %zmm0, %zmm0
vmovdqu64      %zmm0, -0x40(%rdx)

Decoding

AVX512BW

vmovdqu64   (%rbx,%r14,1),%zmm3
vpsrld      $0x4,%zmm3,%zmm1
vpandd      %zmm4,%zmm1,%zmm1
vpcmpeqb    %zmm10,%zmm3,%k1
vpshufb     %zmm1,%zmm11,%zmm0
vpshufb     %zmm1,%zmm7,%zmm1
vmovdqa64   %zmm0,%zmm2
vpandd      %zmm3,%zmm4,%zmm0
vmovdqu8    %zmm9,%zmm2{%k1}
vpshufb     %zmm0,%zmm8,%zmm0
vptestmb    %zmm1,%zmm0,%k2
kmovq       %k2,%rax
cmp         $0xffffffffffffffff,%rax
je          error_handling
vpaddb      %zmm3,%zmm2,%zmm0
vpmaddubsw  %zmm6,%zmm0,%zmm0
vpmaddwd    %zmm5,%zmm0,%zmm0
vpshufb     %zmm13,%zmm0,%zmm0
add         $0x40,%r14
add         $0x30,%rcx
vpermd      %zmm0,%zmm12,%zmm0
vmovdqu64   %zmm0,-0x30(%rcx)

AVX512VBMI

vmovdqu64   (%r14, %r13, 1), %zmm0
vmovdqa64   %zmm5, %zmm1
vpermt2b    %zmm6, %zmm0, %zmm1
vporq       %zmm1, %zmm0, %zmm0
vpmovb2m    %zmm0, %k1
kmovq       %k1, %rax
test        %rax, %rax
jne         error_handling
vpmaddubsw  %zmm3, %zmm1, %zmm0
vpmaddwd    %zmm2, %zmm0, %zmm0
vpermb      %zmm0, %zmm4, %zmm0
add         $0x40, %r13
add         $0x30, %rdx
vmovdqu64   %zmm0, -0x30(%rdx)

Summary

AVX512BW makes both base64 decoding and encoding much easier. The performance tests shows this code is approx. 1.5-1.6 times faster than AVX2 implementations.
It was: AVX512VBMI requires big lookups tables, which can't be simply inlined. This require an additional setup code. Update: this is not true, since the lookup tables are known at compile time, they are just stored as regular data and loaded on demand.
With AVX512VL the main loop of encoding can be coded using only five instructions: load, permutexvar (lane layout), mutlishift, permutexvar (lookup) and store.

Sample code

Repository contains implementations of both encoding and decoding procedures.

Changes

2018-11-05 — performance results for AVX512BW implementation
2018-04-18 — show both AVX512BW and AVX512VBMI solutions
2017-11-29 — spelling
2017-05-01 — add info about VPMULTISHIFTQB from AVX512VL
2016-12-21 — fix pack and unpack procedures
2016-11-21 — change the title, simplified "faster procedure", added link to the AV512F article