Base64 encoding & decoding using AVX512BW instructions

Author:Wojciech Muła
Added on:2016-04-03
Updated on:2018-04-18 (show both AVX512BW and AVX512VBMI solutions)

Contents

Introduction update

The SIMD versions of base64 conversion algorithms were described in Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions. I also described realization of both encoding and decoding using AVX512F (Foundation) instructions.

AVX512BW (Byte & Word) comes with a great number of new instructions; following instructions can help base64-related problems:

The extension AVX512VBMI adds even more powerful instructions:

The extension AVX512VL adds just one, but really nice instruction:

2018-04-18: In the earlier versions of this text I wrongly assumed that instructions vpermb and vpermi2b are part of AVX512BW. Sorry for that.

Encoding

The SIMD encoding consist following steps:

  1. In each step 48 bytes are loaded (16 * 24 bit).
  2. Split each 24-bit words into separate 32-bit lanes.
  3. In each 32-bit lane move 6-bit words to separate bytes.
  4. Convert 6-bit indices to ASCII, according to base64 lookup.

Steps 1 & 2 — Loading data and splitting bytes

In a SSE code loading data and splitting 24-bit words into 32-bit lanes is done by pshufb instruction. In AVX2 the instruction operates on 128-bit lanes, making this step more complicated.

AVX512BW new

In case of AVX512BW two shuffles are needed.

  1. Initial layout of the register after loading the data. We need only triplets A .. P:

    [????|????|????|????|PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA]
    |     lane 3        |     lane 2        |     lane 1        |     lane 0        |
    
  2. Move 32-bit words across lanes using vpermd:

    [0000|PPPO|OONN|NMMM|0000|LLLK|KKJJ|JIII|0000|HHHG|GGFF|FEEE|0000|DDDC|CCBB|BAAA]
    |     lane 3        |     lane 2        |     lane 1        |     lane 0        |
    
  3. Reshuffle bytes within 128-bit lanes to format required by AVX512BW/AVX512VL unpack procedure, which is shown in the next section.

    The input layout of single lane (0th):

    [00 00 00 00|D2 D1 D0 C2|C1 C0 B2 B1|B0 A2 A1 A0]
    |  4 bytes  |
    

    The layout of after shuffling; the order of bytes within 32-bit word is 1, 2, 0, 1:

    [D1 D2 D0 D1|C1 C2 C0 C1|B1 B2 B0 B1|A1 A2 A0 A1]
     10 11 9  10 7  8  6  7  4  5  3  4  1  2  0  1
    

First one moves data across the 128-bit lanes, then in each lane bytes are shuffled into desired positions.

const __m512i tmp1 = _mm512_loadu_si512(input + i);

const __m512i tmp2 = _mm512_permutexvar_epi32(
    _mm512_set_epi32(-1, 11, 10, 9, -1, 8, 7, 6, -1, 5, 4, 3, -1, 2, 1, 0),
    tmp1
);

const __m512i tmp3 = _mm512_shuffle_epi8(
    tmp2,
    _mm512_set4_epi32(0x0a0b090a, 0x07080607, 0x04050304, 0x01020001)
);

AVX512VBMI update

In case of AVX512VBMI the problem is solved with one invocation of vpermb.

// load 48 bytes
// v = [...|DDDC|CCBB|BAAA]
const __m512i v = _mm512_loadu_si512(reinterpret_cast<const __m512i*>(input + i));

// split 24-bit words to 32-bit lanes
// in = [...|0DDD|0CCC|0BBB|0AAA]
const __m512i in = _mm512_permutexvar_epi8(shuffle_input, v);

Step 3 — moving 6-bit word to sperate bytes

Each 32-bit input word contains four 6-bit fields a, b, c and d; the expected output from this step:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

AVX512F

Input order of fields is following:

[????????|ccdddddd|bbbbcccc|aaaaaabb]

Below is an AVX512F version from the article linked above. Lack of byte-level instructions makes the procedure pretty complicated.

template <int shift, uint32_t mask>
__m512i merge(__m512i target, __m512i src) {
    __m512i shifted;
    if (shift > 0) {
        shifted = _mm512_srli_epi32(src, shift);
    } else {
        shifted = _mm512_slli_epi32(src, -shift);
    }

    return _mm512_ternarylogic_epi32(_mm512_set1_epi32(mask), shifted, target, 0xca);
}

__m512i unpack(const __m512i in) {
    // [00000000|00000000|00000000|00aaaaaa]
    __m512i indices = _mm512_and_si512(_mm512_srli_epi32(in, 2), packed_dword(0x0000003f));

    // [00000000|00000000|00BB0000|00aaaaaa]
    indices = merge<-12, 0x00003000>(indices, in);

    // [00000000|00000000|00BBbbbb|00aaaaaa]
    indices = merge<  4, 0x00000f00>(indices, in);

    // [00000000|00CCCC00|00BBbbbb|00aaaaaa]
    indices = merge<-10, 0x003c0000>(indices, in);

    // [00000000|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge<  6, 0x00030000>(indices, in);

    // [00dddddd|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge< -8, 0x3f000000>(indices, in);

    return indices;
}

AVX512BW

The unpacking could be performed faster with help of vpsllvw and vpsrlvw instructions. This require a change in the 32-bit lane format from:

[????????|ccdddddd|bbbbcccc|aaaaaabb]
           byte 2   byte 1   byte 0

to:

[bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc]
 ^^^^                           ^^^^
 unused bits             unused bits

 byte 1    byte 2   byte 0   byte 1

It does require only different vector passed to vpshufb (AVX512BW) or vpermb (AVX512VBMI) used for splitting bytes.

Algorithm:

  1. Isolate fields a and c.
// t0    = [0000cccc|cc000000|aaaaaa00|00000000]
const __m512i t0 = _mm512_and_si512(in, _mm512_set1_epi32(0x0fc0fc00));
  1. Shift right the field a by 10 bits, and the field c by 6 bits.
// t1    = [00000000|00cccccc|00000000|00aaaaaa]
const __m512i t1 = _mm512_srlv_epi16(t0, _mm512_set1_epi32(0x0006000a));
  1. Shift left the field b by 4 bits, and the field d by 8 bits (note that no masking is done.)
// t2    = [ccdddddd|00000000|aabbbbbb|cccc0000]
const __m512i t2 = _mm512_sllv_epi16(in, _mm512_set1_epi32(0x00080004));
  1. Finally copy the selected bits from t2 to t1.
//         = [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
const __m512i indices = _mm512_ternarylogic_epi32(_mm512_set1_epi32(0x3f003f00), t2, t1, 0xca);

The procedure costs just two shifts and two bitwise operations.

AVX512VL

AVX512VL defines the instruction vpmultishiftqb, that may replace all variable shift instructions from the previous point. Please note that the layout of 32-bit lanes require the same modification as described in the previous point.

The instruction builds a vector of bytes from octets located at any position in a quadword. Following psudocode shows the algorithm:

for i in 0 .. 7 loop
    qword := input.qword[i];

    for j in 0 .. 7 loop
        index := indices.byte[i * 8 + j];
        output.byte[i * 8 + j] = rotate_right(qword, index) and 0xff;
    end loop
end loop

Although vpmultishiftqb produces a vector of bytes and the encoding needs just 6 lower bits, no masking is needed. The instruction vpermb (described above) does masking internally.

Below is a code snippet the shows the proper parameters for vpmultishiftqb.

// after multishift a single 32-bit lane has following layout:
// [bbbbcccc|bbcccccc|aabbbbbb|ddaaaaaa],
// i.e.: (a = [10:17], b = [4:11], c = [22:27], d = [16:21])

const __m512i shifts  = packed_qword(0x3036242a1016040alu); // 48, 54, 36, 42, 16, 22, 4, 10
const __m512i indices = _mm512_multishift_epi64_epi8(shifts, in);

Step 4 — conversion to ASCII

The last part of the algorithm is converting, in parallel, all 6-bit indices into ASCII codes.

AVX512BW new

Although vpshufb works on 128-bit lanes it is sufficient for conversion. The fastest procedure from SSE might be translated one-to-one:

__m512i lookup_pshufb_improved(const __m512i input) {

    // reduce  0..51 -> 0
    //        52..61 -> 1 .. 10
    //            62 -> 11
    //            63 -> 12
    __m512i result = _mm512_subs_epu8(input, packed_byte(51));

    // distinguish between ranges 0..25 and 26..51:
    //         0 .. 25 -> remains 0
    //        26 .. 51 -> becomes 13
    const __mmask64 less = _mm512_cmpgt_epi8_mask(packed_byte(26), input);
    result = _mm512_mask_mov_epi8(result, less, packed_byte(13));

    /* the SSE lookup is readable
        const __m128i shift_LUT = _mm_setr_epi8(
            'a' - 26, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52,
            '0' - 52, '0' - 52, '0' - 52, '0' - 52, '0' - 52, '+' - 62,
            '/' - 63, 'A', 0, 0
        );
    */
    const __m512i shift_LUT = _mm512_set4_epi32(
        0x000041f0,
        0xedfcfcfc,
        0xfcfcfcfc,
        0xfcfcfc47
    );

    // read shift
    result = _mm512_shuffle_epi8(shift_LUT, result);

    return _mm512_add_epi8(result, input);
}

AVX512VBMI

The already introduced instruction vpermb does lookup in an AVX512 64-byte register. This is exactly the size of lookup table needed by the encoding algorithm. Thus, just a single invocation of the instruction is needed.

const __m512i result = _mm512_permutexvar_epi8(indices, lookup);

Decoding

The SIMD decoding consists following steps:

  1. Translate in parallel from ASCII into 6-bit values saved on separate bytes. At this stage the error detection is performed.
  2. Pack 6-bit data into continuous bit stream — the result has 48 bytes.

Step 1 — translation from ASCII

AVX512BW new

In case of decoding the limitations of vpshufb also are not the problem, the fastest SSE decoding solution can be applied. Below is the actual implementation.

__m512i lookup_pshufb_bitmask(const __m512i input) {

    const __m512i higher_nibble = _mm512_and_si512(_mm512_srli_epi32(input, 4), packed_byte(0x0f));
    const __m512i lower_nibble  = _mm512_and_si512(input, packed_byte(0x0f));

    const __m512i shiftLUT = _mm512_set4lanes_epi8(
        0,   0,  19,   4, -65, -65, -71, -71,
        0,   0,   0,   0,   0,   0,   0,   0);

    const __m512i maskLUT  = _mm512_set4lanes_epi8(
        /* 0        : 0b1010_1000*/ 0xa8,
        /* 1 .. 9   : 0b1111_1000*/ 0xf8, 0xf8, 0xf8, 0xf8,
                                    0xf8, 0xf8, 0xf8, 0xf8,
                                    0xf8,
        /* 10       : 0b1111_0000*/ 0xf0,
        /* 11       : 0b0101_0100*/ 0x54,
        /* 12 .. 14 : 0b0101_0000*/ 0x50, 0x50, 0x50,
        /* 15       : 0b0101_0100*/ 0x54
    );

    const __m512i bitposLUT = _mm512_set4lanes_epi8(
        0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
    );

    const __m512i   sh      = _mm512_shuffle_epi8(shiftLUT,  higher_nibble);
    const __mmask64 eq_2f   = _mm512_cmpeq_epi8_mask(input, packed_byte(0x2f));
    const __m512i   shift   = _mm512_mask_mov_epi8(sh, eq_2f, packed_byte(16));

    const __m512i M         = _mm512_shuffle_epi8(maskLUT,   lower_nibble);
    const __m512i bit       = _mm512_shuffle_epi8(bitposLUT, higher_nibble);

    const uint64_t match    = _mm512_test_epi8_mask(M, bit);

    if (match != uint64_t(-1)) {
        // report error
    }

    return _mm512_add_epi8(input, shift);
}

AVX512VBMI new

This step is a perfect place to utilize the instruction vpermi2b, it requires three registers:

  • indices,
  • the lower & higher halves of a 128-item lookup table.

Valid input characters in base64 are always standard ASCII, so they never have set the most significant bit (MSB). Thanks to that the seven lowest bits of the input could be directly used as indices for vpermi2b. (The instruction simply ignores MSB, so no masking is required.)

The lookup table has to be precalculated. It translates from an ASCII code into 6-bit data or an invalid character marker. The marker value is 0x80. Thanks to that both extended ASCII and invalid characters could be easily identified in one step.

__m512i lookup(const __m512i input) {

    const __m512i lookup_0 = precalc::lookup_0;
    const __m512i lookup_1 = precalc::lookup_1;

    const __m512i translated = _mm512_permutex2var_epi8(lookup_0, input, lookup_1);

    const uint64_t mask = _mm512_movepi8_mask(translated | input); // convert MSBs to the mask
    if (mask) {
        report error;
    }

    return translated;
}

Step 2 — Packing bit-fields

The final stage of decoding is packing all 6-bit fields into a continues array of 48 bytes. It is done in two steps:

  1. Pack four fields within 32-bit words into 24-bit words.
  2. Move these 3-byte words into the continuous array.

Step A

The first step is a direct translation of the SSE code. It uses twice a multiply-add instruction that does, for given constants, two shifts and one bitwise or at once.

// input:  [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

// merge:  [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x01400140));

// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x00011000));

Step B — AVX512BW new

  1. Initial layout of the register after step A:

    [0PPP|0OOO|0NNN|0MMM|0LLL|0KKK|0JJJ|0III|0HHH|0GGG|0FFF|0EEE|0DDD|0CCC|0BBB|0AAA]
    |     lane 3        |     lane 2        |     lane 1        |     lane 0        |
    
  2. Pack bytes within lanes using vpshufb; 3rd doublwords in each lane are empty:

    [0000|PPPO|OONN|NMMM|0000|LLLK|KKJJ|JIII|0000|HHHG|GGFF|FEEE|0000|DDDC|CCBB|BAAA]
    |     lane 3        |     lane 2        |     lane 1        |     lane 0        |
    
  3. Move non-empty 32-bit words across lanes using vpermd and form continuous 48-byte array:

    [0000|0000|0000|0000|PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA]
    |     lane 3        |     lane 2        |     lane 1        |     lane 0        |
    

Below is a sample implementation.

const __m512i t1 = _mm512_shuffle_epi8(
    packed,
    _mm512_set4lanes_epi8(
         2,  1,  0,
         6,  5,  4,
        10,  9,  8,
        14, 13, 12,
        -1, -1, -1, -1)
);

// shuffle bytes
const __m512i s6 = _mm512_setr_epi32(
     0,  1,  2,
     4,  5,  6,
     8,  9, 10,
    12, 13, 14,
    // unused
     0,  0,  0, 0);

const __m512i t2 = _mm512_permutexvar_epi32(s6, t1);

_mm512_storeu_si512(reinterpret_cast<__m512i*>(out), t2);

Step B — AVX512VBMI

In case of AVX512VBMI this step uses just single vpermb.

Appendix — AVX512BW assembly dump

Below is the assembly dumps of encoding inner loop. GCC 7.3.0 was able to cache all auxiliary values in registers.

Encoding:

62 f2 1d 48 36 04 06    vpermd (%rsi,%rax,1),%zmm12,%zmm0
62 61 fd 48 6f d7       vmovdqa64 %zmm7,%zmm26
48 83 c0 30             add    $0x30,%rax
48 83 c2 40             add    $0x40,%rdx
62 d2 7d 48 00 c3       vpshufb %zmm11,%zmm0,%zmm0
62 f1 35 48 db c8       vpandd %zmm0,%zmm9,%zmm1
62 f2 fd 48 12 c6       vpsllvw %zmm6,%zmm0,%zmm0
62 d2 f5 48 10 c8       vpsrlvw %zmm8,%zmm1,%zmm1
62 63 7d 48 25 d1 ca    vpternlogd $0xca,%zmm1,%zmm0,%zmm26
62 91 5d 48 64 ca       vpcmpgtb %zmm26,%zmm4,%k1
62 f1 2d 40 d8 cd       vpsubusb %zmm5,%zmm26,%zmm1
62 f1 7f 49 6f cb       vmovdqu8 %zmm3,%zmm1{%k1}
62 f2 2d 48 00 c9       vpshufb %zmm1,%zmm10,%zmm1
62 91 75 48 fc c2       vpaddb %zmm26,%zmm1,%zmm0
62 f1 fe 48 7f 42 ff    vmovdqu64 %zmm0,-0x40(%rdx)

Summary

Sample code

Repository contains implementations of both encoding and decoding procedures.

Changes