AVX512VBMI2 and packed varuint format

Author:	Wojciech Muła
Added on:	2022-01-24
Last update:	2022-01-25

Contents

Introduction
Base variant
AVX512VBMI2 variant
Benchmarks
Sample code

Introduction

A quite popular varuint format lets to save an arbitrary integer number on a sequence of bytes. Each byte stores seven bits of information, and the most significant bit indicates whether the given byte is the last one.

Decoding such numbers is quite easy, but is not fast. This is the reason why Google came up with their packed varint format, that stores four numbers (from 1 to 4 byte each). In this format control bits and data bits are separated. The control bits are grouped into single byte: four pairs of bits encode lengths of four numbers.

Handling this format is way easier and is vectorizable. The control byte is used to fetch a shuffle pattern, which is then issued to PSHUFB. Then, this single instruction expands 4-16 data bytes into sixteen 32-bit numbers. Details are shown in the next section.

The packed format can be slightly modified to utilize the instruction VPEXPANDB defined in AVX512VBMI2. The instruction expands bytes according to an AVX512 write mask — it's exactly what the packed varint format needs.

Base variant

The byte layout of a single packed varuint:

[L0 L1 L2 L3][1-4 bytes][1-4 bytes][1-4 bytes][1-4 bytes]
|           |                                           |
|  control  |               4-16 data bytes             |
|  byte     |

A pack is saved on 5 to 17 bytes. The control byte is split into four 2-bit sub-words L{0,1,2,3}. A sub-word encodes the length of the corresponding number: 00 — one byte, 01 — two bytes, 10 --- three bytes and 11 — four bytes.

Below is shown a sample SSE implementation. It requires two 256-element auxiliary tables:

16-byte shuffle patterns,
total number of bytes.

void sse_unsafe_unpack(const std::string& buf, size_t count, uint32_t* output) {
    size_t j = 0;
    size_t i = 0;

    while (i < count) {
        const uint8_t code = buf[j];

        const __m128i input    = _mm_loadu_si128((const __m128i*)(&buf[j + 1]));
        const __m128i shuffle  = _mm_loadu_si128((const __m128i*)(&shuffle_patterns[code][0]));
        const __m128i expanded = _mm_shuffle_epi8(input, shuffle);
        j += 1 + input_lengths[code];

        _mm_storeu_si128((__m128i*)(&output[i]), expanded);
        i += 4;
    }
}

An obvious drawback of the above code is that it may touch memory outside buf and output, but the key was simplicity not safety.

As we see, the procedure is really simple. There are three fetches from memory (control byte, shuffle pattern and input length) and a single byte shuffle.

AVX512VBMI2 variant

In the SSE implementation we can handle up to 16 input bytes. The AVX512 gives us 64 bytes, thus a single pack can store sixteen 32-bit numbers.

We keep the base format of control byte: 2-bit sub-words encode lengths. Since we handle 16 numbers, it means we need four control bytes. Thus the memory layout is as follows:

[CTL0][CTL1][CTL2][CTL3][1-4 bytes][1-4 bytes]...[1-4 bytes]
|                      |                                   |
|   4 control bytes    |          16-64 data bytes         |

To simplify encoding of the lengths we interleave pairs of lengths (nibbles) in the following way (Lx is a 2-bit sub-word):

[L0 L1 L8 L9|L2 L3 L10 L11|L4 L5 L12 L13|L6 L7 L14 L15]
|           |             |             |             |
|   CTL0    |     CTL1    |     CTL3    |    CTL4     |

Thanks to that we can easily expand nibbles. Having the nibbles separated, we can then convert them quickly with PSHUFB into a bitmask required by VPEXPANDB.

The AVX512 decoding algorithm:

Load control words.

// [f e 7 6|d c 5 4|b a 3 2|9 8 1 0] <- length sub-words indices
const uint32_t codewords = unsafe_read_4bytes(buf, j);

Expand nibbles into a 64-bit word.

// [0 0 7 6|0 0 5 4|0 0 3 2|0 0 1 0]
//      ---
//      single nibble
const uint64_t lo_nibbles = codewords & 0x0f0f0f0f;

// [f e 0 0|d c 0 0|b a 0 0|9 8 0 0]
const uint64_t hi_nibbles = codewords & 0xf0f0f0f0;

// [0 0 f e|0 0 d c|0 0 b a|0 0 9 8|0 0 7 6|0 0 5 4|0 0 3 2|0 0 1 0]
const uint64_t nibbles = lo_nibbles | (hi_nibbles << 28);

Convert lengths into a four-bit bitmask. For 1-byte numbers the bitmask is 0001, for 2-byte numbers: 0011, for 3-byte numbers: 0111 and for 4-byte numbers: 1111. Since each byte of nibless contains two lengths, we convert both into an 8-bit bitmask with a single PSHUFB.

const __m128i masks = _mm_setr_epi8(
    /* 00_00 */ 0x11,
    /* 00_01 */ 0x13,
    /* 00_10 */ 0x17,
    /* 00_11 */ 0x1f,
    /* 01_00 */ 0x31,
    /* 01_01 */ 0x33,
    /* 01_10 */ 0x37,
    /* 01_11 */ 0x3f,
    /* 10_00 */ 0x71,
    /* 10_01 */ 0x73,
    /* 10_10 */ 0x77,
    /* 10_11 */ 0x7f,
    /* 11_00 */ char(0xf1),
    /* 11_01 */ char(0xf3),
    /* 11_10 */ char(0xf7),
    /* 11_11 */ char(0xff)
);

const __m128i t0       = _mm_cvtsi64x_si128(nibbles);
const __m128i v_mask   = _mm_shuffle_epi8(masks, t0);
const __mmask64 mask   = _mm_cvtsi128_si64x(v_mask);

Load data into AVX512 register.

const __m512i input    = _mm512_loadu_si512((const __m512*)(&buf[j + 4]));

Convert packed bytes in sixteen 32-bit numbers. Yes, this is one instruction.
```
const __m512i expanded = _mm512_maskz_expand_epi8(mask, input);
```
Store the result and advance pointers. Please not the unlike the SSE variant, we obtain the number of consumed byte by a cheap population count on the mask.
```
_mm512_storeu_si512((__m512i*)(&output[i]), expanded);
i += 16;
j += 4 + __builtin_popcountll(mask);
```

Benchmarks

The benchmark utility from repository was used. It encodes the given count of 32-bit words into one of packed varint formats. Then decoding procedure decodes it 100 times into the same output array.

CPU: Ice Lake Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Compiler: GCC 10.3.1 20210422 (Red Hat 10.3.1-1)

procedure	input size [32-bit words]	best cycles/op	avg cycles/op
SSE	100,000	2.785	2.800
AVX512VBMI2	100,000	1.044	1.079
SSE	1,000,000	2.792	2.808
AVX512VBMI2	1,000,000	1.059	1.077
SSE	10,000,000	2.857	2.875
AVX512VBMI2	10,000,000	1.330	1.385

Note that these procedures were not tuned in any way. I am sure there is room for improvement in the both implementations.
The AVX512VBMI2 procedure is more than two times faster than the SSE version. Seems it should be rather close to 4 times, as the AVX512 routine processes four times more data.
The AVX512VBMI2 format requires exactly the same number of bytes as the base variant.

Sample code

Sample code is available.