AVX512: histogram of sixteen nibbles

Author: Wojciech Muła 2023-01-06

Introduction

At Sneller we had the following problem: there are sixteen 4-bit values, we need a histogram for that limited set.

Since the input set has the fixed size of 64 bits, the problem is not that difficult as a generic case.

The basic problem we solve is counting how many nibbles are present in the given 64-bit set. Then, the solution to the initial problem is performing the basic step for each possible nibble value. And it's is done in parallel.

AVX-512 counting nibbles in 64-bit words

We obviously might expand 4-bit words into bytes, and then use the byte-level comparison (like VPCMPEQB). But expansion is not very fast. A better solution works at the nibble level.

1. The input is the following 16-element set x:

```x = |ab|02|bf|3b|aa|54|b2|b0|
^^                   ^^
byte 7           byte 0 in a 64-bit word
```
2. First we populate the wanted nibble in a 64-bit word. Let the nibble is 0xb; it's present in the x word 5 times.

```c = |bb|bb|bb|bb|bb|bb|bb|bb|
```
3. Next, we XOR the input word with constant one:

```t0 = x ^ c
= |ab|02|bf|3b|aa|54|b2|b0| xor
|bb|bb|bb|bb|bb|bb|bb|bb|
= |10|b9|04|80|11|ef|09|0b|
^    ^   ^       ^  ^
nibbles equal to b
```
4. The 0 nibble indicates the nibble is equal to b. Now our goal is to convert 0 nibbles into 1, and non-zero nibbles into 0. The exact procedure is shown in the next subsection.

```t1 = nibble == 0 ? 1 : 0
= |01|00|10|01|00|00|10|10|
```
5. The next step is to add nibbles together. Since we're targeting AVX-512, we may use VPSADBW that adds bytes in 64-bit lanes. Note that bytes in word t1 are only: 0x00, 0x01, 0x10, 0x11. If we add eight bytes having these values, there won't be overflow. We get a byte whose nibbles are sum of nibbles (0 or 1) from the input.:

```t2 = |00|00|00|00|00|00|00|32|
```
6. The final step is to add the nibbles from the lowest byte. Note that the t2 as uint64 is never bigger than 0x88, thus we may freely use 32-bit or 64-bit arithmetic instructions.:

```t3 = (t2 & 0x0f) + (t2 >> 4)
= |00|00|00|00|00|00|00|05|
```

Note: if the extension AVX-512VPOPCNTDQ is available, the steps #4 and #5 fold into a single invocation of VPOPCNTQ.

The full AVX-512 solution uses two pairs of registers:

1. One pair contains the broadcast input word.
2. Another pair contains 64-bit words filled with nibbles from 0 to f

Then the counting algorithm is performed on these registers. The final pair of registers contains the required histogram.

```for i in 0..7 {
inA[i] = uint64(x)
inB[i] = uint64(x)
constA[i] = 0x1111111111111111 * uint64(i)
constB[i] = 0x1111111111111111 * uint64(i + 8)
}

outA = count_nibble(inA, constA)
outB = count_nibble(inB, constB)
```

Converting nibbles

The following algorithm can be applied to bitfields of any size; here we focus on nibbles.

1. The initial nibbles (examples show just two cases: zero and non-zero):

```t0 = |0000|1101|
```
2. Mask out the most significant bits of nibbles:

```t1 = t0 & 0b0111_0111
= |0000|0101|
```
3. Add 0b0111 to each nibble. If any lower bit -- 0, 1 or 2 -- was set, the carry bit from addition will appear the 3rd bit.

```t2 = t1 + 0b0111_0111
= |0000|1010|
```
4. Combine 3rd bits of nibbles from the initial word with the addition result. Now the 3rd bit is set if any bit of the nibble was 1.

```t3 = t2 | t1
= |0000|1110|
```
5. Leave only the 3rd bit and negate it. Note that the AVX-512 ternary logic instruction (VPTERNLOG) is sufficient to evaluate expression t4 = ~(t2 | t1) & 0b1000_1000.

```t4 = (~t3 & 0b1000_1000)
= |1000|0000|
```
6. Finally, we have to convert 0b1000 into 0b1111 and keep 0b0000 intact. Simple shift right by 3 does the job:

```t5 = t4 >> 3
= |0001|0000|
```

Source code

Sample implementation is available at GitHub.