Author: | Wojciech Muła |
---|---|

Date: | 2023-01-06 |

At Sneller we had the following problem: there are sixteen 4-bit values, we need a histogram for that limited set.

Since the input set has the fixed size of 64 bits, the problem is not that difficult as a generic case.

The basic problem we solve is counting how many nibbles are present in the given 64-bit set. Then, the solution to the initial problem is performing the basic step for each possible nibble value. And it's is done in parallel.

We obviously might expand 4-bit words into bytes, and then use the byte-level
comparison (like `VPCMPEQB`). But expansion is not very fast. A better
solution works at the nibble level.

The input is the following 16-element set

`x`:x = |ab|02|bf|3b|aa|54|b2|b0| ^^ ^^ byte 7 byte 0 in a 64-bit word

First we populate the wanted nibble in a 64-bit word. Let the nibble is

`0xb`; it's present in the`x`word 5 times.c = |bb|bb|bb|bb|bb|bb|bb|bb|

Next, we XOR the input word with constant one:

t0 = x ^ c = |ab|02|bf|3b|aa|54|b2|b0| xor |bb|bb|bb|bb|bb|bb|bb|bb| = |10|b9|04|80|11|ef|09|0b| ^ ^ ^ ^ ^ nibbles equal to b

The 0 nibble indicates the nibble is equal to

`b`. Now our goal is to convert 0 nibbles into 1, and non-zero nibbles into 0. The exact procedure is shown in the next subsection.t1 = nibble == 0 ? 1 : 0 = |01|00|10|01|00|00|10|10|

The next step is to add nibbles together. Since we're targeting AVX-512, we may use

`VPSADBW`that adds**bytes**in 64-bit lanes. Note that bytes in word`t1`are only: 0x00, 0x01, 0x10, 0x11. If we add eight bytes having these values, there won't be overflow. We get a byte whose nibbles are sum of nibbles (0 or 1) from the input.:t2 = |00|00|00|00|00|00|00|32|

The final step is to add the nibbles from the lowest byte. Note that the

`t2`as`uint64`is never bigger than`0x88`, thus we may freely use 32-bit or 64-bit arithmetic instructions.:t3 = (t2 & 0x0f) + (t2 >> 4) = |00|00|00|00|00|00|00|05|

Note: if the extension AVX-512VPOPCNTDQ is available, the steps
#4 and #5 fold into a single invocation of `VPOPCNTQ`.

The full AVX-512 solution uses two pairs of registers:

- One pair contains the broadcast input word.
- Another pair contains 64-bit words filled with nibbles
from
`0`to`f`

Then the counting algorithm is performed on these registers. The final pair of registers contains the required histogram.

for i in 0..7 { inA[i] = uint64(x) inB[i] = uint64(x) constA[i] = 0x1111111111111111 * uint64(i) constB[i] = 0x1111111111111111 * uint64(i + 8) } outA = count_nibble(inA, constA) outB = count_nibble(inB, constB)

The following algorithm can be applied to bitfields of any size; here we focus on nibbles.

The initial nibbles (examples show just two cases: zero and non-zero):

t0 = |0000|1101|

Mask out the most significant bits of nibbles:

t1 = t0 & 0b0111_0111 = |0000|0101|

Add

`0b0111`to each nibble. If any lower bit -- 0, 1 or 2 -- was set, the carry bit from addition will appear the 3rd bit.t2 = t1 + 0b0111_0111 = |0000|1010|

Combine 3rd bits of nibbles from the initial word with the addition result. Now the 3rd bit is set if any bit of the nibble was 1.

t3 = t2 | t1 = |0000|1110|

Leave only the 3rd bit and negate it. Note that the AVX-512 ternary logic instruction (

`VPTERNLOG`) is sufficient to evaluate expression`t4 = ~(t2 | t1) & 0b1000_1000`.t4 = (~t3 & 0b1000_1000) = |1000|0000|

Finally, we have to convert

`0b1000`into`0b1111`and keep`0b0000`intact. Simple shift right by 3 does the job:t5 = t4 >> 3 = |0001|0000|

Sample implementation is available at GitHub.