Author: | Wojciech Muła |
---|---|
Added on: | 2023-01-06 |
At Sneller we had the following problem: there are sixteen 4-bit values, we need a histogram for that limited set.
Since the input set has the fixed size of 64 bits, the problem is not that difficult as a generic case.
The basic problem we solve is counting how many nibbles are present in the given 64-bit set. Then, the solution to the initial problem is performing the basic step for each possible nibble value. And it's is done in parallel.
We obviously might expand 4-bit words into bytes, and then use the byte-level comparison (like VPCMPEQB). But expansion is not very fast. A better solution works at the nibble level.
The input is the following 16-element set x:
x = |ab|02|bf|3b|aa|54|b2|b0| ^^ ^^ byte 7 byte 0 in a 64-bit word
First we populate the wanted nibble in a 64-bit word. Let the nibble is 0xb; it's present in the x word 5 times.
c = |bb|bb|bb|bb|bb|bb|bb|bb|
Next, we XOR the input word with constant one:
t0 = x ^ c = |ab|02|bf|3b|aa|54|b2|b0| xor |bb|bb|bb|bb|bb|bb|bb|bb| = |10|b9|04|80|11|ef|09|0b| ^ ^ ^ ^ ^ nibbles equal to b
The 0 nibble indicates the nibble is equal to b. Now our goal is to convert 0 nibbles into 1, and non-zero nibbles into 0. The exact procedure is shown in the next subsection.
t1 = nibble == 0 ? 1 : 0 = |01|00|10|01|00|00|10|10|
The next step is to add nibbles together. Since we're targeting AVX-512, we may use VPSADBW that adds bytes in 64-bit lanes. Note that bytes in word t1 are only: 0x00, 0x01, 0x10, 0x11. If we add eight bytes having these values, there won't be overflow. We get a byte whose nibbles are sum of nibbles (0 or 1) from the input.:
t2 = |00|00|00|00|00|00|00|32|
The final step is to add the nibbles from the lowest byte. Note that the t2 as uint64 is never bigger than 0x88, thus we may freely use 32-bit or 64-bit arithmetic instructions.:
t3 = (t2 & 0x0f) + (t2 >> 4) = |00|00|00|00|00|00|00|05|
Note: if the extension AVX-512VPOPCNTDQ is available, the steps #4 and #5 fold into a single invocation of VPOPCNTQ.
The full AVX-512 solution uses two pairs of registers:
Then the counting algorithm is performed on these registers. The final pair of registers contains the required histogram.
for i in 0..7 { inA[i] = uint64(x) inB[i] = uint64(x) constA[i] = 0x1111111111111111 * uint64(i) constB[i] = 0x1111111111111111 * uint64(i + 8) } outA = count_nibble(inA, constA) outB = count_nibble(inB, constB)
The following algorithm can be applied to bitfields of any size; here we focus on nibbles.
The initial nibbles (examples show just two cases: zero and non-zero):
t0 = |0000|1101|
Mask out the most significant bits of nibbles:
t1 = t0 & 0b0111_0111 = |0000|0101|
Add 0b0111 to each nibble. If any lower bit -- 0, 1 or 2 -- was set, the carry bit from addition will appear the 3rd bit.
t2 = t1 + 0b0111_0111 = |0000|1010|
Combine 3rd bits of nibbles from the initial word with the addition result. Now the 3rd bit is set if any bit of the nibble was 1.
t3 = t2 | t1 = |0000|1110|
Leave only the 3rd bit and negate it. Note that the AVX-512 ternary logic instruction (VPTERNLOG) is sufficient to evaluate expression t4 = ~(t2 | t1) & 0b1000_1000.
t4 = (~t3 & 0b1000_1000) = |1000|0000|
Finally, we have to convert 0b1000 into 0b1111 and keep 0b0000 intact. Simple shift right by 3 does the job:
t5 = t4 >> 3 = |0001|0000|
Sample implementation is available at GitHub.