AVX512: histogram of sixteen nibbles

Author:Wojciech Muła
Date:2023-01-06

Contents

Introduction

At Sneller we had the following problem: there are sixteen 4-bit values, we need a histogram for that limited set.

Since the input set has the fixed size of 64 bits, the problem is not that difficult as a generic case.

The basic problem we solve is counting how many nibbles are present in the given 64-bit set. Then, the solution to the initial problem is performing the basic step for each possible nibble value. And it's is done in parallel.

AVX-512 counting nibbles in 64-bit words

We obviously might expand 4-bit words into bytes, and then use the byte-level comparison (like VPCMPEQB). But expansion is not very fast. A better solution works at the nibble level.

  1. The input is the following 16-element set x:

    x = |ab|02|bf|3b|aa|54|b2|b0|
         ^^                   ^^
         byte 7           byte 0 in a 64-bit word
    
  2. First we populate the wanted nibble in a 64-bit word. Let the nibble is 0xb; it's present in the x word 5 times.

    c = |bb|bb|bb|bb|bb|bb|bb|bb|
    
  3. Next, we XOR the input word with constant one:

    t0 = x ^ c
       = |ab|02|bf|3b|aa|54|b2|b0| xor
         |bb|bb|bb|bb|bb|bb|bb|bb|
       = |10|b9|04|80|11|ef|09|0b|
           ^    ^   ^       ^  ^
             nibbles equal to b
    
  4. The 0 nibble indicates the nibble is equal to b. Now our goal is to convert 0 nibbles into 1, and non-zero nibbles into 0. The exact procedure is shown in the next subsection.

    t1 = nibble == 0 ? 1 : 0
       = |01|00|10|01|00|00|10|10|
    
  5. The next step is to add nibbles together. Since we're targeting AVX-512, we may use VPSADBW that adds bytes in 64-bit lanes. Note that bytes in word t1 are only: 0x00, 0x01, 0x10, 0x11. If we add eight bytes having these values, there won't be overflow. We get a byte whose nibbles are sum of nibbles (0 or 1) from the input.:

    t2 = |00|00|00|00|00|00|00|32|
    
  6. The final step is to add the nibbles from the lowest byte. Note that the t2 as uint64 is never bigger than 0x88, thus we may freely use 32-bit or 64-bit arithmetic instructions.:

    t3 = (t2 & 0x0f) + (t2 >> 4)
       = |00|00|00|00|00|00|00|05|
    

Note: if the extension AVX-512VPOPCNTDQ is available, the steps #4 and #5 fold into a single invocation of VPOPCNTQ.

The full AVX-512 solution uses two pairs of registers:

  1. One pair contains the broadcast input word.
  2. Another pair contains 64-bit words filled with nibbles from 0 to f

Then the counting algorithm is performed on these registers. The final pair of registers contains the required histogram.

for i in 0..7 {
    inA[i] = uint64(x)
    inB[i] = uint64(x)
    constA[i] = 0x1111111111111111 * uint64(i)
    constB[i] = 0x1111111111111111 * uint64(i + 8)
}

outA = count_nibble(inA, constA)
outB = count_nibble(inB, constB)

Converting nibbles

The following algorithm can be applied to bitfields of any size; here we focus on nibbles.

  1. The initial nibbles (examples show just two cases: zero and non-zero):

    t0 = |0000|1101|
    
  2. Mask out the most significant bits of nibbles:

    t1 = t0 & 0b0111_0111
       = |0000|0101|
    
  3. Add 0b0111 to each nibble. If any lower bit -- 0, 1 or 2 -- was set, the carry bit from addition will appear the 3rd bit.

    t2 = t1 + 0b0111_0111
       = |0000|1010|
    
  4. Combine 3rd bits of nibbles from the initial word with the addition result. Now the 3rd bit is set if any bit of the nibble was 1.

    t3 = t2 | t1
       = |0000|1110|
    
  5. Leave only the 3rd bit and negate it. Note that the AVX-512 ternary logic instruction (VPTERNLOG) is sufficient to evaluate expression t4 = ~(t2 | t1) & 0b1000_1000.

    t4 = (~t3 & 0b1000_1000)
       = |1000|0000|
    
  6. Finally, we have to convert 0b1000 into 0b1111 and keep 0b0000 intact. Simple shift right by 3 does the job:

    t5 = t4 >> 3
       = |0001|0000|
    

Source code

Sample implementation is available at GitHub.