AVX512: finding first byte in lanes

Introduction

The problem is defined as follows: we have separate lanes (32-bit or 64-bit) and want to find the position of the first occurance of the given byte in each lane.

For example, when we look for byte 0xaa in 32-bit lanes:

   lane 0        lane 1        lane 2       lane 4  ...
[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44]
       ^^               ^^   ^^
  position 1    position 0   position 3    position 4 (not found)

The result should be a vector of uint32 = {1, 0, 3, 4, ...}.

With AVX512 an obvious solution would be producing a bitmask from byte-level comparison and then doing some permutations to convert parts of bitmasks into 32-bit values.

While it's feasible, I want to show a method that uses trick from my previous article AVX512: count trailing zeros.

Algorithm

  1. In the first step we XOR lanes with lanes having the byte populated. Bytes equal 0 denotes presence of 0xaa, and non-zero bytes any other value.

    [00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44] = input
    [aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] = populated
    
    [aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = input xor populated — t1
    
  2. Then, we convert non-zero values into 1, using unsigned min(byte, 1) function; zeros remain intact:

    [aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = t1
    [01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones
    
    [01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = min(ones, t1) — t2
    
  3. The next step is simple negation of result. This is achived with bit xor instruction:

    [01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = t2
    [01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones
    
    [00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t2 xor ones — t3
    
  4. Now we can apply the counting trailing zero procedure. First, we subtract uint32(1):

    [00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t3
    [          1] [          1] [          1] [          1] = uint32(1)
    
    [00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = uint32(t3) - uint32(1) — t4
    
  5. Then we can clear upper bits by bit-and the last vector (t4) with the initial one (t3):

    [00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = t4
    [ff|fe|fe|ff] [fe|fe|fe|fe] [fe|ff|ff|ff] [ff|ff|ff|ff] = not t3
    
    [00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t4 and not t3 — t5
    
  6. We are almost done. With VPLZCNT we count number of leading zeros in each lane:

    [00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t5
    [         24] [         32]            8] [          0] = VPLZCNT(t5)
    
  7. The last step is to divide the result by 8 (by shifting right) and subtract it from 4: 4 − lzcnt/8.

Source code

Sample source code is available on Github