AVX512: finding first byte in lanes

Author:	Wojciech Muła
Added on:	2023-02-06

Introduction

The problem is defined as follows: we have separate lanes (32-bit or 64-bit) and want to find the position of the first occurrence of the given byte in each lane.

For example, when we look for byte 0xaa in 32-bit lanes:

   lane 0        lane 1        lane 2       lane 4  ...
[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44]
       ^^               ^^   ^^
  position 1    position 0   position 3    position 4 (not found)

The result should be a vector of uint32 = {1, 0, 3, 4, ...}.

With AVX512 an obvious solution would be producing a bitmask from byte-level comparison and then doing some permutations to convert parts of bitmasks into 32-bit values.

While it's feasible, I want to show a method that uses trick from my previous article AVX512: count trailing zeros.

Algorithm

In the first step we XOR lanes with lanes having the byte populated. Bytes equal 0 denotes presence of 0xaa, and non-zero bytes any other value.

[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44] = input
[aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] = populated

[aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = input xor populated — t1

Then, we convert non-zero values into 1, using unsigned min(byte, 1) function; zeros remain intact:

[aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = t1
[01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones

[01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = min(ones, t1) — t2

The next step is simple negation of result. This is achieved with bit xor instruction:

[01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = t2
[01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones

[00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t2 xor ones — t3

Now we can apply the counting trailing zero procedure. First, we subtract uint32(1):

[00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t3
[          1] [          1] [          1] [          1] = uint32(1)

[00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = uint32(t3) - uint32(1) — t4

Then we can clear upper bits by bit-and the last vector (t4) with the initial one (t3):

[00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = t4
[ff|fe|fe|ff] [fe|fe|fe|fe] [fe|ff|ff|ff] [ff|ff|ff|ff] = not t3

[00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t4 and not t3 — t5

We are almost done. With VPLZCNTD we count number of leading zeros in each lane:

[00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t5
[         24] [         32]            8] [          0] = VPLZCNT(t5)

The last step is to divide the result by 8 (by shifting right) and subtract it from 4: 4 − lzcnt/8.

Source code

Sample source code is available on Github