The problem is defined as follows: we have separate lanes (32-bit or 64-bit) and want to find the position of the first occurance of the given byte in each lane.
For example, when we look for byte 0xaa in 32-bit lanes:
lane 0 lane 1 lane 2 lane 4 ... [00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44] ^^ ^^ ^^ position 1 position 0 position 3 position 4 (not found)
The result should be a vector of uint32 = {1, 0, 3, 4, ...}.
With AVX512 an obvious solution would be producing a bitmask from byte-level comparison and then doing some permutations to convert parts of bitmasks into 32-bit values.
While it's feasible, I want to show a method that uses trick from my previous article AVX512: count trailing zeros.
In the first step we XOR lanes with lanes having the byte populated. Bytes equal 0 denotes presence of 0xaa, and non-zero bytes any other value.
[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44] = input [aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] = populated [aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = input xor populated — t1
Then, we convert non-zero values into 1, using unsigned min(byte, 1) function; zeros remain intact:
[aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = t1 [01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones [01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = min(ones, t1) — t2
The next step is simple negation of result. This is achived with bit xor instruction:
[01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = t2 [01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones [00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t2 xor ones — t3
Now we can apply the counting trailing zero procedure. First, we subtract uint32(1):
[00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t3 [ 1] [ 1] [ 1] [ 1] = uint32(1) [00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = uint32(t3) - uint32(1) — t4
Then we can clear upper bits by bit-and the last vector (t4) with the initial one (t3):
[00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = t4 [ff|fe|fe|ff] [fe|fe|fe|fe] [fe|ff|ff|ff] [ff|ff|ff|ff] = not t3 [00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t4 and not t3 — t5
We are almost done. With VPLZCNT we count number of leading zeros in each lane:
[00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t5 [ 24] [ 32] 8] [ 0] = VPLZCNT(t5)
The last step is to divide the result by 8 (by shifting right) and subtract it from 4: 4 − lzcnt/8.
Sample source code is available on Github