# Introduction

The problem is defined as follows: we have separate lanes (32-bit or 64-bit) and want to find the position of the first occurance of the given byte in each lane.

For example, when we look for byte 0xaa in 32-bit lanes:

```   lane 0        lane 1        lane 2       lane 4  ...
[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44]
^^               ^^   ^^
```

The result should be a vector of uint32 = {1, 0, 3, 4, ...}.

With AVX512 an obvious solution would be producing a bitmask from byte-level comparison and then doing some permutations to convert parts of bitmasks into 32-bit values.

While it's feasible, I want to show a method that uses trick from my previous article AVX512: count trailing zeros.

# Algorithm

1. In the first step we XOR lanes with lanes having the byte populated. Bytes equal 0 denotes presence of 0xaa, and non-zero bytes any other value.

```[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44] = input
[aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] [aa|aa|aa|aa] = populated

[aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = input xor populated — t1
```
2. Then, we convert non-zero values into 1, using unsigned min(byte, 1) function; zeros remain intact:

```[aa|00|00|bb] [00|00|00|00] [00|bb|bb|88] [bb|88|99|ee] = t1
[01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones

[01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = min(ones, t1) — t2
```
3. The next step is simple negation of result. This is achived with bit xor instruction:

```[01|00|00|01] [00|00|00|00] [00|01|01|01] [01|01|01|01] = t2
[01|01|01|01] [01|01|01|01] [01|01|01|01] [01|01|01|01] = ones

[00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t2 xor ones — t3
```
4. Now we can apply the counting trailing zero procedure. First, we subtract uint32(1):

```[00|01|01|00] [01|01|01|01] [01|00|00|00] [00|00|00|00] = t3
[          1] [          1] [          1] [          1] = uint32(1)

[00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = uint32(t3) - uint32(1) — t4
```
5. Then we can clear upper bits by bit-and the last vector (t4) with the initial one (t3):

```[00|01|00|ff] [01|01|01|00] [01|ff|ff|ff] [ff|ff|ff|ff] = t4
[ff|fe|fe|ff] [fe|fe|fe|fe] [fe|ff|ff|ff] [ff|ff|ff|ff] = not t3

[00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t4 and not t3 — t5
```
6. We are almost done. With VPLZCNT we count number of leading zeros in each lane:

```[00|00|00|ff] [00|00|00|00] [00|ff|ff|ff] [ff|ff|ff|ff] = t5
[         24] [         32]            8] [          0] = VPLZCNT(t5)
```
7. The last step is to divide the result by 8 (by shifting right) and subtract it from 4: 4 − lzcnt/8.

# Source code

Sample source code is available on Github