Author: | Wojciech Muła |
---|---|

Added on: | 2023-01-31 |

Updated on: | 2023-02-02 (better method by Harold Aptroot & InstLatX64) |

Contents

AVX512 lacks of counting **trailing zeros**; it supports counting of leading
zeros via instruction VPLZCNT in 32- and 64-bit words. There is the scalar
instruction BSF (Bit Scan Forward).

To recall how counting the leading zeros is supposed to work, let's see sample 32-bit word:

bit 31 bit 0 | | [0000|0001|0111|1000|0011|1100|0000| ^^ ^^^^ 6 trailing zeros

An obvious solution would be reversing bits in a word and then use the
`VPLZCNT` instruction. Source code for this approach is shown in a following
section. Basically, with `PSHUFB` we can reverse order of bytes. Then, with
two additional invocations of that instruction, we swap order of bits
within bytes.

However, a solution which is really clever uses population count; it is explained in the next section.

The core observation: when we subtract 1 from a non-zero number, then **the
least significant bit** is reset, but all leading zeros are turning into ones.

Let's see an example. That's our initial word:

[0000|0001|0111|1000|0011|1100|0000| ^^ ^^^^ 6 trailing zeros

After subtracting one it becomes:

[0000|0001|0111|1000|0011|1011|1111| ^^ ^^^^

If we count bits in the initial word and in the word after subtraction, we can derive the number of leading zeros. Initial population count is increased by one and by leading bits count.

More formally: *initial popcount + 1 + leading bits count = popcount after
subtraction 1*.

There's one special case when the initial word is zero. `VPLZCNT` defines
that for zero, the result is 32 or 64 (depending on the lane width).
In this approach we simply do not add 1.

The only problem with this approach is that VPOPCNT instruction
is available when a CPU supports **AVX512_BITALG** and **AVX512_VL**
extensions.

Sample implementation in Go asm:

// load constants MOVL $1, BX VPBROADCASTD BX, Z1 // Z1 = 1 // load input MOVL context_in(AX), BX VPBROADCASTD BX, Z0 // 1. fixup c to 0, if input is zero VPMINUD Z0, Z1, Z2 // c = 1 if input != 0, otherwise c = 0 // 2. count initial number of bits VPOPCNTD Z0, Z10 // popcnt0 // 3. reset the lowest bit and set all before it VPSUBD Z1, Z0, Z0 // 4. get popcnt1 VPOPCNTD Z0, Z11 // popcnt1 // 5. calculate bfs VPSUBD Z10, Z11, Z11 // bfs = popcnt1 - popcnt0 VPADDD Z2, Z11, Z0 // bfs += c

A better method by Harold Aptroot && InstLatX64 — we use the following
identity: `~x & (x - 1)`. The `x - 1` term sets to one the bits before
the first bit. But the bit and with negation of word resets the bits above
the first bit.

Initial word:

x = [0000|0001|0111|1000|0011|1100|0000| x - 1 = [0000|0001|0111|1000|0011|1011|1111|

Then `~x & (x - 1)`:

x - 1 = [0000|0001|0111|1000|0011|1011|1111| & ~x & [1111|1110|1000|0111|1100|0011|1111| = [0000|0000|0000|0000|0000|0011|1111|

Now we may use population count, that gives the answer instantly.
Alternatively we may use expression `32 - VPLZCNT(...)`.

Sample implementation in Go asm:

// load constants MOVL $1, BX VPBROADCASTD BX, Z1 // Z1 = 1 // load input MOVL context_in(AX), BX // load const VPBROADCASTD BX, Z0 // 1. calculate ~x & (x - 1) VPSUBD Z1, Z0, Z2 // x - 1 VPANDND Z2, Z0, Z0 // & ~x // 2. calculate bfs VPOPCNTD Z0, Z0

Sample implementation in Go asm:

// load consts MOVL $0x0f0f0f0f, BX VPBROADCASTD BX, Z10 VBROADCASTF32X4 swapbitslo<>(SB), Z11 VBROADCASTF32X4 swapbitshi<>(SB), Z12 VBROADCASTF32X4 bswap32<>(SB), Z13 // load input MOVL context_in(AX), BX // load const VPBROADCASTD BX, Z0 // Z0 = |a1|b2|c3|d4|... // 1. separate lo and hi nibbles VPANDD Z0, Z10, Z1 // Z1 = |01|02|03|04|... -- lo nibbles VPSRLD $4, Z0, Z2 // VPANDD Z2, Z10, Z2 // Z2 = |0a|0b|0c|0d|... -- hi nibbles // 2. swap bits in lo nibbles & hi nibbles VPSHUFB Z1, Z11, Z1 VPSHUFB Z2, Z12, Z2 VPORD Z1, Z2, Z0 // 3. swap bytes VPSHUFB Z13, Z0, Z0 // 4. count leading zeros VPLZCNTD Z0, Z0 // n = lzcnt

Sample source code is available on Github