Author: | Wojciech Muła |
---|---|
Added on: | 2023-01-31 |
Updated on: | 2023-02-02 (better method by Harold Aptroot & InstLatX64) |
Contents
AVX512 lacks of counting trailing zeros; it supports counting of leading zeros via instruction VPLZCNTD in 32- and 64-bit words. There is the scalar instruction BSF (Bit Scan Forward).
To recall how counting the leading zeros is supposed to work, let's see sample 32-bit word:
bit 31 bit 0 | | [0000|0001|0111|1000|0011|1100|0000| ^^ ^^^^ 6 trailing zeros
An obvious solution would be reversing bits in a word and then use the VPLZCNT instruction. Source code for this approach is shown in a following section. Basically, with PSHUFB we can reverse order of bytes. Then, with two additional invocations of that instruction, we swap order of bits within bytes.
However, a solution which is really clever uses population count; it is explained in the next section.
The core observation: when we subtract 1 from a non-zero number, then the least significant bit is reset, but all leading zeros are turning into ones.
Let's see an example. That's our initial word:
[0000|0001|0111|1000|0011|1100|0000| ^^ ^^^^ 6 trailing zeros
After subtracting one it becomes:
[0000|0001|0111|1000|0011|1011|1111| ^^ ^^^^
If we count bits in the initial word and in the word after subtraction, we can derive the number of leading zeros. Initial population count is increased by one and by leading bits count.
More formally: initial popcount + 1 + leading bits count = popcount after subtraction 1.
There's one special case when the initial word is zero. VPLZCNT defines that for zero, the result is 32 or 64 (depending on the lane width). In this approach we simply do not add 1.
The only problem with this approach is that VPOPCNT instruction is available when a CPU supports AVX512_BITALG and AVX512_VL extensions.
Sample implementation in Go asm:
// load constants MOVL $1, BX VPBROADCASTD BX, Z1 // Z1 = 1 // load input MOVL context_in(AX), BX VPBROADCASTD BX, Z0 // 1. fixup c to 0, if input is zero VPMINUD Z0, Z1, Z2 // c = 1 if input != 0, otherwise c = 0 // 2. count initial number of bits VPOPCNTD Z0, Z10 // popcnt0 // 3. reset the lowest bit and set all before it VPSUBD Z1, Z0, Z0 // 4. get popcnt1 VPOPCNTD Z0, Z11 // popcnt1 // 5. calculate bfs VPSUBD Z10, Z11, Z11 // bfs = popcnt1 - popcnt0 VPADDD Z2, Z11, Z0 // bfs += c
A better method by Harold Aptroot && InstLatX64 — we use the following identity: ~x & (x - 1). The x - 1 term sets to one the bits before the first bit. But the bit and with negation of word resets the bits above the first bit.
Initial word:
x = [0000|0001|0111|1000|0011|1100|0000| x - 1 = [0000|0001|0111|1000|0011|1011|1111|
Then ~x & (x - 1):
x - 1 = [0000|0001|0111|1000|0011|1011|1111| & ~x & [1111|1110|1000|0111|1100|0011|1111| = [0000|0000|0000|0000|0000|0011|1111|
Now we may use population count, that gives the answer instantly. Alternatively we may use expression 32 - VPLZCNT(...).
Sample implementation in Go asm:
// load constants MOVL $1, BX VPBROADCASTD BX, Z1 // Z1 = 1 // load input MOVL context_in(AX), BX // load const VPBROADCASTD BX, Z0 // 1. calculate ~x & (x - 1) VPSUBD Z1, Z0, Z2 // x - 1 VPANDND Z2, Z0, Z0 // & ~x // 2. calculate bfs VPOPCNTD Z0, Z0
Sample implementation in Go asm:
// load consts MOVL $0x0f0f0f0f, BX VPBROADCASTD BX, Z10 VBROADCASTF32X4 swapbitslo<>(SB), Z11 VBROADCASTF32X4 swapbitshi<>(SB), Z12 VBROADCASTF32X4 bswap32<>(SB), Z13 // load input MOVL context_in(AX), BX // load const VPBROADCASTD BX, Z0 // Z0 = |a1|b2|c3|d4|... // 1. separate lo and hi nibbles VPANDD Z0, Z10, Z1 // Z1 = |01|02|03|04|... -- lo nibbles VPSRLD $4, Z0, Z2 // VPANDD Z2, Z10, Z2 // Z2 = |0a|0b|0c|0d|... -- hi nibbles // 2. swap bits in lo nibbles & hi nibbles VPSHUFB Z1, Z11, Z1 VPSHUFB Z2, Z12, Z2 VPORD Z1, Z2, Z0 // 3. swap bytes VPSHUFB Z13, Z0, Z0 // 4. count leading zeros VPLZCNTD Z0, Z0 // n = lzcnt
Sample source code is available on Github