AVX512: count trailing zeros

Author:	Wojciech Muła
Added on:	2023-01-31
Updated on:	2023-02-02 (better method by Harold Aptroot & InstLatX64)

Contents

Introduction
Trailing zeros counting with population count
Trailing zeros — simplified
Trailing zeros counting with bits reverse
Source code

Introduction

AVX512 lacks of counting trailing zeros; it supports counting of leading zeros via instruction VPLZCNTD in 32- and 64-bit words. There is the scalar instruction BSF (Bit Scan Forward).

To recall how counting the leading zeros is supposed to work, let's see sample 32-bit word:

 bit 31                       bit 0
 |                                |
[0000|0001|0111|1000|0011|1100|0000|
                            ^^ ^^^^
                       6 trailing zeros

An obvious solution would be reversing bits in a word and then use the VPLZCNT instruction. Source code for this approach is shown in a following section. Basically, with PSHUFB we can reverse order of bytes. Then, with two additional invocations of that instruction, we swap order of bits within bytes.

However, a solution which is really clever uses population count; it is explained in the next section.

Trailing zeros counting with population count

The core observation: when we subtract 1 from a non-zero number, then the least significant bit is reset, but all leading zeros are turning into ones.

Let's see an example. That's our initial word:

[0000|0001|0111|1000|0011|1100|0000|
                            ^^ ^^^^
                       6 trailing zeros

After subtracting one it becomes:

[0000|0001|0111|1000|0011|1011|1111|
                            ^^ ^^^^

If we count bits in the initial word and in the word after subtraction, we can derive the number of leading zeros. Initial population count is increased by one and by leading bits count.

More formally: initial popcount + 1 + leading bits count = popcount after subtraction 1.

There's one special case when the initial word is zero. VPLZCNT defines that for zero, the result is 32 or 64 (depending on the lane width). In this approach we simply do not add 1.

The only problem with this approach is that VPOPCNT instruction is available when a CPU supports AVX512_BITALG and AVX512_VL extensions.

Sample implementation in Go asm:

// load constants
MOVL            $1, BX
VPBROADCASTD    BX, Z1              // Z1 = 1

// load input
MOVL            context_in(AX), BX
VPBROADCASTD    BX, Z0

// 1. fixup c to 0, if input is zero
VPMINUD         Z0, Z1, Z2          // c = 1 if input != 0, otherwise c = 0

// 2. count initial number of bits
VPOPCNTD        Z0, Z10             // popcnt0

// 3. reset the lowest bit and set all before it
VPSUBD          Z1, Z0, Z0

// 4. get popcnt1
VPOPCNTD        Z0, Z11             // popcnt1

// 5. calculate bfs
VPSUBD          Z10, Z11, Z11       // bfs = popcnt1 - popcnt0
VPADDD          Z2,  Z11, Z0        // bfs += c

Trailing zeros — simplified

A better method by Harold Aptroot && InstLatX64 — we use the following identity: ~x & (x - 1). The x - 1 term sets to one the bits before the first bit. But the bit and with negation of word resets the bits above the first bit.

Initial word:

x     = [0000|0001|0111|1000|0011|1100|0000|
x - 1 = [0000|0001|0111|1000|0011|1011|1111|

Then ~x & (x - 1):

x - 1 = [0000|0001|0111|1000|0011|1011|1111|
& ~x  & [1111|1110|1000|0111|1100|0011|1111|
      = [0000|0000|0000|0000|0000|0011|1111|

Now we may use population count, that gives the answer instantly. Alternatively we may use expression 32 - VPLZCNT(...).

Sample implementation in Go asm:

// load constants
MOVL            $1, BX
VPBROADCASTD    BX, Z1              // Z1 = 1

// load input
MOVL            context_in(AX), BX  // load const
VPBROADCASTD    BX, Z0

// 1. calculate ~x & (x - 1)
VPSUBD          Z1, Z0, Z2          // x - 1
VPANDND         Z2, Z0, Z0          // & ~x

// 2. calculate bfs
VPOPCNTD        Z0, Z0

Trailing zeros counting with bits reverse

Sample implementation in Go asm:

// load consts
MOVL            $0x0f0f0f0f, BX
VPBROADCASTD    BX, Z10

VBROADCASTF32X4 swapbitslo<>(SB), Z11
VBROADCASTF32X4 swapbitshi<>(SB), Z12
VBROADCASTF32X4 bswap32<>(SB), Z13

// load input
MOVL            context_in(AX), BX  // load const
VPBROADCASTD    BX, Z0              // Z0 = |a1|b2|c3|d4|...

// 1. separate lo and hi nibbles
VPANDD          Z0, Z10, Z1         // Z1 = |01|02|03|04|... -- lo nibbles
VPSRLD          $4, Z0, Z2          //
VPANDD          Z2, Z10, Z2         // Z2 = |0a|0b|0c|0d|... -- hi nibbles

// 2. swap bits in lo nibbles & hi nibbles
VPSHUFB         Z1, Z11, Z1
VPSHUFB         Z2, Z12, Z2
VPORD           Z1, Z2, Z0

// 3. swap bytes
VPSHUFB         Z13, Z0, Z0

// 4. count leading zeros
VPLZCNTD        Z0, Z0              // n = lzcnt

Source code

Sample source code is available on Github