AVX512: generating constants

Author: Wojciech Muła
Added on:2023-01-19
Updated on:2023-02-02

Contents

Introduction

AVX512 code often needs constants that repeat in 64-, 32-, 16- or 8-bit lanes. Instead of pre-computing such constants as whole 64-byte arrays, we can reduce memory consumption by using explicit or implicit broadcasts.

While broadcasts have high throughput, their latencies are quite high. According to uops.info, latencies are 5 cycles (on Skylake-X, Cannon Lake, Ice Lake, Adler Lake-P) when broadcasting from either a 32-bit or 64-bit register. When the source is memory location, latencies are even higher.

When an AVX512 procedure is quite short, or often loads different constants from memory, broadcast latencies might become visible. To overcome this problem, we might compute some values using few cheap instructions.

We can quickly fill an AVX512 all zero bits (with XOR) or ones (with ternary log instruction) and then use shifts, bit operations and other instructions to construct desired value.

This article show some ways to calculate different simple constants. Examples are focused mostly on 32-bit values, although in most cases we might generalize the methods to other lane widths.

Integer constants

Zero (0x00000000)

A well known zeroing idiom:

VPXORD Z1, Z1, Z1

All ones (0xffffffff)

Ternary logic with immediate 255:

VPTERNLOGD $0xff, Z1, Z1, Z1

All decimal ones (0x00000001, 0x00010001, 0x01010101)

The word 0xffffffff interpreted as a signed number is -1. AVX512 provides instruction to calculate an absolute value, thus getting 1 is straightforward:

VPTERNLOGD $0xff, Z1, Z1, Z1
VPABSD     Z1, Z2               // 0x00000001
VPABSW     Z1, Z3               // 0x00010001
VPABSB     Z1, Z4               // 0x01010101

Powers of two (0x2, 0x4, 0x8, ...)

Starting from 0x00000001, we may use left shift with a constant to produce 1 << 1, 1 << 2, 1 << 3 up to 1 << 31.

Decimal numbers (from 0 to 32)

We start from 0xffffffff and then shift it right by the number n we want to get. The result is a word having n zeros followed by 32-n ones.

Then we may use VPLZCNTD to count the leading zeros of word. In our case it will be exactly n.

A sample program for n=17:

VPTERNLOGD  $0xff, Z1, Z1, Z1
VPSRLD      $17,   Z1, Z1, Z1
VPLZCNTD    Z1, Z1

In a similar way we may utilize VPOPCNTD, but this instruction is not widely available yet.

Bit-mask 0x80808080 or 0x80008000

We start with two words 0xffffffff and 0x00000000. Then we use either the instruction AVGB or AVGW. Both calculate expression (xi + yi + 1)/2, yielding the correct result, even if the sum from numerator would not fit in the destination byte (or word).

When x = 0xff and y = 0x00, then x + y + 1 = 0x100, and finally 0x100 / 2 = 0x80.

Sample program for bytes:

VPTERNLOGD  $0xff, Z0, Z0, Z0   // 0xffffffff
VPXORD      Z7, Z7, Z7          // 0x00000000
VPAVGB      Z0, Z7, Z1          // 0x80808080
VPAVGW      Z0, Z7, Z2          // 0x80008000

As Wunk & Harold Aptroot showed, this constant can be generated by the expression AVG(x, ~x) for any x.

Bit-masks: span of ones (like 0b00111000)

Starting from 0xffffffff, we may obtain any bit-mask in form 00..0011..11 by shifting the word right.

Likewise, a bit-mask in form 11..1100..00 by shifting the word left.

When we need a mask in form 00..0011..1100..00, then two shift are required. The first one produces a mask 00..0011..11 with the wanted number of ones. The second shift left moves the ones to the desired position.

For example a bit-mask with 11 bits that starts at the 3rd bit can be constructed with the following program:

VPTERNLOGD $0xff,    Z1, Z1, Z1 # 0xffffffff
VPSRLD     $(32-11), Z1, Z1     # 0x000007ff
VPSLLD     $3,       Z1, Z1     # 0x00003ff8

Bit-mask: span of zeros (like 0b11000111)

We starting from 0xffffffff, and we will use two instructions: shift left and rotate left

The shift produces as many zeros at the word's beginning as we want. The rotation places the string of zeros at the desired position.

For example a bit-mask with 7 zeros that starts at the 14 bit can be constructed with the following program:

VPTERNLOGD $0xff, Z1, Z1, Z1 # 0xffffffff
VPSLLD        $7, Z1, Z1     # 0xffffff80
VPROLD       $14, Z1, Z1     # 0xffe03fff

Any bit-pattern

Any bit pattern can be generated with a sequence of shifts and rotates. The algorithm to construct a program for the given constant works as follows.

  1. Convert the constant into its binary representation.
  2. Initialize a register with 0xffffffff.
  3. Starting from the most significant bits consider spans of bits of the same value.
  4. When a span contains zeros, then use shift left on the register.
  5. When a span contains ones, then use rotate left on the register.

The number of shifts equals the number of spans, thus this method is suitable for a limited set of constants.

Let's analyse algorithm for constant 0x00ff1f01. Its binary representation is 0b00000000111111110001111100000001.

Our program starts with VPTERNLOGD $0xff, Z1, Z1, Z1.

The first span of zero bits contains 8 bits. Thus the next instruction is shift left by 8. Then there's 8-bit span of ones. Thus we're issuing rotate left by 8. Next we have three zeros -- it's shift left by 3. Then five ones equal to rotate left by 5. Next seven zeros yields a shift left by 7. And finally sole one is generated by rotate by 1 bit.

The full assembly program is shown below:

VPTERNLOGD $0xff, Z1, Z1, Z1 // 11111111111111111111111111111111
VPSSLD $8, Z1, Z1            // 11111111111111111111111100000000
VPROLD $8, Z1, Z1            // 11111111111111110000000011111111
VPSSLD $3, Z1, Z1            // 11111111111110000000011111111000
VPROLD $5, Z1, Z1            // 11111111000000001111111100011111
VPSSLD $7, Z1, Z1            // 10000000011111111000111110000000
VPROLD $1, Z1, Z1            // 00000000111111110001111100000001

Any bit-pattern — byte level

When a 32-bit pattern contains a repeated byte, like 0xdddddddd, we may use the generic method to build the zero-th byte, i.e. 0xffffffdd. Then we populate the byte with PSHUFB, as indices are zeros (and we know how to get them).

A sample program:

// generate 0xffffffdd
VPTERNLOGD $0xff, Z1, Z1, Z1 // 11111111111111111111111111111111
VPSLLD $1, Z1, Z1            // 11111111111111111111111111111110
VPROLD $3, Z1, Z1            // 11111111111111111111111111110111
VPSLLD $1, Z1, Z1            // 11111111111111111111111111101110
VPROLD $1, Z1, Z1            // 11111111111111111111111111011101

// populate 0th byte
VPXORD Z0, Z0, Z0
VPSHUFB Z0, Z1, Z1

Any bit pattern — GF2P8AFFINEQB

InstLatX64 showed in 2020, that the instruction GF2P8AFFINEQB (Galois Field Affine Transformation) can be used to broadcast any compile-time 8-bit constant.

Floating point constants

0.0

Since binary representation of float32(0.0) is 0x00000000, it's sufficient to use VPXOR.

Integer values

Obviously, we may first obtain an integer value, and then convert it into a floating point number, using for instance VCVTDQ2PS.

90.0, 1.0, -1.0, +0.0, -0.0, 1/2, Pi/2, MAX_FLOAT, MIN_FLOAT

Instruction VFIXUPIMMPS can be used to generate the following constants: 90.0, 1.0, -1.0, +0.0, -0.0, 1/2, Pi/2, MAX_FLOAT, MIN_FLOAT. Downside of this instruction is high latency.

The instruction works in two steps. It first classifies the input numbers into eight classes:

  • QNaN/SNaN,
  • zero,
  • one,
  • +/- infinity,
  • positive number,
  • negative number.

Then, the class is used to fetch a single nibble from the second argument; the nibble encodes what to do with given value. The nibbles in range 0..6 transform the input number; for instance negate it, thanks to that we can calculate absolute value.

The nibbles in range 7..15 force writing one of the listed constants:

  • -0.0 — nibble=7:

    f32: -0.000000
    u32: 0x80000000
    
  • +0.0 — nibble=8:

    f32: 0.000000
    u32: 0x00000000
    
  • -1.0 — nibble=9:

    f32: -1.000000
    u32: 0xbf800000
    
  • +1.0 — nibble=0xa:

    f32: 1.000000
    u32: 0x3f800000
    
  • ½ — nibble=0xb:

    f32: 0.500000
    u32: 0x3f000000
    
  • 90.0 — nibble=0xc:

    f32: 90.000000
    u32: 0x42b40000
    
  • pi/2 — nibble=0xd:

    f32: 1.570796
    u32: 0x3fc90fdb
    
  • MAX_FLOAT — nibble=0xe:

    f32: 340282346638528859811704183484516925440.000000
    u32: 0x7f7fffff
    
  • MIN_FLOAT — nibble=0xf:

    f32: -340282346638528859811704183484516925440.000000
    u32: 0xff7fffff
    

Sample code

Sample programs are available on Github.