Author: | Wojciech Muła |
---|---|

Added on: | 2023-01-19 |

Contents

- Introduction
- Integer constants
- Zero (0x00000000)
- All ones (0xffffffff)
- All decimal ones (0x00000001, 0x00010001, 0x01010101)
- Powers of two (0x2, 0x4, 0x8, ...)
- Decimal numbers (from 0 to 32)
- Bit-mask 0x80808080 or 0x80008000
- Bit-masks: span of ones (like 0b00111000)
- Bit-mask: span of zeros (like 0b11000111)
- Any bit-pattern
- Any bit-pattern — byte level

- Floating point constants
- Sample code

AVX512 code often needs constants that repeat in 64-, 32-, 16- or 8-bit lanes. Instead of pre-computing such constants as whole 64-byte arrays, we can reduce memory consumption by using explicit or implicit broadcasts.

While broadcasts have low throughput, their latencies are quite high. According to uops.info, latencies are 5 cycles (on Skylake-X, Cannon Lake, Ice Lake, Adler Lake-P) when broadcasting from either a 32-bit or 64-bit register. When the source is memory location, latencies are even higher.

When an AVX512 procedure is quite short, or often loads different constants
from memory, broadcast latencies might become visible. To overcome this
problem, we might **compute** some values using few cheap instructions.

We can quickly fill an AVX512 all zero bits (with XOR) or ones (with ternary log instruction) and then use shifts, bit operations and other instructions to construct desired value.

This article show some ways to calculate different simple constants. Examples are focused mostly on 32-bit values, although in most cases we might generalize the methods to other lane widths.

A well known zeroing idiom:

VPXORD Z1, Z1, Z1

Ternary logic with immediate 255:

VPTERNLOGD $0xff, Z1, Z1, Z1

The word `0xffffffff` interpreted as a signed number is -1.
AVX512 provides instruction to calculate an absolute value,
thus getting 1 is straightforward:

VPTERNLOGD $0xff, Z1, Z1, Z1 VPABSD Z1, Z2 // 0x00000001 VPABSW Z1, Z3 // 0x00010001 VPABSB Z1, Z4 // 0x01010101

Starting from `0x00000001`, we may use left shift with
a constant to produce `1 << 1`, `1 << 2`, `1 << 3`
up to `1 << 31`.

We start from `0xffffffff` and then shift it right
by the number **n** we want to get. The result
is a word having n zeros followed by 32-n ones.

Then we may use `VPLZCNTD` to count the leading zeros
of word. In our case it will be exactly **n**.

A sample program for n=17:

VPTERNLOGD $0xff, Z1, Z1, Z1 VPSRLD $17, Z1, Z1, Z1 VPLZCNTD Z1, Z1

In a similar way we may utilize `VPOPCNTD`, but
this instruction is not widely available yet.

We start with two words `0xffffffff` and `0x00000000`.
Then we use either the instruction `AVGB` or `AVGW`.
Both calculate expression (*x*_{i} + *y*_{i} + 1)/2,
yielding the correct result, even if the sum from numerator
would not fit in the destination byte (or word).

When x = 0xff and y = 0x00, then `x + y + 1 = 0x100`,
and finally `0x100 / 2 = 0x80`.

Sample program for bytes:

VPTERNLOGD $0xff, Z0, Z0, Z0 // 0xffffffff VPXORD Z7, Z7, Z7 // 0x00000000 VPAVGB Z0, Z7, Z1 // 0x80808080 VPAVGW Z0, Z7, Z2 // 0x80008000

Starting from `0xffffffff`, we may obtain any bit-mask in form
`00..0011..11` by shifting the word right.

Likewise, a bit-mask in form `11..1100..00` by shifting
the word left.

When we need a mask in form `00..0011..1100..00`, then
two shift are required. The first one produces a mask
`00..0011..11` with the wanted number of ones. The second
shift left moves the ones to the desired position.

For example a bit-mask with 11 bits that starts at the 3rd bit can be constructed with the following program:

VPTERNLOGD $0xff, Z1, Z1, Z1 # 0xffffffff VPSRLD $(32-11), Z1, Z1 # 0x000007ff VPSLLD $3, Z1, Z1 # 0x00003ff8

We starting from `0xffffffff`, and we will use two
instructions: shift left and **rotate** left

The shift produces as many zeros at the word's beginning as we want. The rotation places the string of zeros at the desired position.

For example a bit-mask with 7 zeros that starts at the 14 bit can be constructed with the following program:

VPTERNLOGD $0xff, Z1, Z1, Z1 # 0xffffffff VPSLLD $7, Z1, Z1 # 0xffffff80 VPROLD $14, Z1, Z1 # 0xffe03fff

Any bit pattern can be generated with a sequence of shifts and rotates. The algorithm to construct a program for the given constant works as follows.

- Convert the constant into its binary representation.
- Initialize a register with
`0xffffffff`. - Starting from the most significant bits consider spans of bits of the same value.
- When a span contains zeros, then use shift left on the register.
- When a span contains ones, then use rotate left on the register.

The number of shifts equals the number of spans, thus this method is suitable for a limited set of constants.

Let's analyse algorithm for constant 0x00ff1f01. Its binary representation is 0b00000000111111110001111100000001.

Our program starts with `VPTERNLOGD $0xff, Z1, Z1, Z1`.

The first span of zero bits contains 8 bits. Thus the next instruction is shift left by 8. Then there's 8-bit span of ones. Thus we're issuing rotate left by 8. Next we have three zeros -- it's shift left by 3. Then five ones equal to rotate left by 5. Next seven zeros yields a shift left by 7. And finally sole one is generated by rotate by 1 bit.

The full assembly program is shown below:

VPTERNLOGD $0xff, Z1, Z1, Z1 // 11111111111111111111111111111111 VPSSLD $8, Z1, Z1 // 11111111111111111111111100000000 VPROLD $8, Z1, Z1 // 11111111111111110000000011111111 VPSSLD $3, Z1, Z1 // 11111111111110000000011111111000 VPROLD $5, Z1, Z1 // 11111111000000001111111100011111 VPSSLD $7, Z1, Z1 // 10000000011111111000111110000000 VPROLD $1, Z1, Z1 // 00000000111111110001111100000001

When a 32-bit pattern contains a repeated byte, like `0xdddddddd`,
we may use the generic method to build the zero-th byte, i.e.
`0xffffffdd`. Then we populate the byte with `PSHUFB`, as
indices are zeros (and we know how to get them).

A sample program:

// generate 0xffffffdd VPTERNLOGD $0xff, Z1, Z1, Z1 // 11111111111111111111111111111111 VPSLLD $1, Z1, Z1 // 11111111111111111111111111111110 VPROLD $3, Z1, Z1 // 11111111111111111111111111110111 VPSLLD $1, Z1, Z1 // 11111111111111111111111111101110 VPROLD $1, Z1, Z1 // 11111111111111111111111111011101 // populate 0th byte VPXORD Z0, Z0, Z0 VPSHUFB Z0, Z1, Z1

Since binary representation of `float32(0.0)` is `0x00000000`,
it's sufficient to use `VPXOR`.

Obviously, we may first obtain an integer value, and then convert
it into a floating point number, using for instance `VCVTDQ2PS`.

Instruction `VFIXUPIMMPS` can be used to generate the following
constants: 90.0, 1.0, -1.0, +0.0, -0.0, 1/2, Pi/2, `MAX_FLOAT`,
`MIN_FLOAT`. Downside of this instruction is high latency.

The instruction works in two steps. It first classifies the input numbers into eight classes:

- QNaN/SNaN,
- zero,
- one,
- +/- infinity,
- positive number,
- negative number.

Then, the class is used to fetch a single nibble from the second argument; the nibble encodes what to do with given value. The nibbles in range 0..6 transform the input number; for instance negate it, thanks to that we can calculate absolute value.

The nibbles in range 7..15 force writing one of the listed constants:

-0.0 — nibble=7:

f32: -0.000000 u32: 0x80000000

+0.0 — nibble=8:

f32: 0.000000 u32: 0x00000000

-1.0 — nibble=9:

f32: -1.000000 u32: 0xbf800000

+1.0 — nibble=0xa:

f32: 1.000000 u32: 0x3f800000

½ — nibble=0xb:

f32: 0.500000 u32: 0x3f000000

90.0 — nibble=0xc:

f32: 90.000000 u32: 0x42b40000

*pi*/2 — nibble=0xd:f32: 1.570796 u32: 0x3fc90fdb

`MAX_FLOAT`— nibble=0xe:f32: 340282346638528859811704183484516925440.000000 u32: 0x7f7fffff

`MIN_FLOAT`— nibble=0xf:f32: -340282346638528859811704183484516925440.000000 u32: 0xff7fffff

Sample programs are available on Github.