Author: | Wojciech Muła |
---|---|
Added on: | 2023-01-19 |
Updated on: | 2023-02-02 |
Contents
AVX512 code often needs constants that repeat in 64-, 32-, 16- or 8-bit lanes. Instead of pre-computing such constants as whole 64-byte arrays, we can reduce memory consumption by using explicit or implicit broadcasts.
While broadcasts have high throughput, their latencies are quite high. According to uops.info, latencies are 5 cycles (on Skylake-X, Cannon Lake, Ice Lake, Adler Lake-P) when broadcasting from either a 32-bit or 64-bit register. When the source is memory location, latencies are even higher.
When an AVX512 procedure is quite short, or often loads different constants from memory, broadcast latencies might become visible. To overcome this problem, we might compute some values using few cheap instructions.
We can quickly fill an AVX512 all zero bits (with XOR) or ones (with ternary log instruction) and then use shifts, bit operations and other instructions to construct desired value.
This article show some ways to calculate different simple constants. Examples are focused mostly on 32-bit values, although in most cases we might generalize the methods to other lane widths.
A well known zeroing idiom:
VPXORD Z1, Z1, Z1
Ternary logic with immediate 255:
VPTERNLOGD $0xff, Z1, Z1, Z1
The word 0xffffffff interpreted as a signed number is -1. AVX512 provides instruction to calculate an absolute value, thus getting 1 is straightforward:
VPTERNLOGD $0xff, Z1, Z1, Z1 VPABSD Z1, Z2 // 0x00000001 VPABSW Z1, Z3 // 0x00010001 VPABSB Z1, Z4 // 0x01010101
Starting from 0x00000001, we may use left shift with a constant to produce 1 << 1, 1 << 2, 1 << 3 up to 1 << 31.
We start from 0xffffffff and then shift it right by the number n we want to get. The result is a word having n zeros followed by 32-n ones.
Then we may use VPLZCNTD to count the leading zeros of word. In our case it will be exactly n.
A sample program for n=17:
VPTERNLOGD $0xff, Z1, Z1, Z1 VPSRLD $17, Z1, Z1, Z1 VPLZCNTD Z1, Z1
In a similar way we may utilize VPOPCNTD, but this instruction is not widely available yet.
We start with two words 0xffffffff and 0x00000000. Then we use either the instruction AVGB or AVGW. Both calculate expression (xi + yi + 1)/2, yielding the correct result, even if the sum from numerator would not fit in the destination byte (or word).
When x = 0xff and y = 0x00, then x + y + 1 = 0x100, and finally 0x100 / 2 = 0x80.
Sample program for bytes:
VPTERNLOGD $0xff, Z0, Z0, Z0 // 0xffffffff VPXORD Z7, Z7, Z7 // 0x00000000 VPAVGB Z0, Z7, Z1 // 0x80808080 VPAVGW Z0, Z7, Z2 // 0x80008000
As Wunk & Harold Aptroot showed, this constant can be generated by the expression AVG(x, ~x) for any x.
Starting from 0xffffffff, we may obtain any bit-mask in form 00..0011..11 by shifting the word right.
Likewise, a bit-mask in form 11..1100..00 by shifting the word left.
When we need a mask in form 00..0011..1100..00, then two shift are required. The first one produces a mask 00..0011..11 with the wanted number of ones. The second shift left moves the ones to the desired position.
For example a bit-mask with 11 bits that starts at the 3rd bit can be constructed with the following program:
VPTERNLOGD $0xff, Z1, Z1, Z1 # 0xffffffff VPSRLD $(32-11), Z1, Z1 # 0x000007ff VPSLLD $3, Z1, Z1 # 0x00003ff8
We starting from 0xffffffff, and we will use two instructions: shift left and rotate left
The shift produces as many zeros at the word's beginning as we want. The rotation places the string of zeros at the desired position.
For example a bit-mask with 7 zeros that starts at the 14 bit can be constructed with the following program:
VPTERNLOGD $0xff, Z1, Z1, Z1 # 0xffffffff VPSLLD $7, Z1, Z1 # 0xffffff80 VPROLD $14, Z1, Z1 # 0xffe03fff
Any bit pattern can be generated with a sequence of shifts and rotates. The algorithm to construct a program for the given constant works as follows.
The number of shifts equals the number of spans, thus this method is suitable for a limited set of constants.
Let's analyse algorithm for constant 0x00ff1f01. Its binary representation is 0b00000000111111110001111100000001.
Our program starts with VPTERNLOGD $0xff, Z1, Z1, Z1.
The first span of zero bits contains 8 bits. Thus the next instruction is shift left by 8. Then there's 8-bit span of ones. Thus we're issuing rotate left by 8. Next we have three zeros -- it's shift left by 3. Then five ones equal to rotate left by 5. Next seven zeros yields a shift left by 7. And finally sole one is generated by rotate by 1 bit.
The full assembly program is shown below:
VPTERNLOGD $0xff, Z1, Z1, Z1 // 11111111111111111111111111111111 VPSSLD $8, Z1, Z1 // 11111111111111111111111100000000 VPROLD $8, Z1, Z1 // 11111111111111110000000011111111 VPSSLD $3, Z1, Z1 // 11111111111110000000011111111000 VPROLD $5, Z1, Z1 // 11111111000000001111111100011111 VPSSLD $7, Z1, Z1 // 10000000011111111000111110000000 VPROLD $1, Z1, Z1 // 00000000111111110001111100000001
When a 32-bit pattern contains a repeated byte, like 0xdddddddd, we may use the generic method to build the zero-th byte, i.e. 0xffffffdd. Then we populate the byte with PSHUFB, as indices are zeros (and we know how to get them).
A sample program:
// generate 0xffffffdd VPTERNLOGD $0xff, Z1, Z1, Z1 // 11111111111111111111111111111111 VPSLLD $1, Z1, Z1 // 11111111111111111111111111111110 VPROLD $3, Z1, Z1 // 11111111111111111111111111110111 VPSLLD $1, Z1, Z1 // 11111111111111111111111111101110 VPROLD $1, Z1, Z1 // 11111111111111111111111111011101 // populate 0th byte VPXORD Z0, Z0, Z0 VPSHUFB Z0, Z1, Z1
InstLatX64 showed in 2020, that the instruction GF2P8AFFINEQB (Galois Field Affine Transformation) can be used to broadcast any compile-time 8-bit constant.
Since binary representation of float32(0.0) is 0x00000000, it's sufficient to use VPXOR.
Obviously, we may first obtain an integer value, and then convert it into a floating point number, using for instance VCVTDQ2PS.
Instruction VFIXUPIMMPS can be used to generate the following constants: 90.0, 1.0, -1.0, +0.0, -0.0, 1/2, Pi/2, MAX_FLOAT, MIN_FLOAT. Downside of this instruction is high latency.
The instruction works in two steps. It first classifies the input numbers into eight classes:
Then, the class is used to fetch a single nibble from the second argument; the nibble encodes what to do with given value. The nibbles in range 0..6 transform the input number; for instance negate it, thanks to that we can calculate absolute value.
The nibbles in range 7..15 force writing one of the listed constants:
-0.0 — nibble=7:
f32: -0.000000 u32: 0x80000000
+0.0 — nibble=8:
f32: 0.000000 u32: 0x00000000
-1.0 — nibble=9:
f32: -1.000000 u32: 0xbf800000
+1.0 — nibble=0xa:
f32: 1.000000 u32: 0x3f800000
½ — nibble=0xb:
f32: 0.500000 u32: 0x3f000000
90.0 — nibble=0xc:
f32: 90.000000 u32: 0x42b40000
pi/2 — nibble=0xd:
f32: 1.570796 u32: 0x3fc90fdb
MAX_FLOAT — nibble=0xe:
f32: 340282346638528859811704183484516925440.000000 u32: 0x7f7fffff
MIN_FLOAT — nibble=0xf:
f32: -340282346638528859811704183484516925440.000000 u32: 0xff7fffff
Sample programs are available on Github.