Author: | Wojciech Muła |
---|---|
Added on: | 2023-01-21 |
We want to check if a value belongs to a set. More formally, we want to evaluate the following expression: (x == word_0) or (x == word_1) or ... or (x == word_n), where x is a vector of words, and word_i is a constant vector.
For a four-element set, a naive version of AVX512 assembly code:
VPCMPD.BCST $0, (AX), Z1, K1 // K1 = Z1 == word_0 VPCMPD.BCST $0, 4(AX), Z1, K2 // K2 = Z1 == word_1 VPCMPD.BCST $0, 8(AX), Z1, K3 // K3 = Z1 == word_2 VPCMPD.BCST $0, 12(AX), Z1, K4 // K4 = Z1 == word_3 KORW K1, K2, K1 // K1 = K1 | K2 KORW K3, K4, K3 // K3 = K3 | K4 KORW K1, K3, K1 // K1 = K1 | K3
The above code tests a vector register Z1 against const values stored in an array pointed by AX, and sets result in kreg K1.
The tool uICA reports the followig timings (for Skylake-X):
Throughput (in cycles per iteration): 4.00 Bottlenecks: Decoder, Ports ┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐ │ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 │ ├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤ │ 2 │ 2 │ 2 │ 1 1 │ vpcmpeqd k1, zmm1, dword ptr [rax]{1to16} │ 2 │ 2 │ 2 │ 1 1 │ vpcmpeqd k2, zmm1, dword ptr [rax+0x4]{1to16} │ 2 │ 2 │ 2 │ 1 1 │ vpcmpeqd k3, zmm1, dword ptr [rax+0x8]{1to16} │ 2 │ 2 │ 2 │ 1 1 │ vpcmpeqd k4, zmm1, dword ptr [rax+0xc]{1to16} │ 1 │ 1 │ 1 │ 1 │ korw k1, k2, k1 │ 1 │ 1 │ 1 │ 1 │ korw k3, k4, k3 │ 1 │ 1 │ 1 │ 1 │ korw k1, k3, k1 ├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤ │ 11 │ 11 │ 11 │ 3 2 2 4 │ Total └───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘
Using the de Morgan's Law we may rewrite the expression (x == word_0) or (x == word_1) or ... or (x == word_n) into not ((x != word_0) and ... and (x != word_n)). Then, we can use the fact that the AVX512 masked comparison performs operation mask and (comparison result).
Compared with the naive version, we get rid of most explicit k-reg operations. However, the performance of the snippet is not better than the navie version. We only decrease the number of upos by 2, that might have impact on the overall performance.
VPCMPD.BCST $4, (AX), Z1, K1 // K1 = Z1 != word_0 VPCMPD.BCST $4, 4(AX), Z1, K1, K2 // K2 = Z1 != word_1 && K1 VPCMPD.BCST $4, 8(AX), Z1, K2, K3 // K3 = Z1 != word_2 && K2 VPCMPD.BCST $4, 12(AX), Z1, K3, K4 // K4 = Z1 != word_3 && K3 KNOTW K4, K1 // K1 = ~K4
uICA results:
Throughput (in cycles per iteration): 4.00 Bottlenecks: Decoder, Ports ┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐ │ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 │ ├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤ │ 2 │ 2 │ 2 │ 1 1 │ vpcmpd k1, zmm1, dword ptr [rax]{1to16}, 0x4 │ 2 │ 2 │ 2 │ 1 1 │ vpcmpd k2{k1}, zmm1, dword ptr [rax+0x4]{1to16}, 0x4 │ 2 │ 2 │ 2 │ 1 1 │ vpcmpd k3{k2}, zmm1, dword ptr [rax+0x8]{1to16}, 0x4 │ 2 │ 2 │ 2 │ 1 1 │ vpcmpd k4{k3}, zmm1, dword ptr [rax+0xc]{1to16}, 0x4 │ 1 │ 1 │ 1 │ 1 │ knotw k1, k4 ├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤ │ 9 │ 9 │ 9 │ 1 2 2 4 │ Total └───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘
This approach produces the result in a vector register.
We use fact that the bit-xor of two values yields 0 if both values are equal, and a non-zero result otherwise.
Then, we may convert any non-zero value into 1 by using expression min(xor-result, 1). This requires an extra register with preloaded 1.
VPXORD.BCST (AX), Z1, Z2 // Z2 = (Z1 != word_0) ? non-zero : 0 VPXORD.BCST 4(AX), Z1, Z3 // Z3 = (Z1 != word_1) ? non-zero : 0 VPXORD.BCST 8(AX), Z1, Z4 // Z4 = (Z1 != word_2) ? non-zero : 0 VPXORD.BCST 12(AX), Z1, Z5 // Z5 = (Z1 != word_3) ? non-zero : 0 VPMINUD Z2, Z3, Z2 // Z2 = min(Z2, Z3) VPMINUD Z4, Z5, Z4 // Z4 = min(Z4, Z5) VPMINUD Z4, Z2, Z2 // Z2 = min(Z2, Z4) VPTESTNMD Z2, Z2, K1 // K1 = 1 if Z2[i] == 0
uICA results:
Throughput (in cycles per iteration): 4.00 Bottleneck: Ports ┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐ │ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 │ ├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤ │ 1 │ 1 │ 2 │ 1 1 │ vpxord zmm2, zmm1, dword ptr [rax]{1to16} │ 1 │ 1 │ 2 │ 1 1 │ vpxord zmm3, zmm1, dword ptr [rax+0x4]{1to16} │ 1 │ 1 │ 2 │ 1 1 │ vpxord zmm4, zmm1, dword ptr [rax+0x8]{1to16} │ 1 │ 1 │ 2 │ 1 1 │ vpxord zmm5, zmm1, dword ptr [rax+0xc]{1to16} │ 1 │ 1 │ 1 │ 1 │ vpminud zmm2, zmm3, zmm2 │ 1 │ 1 │ 1 │ 1 │ vpminud zmm4, zmm5, zmm4 │ 1 │ 1 │ 1 │ 1 │ vpminud zmm2, zmm2, zmm4 │ 1 │ 1 │ 1 │ 1 │ vptestnmd k1, zmm2, zmm2 ├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤ │ 8 │ 8 │ 12 │ 4 2 2 4 │ Total └───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘