AVX512: check if value belongs to a set

Author:Wojciech Muła
Added on:2023-01-21

Introduction

We want to check if a value belongs to a set. More formally, we want to evaluate the following expression: (x == word_0) or (x == word_1) or ... or (x == word_n), where x is a vector of words, and word_i is a constant vector.

For a four-element set, a naive version of AVX512 assembly code:

VPCMPD.BCST   $0, (AX),   Z1, K1        // K1 = Z1 == word_0
VPCMPD.BCST   $0, 4(AX),  Z1, K2        // K2 = Z1 == word_1
VPCMPD.BCST   $0, 8(AX),  Z1, K3        // K3 = Z1 == word_2
VPCMPD.BCST   $0, 12(AX), Z1, K4        // K4 = Z1 == word_3
KORW          K1, K2, K1                // K1 = K1 | K2
KORW          K3, K4, K3                // K3 = K3 | K4
KORW          K1, K3, K1                // K1 = K1 | K3

The above code tests a vector register Z1 against const values stored in an array pointed by AX, and sets result in kreg K1.

The tool uICA reports the followig timings (for Skylake-X):

Throughput (in cycles per iteration): 4.00
Bottlenecks: Decoder, Ports

┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7 │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                     1                          1                      │ vpcmpeqd k1, zmm1, dword ptr [rax]{1to16}
│  2                    │   2    │   2   │                              1                 1                      │ vpcmpeqd k2, zmm1, dword ptr [rax+0x4]{1to16}
│  2                    │   2    │   2   │                     1                          1                      │ vpcmpeqd k3, zmm1, dword ptr [rax+0x8]{1to16}
│  2                    │   2    │   2   │                              1                 1                      │ vpcmpeqd k4, zmm1, dword ptr [rax+0xc]{1to16}
│  1                    │   1    │   1   │   1                                                                   │ korw k1, k2, k1
│  1                    │   1    │   1   │   1                                                                   │ korw k3, k4, k3
│  1                    │   1    │   1   │   1                                                                   │ korw k1, k3, k1
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  11                   │   11   │  11   │   3                 2        2                 4                      │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘

Another approach

Using the de Morgan's Law we may rewrite the expression (x == word_0) or (x == word_1) or ... or (x == word_n) into not ((x != word_0) and ... and (x != word_n)). Then, we can use the fact that the AVX512 masked comparison performs operation mask and (comparison result).

Compared with the naive version, we get rid of most explicit k-reg operations. However, the performance of the snippet is not better than the navie version. We only decrease the number of upos by 2, that might have impact on the overall performance.

VPCMPD.BCST   $4, (AX),   Z1, K1        // K1 = Z1 != word_0
VPCMPD.BCST   $4, 4(AX),  Z1, K1, K2    // K2 = Z1 != word_1 && K1
VPCMPD.BCST   $4, 8(AX),  Z1, K2, K3    // K3 = Z1 != word_2 && K2
VPCMPD.BCST   $4, 12(AX), Z1, K3, K4    // K4 = Z1 != word_3 && K3
KNOTW         K4, K1                    // K1 = ~K4

uICA results:

Throughput (in cycles per iteration): 4.00
Bottlenecks: Decoder, Ports

┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7 │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                     1                          1                      │ vpcmpd k1, zmm1, dword ptr [rax]{1to16}, 0x4
│  2                    │   2    │   2   │                              1                 1                      │ vpcmpd k2{k1}, zmm1, dword ptr [rax+0x4]{1to16}, 0x4
│  2                    │   2    │   2   │                     1                          1                      │ vpcmpd k3{k2}, zmm1, dword ptr [rax+0x8]{1to16}, 0x4
│  2                    │   2    │   2   │                              1                 1                      │ vpcmpd k4{k3}, zmm1, dword ptr [rax+0xc]{1to16}, 0x4
│  1                    │   1    │   1   │   1                                                                   │ knotw k1, k4
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  9                    │   9    │   9   │   1                 2        2                 4                      │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘

Result in a vector register

This approach produces the result in a vector register.

We use fact that the bit-xor of two values yields 0 if both values are equal, and a non-zero result otherwise.

Then, we may convert any non-zero value into 1 by using expression min(xor-result, 1). This requires an extra register with preloaded 1.

VPXORD.BCST   (AX), Z1, Z2      // Z2 = (Z1 != word_0) ? non-zero : 0
VPXORD.BCST  4(AX), Z1, Z3      // Z3 = (Z1 != word_1) ? non-zero : 0
VPXORD.BCST  8(AX), Z1, Z4      // Z4 = (Z1 != word_2) ? non-zero : 0
VPXORD.BCST 12(AX), Z1, Z5      // Z5 = (Z1 != word_3) ? non-zero : 0
VPMINUD     Z2, Z3, Z2          // Z2 = min(Z2, Z3)
VPMINUD     Z4, Z5, Z4          // Z4 = min(Z4, Z5)
VPMINUD     Z4, Z2, Z2          // Z2 = min(Z2, Z4)
VPTESTNMD   Z2, Z2, K1          // K1 = 1 if Z2[i] == 0

uICA results:

Throughput (in cycles per iteration): 4.00
Bottleneck: Ports

┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7 │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  1                    │   1    │   2   │   1                 1                                                 │ vpxord zmm2, zmm1, dword ptr [rax]{1to16}
│  1                    │   1    │   2   │                              1                 1                      │ vpxord zmm3, zmm1, dword ptr [rax+0x4]{1to16}
│  1                    │   1    │   2   │                     1                          1                      │ vpxord zmm4, zmm1, dword ptr [rax+0x8]{1to16}
│  1                    │   1    │   2   │                              1                 1                      │ vpxord zmm5, zmm1, dword ptr [rax+0xc]{1to16}
│  1                    │   1    │   1   │   1                                                                   │ vpminud zmm2, zmm3, zmm2
│  1                    │   1    │   1   │   1                                                                   │ vpminud zmm4, zmm5, zmm4
│  1                    │   1    │   1   │   1                                                                   │ vpminud zmm2, zmm2, zmm4
│  1                    │   1    │   1   │                                                1                      │ vptestnmd k1, zmm2, zmm2
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  8                    │   8    │  12   │   4                 2        2                 4                      │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘