AVX512: ternary functions evaluation

Author:Wojciech Muła
Added on:2015-03-22
Updates:2020-01-11

Contents

Introduction update

Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however it requires additional, pre-set register.

AVX512F will come with a very interesting instruction called vpternlog. There are two variants of the instruction operating on a packed 32-bit (vpternlogd) or a 64-bit vector (vpternlogq), however they do exactly the same thing — evaluate a 3-argument (ternary) boolean function on each bit of arguments, the function is given as a truth table.

The pattern of a truth table:

inputs result
A B C
0 0 0 a
0 0 1 b
0 1 0 c
0 1 1 d
1 0 0 e
1 0 1 f
1 1 0 g
1 1 1 h

A programmer supplies only the result column, i.e. defines values of bits a through h, this is a single 8-bit value.

Depending on function complexity, a single vpternlog instruction can replace from one up to eight SIMD instructions.

According to Agner Fog's documentation on SkylakeX vpternlog has 1 cycle latency and 0,5 cycle reciprocal throughput (there are two execution units able to handle the instruction). It's pretty fast, though.

Ternary logic function is available as the intrinsic function _mm512_ternarylogic_epi32(a, b, c, imm8), where the argument a carries most significant bits, and c least significant bits.

Bit select function

A ternary function may be seen as a select function, i.e. the most significant bit A select one of binary functions of B and C. The function for A=0 is described by bits a..d, the function for A=1 by bits e..h.

As @solardiz noted this function is available in OpenCL as bitselect(), AMD XOP has instruction VPCMOV, AltiVec has VSEL, NEON has VBSL, also several GPUs support such instruction.

Example A ? B : C.

inputs result
A B C
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 0
1 1 0 1
1 1 1 1
constant 0xca

The assembly code using binary functions:

; A - xmm0 and result
; B - xmm1
; C - xmm2

pand     xmm1, xmm0
pandn    xmm0, xmm2
por      xmm0, xmm1

Other examples

Artifical function

Let see for example function (A or not B) and C. The truth table:

inputs result
A B C
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1

To express the function using standard SSE instructions we need three operations:

; A - xmm0 and result
; B - xmm1
; C - xmm2

pcmpeqb  xmm4, xmm4     // constant, could exist earlier
pxor     xmm0, xmm4     // A1 := not A
por      xmm0, xmm1     // AB := A or B1
pand     xmm0, xmm2     // result := AB and C

With AVX512 it would be very simple:

; a - zmm0 and result
; b - zmm1
; c - zmm2

vpternlog zmm0, zmm1, zmm2, 0xa2 // 0xa2 = 0b10100010

OR/AND/XOR all

inputs OR AND XOR
A B C
0 0 0 0 0 0
0 0 1 1 0 1
0 1 0 1 0 1
0 1 1 1 0 0
1 0 0 1 0 1
1 0 1 1 0 0
1 1 0 1 0 0
1 1 1 1 1 1
constant 0xfe 0x80 0x96

The assembly code of or all using binary functions:

; A - xmm0 and result
; B - xmm1
; C - xmm2

por      xmm0, xmm1
por      xmm0, xmm2

Exactly one bit is set

Function is true when only one bit is set.

inputs result
C B A
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 0
constant 0x16

The C code using binary functions:

// t1 = ~(A | B) & C
__m512i t1 = _mm512i_andnot_si512(_mm512_or_si512(A, B), C);

// t2 = ~C & (A ^ B)
__m512i t2 = _mm512i_andnot_si512(C, _mm512_xor_si512(A, B));

__m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Exactly two bits are set

Function is true exactly two bits are set.

inputs result
C B A
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 0
constant 0x68

The C code using binary functions:

// t1 = ~C & (A & B)
__m512i t1 = _mm512i_andnot_si512(C, _mm512_and_si512(A, B));

// t2 = C & (A ^ B)
__m512i t2 = _mm512i_and_si512(C, _mm512_xor_si512(A, B));

__m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Real-world examples

MD5, SHA-1 and SHA-2 hashing update

Popular hashing algorithms use several three-argument logic functions in their main loops. Although the functions might seem complicated, they are expressed with single vpternlog instruction.

MD5 hashing defines four state changing functions (the order of variables from the most significant bit: x, y, z).

expression vpternlogd const
(x and y) or (not x and z) 0xca
(x and z) or (y and not z) 0xe4
x xor y xor z 0x96
y xor (x or not z) 0x39

Similarly, SHA-1 uses three ternary logic functions in its main loop (the order of variables from the most significant bit: b, c, d).

expression vpternlogd const
(b and c) or (not b and d) 0xca
b xor c xor d 0x96
(b and c) or (b and d) or (c and d) 0xe8

Also SHA-2 uses ternary logic functions in its main loop. I omitted a three-argument xor, which is shown above (the order of variables from the most significant bit: e, f, g; and a, b, c).

expression vpternlogd const
(e and f) xor (not e and g) 0xca
(a and b) xor (a and c) xor (b and c) 0xe8

Carry-save adder

The crucial function of Harley-Seal population count algorithm is carry-save adder. That function calculates two bits of 3-argument sum, i.e. the digit and the carry flag.

inputs outputs
C B A digit carry
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

The fastest implementation of above functions contains five logic instructions:

tmp   = a ^ b
carry = (a & b) | (tmp & c)
digit = tmp ^ c;

With help of vpternlogd this can be calculated as:

l = _mm512_ternarylogic_epi32(c, b, a, 0x96); // 1001_0110
h = _mm512_ternarylogic_epi32(c, b, a, 0xe8); // 1110_1000

See also

Changes