Author: | Wojciech Muła |
---|---|
Added on: | 2015-03-22 |
Updates: | 2018-11-05 |
Contents
Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however it requires additional, pre-set register.
AVX512F will come with a very interesting instruction called vpternlog. There are two variants of the instruction operating on a packed 32-bit (vpternlogd) or a 64-bit vector (vpternlogq), however they do exactly the same thing — evaluate a 3-argument (ternary) boolean function on each bit of arguments, the function is given as a truth table.
The pattern of a truth table:
inputs | result | ||
---|---|---|---|
A | B | C | |
0 | 0 | 0 | a |
0 | 0 | 1 | b |
0 | 1 | 0 | c |
0 | 1 | 1 | d |
1 | 0 | 0 | e |
1 | 0 | 1 | f |
1 | 1 | 0 | g |
1 | 1 | 1 | h |
A programmer supplies only the result column, i.e. defines values of bits a through h, this is a single 8-bit value.
Depending on function complexity, a single vpternlog instruction can replace from one up to eight SIMD instructions.
According to Agner Fog's documentation on SkylakeX vpternlog has 1 cycle latency and 0,5 cycle reciprocal throughput (there are two execution units able to handle the instruction). It's pretty fast, though.
A ternary function may be seen as a select function, i.e. the most significant bit A select one of binary functions of B and C. The function for A=0 is described by bits a..d, the function for A=1 by bits e..h.
As @solardiz noted this function is available in OpenCL as bitselect(), AMD XOP has instruction VPCMOV, AltiVec has VSEL, NEON has VBSL, also several GPUs support such instruction.
Example A ? B : C.
inputs | result | ||
---|---|---|---|
A | B | C | |
0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 |
1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 1 |
1 | 1 | 1 | 1 |
constant | 0xca |
The assembly code using binary functions:
; A - xmm0 and result ; B - xmm1 ; C - xmm2 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1
Let see for example function (A or not B) and C. The truth table:
inputs | result | ||
---|---|---|---|
A | B | C | |
0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | 0 |
1 | 0 | 0 | 0 |
1 | 0 | 1 | 1 |
1 | 1 | 0 | 0 |
1 | 1 | 1 | 1 |
To express the function using standard SSE instructions we need three operations:
; A - xmm0 and result ; B - xmm1 ; C - xmm2 pcmpeqb xmm4, xmm4 // constant, could exist earlier pxor xmm0, xmm4 // A1 := not A por xmm0, xmm1 // AB := A or B1 pand xmm0, xmm2 // result := AB or C
With AVX512 it would be very simple:
; a - zmm0 and result ; b - zmm1 ; c - zmm2 vpternlog zmm0, zmm1, zmm2, 0xa2 // 0xa2 = 0b10100010
inputs | OR | AND | XOR | ||
---|---|---|---|---|---|
A | B | C | |||
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 1 |
0 | 1 | 0 | 1 | 0 | 1 |
0 | 1 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 1 |
1 | 0 | 1 | 1 | 0 | 0 |
1 | 1 | 0 | 1 | 0 | 0 |
1 | 1 | 1 | 1 | 1 | 1 |
constant | 0xfe | 0x80 | 0x96 |
The assembly code of or all using binary functions:
; A - xmm0 and result ; B - xmm1 ; C - xmm2 por xmm0, xmm1 por xmm0, xmm2
Function is true when only one bit is set.
inputs | result | ||
---|---|---|---|
C | B | A | |
0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 |
0 | 1 | 0 | 1 |
0 | 1 | 1 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 1 | 1 | 0 |
constant | 0x16 |
The C code using binary functions:
// t1 = ~(A | B) & C __m512i t1 = _mm512i_andnot_si512(_mm512_or_si512(A, B), C); // t2 = ~C & (A ^ B) __m512i t2 = _mm512i_andnot_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);
The procedure uses five instructions.
Function is true exactly two bits are set.
inputs | result | ||
---|---|---|---|
C | B | A | |
0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 |
1 | 0 | 0 | 0 |
1 | 0 | 1 | 1 |
1 | 1 | 0 | 1 |
1 | 1 | 1 | 0 |
constant | 0x68 |
The C code using binary functions:
// t1 = ~C & (A & B) __m512i t1 = _mm512i_andnot_si512(C, _mm512_and_si512(A, B)); // t2 = C & (A ^ B) __m512i t2 = _mm512i_and_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);
The procedure uses five instructions.
Popular hashing algorithms use several three-argument logic functions in their main loops. Although the functions might seem complicated, they are expressed with single vpternlog instruction.
MD5 hashing defines four state changing functions.
expression | vpternlogd const |
---|---|
(x and y) or (not x and z) | 0x53 |
(x and z) or (y and not z) | 0x27 |
x xor y xor z | 0x69 |
y xor (x or not z) | 0x9c |
Similarly, SHA-1 uses three ternary logic functions in its main loop.
expression | vpternlogd const |
---|---|
(b and c) or (not b and d) | 0x53 |
b xor c xor d | 0x69 |
(b and c) or (b and d) or (c and d) | 0x17 |
Also SHA-2 uses ternary logic functions in its main loop. I omitted a three-argument xor (vternarlog code 0x69).
expression | vpternlogd const |
---|---|
(e and f) xor (not e and g) | 0x53 |
(a and b) xor (a and c) xor (b and c) | 0x17 |
The crucial function of Harley-Seal population count algorithm is carry-save adder. That function calculates two bits of 3-argument sum, i.e. the digit and the carry flag.
inputs | outputs | |||
---|---|---|---|---|
C | B | A | digit | carry |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 |
0 | 1 | 0 | 1 | 0 |
0 | 1 | 1 | 0 | 1 |
1 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 1 |
1 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 1 | 1 |
The fastest implementation of above functions contains five logic instructions:
tmp = a ^ b carry = (a & b) | (tmp & c) digit = tmp ^ c;
With help of vpternlogd (intrinsic _mm512_ternarylogic_epi32) this can be calculated as:
l = _mm512_ternarylogic_epi32(c, b, a, 0x96); // 1001_0110 h = _mm512_ternarylogic_epi32(c, b, a, 0xe8); // 1110_1000