Author: | Wojciech Muła |
---|---|

Added on: | 2015-03-22 |

Updates: | 2018-11-05 |

Contents

Intel's version of SIMD offers following 2-argument (binary) boolean
functions: **and**, **or**, **xor**, **and not**. There isn't a single
argument **not**, this function can be expressed with `xor reg, ones`,
however it requires additional, pre-set register.

**AVX512F** will come with a very interesting instruction called
`vpternlog`. There are two variants of the instruction operating on
a packed 32-bit (`vpternlogd`) or a 64-bit vector (`vpternlogq`),
however they do exactly the same thing — evaluate a 3-argument
(*ternary*) boolean function on each bit of arguments, the function
is given as a truth table.

The pattern of a truth table:

inputs | result | ||
---|---|---|---|

A | B | C | |

0 | 0 | 0 | a |

0 | 0 | 1 | b |

0 | 1 | 0 | c |

0 | 1 | 1 | d |

1 | 0 | 0 | e |

1 | 0 | 1 | f |

1 | 1 | 0 | g |

1 | 1 | 1 | h |

A programmer supplies only the result column, i.e. defines values of bits
**a** through **h**, this is a single 8-bit value.

Depending on function complexity, a single `vpternlog` instruction can
replace from one up to **eight** SIMD instructions.

According to Agner Fog's documentation on SkylakeX `vpternlog` has
1 cycle latency and 0,5 cycle reciprocal throughput (there are two execution
units able to handle the instruction). It's pretty fast, though.

A ternary function may be seen as a select function, i.e. the most
significant bit **A** select one of binary functions of B and C. The
function for A=0 is described by bits a..d, the function
for A=1 by bits e..h.

As @solardiz noted this function is available in OpenCL as
bitselect(), AMD XOP has instruction `VPCMOV`, AltiVec has
`VSEL`, NEON has `VBSL`, also several GPUs support such instruction.

Example **A ? B : C**.

inputs | result | ||
---|---|---|---|

A | B | C | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 |

0 | 1 | 0 | 0 |

0 | 1 | 1 | 1 |

1 | 0 | 0 | 0 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 1 |

1 | 1 | 1 | 1 |

constant | 0xca |

The assembly code using binary functions:

; A - xmm0 and result ; B - xmm1 ; C - xmm2 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1

Let see for example function **(A or not B) and C**. The truth table:

inputs | result | ||
---|---|---|---|

A | B | C | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 |

0 | 1 | 0 | 0 |

0 | 1 | 1 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 1 | 1 |

1 | 1 | 0 | 0 |

1 | 1 | 1 | 1 |

To express the function using standard SSE instructions we need three operations:

; A - xmm0 and result ; B - xmm1 ; C - xmm2 pcmpeqb xmm4, xmm4 // constant, could exist earlier pxor xmm0, xmm4 // A1 := not A por xmm0, xmm1 // AB := A or B1 pand xmm0, xmm2 // result := AB or C

With AVX512 it would be very simple:

; a - zmm0 and result ; b - zmm1 ; c - zmm2 vpternlog zmm0, zmm1, zmm2, 0xa2 // 0xa2 = 0b10100010

inputs | OR | AND | XOR | ||
---|---|---|---|---|---|

A | B | C | |||

0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 | 0 | 1 |

0 | 1 | 0 | 1 | 0 | 1 |

0 | 1 | 1 | 1 | 0 | 0 |

1 | 0 | 0 | 1 | 0 | 1 |

1 | 0 | 1 | 1 | 0 | 0 |

1 | 1 | 0 | 1 | 0 | 0 |

1 | 1 | 1 | 1 | 1 | 1 |

constant | 0xfe | 0x80 | 0x96 |

The assembly code of **or all** using binary functions:

; A - xmm0 and result ; B - xmm1 ; C - xmm2 por xmm0, xmm1 por xmm0, xmm2

Function is true when only one bit is set.

inputs | result | ||
---|---|---|---|

C | B | A | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 |

0 | 1 | 0 | 1 |

0 | 1 | 1 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 1 | 1 | 0 |

constant | 0x16 |

The C code using binary functions:

// t1 = ~(A | B) & C __m512i t1 = _mm512i_andnot_si512(_mm512_or_si512(A, B), C); // t2 = ~C & (A ^ B) __m512i t2 = _mm512i_andnot_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Function is true exactly two bits are set.

inputs | result | ||
---|---|---|---|

C | B | A | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 0 |

0 | 1 | 0 | 0 |

0 | 1 | 1 | 1 |

1 | 0 | 0 | 0 |

1 | 0 | 1 | 1 |

1 | 1 | 0 | 1 |

1 | 1 | 1 | 0 |

constant | 0x68 |

The C code using binary functions:

// t1 = ~C & (A & B) __m512i t1 = _mm512i_andnot_si512(C, _mm512_and_si512(A, B)); // t2 = C & (A ^ B) __m512i t2 = _mm512i_and_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Popular hashing algorithms use several three-argument logic functions in
their main loops. Although the functions might seem complicated, they
are expressed with single `vpternlog` instruction.

MD5 hashing defines four state changing functions.

expression | vpternlogd const |
---|---|

(x and y) or (not x and z) | 0x53 |

(x and z) or (y and not z) | 0x27 |

x xor y xor z | 0x69 |

y xor (x or not z) | 0x9c |

Similarly, SHA-1 uses three ternary logic functions in its main loop.

expression | vpternlogd const |
---|---|

(b and c) or (not b and d) | 0x53 |

b xor c xor d | 0x69 |

(b and c) or (b and d) or (c and d) | 0x17 |

Also SHA-2 uses ternary logic functions in its main loop. I omitted a
three-argument xor (`vternarlog` code `0x69`).

expression | vpternlogd const |
---|---|

(e and f) xor (not e and g) | 0x53 |

(a and b) xor (a and c) xor (b and c) | 0x17 |

The crucial function of Harley-Seal population count algorithm is carry-save adder. That function calculates two bits of 3-argument sum, i.e. the digit and the carry flag.

inputs | outputs | |||
---|---|---|---|---|

C | B | A | digit | carry |

0 | 0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 | 0 |

0 | 1 | 0 | 1 | 0 |

0 | 1 | 1 | 0 | 1 |

1 | 0 | 0 | 1 | 0 |

1 | 0 | 1 | 0 | 1 |

1 | 1 | 0 | 0 | 1 |

1 | 1 | 1 | 1 | 1 |

The fastest implementation of above functions contains **five logic** instructions:

tmp = a ^ b carry = (a & b) | (tmp & c) digit = tmp ^ c;

With help of `vpternlogd` (intrinsic `_mm512_ternarylogic_epi32`)
this can be calculated as:

l = _mm512_ternarylogic_epi32(c, b, a, 0x96); // 1001_0110 h = _mm512_ternarylogic_epi32(c, b, a, 0xe8); // 1110_1000

- Ternary functions for SSE, AVX2 and XOP — library which allows to use ternary functions in the same way as AVX512, i.e. by passing the function index. The best known instruction sequences are emitted.

- 2018-11-05 —
**Mark Parker**spotted several mistakes; added hashing methods, added latency/throughput parameters of the instruction - 2016-11-07 — more info about bit-select (thank you @solardiz!)
- 2016-09-04 — fixed a wrong claim, some practical example.
- 2016-03-26 — real-world example