Author: | Wojciech Muła |
---|---|

Added on: | 2015-03-22 |

Updates: | 2016-11-07 |

Contents

Intel's version of SIMD offers following 2-argument (binary) boolean
functions: **and**, **or**, **xor**, **and not**. There isn't a single
argument **not**, this function can be expressed with `xor reg, ones`,
however it requires additional, pre-set register.

**AVX512F** will come with a very interesting instruction called
`vpternlog`. There are two variants of the instruction operating on
a packed 32-bit (`vpternlogd`) or a 64-bit vector (`vpternlogq`),
however they do exactly the same thing — evaluate a 3-argument
(*ternary*) boolean function on each bit of arguments, the function
is given as a truth table.

The pattern of a truth table:

inputs | result | ||
---|---|---|---|

A | B | C | |

0 | 0 | 0 | a |

0 | 0 | 1 | b |

0 | 1 | 0 | c |

0 | 1 | 1 | d |

1 | 0 | 0 | e |

1 | 0 | 1 | f |

1 | 1 | 0 | g |

1 | 1 | 1 | h |

A programmer supplies only the result column, i.e. defines values of bits
**a** through **h**, this is a single 8-bit value.

Depending on function complexity, a single `vpternlog` instruction can
replace from one up to **eight** SIMD instructions.

So far there are no processors supporting AVX512, so it's a question how fast this instruction will be. I guess it couldn't be as fast as existing boolean instructions (1 cycle latency, 0.33 cycles throughput), but who knows.

A ternary function may be seen as a select function, i.e. the most
significant bit **A** select one of binary functions of B and C. The
function for A=0 is described by bits a..d, the function
for A=1 by bits e..h.

As @solardiz noted this function is available in OpenCL as
bitselect(), AMD XOP has instruction `VPCMOV`, AltiVec has
`VSEL`, NEON has `VBSL`, also several GPUs support such instruction.

Example **A ? B : C**.

inputs | result | ||
---|---|---|---|

A | B | C | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 0 |

0 | 1 | 0 | 1 |

0 | 1 | 1 | 1 |

1 | 0 | 0 | 0 |

1 | 0 | 1 | 1 |

1 | 1 | 0 | 0 |

1 | 1 | 1 | 1 |

constant | 0xac |

The assembly code using binary functions:

; A - xmm0 and result ; B - xmm1 ; C - xmm2 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1

Let see for example function **(A or not B) and C**. The truth table:

inputs | result | ||
---|---|---|---|

A | B | C | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 |

0 | 1 | 0 | 0 |

0 | 1 | 1 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 1 | 1 |

1 | 1 | 0 | 0 |

1 | 1 | 1 | 1 |

To express the function using standard SSE instructions we need three operations:

; A - xmm0 and result ; B - xmm1 ; C - xmm2 pcmpeqb xmm4, xmm4 // constant, could exist earlier pxor xmm0, xmm4 // A1 := not A por xmm0, xmm1 // AB := A or B1 pand xmm0, xmm2 // result := AB or C

With AVX512 it would be very simple:

; a - zmm0 and result ; b - zmm1 ; c - zmm2 vpternlog zmm0, zmm1, zmm2, 0xa2 // 0xa2 = 0b10100010

inputs | OR | AND | XOR | ||
---|---|---|---|---|---|

A | B | C | |||

0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 | 0 | 1 |

0 | 1 | 0 | 1 | 0 | 1 |

0 | 1 | 1 | 1 | 0 | 0 |

1 | 0 | 0 | 1 | 0 | 1 |

1 | 0 | 1 | 1 | 0 | 0 |

1 | 1 | 0 | 1 | 0 | 0 |

1 | 1 | 1 | 1 | 1 | 1 |

constant | 0xfe | 0x80 | 0x96 |

The assembly code of **or all** using binary functions:

; A - xmm0 and result ; B - xmm1 ; C - xmm2 por xmm0, xmm1 por xmm0, xmm2

Function is true when only one bit is set.

inputs | result | ||
---|---|---|---|

C | B | A | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 |

0 | 1 | 0 | 1 |

0 | 1 | 1 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 1 | 1 | 0 |

constant | 0x16 |

The C code using binary functions:

// t1 = ~(A | B) & C __m512i t1 = _mm512i_andnot_si512(_mm512_or_si512(A, B), C); // t2 = ~C & (A ^ B) __m512i t2 = _mm512i_andnot_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Function is true exactly two bits are set.

inputs | result | ||
---|---|---|---|

C | B | A | |

0 | 0 | 0 | 0 |

0 | 0 | 1 | 0 |

0 | 1 | 0 | 0 |

0 | 1 | 1 | 1 |

1 | 0 | 0 | 0 |

1 | 0 | 1 | 1 |

1 | 1 | 0 | 1 |

1 | 1 | 1 | 0 |

constant | 0x68 |

The C code using binary functions:

// t1 = ~C & (A & B) __m512i t1 = _mm512i_andnot_si512(C, _mm512_and_si512(A, B)); // t2 = C & (A ^ B) __m512i t2 = _mm512i_and_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

The crucial function of Harley-Seal population count algorithm is carry-save adder. That function calculates two bits of 3-argument sum, i.e. the digit and the carry flag.

inputs | outputs | |||
---|---|---|---|---|

C | B | A | digit | carry |

0 | 0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 | 0 |

0 | 1 | 0 | 1 | 0 |

0 | 1 | 1 | 0 | 1 |

1 | 0 | 0 | 1 | 0 |

1 | 0 | 1 | 0 | 1 |

1 | 1 | 0 | 0 | 1 |

1 | 1 | 1 | 1 | 0 |

The fastest implementation of above functions contains **five logic** instructions:

tmp = a ^ b carry = (a & b) | (tmp & c) digit = tmp ^ c;

With help of `vpternlogd` (intrinsic `_mm512_ternarylogic_epi32`)
this can be calculated as:

l = _mm512_ternarylogic_epi32(c, b, a, 0x96); // 1001_0110 h = _mm512_ternarylogic_epi32(c, b, a, 0xe8); // 0110_1000

- Ternary functions for SSE, AVX2 and XOP — library which allows to use ternary functions in the same way as AVX512, i.e. by passing the function index. The best known instruction sequences are emitted.

- 2016-11-07 — more info about bit-select (thank you @solardiz!)
- 2016-09-04 — fixed a wrong claim, some practical example.
- 2016-03-26 — real-world example