Population count using XOP instructions

Author:	Wojciech Muła
Added on:	2016-12-16

Introduction

AMD XOP defines instruction VPTERNB which does lookup in a pair of SSE registers. The instruction is similar to PSHUFB, but apart of wider, 5-bit index, it allows to perform several extra operations based on the higher 3-bits.

I showed that PSHUFB can be used to implement population count procedure. With 4-bit indices such procedure is straightforward. We split a byte vector into two halves, and invoke the instruction twice, getting popcount for both nibbles. Next these popcounts are added together, forming 8-bit counters, which are added in the end.

Similar procedure can be build around VPTERNB. However, to fully utilize 5-bit indices a slightly different strategy is needed. We process two vectors in one step treating byte pairs as 16-bit words. We call VPTERNB to calculate popcount of three 5-bit fields, one remaining bit of the 16-bit word is counted separately.

Algorithm

We process two SSE registers (vec0 and vec1) in one step.

vec0 = packed_byte(abcd_efgh) // a .. p are bits */
vec1 = packed_byte(ijkl_mnop)

First we calculate population count of lower five bits of both registers.

// t0 = packed_byte(000d_efgh)
// t1 = packed_byte(000l_mnop)
const __m128i t0      = _mm_and_si128(vec0, packed_byte(0x1f));
const __m128i t1      = _mm_and_si128(vec1, packed_byte(0x1f));

// popcnt for bits d..h
const __m128i popcnt1 = _mm_perm_epi8(lookup_lo, lookup_hi, t0);

// popcnt for bits l..p
const __m128i popcnt2 = _mm_perm_epi8(lookup_lo, lookup_hi, t1);

Then we build a 5-bit word using three remaining bits from vec0 (abc) and vec1 (jk). We merge these bit-fields using XOP-instruction VCMOV, a condition move based on the bit-mask.

// t2 = packed_byte(??ij_klmn)
const __m128i t2 = _mm_srli_epi16(vec1, 2);
// t3 = packed_byte(...._.abc)
const __m128i t3 = _mm_and_si128(_mm_srli_epi16(vec0, 5), _mm_set1_epi8(0x07));
// t4 = packed_byte(...j_kabc)
const __m128i t4 = _mm_cmov_si128(t2, t3, _mm_set1_epi8(0x18));
// popcnt3 = popcount of bits jkabc
const __m128i popcnt3 = _mm_perm_epi8(lookup_lo, lookup_hi, t4);

At this point vectors popcnt1, popcnt2 and popcnt3 have got population count for all bits except the bit i. This bit is present in vector t4 at position 4th. We separately accumulate popcounts from vectors popcnt{1,2,3} and the bit i. The bit i is not shifted, we simply mask it and keep the counter shifted right, that limits its capacity to 7.

// update counter
local_a = _mm_add_epi8(local_a, popcnt1);
local_a = _mm_add_epi8(local_a, popcnt2);
local_a = _mm_add_epi8(local_a, popcnt3);

// update counter for bits i
local_b = _mm_add_epi8(local_b, _mm_and_si128(t2, _mm_set1_epi8(0x20)));

We repeat the procedure six times, and then add counters local_a and local_b. The resulting vector local contains in each byte population count of two bytes from the input vectors.

local_b = _mm_srli_epi16(local_b, 5);
const __m128i local = _mm_add_epi8(local_a, local_b);

Source code

Sample implementation is available in file src/xop_hamming_weight.cpp on github repository accompanying our paper Faster Population Counts using AVX2 Instructions.

Performance results

Following timings were obtained on AMD FX(tm)-8150 Eight-Core Processor with program basic_benchmark from the repository. Comparison includes just three procedures:

PSHUFB-based (sse_bitset64_weight),
SSE Harley-Seal (sse_harley_seal_bitset64_weight),
and VPTERNB-based (xop_bitset64_weight).

The Harley-Seal method is the fastest. However, when compare parallel-lookup methods, we see that one presented in this article can be 1.3 times faster than PSHUFB-based.

64-bit words	cycles per 64-bit word
64-bit words	`PSHUFB`-based	`VPTERNB`-based	Harley-Seal
128	0.35	0.26	0.40
192	0.71	0.59	0.68
256	0.96	0.83	0.81
384	1.11	0.99	0.95
512	1.19	1.12	1.07
768	1.26	1.16	1.13
1024	1.30	1.24	1.15
2048	1.52	1.34	1.26
4096	1.54	1.34	1.26
8192	1.56	1.34	1.26
12288	1.56	1.35	1.26