Author: | Wojciech Muła |
---|---|
Added on: | 2016-12-16 |
AMD XOP defines instruction VPTERNB which does lookup in a pair of SSE registers. The instruction is similar to PSHUFB, but apart of wider, 5-bit index, it allows to perform several extra operations based on the higher 3-bits.
I showed that PSHUFB can be used to implement population count procedure. With 4-bit indices such procedure is straightforward. We split a byte vector into two halves, and invoke the instruction twice, getting popcount for both nibbles. Next these popcounts are added together, forming 8-bit counters, which are added in the end.
Similar procedure can be build around VPTERNB. However, to fully utilize 5-bit indices a slightly different strategy is needed. We process two vectors in one step treating byte pairs as 16-bit words. We call VPTERNB to calculate popcount of three 5-bit fields, one remaining bit of the 16-bit word is counted separately.
We process two SSE registers (vec0 and vec1) in one step.
vec0 = packed_byte(abcd_efgh) // a .. p are bits */ vec1 = packed_byte(ijkl_mnop)
// t0 = packed_byte(000d_efgh) // t1 = packed_byte(000l_mnop) const __m128i t0 = _mm_and_si128(vec0, packed_byte(0x1f)); const __m128i t1 = _mm_and_si128(vec1, packed_byte(0x1f)); // popcnt for bits d..h const __m128i popcnt1 = _mm_perm_epi8(lookup_lo, lookup_hi, t0); // popcnt for bits l..p const __m128i popcnt2 = _mm_perm_epi8(lookup_lo, lookup_hi, t1);
// t2 = packed_byte(??ij_klmn) const __m128i t2 = _mm_srli_epi16(vec1, 2); // t3 = packed_byte(...._.abc) const __m128i t3 = _mm_and_si128(_mm_srli_epi16(vec0, 5), _mm_set1_epi8(0x07)); // t4 = packed_byte(...j_kabc) const __m128i t4 = _mm_cmov_si128(t2, t3, _mm_set1_epi8(0x18)); // popcnt3 = popcount of bits jkabc const __m128i popcnt3 = _mm_perm_epi8(lookup_lo, lookup_hi, t4);
// update counter local_a = _mm_add_epi8(local_a, popcnt1); local_a = _mm_add_epi8(local_a, popcnt2); local_a = _mm_add_epi8(local_a, popcnt3); // update counter for bits i local_b = _mm_add_epi8(local_b, _mm_and_si128(t2, _mm_set1_epi8(0x20)));
We repeat the procedure six times, and then add counters local_a and local_b. The resulting vector local contains in each byte population count of two bytes from the input vectors.
local_b = _mm_srli_epi16(local_b, 5); const __m128i local = _mm_add_epi8(local_a, local_b);
Sample implementation is available in file src/xop_hamming_weight.cpp on github repository accompanying our paper Faster Population Counts using AVX2 Instructions.
Following timings were obtained on AMD FX(tm)-8150 Eight-Core Processor with program basic_benchmark from the repository. Comparison includes just three procedures:
The Harley-Seal method is the fastest. However, when compare parallel-lookup methods, we see that one presented in this article can be 1.3 times faster than PSHUFB-based.
64-bit words | cycles per 64-bit word | ||
---|---|---|---|
PSHUFB-based | VPTERNB-based | Harley-Seal | |
128 | 0.35 | 0.26 | 0.40 |
192 | 0.71 | 0.59 | 0.68 |
256 | 0.96 | 0.83 | 0.81 |
384 | 1.11 | 0.99 | 0.95 |
512 | 1.19 | 1.12 | 1.07 |
768 | 1.26 | 1.16 | 1.13 |
1024 | 1.30 | 1.24 | 1.15 |
2048 | 1.52 | 1.34 | 1.26 |
4096 | 1.54 | 1.34 | 1.26 |
8192 | 1.56 | 1.34 | 1.26 |
12288 | 1.56 | 1.35 | 1.26 |