Author: | Wojciech Muła |
---|---|
Added on: | 2015-04-13 |
This well know method requires logarithmic number of steps in term of the word width. For example the algorithm run on a 64-bit word executes 6 steps:
const uint64_t t1 = fetch data const uint64_t t2 = (t1 & 0x5555555555555555llu) + ((t1 >> 1) & 0x5555555555555555llu); const uint64_t t3 = (t2 & 0x3333333333333333llu) + ((t2 >> 2) & 0x3333333333333333llu); const uint64_t t4 = (t3 & 0x0f0f0f0f0f0f0f0fllu) + ((t3 >> 4) & 0x0f0f0f0f0f0f0f0fllu); const uint64_t t5 = (t4 & 0x00ff00ff00ff00ffllu) + ((t4 >> 8) & 0x00ff00ff00ff00ffllu); const uint64_t t6 = (t5 & 0x0000ffff0000ffffllu) + ((t5 >> 16) & 0x0000ffff0000ffffllu); const uint64_t t7 = (t6 & 0x00000000ffffffffllu) + ((t6 >> 32) & 0x00000000ffffffffllu);
In each step k-bit fields are summed together; we start from 1-bit fields, then 2, 4, 8, 16 and finally 32 bits. Single step requires:
When counting bits in larger data sets it's possible to delay later steps (i.e. operating on wider bit fields) by accumulating partial results from earlier steps as long as a bit field is able to hold the sum value.
For example maximum value of single byte in t4 is 8, thus we can perform up to 255/8 = 31 additions without overflow.
This example shows the idea:
while (input data) { uint64_t partial = 0; // local accumulator for (int k = 0; k < 255/8; k++) { const uint64_t t1 = fetch input data const uint64_t t2 = (t1 & 0x5555555555555555llu) + ((t1 >> 1) & 0x5555555555555555llu); const uint64_t t3 = (t2 & 0x3333333333333333llu) + ((t2 >> 2) & 0x3333333333333333llu); const uint64_t t4 = (t3 & 0x0f0f0f0f0f0f0f0fllu) + ((t3 >> 4) & 0x0f0f0f0f0f0f0f0fllu); partial += t4; } const uint64_t t5 = (partial & 0x00ff00ff00ff00ffllu) + ((partial >> 8) & 0x00ff00ff00ff00ffllu); const uint64_t t6 = (t5 & 0x0000ffff0000ffffllu) + ((t5 >> 16) & 0x0000ffff0000ffffllu); const uint64_t t7 = (t6 & 0x00000000ffffffffllu) + ((t6 >> 32) & 0x00000000ffffffffllu); result += t7; }
Thanks to that number of operations is smaller comparing to naive version — when processing n 64-bit words the naive version executes:
and the modified version:
This is nearly 50% less of bit-ands, 50% less of shifts and 40% less of additions.
The source code is available at github. Sample results from Core i5:
running bit-parallel-optimized time = 0.447463 s running bit-parallel time = 0.683336 s
clearly shows the theoretical calculations, optimized version is 48% faster.