# Speeding up bit-parallel population count

Author: Wojciech Muła 2015-04-13

# Introduction

This well know method requires logarithmic number of steps in term of the word width. For example the algorithm run on a 64-bit word executes 6 steps:

```const uint64_t t1 = fetch data

const uint64_t t2 = (t1 & 0x5555555555555555llu) + ((t1 >>  1) & 0x5555555555555555llu);
const uint64_t t3 = (t2 & 0x3333333333333333llu) + ((t2 >>  2) & 0x3333333333333333llu);
const uint64_t t4 = (t3 & 0x0f0f0f0f0f0f0f0fllu) + ((t3 >>  4) & 0x0f0f0f0f0f0f0f0fllu);
const uint64_t t5 = (t4 & 0x00ff00ff00ff00ffllu) + ((t4 >>  8) & 0x00ff00ff00ff00ffllu);
const uint64_t t6 = (t5 & 0x0000ffff0000ffffllu) + ((t5 >> 16) & 0x0000ffff0000ffffllu);
const uint64_t t7 = (t6 & 0x00000000ffffffffllu) + ((t6 >> 32) & 0x00000000ffffffffllu);
```

In each step k-bit fields are summed together; we start from 1-bit fields, then 2, 4, 8, 16 and finally 32 bits. Single step requires:

• 2 bit-ands;
• shift right by constant amount;

# Modification

When counting bits in larger data sets it's possible to delay later steps (i.e. operating on wider bit fields) by accumulating partial results from earlier steps as long as a bit field is able to hold the sum value.

For example maximum value of single byte in t4 is 8, thus we can perform up to 255/8 = 31 additions without overflow.

This example shows the idea:

```while (input data) {

uint64_t partial = 0; // local accumulator

for (int k = 0; k < 255/8; k++) {
const uint64_t t1 = fetch input data

const uint64_t t2 = (t1 & 0x5555555555555555llu) + ((t1 >>  1) & 0x5555555555555555llu);
const uint64_t t3 = (t2 & 0x3333333333333333llu) + ((t2 >>  2) & 0x3333333333333333llu);
const uint64_t t4 = (t3 & 0x0f0f0f0f0f0f0f0fllu) + ((t3 >>  4) & 0x0f0f0f0f0f0f0f0fllu);

partial += t4;
}

const uint64_t t5 = (partial & 0x00ff00ff00ff00ffllu) + ((partial >>  8) & 0x00ff00ff00ff00ffllu);
const uint64_t t6 = (t5 & 0x0000ffff0000ffffllu) + ((t5 >> 16) & 0x0000ffff0000ffffllu);
const uint64_t t7 = (t6 & 0x00000000ffffffffllu) + ((t6 >> 32) & 0x00000000ffffffffllu);

result += t7;
}
```

Thanks to that number of operations is smaller comparing to naive version — when processing n 64-bit words the naive version executes:

• 2 ⋅ 6 ⋅ n bit-and;
• 6 ⋅ n shifts

and the modified version:

• 2 ⋅ 3 ⋅ (n + n/31) bit-and;
• 3 ⋅ (n + n/31) shifts;
• 4 ⋅ (n + n/31) additions;

This is nearly 50% less of bit-ands, 50% less of shifts and 40% less of additions.

# Experiments results and source code

The source code is available at github. Sample results from Core i5:

```running bit-parallel-optimized        time = 0.447463 s
running bit-parallel                  time = 0.683336 s
```

clearly shows the theoretical calculations, optimized version is 48% faster.