Speeding up bit-parallel population count

Author: Wojciech Muła
Added on:2015-04-13

Contents

Introduction

This well know method requires logarithmic number of steps in term of the word width. For example the algorithm run on a 64-bit word executes 6 steps:

const uint64_t t1 = fetch data

const uint64_t t2 = (t1 & 0x5555555555555555llu) + ((t1 >>  1) & 0x5555555555555555llu);
const uint64_t t3 = (t2 & 0x3333333333333333llu) + ((t2 >>  2) & 0x3333333333333333llu);
const uint64_t t4 = (t3 & 0x0f0f0f0f0f0f0f0fllu) + ((t3 >>  4) & 0x0f0f0f0f0f0f0f0fllu);
const uint64_t t5 = (t4 & 0x00ff00ff00ff00ffllu) + ((t4 >>  8) & 0x00ff00ff00ff00ffllu);
const uint64_t t6 = (t5 & 0x0000ffff0000ffffllu) + ((t5 >> 16) & 0x0000ffff0000ffffllu);
const uint64_t t7 = (t6 & 0x00000000ffffffffllu) + ((t6 >> 32) & 0x00000000ffffffffllu);

In each step k-bit fields are summed together; we start from 1-bit fields, then 2, 4, 8, 16 and finally 32 bits. Single step requires:

Modification

When counting bits in larger data sets it's possible to delay later steps (i.e. operating on wider bit fields) by accumulating partial results from earlier steps as long as a bit field is able to hold the sum value.

For example maximum value of single byte in t4 is 8, thus we can perform up to 255/8 = 31 additions without overflow.

This example shows the idea:

while (input data) {

    uint64_t partial = 0; // local accumulator

    for (int k = 0; k < 255/8; k++) {
        const uint64_t t1 = fetch input data

        const uint64_t t2 = (t1 & 0x5555555555555555llu) + ((t1 >>  1) & 0x5555555555555555llu);
        const uint64_t t3 = (t2 & 0x3333333333333333llu) + ((t2 >>  2) & 0x3333333333333333llu);
        const uint64_t t4 = (t3 & 0x0f0f0f0f0f0f0f0fllu) + ((t3 >>  4) & 0x0f0f0f0f0f0f0f0fllu);

        partial += t4;
    }

    const uint64_t t5 = (partial & 0x00ff00ff00ff00ffllu) + ((partial >>  8) & 0x00ff00ff00ff00ffllu);
    const uint64_t t6 = (t5 & 0x0000ffff0000ffffllu) + ((t5 >> 16) & 0x0000ffff0000ffffllu);
    const uint64_t t7 = (t6 & 0x00000000ffffffffllu) + ((t6 >> 32) & 0x00000000ffffffffllu);

    result += t7;
}

Thanks to that number of operations is smaller comparing to naive version — when processing n 64-bit words the naive version executes:

and the modified version:

This is nearly 50% less of bit-ands, 50% less of shifts and 40% less of additions.

Experiments results and source code

The source code is available at github. Sample results from Core i5:

running bit-parallel-optimized        time = 0.447463 s
running bit-parallel                  time = 0.683336 s

clearly shows the theoretical calculations, optimized version is 48% faster.

See also