SIMD-ized searching in unique constant dictionary

Author:Wojciech Muła
Added on:2015-04-08
Updates:2016-05-03 (binary search modifications)

Contents

Introduction

The problem: there is an ordered dictionary containing only unique keys. The dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2).

The obvious solution is to use binary search. Keys can be stored in a contiguous memory thanks to that there is no internal fragmentation, and data has cache locality. And of course indexing the keys is done in constant time (in the terms of computational complexity) or a single memory fetch (hardware).

The time complexity of binary search is O(log2(n)), i.e. for one million elements single lookup takes up to 20 operations. Single operation is fetching a value and comparing with the given key.

Another algorithm is linear search which seems to be suitable for small dictionaries. Linear search could be easily SIMD-ized.

Binary search with SIMD search around pivot point new

Modification of binary search: when the search range is narrowed, then try to find a key near the pivot point. A SIMD equality function is used.

Following code shows the idea (check out the implementation):

key_vector = {key, key, key, key}

while (a <= b) {
    const int c = (a + b)/2;

    if (data[c] == key) {
        return c;
    }

    if (key < data[c]) {
        b = c - 1;

        if (b >= 4) {
            SIMD_equal(data[b - 4 .. b], key_vector) => index
            return index if found
        }
    } else {
        a = c + 1;

        if (a + 4 < limit) {
            SIMD_equal(data[a .. b + 4], key_vector) => index
            return index if found
        }
    }
}

Binary search with fallback to linear search new

For short arrays linear search is faster than binary search. In this modification binary search algoithm ends when the search range is small, then linear search is performed on this range.

Threshold size shouldn't be too large. I would say that the size of cache line (or two lines) is a good starting point.

A full C++ implementation:

int binsearch(uint32_t key, int a, int b) const {
    while (b - a > 64/4) {
        const int c = (a + b)/2;

        if (data[c] == key) {
            return c;
        }

        if (key < data[c]) {
            b = c - 1;
        } else {
            a = c + 1;
        }
    }

    for (int i=a; i <= b; i++) {
        if (data[i] == key) {
            return i;
        }
    }

    return -1;
}

Comparison new

Results from Core i5 (Westmere). The unit is number of queries per second.

size [bytes] binary search linear search
(1) default (2) linear fallback (3) SIMD subrange select (4) SIMD search around pivot (5) default (6) SIMD boosted
4 44,277,175 46,899,696 33,078,081 40,488,781 64,679,288 40,058,806
8 16,391,723 18,863,227 16,241,043 15,295,572 22,813,654 27,207,701
16 3,617,123 4,131,163 6,905,724 6,622,157 3,266,917 10,469,538
32 1,444,317 1,784,175 1,926,374 2,437,853 1,268,942 4,130,405
64 586,026 669,683 677,376 822,112 456,542 1,504,841
128 259,413 279,583 269,133 337,367 139,831 442,554
256 114,956 123,168 117,368 134,691 40,075 132,586
512 49,428 56,470 53,368 59,880 10,767 38,690
1024 23,527 25,907 25,402 27,442 2,788 10,141
2048 10,873 12,048 11,193 12,640 707 2,631
4096 5,520 5,632 5,091 5,999 178 692
8192 2,738 2,656 2,531 2,852 45 174

Summary

Source code

All sources are available at github.