Author: | Wojciech Muła |
---|---|
Added on: | 2015-04-08 |
Updates: | 2016-05-03 (binary search modifications) |
Contents
The problem: there is an ordered dictionary containing only unique keys. The dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2).
The obvious solution is to use binary search. Keys can be stored in a contiguous memory thanks to that there is no internal fragmentation, and data has cache locality. And of course indexing the keys is done in constant time (in the terms of computational complexity) or a single memory fetch (hardware).
The time complexity of binary search is O(log2(n)), i.e. for one million elements single lookup takes up to 20 operations. Single operation is fetching a value and comparing with the given key.
Another algorithm is linear search which seems to be suitable for small dictionaries. Linear search could be easily SIMD-ized.
I propose following modification: the key space is split to five independent subspaces and the lookup starts with choosing the proper subspace, then standard binary is used to search in the subspace. The computational complexity remains O(log2n), but the exact number of operations is smaller.
┌──────────┬────┬──────────┬────┬──────────┬────┬──────────┬────┬──────────┐ │ │ k1 │ │ k2 │ │ k3 │ │ k4 │ │ └──────────┴────┼──────────┴────┼──────────┴────┼──────────┴────┼──────────┘ subspace 0 │ subspace 1 │ subspace 2 │ subspace 3 │ subspace 4
The trick is to use SIMD operations to choose a subspace. We need four keys k1, k2, k3 and k4 which define subspace boundaries, then a single vector compare for greater yields a mask that defines a subspace.
Following pseudo code show the idea:
function search(key) is begin -- SIMD part key_vector = {key, key, key, key}; -- 1 boundaries = {k1, k2, k3, k4}; dword_mask = PCMPGT(key_vector, boundaries); -- 2 bit_mask = MOVMSKPS(dword_mask); -- 3 -- plain binsearch if bit_mask == 0b0000 then -- key <= k1 search in subspace 0 elif bit_mask == 0b0001 then -- key > k1 and key <= k2 search in subspace 1 elif bit_mask == 0b0011 then -- key > k2 and key <= k3 search in subspace 2 elif bit_mask == 0b0111 then -- key > k3 and key <= k4 search in subspace 3 elif bit_mask == 0b1111 then -- key > k4 search in subspace 4 end if end
SIMD part requires just three instructions:
This is done in constant time and binary search is executed on subarray of size 20% of input data.
Modification of binary search: when the search range is narrowed, then try to find a key near the pivot point. A SIMD equality function is used.
Following code shows the idea (check out the implementation):
key_vector = {key, key, key, key} while (a <= b) { const int c = (a + b)/2; if (data[c] == key) { return c; } if (key < data[c]) { b = c - 1; if (b >= 4) { SIMD_equal(data[b - 4 .. b], key_vector) => index return index if found } } else { a = c + 1; if (a + 4 < limit) { SIMD_equal(data[a .. b + 4], key_vector) => index return index if found } } }
For short arrays linear search is faster than binary search. In this modification binary search algorithm ends when the search range is small, then linear search is performed on this range.
Threshold size shouldn't be too large. I would say that the size of cache line (or two lines) is a good starting point.
A full C++ implementation:
int binsearch(uint32_t key, int a, int b) const { while (b - a > 64/4) { const int c = (a + b)/2; if (data[c] == key) { return c; } if (key < data[c]) { b = c - 1; } else { a = c + 1; } } for (int i=a; i <= b; i++) { if (data[i] == key) { return i; } } return -1; }
Results from Core i5 (Westmere). The unit is number of queries per second.
size [bytes] | binary search | linear search | ||||
---|---|---|---|---|---|---|
(1) default | (2) linear fallback | (3) SIMD subrange select | (4) SIMD search around pivot | (5) default | (6) SIMD boosted | |
4 | 44,277,175 | 46,899,696 | 33,078,081 | 40,488,781 | 64,679,288 | 40,058,806 |
8 | 16,391,723 | 18,863,227 | 16,241,043 | 15,295,572 | 22,813,654 | 27,207,701 |
16 | 3,617,123 | 4,131,163 | 6,905,724 | 6,622,157 | 3,266,917 | 10,469,538 |
32 | 1,444,317 | 1,784,175 | 1,926,374 | 2,437,853 | 1,268,942 | 4,130,405 |
64 | 586,026 | 669,683 | 677,376 | 822,112 | 456,542 | 1,504,841 |
128 | 259,413 | 279,583 | 269,133 | 337,367 | 139,831 | 442,554 |
256 | 114,956 | 123,168 | 117,368 | 134,691 | 40,075 | 132,586 |
512 | 49,428 | 56,470 | 53,368 | 59,880 | 10,767 | 38,690 |
1024 | 23,527 | 25,907 | 25,402 | 27,442 | 2,788 | 10,141 |
2048 | 10,873 | 12,048 | 11,193 | 12,640 | 707 | 2,631 |
4096 | 5,520 | 5,632 | 5,091 | 5,999 | 178 | 692 |
8192 | 2,738 | 2,656 | 2,531 | 2,852 | 45 | 174 |
All sources are available at github.