# Intersection of ordered sets

Author: Wojciech Muła 2018-03-14 2018-03-16

# Introduction

The intersection of two sets represented by sorted collections (like lists or arrays) can be done in linear time. If we label with k the size of one collection, and with n the size of another collection, then the complexity of intersection is O(n + k).

Below is a naive C++ implementation; the C++ standard library comes with std::set_intersection.

```template <typename INSERTER>
void custom_set_intersection(const vec& A, const vec& B, INSERTER output) {

size_t i = 0;
size_t j = 0;

while (i < A.size() && j < B.size()) {
if (A[i] < B[j]) {
i += 1;
} else if (B[j] < A[i]) {
j += 1;
} else {
// A[i] == B[j]
*output++ = A[i];

i += 1;
j += 1;
}
}
}
```

If there is a huge difference between sizes of collections, then we might use two different approaches described below. I use terms "smaller" (k items) and "larger" collection/set (n items).

# SIMD algorithm

A SIMD approach is vectorized linear algorithm in which a single value from the smaller set is compared with vectors read from the larger set. Thanks to that the number of comparisons is smaller and we can faster skip parts of the larger collection.

In the example below we assume that the first element of the smaller collection is 9, and the head of the larger collection contains values [-5, 1, 2, 5, 7, 8, 9, 12].

• Fill the vector A with the first item from the smaller collection:

```+-----+-----+-----+-----+
|  9  |  9  |  9  |  9  |
+-----+-----+-----+-----+
```
• Load into another vector B a part of the larger collection:

```+-----+-----+-----+-----+
| -5  |  1  |  2  |  5  |
+-----+-----+-----+-----+
```
• Compare B < A:

```+-----+-----+-----+-----+
| -1  | -1  | -1  | -1  |
+-----+-----+-----+-----+
```
• If all items in B are smaller (all bits of the comparison vector are set), then load the next chunk into B:

```+-----+-----+-----+-----+
|  7  |  8  |  9  | 12  |
+-----+-----+-----+-----+
```
• And keep comparing with A; now B < A vector is equal:

```+-----+-----+-----+-----+
| -1  | -1  |  0  |  0  |
+-----+-----+-----+-----+
```
• The result vector contains some zero items. It means that B may contain the value present in A. If it's true (in the example it is), save this value in the output collection. In either case fill A with the next value from the smaller collection and again compare with B.

Below is a sample implementation that uses SSE instructions.

```template <typename INSERTER>
void sse_set_intersection(const vec& A, const vec& B, INSERTER output) {

__m128i a_rep;
__m128i b;

size_t ai = 0;
size_t bi = 0;

a_rep = _mm_set1_epi32(A[ai]);

while (ai < A.size() && bi < B.size()) {
const __m128i lt = _mm_cmplt_epi32(b, a_rep);
// all elements in b are smaller, fetch the next chunk from B
bi += 4;
} else {
// there might be element equal to A[ai]
// a simple linear search, as there're only 4 elements to search in
for (size_t i = 0; i < 4; i++) {
if (B[bi + i] == A[ai]) {
output = A[ai];
break;
}
}

// fetch the next value from A
ai += 1;
a_rep = _mm_set1_epi32(A[ai]);
}
}
}
```

# Evaluation

Experiment:

• The larger collection is a C++ std::vector<int32_t> that has got 1024 * 1024 random elements.
• The size of smaller table varies. It is always a subset of the larger collection.
• Intersection is done 1000 times and the minimum time is noted.

Full source code is available.

## Westmere M540 (Core i5)

size A size B size ratio std [us] SSE [us] binsearch [us]
128 1048576 8192.00 1802 666 16
x 2.71 x 112.62
256 1048576 4096.00 1803 670 37
x 2.69 x 48.73
384 1048576 2730.67 1807 672 53
x 2.69 x 34.09
512 1048576 2048.00 1807 674 69
x 2.68 x 26.19
640 1048576 1638.40 1816 684 84
x 2.65 x 21.62
768 1048576 1365.33 1817 685 97
x 2.65 x 18.73
896 1048576 1170.29 1820 690 112
x 2.64 x 16.25
1024 1048576 1024.00 1823 698 127
x 2.61 x 14.35
1152 1048576 910.22 1824 695 146
x 2.62 x 12.49
1280 1048576 819.20 1825 705 159
x 2.59 x 11.48
2560 1048576 409.60 1844 738 433
x 2.50 x 4.26
6400 1048576 163.84 1889 819 1009
x 2.31 x 1.87
2048 1048576 512.00 1837 724 319
x 2.54 x 5.76
3072 1048576 341.33 1851 750 549
x 2.47 x 3.37
4096 1048576 256.00 1863 770 719
x 2.42 x 2.59
5120 1048576 204.80 1875 796 863
x 2.36 x 2.17
6144 1048576 170.67 1888 817 982
x 2.31 x 1.92
7168 1048576 146.29 1900 836 1092
x 2.27 x 1.74
8192 1048576 128.00 1913 857 1180
x 2.23 x 1.62
9216 1048576 113.78 1927 876 1274
x 2.20 x 1.51
10240 1048576 102.40 1938 893 1349
x 2.17 x 1.44
20480 1048576 51.20 2059 1100 1892
x 1.87 x 1.09
51200 1048576 20.48 2388 1741 2878
x 1.37 x 0.83

## Skylake 6700 (Core i7)

size A size B size ratio std [us] SSE [us] AVX2 [us] binsearch [us]
128 1048576 8192.00 771 313 190 2
x 2.46 x 4.06 x 385.50
256 1048576 4096.00 772 316 192 8
x 2.44 x 4.02 x 96.50
384 1048576 2730.67 774 318 193 24
x 2.43 x 4.01 x 32.25
512 1048576 2048.00 775 320 195 38
x 2.42 x 3.97 x 20.39
640 1048576 1638.40 780 324 197 50
x 2.41 x 3.96 x 15.60
768 1048576 1365.33 780 326 199 59
x 2.39 x 3.92 x 13.22
896 1048576 1170.29 782 329 201 70
x 2.38 x 3.89 x 11.17
1024 1048576 1024.00 784 331 203 81
x 2.37 x 3.86 x 9.68
1152 1048576 910.22 785 334 204 90
x 2.35 x 3.85 x 8.72
1280 1048576 819.20 786 336 206 98
x 2.34 x 3.82 x 8.02
2560 1048576 409.60 801 361 224 178
x 2.22 x 3.58 x 4.50
6400 1048576 163.84 844 438 274 364
x 1.93 x 3.08 x 2.32
2048 1048576 512.00 796 351 217 148
x 2.27 x 3.67 x 5.38
3072 1048576 341.33 807 372 230 206
x 2.17 x 3.51 x 3.92
4096 1048576 256.00 818 391 244 258
x 2.09 x 3.35 x 3.17
5120 1048576 204.80 830 412 259 305
x 2.01 x 3.20 x 2.72
6144 1048576 170.67 841 431 271 351
x 1.95 x 3.10 x 2.40
7168 1048576 146.29 853 451 285 396
x 1.89 x 2.99 x 2.15
8192 1048576 128.00 864 471 299 438
x 1.83 x 2.89 x 1.97
9216 1048576 113.78 875 490 312 480
x 1.79 x 2.80 x 1.82
10240 1048576 102.40 886 510 325 520
x 1.74 x 2.73 x 1.70
20480 1048576 51.20 995 704 459 880
x 1.41 x 2.17 x 1.13
51200 1048576 20.48 1302 1292 881 1811
x 1.01 x 1.48 x 0.72

# Conculsions

• When differences in size of sets is huge the presented approaches might be significantly faster.
• The algorithm that use binary search is faster, but the size ratio is strictly limited.
• Both SSE and AVX2 implementations perform better than linear search in circumstances when binary search become slower.