Author: | Wojciech Muła |
---|---|

Added on: | 2018-03-14 |

Updated on: | 2018-03-16 |

Contents

The intersection of two sets represented by sorted collections (like lists or
arrays) can be done in linear time. If we label with *k* the size of one
collection, and with *n* the size of another collection, then the
complexity of intersection is O(*n* + *k*).

Below is a naive C++ implementation; the C++ standard library comes with std::set_intersection.

template <typename INSERTER> void custom_set_intersection(const vec& A, const vec& B, INSERTER output) { size_t i = 0; size_t j = 0; while (i < A.size() && j < B.size()) { if (A[i] < B[j]) { i += 1; } else if (B[j] < A[i]) { j += 1; } else { // A[i] == B[j] *output++ = A[i]; i += 1; j += 1; } } }

If there is a huge difference between sizes of collections, then we might
use two different approaches described below. I use terms "smaller"
(*k* items) and "larger" collection/set (*n* items).

In this approach we iterate over the smaller collection. For each item the
binary search on the larger collection is performed. The complexity of this
solution is O(*k* ⋅ log*n*).

A clear disadvantage is that at least the larger collection must support random access. It must be an array, a vector or something similar.

To make a practical implementation faster, we utilize the fact that both collections are sorted. In each iteration we are not looking for an equal element but for the first element which is greater or equal to the searched one. Thanks to that the searched range is narrowed after each iteration: the part of the larger collection containing elements smaller than the elements already tested is never touched again. It doesn't change the asymptotic complexity, though.

In C++ the function std::lower_bound implements this look up; it's a modification of binary search algorithm.

Below is a C++ implementation.

template <typename INSERTER> void binsearch_set_intersection(const vec& A, const vec& B, INSERTER output) { auto it = B.begin(); for (const auto& a: A) { it = std::lower_bound(it, B.end(), a); if (it == B.end()) { // since there are no values greater or equal the current, // there also won't be any for subsequent value (all greater than current) return; } if (*it == a) { output = a; ++it; } } }

Lets assume that the number of operations in the linear algorithm is exactly
*n* + *k*. Likewise, for binary search we may assume that on average
there are *k* ⋅ log_{2}*n* operations. It is of course a simplification,
because we are unable to derive the exact numbers of operations, as they
depend on input values and their distribution.

We start with an obvious inequality:

k+n<k⋅ log_{2}(n)

n<k(log_{2}(n) − 1)

n/k< log_{2}(n) − 1

If the proportion *n*/*k* is smaller than log_{2}(*n*) − 1 then
**the linear algorithm will perform better**. After a simple transformation
we get the threshold for size of the smaller set:

k>n/(log_{2}(n) − 1)

For example, if *n* = 1.000.000 then *k* ≈ 50.000;
the proportion of sizes is 18-19. In practise, we may approximate
function log_{2} with the position of the first bit set in the
number *k*.

A SIMD approach is vectorized linear algorithm in which a single value from the smaller set is compared with vectors read from the larger set. Thanks to that the number of comparisons is smaller and we can faster skip parts of the larger collection.

In the example below we assume that the first element of the smaller collection is 9, and the head of the larger collection contains values [-5, 1, 2, 5, 7, 8, 9, 12].

Fill the vector

**A**with the first item from the smaller collection:+-----+-----+-----+-----+ | 9 | 9 | 9 | 9 | +-----+-----+-----+-----+

Load into another vector

**B**a part of the larger collection:+-----+-----+-----+-----+ | -5 | 1 | 2 | 5 | +-----+-----+-----+-----+

Compare

**B < A**:+-----+-----+-----+-----+ | -1 | -1 | -1 | -1 | +-----+-----+-----+-----+

If all items in

**B**are smaller (all bits of the comparison vector are set), then load the next chunk into**B**:+-----+-----+-----+-----+ | 7 | 8 | 9 | 12 | +-----+-----+-----+-----+

And keep comparing with

**A**; now**B < A**vector is equal:+-----+-----+-----+-----+ | -1 | -1 | 0 | 0 | +-----+-----+-----+-----+

The result vector contains some zero items. It means that

**B**may contain the value present in**A**. If it's true (in the example it is), save this value in the output collection. In either case fill**A**with the next value from the smaller collection and again compare with**B**.

Below is a sample implementation that uses SSE instructions.

template <typename INSERTER> void sse_set_intersection(const vec& A, const vec& B, INSERTER output) { __m128i a_rep; __m128i b; size_t ai = 0; size_t bi = 0; a_rep = _mm_set1_epi32(A[ai]); b = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&B[bi])); while (ai < A.size() && bi < B.size()) { const __m128i lt = _mm_cmplt_epi32(b, a_rep); const uint16_t mask = _mm_movemask_epi8(lt); if (mask == 0xffff) { // all elements in b are smaller, fetch the next chunk from B bi += 4; b = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&B[bi])); } else { // there might be element equal to A[ai] // a simple linear search, as there're only 4 elements to search in for (size_t i = 0; i < 4; i++) { if (B[bi + i] == A[ai]) { output = A[ai]; break; } } // fetch the next value from A ai += 1; a_rep = _mm_set1_epi32(A[ai]); } } }

Experiment:

- The larger collection is a C++
`std::vector<int32_t>`that has got 1024 * 1024 random elements. - The size of smaller table varies. It is always a subset of the larger collection.
- Intersection is done 1000 times and the
**minimum time**is noted.

Full source code is available.

size A | size B | size ratio | std [us] | SSE [us] | binsearch [us] |
---|---|---|---|---|---|

128 | 1048576 | 8192.00 | 1802 | 666 | 16 |

x 2.71 | x 112.62 | ||||

256 | 1048576 | 4096.00 | 1803 | 670 | 37 |

x 2.69 | x 48.73 | ||||

384 | 1048576 | 2730.67 | 1807 | 672 | 53 |

x 2.69 | x 34.09 | ||||

512 | 1048576 | 2048.00 | 1807 | 674 | 69 |

x 2.68 | x 26.19 | ||||

640 | 1048576 | 1638.40 | 1816 | 684 | 84 |

x 2.65 | x 21.62 | ||||

768 | 1048576 | 1365.33 | 1817 | 685 | 97 |

x 2.65 | x 18.73 | ||||

896 | 1048576 | 1170.29 | 1820 | 690 | 112 |

x 2.64 | x 16.25 | ||||

1024 | 1048576 | 1024.00 | 1823 | 698 | 127 |

x 2.61 | x 14.35 | ||||

1152 | 1048576 | 910.22 | 1824 | 695 | 146 |

x 2.62 | x 12.49 | ||||

1280 | 1048576 | 819.20 | 1825 | 705 | 159 |

x 2.59 | x 11.48 | ||||

2560 | 1048576 | 409.60 | 1844 | 738 | 433 |

x 2.50 | x 4.26 | ||||

6400 | 1048576 | 163.84 | 1889 | 819 | 1009 |

x 2.31 | x 1.87 | ||||

2048 | 1048576 | 512.00 | 1837 | 724 | 319 |

x 2.54 | x 5.76 | ||||

3072 | 1048576 | 341.33 | 1851 | 750 | 549 |

x 2.47 | x 3.37 | ||||

4096 | 1048576 | 256.00 | 1863 | 770 | 719 |

x 2.42 | x 2.59 | ||||

5120 | 1048576 | 204.80 | 1875 | 796 | 863 |

x 2.36 | x 2.17 | ||||

6144 | 1048576 | 170.67 | 1888 | 817 | 982 |

x 2.31 | x 1.92 | ||||

7168 | 1048576 | 146.29 | 1900 | 836 | 1092 |

x 2.27 | x 1.74 | ||||

8192 | 1048576 | 128.00 | 1913 | 857 | 1180 |

x 2.23 | x 1.62 | ||||

9216 | 1048576 | 113.78 | 1927 | 876 | 1274 |

x 2.20 | x 1.51 | ||||

10240 | 1048576 | 102.40 | 1938 | 893 | 1349 |

x 2.17 | x 1.44 | ||||

20480 | 1048576 | 51.20 | 2059 | 1100 | 1892 |

x 1.87 | x 1.09 | ||||

51200 | 1048576 | 20.48 | 2388 | 1741 | 2878 |

x 1.37 | x 0.83 |

size A | size B | size ratio | std [us] | SSE [us] | AVX2 [us] | binsearch [us] |
---|---|---|---|---|---|---|

128 | 1048576 | 8192.00 | 771 | 313 | 190 | 2 |

x 2.46 | x 4.06 | x 385.50 | ||||

256 | 1048576 | 4096.00 | 772 | 316 | 192 | 8 |

x 2.44 | x 4.02 | x 96.50 | ||||

384 | 1048576 | 2730.67 | 774 | 318 | 193 | 24 |

x 2.43 | x 4.01 | x 32.25 | ||||

512 | 1048576 | 2048.00 | 775 | 320 | 195 | 38 |

x 2.42 | x 3.97 | x 20.39 | ||||

640 | 1048576 | 1638.40 | 780 | 324 | 197 | 50 |

x 2.41 | x 3.96 | x 15.60 | ||||

768 | 1048576 | 1365.33 | 780 | 326 | 199 | 59 |

x 2.39 | x 3.92 | x 13.22 | ||||

896 | 1048576 | 1170.29 | 782 | 329 | 201 | 70 |

x 2.38 | x 3.89 | x 11.17 | ||||

1024 | 1048576 | 1024.00 | 784 | 331 | 203 | 81 |

x 2.37 | x 3.86 | x 9.68 | ||||

1152 | 1048576 | 910.22 | 785 | 334 | 204 | 90 |

x 2.35 | x 3.85 | x 8.72 | ||||

1280 | 1048576 | 819.20 | 786 | 336 | 206 | 98 |

x 2.34 | x 3.82 | x 8.02 | ||||

2560 | 1048576 | 409.60 | 801 | 361 | 224 | 178 |

x 2.22 | x 3.58 | x 4.50 | ||||

6400 | 1048576 | 163.84 | 844 | 438 | 274 | 364 |

x 1.93 | x 3.08 | x 2.32 | ||||

2048 | 1048576 | 512.00 | 796 | 351 | 217 | 148 |

x 2.27 | x 3.67 | x 5.38 | ||||

3072 | 1048576 | 341.33 | 807 | 372 | 230 | 206 |

x 2.17 | x 3.51 | x 3.92 | ||||

4096 | 1048576 | 256.00 | 818 | 391 | 244 | 258 |

x 2.09 | x 3.35 | x 3.17 | ||||

5120 | 1048576 | 204.80 | 830 | 412 | 259 | 305 |

x 2.01 | x 3.20 | x 2.72 | ||||

6144 | 1048576 | 170.67 | 841 | 431 | 271 | 351 |

x 1.95 | x 3.10 | x 2.40 | ||||

7168 | 1048576 | 146.29 | 853 | 451 | 285 | 396 |

x 1.89 | x 2.99 | x 2.15 | ||||

8192 | 1048576 | 128.00 | 864 | 471 | 299 | 438 |

x 1.83 | x 2.89 | x 1.97 | ||||

9216 | 1048576 | 113.78 | 875 | 490 | 312 | 480 |

x 1.79 | x 2.80 | x 1.82 | ||||

10240 | 1048576 | 102.40 | 886 | 510 | 325 | 520 |

x 1.74 | x 2.73 | x 1.70 | ||||

20480 | 1048576 | 51.20 | 995 | 704 | 459 | 880 |

x 1.41 | x 2.17 | x 1.13 | ||||

51200 | 1048576 | 20.48 | 1302 | 1292 | 881 | 1811 |

x 1.01 | x 1.48 | x 0.72 |

- When differences in size of sets is huge the presented approaches might be significantly faster.
- The algorithm that use binary search is faster, but the size ratio is strictly limited.
- Both SSE and AVX2 implementations perform better than linear search in circumstances when binary search become slower.