Author: | Wojciech Muła |
---|---|
Added on: | 2018-03-14 |
Updated on: | 2018-03-16 |
Contents
The intersection of two sets represented by sorted collections (like lists or arrays) can be done in linear time. If we label with k the size of one collection, and with n the size of another collection, then the complexity of intersection is O(n + k).
Below is a naive C++ implementation; the C++ standard library comes with std::set_intersection.
template <typename INSERTER> void custom_set_intersection(const vec& A, const vec& B, INSERTER output) { size_t i = 0; size_t j = 0; while (i < A.size() && j < B.size()) { if (A[i] < B[j]) { i += 1; } else if (B[j] < A[i]) { j += 1; } else { // A[i] == B[j] *output++ = A[i]; i += 1; j += 1; } } }
If there is a huge difference between sizes of collections, then we might use two different approaches described below. I use terms "smaller" (k items) and "larger" collection/set (n items).
In this approach we iterate over the smaller collection. For each item the binary search on the larger collection is performed. The complexity of this solution is O(k ⋅ logn).
A clear disadvantage is that at least the larger collection must support random access. It must be an array, a vector or something similar.
To make a practical implementation faster, we utilize the fact that both collections are sorted. In each iteration we are not looking for an equal element but for the first element which is greater or equal to the searched one. Thanks to that the searched range is narrowed after each iteration: the part of the larger collection containing elements smaller than the elements already tested is never touched again. It doesn't change the asymptotic complexity, though.
In C++ the function std::lower_bound implements this look up; it's a modification of the binary search algorithm.
Below is a C++ implementation.
template <typename INSERTER> void binsearch_set_intersection(const vec& A, const vec& B, INSERTER output) { auto it = B.begin(); for (const auto& a: A) { it = std::lower_bound(it, B.end(), a); if (it == B.end()) { // since there are no values greater or equal the current, // there also won't be any for subsequent value (all greater than current) return; } if (*it == a) { output = a; ++it; } } }
Lets assume that the number of operations in the linear algorithm is exactly n + k. Likewise, for binary search we may assume that on average there are k ⋅ log2n operations. It is of course a simplification, because we are unable to derive the exact numbers of operations, as they depend on input values and their distribution.
We start with an obvious inequality:
k + n < k ⋅ log2(n)
n < k( log2(n) − 1)
n/k < log2(n) − 1
If the proportion n/k is smaller than