Author: | Wojciech Muła |
---|---|
Added on: | 2016-11-28 |
Updated on: | 2018-02-14 (spelling), 2017-04-29 (ARMv8 results) |
Popular programming languages provide methods or functions which locate a substring in a given string. In C it is the function strstr, the C++ class std::string has the method find, Python's string has methods pos and index, and so on, so forth. All these APIs were designed for one-shot searches. During past decades several algorithms to solve this problem were designed, an excellent page by Christian Charras and Thierry Lecroq lists most of them (if not all). Basically these algorithms could be split into two major categories: (1) based on Deterministic Finite Automaton, like Knuth-Morris-Pratt, Boyer Moore, etc., and (2) based on a simple comparison, like the Karp-Rabin algorithm.
The main problem with these standard algorithms is a silent assumption that comparing a pair of characters, looking up in an extra table and conditions are cheap, while comparing two substrings is expansive.
But current desktop CPUs do not meet this assumption, in particular:
This article shows two approaches utilizing SIMD instructions which I've already described in SIMD-friendly Rabin-Karp modification and SSE4 string search — modification of Karp-Rabin algorithm. I merged the articles, compared these two methods and extended material. Article shows also performance results for various implementations, ranging from SWAR to AVX512F.
The Karp-Rabin algorithm does the exact substring comparison whenever weak hashes are equal. One hash is calculated just once for searched substring, and another one is calculated for string's portion; in every iteration the second hash is updated at small cost. Following code shows the idea:
k := substring length h1 := hash(substring) h2 := hash(string[i .. i + k]) for i in 0 .. n - k loop if h1 == h2 then if substring == string[i .. i + k] then return i end if end if h = next_hash(h, ...) # this meant to be cheap end loop
SIMD solutions replace the hash predicate with a vector predicate, which is calculated in parallel and, hopefully, is calculated fast. For each "true" element of the predicate vector an exact comparison of substrings is performed.
This is one source of improvement, another is a careful implementation. A generic implementation calls a function like memcmp to compare substrings. But while we know the length of searched substring, we may provide specialisations for certain lengths, where a subprocedure call is replaced by a few CPU instructions, even just one. Thanks to that the cost of calling the procedure and all internal memcmp costs are simply ridden off.
This algorithm is suitable for all SIMD instruction sets and also SWAR approach. It uses as a predicate equality of the first and the last characters from the substring.
These two characters are populated in two registers, F and L respectively. Then in each iteration two chunks of strings are loaded. The first chunk (A) is read from offset i (where i is the current offset) and the second chunk (B) is read from offset i + k - 1, where k is substring's length.
Then we compute a vector expression F == A and B == L. This step yields a byte vector (or a bit mask), where "true" values denote position of potential substring occurrences. Finally, just at these positions an exact comparisons of substrings are performed.
Let's assume 8-byte registers. We're searching for word "cat", thus:
F = [ c | c | c | c | c | c | c | c ] L = [ t | t | t | t | t | t | t | t ]
We're searching in the string "a_cat_tries". In the first iteration the register A gets data from offset 0, B from offset 2:
A = [ a | _ | c | a | t | _ | t | r ] B = [ c | a | t | _ | t | r | i | e ]
Now we compare:
AF = (A == F) = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ] BL = (B == L) = [ 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 ]
After merging comparison results, i.e. AF & BL, we get following mask:
mask = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ]
Since the mask is non-zero, it means there are possible substring occurrences. As we see, there is only one non-zero element at index 2, thus only one substring comparison must be performed.
Choosing the first and the last character from a substring is not always a wise decision. Consider following scenario: a string contains mostly 'A' characters, and a user wants to find "AjohndoeA" — in such situation the number of char-wise would be large.
In order to prevent such situations an implementation can pick "last" character as the farthest character not equal to the first one. If there is no such character, it means that all characters in substring are the same (for example "AAAAA"). A specialised procedure may be used to handle such patterns.
Both SSE and AVX2 versions are practically the same, and both use the minimum number of instruction. Below is a generic AVX2 version.
It's worth to note that since we already know that the first and the last characters match, we don't need to compare them again with memcmp.
size_t avx2_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { const __m256i first = _mm256_set1_epi8(needle[0]); const __m256i last = _mm256_set1_epi8(needle[k - 1]); for (size_t i = 0; i < n; i += 32) { const __m256i block_first = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(s + i)); const __m256i block_last = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(s + i + k - 1)); const __m256i eq_first = _mm256_cmpeq_epi8(first, block_first); const __m256i eq_last = _mm256_cmpeq_epi8(last, block_last); uint32_t mask = _mm256_movemask_epi8(_mm256_and_si256(eq_first, eq_last)); while (mask != 0) { const auto bitpos = bits::get_first_bit_set(mask); if (memcmp(s + i + bitpos + 1, needle + 1, k - 2) == 0) { return i + bitpos; } mask = bits::clear_leftmost_set(mask); } } return std::string::npos; }
In SWAR approach, comparison for equality uses bit xor operation, which yields zero when two bytes are equal. Therefore instead of anding partial results, the bitwise or is used. Clearly this part of algorithm has the same complexity as the SSE/AVX2 code.
However, SWAR requires more effort to locate zero bytes. Following procedure calculates an exact byte mask, where MSBs of zeros are set when the corresponding byte in x is zero.
// 7th bit set if lower 7 bits are zero const uint64_t t0 = (~x & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu; // 7th bit set if 7th bit is zero const uint64_t t1 = (~x & 0x8080808080808080llu); uint64_t zeros = t0 & t1;
Below is the C++ implementation for 64-bit vectors. The while loop contains an additional condition which might look not optimal. But searching the first set bit and later clearing it (as the SSE version does) is slower.
size_t swar64_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { const uint64_t first = 0x0101010101010101llu * static_cast<uint8_t>(needle[0]); const uint64_t last = 0x0101010101010101llu * static_cast<uint8_t>(needle[k - 1]); uint64_t* block_first = reinterpret_cast<uint64_t*>(const_cast<char*>(s)); uint64_t* block_last = reinterpret_cast<uint64_t*>(const_cast<char*>(s + k - 1)); for (auto i=0u; i < n; i+=8, block_first++, block_last++) { const uint64_t eq = (*block_first ^ first) | (*block_last ^ last); const uint64_t t0 = (~eq & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu; const uint64_t t1 = (~eq & 0x8080808080808080llu); uint64_t zeros = t0 & t1; size_t j = 0; while (zeros) { if (zeros & 0x80) { const char* substr = reinterpret_cast<char*>(block_first) + j + 1; if (memcmp(substr, needle + 1, k - 2) == 0) { return i + j; } } zeros >>= 8; j += 1; } } return std::string::npos; }
AVX512F lacks of operations on bytes, the smallest vector item is a 32-bit word. The limitation forces us to use SWAR techniques.
Unlike the SWAR procedure, where we need a precise mask for zero bytes, an AVX512F procedure requires just information "a word has zero byte". A simpler algorithm, described in Bit Twiddling Hacks is used; below is its C++ implementation.
__mmask16 zero_byte_mask(const __m512i v) { const __m512i v01 = _mm512_set1_epi32(0x01010101u); const __m512i v80 = _mm512_set1_epi32(0x80808080u); const __m512i v1 = _mm512_sub_epi32(v, v01); // tmp1 = (v - 0x01010101) & ~v & 0x80808080 const __m512i tmp1 = _mm512_ternarylogic_epi32(v1, v, v80, 0x20); return _mm512_test_epi32_mask(tmp1, tmp1); }
Generic C++ implementation.
#define _mm512_set1_epu8(c) _mm512_set1_epi32(uint32_t(c) * 0x01010101u) size_t avx512f_strstr_v2_anysize(const char* string, size_t n, const char* needle, size_t k) { assert(n > 0); assert(k > 0); const __m512i first = _mm512_set1_epu8(needle[0]); const __m512i last = _mm512_set1_epu8(needle[k - 1]); char* haystack = const_cast<char*>(string); char* end = haystack + n; for (/**/; haystack < end; haystack += 64) { const __m512i block_first = _mm512_loadu_si512(haystack + 0); const __m512i block_last = _mm512_loadu_si512(haystack + k - 1); const __m512i first_zeros = _mm512_xor_si512(block_first, first); // zeros = first_zeros | (block_last ^ last) const __m512i zeros = _mm512_ternarylogic_epi32(first_zeros, block_last, last, 0xf6); uint32_t mask = zero_byte_mask(zeros); while (mask) { const uint64_t p = __builtin_ctz(mask); if (memcmp(haystack + 4*p + 0, needle, k) == 0) { return (haystack - string) + 4*p + 0; } if (memcmp(haystack + 4*p + 1, needle, k) == 0) { return (haystack - string) + 4*p + 1; } if (memcmp(haystack + 4*p + 2, needle, k) == 0) { return (haystack - string) + 4*p + 2; } if (memcmp(haystack + 4*p + 3, needle, k) == 0) { return (haystack - string) + 4*p + 3; } mask = bits::clear_leftmost_set(mask); } } return size_t(-1); }
The algorithm can be also easily realised using ARM Neon instructions, having 128-bit SIMD registers. The only problem is caused by long round trip from the Neon unit back to the CPU.
It was solved by saving back the comparison result in a 64 bit word in memory: lower nibbles come from the lower half of a SIMD register, likewise higher nibbles come from the higher half of the register.
Comparison is done in two loops, separately for lower and higher nibbles. This split is required to detect substring occurrences in the correct order.
Below is a sample implementation.
size_t FORCE_INLINE neon_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { assert(k > 0); assert(n > 0); const uint8x16_t first = vdupq_n_u8(needle[0]); const uint8x16_t last = vdupq_n_u8(needle[k - 1]); const uint8x8_t half = vdup_n_u8(0x0f); const uint8_t* ptr = reinterpret_cast<const uint8_t*>(s); union { uint8_t tmp[8]; uint32_t word[2]; }; for (size_t i = 0; i < n; i += 16) { const uint8x16_t block_first = vld1q_u8(ptr + i); const uint8x16_t block_last = vld1q_u8(ptr + i + k - 1); const uint8x16_t eq_first = vceqq_u8(first, block_first); const uint8x16_t eq_last = vceqq_u8(last, block_last); const uint8x16_t pred_16 = vandq_u8(eq_first, eq_last); const uint8x8_t pred_8 = vbsl_u8(half, vget_low_u8(pred_16), vget_high_u8(pred_16)); vst1_u8(tmp, pred_8); if ((word[0] | word[1]) == 0) { continue; } for (int j=0; j < 8; j++) { if (tmp[j] & 0x0f) { if (memcmp(s + i + j + 1, needle + 1, k - 2) == 0) { return i + j; } } } for (int j=0; j < 8; j++) { if (tmp[j] & 0xf0) { if (memcmp(s + i + j + 1 + 8, needle + 1, k - 2) == 0) { return i + j + 8; } } } } return std::string::npos; }
It appeared that unrolling the two inner loops brought about 1.2 speedup.
AArch64 code is almost the exact copy of the above ARM Neon procedure. The only exception is direct reading of SIMD registers lanes, as the architecture made this operation fast.
SSE4.1 and AVX2 provide instruction MPSADBW, which calculates eight Manhattan distances (L1) between given 4-byte sub-vector from one register and eight subsequent 4-byte sub-vector from second register. The instruction returns vector of eight words (16-bit values).
When two sub-vectors are equal, then the L1 distance is 0, and we may use this property to locate possible substring locations. In other words equality of four leading characters is used as a predicate.
Albeit it seems to be a stronger predicate than matching the first and the last characters, a quadratic complexity is unavoidable. For example, when the searched string contains one letter "a", and we're looking for "aaaabcde", then the predicate obviously will be true for all input characters.
If it isn't enough, there are following problems:
The generic, simplest implementation.
size_t sse4_strstr_anysize(const char* s, size_t n, const char* needle, size_t needle_size) { const __m128i prefix = _mm_loadu_si128(reinterpret_cast<const __m128i*>(needle)); const __m128i zeros = _mm_setzero_si128(); for (size_t i = 0; i < n; i += 8) { const __m128i data = _mm_loadu_si128(reinterpret_cast<const __m128i*>(s + i)); const __m128i result = _mm_mpsadbw_epu8(data, prefix, 0); const __m128i cmp = _mm_cmpeq_epi16(result, zeros); unsigned mask = _mm_movemask_epi8(cmp) & 0x5555; while (mask != 0) { const auto bitpos = bits::get_first_bit_set(mask)/2; if (memcmp(s + i + bitpos + 4, needle + 4, needle_size - 4) == 0) { return i + bitpos; } mask = bits::clear_leftmost_set(mask); } } return std::string::npos; }
Although AVX512F doesn't support MPSADBW (AVX512BW defines it) we still can use 4-byte prefix equality as a predicate, utilizing fact that 32-bit elements are natively supported.
In each iteration we generate four AVX512 vectors containing all possible 4-byte prefixes. Example:
string = "the-cat-tries-to-eat..." vec0 = [ t | h | e | - ][ c | a | t | - ][ t | r | i | e ][ s | - | t | o ][ ... ] vec1 = [ h | e | - | c ][ a | t | - | t ][ r | i | e | s ][ - | t | o | - ][ ... ] vec2 = [ e | - | c | a ][ t | - | t | r ][ i | e | s | - ][ t | o | - | e ][ ... ] vec3 = [ - | c | a | t ][ - | t | r | i ][ e | s | - | t ][ o | - ] e | a ][ ... ]
Vector vec0 contains prefixes for position 0, 4, 8, 12, ...; vec1 — 1, 5, 9, 13, ..., vec2 — 2, 6, 10, 14, ...; vec3 — 3, 7, 11, 15, etc.
Building each vector require two shifts and one bitwise or. In each iteration four vector comparison are performed and then four bitmasks are examined. This make a loop, which compares substrings, quite complicated.
Moreover, to properly fill the last elements of vectors we need four bytes beyond vector. This is accomplished by having two adjacent vectors per iterations (one load per iteration is needed, though). Finally, instruction VPALIGNR is used to extract required data.
size_t avx512f_strstr_long(const char* string, size_t n, const char* needle, size_t k) { __m512i curr; __m512i next; __m512i v0, v1, v2, v3; char* haystack = const_cast<char*>(string); char* last = haystack + n; const uint32_t prf = *(uint32_t*)needle; // the first 4 bytes of needle const __m512i prefix = _mm512_set1_epi32(prf); next = _mm512_loadu_si512(haystack); for (/**/; haystack < last; haystack += 64) { curr = next; next = _mm512_loadu_si512(haystack + 64); const __m512i shft = _mm512_alignr_epi32(next, curr, 1); v0 = curr; { const __m512i t1 = _mm512_srli_epi32(curr, 8); const __m512i t2 = _mm512_slli_epi32(shft, 24); v1 = _mm512_or_si512(t1, t2); } { const __m512i t1 = _mm512_srli_epi32(curr, 16); const __m512i t2 = _mm512_slli_epi32(shft, 16); v2 = _mm512_or_si512(t1, t2); } { const __m512i t1 = _mm512_srli_epi32(curr, 24); const __m512i t2 = _mm512_slli_epi32(shft, 8); v3 = _mm512_or_si512(t1, t2); } uint16_t m0 = _mm512_cmpeq_epi32_mask(v0, prefix); uint16_t m1 = _mm512_cmpeq_epi32_mask(v1, prefix); uint16_t m2 = _mm512_cmpeq_epi32_mask(v2, prefix); uint16_t m3 = _mm512_cmpeq_epi32_mask(v3, prefix); int index = 64; while (m0 | m1 | m2 | m3) { if (m0) { int pos = __builtin_ctz(m0) * 4 + 0; m0 = m0 & (m0 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } if (m1) { int pos = __builtin_ctz(m1) * 4 + 1; m1 = m1 & (m1 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } if (m2) { int pos = __builtin_ctz(m2) * 4 + 2; m2 = m2 & (m2 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } if (m3) { int pos = __builtin_ctz(m3) * 4 + 3; m3 = m3 & (m3 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } } if (index < 64) { return (haystack - string) + index; } } return size_t(-1); }
SSE4.2 introduced String and Text New Instructions (STNI), a set of very complex instructions that were meant to be a building block for string operations. Unfortunately, Intel practically discontinued STNI in newer processors, hasn't introduced AVX2 versions of STNI and make them extremely slow (11 cycles latency is unacceptable).
Basically PCMPxSTRx instruction exists in four variants, which differs only in:
Additional instruction's argument (immediate constant) defines several aspects of execution, specifically the algorithm of comparison. There are four different algorithms available, one we're using is called range ordered. Despite the name, this algorithm locates substring, or its prefix if a substring goes beyond register width.
For example, when we're searching "ABCD" in "ABCD_ABC_ABCD_AB" the instruction returns bitmask 0b0100001000000001, treating suffix "AB" as a match. Thus we can safely assume that only the first character matches, as tail might or might not be present in a register. (Of course it can be determined, but require additional calculations which is not very handy.)
Below is a code snippet which does the above operation.
const char* s1 = "ABCD_ABC_ABCD_AB"; const char* s2 = "ABCD____________"; const int mode = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ORDERED | _SIDD_BIT_MASK; const __m128i N = _mm_loadu_si128((__m128i*)(s2)); const __m128i D = _mm_loadu_si128((__m128i*)(s1)); const __m128i res = _mm_cmpestrm(N, 4, D, 16, mode); uint64_t mask = _mm_cvtsi128_si64(res); // = 0b0100001000000001
size_t sse42_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { const __m128i N = _mm_loadu_si128((__m128i*)needle); for (size_t i = 0; i < n; i += 16) { const int mode = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ORDERED | _SIDD_BIT_MASK; const __m128i D = _mm_loadu_si128((__m128i*)(s + i)); const __m128i res = _mm_cmpestrm(N, k, D, n - i, mode); uint64_t mask = _mm_cvtsi128_si64(res); while (mask != 0) { const auto bitpos = bits::get_first_bit_set(mask); // we know that at least the first character of needle matches if (memcmp(s + i + bitpos + 1, needle + 1, k - 1) == 0) { return i + bitpos; } mask = bits::clear_leftmost_set(mask); } } return std::string::npos; }
Performance of various SIMD implementations were measured. Test programs also have got specialisation for short substrings, that are selected at run time. Performance of a C strstr is included for comparison. I omitted C++ string::find due to performance bug in GNU libc which makes the method 10 times slower than strstr.
Test programs were run three times. Following computers, running either Debian or Ubuntu, were tested:
procedure | time in seconds | ||||
---|---|---|---|---|---|
Westemere | Bulldozer | Haswell | Skylake | KNL | |
std::strstr | 0.82246 | 9.37792 | 0.52786 | 0.66148 | 4.94606 |
std::string::find | --- | --- | --- | --- | --- |
SWAR 64-bit (generic) | 2.49859 | 2.93836 | 1.57715 | 1.40404 | 8.17075 |
SSE2 (generic) | 0.74589 | 0.78871 | 0.55435 | 0.48863 | 6.10786 |
SSE4.1 (MPSADBW) | 1.45040 | 1.98863 | 0.89775 | 0.63875 | 18.71666 |
SSE4.1 (MPSADBW unrolled) | 1.23849 | 2.06008 | 0.99647 | 0.87919 | 13.72486 |
SSE4.2 (PCMPESTRM) | 1.69968 | 2.00681 | 1.55992 | 1.39063 | 6.28869 |
AVX2 (MPSADBW) | 0.61578 | 0.56981 | 13.15136 | ||
AVX2 (generic) | 0.38653 | 0.36309 | 4.09478 | ||
AVX512F (MPSADBW-like) | 2.32616 | ||||
AVX512F (generic) | 1.14057 | ||||
biggest speed-up to strstr | 1.10 | --- | 1.37 | 1.82 | 4.33 |
procedure | time in seconds | |
---|---|---|
ARMv7 | ARMv8 | |
std::strstr | 7.30405 | 3.37546 |
std::string::find | 4.17131 | 1.81368 |
SWAR 64-bit (generic) | 36.65012 | 0.46269 |
SWAR 32-bit (generic) | 2.45058 | 0.81075 |
ARM Neon (32-bit, generic) | 1.29861 | 0.40699 |
AArch64 (64-bit, generic) | --- | 0.27897 |
biggest speed-up to std::string::find | 3.1 | 6.5 |
Daniel Lemire has gave me access to Haswell, Skylake, KNL, Bulldozer and ARMv8 machines, where I compiled and run test programs. Thank you!
All implementations and tests programs are available at github.