Author: | Wojciech Muła |
---|---|
Added on: | 2016-10-08 |
Updated on: | 2016-11-27 (tidy), 2016-10-17 (link to GCC's bugtrack) |
Today's weather was a really good excuse for staying at home and hacking. I come back to my old idea of speeding up substring searching exploiting SSE4.1 instruction MPSADBW. Finally I translated my old SSE code into AVX2. Then I transplanted my recent AVX512F code which employs similar techniques. In the end I coded AVX512F and SSE2 versions of SWAR-friendly technique which I developed some time ago. (Github repository contains all the stuff, if you wish to take a look.)
But let's put aside vector algorithms performance. Experiments have revelead that method find from C++ std::string is incredibly slow. The method can be slower an order of magnitude than strstr, and it doesn't get better with a newer stdlib — GCC versions varies from 4.9.2 (Debian) to 5.4.0 (Ubuntu).
Here is the table I took from the repository's README. The slowdown factors, when compared with strstr, are: 12.9, 12.6, 9.9, 7.4.
architecture | procedure time in seconds | |
---|---|---|
strstr | string::find | |
Westemere | 0.83 | 10.71 |
Haswell | 0.48 | 6.06 |
Skylake | 0.66 | 6.54 |
KNL | 4.96 | 36.94 |
I looked into headers, and I suppose located the culprit in bits/basic_string.tcc. Maybe I'm wrong, but for me it's a linear search, isn't it?
find(const _CharT* __s, size_type __pos, size_type __n) const { __glibcxx_requires_string_len(__s, __n); const size_type __size = this->size(); const _CharT* __data = _M_data(); if (__n == 0) return __pos <= __size ? __pos : npos; if (__n <= __size) { for (; __pos <= __size - __n; ++__pos) if (traits_type::eq(__data[__pos], __s[0]) && traits_type::compare(__data + __pos + 1, __s + 1, __n - 1) == 0) return __pos; } return npos; }
The assembly dump taken from perf annotate:
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find 78:┌─→add $0x1,%rbx │ add $0x1,%rbp │ cmp %r15,%rax │↓ ja b8 85:│ cmp -0x1(%rbp),%r14b │ lea -0x1(%rbx),%r13 │ mov %rbx,%rax │↑ jne 78 │ test %r12,%r12 │↑ je 25 │ mov 0x8(%rsp),%rsi │ mov %r12,%rdx │ mov %rbp,%rdi │→ callq memcmp@plt │ test %eax,%eax │↑ je 25 │ mov %rbx,%rax └──jmp 78
It must be slow. And I think C++ users should be worried a little bit.
The problem has been already reported (bug 66414).