Author: | Wojciech Muła |
---|---|
Added on: | 2023-03-06 |
Contents
When I was browsing the source code of project Ada (WHATWG-compliant and fast URL parser written in modern C++) the following procedure caught my attention:
ada_really_inline size_t find_authority_delimiter_special(std::string_view view) noexcept { auto has_zero_byte = [](uint64_t v) { return ((v - 0x0101010101010101) & ~(v)&0x8080808080808080); }; auto index_of_first_set_byte = [](uint64_t v) { return ((((v - 1) & 0x101010101010101) * 0x101010101010101) >> 56) - 1;, }; auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; }; size_t i = 0; uint64_t mask1 = broadcast('@'); uint64_t mask2 = broadcast('/'); uint64_t mask3 = broadcast('?'); uint64_t mask4 = broadcast('\\'); for (; i + 7 < view.size(); i += 8) { uint64_t word{}; memcpy(&word, view.data() + i, sizeof(word)); word = swap_bytes_if_big_endian(word); uint64_t xor1 = word ^ mask1; uint64_t xor2 = word ^ mask2; uint64_t xor3 = word ^ mask3; uint64_t xor4 = word ^ mask4; uint64_t is_match = has_zero_byte(xor1) | has_zero_byte(xor2) | has_zero_byte(xor3) | has_zero_byte(xor4); if (is_match) { return i + index_of_first_set_byte(is_match); } } if (i < view.size()) { uint64_t word{}; memcpy(&word, view.data() + i, view.size() - i); word = swap_bytes_if_big_endian(word); uint64_t xor1 = word ^ mask1; uint64_t xor2 = word ^ mask2; uint64_t xor3 = word ^ mask3; uint64_t xor4 = word ^ mask4; uint64_t is_match = has_zero_byte(xor1) | has_zero_byte(xor2) | has_zero_byte(xor3) | has_zero_byte(xor4); if (is_match) { return i + index_of_first_set_byte(is_match); } } return view.size(); }
The above procedure finds the position of the first occurrence of a char from the set @, /, ? and /. It returns the length of input string if nothing was found.
The procedure uses SWAR techniques: it processes several bytes at once, taking advantage on the current CPUs architecture that process 64-bit values. The procedure implementation comes from src/helpers.cpp, and more function from that file follow exactly the same SWAR approach.
These two functions are crucial:
The pattern used is quite straightforward. If we bit-xor input bytes with a word filled with one of bytes from set, then the result has zero byte if the byte was there. We check then if it least one result of bit-xor has zero-byte and if it is true, we're looking for its position.
While the production code processes multi-word inputs, let's focus on a basic building block that processes a single 64-bit word.
int find_authority_delimiter_special_reference(uint64_t word) noexcept { auto has_zero_byte = [](uint64_t v) { return ((v - 0x0101010101010101) & ~(v)&0x8080808080808080); }; auto index_of_first_set_byte = [](uint64_t v) { return ((((v - 1) & 0x101010101010101) * 0x101010101010101) >> 56) - 1; }; auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; }; uint64_t mask1 = broadcast('@'); uint64_t mask2 = broadcast('/'); uint64_t mask3 = broadcast('?'); uint64_t mask4 = broadcast('\\'); uint64_t xor1 = word ^ mask1; uint64_t xor2 = word ^ mask2; uint64_t xor3 = word ^ mask3; uint64_t xor4 = word ^ mask4; uint64_t is_match = has_zero_byte(xor1) | has_zero_byte(xor2) | has_zero_byte(xor3) | has_zero_byte(xor4); if (is_match) { return index_of_first_set_byte(is_match); } return -1; }
The following assembly is produced by GCC 12.2.0 from Debian for the IceLake Server architecture (gcc -O3 -march=icelake-server).
movabs $0x2f2f2f2f2f2f2f2f,%rax movabs $0xfefefefefefefeff,%rsi xor %rdi,%rax movabs $0xd0d0d0d0d0d0d0d0,%rcx xor %rdi,%rcx add %rsi,%rax and %rcx,%rax movabs $0x4040404040404040,%rcx mov %rdi,%rdx xor %rdi,%rcx movabs $0xbfbfbfbfbfbfbfbf,%rdi xor %rdx,%rdi add %rsi,%rcx and %rdi,%rcx or %rcx,%rax movabs $0x3f3f3f3f3f3f3f3f,%rcx xor %rdx,%rcx movabs $0xc0c0c0c0c0c0c0c0,%rdi add %rsi,%rcx xor %rdx,%rdi and %rdi,%rcx or %rcx,%rax movabs $0x5c5c5c5c5c5c5c5c,%rcx xor %rdx,%rcx add %rsi,%rcx movabs $0xa3a3a3a3a3a3a3a3,%rsi xor %rsi,%rdx and %rcx,%rdx or %rdx,%rax movabs $0x8080808080808080,%rdx and %rdx,%rax je <_Z42find_authority_delimiter_special_referencem+0xc0> movabs $0x101010101010101,%rdx dec %rax and %rdx,%rax imul %rdx,%rax shr $0x38,%rax dec %eax ret mov $0xffffffff,%eax ret
The assembly contains:
While the approach used by Ada library is universal and can be used for locating any byte, there's faster method that can be used with the set contains only ASCII chars, i.e. bytes that do not have set the 7th bit. The set @, /, ? and \\ falls into this category.
In this approach we pretend that the input is 7-bit. If the input has any 8-bit byte (for instance UTF-8 characters) we'll fix up wrongly detected bytes at the end.
We also use bit-xor to produce zero bytes for matching bytes. However, we then convert the result into a bit. The result of bit-xor can be zero or non zero (x - bit 0 or 1, but at least one is 1):
┌─────────┐ ┌─────────┐ │0000 0000│ │0xxx xxxx│ └─────────┘ └─────────┘
If we add 0x7f = 0b0111_1111 then the result is:
┌─────────┐ ┌─────────┐ │0111 1111│ │1yyy yyyy│ └─────────┘ └─────────┘
We can see that the most significant bit expresses relation "byte is not zero". Let's review the basic algorithm steps, that use the above observation.
Input 8-byte string is a\\xc0ą@b?c:
'a'=0x61 \xc0 'ą'=<0xc4, 0x85> '@'=0x40 'b'=0x62 '?'=0x3f 'c'=0x63 ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │0110 0001│1100 0000│1100 0100│1000 0101│0100 0000│0110 0010│0011 1111│0110 0011│ = input └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
And we're looking for '@' = 0x40:
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│ = byte └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Xor input and mask:
t0 = word ^ 0x4040404040404040 ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │0010 0001│1000 0000│1000 0100│1100 0101│0000 0000│0010 0010│0111 1111│0010 0011│ = t0 └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ 21 80 84 c5 00 22 7f 23
Mask MSB from input:
t1 = t0 & 0x7f7f7f7f7f7f7f7f ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │0010 0001│0000 0000│0000 0100│0100 0101│0000 0000│0010 0010│0111 1111│0010 0011│ = t1 └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ 21 00 04 45 00 22 7f 23
Add 0x7f to locate non-zero bytes. Note that two bytes were found, but the first one is not valid.:
t2 = t1 + 0x7f7f7f7f7f7f7f7f ┌─────────┲━━━━━━━━━┱─────────┬─────────┲━━━━━━━━━┱─────────┬─────────┬─────────┐ │1010 0000┃0111 1111┃1000 0011│1100 0100┃0111 1111┃1010 0001│1111 1110│1010 0010│ = t2 └─────────┺━━━━━━━━━┹─────────┴─────────┺━━━━━━━━━┹─────────┴─────────┴─────────┘ a0 7f 83 c4 7f a1 fe a2
At this point, MSBs equal zero denote byte position. To fix up the result we bit-or the t2 with input. If input byte has MSB set, it will discard wrong results.:
t3 = t2 | input ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │1110 0001│1111 1111│1100 0111│1100 0101│0111 1111│1110 0011│1111 1111│1110 0011│ = t3 └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ e1 ff c7 c5 7f e3 ff e3
Now, we keep only MSBs:
t4 = t3 & 0x8080808080808080 ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │1000 0000│1000 0000│1000 0000│1000 0000│0000 0000│1000 0000│1000 0000│1000 0000│ = t4 └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ 80 80 80 80 00 80 80 80
Negate the MSBs — ones would point found positions. As we can see, now we have exactly one non-zero byte:
t5 = t4 ^ 0x8080808080808080 ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │0000 0000│0000 0000│0000 0000│0000 0000│1000 0000│0000 0000│0000 0000│0000 0000│ = t5 └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ 00 00 00 00 80 00 00 00
To find the position of byte we simply looking for first bit set and divide result by 8.
The above algorithm was used to code the fully functional procedure.
long int find_authority_delimiter_special_better(uint64_t word) noexcept { auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; }; uint64_t mask = broadcast(0x7f); uint64_t lo7bits = word & mask; uint64_t x0 = (lo7bits ^ broadcast('@')) + mask; uint64_t x1 = (lo7bits ^ broadcast('/')) + mask; uint64_t x2 = (lo7bits ^ broadcast('?')) + mask; uint64_t x3 = (lo7bits ^ broadcast('\\')) + mask; uint64_t t0 = ((x0 & x1 & x2 & x3) | word); uint64_t t1 = t0 & broadcast(0x80); uint64_t t2 = t1 ^ broadcast(0x80); if (t2 != 0) { return __builtin_ctzl(t2) / 8; } return -1; }
The following assembly is produced by GCC 12.2.0 from Debian for the IceLake Server architecture (gcc -O3 -march=icelake-server).
movabs $0x7f7f7f7f7f7f7f7f,%rcx mov %rdi,%rax and %rcx,%rax movabs $0x4040404040404040,%rdx movabs $0x2f2f2f2f2f2f2f2f,%rsi xor %rax,%rdx xor %rax,%rsi add %rcx,%rsi add %rcx,%rdx and %rsi,%rdx movabs $0x3f3f3f3f3f3f3f3f,%rsi xor %rax,%rsi add %rcx,%rsi and %rsi,%rdx movabs $0x5c5c5c5c5c5c5c5c,%rsi xor %rsi,%rax add %rcx,%rax and %rax,%rdx or %rdi,%rdx movabs $0x8080808080808080,%rax andn %rax,%rdx,%rdx xor %eax,%eax tzcnt %rdx,%rax shr $0x3,%rax test %rdx,%rdx mov $0xffffffffffffffff,%rdx cmove %rdx,%rax ret
The assembly contains:
Since the reference assembly contains the jump instruction, I rewrote it to:
movabs $0x2f2f2f2f2f2f2f2f,%rax movabs $0xfefefefefefefeff,%rsi xor %rdi,%rax movabs $0xd0d0d0d0d0d0d0d0,%rcx xor %rdi,%rcx add %rsi,%rax and %rcx,%rax movabs $0x4040404040404040,%rcx mov %rdi,%rdx xor %rdi,%rcx movabs $0xbfbfbfbfbfbfbfbf,%rdi xor %rdx,%rdi add %rsi,%rcx and %rdi,%rcx or %rcx,%rax movabs $0x3f3f3f3f3f3f3f3f,%rcx xor %rdx,%rcx movabs $0xc0c0c0c0c0c0c0c0,%rdi add %rsi,%rcx xor %rdx,%rdi and %rdi,%rcx or %rcx,%rax movabs $0x5c5c5c5c5c5c5c5c,%rcx xor %rdx,%rcx add %rsi,%rcx movabs $0xa3a3a3a3a3a3a3a3,%rsi xor %rsi,%rdx and %rcx,%rdx or %rdx,%rax movabs $0x8080808080808080,%rdx and %rdx,%rax sete %cl // cl = 1 or 0 movabs $0x101010101010101,%rdx dec %rax and %rdx,%rax imul %rdx,%rax shr $0x38,%rax dec %eax testb %cl,%cl // reapply `and` result mov $0xffffffff,%ebx // cmove %rbx,%rax // and conditionally overwrite rax ret
For the above procedure uICA give the following analysis.
Throughput (in cycles per iteration): 13.18 Bottleneck: Predecoder The following throughputs could be achieved if the given property were the only bottleneck: - Predecoder: 13.19 - Decoder: 11.00 - Issue: 8.20 - Ports: 10.25 - Dependencies: 2.00 ┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐ │ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 Port 8 Port 9 │ ├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤ │ 1 │ 1 │ 1 │ 0.11 0.11 0.56 0.22 │ mov rax, 0x2f2f2f2f2f2f2f2f │ 1 │ 1 │ 1 │ 0.11 0.39 0.33 0.17 │ mov rsi, 0xfefefefefefefeff │ 1 │ 1 │ 1 │ 0.28 0.33 0.17 0.22 │ xor rax, rdi │ 1 │ 1 │ 1 │ 0.11 0.17 0.56 0.17 │ mov rcx, 0xd0d0d0d0d0d0d0d0 │ 1 │ 1 │ 1 │ 0.17 0.44 0.28 0.11 │ xor rcx, rdi │ 1 │ 1 │ 1 │ 0.22 0.44 0.28 0.06 │ add rax, rsi │ 1 │ 1 │ 1 │ 0.17 0.28 0.28 0.28 │ and rax, rcx │ 1 │ 1 │ 1 │ 0.17 0.39 0.39 0.06 │ mov rcx, 0x4040404040404040 │ 1 │ 1 │ 1 │ 0.22 0.39 0.11 0.28 │ mov rdx, rdi │ 1 │ 1 │ 1 │ 0.56 0.11 0.17 0.17 │ xor rcx, rdi │ 1 │ 1 │ 1 │ 0.06 0.22 0.22 0.5 │ mov rdi, 0xbfbfbfbfbfbfbfbf │ 1 │ 1 │ 1 │ 0.11 0.33 0.44 0.11 │ xor rdi, rdx │ 1 │ 1 │ 1 │ 0.33 0.33 0.28 0.06 │ add rcx, rsi │ 1 │ 1 │ 1 │ 0.33 0.28 0.06 0.33 │ and rcx, rdi │ 1 │ 1 │ 1 │ 0.17 0.22 0.11 0.5 │ or rax, rcx │ 1 │ 1 │ 1 │ 0.17 0.22 0.33 0.28 │ mov rcx, 0x3f3f3f3f3f3f3f3f │ 1 │ 1 │ 1 │ 0.28 0.28 0.33 0.11 │ xor rcx, rdx │ 1 │ 1 │ 1 │ 0.17 0.44 0.22 0.17 │ mov rdi, 0xc0c0c0c0c0c0c0c0 │ 1 │ 1 │ 1 │ 0.44 0.22 0.22 0.11 │ add rcx, rsi │ 1 │ 1 │ 1 │ 0.11 0.17 0.28 0.44 │ xor rdi, rdx │ 1 │ 1 │ 1 │ 0.39 0.22 0.22 0.17 │ and rcx, rdi │ 1 │ 1 │ 1 │ 0.17 0.22 0.39 0.22 │ or rax, rcx │ 1 │ 1 │ 1 │ 0.28 0.33 0.11 0.28 │ mov rcx, 0x5c5c5c5c5c5c5c5c │ 1 │ 1 │ 1 │ 0.33 0.22 0.11 0.33 │ xor rcx, rdx │ 1 │ 1 │ 1 │ 0.22 0.11 0.28 0.39 │ add rcx, rsi │ 1 │ 1 │ 1 │ 0.22 0.11 0.33 0.33 │ mov rsi, 0xa3a3a3a3a3a3a3a3 │ 1 │ 1 │ 1 │ 0.11 0.22 0.39 0.28 │ xor rdx, rsi │ 1 │ 1 │ 1 │ 0.39 0.28 0.11 0.22 │ and rdx, rcx │ 1 │ 1 │ 1 │ 0.17 0.06 0.5 0.28 │ or rax, rdx │ 1 │ 1 │ 1 │ 0.22 0.28 0.33 0.17 │ mov rdx, 0x8080808080808080 │ 1 │ 1 │ 1 │ 0.28 0.11 0.17 0.44 │ and rax, rdx │ 1 │ 1 │ 1 │ 0.39 0.61 │ setz cl │ 1 │ 1 │ 1 │ 0.22 0.17 0.39 0.22 │ mov rdx, 0x101010101010101 │ 1 │ 1 │ 1 │ 0.11 0.39 0.5 │ dec rax │ 1 │ 1 │ 1 │ 0.44 0.28 0.22 0.06 │ and rax, rdx │ 1 │ 1 │ 1 │ 1 │ imul rax, rdx │ 1 │ 1 │ 1 │ 0.44 0.56 │ shr rax, 0x38 │ 1 │ 1 │ 1 │ 0.17 0.17 0.39 0.28 │ dec eax │ 1 │ 1 │ 1 │ 0.11 0.17 0.44 0.28 │ mov ebx, 0xffffffff │ 1 │ 1 │ 1 │ 0.17 0.5 0.33 │ test cl, cl │ 1 │ 1 │ 1 │ 0.56 0.44 │ cmovnz rax, rbx ├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤ │ 41 │ 41 │ 41 │ 9.56 10.33 10.72 10.39 │ Total └───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘
uICA output:
Throughput (in cycles per iteration): 8.19 Bottleneck: Predecoder The following throughputs could be achieved if the given property were the only bottleneck: - Predecoder: 8.25 - Decoder: 7.00 - Issue: 5.40 - Ports: 6.50 ┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐ │ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 Port 8 Port 9 │ ├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤ │ 1 │ 1 │ 1 │ 0.29 0.14 0.25 0.32 │ mov rcx, 0x7f7f7f7f7f7f7f7f │ 1 │ 1 │ 1 │ 0.11 0.14 0.39 0.36 │ mov rax, rdi │ 1 │ 1 │ 1 │ 0.36 0.07 0.25 0.32 │ and rax, rcx │ 1 │ 1 │ 1 │ 0.18 0.11 0.46 0.25 │ mov rdx, 0x4040404040404040 │ 1 │ 1 │ 1 │ 0.11 0.36 0.36 0.18 │ mov rsi, 0x2f2f2f2f2f2f2f2f │ 1 │ 1 │ 1 │ 0.32 0.18 0.32 0.18 │ xor rdx, rax │ 1 │ 1 │ 1 │ 0.29 0.29 0.21 0.21 │ xor rsi, rax │ 1 │ 1 │ 1 │ 0.11 0.14 0.32 0.43 │ add rsi, rcx │ 1 │ 1 │ 1 │ 0.32 0.39 0.21 0.07 │ add rdx, rcx │ 1 │ 1 │ 1 │ 0.18 0.11 0.36 0.36 │ and rdx, rsi │ 1 │ 1 │ 1 │ 0.29 0.11 0.36 0.25 │ mov rsi, 0x3f3f3f3f3f3f3f3f │ 1 │ 1 │ 1 │ 0.21 0.39 0.14 0.25 │ xor rsi, rax │ 1 │ 1 │ 1 │ 0.25 0.36 0.18 0.21 │ add rsi, rcx │ 1 │ 1 │ 1 │ 0.36 0.11 0.29 0.25 │ and rdx, rsi │ 1 │ 1 │ 1 │ 0.18 0.36 0.25 0.21 │ mov rsi, 0x5c5c5c5c5c5c5c5c │ 1 │ 1 │ 1 │ 0.36 0.29 0.25 0.11 │ xor rax, rsi │ 1 │ 1 │ 1 │ 0.29 0.11 0.29 0.32 │ add rax, rcx │ 1 │ 1 │ 1 │ 0.11 0.29 0.29 0.32 │ and rdx, rax │ 1 │ 1 │ 1 │ 0.29 0.29 0.25 0.18 │ or rdx, rdi │ 1 │ 1 │ 1 │ 0.25 0.32 0.18 0.25 │ mov rax, 0x8080808080808080 │ 1 │ 1 │ 1 │ 0.36 0.64 │ andn rdx, rdx, rax │ 1 │ 1 │ │ │ xor eax, eax │ 1 │ 1 │ 1 │ 1 │ tzcnt rax, rdx │ 1 │ 1 │ 1 │ 0.54 0.46 │ shr rax, 0x3 │ 1 │ 1 │ 1 │ 0.21 0.21 0.32 0.25 │ test rdx, rdx │ 1 │ 1 │ 1 │ 0.21 0.36 0.29 0.14 │ mov rdx, 0xffffffffffffffff │ 1 │ 1 │ 1 │ 0.39 0.61 │ cmovz rax, rdx ├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤ │ 27 │ 27 │ 26 │ 6.18 6.46 6.86 6.5 │ Total └───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘
Microbenchmarks run on an IceLake machine:
$ cat /proc/cpuinfo | grep "model name" | head -n 1 model name : Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz $ g++ -Wall -Wextra -std=c++11 -O3 -march=native benchmark.cpp reference.o better.o -o benchmark rdtsc_overhead set to 32 reference : 16.000 cycle/op (best) 21.878 cycle/op (avg) better : 14.000 cycle/op (best) 20.360 cycle/op (avg)
The throughput reported by uICA for the faster procedure is 8.25 cycles, for original code 13.19 cycles. In theory, we may expected 1.5x speedup of the procedure, but in practice it's not that significant.
Sample source code is available on Github