SWAR find any byte from set

Author:	Wojciech Muła
Added on:	2023-03-06

Contents

Introduction
Faster SWAR procedure
Comparison
Source code

Introduction

When I was browsing the source code of project Ada (WHATWG-compliant and fast URL parser written in modern C++) the following procedure caught my attention:

ada_really_inline size_t find_authority_delimiter_special(std::string_view view) noexcept {
  auto has_zero_byte = [](uint64_t v) {
    return ((v - 0x0101010101010101) & ~(v)&0x8080808080808080);
  };
  auto index_of_first_set_byte = [](uint64_t v) {
    return ((((v - 1) & 0x101010101010101) * 0x101010101010101) >> 56) - 1;,
  };
  auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; };
  size_t i = 0;
  uint64_t mask1 = broadcast('@');
  uint64_t mask2 = broadcast('/');
  uint64_t mask3 = broadcast('?');
  uint64_t mask4 = broadcast('\\');

  for (; i + 7 < view.size(); i += 8) {
    uint64_t word{};
    memcpy(&word, view.data() + i, sizeof(word));
    word = swap_bytes_if_big_endian(word);
    uint64_t xor1 = word ^ mask1;
    uint64_t xor2 = word ^ mask2;
    uint64_t xor3 = word ^ mask3;
    uint64_t xor4 = word ^ mask4;
    uint64_t is_match = has_zero_byte(xor1) | has_zero_byte(xor2) | has_zero_byte(xor3) | has_zero_byte(xor4);
    if (is_match) {
      return i + index_of_first_set_byte(is_match);
    }
  }

  if (i < view.size()) {
    uint64_t word{};
    memcpy(&word, view.data() + i, view.size() - i);
    word = swap_bytes_if_big_endian(word);
    uint64_t xor1 = word ^ mask1;
    uint64_t xor2 = word ^ mask2;
    uint64_t xor3 = word ^ mask3;
    uint64_t xor4 = word ^ mask4;
    uint64_t is_match = has_zero_byte(xor1) | has_zero_byte(xor2) | has_zero_byte(xor3) | has_zero_byte(xor4);
    if (is_match) {
      return i + index_of_first_set_byte(is_match);
    }
  }

  return view.size();
}

The above procedure finds the position of the first occurrence of a char from the set @, /, ? and /. It returns the length of input string if nothing was found.

The procedure uses SWAR techniques: it processes several bytes at once, taking advantage on the current CPUs architecture that process 64-bit values. The procedure implementation comes from src/helpers.cpp, and more function from that file follow exactly the same SWAR approach.

These two functions are crucial:

has_zero_byte is non-zero if a multi-byte word has at least one zero byte; note that the procedure also keeps only the most significant bits.
index_of_first_set_byte returns the index of first non-zero byte; it uses the fact it is called on word formed with bytes 0x00 and 0x80.

The pattern used is quite straightforward. If we bit-xor input bytes with a word filled with one of bytes from set, then the result has zero byte if the byte was there. We check then if it least one result of bit-xor has zero-byte and if it is true, we're looking for its position.

While the production code processes multi-word inputs, let's focus on a basic building block that processes a single 64-bit word.

int find_authority_delimiter_special_reference(uint64_t word) noexcept {
  auto has_zero_byte = [](uint64_t v) {
    return ((v - 0x0101010101010101) & ~(v)&0x8080808080808080);
  };
  auto index_of_first_set_byte = [](uint64_t v) {
    return ((((v - 1) & 0x101010101010101) * 0x101010101010101) >> 56) - 1;
  };
  auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; };
  uint64_t mask1 = broadcast('@');
  uint64_t mask2 = broadcast('/');
  uint64_t mask3 = broadcast('?');
  uint64_t mask4 = broadcast('\\');

  uint64_t xor1 = word ^ mask1;
  uint64_t xor2 = word ^ mask2;
  uint64_t xor3 = word ^ mask3;
  uint64_t xor4 = word ^ mask4;
  uint64_t is_match = has_zero_byte(xor1) | has_zero_byte(xor2) | has_zero_byte(xor3) | has_zero_byte(xor4);
  if (is_match) {
      return index_of_first_set_byte(is_match);
  }

  return -1;
}

The following assembly is produced by GCC 12.2.0 from Debian for the IceLake Server architecture (gcc -O3 -march=icelake-server).

movabs $0x2f2f2f2f2f2f2f2f,%rax
movabs $0xfefefefefefefeff,%rsi
xor    %rdi,%rax
movabs $0xd0d0d0d0d0d0d0d0,%rcx
xor    %rdi,%rcx
add    %rsi,%rax
and    %rcx,%rax
movabs $0x4040404040404040,%rcx
mov    %rdi,%rdx
xor    %rdi,%rcx
movabs $0xbfbfbfbfbfbfbfbf,%rdi
xor    %rdx,%rdi
add    %rsi,%rcx
and    %rdi,%rcx
or     %rcx,%rax
movabs $0x3f3f3f3f3f3f3f3f,%rcx
xor    %rdx,%rcx
movabs $0xc0c0c0c0c0c0c0c0,%rdi
add    %rsi,%rcx
xor    %rdx,%rdi
and    %rdi,%rcx
or     %rcx,%rax
movabs $0x5c5c5c5c5c5c5c5c,%rcx
xor    %rdx,%rcx
add    %rsi,%rcx
movabs $0xa3a3a3a3a3a3a3a3,%rsi
xor    %rsi,%rdx
and    %rcx,%rdx
or     %rdx,%rax
movabs $0x8080808080808080,%rdx
and    %rdx,%rax
je     <_Z42find_authority_delimiter_special_referencem+0xc0>
movabs $0x101010101010101,%rdx
dec    %rax
and    %rdx,%rax
imul   %rdx,%rax
shr    $0x38,%rax
dec    %eax
ret
mov    $0xffffffff,%eax
ret

The assembly contains:

11 x constants,
6 x xor,
6 x and,
4 x add,
3 x or,
1 x multiplication (imul),
1 x shift right,
1 x branch.

Faster SWAR procedure

While the approach used by Ada library is universal and can be used for locating any byte, there's faster method that can be used with the set contains only ASCII chars, i.e. bytes that do not have set the 7th bit. The set @, /, ? and \\ falls into this category.

In this approach we pretend that the input is 7-bit. If the input has any 8-bit byte (for instance UTF-8 characters) we'll fix up wrongly detected bytes at the end.

We also use bit-xor to produce zero bytes for matching bytes. However, we then convert the result into a bit. The result of bit-xor can be zero or non zero (x - bit 0 or 1, but at least one is 1):

┌─────────┐ ┌─────────┐
│0000 0000│ │0xxx xxxx│
└─────────┘ └─────────┘

If we add 0x7f = 0b0111_1111 then the result is:

┌─────────┐ ┌─────────┐
│0111 1111│ │1yyy yyyy│
└─────────┘ └─────────┘

We can see that the most significant bit expresses relation "byte is not zero". Let's review the basic algorithm steps, that use the above observation.

Input 8-byte string is a\\xc0ą@b?c:

 'a'=0x61   \xc0     'ą'=<0xc4, 0x85>    '@'=0x40  'b'=0x62  '?'=0x3f  'c'=0x63
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│0110 0001│1100 0000│1100 0100│1000 0101│0100 0000│0110 0010│0011 1111│0110 0011│ = input
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

And we're looking for '@' = 0x40:

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│0100 0000│ = byte
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Xor input and mask:

t0 = word ^ 0x4040404040404040

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│0010 0001│1000 0000│1000 0100│1100 0101│0000 0000│0010 0010│0111 1111│0010 0011│ = t0
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
     21        80        84        c5        00        22        7f        23

Mask MSB from input:

t1 = t0 & 0x7f7f7f7f7f7f7f7f

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│0010 0001│0000 0000│0000 0100│0100 0101│0000 0000│0010 0010│0111 1111│0010 0011│ = t1
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
     21        00        04        45        00        22        7f        23

Add 0x7f to locate non-zero bytes. Note that two bytes were found, but the first one is not valid.:

t2 = t1 + 0x7f7f7f7f7f7f7f7f

┌─────────┲━━━━━━━━━┱─────────┬─────────┲━━━━━━━━━┱─────────┬─────────┬─────────┐
│1010 0000┃0111 1111┃1000 0011│1100 0100┃0111 1111┃1010 0001│1111 1110│1010 0010│ = t2
└─────────┺━━━━━━━━━┹─────────┴─────────┺━━━━━━━━━┹─────────┴─────────┴─────────┘
     a0        7f        83        c4        7f        a1        fe        a2

At this point, MSBs equal zero denote byte position. To fix up the result we bit-or the t2 with input. If input byte has MSB set, it will discard wrong results.:

t3 = t2 | input

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│1110 0001│1111 1111│1100 0111│1100 0101│0111 1111│1110 0011│1111 1111│1110 0011│ = t3
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
     e1        ff        c7        c5        7f        e3        ff        e3

Now, we keep only MSBs:

t4 = t3 & 0x8080808080808080

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│1000 0000│1000 0000│1000 0000│1000 0000│0000 0000│1000 0000│1000 0000│1000 0000│ = t4
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
     80        80        80        80        00        80        80        80

Negate the MSBs — ones would point found positions. As we can see, now we have exactly one non-zero byte:

t5 = t4 ^ 0x8080808080808080

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│0000 0000│0000 0000│0000 0000│0000 0000│1000 0000│0000 0000│0000 0000│0000 0000│ = t5
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
     00        00        00        00        80        00        00        00

To find the position of byte we simply looking for first bit set and divide result by 8.

The above algorithm was used to code the fully functional procedure.

long int find_authority_delimiter_special_better(uint64_t word) noexcept {
  auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; };

  uint64_t mask = broadcast(0x7f);
  uint64_t lo7bits = word & mask;

  uint64_t x0 = (lo7bits ^ broadcast('@')) + mask;
  uint64_t x1 = (lo7bits ^ broadcast('/')) + mask;
  uint64_t x2 = (lo7bits ^ broadcast('?')) + mask;
  uint64_t x3 = (lo7bits ^ broadcast('\\')) + mask;

  uint64_t t0 = ((x0 & x1 & x2 & x3) | word);
  uint64_t t1 = t0 & broadcast(0x80);
  uint64_t t2 = t1 ^ broadcast(0x80);

  if (t2 != 0) {
    return __builtin_ctzl(t2) / 8;
  }

  return -1;
}

The following assembly is produced by GCC 12.2.0 from Debian for the IceLake Server architecture (gcc -O3 -march=icelake-server).

movabs $0x7f7f7f7f7f7f7f7f,%rcx
mov    %rdi,%rax
and    %rcx,%rax
movabs $0x4040404040404040,%rdx
movabs $0x2f2f2f2f2f2f2f2f,%rsi
xor    %rax,%rdx
xor    %rax,%rsi
add    %rcx,%rsi
add    %rcx,%rdx
and    %rsi,%rdx
movabs $0x3f3f3f3f3f3f3f3f,%rsi
xor    %rax,%rsi
add    %rcx,%rsi
and    %rsi,%rdx
movabs $0x5c5c5c5c5c5c5c5c,%rsi
xor    %rsi,%rax
add    %rcx,%rax
and    %rax,%rdx
or     %rdi,%rdx
movabs $0x8080808080808080,%rax
andn   %rax,%rdx,%rdx
xor    %eax,%eax
tzcnt  %rdx,%rax
shr    $0x3,%rax
test   %rdx,%rdx
mov    $0xffffffffffffffff,%rdx
cmove  %rdx,%rax
ret

The assembly contains:

6 x constants,
5 x xor,
5 x and,
4 x add,
1 x or,
1 x count trailing zeros,
1 x shift right,
1 x condition move.

Comparison

Reference implementation

Since the reference assembly contains the jump instruction, I rewrote it to:

    movabs $0x2f2f2f2f2f2f2f2f,%rax
    movabs $0xfefefefefefefeff,%rsi
    xor    %rdi,%rax
    movabs $0xd0d0d0d0d0d0d0d0,%rcx
    xor    %rdi,%rcx
    add    %rsi,%rax
    and    %rcx,%rax
    movabs $0x4040404040404040,%rcx
    mov    %rdi,%rdx
    xor    %rdi,%rcx
    movabs $0xbfbfbfbfbfbfbfbf,%rdi
    xor    %rdx,%rdi
    add    %rsi,%rcx
    and    %rdi,%rcx
    or     %rcx,%rax
    movabs $0x3f3f3f3f3f3f3f3f,%rcx
    xor    %rdx,%rcx
    movabs $0xc0c0c0c0c0c0c0c0,%rdi
    add    %rsi,%rcx
    xor    %rdx,%rdi
    and    %rdi,%rcx
    or     %rcx,%rax
    movabs $0x5c5c5c5c5c5c5c5c,%rcx
    xor    %rdx,%rcx
    add    %rsi,%rcx
    movabs $0xa3a3a3a3a3a3a3a3,%rsi
    xor    %rsi,%rdx
    and    %rcx,%rdx
    or     %rdx,%rax
    movabs $0x8080808080808080,%rdx
    and    %rdx,%rax
sete   %cl                      // cl = 1 or 0
    movabs $0x101010101010101,%rdx
    dec    %rax
    and    %rdx,%rax
    imul   %rdx,%rax
    shr    $0x38,%rax
    dec    %eax
testb  %cl,%cl                  // reapply `and` result
    mov    $0xffffffff,%ebx         //
cmove  %rbx,%rax                // and conditionally overwrite rax
    ret

For the above procedure uICA give the following analysis.

Throughput (in cycles per iteration): 13.18
Bottleneck: Predecoder

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 13.19
  - Decoder: 11.00
  - Issue: 8.20
  - Ports: 10.25
  - Dependencies: 2.00

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  1                    │   1    │   1   │  0.11     0.11                                0.56     0.22                             │ mov rax, 0x2f2f2f2f2f2f2f2f
│  1                    │   1    │   1   │  0.11     0.39                                0.33     0.17                             │ mov rsi, 0xfefefefefefefeff
│  1                    │   1    │   1   │  0.28     0.33                                0.17     0.22                             │ xor rax, rdi
│  1                    │   1    │   1   │  0.11     0.17                                0.56     0.17                             │ mov rcx, 0xd0d0d0d0d0d0d0d0
│  1                    │   1    │   1   │  0.17     0.44                                0.28     0.11                             │ xor rcx, rdi
│  1                    │   1    │   1   │  0.22     0.44                                0.28     0.06                             │ add rax, rsi
│  1                    │   1    │   1   │  0.17     0.28                                0.28     0.28                             │ and rax, rcx
│  1                    │   1    │   1   │  0.17     0.39                                0.39     0.06                             │ mov rcx, 0x4040404040404040
│  1                    │   1    │   1   │  0.22     0.39                                0.11     0.28                             │ mov rdx, rdi
│  1                    │   1    │   1   │  0.56     0.11                                0.17     0.17                             │ xor rcx, rdi
│  1                    │   1    │   1   │  0.06     0.22                                0.22     0.5                              │ mov rdi, 0xbfbfbfbfbfbfbfbf
│  1                    │   1    │   1   │  0.11     0.33                                0.44     0.11                             │ xor rdi, rdx
│  1                    │   1    │   1   │  0.33     0.33                                0.28     0.06                             │ add rcx, rsi
│  1                    │   1    │   1   │  0.33     0.28                                0.06     0.33                             │ and rcx, rdi
│  1                    │   1    │   1   │  0.17     0.22                                0.11     0.5                              │ or rax, rcx
│  1                    │   1    │   1   │  0.17     0.22                                0.33     0.28                             │ mov rcx, 0x3f3f3f3f3f3f3f3f
│  1                    │   1    │   1   │  0.28     0.28                                0.33     0.11                             │ xor rcx, rdx
│  1                    │   1    │   1   │  0.17     0.44                                0.22     0.17                             │ mov rdi, 0xc0c0c0c0c0c0c0c0
│  1                    │   1    │   1   │  0.44     0.22                                0.22     0.11                             │ add rcx, rsi
│  1                    │   1    │   1   │  0.11     0.17                                0.28     0.44                             │ xor rdi, rdx
│  1                    │   1    │   1   │  0.39     0.22                                0.22     0.17                             │ and rcx, rdi
│  1                    │   1    │   1   │  0.17     0.22                                0.39     0.22                             │ or rax, rcx
│  1                    │   1    │   1   │  0.28     0.33                                0.11     0.28                             │ mov rcx, 0x5c5c5c5c5c5c5c5c
│  1                    │   1    │   1   │  0.33     0.22                                0.11     0.33                             │ xor rcx, rdx
│  1                    │   1    │   1   │  0.22     0.11                                0.28     0.39                             │ add rcx, rsi
│  1                    │   1    │   1   │  0.22     0.11                                0.33     0.33                             │ mov rsi, 0xa3a3a3a3a3a3a3a3
│  1                    │   1    │   1   │  0.11     0.22                                0.39     0.28                             │ xor rdx, rsi
│  1                    │   1    │   1   │  0.39     0.28                                0.11     0.22                             │ and rdx, rcx
│  1                    │   1    │   1   │  0.17     0.06                                0.5      0.28                             │ or rax, rdx
│  1                    │   1    │   1   │  0.22     0.28                                0.33     0.17                             │ mov rdx, 0x8080808080808080
│  1                    │   1    │   1   │  0.28     0.11                                0.17     0.44                             │ and rax, rdx
│  1                    │   1    │   1   │  0.39                                                  0.61                             │ setz cl
│  1                    │   1    │   1   │  0.22     0.17                                0.39     0.22                             │ mov rdx, 0x101010101010101
│  1                    │   1    │   1   │           0.11                                0.39     0.5                              │ dec rax
│  1                    │   1    │   1   │  0.44     0.28                                0.22     0.06                             │ and rax, rdx
│  1                    │   1    │   1   │            1                                                                            │ imul rax, rdx
│  1                    │   1    │   1   │  0.44                                                  0.56                             │ shr rax, 0x38
│  1                    │   1    │   1   │  0.17     0.17                                0.39     0.28                             │ dec eax
│  1                    │   1    │   1   │  0.11     0.17                                0.44     0.28                             │ mov ebx, 0xffffffff
│  1                    │   1    │   1   │  0.17     0.5                                 0.33                                      │ test cl, cl
│  1                    │   1    │   1   │  0.56                                                  0.44                             │ cmovnz rax, rbx
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  41                   │   41   │  41   │  9.56    10.33                               10.72    10.39                             │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

Faster SWAR procedure

uICA output:

Throughput (in cycles per iteration): 8.19
Bottleneck: Predecoder

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 8.25
  - Decoder: 7.00
  - Issue: 5.40
  - Ports: 6.50

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  1                    │   1    │   1   │  0.29     0.14                                0.25     0.32                             │ mov rcx, 0x7f7f7f7f7f7f7f7f
│  1                    │   1    │   1   │  0.11     0.14                                0.39     0.36                             │ mov rax, rdi
│  1                    │   1    │   1   │  0.36     0.07                                0.25     0.32                             │ and rax, rcx
│  1                    │   1    │   1   │  0.18     0.11                                0.46     0.25                             │ mov rdx, 0x4040404040404040
│  1                    │   1    │   1   │  0.11     0.36                                0.36     0.18                             │ mov rsi, 0x2f2f2f2f2f2f2f2f
│  1                    │   1    │   1   │  0.32     0.18                                0.32     0.18                             │ xor rdx, rax
│  1                    │   1    │   1   │  0.29     0.29                                0.21     0.21                             │ xor rsi, rax
│  1                    │   1    │   1   │  0.11     0.14                                0.32     0.43                             │ add rsi, rcx
│  1                    │   1    │   1   │  0.32     0.39                                0.21     0.07                             │ add rdx, rcx
│  1                    │   1    │   1   │  0.18     0.11                                0.36     0.36                             │ and rdx, rsi
│  1                    │   1    │   1   │  0.29     0.11                                0.36     0.25                             │ mov rsi, 0x3f3f3f3f3f3f3f3f
│  1                    │   1    │   1   │  0.21     0.39                                0.14     0.25                             │ xor rsi, rax
│  1                    │   1    │   1   │  0.25     0.36                                0.18     0.21                             │ add rsi, rcx
│  1                    │   1    │   1   │  0.36     0.11                                0.29     0.25                             │ and rdx, rsi
│  1                    │   1    │   1   │  0.18     0.36                                0.25     0.21                             │ mov rsi, 0x5c5c5c5c5c5c5c5c
│  1                    │   1    │   1   │  0.36     0.29                                0.25     0.11                             │ xor rax, rsi
│  1                    │   1    │   1   │  0.29     0.11                                0.29     0.32                             │ add rax, rcx
│  1                    │   1    │   1   │  0.11     0.29                                0.29     0.32                             │ and rdx, rax
│  1                    │   1    │   1   │  0.29     0.29                                0.25     0.18                             │ or rdx, rdi
│  1                    │   1    │   1   │  0.25     0.32                                0.18     0.25                             │ mov rax, 0x8080808080808080
│  1                    │   1    │   1   │           0.36                                0.64                                      │ andn rdx, rdx, rax
│  1                    │   1    │       │                                                                                         │ xor eax, eax
│  1                    │   1    │   1   │            1                                                                            │ tzcnt rax, rdx
│  1                    │   1    │   1   │  0.54                                                  0.46                             │ shr rax, 0x3
│  1                    │   1    │   1   │  0.21     0.21                                0.32     0.25                             │ test rdx, rdx
│  1                    │   1    │   1   │  0.21     0.36                                0.29     0.14                             │ mov rdx, 0xffffffffffffffff
│  1                    │   1    │   1   │  0.39                                                  0.61                             │ cmovz rax, rdx
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  27                   │   27   │  26   │  6.18     6.46                                6.86     6.5                              │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

Benchmark results

Microbenchmarks run on an IceLake machine:

$ cat /proc/cpuinfo | grep "model name" | head -n 1
model name  : Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz

$ g++ -Wall -Wextra -std=c++11 -O3 -march=native benchmark.cpp reference.o better.o -o benchmark
rdtsc_overhead set to 32
reference                           :    16.000 cycle/op (best)   21.878 cycle/op (avg)
better                              :    14.000 cycle/op (best)   20.360 cycle/op (avg)

Summary

The throughput reported by uICA for the faster procedure is 8.25 cycles, for original code 13.19 cycles. In theory, we may expected 1.5x speedup of the procedure, but in practice it's not that significant.

Source code

Sample source code is available on Github