Bit test and reset vs compilers

Author: Wojciech Muła 2021-12-22

Problem

The problem: there is a bitmask (16-, 32-, 64-bit). We need to scan it backward, starting from the most significant bit. We finish at the last set bit. For instance we scan bits from 15 to 11 in a 16-bit mask 0b1100'1000'0000'0000. Depending on the bit's value we perform different tasks.

Since x86 has instruction BTR it was obvious for me that I should use the idiom bit-test-and-reset. Thus my initial code was straightforward.

```void loop_v1(uint64_t mask) {
for (int i=63; i >= 0 && mask != 0; i--) {
func_true(i);
else
func_false(i);
}
}
```

Function test_and_clear_bit wraps the BTR instruction. Below is an example how this function behaves.

```uint16_t w = 0b0000'0010'1110'0100;
bool b;

// bit #1 is zero:
// - b == false
// - w == 0b0000'0010'1110'0100 — unchanged
b = test_and_clear_bit(w, 1);

// bit #2 is set:
// b == true
// - w == 0b0000'0010'1110'0000
b = test_and_clear_bit(w, 2);

// bit #10 is set:
// - b == true
// - w == 0b0000'0000'1110'0000
b = test_and_clear_bit(w, 2);
```

Bit-test-and-reset

Naive

As I didn't like to use inline asm, I wrote a plain C++ function. And asserted that the instruction introduced in 1986 with 80386 CPU is already well known to compilers.

```bool test_and_clear_bit(uint64_t& val, int bitpos) {
const uint64_t bitmask = uint64_t(1) << bitpos;

const bool ret = (val & bitmask);

return ret;
}
```

To my surprise GCC compiled the procedure into:

``` 0:   48 8b 07                mov    (%rdi),%rax
3:   41 89 f0                mov    %esi,%r8d
6:   be 01 00 00 00          mov    \$0x1,%esi
b:   c4 e2 b9 f7 f6          shlx   %r8,%rsi,%rsi
10:   c4 e2 c8 f2 d0          andn   %rax,%rsi,%rdx
15:   48 85 f0                test   %rsi,%rax
18:   48 89 17                mov    %rdx,(%rdi)
1b:   0f 95 c0                setne  %al
1e:   c3                      retq
```

Apart from some ABI-related stuff, the code is almost one-to-one translation of C++ code. The compiler emitted ANDN instruction, this is the optimization enabled by the target CPU supporting the BMI extension. There's no BTR instruction.

Clang did it a bit better:

``` 0:   48 8b 0f                mov    (%rdi),%rcx
3:   48 0f a3 f1             bt     %rsi,%rcx
7:   0f 92 c0                setb   %al
a:   48 0f b3 f1             btr    %rsi,%rcx
e:   48 89 0f                mov    %rcx,(%rdi)
11:   c3                      retq
```

However, BTR is used only to reset the bit. To read the bit's value the compiler emitted a separate instruction BT.

Alternative

Bit-test-and-reset can be also solved in the following way: reset i-th bit, if the new unsigned value of smaller than the input value it means i-th bit was 1.

```bool test_and_clear_bit_v2(uint64_t& val, int bitpos) {
const uint64_t bitmask = uint64_t(1) << bitpos;
const uint64_t old = val;

return val < old;
}
```

Now GCC and clang compiled the code into exactly the same code:

```# GCC
20:       48 8b 07                mov    (%rdi),%rax
23:       48 89 c2                mov    %rax,%rdx
26:       48 0f b3 f2             btr    %rsi,%rdx
2a:       48 39 c2                cmp    %rax,%rdx
2d:       48 89 17                mov    %rdx,(%rdi)
30:       0f 92 c0                setb   %al
33:       c3                      retq
```
```# clang
20:       48 8b 07                mov    (%rdi),%rax
23:       48 89 c1                mov    %rax,%rcx
26:       48 0f b3 f1             btr    %rsi,%rcx
2a:       48 89 0f                mov    %rcx,(%rdi)
2d:       48 39 c1                cmp    %rax,%rcx
30:       0f 92 c0                setb   %al
33:       c3                      retq
```

However, it's still not perfect. BTR is used only to reset the bit, its outcome — setting the carry flag — is ignored. Thus we have an explicit compare (CMP).

Inline asm

For the sake of completeness, there's version uses inline asm. As a separate function it's not that short (due to ABI), when get inlined by a compiler, then it's reduced to 2-3 instructions.

```bool test_and_clear_bit(uint32_t& val, int bitpos) {
uint32_t flag = 0;

asm (
"btr  %[bitpos], %[val]    \n"
"setc %b[flag]             \n"
: [val] "=r" (val),
[flag] "=r" (flag)
: [bitpos] "r" (bitpos),
"0" (val),
"1" (flag)
: "cc"
);

return flag;
}
```

Looping

Testing MSB

Since we iterate over sequence of bits, I tried another approach: we test the most significant bit and shift left the mask in every iteration. Testing the MSB on x86 is done by checking if a signed number is negative. Shifting by 1 bit is performed by a cheap addition.

The second version of algorithm is shown below. I didn't want to put actual code in the if-branches, and think that calls to external functions are more fair.

```extern void func_true(int);
extern void func_false(int);

for (int i=63; i >= 0 && mask != 0; i--) {
if ((int64_t)mask < 0) // test the MSB
func_true(i);
else
func_false(i);

}
}
```

Let's examine output from GCC:

```70:   48 85 ff                test   %rdi,%rdi
73:   74 43                   je     b8
75:   55                      push   %rbp
76:   bd 3f 00 00 00          mov    \$0x3f,%ebp
7b:   53                      push   %rbx
7c:   48 89 fb                mov    %rdi,%rbx
7f:   48 83 ec 08             sub    \$0x8,%rsp
83:   eb 14                   jmp    99
85:   0f 1f 00                nopl   (%rax)
88:   e8 00 00 00 00          callq  8d
8d:   48 01 db                add    %rbx,%rbx
90:   ff cd                   dec    %ebp
92:   78 18                   js     ac
94:   48 85 db                test   %rbx,%rbx
97:   74 13                   je     ac
99:   89 ef                   mov    %ebp,%edi
9b:   48 85 db                test   %rbx,%rbx
9e:   79 e8                   jns    88
a0:   e8 00 00 00 00          callq  a5
a5:   48 01 db                add    %rbx,%rbx
a8:   ff cd                   dec    %ebp
aa:   79 e8                   jns    94
ac:   48 83 c4 08             add    \$0x8,%rsp
b0:   5b                      pop    %rbx
b1:   5d                      pop    %rbp
b2:   c3                      retq
b3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
b8:   c3                      retq
```

That look quite complicated. Is clang better? The conditional jump instructions are ordered differently. Yet the code complexity is similar to GCC output.:

```90:   41 56                   push   %r14
92:   53                      push   %rbx
93:   50                      push   %rax
94:   48 85 ff                test   %rdi,%rdi
97:   0f 84 49 00 00 00       je     e6
9d:   48 89 fb                mov    %rdi,%rbx
a0:   41 be 3f 00 00 00       mov    \$0x3f,%r14d
a6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
b0:   44 89 f7                mov    %r14d,%edi
b3:   48 85 db                test   %rbx,%rbx
b6:   0f 88 14 00 00 00       js     d0
bc:   e8 00 00 00 00          callq  c1
c1:   48 01 db                add    %rbx,%rbx
c4:   0f 85 10 00 00 00       jne    da
ca:   e9 17 00 00 00          jmpq   e6
cf:   90                      nop
d0:   e8 00 00 00 00          callq  d5
d5:   48 01 db                add    %rbx,%rbx
d8:   74 0c                   je     e6
da:   41 8d 46 ff             lea    -0x1(%r14),%eax
de:   45 85 f6                test   %r14d,%r14d
e1:   41 89 c6                mov    %eax,%r14d
e4:   7f ca                   jg     b0
e6:   48 83 c4 08             add    \$0x8,%rsp
ea:   5b                      pop    %rbx
eb:   41 5e                   pop    %r14
ed:   c3                      retq
```

Avoiding jumps

Since I decide to experiment with calls to external functions, it's worth to show a solution which avoids few conditional jumps. The solution uses a two-element table of function pointers.

```void loop_v3(uint64_t mask) {
using function = void (*)(int);
function func[2] = {func_false, func_true};

for (int i=63; i >= 0 && mask != 0; i--) {
const int idx = ((int64_t)mask < 0); // 0 or 1
func[idx](i);
}
}
```

Below is the GCC output:

``` c0:   55                      push   %rbp
c1:   53                      push   %rbx
c2:   48 83 ec 18             sub    \$0x18,%rsp
c6:   c5 fa 7e 0d 00 00 00    vmovq  0x0(%rip),%xmm1
cd:   00
ce:   c4 e3 f1 22 05 00 00    vpinsrq \$0x1,0x0(%rip),%xmm1,%xmm0
d5:   00 00 01
d8:   c5 f9 7f 04 24          vmovdqa %xmm0,(%rsp)
dd:   48 85 ff                test   %rdi,%rdi
e0:   74 26                   je     108
e2:   48 89 fb                mov    %rdi,%rbx
e5:   bd 3f 00 00 00          mov    \$0x3f,%ebp
ea:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
f0:   48 89 d8                mov    %rbx,%rax
f3:   48 c1 e8 3f             shr    \$0x3f,%rax
f7:   89 ef                   mov    %ebp,%edi
f9:   ff 14 c4                callq  *(%rsp,%rax,8)
fc:   48 01 db                add    %rbx,%rbx
ff:   ff cd                   dec    %ebp
101:   78 05                   js     108
103:   48 85 db                test   %rbx,%rbx
106:   75 e8                   jne    f0
108:   48 83 c4 18             add    \$0x18,%rsp
10c:   5b                      pop    %rbx
10d:   5d                      pop    %rbp
10e:   c3                      retq
```

And clang:

``` f0:   55                      push   %rbp
f1:   53                      push   %rbx
f2:   50                      push   %rax
f3:   48 85 ff                test   %rdi,%rdi
f6:   74 26                   je     11e
f8:   48 89 fb                mov    %rdi,%rbx
fb:   bd 3f 00 00 00          mov    \$0x3f,%ebp
100:   48 89 d8                mov    %rbx,%rax
103:   48 c1 e8 3f             shr    \$0x3f,%rax
107:   89 ef                   mov    %ebp,%edi
109:   ff 14 c5 00 00 00 00    callq  *0x0(,%rax,8)
110:   85 ed                   test   %ebp,%ebp
112:   7e 0a                   jle    11e
114:   48 01 db                add    %rbx,%rbx
117:   ff cd                   dec    %ebp
119:   48 85 db                test   %rbx,%rbx
11c:   75 e2                   jne    100
11e:   48 83 c4 08             add    \$0x8,%rsp
122:   5b                      pop    %rbx
123:   5d                      pop    %rbp
124:   c3                      retq
```

Both compilers convert the MSB into 0 or 1 (shr \$0x3f, %reg) and use it as an index to the indirect call instruction.

Compilers:

• GCC: gcc version 10.2.1 20210108 (Debian 10.2.1-5)
• clang: Debian clang version 11.0.1-2

Compiler flags:

• -O3, -march=skylake

Source code is available.