Bit test and reset vs compilers

Author:Wojciech Muła
Added on:2021-12-22

Contents

Problem

The problem: there is a bitmask (16-, 32-, 64-bit). We need to scan it backward, starting from the most significant bit. We finish at the last set bit. For instance we scan bits from 15 to 11 in a 16-bit mask 0b1100'1000'0000'0000. Depending on the bit's value we perform different tasks.

Since x86 has instruction BTR it was obvious for me that I should use the idiom bit-test-and-reset. Thus my initial code was straightforward.

void loop_v1(uint64_t mask) {
    for (int i=63; i >= 0 && mask != 0; i--) {
        if (test_and_clear_bit(mask, i))
            func_true(i);
        else
            func_false(i);
    }
}

Function test_and_clear_bit wraps the BTR instruction. Below is an example how this function behaves.

uint16_t w = 0b0000'0010'1110'0100;
bool b;

// bit #1 is zero:
// - b == false
// - w == 0b0000'0010'1110'0100 — unchanged
b = test_and_clear_bit(w, 1);

// bit #2 is set:
// b == true
// - w == 0b0000'0010'1110'0000
b = test_and_clear_bit(w, 2);

// bit #10 is set:
// - b == true
// - w == 0b0000'0000'1110'0000
b = test_and_clear_bit(w, 2);

Bit-test-and-reset

Naive

As I didn't like to use inline asm, I wrote a plain C++ function. And asserted that the instruction introduced in 1986 with 80386 CPU is already well known to compilers.

bool test_and_clear_bit(uint64_t& val, int bitpos) {
    const uint64_t bitmask = uint64_t(1) << bitpos;

    const bool ret = (val & bitmask);

    val &= ~bitmask;

    return ret;
}

To my surprise GCC compiled the procedure into:

 0:   48 8b 07                mov    (%rdi),%rax
 3:   41 89 f0                mov    %esi,%r8d
 6:   be 01 00 00 00          mov    $0x1,%esi
 b:   c4 e2 b9 f7 f6          shlx   %r8,%rsi,%rsi
10:   c4 e2 c8 f2 d0          andn   %rax,%rsi,%rdx
15:   48 85 f0                test   %rsi,%rax
18:   48 89 17                mov    %rdx,(%rdi)
1b:   0f 95 c0                setne  %al
1e:   c3                      retq

Apart from some ABI-related stuff, the code is almost one-to-one translation of C++ code. The compiler emitted ANDN instruction, this is the optimization enabled by the target CPU supporting the BMI extension. There's no BTR instruction.

Clang did it a bit better:

 0:   48 8b 0f                mov    (%rdi),%rcx
 3:   48 0f a3 f1             bt     %rsi,%rcx
 7:   0f 92 c0                setb   %al
 a:   48 0f b3 f1             btr    %rsi,%rcx
 e:   48 89 0f                mov    %rcx,(%rdi)
11:   c3                      retq

However, BTR is used only to reset the bit. To read the bit's value the compiler emitted a separate instruction BT.

Alternative

Bit-test-and-reset can be also solved in the following way: reset i-th bit, if the new unsigned value of smaller than the input value it means i-th bit was 1.

bool test_and_clear_bit_v2(uint64_t& val, int bitpos) {
    const uint64_t bitmask = uint64_t(1) << bitpos;
    const uint64_t old = val;

    val &= ~bitmask;

    return val < old;
}

Now GCC and clang compiled the code into exactly the same code:

# GCC
  20:       48 8b 07                mov    (%rdi),%rax
  23:       48 89 c2                mov    %rax,%rdx
  26:       48 0f b3 f2             btr    %rsi,%rdx
  2a:       48 39 c2                cmp    %rax,%rdx
  2d:       48 89 17                mov    %rdx,(%rdi)
  30:       0f 92 c0                setb   %al
  33:       c3                      retq
# clang
  20:       48 8b 07                mov    (%rdi),%rax
  23:       48 89 c1                mov    %rax,%rcx
  26:       48 0f b3 f1             btr    %rsi,%rcx
  2a:       48 89 0f                mov    %rcx,(%rdi)
  2d:       48 39 c1                cmp    %rax,%rcx
  30:       0f 92 c0                setb   %al
  33:       c3                      retq

However, it's still not perfect. BTR is used only to reset the bit, its outcome — setting the carry flag — is ignored. Thus we have an explicit compare (CMP).

Inline asm

For the sake of completeness, there's version uses inline asm. As a separate function it's not that short (due to ABI), when get inlined by a compiler, then it's reduced to 2-3 instructions.

bool test_and_clear_bit(uint32_t& val, int bitpos) {
    uint32_t flag = 0;

    asm (
        "btr  %[bitpos], %[val]    \n"
        "setc %b[flag]             \n"
        : [val] "=r" (val),
          [flag] "=r" (flag)
        : [bitpos] "r" (bitpos),
          "0" (val),
          "1" (flag)
        : "cc"
    );

    return flag;
}

Looping

Testing MSB

Since we iterate over sequence of bits, I tried another approach: we test the most significant bit and shift left the mask in every iteration. Testing the MSB on x86 is done by checking if a signed number is negative. Shifting by 1 bit is performed by a cheap addition.

The second version of algorithm is shown below. I didn't want to put actual code in the if-branches, and think that calls to external functions are more fair.

extern void func_true(int);
extern void func_false(int);

void loop_v2(uint64_t mask) {
    for (int i=63; i >= 0 && mask != 0; i--) {
        if ((int64_t)mask < 0) // test the MSB
            func_true(i);
        else
            func_false(i);

        mask <<= 1;
    }
}

Let's examine output from GCC:

70:   48 85 ff                test   %rdi,%rdi
73:   74 43                   je     b8
75:   55                      push   %rbp
76:   bd 3f 00 00 00          mov    $0x3f,%ebp
7b:   53                      push   %rbx
7c:   48 89 fb                mov    %rdi,%rbx
7f:   48 83 ec 08             sub    $0x8,%rsp
83:   eb 14                   jmp    99
85:   0f 1f 00                nopl   (%rax)
88:   e8 00 00 00 00          callq  8d
8d:   48 01 db                add    %rbx,%rbx
90:   ff cd                   dec    %ebp
92:   78 18                   js     ac
94:   48 85 db                test   %rbx,%rbx
97:   74 13                   je     ac
99:   89 ef                   mov    %ebp,%edi
9b:   48 85 db                test   %rbx,%rbx
9e:   79 e8                   jns    88
a0:   e8 00 00 00 00          callq  a5
a5:   48 01 db                add    %rbx,%rbx
a8:   ff cd                   dec    %ebp
aa:   79 e8                   jns    94
ac:   48 83 c4 08             add    $0x8,%rsp
b0:   5b                      pop    %rbx
b1:   5d                      pop    %rbp
b2:   c3                      retq
b3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
b8:   c3                      retq

That look quite complicated. Is clang better? The conditional jump instructions are ordered differently. Yet the code complexity is similar to GCC output.:

90:   41 56                   push   %r14
92:   53                      push   %rbx
93:   50                      push   %rax
94:   48 85 ff                test   %rdi,%rdi
97:   0f 84 49 00 00 00       je     e6
9d:   48 89 fb                mov    %rdi,%rbx
a0:   41 be 3f 00 00 00       mov    $0x3f,%r14d
a6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
ad:   00 00 00
b0:   44 89 f7                mov    %r14d,%edi
b3:   48 85 db                test   %rbx,%rbx
b6:   0f 88 14 00 00 00       js     d0
bc:   e8 00 00 00 00          callq  c1
c1:   48 01 db                add    %rbx,%rbx
c4:   0f 85 10 00 00 00       jne    da
ca:   e9 17 00 00 00          jmpq   e6
cf:   90                      nop
d0:   e8 00 00 00 00          callq  d5
d5:   48 01 db                add    %rbx,%rbx
d8:   74 0c                   je     e6
da:   41 8d 46 ff             lea    -0x1(%r14),%eax
de:   45 85 f6                test   %r14d,%r14d
e1:   41 89 c6                mov    %eax,%r14d
e4:   7f ca                   jg     b0
e6:   48 83 c4 08             add    $0x8,%rsp
ea:   5b                      pop    %rbx
eb:   41 5e                   pop    %r14
ed:   c3                      retq

Avoiding jumps

Since I decide to experiment with calls to external functions, it's worth to show a solution which avoids few conditional jumps. The solution uses a two-element table of function pointers.

void loop_v3(uint64_t mask) {
    using function = void (*)(int);
    function func[2] = {func_false, func_true};

    for (int i=63; i >= 0 && mask != 0; i--) {
        const int idx = ((int64_t)mask < 0); // 0 or 1
        func[idx](i);
        mask <<= 1;
    }
}

Below is the GCC output:

 c0:   55                      push   %rbp
 c1:   53                      push   %rbx
 c2:   48 83 ec 18             sub    $0x18,%rsp
 c6:   c5 fa 7e 0d 00 00 00    vmovq  0x0(%rip),%xmm1
 cd:   00
 ce:   c4 e3 f1 22 05 00 00    vpinsrq $0x1,0x0(%rip),%xmm1,%xmm0
 d5:   00 00 01
 d8:   c5 f9 7f 04 24          vmovdqa %xmm0,(%rsp)
 dd:   48 85 ff                test   %rdi,%rdi
 e0:   74 26                   je     108
 e2:   48 89 fb                mov    %rdi,%rbx
 e5:   bd 3f 00 00 00          mov    $0x3f,%ebp
 ea:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
 f0:   48 89 d8                mov    %rbx,%rax
 f3:   48 c1 e8 3f             shr    $0x3f,%rax
 f7:   89 ef                   mov    %ebp,%edi
 f9:   ff 14 c4                callq  *(%rsp,%rax,8)
 fc:   48 01 db                add    %rbx,%rbx
 ff:   ff cd                   dec    %ebp
101:   78 05                   js     108
103:   48 85 db                test   %rbx,%rbx
106:   75 e8                   jne    f0
108:   48 83 c4 18             add    $0x18,%rsp
10c:   5b                      pop    %rbx
10d:   5d                      pop    %rbp
10e:   c3                      retq

And clang:

 f0:   55                      push   %rbp
 f1:   53                      push   %rbx
 f2:   50                      push   %rax
 f3:   48 85 ff                test   %rdi,%rdi
 f6:   74 26                   je     11e
 f8:   48 89 fb                mov    %rdi,%rbx
 fb:   bd 3f 00 00 00          mov    $0x3f,%ebp
100:   48 89 d8                mov    %rbx,%rax
103:   48 c1 e8 3f             shr    $0x3f,%rax
107:   89 ef                   mov    %ebp,%edi
109:   ff 14 c5 00 00 00 00    callq  *0x0(,%rax,8)
110:   85 ed                   test   %ebp,%ebp
112:   7e 0a                   jle    11e
114:   48 01 db                add    %rbx,%rbx
117:   ff cd                   dec    %ebp
119:   48 85 db                test   %rbx,%rbx
11c:   75 e2                   jne    100
11e:   48 83 c4 08             add    $0x8,%rsp
122:   5b                      pop    %rbx
123:   5d                      pop    %rbp
124:   c3                      retq

Both compilers convert the MSB into 0 or 1 (shr $0x3f, %reg) and use it as an index to the indirect call instruction.

Additional info

Compilers:

Compiler flags:

Source code is available.