Author: | Wojciech Muła |
---|---|
Added on: | 2021-12-22 |
Contents
The problem: there is a bitmask (16-, 32-, 64-bit). We need to scan it backward, starting from the most significant bit. We finish at the last set bit. For instance we scan bits from 15 to 11 in a 16-bit mask 0b1100'1000'0000'0000. Depending on the bit's value we perform different tasks.
Since x86 has instruction BTR it was obvious for me that I should use the idiom bit-test-and-reset. Thus my initial code was straightforward.
void loop_v1(uint64_t mask) { for (int i=63; i >= 0 && mask != 0; i--) { if (test_and_clear_bit(mask, i)) func_true(i); else func_false(i); } }
Function test_and_clear_bit wraps the BTR instruction. Below is an example how this function behaves.
uint16_t w = 0b0000'0010'1110'0100; bool b; // bit #1 is zero: // - b == false // - w == 0b0000'0010'1110'0100 — unchanged b = test_and_clear_bit(w, 1); // bit #2 is set: // b == true // - w == 0b0000'0010'1110'0000 b = test_and_clear_bit(w, 2); // bit #10 is set: // - b == true // - w == 0b0000'0000'1110'0000 b = test_and_clear_bit(w, 2);
As I didn't like to use inline asm, I wrote a plain C++ function. And asserted that the instruction introduced in 1986 with 80386 CPU is already well known to compilers.
bool test_and_clear_bit(uint64_t& val, int bitpos) { const uint64_t bitmask = uint64_t(1) << bitpos; const bool ret = (val & bitmask); val &= ~bitmask; return ret; }
To my surprise GCC compiled the procedure into:
0: 48 8b 07 mov (%rdi),%rax 3: 41 89 f0 mov %esi,%r8d 6: be 01 00 00 00 mov $0x1,%esi b: c4 e2 b9 f7 f6 shlx %r8,%rsi,%rsi 10: c4 e2 c8 f2 d0 andn %rax,%rsi,%rdx 15: 48 85 f0 test %rsi,%rax 18: 48 89 17 mov %rdx,(%rdi) 1b: 0f 95 c0 setne %al 1e: c3 retq
Apart from some ABI-related stuff, the code is almost one-to-one translation of C++ code. The compiler emitted ANDN instruction, this is the optimization enabled by the target CPU supporting the BMI extension. There's no BTR instruction.
Clang did it a bit better:
0: 48 8b 0f mov (%rdi),%rcx 3: 48 0f a3 f1 bt %rsi,%rcx 7: 0f 92 c0 setb %al a: 48 0f b3 f1 btr %rsi,%rcx e: 48 89 0f mov %rcx,(%rdi) 11: c3 retq
However, BTR is used only to reset the bit. To read the bit's value the compiler emitted a separate instruction BT.
Bit-test-and-reset can be also solved in the following way: reset i-th bit, if the new unsigned value of smaller than the input value it means i-th bit was 1.
bool test_and_clear_bit_v2(uint64_t& val, int bitpos) { const uint64_t bitmask = uint64_t(1) << bitpos; const uint64_t old = val; val &= ~bitmask; return val < old; }
Now GCC and clang compiled the code into exactly the same code:
# GCC 20: 48 8b 07 mov (%rdi),%rax 23: 48 89 c2 mov %rax,%rdx 26: 48 0f b3 f2 btr %rsi,%rdx 2a: 48 39 c2 cmp %rax,%rdx 2d: 48 89 17 mov %rdx,(%rdi) 30: 0f 92 c0 setb %al 33: c3 retq
# clang 20: 48 8b 07 mov (%rdi),%rax 23: 48 89 c1 mov %rax,%rcx 26: 48 0f b3 f1 btr %rsi,%rcx 2a: 48 89 0f mov %rcx,(%rdi) 2d: 48 39 c1 cmp %rax,%rcx 30: 0f 92 c0 setb %al 33: c3 retq
However, it's still not perfect. BTR is used only to reset the bit, its outcome — setting the carry flag — is ignored. Thus we have an explicit compare (CMP).
For the sake of completeness, there's version uses inline asm. As a separate function it's not that short (due to ABI), when get inlined by a compiler, then it's reduced to 2-3 instructions.
bool test_and_clear_bit(uint32_t& val, int bitpos) { uint32_t flag = 0; asm ( "btr %[bitpos], %[val] \n" "setc %b[flag] \n" : [val] "=r" (val), [flag] "=r" (flag) : [bitpos] "r" (bitpos), "0" (val), "1" (flag) : "cc" ); return flag; }
Since we iterate over sequence of bits, I tried another approach: we test the most significant bit and shift left the mask in every iteration. Testing the MSB on x86 is done by checking if a signed number is negative. Shifting by 1 bit is performed by a cheap addition.
The second version of algorithm is shown below. I didn't want to put actual code in the if-branches, and think that calls to external functions are more fair.
extern void func_true(int); extern void func_false(int); void loop_v2(uint64_t mask) { for (int i=63; i >= 0 && mask != 0; i--) { if ((int64_t)mask < 0) // test the MSB func_true(i); else func_false(i); mask <<= 1; } }
Let's examine output from GCC:
70: 48 85 ff test %rdi,%rdi 73: 74 43 je b8 75: 55 push %rbp 76: bd 3f 00 00 00 mov $0x3f,%ebp 7b: 53 push %rbx 7c: 48 89 fb mov %rdi,%rbx 7f: 48 83 ec 08 sub $0x8,%rsp 83: eb 14 jmp 99 85: 0f 1f 00 nopl (%rax) 88: e8 00 00 00 00 callq 8d 8d: 48 01 db add %rbx,%rbx 90: ff cd dec %ebp 92: 78 18 js ac 94: 48 85 db test %rbx,%rbx 97: 74 13 je ac 99: 89 ef mov %ebp,%edi 9b: 48 85 db test %rbx,%rbx 9e: 79 e8 jns 88 a0: e8 00 00 00 00 callq a5 a5: 48 01 db add %rbx,%rbx a8: ff cd dec %ebp aa: 79 e8 jns 94 ac: 48 83 c4 08 add $0x8,%rsp b0: 5b pop %rbx b1: 5d pop %rbp b2: c3 retq b3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) b8: c3 retq
That look quite complicated. Is clang better? The conditional jump instructions are ordered differently. Yet the code complexity is similar to GCC output.:
90: 41 56 push %r14 92: 53 push %rbx 93: 50 push %rax 94: 48 85 ff test %rdi,%rdi 97: 0f 84 49 00 00 00 je e6 9d: 48 89 fb mov %rdi,%rbx a0: 41 be 3f 00 00 00 mov $0x3f,%r14d a6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) ad: 00 00 00 b0: 44 89 f7 mov %r14d,%edi b3: 48 85 db test %rbx,%rbx b6: 0f 88 14 00 00 00 js d0 bc: e8 00 00 00 00 callq c1 c1: 48 01 db add %rbx,%rbx c4: 0f 85 10 00 00 00 jne da ca: e9 17 00 00 00 jmpq e6 cf: 90 nop d0: e8 00 00 00 00 callq d5 d5: 48 01 db add %rbx,%rbx d8: 74 0c je e6 da: 41 8d 46 ff lea -0x1(%r14),%eax de: 45 85 f6 test %r14d,%r14d e1: 41 89 c6 mov %eax,%r14d e4: 7f ca jg b0 e6: 48 83 c4 08 add $0x8,%rsp ea: 5b pop %rbx eb: 41 5e pop %r14 ed: c3 retq
Since I decide to experiment with calls to external functions, it's worth to show a solution which avoids few conditional jumps. The solution uses a two-element table of function pointers.
void loop_v3(uint64_t mask) { using function = void (*)(int); function func[2] = {func_false, func_true}; for (int i=63; i >= 0 && mask != 0; i--) { const int idx = ((int64_t)mask < 0); // 0 or 1 func[idx](i); mask <<= 1; } }
Below is the GCC output:
c0: 55 push %rbp c1: 53 push %rbx c2: 48 83 ec 18 sub $0x18,%rsp c6: c5 fa 7e 0d 00 00 00 vmovq 0x0(%rip),%xmm1 cd: 00 ce: c4 e3 f1 22 05 00 00 vpinsrq $0x1,0x0(%rip),%xmm1,%xmm0 d5: 00 00 01 d8: c5 f9 7f 04 24 vmovdqa %xmm0,(%rsp) dd: 48 85 ff test %rdi,%rdi e0: 74 26 je 108 e2: 48 89 fb mov %rdi,%rbx e5: bd 3f 00 00 00 mov $0x3f,%ebp ea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) f0: 48 89 d8 mov %rbx,%rax f3: 48 c1 e8 3f shr $0x3f,%rax f7: 89 ef mov %ebp,%edi f9: ff 14 c4 callq *(%rsp,%rax,8) fc: 48 01 db add %rbx,%rbx ff: ff cd dec %ebp 101: 78 05 js 108 103: 48 85 db test %rbx,%rbx 106: 75 e8 jne f0 108: 48 83 c4 18 add $0x18,%rsp 10c: 5b pop %rbx 10d: 5d pop %rbp 10e: c3 retq
And clang:
f0: 55 push %rbp f1: 53 push %rbx f2: 50 push %rax f3: 48 85 ff test %rdi,%rdi f6: 74 26 je 11e f8: 48 89 fb mov %rdi,%rbx fb: bd 3f 00 00 00 mov $0x3f,%ebp 100: 48 89 d8 mov %rbx,%rax 103: 48 c1 e8 3f shr $0x3f,%rax 107: 89 ef mov %ebp,%edi 109: ff 14 c5 00 00 00 00 callq *0x0(,%rax,8) 110: 85 ed test %ebp,%ebp 112: 7e 0a jle 11e 114: 48 01 db add %rbx,%rbx 117: ff cd dec %ebp 119: 48 85 db test %rbx,%rbx 11c: 75 e2 jne 100 11e: 48 83 c4 08 add $0x8,%rsp 122: 5b pop %rbx 123: 5d pop %rbp 124: c3 retq
Both compilers convert the MSB into 0 or 1 (shr $0x3f, %reg) and use it as an index to the indirect call instruction.
Compilers:
Compiler flags:
Source code is available.