Author: | Wojciech Muła |
---|---|
Added on: | 2022-01-31 |
The other day I came across the following line:
len * ("11124811248484"[type < 14 ? type:0]-'0') > 4;
My first reaction was "WTF", but later I realized that such a hackish code has to be a response to poor compiler optimizations. And taking into account that the code might be quite old, this is a perfect solution. Only a bit unreadable.
In fact we have something like len * coefficient(type) > 4, where coefficient is a value from the set {1, 2, 4, 8}.
I want to show that this problem can be solved not only cleaner but also faster.
First, let's examine the compiler output for the original expression. We assume both variables are unsigned integers.
bool fun1(unsigned int len, unsigned int type) { return len * ("11124811248484"[type < 14 ? type:0]-'0') > 4; }
And the output from gcc -O3 -march=skylake -s; GCC version is 10.2.1.:
cmpl $13, %esi ja .L2 movl %esi, %esi leaq .LC0(%rip), %rax # coef = "11124811248484"[type] if type < 14 movsbl (%rax,%rsi), %eax subl $48, %eax # coef -= '0' imull %eax, %edi # edi = len * coef .L2: cmpl $4, %edi # len * coef > 4 seta %al ret
It's worth to note that the compiler knows that for type >= 14, we always fetch the value "11124811248484"[0] - '0' that equals one. Thus, jumps instantly to the comparison.
Optimization #1
We may omit subtraction of constant '0', if we don't use ASCII digits, but hex strings. Maybe for older compiler we need to use octal digits. Either way, we avoid a subtraction.
bool fun2(unsigned int len, unsigned int type) { if (type >= 14) { type = 0; } const unsigned coef = "\x01\x01\x01\x02\x04\x08\x01\x01\x02\x04\x08\x04\x08\x04"[type]; return (len * coef) > 4; }
The assembly code:
cmpl $13, %esi ja .L5 movl %esi, %esi leaq .LC1(%rip), %rax movsbl (%rax,%rsi), %eax imull %eax, %edi .L5: cmpl $4, %edi seta %al ret
Optimization #2
We removed one instruction. Can we do it better? The comparison type < 14 cannot be avoided,b ut the constants are {1, 2, 4, 8}. All are powers of two, thus we can replace the multiplication with a binary shift left just by adjusting the constants.
bool fun3(unsigned int len, unsigned int type) { if (type >= 14) { type = 0; } const unsigned shift = "\x00\x00\x00\x01\x02\x03\x00\x00\x00\x01\x02\x03\x02\x03\x02"[type]; return (len << shift) > 4; }
The assembly code:
cmpl $13, %esi ja .L7 movl %esi, %esi leaq .LC2(%rip), %rax movsbl (%rax,%rsi), %eax shlx %eax, %edi, %edi .L7: cmpl $4, %edi seta %al ret
Since we are targeting the Skylake, it was possible to use the instruction SHLX from the BMI extensions. The instruction performs shift left, but does not alter the CPU flags register. By not doing this, it does not create any indirect dependencies between instructions.
Optimization #3
Seems we reached the end? Not really. We are calculating four cases:
len * 1 > 4 len * 2 > 4 len * 4 > 4 len * 8 > 4
Remembering that len is an unsigned integer, we may rewrite the expressions:
len > 5 # 5 * 1 > 4 len > 3 # 3 * 2 > 4 len > 2 # 2 * 4 > 4 len > 1 # 1 * 8 > 4
We just figured out the minimum value of len for which the expression len * coefficient > 4 is true. As a result, we may get rid of multiplication/shift.
bool fun4(unsigned int len, unsigned int type) { if (type >= 14) { type = 0; } const unsigned bound = "\x05\x05\x05\x03\x02\x01\x05\x05\x03\x02\x01\x02\x01\x02"[type]; return len > bound; }
The compiler output:
movl $5, %eax # bound = lookup[0] cmpl $13, %esi # type >= 14 ja .L9 movl %esi, %esi leaq .LC3(%rip), %rax movsbl (%rax,%rsi), %eax # bound = lookup[type] .L9: cmpl %eax, %edi # len > cond seta %al ret
Conclusions: