Faster hack

Author:	Wojciech Muła
Added on:	2022-01-31

The other day I came across the following line:

len * ("11124811248484"[type < 14 ? type:0]-'0') > 4;

My first reaction was "WTF", but later I realized that such a hackish code has to be a response to poor compiler optimizations. And taking into account that the code might be quite old, this is a perfect solution. Only a bit unreadable.

In fact we have something like len * coefficient(type) > 4, where coefficient is a value from the set {1, 2, 4, 8}.

I want to show that this problem can be solved not only cleaner but also faster.

First, let's examine the compiler output for the original expression. We assume both variables are unsigned integers.

bool fun1(unsigned int len, unsigned int type) {
    return len * ("11124811248484"[type < 14 ? type:0]-'0') > 4;
}

And the output from gcc -O3 -march=skylake -s; GCC version is 10.2.1.:

    cmpl    $13, %esi
    ja      .L2
    movl    %esi, %esi
    leaq    .LC0(%rip), %rax        # coef = "11124811248484"[type] if type < 14
    movsbl  (%rax,%rsi), %eax
    subl    $48, %eax               # coef -= '0'
    imull   %eax, %edi              # edi = len * coef
.L2:
    cmpl    $4, %edi                # len * coef > 4
    seta    %al
    ret

It's worth to note that the compiler knows that for type >= 14, we always fetch the value "11124811248484"[0] - '0' that equals one. Thus, jumps instantly to the comparison.

Optimization #1

We may omit subtraction of constant '0', if we don't use ASCII digits, but hex strings. Maybe for older compiler we need to use octal digits. Either way, we avoid a subtraction.

bool fun2(unsigned int len, unsigned int type) {
    if (type >= 14) {
        type = 0;
    }
    const unsigned coef = "\x01\x01\x01\x02\x04\x08\x01\x01\x02\x04\x08\x04\x08\x04"[type];
    return (len * coef) > 4;
}

The assembly code:

    cmpl    $13, %esi
    ja      .L5
    movl    %esi, %esi
    leaq    .LC1(%rip), %rax
    movsbl  (%rax,%rsi), %eax
    imull   %eax, %edi
.L5:
    cmpl    $4, %edi
    seta    %al
    ret

Optimization #2

We removed one instruction. Can we do it better? The comparison type < 14 cannot be avoided,b ut the constants are {1, 2, 4, 8}. All are powers of two, thus we can replace the multiplication with a binary shift left just by adjusting the constants.

bool fun3(unsigned int len, unsigned int type) {
    if (type >= 14) {
        type = 0;
    }
    const unsigned shift = "\x00\x00\x00\x01\x02\x03\x00\x00\x00\x01\x02\x03\x02\x03\x02"[type];
    return (len << shift) > 4;
}

The assembly code:

    cmpl    $13, %esi
    ja      .L7
    movl    %esi, %esi
    leaq    .LC2(%rip), %rax
    movsbl  (%rax,%rsi), %eax
    shlx    %eax, %edi, %edi
.L7:
    cmpl    $4, %edi
    seta    %al
    ret

Since we are targeting the Skylake, it was possible to use the instruction SHLX from the BMI extensions. The instruction performs shift left, but does not alter the CPU flags register. By not doing this, it does not create any indirect dependencies between instructions.

Optimization #3

Seems we reached the end? Not really. We are calculating four cases:

len * 1 > 4
len * 2 > 4
len * 4 > 4
len * 8 > 4

Remembering that len is an unsigned integer, we may rewrite the expressions:

len > 5     # 5 * 1 > 4
len > 3     # 3 * 2 > 4
len > 2     # 2 * 4 > 4
len > 1     # 1 * 8 > 4

We just figured out the minimum value of len for which the expression len * coefficient > 4 is true. As a result, we may get rid of multiplication/shift.

bool fun4(unsigned int len, unsigned int type) {
    if (type >= 14) {
        type = 0;
    }
    const unsigned bound = "\x05\x05\x05\x03\x02\x01\x05\x05\x03\x02\x01\x02\x01\x02"[type];
    return len > bound;
}

The compiler output:

    movl    $5, %eax            # bound = lookup[0]
    cmpl    $13, %esi           # type >= 14
    ja      .L9
    movl    %esi, %esi
    leaq    .LC3(%rip), %rax
    movsbl  (%rax,%rsi), %eax   # bound = lookup[type]
.L9:
    cmpl    %eax, %edi          # len > cond
    seta    %al
    ret

Conclusions:

All in all, the trick with subscripting a string is neat. This is what a compiler may do underneath for switch statements, but it is not guaranteed.
Clang uses conditional moves instead of jumps.