Scalar version of SSE move mask instruction

Author: Wojciech Muła
Added on:2014-03-16
Updated on:2015-12-28 (64-bit code)

SSE instruction PMOVMSKB gathers all most significant bits from bytes and stores them as a single 16-bit value; similar action is performed by MOVMSKPD and MOVMSKPS.

Such operation could be easily done using scalar multiplication. But first see the example (we use 32-bit number, so result has 4 bits):

input := A.......B.......C.......D.......

Bits A, B, C and D are important, dots denote zero. The bit A have to be moved to the position 3, the bit B - 2, the bit C — 1 and the bit D — 0.

tmp1 := input <<       4
tmp2 := input << ( 8 + 3)
tmp3 := input << (16 + 2)
tmp4 := input << (24 + 1)

The layout of tmpX:

input:  ................................A.......B.......C.......D.......
        |      higher 32 bits          ||       lower 32 bits          |

tmp1:   ............................A.......B.......C.......D...........
tmp2:   .....................A.......B.......C.......D..................
tmp3:   ..............A.......B.......C.......D.........................
tmp4:   .......A.......B.......C.......D................................

Now tmp1..4 have to be ored together:

tmp5:   .......A......AB.....ABC....ABCD....BCD.....CD......D...........
        |      higher 32 bits          ||       lower 32 bits          |
                                    ^^^^
                                    3210 - bit position within byte

The result bits are located at the lowest byte of the higher 32-bits word, so just single shift & bit-and are required to isolate them.

Fact: none of intermediate values tmp1..4 has set bit on the same position. Thanks to that a single multiplication can be used to perform the all shifts and bit-or — while adding intermediate values never appear a carry bit, that would "disturb" result.

The constant for multiplication is derived directly from the shift amounts:

// 0x02040810
uint32_t movmask = (1u << 4) |
                   (1u << (8 + 3)) |
                   (1u << (16 + 2)) |
                   (1u << (24 + 1));

Sample implementation

C function for 32-bit numbers:

#include <stdint.h>

uint32_t movmask(const uint32_t input) {

    assert((input & 0x7f7f7f7f) == 0);

    const uint32_t mult = 0x02040810;

    const uint64_t result = (uint64_t)input * mult;

    return (result >> 32) & 0xff;
}

GCC generated the best possible code:

Disassembly of section .text:

00000000 :
   0: b8 10 08 04 02        mov    $0x2040810,%eax
   5: f7 64 24 04           mull   0x4(%esp)
   9: 0f b6 c2              movzbl %dl,%eax
   c: c3                    ret

C function for 64-bit numbers (the type __int128 is the GCC extension):

uint64_t movmask64_unsafe(const uint64_t input) {

    assert((input & 0x7f7f7f7f7f7f7f7flu) == 0);

    const uint64_t mult = (1lu << (1*8 - 0))
                        | (1lu << (2*8 - 1))
                        | (1lu << (3*8 - 2))
                        | (1lu << (4*8 - 3))
                        | (1lu << (5*8 - 4))
                        | (1lu << (6*8 - 5))
                        | (1lu << (7*8 - 6))
                        | (1lu << (8*8 - 7))
                        ;

    const unsigned __int128 result = (unsigned __int128)mult * (unsigned __int128)input;

    return (result >> 64) & 0xff;
}

And disassembly, GCC generated also the shortest possible code:

48 bf 00 81 40 20 10 08 04 02    movabs $0x204081020408100,%rdi
48 f7 e7                         mul    %rdi
0f b6 c2                         movzbl %dl,%eax

Full source code is available, including the proof written in Python.