Author: | Wojciech Muła |
---|---|

Added on: | 2014-03-16 |

Updated on: | 2015-12-28 (64-bit code) |

SSE instruction **PMOVMSKB** gathers all most significant bits from
bytes and stores them as a single 16-bit value; similar action is
performed by **MOVMSKPD** and **MOVMSKPS**.

Such operation could be easily done using scalar multiplication. But first see the example (we use 32-bit number, so result has 4 bits):

input := A.......B.......C.......D.......

Bits A, B, C and D are important, dots denote zero. The bit **A** have to be
moved to the position 3, the bit **B** - 2, the bit **C** — 1 and the bit
**D** — 0.

tmp1 := input << 4 tmp2 := input << ( 8 + 3) tmp3 := input << (16 + 2) tmp4 := input << (24 + 1)

The layout of tmpX:

input: ................................A.......B.......C.......D....... | higher 32 bits || lower 32 bits | tmp1: ............................A.......B.......C.......D........... tmp2: .....................A.......B.......C.......D.................. tmp3: ..............A.......B.......C.......D......................... tmp4: .......A.......B.......C.......D................................

Now `tmp1..4` have to be ored together:

tmp5: .......A......AB.....ABC....ABCD....BCD.....CD......D........... | higher 32 bits || lower 32 bits | ^^^^ 3210 - bit position within byte

The result bits are located at the lowest byte of the higher 32-bits word, so just single shift & bit-and are required to isolate them.

**Fact**: none of intermediate values `tmp1..4` has set bit on the same
position. Thanks to that a single multiplication can be used to perform the
**all** shifts and bit-or — while adding intermediate values never appear
a carry bit, that would "disturb" result.

The constant for multiplication is derived directly from the shift amounts:

// 0x02040810 uint32_t movmask = (1u << 4) | (1u << (8 + 3)) | (1u << (16 + 2)) | (1u << (24 + 1));

C function for 32-bit numbers:

#include <stdint.h> uint32_t movmask(const uint32_t input) { assert((input & 0x7f7f7f7f) == 0); const uint32_t mult = 0x02040810; const uint64_t result = (uint64_t)input * mult; return (result >> 32) & 0xff; }

GCC generated the best possible code:

Disassembly of section .text:

00000000 : 0: b8 10 08 04 02 mov $0x2040810,%eax 5: f7 64 24 04 mull 0x4(%esp) 9: 0f b6 c2 movzbl %dl,%eax c: c3 ret

C function for 64-bit numbers (the type `__int128` is the GCC extension):

uint64_t movmask64_unsafe(const uint64_t input) { assert((input & 0x7f7f7f7f7f7f7f7flu) == 0); const uint64_t mult = (1lu << (1*8 - 0)) | (1lu << (2*8 - 1)) | (1lu << (3*8 - 2)) | (1lu << (4*8 - 3)) | (1lu << (5*8 - 4)) | (1lu << (6*8 - 5)) | (1lu << (7*8 - 6)) | (1lu << (8*8 - 7)) ; const unsigned __int128 result = (unsigned __int128)mult * (unsigned __int128)input; return (result >> 64) & 0xff; }

And disassembly, GCC generated also the shortest possible code:

48 bf 00 81 40 20 10 08 04 02 movabs $0x204081020408100,%rdi 48 f7 e7 mul %rdi 0f b6 c2 movzbl %dl,%eax

Full source code is available, including the **proof** written in Python.