# Scalar version of SSE move mask instruction

Author: Wojciech Muła 2014-03-16 2015-12-28 (64-bit code)

SSE instruction PMOVMSKB gathers all most significant bits from bytes and stores them as a single 16-bit value; similar action is performed by MOVMSKPD and MOVMSKPS.

Such operation could be easily done using scalar multiplication. But first see the example (we use 32-bit number, so result has 4 bits):

```input := A.......B.......C.......D.......
```

Bits A, B, C and D are important, dots denote zero. The bit A have to be moved to the position 3, the bit B - 2, the bit C — 1 and the bit D — 0.

```tmp1 := input <<       4
tmp2 := input << ( 8 + 3)
tmp3 := input << (16 + 2)
tmp4 := input << (24 + 1)
```

The layout of tmpX:

```input:  ................................A.......B.......C.......D.......
|      higher 32 bits          ||       lower 32 bits          |

tmp1:   ............................A.......B.......C.......D...........
tmp2:   .....................A.......B.......C.......D..................
tmp3:   ..............A.......B.......C.......D.........................
tmp4:   .......A.......B.......C.......D................................
```

Now tmp1..4 have to be ored together:

```tmp5:   .......A......AB.....ABC....ABCD....BCD.....CD......D...........
|      higher 32 bits          ||       lower 32 bits          |
^^^^
3210 - bit position within byte
```

The result bits are located at the lowest byte of the higher 32-bits word, so just single shift & bit-and are required to isolate them.

Fact: none of intermediate values tmp1..4 has set bit on the same position. Thanks to that a single multiplication can be used to perform the all shifts and bit-or — while adding intermediate values never appear a carry bit, that would "disturb" result.

The constant for multiplication is derived directly from the shift amounts:

```// 0x02040810
uint32_t movmask = (1u << 4) |
(1u << (8 + 3)) |
(1u << (16 + 2)) |
(1u << (24 + 1));
```

# Sample implementation

C function for 32-bit numbers:

```#include <stdint.h>

assert((input & 0x7f7f7f7f) == 0);

const uint32_t mult = 0x02040810;

const uint64_t result = (uint64_t)input * mult;

return (result >> 32) & 0xff;
}
```

GCC generated the best possible code:

Disassembly of section .text:

```00000000 :
0: b8 10 08 04 02        mov    \$0x2040810,%eax
5: f7 64 24 04           mull   0x4(%esp)
9: 0f b6 c2              movzbl %dl,%eax
c: c3                    ret
```

C function for 64-bit numbers (the type __int128 is the GCC extension):

```uint64_t movmask64_unsafe(const uint64_t input) {

assert((input & 0x7f7f7f7f7f7f7f7flu) == 0);

const uint64_t mult = (1lu << (1*8 - 0))
| (1lu << (2*8 - 1))
| (1lu << (3*8 - 2))
| (1lu << (4*8 - 3))
| (1lu << (5*8 - 4))
| (1lu << (6*8 - 5))
| (1lu << (7*8 - 6))
| (1lu << (8*8 - 7))
;

const unsigned __int128 result = (unsigned __int128)mult * (unsigned __int128)input;

return (result >> 64) & 0xff;
}
```

And disassembly, GCC generated also the shortest possible code:

```48 bf 00 81 40 20 10 08 04 02    movabs \$0x204081020408100,%rdi
48 f7 e7                         mul    %rdi
0f b6 c2                         movzbl %dl,%eax
```

Full source code is available, including the proof written in Python.