Author: | Wojciech Muła |
---|---|
Added on: | 2014-03-16 |
Updated on: | 2015-12-28 (64-bit code) |
SSE instruction PMOVMSKB gathers all most significant bits from bytes and stores them as a single 16-bit value; similar action is performed by MOVMSKPD and MOVMSKPS.
Such operation could be easily done using scalar multiplication. But first see the example (we use 32-bit number, so result has 4 bits):
input := A.......B.......C.......D.......
Bits A, B, C and D are important, dots denote zero. The bit A have to be moved to the position 3, the bit B - 2, the bit C — 1 and the bit D — 0.
tmp1 := input << 4 tmp2 := input << ( 8 + 3) tmp3 := input << (16 + 2) tmp4 := input << (24 + 1)
The layout of tmpX:
input: ................................A.......B.......C.......D....... | higher 32 bits || lower 32 bits | tmp1: ............................A.......B.......C.......D........... tmp2: .....................A.......B.......C.......D.................. tmp3: ..............A.......B.......C.......D......................... tmp4: .......A.......B.......C.......D................................
Now tmp1..4 have to be ored together:
tmp5: .......A......AB.....ABC....ABCD....BCD.....CD......D........... | higher 32 bits || lower 32 bits | ^^^^ 3210 - bit position within byte
The result bits are located at the lowest byte of the higher 32-bits word, so just single shift & bit-and are required to isolate them.
Fact: none of intermediate values tmp1..4 has set bit on the same position. Thanks to that a single multiplication can be used to perform the all shifts and bit-or — while adding intermediate values never appear a carry bit, that would "disturb" result.
The constant for multiplication is derived directly from the shift amounts:
// 0x02040810 uint32_t movmask = (1u << 4) | (1u << (8 + 3)) | (1u << (16 + 2)) | (1u << (24 + 1));
C function for 32-bit numbers:
#include <stdint.h> uint32_t movmask(const uint32_t input) { assert((input & 0x7f7f7f7f) == 0); const uint32_t mult = 0x02040810; const uint64_t result = (uint64_t)input * mult; return (result >> 32) & 0xff; }
GCC generated the best possible code:
Disassembly of section .text:
00000000 : 0: b8 10 08 04 02 mov $0x2040810,%eax 5: f7 64 24 04 mull 0x4(%esp) 9: 0f b6 c2 movzbl %dl,%eax c: c3 ret
C function for 64-bit numbers (the type __int128 is the GCC extension):
uint64_t movmask64_unsafe(const uint64_t input) { assert((input & 0x7f7f7f7f7f7f7f7flu) == 0); const uint64_t mult = (1lu << (1*8 - 0)) | (1lu << (2*8 - 1)) | (1lu << (3*8 - 2)) | (1lu << (4*8 - 3)) | (1lu << (5*8 - 4)) | (1lu << (6*8 - 5)) | (1lu << (7*8 - 6)) | (1lu << (8*8 - 7)) ; const unsigned __int128 result = (unsigned __int128)mult * (unsigned __int128)input; return (result >> 64) & 0xff; }
And disassembly, GCC generated also the shortest possible code:
48 bf 00 81 40 20 10 08 04 02 movabs $0x204081020408100,%rdi 48 f7 e7 mul %rdi 0f b6 c2 movzbl %dl,%eax
Full source code is available, including the proof written in Python.