Transpose bits in byte using SIMD instructions

Author:	Wojciech Muła
Added on:	2010-03-31

Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.

Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):

fill vector with given byte:

[11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001]
    ▲        ▲        ▲        ▲        ▲        ▲        ▲        ▲
    │        │        │        │        │        │        │        │
[11010001] ╶─┴────────┴────────┴────────┴────────┴────────┴────────┘

leave one bit per byte:

[10000000|01000000|00000000|00010000|00000000|00000000|00000000|00000001]

perform desired transposition ("move" bits around):

[00000001|00000010|00000000|00001000|00000000|00000000|00000000|10000000]

perform horizontal OR of all bytes:
```
[10001011]
```

Here is my old MMX code (polish text); below SSE/SSE5 implementation details.

Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.

# 1.1
movd       %eax, %xmm0
shufps    $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0

# 1.2
pxor      %xmm1, %xmm1
movd       %eax, %xmm0
pshufb    %xmm1, %xmm0

Ad 2. Simple pand with mask packed_qword(0x8040201008040201).

pand  MASK1, %xmm0

Ad 3. If plain SSE instructions are supported this step requires some work. First, each bit is populated to fill the whole byte (using pcmpeq — we get negated result), then mask bits on desired positons.

SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount — so in this case just one instruction is needed.

# SSE
pcmpeq  %xmm1, %xmm0
pandn   MASK2, %xmm0    # pandn - to negate

# SSE5
protb    ROT, %xmm0, %xmm0

Ad 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers halves). If one register is full of zeros, we get sum of bytes from other register.

psadbw  %xmm1, %xmm0
movd    %xmm0, %eax

Depending on instruction set, three (SSE) or two (SSE5) additional tables are needed.