SSE/AVX2: Generating mask where n leading (trailing) bytes are set

Author:Wojciech Muła
Addded on:2015-03-21

Contents

Introduction

Informal specification:

__m128i mask_lower(const unsigned n) {

    assert(n < 16);

    /*
        __m128i result = 0;
        for (unsinged int i=0; i < n; i++) {
            result.byte[i] = 0xff;
        }
    */

    switch (n) {
        case 0: return {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 1: return {0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 2: return {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        // ...
        case 14: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00};
        case 15: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
    }
}

__m128i mask_higher(const unsigned n) {

    assert(n < 16);

    return ~mask_lower(15 - n);
}

Big lookup table

The obvious approach is to use two lookups with 16 entries each, then mask_lower could be expressed by a single movdqa. Downside of this solution is quite large lookup: 2*16*16 bytes = 512 bytes. For AVX/AVX2 it's 4 times larger (2KiB).

Small lookup table

This idea comes from my friend Piotr Wyderski and I have permission to publish his trick. It's pretty neat. :)

We need just 48 bytes of continuous memory, i.e. three SSE vectors. The memory is filled with following pattern:

lookup[ 0..15] = 0xff;
lookup[16..31] = 0x00;
lookup[32..47] = 0xff;

Then mask_lower is reduced to a single unaligned load movdqu from lookup[index]. Likewise mask_higher loads vector from lookup[16 + index]. It can't be done better.

No lookups

This solution use the instruction palignr. The instruction gets three arguments: two vectors and an immediate index, join vectors into one big 32-bytes vector, and then return subvector starts at the index.

See psuedocode:

// A, B  - 16-elements vectors
// tmp   - 32-elements vector
// index - immediate value 0..15
__m128i palignr(__m128 A, __m128i B, unsigned index) {

    tmp[0 .. 15]  := A[0 .. 15];
    tmp[16 .. 31] := B[0 .. 15];

    return tmp[index .. index+16];
}

When the vector A is full of ones (zeros), and the B is zero (full of ones), then palignr calculates mask_lower (mask_higher). Filling vector with ones or zeros is fast, pcmepeq reg, reg and pxor reg, reg does the job.

Unfortunately palignr doesn't allow variable index values, just an immediate, so for each value we have to invoke different instruction. This costs two jumps and it's too high cost. But if we need to conserve the memory or the operation isn't on a hot path then this solution seems to be OK.