SIMDized check which bytes are in a set

Author:Wojciech Muła
Added on:2018-10-18
Updated:2019-03-27 (mistake pointed by Marcus Klaas) 2018-10-28 (add Geoff's algorithm, fixes/updates pointed by Michael)

Contents

Introduction

The problem is defined as follows: there's a stream of bytes and we want to get a byte-mask (or a bit-mask) that indicates which bytes are in the predefined set.

Thanks to SIMD instructions this task can be performed faster than scalar code. Jobs like input validation or parsing (for instance CSV files), might benefit from a vectorized approach.

In this text I show several SIMD methods:

SIMD algorithms

The main ingredient of the techniques shown below is instruction pshufb (_mm_shuffle_epi8), which is present in SSE, AVX2 and also AVX512BW. The instruction does parallel byte lookup in a 16-byte register (or lane, in AVX2 and AVX512 variants) using 4-bit indices from another vector.

A C-like code shows the pshufb algorithm:

for (int i=0; i < VECTOR_SIZE; i++) {
    uint8_t index = indices_vector[i];
    if (index & 0x80)
        result[i] = 0x00;
    else
        result[i] = lookup_vector[index & 0x0f];
}

Universal algorithm

In this algorithm the set is represented as a bitmap of size 16 x 16 bits, where a bit of value 1 indicates that given element is in the set. The bitmap is addressed with nibbles, i.e. 4-bit halves of a byte.

The lower nibble of each input byte selects the bitmap's row, i.e. a 16-bit value. The higher nibble selects the column on the bitmap — once we fetched a row row, the higher nibble points at specific bit.

Scalar code which deals with such representation is quite simple:

bool in_set(uint16_t bitmap[16], uint8_t byte) {

    const uint8_t lo_nibble = byte & 0xf;
    const uint8_t hi_nibble = byte >> 4;

    const uint16_t bitset  = bitmap[lo_nibble];
    const uint16_t bitmask = uint16_t(1) << hi_nibble;

    return (bitset & bitmask) != 0;
}

Example

Let's consider following set; for better visibility zeroes are shown with dots, and ones with the 'x':

lo / hi nibble
  +--------------------------------
  | 0 1 2 3 4 5 6 7 8 9 a b c d e f
--+--------------------------------
0 | x x . . . . x . . . x . . x . .
1 | x x x x . x x . . . . . x x . x
2 | . x . . x . x . . . x . . x . .
3 | . x x . . . . x . . x . x . x .
4 | . . . . . . . . . . . . x x x x
5 | x x . . x . x x x . x . . . x x
6 | x . . . . x . x . . x . x . . .
7 | . . x . . . . . . . . x . . x .
8 | . . x x . . . . . . . . . . . x
9 | . . x x x . . x . . x . . . . .
a | . . . . . . x . . . x . . . . x
b | . . . x . . x . . . . . . . . .
c | x . . . x . . . . . . . . . x x
d | . . . x x x . x . . x x . . . .
e | x . x . . . . x . x . x . . . .
f | x x . . . . x . . . . . x x x .

We want to check if byte 0xa5 is in the set. Its lower nibble is 5, so we select the highlighted row, getting 16-bit value 1100_0101_1101_0011. The higher nibble is 10 (0xa), thus we're testing this bit:

1100_0101_1101_0011
      ^

It's set and that means 0xa5 belongs to the set.

Implementation

Since the instruction pshufb maps bytes into bytes, it necessary to store two halves of a bitmap, i.e. 16 x 8 bit arrays. These sub-bitmaps are highlighted with blue and magenta below.

lo / hi nibble
  +--------------------------------
  | 0 1 2 3 4 5 6 7 8 9 a b c d e f
--+--------------------------------
0 | x x . . . . x . . . x . . x . .
1 | x x x x . x x . . . . . x x . x
2 | . x . . x . x . . . x . . x . .
3 | . x x . . . . x . . x . x . x .
4 | . . . . . . . . . . . . x x x x
5 | x x . . x . x x x . x . . . x x
6 | x . . . . x . x . . x . x . . .
7 | . . x . . . . . . . . x . . x .
8 | . . x x . . . . . . . . . . . x
9 | . . x x x . . x . . x . . . . .
a | . . . . . . x . . . x . . . . x
b | . . . x . . x . . . . . . . . .
c | x . . . x . . . . . . . . . x x
d | . . . x x x . x . . x x . . . .
e | x . x . . . . x . x . x . . . .
f | x x . . . . x . . . . . x x x .

Fetching a row from the bitmap requires two lookups: to get "blue" and "magenta" parts of the row. Then, depending on higher nibble, we test a bit in "blue" (for higher nibble in range 0..7) or "magenta" part (for higher nibble in range 8..15).

A note on calculating bitmask. In the scalar code it's just 1 << hi_nibble, here we must evaluate (1 << higher_nibble) % 8. Fortunately, this is possible with single pshufb invocation.

Below is a scalar code that show the algorithm:

/*   */ const uint8_t lo_nibble  = input & 0x0f;
/*   */ const uint8_t hi_nibble = input >> 4;

/* 1 */ const uint8_t bitset_0_7  = bitmap_0_7[lo_nibble];
/* 2 */ const uint8_t bitset_8_15 = bitmap_8_15[lo_nibble];

/* 3 */ const uint8_t bitmask = (1 << (hi_nibble & 0x7))

        uint8_t bitset;
/* 4 */ if (hi_nibble < 8)
            bitset = row_0_7;
        else
            bitset = row_8_15;

/* 5 */ return (bitset & bitmask) != 0;

Instructions 1, 2 and 3 are pshufb. The fourth expression might be expressed by vector blend instruction (x[i] ? t[i] : f[i]). The last, 5th instruction, is simple bit-and followed by conversion into a bytemask. Following code shows a detailed SSE implementation of the above steps.

  1. Load 16 input bytes; hex digits are shown.

    // example values in set: {0x10, 0x21, 0xbd}
    //            not in set: {0x36, 0x91, 0xed}
    
    // input          = [36|10|91|21|10|ed|ed|21|36|bd|36|21|91|91|ed|10]
    //                      ^^    ^^ ^^       ^^    ^^    ^^          ^^
    const __m128i input = _mm_loadu_si128(ptr);
    
  2. Extract lower nibbles. Simple bit-and is needed.

    // lower_nibbles  = [06|00|01|01|00|0d|0d|01|06|0d|06|01|01|01|0d|00]
    const __m128i lower_nibbles = _mm_and_si128(input, _mm_set1_epi8(0x0f));
    
  3. Extract higher nibbles, this requires also a bit shift.

    // higher_nibbles = [03|01|09|02|01|0e|0e|02|03|0b|03|02|09|09|0e|01]
    const __m128i higher_nibbles = _mm_and_si128(_mm_srli_epi16(input, 4), _mm_set1_epi8(0x0f));
    
  4. Pick rows for lower nibble (0..7).

    static const __m128i bitmap_0_7 = _mm_setr_epi8(
        /* 0 */ 0x43, /* 01000011 */
        /* 1 */ 0x6f, /* 01101111 */
        /* 2 */ 0x52, /* 01010010 */
        /* 3 */ 0x86, /* 10000110 */
        /* 4 */ 0x00, /* 00000000 */
        /* 5 */ 0xd3, /* 11010011 */
        /* 6 */ 0xa1, /* 10100001 */
        /* 7 */ 0x04, /* 00000100 */
        /* 8 */ 0x0c, /* 00001100 */
        /* 9 */ 0x9c, /* 10011100 */
        /* a */ 0x40, /* 01000000 */
        /* b */ 0x48, /* 01001000 */
        /* c */ 0x11, /* 00010001 */
        /* d */ 0xb8, /* 10111000 */
        /* e */ 0x85, /* 10000101 */
        /* f */ 0x43  /* 01000011 */
    );
    
    // row_0_7        = [a1|43|6f|6f|43|b8|b8|6f|a1|b8|a1|6f|6f|6f|b8|43]
    const __m128i row_0_7 = _mm_shuffle_epi8(bitmap_0_7, lower_nibbles);
    
  5. Likewise pick rows for higher nibble (8..15).

    static const __m128i bitmap_8_15 = _mm_setr_epi8(
        /* 0 */ 0x24, /* 00100100 */
        /* 1 */ 0xb0, /* 10110000 */
        /* 2 */ 0x24, /* 00100100 */
        /* 3 */ 0x54, /* 01010100 */
        /* 4 */ 0xf0, /* 11110000 */
        /* 5 */ 0xc5, /* 11000101 */
        /* 6 */ 0x14, /* 00010100 */
        /* 7 */ 0x48, /* 01001000 */
        /* 8 */ 0x80, /* 10000000 */
        /* 9 */ 0x04, /* 00000100 */
        /* a */ 0x84, /* 10000100 */
        /* b */ 0x00, /* 00000000 */
        /* c */ 0xc0, /* 11000000 */
        /* d */ 0x0c, /* 00001100 */
        /* e */ 0x0a, /* 00001010 */
        /* f */ 0x70  /* 01110000 */
    );
    
    // row_8_15       = [14|24|b0|b0|24|0c|0c|b0|14|0c|14|b0|b0|b0|0c|24]
    const __m128i row_8_15 = _mm_shuffle_epi8(bitmap_8_15, lower_nibbles);
    
  6. Calculate a bitmask, i.e. (1 << hi_nibble % 8).

    static const __m128i bitmask_lookup = _mm_setr_epi8(
            1, 2, 4, 8, 16, 32, 64, -128,
            1, 2, 4, 8, 16, 32, 64, -128);
    
    // bitmasks       = [08|02|02|02|02|40|40|02|08|08|08|04|02|02|40|02]
    const __m128i bitmask = _mm_shuffle_epi8(bitmask_lookup, higher_nibbles);
    
  7. Choose rows halves depending on higher nibbles.

    // mask           = [ff|ff|00|ff|ff|00|00|ff|ff|00|ff|ff|00|00|00|ff]
    const __m128i mask    = _mm_cmplt_epi8(higher_nibbles, _mm_set1_epi8(8));
    
    // bitsets        = [ff|ff|00|ff|ff|00|00|ff|ff|00|ff|ff|00|00|00|ff]
    //                ? [a1|43|..|6f|43|..|..|6f|a1|..|a1|6f|..|..|..|43]
    //                : [..|..|b0|..|..|0c|0c|..|..|0c|..|..|b0|b0|0c|..]
    //
    //                = [a1|43|b0|6f|43|0c|0c|6f|a1|0c|a1|6f|b0|b0|0c|43]
    
    const __m128i bitsets = _mm_blendv_epi8(row_0_7, row_8_15, mask);
    
  8. Finally check which bytes belong to the set.

    // tmp            = [a1|43|b0|6f|43|0c|0c|6f|a1|0c|a1|6f|b0|b0|0c|43]
    //                & [08|02|02|02|02|40|40|02|08|08|08|04|02|02|40|02]
    
    //                = [00|02|00|02|02|00|00|02|00|08|00|04|00|00|00|02]
    //                      ^^    ^^ ^^       ^^    ^^    ^^          ^^
    const __m128i tmp    = _mm_and_si128(bitsets, bitmask);
    
    // result         = [00|ff|00|ff|ff|00|00|ff|00|ff|00|ff|00|00|00|ff]
    const __m128i result = _mm_cmpeq_epi8(t0, bitmask);
    

Number of instructions:

  • 3 x bit-and;
  • 1 x shift;
  • 3 x shuffle bytes;
  • 2 x comparison;
  • 1 x blend.

Alternative implementation new

Geoff Langdale pointed out that we might get rid of the blend instruction, which is not the fastest one. Instead, we use pshufb, that zeros destination bytes when the highest bit of index is 1.

This requires different indices when fetching row_0_7 and row_8_15. The indices for bitmap_0_7 keep bits 0..3 and also the most significant, i.e. 7th bit of the input. Similarly the indices for bitmap_8_15 have also bits 0..3, but the bit 7th is negated.

When the most significant bit is set, it means that the higher nibble is greater or equal 8 — in such situation bitset_0_7 is zeroed and bitset_8_15 kept.

The modified algorithm:

  1. Load 16 input bytes.

    const __m128i input = _mm_loadu_si128(ptr);
    
  2. Extract indices for row_0_7.

    const __m128i indices_0_7  = _mm_and_si128(input, _mm_set1_epi8(0x8f)); // 0b1000_1111
    
  3. Extract indices for row_8_15.

    const __m128i msb          = _mm_and_si128(input, _mm_set1_epi(0x80));
    const __m128i indices_8_15 = _mm_xor_si128(indices_0_7, msb);
    
  4. Fetch row_0_7 and row_8_15:

    const __m128i row_0_7  = _mm_shuffle_epi8(bitmap_0_7, indices_0_7);
    const __m128i row_8_15 = _mm_shuffle_epi8(bitmap_8_15, indices_8_15);
    
  5. Calculate a bitmask, i.e. (1 << hi_nibble % 8).

    const __m128i bitmask = _mm_shuffle_epi8(bitmask_lookup, higher_nibbles);
    
  6. Choose rows halves depending on higher nibbles.

    const __m128i bitsets = _mm_or_si128(row_0_7, row_8_15);
    
  7. Finally check which bytes belong to the set.

    const __m128i tmp    = _mm_and_si128(bitsets, bitmask);
    const __m128i result = _mm_cmpeq_epi8(t0, bitmask);
    

Number of instructions:

  • 3 x bit-and,
  • 1 x bit-or,
  • 1 x bit-xor,
  • 3 x shuffle,
  • 1 x compare.

Special case 1 — small sets

This method handles up to eight distinct elements.

Let's consider this sample set: {0x01, 0x31, 0xc1, 0x35, 0x65, 0x77, 0x8b, 0x3e}; elements are labelled with letters K .. R.

lo / hi nibble
  +--------------------------------
  | 0 1 2 3 4 5 6 7 8 9 a b c d e f
--+--------------------------------
0 | . . . . . . . . . . . . . . . .
1 | K . . L . . . . . . . . M . . .
2 | . . . . . . . . . . . . . . . .
3 | . . . . . . . . . . . . . . . .
4 | . . . . . . . . . . . . . . . .
5 | . . . N . . O . . . . . . . . .
6 | . . . . . . . . . . . . . . . .
7 | . . . . . . . P . . . . . . . .
8 | . . . . . . . . . . . . . . . .
9 | . . . . . . . . . . . . . . . .
a | . . . . . . . . . . . . . . . .
b | . . . . . . . . Q . . . . . . .
c | . . . . . . . . . . . . . . . .
d | . . . . . . . . . . . . . . . .
e | . . . R . . . . . . . . . . . .
f | . . . . . . . . . . . . . . . .

We separately consider lower and higher nibble, i.e. rows and columns in the diagram above. We associate a subset with each nibble, for lower nibbles there are following subsets:

For higher nibbles we have:

These subsets are encoded using bit sets, i.e. each element has assigned a bit in a byte (this is the source of size limit). For the example data we might have following mapping:

const uint8_t empty = 0x00;
const uint8_t K = (1 << 0); // 0x01
const uint8_t L = (1 << 1); // 0x02
const uint8_t M = (1 << 2); // 0x04
const uint8_t N = (1 << 3); // 0x08
const uint8_t O = (1 << 4); // 0x10
const uint8_t P = (1 << 5); // 0x20
const uint8_t Q = (1 << 6); // 0x40
const uint8_t R = (1 << 7); // 0x80

And then lookup tables are:

const uint8_t lo_nibbles_lookup[16] = {
    /* 0 */ 0x00,
    /* 1 */ 0x07, // K | L | M
    /* 2 */ 0x00,
    /* 3 */ 0x00,
    /* 4 */ 0x00,
    /* 5 */ 0x18, // N | O
    /* 6 */ 0x00,
    /* 7 */ 0x20, // P
    /* 8 */ 0x00,
    /* 9 */ 0x00,
    /* a */ 0x00,
    /* b */ 0x40, // Q
    /* c */ 0x00,
    /* d */ 0x00,
    /* e */ 0x80, // R
    /* f */ 0x00
};

const uint8_t hi_nibbles_lookup[16] = {
    /* 0 */ 0x01, // K
    /* 1 */ 0x00,
    /* 2 */ 0x00,
    /* 3 */ 0x8a, // L | N | R
    /* 4 */ 0x00,
    /* 5 */ 0x00,
    /* 6 */ 0x10, // O
    /* 7 */ 0x20, // P
    /* 8 */ 0x40, // Q
    /* 9 */ 0x00,
    /* a */ 0x00,
    /* b */ 0x00,
    /* c */ 0x04, // M
    /* d */ 0x00,
    /* e */ 0x00,
    /* f */ 0x00
};

Algorithm consist following steps (SSE code is shown in example):

  1. Load 16 input bytes.

    // input          = [11|31|11|35|8b|ff|ee|77|11|c1|11|8b|11|11|ff|01]
    //                      ^^    ^^ ^^       ^^    ^^    ^^          ^^
    const __m128i input = _mm_loadu_si128(ptr);
    
  2. Extract lower nibbles.

    // lower_nibbles  = [01|01|01|05|0b|0f|0e|07|01|01|01|0b|01|01|0f|01]
    const __m128i lower_nibbles = _mm_and_si128(input, _mm_set1_epi8(0x0f));
    
  3. Extract higher nibbles.

    // higher_nibbles = [01|03|01|03|08|0f|0e|07|01|0c|01|08|01|01|0f|00]
    const __m128i higher_nibbles = _mm_and_si128(_mm_srli_epi16(input, 4), _mm_set1_epi8(0x0f));
    
  4. Translate the lower and higher nibbles (the lookup tables are defined above).

    // lo_translated  = [07|07|07|18|40|00|80|20|07|07|07|40|07|07|00|07]
    const __m128i lo_translated = _mm_shuffle_epi8(
                                    _mm_load_si128(lo_nibbles_lookup), lower_nibbles);
    
    // hi_translated  = [00|8a|00|8a|40|00|00|20|00|04|00|40|00|00|00|01]
    const __m128i hi_translated = _mm_shuffle_epi8(
                                    _mm_load_si128(hi_nibbles_lookup), higher_nibbles);
    
  5. Test bitsets for intersection.

    // lo_translated  = [07|07|07|18|40|00|80|20|07|07|07|40|07|07|00|07]
    // hi_translated  = [00|8a|00|8a|40|00|00|20|00|04|00|40|00|00|00|01]
    // intersection   = [00|02|00|08|40|00|00|20|00|04|00|40|00|00|00|01]
    //                      ^^    ^^ ^^       ^^    ^^    ^^          ^^
    const __m128i intersection = _mm_and_si128(lo_translated, hi_translated);
    
  6. Fix up if necessary. Convert all non-zero elements into 0xff.

    // t0             = [ff|00|ff|00|00|ff|ff|00|ff|00|ff|00|ff|ff|ff|00]
    const __m128i t0 = _mm_cmpeq_epi8(intersection, _mm_setzero_si128());
    const __m128i t1 = _mm_andnot_si128(t0, _mm_setzero_si128());
    

Number of instructions:

Special case 2 — constant nibble

When either higher or lower nibbles of set values are constant, a single invocation of pshufb is sufficient to classify up to 16-element set.

The variable nibble addresses a lookup vector which contains elements from the set; nibbles not present in the set are marked with some other value. The translated vector is finally compared for equal with the input vector, yielding zeros for bytes that are not in the set.

Let's consider set {0x10, 0x12, 0x14, 0x15, 0x17, 0x18, 0x1a, 0x1f}.

lo / hi nibble
  +--------------------------------
  | 0 1 2 3 4 5 6 7 8 9 a b c d e f
--+--------------------------------
0 | . x . . . . . . . . . . . . . .
1 | . . . . . . . . . . . . . . . .
2 | . . . . . . . . . . . . . . . .
3 | . . . . . . . . . . . . . . . .
4 | . x . . . . . . . . . . . . . .
5 | . x . . . . . . . . . . . . . .
6 | . . . . . . . . . . . . . . . .
7 | . x . . . . . . . . . . . . . .
8 | . x . . . . . . . . . . . . . .
9 | . . . . . . . . . . . . . . . .
a | . x . . . . . . . . . . . . . .
b | . . . . . . . . . . . . . . . .
c | . . . . . . . . . . . . . . . .
d | . . . . . . . . . . . . . . . .
e | . . . . . . . . . . . . . . . .
f | . x . . . . . . . . . . . . . .

The higher nibble is constant, equal to 0x10; we will map lower, variable nibbles as shown below.

const uint8_t lookup[16] = {
    /* 0 */ 0x10,
    /* 1 */ 0x00, // not in the set
    /* 2 */ 0x12,
    /* 3 */ 0x00, // not in the set
    /* 4 */ 0x14,
    /* 5 */ 0x15,
    /* 6 */ 0x00, // not in the set
    /* 7 */ 0x17,
    /* 8 */ 0x18,
    /* 9 */ 0x00, // not in the set
    /* a */ 0x1a,
    /* b */ 0x00, // not in the set
    /* c */ 0x00, // not in the set
    /* d */ 0x00, // not in the set
    /* e */ 0x00, // not in the set
    /* f */ 0x1f,
};

Algorithm consist following steps:

  1. Load 16 input bytes.

    // input = [21|12|13|15|14|fa|ca|17|55|aa|2a|1a|3a|ff|af|1f]
    //             ^^    ^^ ^^       ^^          ^^          ^^
    const __m128i input = _mm_loadu_si128(ptr);
    
  2. Extract the variable nibble. If it's lower one, then simple bit-and is needed; for higher nibble also a shift right is required.

    // lower_nibbles  = [01|02|03|05|04|0a|0a|07|05|0a|0a|0a|0a|0f|0f|0f]
    const __m128i lower_nibbles = _mm_and_si128(input, _mm_set1_epi8(0x0f));
    
    // higher_nibbles = [02|01|01|01|01|0f|0c|01|05|0a|02|01|03|0f|0a|01]
    const __m128i higher_nibbles = _mm_and_si128(_mm_srli_epi16(input, 4), _mm_set1_epi8(0x0f));
    
    // in our example we use lower nibbles
    // nibbles        = [01|02|03|05|04|0a|0a|07|05|0a|0a|0a|0a|0f|0f|0f]
    const __m128i nibbles = lower_nibbles;
    
  3. Translate the nibbles using lookup table.

    // translated     = [10|12|10|15|14|1a|1a|17|15|1a|1a|1a|1a|1f|1f|1f]
    const __m128i translated = _mm_shuffle_epi8(_mm_load_si128(lookup), nibbles)
    
  4. Compare the translated vector with the input one.

    // input          = [21|12|13|15|14|fa|ca|17|55|aa|2a|1a|3a|ff|af|1f]
    //                      ^^    ^^ ^^       ^^          ^^          ^^
    // translated     = [10|12|10|15|14|1a|1a|17|15|1a|1a|1a|1a|1f|1f|1f]
    // eq             = [00|ff|00|ff|ff|00|00|ff|00|00|00|ff|00|00|00|ff]
    const __m128i eq = _mm_cmpeq_epi8(input, translated);
    

Number of instructions:

Special case 3 — unique lower and higher nibbles

When none of lower and higher nibbles repeats then just two invocations of pshufb are needed to classify a vector. In this case up to 16-elements sets are supported.

Let's consider following 11-element set {0x20, 0x31, 0x42, 0x53, 0x64, 0x75, 0x86, 0x97, 0xa8, 0xb9, 0xca}. Lower nibbles are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a}, higher nibbles are {2, 3, 4, 5, 6, 7, 8, 9, a, b, c} — as we see, the both sets contain unique values.

lo / hi nibble
  +--------------------------------
  | 0 1 2 3 4 5 6 7 8 9 a b c d e f
--+--------------------------------
0 | . . K . . . . . . . . . . . . .
1 | . . . L . . . . . . . . . . . .
2 | . . . . M . . . . . . . . . . .
3 | . . . . . N . . . . . . . . . .
4 | . . . . . . O . . . . . . . . .
5 | . . . . . . . P . . . . . . . .
6 | . . . . . . . . Q . . . . . . .
7 | . . . . . . . . . R . . . . . .
8 | . . . . . . . . . . S . . . . .
9 | . . . . . . . . . . . T . . . .
a | . . . . . . . . . . . . U . . .
b | . . . . . . . . . . . . . . . .
c | . . . . . . . . . . . . . . . .
d | . . . . . . . . . . . . . . . .
e | . . . . . . . . . . . . . . . .
f | . . . . . . . . . . . . . . . .

The idea is to label each set element with unique index and then map lower and higher nibbles to that index. If the indices are equal, it means that both nibbles form set's element. Mapping for nibbles outside the valid set must yield different values, so comparison is always false.

For instance, we might have following mapping:

const uint8_t K = 0;
const uint8_t L = 1;
const uint8_t M = 2;
const uint8_t N = 3;
const uint8_t O = 4;
const uint8_t P = 5;
const uint8_t Q = 6;
const uint8_t R = 7;
const uint8_t S = 8;
const uint8_t T = 9;
const uint8_t U = 10;

const uint8_t lo_outside = 0x10;
const uint8_t hi_outside = 0x20; // lo_outside != hi_outside

Algorithm:

  1. Load 16 bytes.

    // in   = [20|21|ca|cb|aa|a8|86|42|43|12|44|75|86|8f|fa|97]
    //         ^^    ^^       ^^ ^^ ^^          ^^ ^^       ^^
    const __m128 in = _mm_loadu_si128(ptr);
    
  2. Extract lower nibbles.

    // lo   = [ 0| 1| a| b| a| 8| 6| 2| 3| 2| 4| 5| 6| f| a| 7]
    const __m128i lo = _mm_and_si128(in, _mm_set1_epi8(0x0f));
    
  3. Extract higher nibbles.

    // hi   = [ 2| 2| c| c| a| a| 8| 4| 4| 1| 4| 7| 8| 8| f| 9]
    __m128i hi = _mm_srli_epi16(in, 4);
    hi = _mm_and_si128(hi, _mm_set1_epi8(0x0f));
    
  4. Translate lower nibbles into indices.

    const __m128i lo_lookup = _mm_setr_epi8(
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa,
        0x20, 0x20, 0x20, 0x20, 0x20
    );
    
    // lo     = [ 0| 1| a| b| a| 8| 6| 2| 3| 2| 4| 5| 6| f| a| 7]
    // lo_idx = [ 0| 1| a|20| a| 8| 6| 2| 3| 2| 4| 5| 6|20| a| 7]
    const __m128i lo_idx = _mm_shuffle_epi8(lo, lo_lookup)
    
  5. Translate higher nibbles into indices.

    const __m128i hi_lookup = _mm_setr_epi8(
        0x10, 0x10,
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa,
        0x10, 0x10, 0x10
    );
    
    // hi     = [ 2| 2| c| c| a| a| 8| 4| 4| 1| 4| 7| 8| 8| f| 9]
    // hi_idx = [ 0| 0| a| a| 8| 8| 6| 2| 2|10| 2| 5| 6| 6|10| 7]
    const __m128i hi_idx = _mm_shuffle_epi8(hi, hi_lookup)
    
  6. Compare translated nibbles — if they match, it means that bytes also match.

    // in     = [20|21|ca|cb|aa|a8|86|42|43|12|44|75|86|8f|fa|97]
    //           ^^    ^^       ^^ ^^ ^^          ^^ ^^       ^^
    // lo_idx = [ 0| 1| a|20| a| 8| 6| 2| 3| 2| 4| 5| 6|20| a| 7]
    // hi_idx = [ 0| 0| a| a| 8| 8| 6| 2| 2|10| 2| 5| 6| 6|10| 7]
    // result = [ff|00|ff|00|00|ff|ff|ff|00|00|00|ff|ff|00|00|ff]
    const __m128i result = _mm_cmpeq_epi8(lo_idx, hi_idx)
    

Number of instructions:

Basic SIMD methods update

Just to make this text complete, we'll enumerate naive SIMD methods:

  1. Tiny sets might be classified by compare each element of set with the input vector and merge partial results using bit-or. For instance, if the set is {' ', 't', 'n'}, then following code matches it.

    const __m128i input = _mm_loadu_si128(data);
    const __m128i eq_32 = _mm_cmpeq_epi8(input, _mm_set1_epi8(' '));
    const __m128i eq_13 = _mm_cmpeq_epi8(input, _mm_set1_epi8('\n'));
    const __m128i eq_09 = _mm_cmpeq_epi8(input, _mm_set1_epi8('\t'));
    
    // any_eq = eq_32 | eq_13 | eq_09
    const __m128i any_eq = _mm_or_si128(eq_32, _mm_or_si128(eq_13, eq_03));
    
  2. If a set is described by ranges, like 0-9, A-Z and a-z, it's sufficient just to compare input vector with ranges' boundaries and bit-or partial result. Please be aware that both SSE and AVX2 do not support unsigned byte comparisons. Thus only when boundaries are non-negative, code shown below works correctly.

    const __m128i input = _mm_loadu_si128(data);
    
    // check range 0-9 [ASCII: 48, 57]
    const __m128i lt_48 = _mm_cmplt_epi8(input, '0');
    const __m128i lt_58 = _mm_cmplt_epi8(input, '9' + 1);
    const __m128i t0    = _mm_andnot_si128(lt_48, lt_58); // not (x < 48) and (x < 58) =
                                                          //     (x >= 48) and (x < 58)
    
    // check range A-Z [ASCII: 65, 90]
    const __m128i lt_65 = _mm_cmplt_epi8(input, 'A');
    const __m128i lt_91 = _mm_cmplt_epi8(input, 'Z' + 1);
    const __m128i t1    = _mm_andnot_si128(lt_65, lt_91);
    
    // check range a-z [ASCII: 97, 122]
    const __m128i lt_97  = _mm_cmplt_epi8(input, 'a');
    const __m128i lt_123 = _mm_cmplt_epi8(input, 'z' + 1);
    const __m128i t2     = _mm_andnot_si128(lt_97, lt_123);
    
    const __m128i result = _mm_or_si128(t0, _mm_or_si128(t1, t2));
    
  3. Range with holes. For instance, we want to match range A-Z except K. This can be expressed as sum of two ranges A-J and L-Z. Then we have: four comparisons, two bit-andnot operations, and one bit-or. However, it's easier to check the whole range and then check for excluded element(s). Then, there are three comparisons and two bit-andnot operations.

    // check range A-Z [ASCII: 65, 90]
    const __m128i lt_65 = _mm_cmplt_epi8(input, 'A');
    const __m128i lt_91 = _mm_cmplt_epi8(input, 'Z' + 1);
    const __m128i t0    = _mm_andnot_si128(lt_65, lt_91);
    
    // exclude K
    const __m128i eq_K  = _mm_cmpeq_epi8('K');
    const __m128i result = _mm_andnot_si128(eq_K, t0);
    

Acknowledgements

Thanks to Geoff Langdale and Michael Howard for their comments and fixes.

See also new

Source code

Source code is available on github.