# Implementing byte-wise lookup table with PSHUFB

Author: Wojciech Muła 2016-03-13

Contents

# Introduction

In articles about base64 encoding and decoding I've showed how to implement SIMD version of a lookup table using basic vector instruction. This text describes another technique which employs my favourite instruction pshufb.

The task is defined as follows:

• input range (0..255) is split into several subranges;
• for each subrange a predefined value is assigned.

Now, for an input value a proper subrange is determined, and the value associated to the subrange is returned. Of course, everything is done in parallel.

# Algorithm

All subranges span the whole input range (0..255). Subranges are order and each subrange gets its index from 0 to n. Limitation: since the method uses pshufb, the maximum number of subranges is 16.

Algorithm finds an subrange's index for given value by comparing the value with the upper limit of each subrange and accumulating the comparison value. When the index is know, then it is transformed by pshufb into a value associated with a subrange.

Pseudocode:

```procedure byte_lookup(x: packed_byte)
begin
index := packed_byte(0)

for i := 0 to n:
// cmp[k] = 0xff or 0x00
cmp := packed_compare(x, subrange[i].upper)
index := packed_subtract(index, cmp)
end for

// index[k] = 0..n

LUT = packed_byte(const_0, const_1, const_1, ...);

return pshufb(LUT, index);
end
```

Number of operations:

• comparison: n,
• subtracts: n,
• zeroing: 1,
• pshufb: 1.

Please be aware that both constant vectors used in comparison and a LUT vector could be stored by a compiler in memory and fetched during execution.

# Real world example

Lookup in base64 decoder.

index input comment shift
0 00 .. 42   invalid
1 43 '+' 19
2 44 .. 46   invalid
3 47 '/' 16
4 48 .. 57 '0' - '9' 4
5 58 .. 64   invalid
6 65 .. 90 'A' - 'Z' -65
7 91 .. 96   invalid
8 97 .. 122 'a' - 'z' -71
9 122 .. ...   invalid

C++ code:

```__m128i index = packed_byte(0);

index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(42)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(43)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(46)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(47)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(57)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(64)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(90)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(96)));
index = _mm_sub_epi8(index, _mm_cmpgt_epi8(input, packed_byte(122)));

const char invalid  = 0;
const __m128i LUT = _mm_setr_epi8(
/* 0 */ invalid,
/* 1 */ char(19),
/* 2 */ invalid,
/* 3 */ char(16),
/* 4 */ char(4),
/* 5 */ invalid,
/* 6 */ char(-65),
/* 7 */ invalid,
/* 8 */ char(-71),
/* 9 */ invalid,
// the rest is undefined, so also marked as invalid
invalid, invalid, invalid, invalid, invalid, invalid
);

const __m128i shift = _mm_shuffle_epi8(LUT, index);
```