SSE: trie lookup speedup

Author: Wojciech Muła 2017-02-14 (link to github, syntax highlighting), 2013-09-30

Introduction

Trie is a multiway tree where each edge is labelled by a letter; such trees are used as dictionaries. Lookup takes O(k) time, where k is a string length.

Trie node is defined in C:

```typedef struct TrieNode {
bool        eow;        // end of word marker
int         n;          // children count
char*       letter;     // list of edge labels
TrieNode**  children;   // pointers to children nodes
}
```

Lookup procedure is simple.

```bool trie_lookup(TrieNode* root, const char* word, const int size) {
TrieNode* node = root;

for (int i=0; i < size; i++) {
// go through edge labelled by i-th letter
node = trie_next(node, word[i]);
if (node == NULL) {
return false;
}
}

// we visited 'size' nodes, check if last
// visited node is located at end-of-word

return node ? node->eow : false;
}
```

Function trie_next returns a child node labelled with given letter.

```TrieNode trie_next(TrieNode node, char letter);
```

The implementation of this procedure determines overall searching and inserting time.

Analysis

Following algorithms used by trie_next has been studied:

1. linear search
2. linear search with move-to-front strategy
3. linear search with incremental move-to-front strategy
4. linear search with minimized memory reads
5. linear search optimized in SSE
6. binary search

Linear search with move-to-front strategy

Implementation of linear-mtf:

```TrieNode* trie_next(TrieNode* node, const char letter) {
for (size_t i = 0; i < node->n; i++) {
if (node->chars[i] == letter) {
if (i == 0) {
return node->next[0];
}

char      c0 = node->chars[0];
TrieNode* n0 = node->next[0];

node->chars[0] = letter;
node->next[0]  = node->next[i];

node->chars[i] = c0;
node->next[i]  = n0;

return node->next[0];
}
}

return NULL;
}
```

Thanks to MTF item located at index 0 is at the fast path of procedure, hit rate was about 50% in my tests.

This procedure does 3 additional reads 4 additional writes on hit item with index greater than 0. These writes gives some penalty, but are amorized by fast-path execution.

Linear search with incremental move-to-front strategy

Implementation of linear-mtf-incr:

```TrieNode* trie_next(TrieNode* node, const char letter) {
for (size_t i = 0; i < node->n; i++) {
if (node->chars[i] == letter) {
if (i == 0) {
return node->next[0];
}

const int prev = i-1;
char      c0 = node->chars[prev];
TrieNode* n0 = node->next[prev];

node->chars[prev] = letter;
node->next[prev]  = node->next[i];

node->chars[i] = c0;
node->next[i]  = n0;

return node->next[prev];
}
}

return NULL;
}
```

First variant of MTF allows to random fluctuating element located at the fast-path, this variant — by moving item one position backward --- keep at index 0 most frequently used items.

Linear search with minimized memory reads

Implementation of linear-unrolled:

```#define ROUND_UP(size) (4*(((size) + 3)/4))

TrieNode* trie_next(TrieNode* node, const char letter) {
const uint32_t n = node->n;
if (n == 0) {
return NULL;
}

const uint32_t packed_byte = (letter & 0xff) * 0x01010101;

for (size_t i = 0; i < n; i+=4) {
const uint32_t dword = *((uint32_t*)&node->chars[i]);
const uint32_t v = dword ^ packed_byte;

if ((v & 0x000000ff) == 0) {
return node->next[i + 0];
}
if ((v & 0x0000ff00) == 0) {
return node->next[i + 1];
}
if ((v & 0x00ff0000) == 0) {
return node->next[i + 2];
}
if ((v & 0xff000000) == 0) {
return node->next[i + 3];
}
}

return NULL;
}
```

This procedure reads 4 bytes in one iteration (when run on 32-bit machine), and also some parallelism is involved — comparision is done for 4 bytes at once.

However, this procedure require array of length rounded up to 4 bytes, that costs additional memory.

Linear search optimized in SSE

Implementation of sse:

```TrieNode* trie_next(TrieNode* node, const char letter) {
const int n = node->n;

if (n == 0) {
return NULL;
}

if (n == 1) {
if (node->chars[0] == letter) {
return node->next[0];
} else {
return NULL;
}
}

int dummy;
__asm__ __volatile__ (
"movzbl %%al, %%eax                 \n"
"imul   \$0x01010101, %%eax, %%eax   \n"
"movd   %%eax, %%xmm0               \n"
"pshufd \$0x00, %%xmm0, %%xmm0       \n"
: "=a" (dummy)
: "a" (letter)
);

size_t i, j;
for (i = 0; i < n; i+=16) {
__asm__ __volatile__ (
"movdqa (%%eax), %%xmm1         \n"
"pcmpeqb %%xmm0, %%xmm1         \n"
"pmovmskb %%xmm1, %%eax         \n"
"bsf %%eax, %%eax               \n"
"setz %%ah                      \n"
: "=a" (j)
: "a"  (&(node->chars[i]))
);
if (j < 256) {
return node->next[i+j];
}
}

return NULL;
}
```

This is quite naive implementation, that uses obvious SSE instruction like pcmpeq and pmovmskb. In one iteration 16 bytes are processed in parallel.

The downside of this method is high setup time, that is really visible when arrays are short.

Experiments

Test programs use two lists:

• dictionary — list of 99171 English words, /usr/share/dict/american-english from Debian;
• input words — list of 133861 English words, splitted text of "The Oddysey" from Project Gutenberg.

Each test program reads dictionary and build trie, then loads input words into simple array. Finally for each input word call trie_lookup, and this is repeated 100 times.

Tests were run on my old laptop with Penium M. Tests on recent CPU are coming.

Statistics

There are 227979 nodes in trie.

Word length

```len  |  count  |
0 |       0 |
1 |      52 |
2 |     182 |
3 |     845 | ===
4 |    3346 | ==============
5 |    6788 | ==============================
6 |   11278 | ==================================================
7 |   14787 | ==================================================================
8 |   15674 | ======================================================================
9 |   14262 | ===============================================================
10 |   11546 | ===================================================
11 |    8415 | =====================================
12 |    5508 | ========================
13 |    3236 | ==============
14 |    1679 | =======
15 |     893 | ===
16 |     382 | =
17 |     176 |
18 |      72 |
19 |      31 |
20 |      10 |
21 |       3 |
22 |       5 |
23 |       1 |
```

Node degree

```len  |  count  |
0 |   67480 | ======================================
1 |  123210 | ======================================================================
2 |   24418 | =============
3 |    6432 | ===
4 |    3104 | =
5 |    1407 |
6 |     646 |
7 |     359 |
8 |     235 |
9 |     145 |
10 |     124 |
11 |      83 |
12 |      58 |
13 |      48 |
14 |      48 |
15 |      29 |
16 |      25 |
17 |      24 |
18 |      19 |
19 |      22 |
20 |      16 |
21 |      14 |
22 |      11 |
23 |       3 |
24 |       5 |
25 |       4 |
26 |       2 |
27 |       0 |
28 |       1 |
29 |       1 |
30 |       3 |
31 |       1 |
32 |       0 |
33 |       0 |
34 |       0 |
35 |       1 |
36 |       0 |
37 |       0 |
38 |       0 |
39 |       0 |
40 |       0 |
41 |       0 |
42 |       0 |
43 |       0 |
44 |       0 |
45 |       0 |
46 |       0 |
47 |       0 |
48 |       0 |
49 |       0 |
50 |       0 |
51 |       0 |
52 |       0 |
53 |       1 |
```

Results [Pentium M, clocked at 1,5GHz]

procedure min [ms] avg [ms] max [ms]
linear 4501 4517 4543
linear-mtf 4386 4408 4445
linear-mtf-incr 3923 3941 3963
linear-unrolled 4111 4127 4151
sse 3951 3970 4005
binary 4400 4424 4441

As we see linear search is the worst, and binary search is not better. Unrolling loop as always helps, SSE instructions helps too.

But the most surprising is that simple linear search with "incremental" move-to-front strategy beats all methods. As Rob Pike said "Fancy algorithms are slow when n is small, and n is usually small." Proved!

Sources

Sample programs are available at github.