SSE: trie lookup speedup

Author:	Wojciech Muła
Last update:	2017-02-14 (link to github, syntax highlighting), 2013-09-30

Introduction

Trie is a multiway tree where each edge is labelled by a letter; such trees are used as dictionaries. Lookup takes O(k) time, where k is a string length.

Trie node is defined in C:

typedef struct TrieNode {
    bool        eow;        // end of word marker
    int         n;          // children count
    char*       letter;     // list of edge labels
    TrieNode**  children;   // pointers to children nodes
}

Lookup procedure is simple.

bool trie_lookup(TrieNode* root, const char* word, const int size) {
    TrieNode* node = root;

    for (int i=0; i < size; i++) {
        // go through edge labelled by i-th letter
        node = trie_next(node, word[i]);
        if (node == NULL) {
            return false;
        }
    }

    // we visited 'size' nodes, check if last
    // visited node is located at end-of-word

    return node ? node->eow : false;
}

Function trie_next returns a child node labelled with given letter.

TrieNode trie_next(TrieNode node, char letter);

The implementation of this procedure determines overall searching and inserting time.

Analysis

Following algorithms used by trie_next has been studied:

linear search
linear search with move-to-front strategy
linear search with incremental move-to-front strategy
linear search with minimized memory reads
linear search optimized in SSE
binary search

Linear search

Implementation of linear:

TrieNode* trie_next(TrieNode* node, const char letter) {
        size_t i;
        for (i = 0; i < node->n; i++) {
                if (node->chars[i] == letter) {
                        return node->next[i];
                }
        }

        return NULL;
}

This procedure performs at most n reads from letters array, and one from children if letters has been found. Since reading letters is done sequentially CPU caching helps.

Linear search with move-to-front strategy

Implementation of linear-mtf:

TrieNode* trie_next(TrieNode* node, const char letter) {
        for (size_t i = 0; i < node->n; i++) {
                if (node->chars[i] == letter) {
                        if (i == 0) {
                                return node->next[0];
                        }

                        char      c0 = node->chars[0];
                        TrieNode* n0 = node->next[0];

                        node->chars[0] = letter;
                        node->next[0]  = node->next[i];

                        node->chars[i] = c0;
                        node->next[i]  = n0;

                        return node->next[0];
                }
        }

        return NULL;
}

Thanks to MTF item located at index 0 is at the fast path of procedure, hit rate was about 50% in my tests.

This procedure does 3 additional reads 4 additional writes on hit item with index greater than 0. These writes gives some penalty, but are amorized by fast-path execution.

Linear search with incremental move-to-front strategy

Implementation of linear-mtf-incr:

TrieNode* trie_next(TrieNode* node, const char letter) {
        for (size_t i = 0; i < node->n; i++) {
                if (node->chars[i] == letter) {
                        if (i == 0) {
                                return node->next[0];
                        }

                        const int prev = i-1;
                        char      c0 = node->chars[prev];
                        TrieNode* n0 = node->next[prev];

                        node->chars[prev] = letter;
                        node->next[prev]  = node->next[i];

                        node->chars[i] = c0;
                        node->next[i]  = n0;

                        return node->next[prev];
                }
        }

        return NULL;
}

First variant of MTF allows to random fluctuating element located at the fast-path, this variant — by moving item one position backward --- keep at index 0 most frequently used items.

Linear search with minimized memory reads

Implementation of linear-unrolled:

#define ROUND_UP(size) (4*(((size) + 3)/4))

TrieNode* trie_next(TrieNode* node, const char letter) {
        const uint32_t n = node->n;
        if (n == 0) {
                return NULL;
        }

        const uint32_t packed_byte = (letter & 0xff) * 0x01010101;

        for (size_t i = 0; i < n; i+=4) {
                const uint32_t dword = *((uint32_t*)&node->chars[i]);
                const uint32_t v = dword ^ packed_byte;

                if ((v & 0x000000ff) == 0) {
                        return node->next[i + 0];
                }
                if ((v & 0x0000ff00) == 0) {
                        return node->next[i + 1];
                }
                if ((v & 0x00ff0000) == 0) {
                        return node->next[i + 2];
                }
                if ((v & 0xff000000) == 0) {
                        return node->next[i + 3];
                }
        }

        return NULL;
}

This procedure reads 4 bytes in one iteration (when run on 32-bit machine), and also some parallelism is involved — comparision is done for 4 bytes at once.

However, this procedure require array of length rounded up to 4 bytes, that costs additional memory.

Linear search optimized in SSE

Implementation of sse:

TrieNode* trie_next(TrieNode* node, const char letter) {
        const int n = node->n;

        if (n == 0) {
                return NULL;
        }

        if (n == 1) {
                if (node->chars[0] == letter) {
                        return node->next[0];
                } else {
                        return NULL;
                }
        }

        int dummy;
        __asm__ __volatile__ (
                "movzbl %%al, %%eax                 \n"
                "imul   $0x01010101, %%eax, %%eax   \n"
                "movd   %%eax, %%xmm0               \n"
                "pshufd $0x00, %%xmm0, %%xmm0       \n"
                : "=a" (dummy)
                : "a" (letter)
        );

        size_t i, j;
        for (i = 0; i < n; i+=16) {
                __asm__ __volatile__ (
                        "movdqa (%%eax), %%xmm1         \n"
                        "pcmpeqb %%xmm0, %%xmm1         \n"
                        "pmovmskb %%xmm1, %%eax         \n"
                        "bsf %%eax, %%eax               \n"
                        "setz %%ah                      \n"
                        : "=a" (j)
                        : "a"  (&(node->chars[i]))
                );
                if (j < 256) {
                        return node->next[i+j];
                }
        }

        return NULL;
}

This is quite naive implementation, that uses obvious SSE instruction like pcmpeq and pmovmskb. In one iteration 16 bytes are processed in parallel.

The downside of this method is high setup time, that is really visible when arrays are short.

Binary search

Implementation of binary:

TrieNode* trie_next(TrieNode* node, const char letter) {
        int a = 0;
        int b = node->n - 1;

        while (a <= b) {
                const int c = (a + b)/2;
                if (node->chars[c] == letter) {
                        return node->next[c];
                }

                if (letter < (int)node->chars[c]) {
                        b = c - 1;
                } else {
                        a = c + 1;
                }
        }

        return NULL;
}

Procedure use straightforward binary search implementation (yes, it has integer overflow bug).

Experiments

Test programs use two lists:

dictionary — list of 99171 English words, /usr/share/dict/american-english from Debian;
input words — list of 133861 English words, splitted text of "The Oddysey" from Project Gutenberg.

Each test program reads dictionary and build trie, then loads input words into simple array. Finally for each input word call trie_lookup, and this is repeated 100 times.

Tests were run on my old laptop with Penium M. Tests on recent CPU are coming.

Statistics

There are 227979 nodes in trie.

Word length

len  |  count  |
   0 |       0 |
   1 |      52 |
   2 |     182 |
   3 |     845 | ===
   4 |    3346 | ==============
   5 |    6788 | ==============================
   6 |   11278 | ==================================================
   7 |   14787 | ==================================================================
   8 |   15674 | ======================================================================
   9 |   14262 | ===============================================================
  10 |   11546 | ===================================================
  11 |    8415 | =====================================
  12 |    5508 | ========================
  13 |    3236 | ==============
  14 |    1679 | =======
  15 |     893 | ===
  16 |     382 | =
  17 |     176 |
  18 |      72 |
  19 |      31 |
  20 |      10 |
  21 |       3 |
  22 |       5 |
  23 |       1 |

Node degree

len  |  count  |
   0 |   67480 | ======================================
   1 |  123210 | ======================================================================
   2 |   24418 | =============
   3 |    6432 | ===
   4 |    3104 | =
   5 |    1407 |
   6 |     646 |
   7 |     359 |
   8 |     235 |
   9 |     145 |
  10 |     124 |
  11 |      83 |
  12 |      58 |
  13 |      48 |
  14 |      48 |
  15 |      29 |
  16 |      25 |
  17 |      24 |
  18 |      19 |
  19 |      22 |
  20 |      16 |
  21 |      14 |
  22 |      11 |
  23 |       3 |
  24 |       5 |
  25 |       4 |
  26 |       2 |
  27 |       0 |
  28 |       1 |
  29 |       1 |
  30 |       3 |
  31 |       1 |
  32 |       0 |
  33 |       0 |
  34 |       0 |
  35 |       1 |
  36 |       0 |
  37 |       0 |
  38 |       0 |
  39 |       0 |
  40 |       0 |
  41 |       0 |
  42 |       0 |
  43 |       0 |
  44 |       0 |
  45 |       0 |
  46 |       0 |
  47 |       0 |
  48 |       0 |
  49 |       0 |
  50 |       0 |
  51 |       0 |
  52 |       0 |
  53 |       1 |

Results [Pentium M, clocked at 1,5GHz]

procedure	min [ms]	avg [ms]	max [ms]
linear	4501	4517	4543
linear-mtf	4386	4408	4445
linear-mtf-incr	3923	3941	3963
linear-unrolled	4111	4127	4151
sse	3951	3970	4005
binary	4400	4424	4441

As we see linear search is the worst, and binary search is not better. Unrolling loop as always helps, SSE instructions helps too.

But the most surprising is that simple linear search with "incremental" move-to-front strategy beats all methods. As Rob Pike said "Fancy algorithms are slow when n is small, and n is usually small." Proved!

Sources

Sample programs are available at github.