Author: | Wojciech Muła |
---|---|
Added on: | 2013-09-30 |
Last update: | 2017-02-14 (link to github, syntax highlighting) |
Trie is a multiway tree where each edge is labelled by a letter; such trees are used as dictionaries. Lookup takes O(k) time, where k is a string length.
Trie node is defined in C:
typedef struct TrieNode { bool eow; // end of word marker int n; // children count char* letter; // list of edge labels TrieNode** children; // pointers to children nodes }
Lookup procedure is simple.
bool trie_lookup(TrieNode* root, const char* word, const int size) { TrieNode* node = root; for (int i=0; i < size; i++) { // go through edge labelled by i-th letter node = trie_next(node, word[i]); if (node == NULL) { return false; } } // we visited 'size' nodes, check if last // visited node is located at end-of-word return node ? node->eow : false; }
Function trie_next returns a child node labelled with given letter.
TrieNode trie_next(TrieNode node, char letter);
The implementation of this procedure determines overall searching and inserting time.
Following algorithms used by trie_next has been studied:
Implementation of linear:
TrieNode* trie_next(TrieNode* node, const char letter) { size_t i; for (i = 0; i < node->n; i++) { if (node->chars[i] == letter) { return node->next[i]; } } return NULL; }
This procedure performs at most n reads from letters array, and one from children if letters has been found. Since reading letters is done sequentially CPU caching helps.
Implementation of linear-mtf:
TrieNode* trie_next(TrieNode* node, const char letter) { for (size_t i = 0; i < node->n; i++) { if (node->chars[i] == letter) { if (i == 0) { return node->next[0]; } char c0 = node->chars[0]; TrieNode* n0 = node->next[0]; node->chars[0] = letter; node->next[0] = node->next[i]; node->chars[i] = c0; node->next[i] = n0; return node->next[0]; } } return NULL; }
Thanks to MTF item located at index 0 is at the fast path of procedure, hit rate was about 50% in my tests.
This procedure does 3 additional reads 4 additional writes on hit item with index greater than 0. These writes gives some penalty, but are amorized by fast-path execution.
Implementation of linear-mtf-incr:
TrieNode* trie_next(TrieNode* node, const char letter) { for (size_t i = 0; i < node->n; i++) { if (node->chars[i] == letter) { if (i == 0) { return node->next[0]; } const int prev = i-1; char c0 = node->chars[prev]; TrieNode* n0 = node->next[prev]; node->chars[prev] = letter; node->next[prev] = node->next[i]; node->chars[i] = c0; node->next[i] = n0; return node->next[prev]; } } return NULL; }
First variant of MTF allows to random fluctuating element located at the fast-path, this variant — by moving item one position backward --- keep at index 0 most frequently used items.
Implementation of linear-unrolled:
#define ROUND_UP(size) (4*(((size) + 3)/4)) TrieNode* trie_next(TrieNode* node, const char letter) { const uint32_t n = node->n; if (n == 0) { return NULL; } const uint32_t packed_byte = (letter & 0xff) * 0x01010101; for (size_t i = 0; i < n; i+=4) { const uint32_t dword = *((uint32_t*)&node->chars[i]); const uint32_t v = dword ^ packed_byte; if ((v & 0x000000ff) == 0) { return node->next[i + 0]; } if ((v & 0x0000ff00) == 0) { return node->next[i + 1]; } if ((v & 0x00ff0000) == 0) { return node->next[i + 2]; } if ((v & 0xff000000) == 0) { return node->next[i + 3]; } } return NULL; }
This procedure reads 4 bytes in one iteration (when run on 32-bit machine), and also some parallelism is involved — comparision is done for 4 bytes at once.
However, this procedure require array of length rounded up to 4 bytes, that costs additional memory.
Implementation of sse:
TrieNode* trie_next(TrieNode* node, const char letter) { const int n = node->n; if (n == 0) { return NULL; } if (n == 1) { if (node->chars[0] == letter) { return node->next[0]; } else { return NULL; } } int dummy; __asm__ __volatile__ ( "movzbl %%al, %%eax \n" "imul $0x01010101, %%eax, %%eax \n" "movd %%eax, %%xmm0 \n" "pshufd $0x00, %%xmm0, %%xmm0 \n" : "=a" (dummy) : "a" (letter) ); size_t i, j; for (i = 0; i < n; i+=16) { __asm__ __volatile__ ( "movdqa (%%eax), %%xmm1 \n" "pcmpeqb %%xmm0, %%xmm1 \n" "pmovmskb %%xmm1, %%eax \n" "bsf %%eax, %%eax \n" "setz %%ah \n" : "=a" (j) : "a" (&(node->chars[i])) ); if (j < 256) { return node->next[i+j]; } } return NULL; }
This is quite naive implementation, that uses obvious SSE instruction like pcmpeq and pmovmskb. In one iteration 16 bytes are processed in parallel.
The downside of this method is high setup time, that is really visible when arrays are short.
Implementation of binary:
TrieNode* trie_next(TrieNode* node, const char letter) { int a = 0; int b = node->n - 1; while (a <= b) { const int c = (a + b)/2; if (node->chars[c] == letter) { return node->next[c]; } if (letter < (int)node->chars[c]) { b = c - 1; } else { a = c + 1; } } return NULL; }
Procedure use straightforward binary search implementation (yes, it has integer overflow bug).
Test programs use two lists:
Each test program reads dictionary and build trie, then loads input words into simple array. Finally for each input word call trie_lookup, and this is repeated 100 times.
Tests were run on my old laptop with Penium M. Tests on recent CPU are coming.
There are 227979 nodes in trie.
len | count | 0 | 0 | 1 | 52 | 2 | 182 | 3 | 845 | === 4 | 3346 | ============== 5 | 6788 | ============================== 6 | 11278 | ================================================== 7 | 14787 | ================================================================== 8 | 15674 | ====================================================================== 9 | 14262 | =============================================================== 10 | 11546 | =================================================== 11 | 8415 | ===================================== 12 | 5508 | ======================== 13 | 3236 | ============== 14 | 1679 | ======= 15 | 893 | === 16 | 382 | = 17 | 176 | 18 | 72 | 19 | 31 | 20 | 10 | 21 | 3 | 22 | 5 | 23 | 1 |
len | count | 0 | 67480 | ====================================== 1 | 123210 | ====================================================================== 2 | 24418 | ============= 3 | 6432 | === 4 | 3104 | = 5 | 1407 | 6 | 646 | 7 | 359 | 8 | 235 | 9 | 145 | 10 | 124 | 11 | 83 | 12 | 58 | 13 | 48 | 14 | 48 | 15 | 29 | 16 | 25 | 17 | 24 | 18 | 19 | 19 | 22 | 20 | 16 | 21 | 14 | 22 | 11 | 23 | 3 | 24 | 5 | 25 | 4 | 26 | 2 | 27 | 0 | 28 | 1 | 29 | 1 | 30 | 3 | 31 | 1 | 32 | 0 | 33 | 0 | 34 | 0 | 35 | 1 | 36 | 0 | 37 | 0 | 38 | 0 | 39 | 0 | 40 | 0 | 41 | 0 | 42 | 0 | 43 | 0 | 44 | 0 | 45 | 0 | 46 | 0 | 47 | 0 | 48 | 0 | 49 | 0 | 50 | 0 | 51 | 0 | 52 | 0 | 53 | 1 |
procedure | min [ms] | avg [ms] | max [ms] |
---|---|---|---|
linear | 4501 | 4517 | 4543 |
linear-mtf | 4386 | 4408 | 4445 |
linear-mtf-incr | 3923 | 3941 | 3963 |
linear-unrolled | 4111 | 4127 | 4151 |
sse | 3951 | 3970 | 4005 |
binary | 4400 | 4424 | 4441 |
As we see linear search is the worst, and binary search is not better. Unrolling loop as always helps, SSE instructions helps too.
But the most surprising is that simple linear search with "incremental" move-to-front strategy beats all methods. As Rob Pike said "Fancy algorithms are slow when n is small, and n is usually small." Proved!
Sample programs are available at github.