Author: | Wojciech Muła |
---|---|
Last update: | 2011-03-31 |
Added on: | 2011-03-26 |
Trie is a multi-way tree, where each edge is labelled with a single letter. In this text several trie representations are shown.
The article is the side effect of work on the pyahocorasic module.
Contents
Each node of a tree store two kinds of data: user data and trie-related data. Let say an alphabet has at most 256 letters, thus a letter could be saved on a byte.
During maintaining tree structure following issues appear:
In experiments Polish, English, French and German words lists were used. User data has got two pointers, i.e. additional 8 bytes per node.
list | words | trie nodes |
---|---|---|
english | 138 622 | 312 855 |
german | 162 032 | 610 470 |
french | 629 420 | 1 297 080 |
polish | 3 588 729 | 5 933 658 |
The sample program were linked against GNU libc, and procedure malloc_stats from malloc.h was used to obtain statistics about the real memory usage.
The simplest solution is to store all possible pointers to children nodes in array of constant size. C-structure:
struct TrieNode { /* user data */ struct TrieNode* children[256]; }
Procedure that returns child node or NULL depending on a letter works in constant time:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { return node->children[letter] }
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 1,032 | 322,866,360 | 1032.00 | 325,369,216 | 2,502,856 | 0.77 |
german | 1,032 | 630,005,040 | 1032.00 | 634,888,816 | 4,883,776 | 0.77 |
french | 1,032 | 1,338,586,560 | 1032.00 | 1,348,963,216 | 10,376,656 | 0.77 |
polish | n/a | n/a | n/a | n/a | n/a | n/a |
Instead of a static array of pointers allocate such array only when needed, i.e. leaf nodes don't occupy much memory. This cost additional test when accessing children.
Node structure:
struct TrieNode { /* user data */ struct TrieNode** children; }
Next function:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { if (node->children) return node->children[letter]; else return NULL; }
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 12 | 232,249,620 | 742.36 | 235,286,176 | 3,036,556 | 1.29 |
german | 12 | 495,588,296 | 811.81 | 501,844,744 | 6,256,448 | 1.25 |
french | 12 | 894,746,784 | 689.82 | 906,803,728 | 12,056,944 | 1.33 |
polish | n/a | n/a | n/a | n/a | n/a | n/a |
Do not store the array of all possible pointers, but only a continuous range between first and last non-null pointer.
Node structure:
struct TrieNode { /* user data */ uint8_t lo; uint8_t hi; struct TrieNode** children; }
The array has size hi - lo + 1.
Next function:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { if (node->children) { if (letter >= node->lo and letter <= node->hi) return node->children[letter - node->lo]; else return NULL; } else return NULL; }
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 16 | 12,688,512 | 40.56 | 17,552,000 | 4,863,488 | 27.71 |
german | 16 | 18,303,588 | 29.98 | 28,461,208 | 10,157,620 | 35.69 |
french | 16 | 44,657,336 | 34.43 | 63,965,320 | 19,307,984 | 30.19 |
polish | 16 | 312,236,524 | 52.62 | 393,812,632 | 81,576,108 | 20.71 |
This method extends 1b approach. The size of chars set stored in entire trie may be smaller than 256, for example words in english dictionary contains small and big letters and apostrophe, i.e. about 53 chars. Thus the size of array in each node could be much smaller, with cost of additional translation table(s).
Such translation tables can be applied for all nodes or be specific for nodes at the same depth. Sample functions retrieving a child node:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { const uint16_t L = trie->translate[letter]; if (L == trie->INVALID_LETTER) return NULL; else return node->children[L] } TrieNode* NEXT(TrieNode* node, uint8_t letter) { /* depth could be parameter of function! */ const size_t depth = get_depth(node); const uint16_t L = trie->translate[depth][letter]; if (L == trie->INVALID_LETTER) return NULL; else return node->children[L] }
Global translation table has 54 elements, while size of translation tables created for each depth vary from 1 to 53:
Results for tries with additional lookups.
method | node struct size | total size | bytes/node |
---|---|---|---|
All possible edges | 1036 | 324,117,780 | 1036.00 |
Global translation table | 16 | 72,582,360 | 232.00 |
Depth-local translation table | 16 | 41,797,316 | 133.60 |
Dynamic table of pointers — store only non-null pointers. This require store also size of a table and a edge label.
struct TrieNode { /* user data */ uint8_t letter; // edge label uint16_t n; // array size TrieNode** children; // array }
Procedure returns next node:
TrieNode* g(TrieNode* node, uint8_t letter) { int i; const int n = node->n; for (i=0; i < n; i++) if (node->child[i]->letter == letter) return node->child[i]; return NULL; }
struct TrieNode { /* user data */ uint8_t letter; uint16_t n; union { TrieNode* single_child; // valid when n = 1 TrieNode** ptr_array; } }
Version 2:
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 16 | 6,257,096 | 20.00 | 11,193,376 | 4,936,280 | 44.10 |
german | 16 | 12,209,396 | 20.00 | 22,448,216 | 10,238,820 | 45.61 |
french | 16 | 25,941,596 | 20.00 | 45,420,392 | 19,478,796 | 42.89 |
polish | 16 | 118,673,156 | 20.00 | 201,448,152 | 82,774,996 | 41.09 |
Version 2a:
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 16 | 5,563,904 | 17.78 | 8,419,680 | 2,855,776 | 33.92 |
german | 16 | 10,607,980 | 17.38 | 16,040,816 | 5,432,836 | 33.87 |
french | 16 | 23,480,024 | 18.10 | 35,573,008 | 12,092,984 | 33.99 |
polish | 16 | 109,473,048 | 18.45 | 164,644,264 | 55,171,216 | 33.51 |
Use single-linked list instead of dynamic table. This requires also store a letter (label's edge) on a list.
struct TrieNode { /* user data */ ListItem* head; }; ListItem { uint8_t letter; TrieNode* child; ListItem* next; };
Procedure returns next node:
TrieNode* g(TrieNode* node, uint8_t letter) { ListItem* item = node->head; while (item) { if (item->letter == letter) return item->child; else item = item->next; } return NULL }
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 12 | 7,508,508 | 24.00 | 10,011,360 | 2,502,852 | 25.00 |
german | 12 | 14,651,268 | 24.00 | 19,535,040 | 4,883,772 | 25.00 |
french | 12 | 31,129,908 | 24.00 | 41,506,560 | 10,376,652 | 25.00 |
polish | 12 | 142,407,780 | 24.00 | 189,877,056 | 47,469,276 | 25.00 |
The dynamic table require two memory reads: the first gets a pointer to child, then get a letter from a child. In this representation letters are separated from the node structure and saved in front of the table of pointers. Sample memory layout for node with 3 children:
| letters | pointers | +---+---+---+---+-------+-------+-------+ | a | b | c | _ | ptr_0 | ptr_1 | ptr_2 | +---+---+---+---+-------+-------+-------+ 0 1 2 x 0 1 2 <- index
Letters area has size rounded to alignment, in this case 4 bytes.
Structure of node is simple:
struct TrieNode { /* user data */ uint16_t n; void* children; }
Procedure returns next node:
TrieNode* NEXT_v3(TrieNode* node, uint8_t letter) { const int n = node->n; const letters_size = round_up(n, aligment); char* letters = (char*)(node->children); TrieNode* pointers = (TrieNode*)(node->children + letters_size); /* find index */ for (i=0; i < n; i++) if (chars[i] == letter) return pointers[i]; return NULL; }
struct TrieNode { /* user data */ uint16_t n; uint8_t byte0; // n == 1 or n == 2 uint8_t byte1; // n == 2 void* children; }
Procedure returns next node:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { const int n = node->n; switch (n) { case 0: return NULL; case 1: if (node->byte0 == letter) return (TrieNode**)(node->children)[0]; else return NULL; case 2: if (node->byte0 == letter) return (TrieNode**)(node->children)[0]; else if (node->byte1 == letter) return (TrieNode**)(node->children)[1]; else return NULL; default: return NEXT_v3(node, letter); } }
Version 3:
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 16 | 7,588,476 | 24.26 | 11,349,160 | 3,760,684 | 33.14 |
german | 16 | 18,070,716 | 24.17 | 28,197,544 | 10,126,828 | 35.91 |
french | 16 | 62,388,760 | 24.20 | 97,167,832 | 34,779,072 | 35.79 |
polish | 16 | 310,656,956 | 24.24 | 480,136,984 | 169,480,028 | 35.30 |
Version 3a:
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 16 | 6,406,060 | 20.48 | 11,349,160 | 4,943,100 | 43.55 |
german | 16 | 15,098,296 | 20.19 | 28,122,600 | 13,024,304 | 46.31 |
french | 16 | 52,107,676 | 20.21 | 96,885,336 | 44,777,660 | 46.22 |
polish | 16 | 259,933,364 | 20.28 | 478,677,552 | 218,744,188 | 45.70 |
A modification of the previous scheme — move counter n to a dynamic array. Also type of n could be change to uint8_t. If node has no children, then pointer is NULL, and n never get value 0, so we have values ouside range 1..256.
Structure of node is simple:
struct TrieNode { /* user data */ void* children; }
Procedure returns next node:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { const int n = (uint_8)(node->children[0]); const letters_size = round_up(n, aligment); char* letters = (char*)(node->children + 1); TrieNode* pointers = (TrieNode*)(node->children + letters_size + 1); /* find index */ for (i=0; i < n; i++) if (chars[i] == letter) return pointers[i]; return NULL; }
Sample memory layout for node with 3 children:
| n | letters | pointers | +---+---+---+---+-------+-------+-------+ | 3 | a | b | c | ptr_0 | ptr_1 | ptr_2 | +---+---+---+---+-------+-------+-------+ 0 1 2 0 1 2 <- index
see previous section
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 12 | 6,297,676 | 20.13 | 8,841,424 | 2,543,748 | 28.77 |
german | 12 | 12,267,360 | 20.09 | 17,782,664 | 5,515,304 | 31.02 |
french | 12 | 26,152,132 | 20.16 | 35,824,896 | 9,672,764 | 27.00 |
polish | 12 | 120,114,864 | 20.24 | 158,666,648 | 38,551,784 | 24.30 |
Instead of storing pointers to children it's possible to store all children nodes in a single array and save only one pointer to that array.
Structure of node:
struct TrieNode { /* user data */ uint16_t n uint8_t letter; TrieNode* array; }
If a trie is not modified, then memory could be even more compacted, but following constraints are set:
Structure of node:
struct TrieNode { /* user data */ uint32_t letter:8; // 8-bit for letter uint32_t eol:1; // 1-bit for End-of-list marker uint32_t index:23; // index of subarray of children }
Procedure thar returns child node:
TrieNode nodes[k]; TrieNode* NEXT(TrieNode* node, const uint8_t letter) { if (node->index == 0) return NULL; TrieNode* children = &nodes[index]; size_t i = 0; while (1) { if (children[i].letter == letter) return &children[i]; else { if (children[i].eol) // last element of array return NULL; else i += 1; } } }
Such trie takes exactly nodes count * sizeof(TrieNode).
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 16 | 5,005,680 | 16.00 | 6,814,952 | 1,809,272 | 26.55 |
german | 16 | 9,767,520 | 16.00 | 13,623,080 | 3,855,560 | 28.30 |
french | 16 | 20,753,280 | 16.00 | 27,680,536 | 6,927,256 | 25.03 |
polish | 16 | 94,938,528 | 16.00 | 122,940,568 | 28,002,040 | 22.78 |
This is merge of version 4 and version 3b — children nodes are saved in an array (v4), a children counter and edge lables are also stored in the table (v3b).
Structure of node:
struct TrieNode { /* user data */ uint8_t* array; }
(Pseudo)structure od array:
pseudostruct array { uint8_t n; uint8_t letters[n]; uint8_t __alignment__[?]; TrieNode next[n]; }
Procedure returns next node:
TrieNode* NEXT(TrieNode* node, uint8_t letter) { const int n = (int)(node->array[0]); const letters_size = round_up(n + 1, aligment); char* letters = (char*)(node->array + 1); TrieNode* next = (TrieNode*)(node->array + letters_size); /* find index */ for (i=0; i < n; i++) if (letters[i] == letter) return &next[i]; return NULL; }
same as for version 4
language | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
english | 12 | 4'687'400 | 14.98 | 6'340'624 | 1'653'224 | 26.07 |
german | 12 | 9'290'880 | 15.22 | 12'901'616 | 3'610'736 | 28.00 |
french | 12 | 19'209'800 | 14.81 | 25'460'408 | 6'250'608 | 25.55 |
polish | 12 | 86'460'392 | 14.57 | 111'249'824 | 24'789'432 | 22.28 |
English words:
layout | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
version 1 | 1,032 | 322,866,360 | 1032.00 | 325,369,216 | 2,502,856 | 0.77 |
version 1a | 12 | 232,249,620 | 742.36 | 235,286,176 | 3,036,556 | 1.29 |
version 1b | 16 | 12,688,512 | 40.56 | 17,552,000 | 4,863,488 | 27.71 |
version 2 | 16 | 6,257,096 | 20.00 | 11,193,376 | 4,936,280 | 44.10 |
version 2a | 16 | 5,563,904 | 17.78 | 8,419,680 | 2,855,776 | 33.92 |
version 2b | 12 | 7,508,508 | 24.00 | 10,011,360 | 2,502,852 | 25.00 |
version 3 | 16 | 7,588,476 | 24.26 | 11,349,160 | 3,760,684 | 33.14 |
version 3a | 16 | 6,406,060 | 20.48 | 11,349,160 | 4,943,100 | 43.55 |
version 3b | 12 | 6,297,676 | 20.13 | 8,841,424 | 2,543,748 | 28.77 |
version 4 | 16 | 5,005,680 | 16.00 | 6,814,952 | 1,809,272 | 26.55 |
version 4a | 12 | 4'687'400 | 14.98 | 6'340'624 | 1'653'224 | 26.07 |
German words:
layout | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
version 1 | 1,032 | 630,005,040 | 1032.00 | 634,888,816 | 4,883,776 | 0.77 |
version 1a | 12 | 495,588,296 | 811.81 | 501,844,744 | 6,256,448 | 1.25 |
version 1b | 16 | 18,303,588 | 29.98 | 28,461,208 | 10,157,620 | 35.69 |
version 2 | 16 | 12,209,396 | 20.00 | 22,448,216 | 10,238,820 | 45.61 |
version 2a | 16 | 10,607,980 | 17.38 | 16,040,816 | 5,432,836 | 33.87 |
version 2b | 12 | 14,651,268 | 24.00 | 19,535,040 | 4,883,772 | 25.00 |
version 3 | 16 | 18,070,716 | 24.17 | 28,197,544 | 10,126,828 | 35.91 |
version 3a | 16 | 15,098,296 | 20.19 | 28,122,600 | 13,024,304 | 46.31 |
version 3b | 12 | 12,267,360 | 20.09 | 17,782,664 | 5,515,304 | 31.02 |
version 4 | 16 | 9,767,520 | 16.00 | 13,623,080 | 3,855,560 | 28.30 |
version 4a | 12 | 9'290'880 | 15.22 | 12'901'616 | 3'610'736 | 28.00 |
French words:
layout | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
version 1 | 1,032 | 1,338,586,560 | 1032.00 | 1,348,963,216 | 10,376,656 | 0.77 |
version 1a | 12 | 894,746,784 | 689.82 | 906,803,728 | 12,056,944 | 1.33 |
version 1b | 16 | 44,657,336 | 34.43 | 63,965,320 | 19,307,984 | 30.19 |
version 2 | 16 | 25,941,596 | 20.00 | 45,420,392 | 19,478,796 | 42.89 |
version 2a | 16 | 23,480,024 | 18.10 | 35,573,008 | 12,092,984 | 33.99 |
version 2b | 12 | 31,129,908 | 24.00 | 41,506,560 | 10,376,652 | 25.00 |
version 3 | 16 | 62,388,760 | 24.20 | 97,167,832 | 34,779,072 | 35.79 |
version 3a | 16 | 52,107,676 | 20.21 | 96,885,336 | 44,777,660 | 46.22 |
version 3b | 12 | 26,152,132 | 20.16 | 35,824,896 | 9,672,764 | 27.00 |
version 4 | 16 | 20,753,280 | 16.00 | 27,680,536 | 6,927,256 | 25.03 |
version 4a | 12 | 19'209'800 | 14.81 | 25'460'408 | 6'250'608 | 25.55 |
Polish words:
layout | node size [bytes] | trie size [bytes] | bytes/node | allocated [bytes] | wasted [bytes] | wasted [%] |
---|---|---|---|---|---|---|
version 1 | n/a | n/a | n/a | n/a | n/a | n/a |
version 1a | n/a | n/a | n/a | n/a | n/a | n/a |
version 1b | 16 | 312,236,524 | 52.62 | 393,812,632 | 81,576,108 | 20.71 |
version 2 | 16 | 118,673,156 | 20.00 | 201,448,152 | 82,774,996 | 41.09 |
version 2a | 16 | 109,473,048 | 18.45 | 164,644,264 | 55,171,216 | 33.51 |
version 2b | 12 | 142,407,780 | 24.00 | 189,877,056 | 47,469,276 | 25.00 |
version 3 | 16 | 310,656,956 | 24.24 | 480,136,984 | 169,480,028 | 35.30 |
version 3a | 16 | 259,933,364 | 20.28 | 478,677,552 | 218,744,188 | 45.70 |
version 3b | 12 | 120,114,864 | 20.24 | 158,666,648 | 38,551,784 | 24.30 |
version 4 | 16 | 94,938,528 | 16.00 | 122,940,568 | 28,002,040 | 22.78 |
version 4a | 12 | 86'460'392 | 14.57 | 111'249'824 | 24'789'432 | 22.28 |
Procedure tests if word is present in trie:
bool exists(Trie* trie, char* string) { char* c = string; TrieNode* node = trie->root; while (*c and node) node = NEXT(node, *c++); return (node && node->end_of_word) }
All representations has been tested using similar procedure, and there are two conclusion:
Experiments clearly shows that internal memory fragmentation is significant, at least when GNU libc is used. The answer to this problem are specialized allocators.
All representations require allocate (or free) single nodes, thus memory pools could be used. For example GNU libc provides obstacks, but such stack-based memory pool is quite simple to write.
Representations using dynamic arrays require more complex memory pool, thus I wrote simple bitmap-based allocator. There is a linked list of arenas:
typedef struct MemPool { MemPool* next_arena; size_t free_count; uint8_t bitmap[MAX_SIZE/8]; } MemPool;
The size of allocated array isn't stored in arena structure, because it's already saved in nodes. Thanks to that memory required by allocator is much smaller, but API is slightly different and isn't safe:
void* mempool_alloc(MemPool** pool, size_t n); void mempool_free(MemPool* pool, size_t n);
Two representations were tested with the English dictionary: