Author: | Wojciech Muła |
---|---|
Added on: | 2011-04-09 |
If you read the Wikipedia entry about DAWG, then you find following sentence:
Because the terminal nodes of a DAWG can be reached by multiple paths, a DAWG is not suitable for storing auxiliary information relating to each path, e.g. a word's frequency in the English language. A trie would be more useful in such a case.
This isn't true!
There is a quite simple algorithm, that allow to perform two-way minimal perfect hashing (MPH), i.e. convert any path representing a word to a unique number, or back — a number to a path (word). Values lie in the range 1 .. n, where n is the number of distinct words saved in a DAWG.
The algorithm is described in Applications of Finite Automata Representing Large Vocabularies, by Claudio Lucchiesi and Tomasz Kowaltowski (preprint is freely available somewhere online).
The main part of the algorithm is assigning to each node the number of reachable words from a node; this can be easily done in one pass. Then these numbers are used to perform perfect hashing. Hashing algorithm is fast and simple, translation from pseudocode presented in the paper is straightforward.
Algorithm requires additional memory for numbers in each node and a table of size n to implement dictionary lookups.
I've updated pyDAWG to support MPH.