Author: | Wojciech Muła |
---|---|
Date: | 2013-11-23 |
Update: | 2013-11-24 (minor updates), 2013-12-01 (pair encoding) |
Russ Cox wrote very interesting article about algorithms behind service Google Code Search. In short: files are indexed with trigrams, the query string is split to trigrams and then an index is used to limit number of searched files.
Code search manages a reverse index — there is a file list, and for each trigram there is a list of file's id. I've interested how the list of file's id is stored on a disc. In the Russ implementation such list is built as follows:
The variable length integer is a byte-oriented encoding, similar to UTF-8, where each byte holds 7 bit of data and 1 bit (MSB) is used as the end-of-word marker. Thus smaller values require less bytes: integers less than 128 occupy 1 byte, less than 16384 occupy 2 bytes, and so on.
Important: lists of file's id are encoded independently, because of that it is not possible to use any common attributes to increase compression.
Downside of byte-oriented encoding is wasting bits if values are small. For example list of id:
[10000, 10001, 10003, 10004, 10006, 10007, 10009, 10010, 10017, 11500]
after calculating differences becomes:
[10000, 1, 2, 1, 2, 1, 2, 1, 7, 1488]
The first and the last value could be saved on 2 bytes, rest on 1 byte — total 12 bytes. Small values are stored on one byte, but in fact require 1, 2 or 3 bits, so 5-7 bits in a byte are unused.
It's possible to efficient encode such subset of small values on a bitset of constant size. Difference between each value and the first value ("head") are encoded as bit positions. For subset:
[10000, 10001, 10003, 10004, 10006, 10007, 10009, 10010, 10017]
we have:
Instead of 12 bytes we got 2 + 4 + 2 = 8. Of course for longer subsets results are better, in the best case we can encode one value using just one bit; however, for subsets smaller than 6 elements this operation has no sense.
Note: There is an encoding scheme called Simple9 (see the first position in bibliography) that can achieve similar compression ratio, but Simple9 requires a 4-bit header per 28-bit data word, thus compression ratio in the same scenario would be smaller.
I've checked two methods:
Although first match seems to be "too easy", this method gives better compression ratio than greedy. Moreover is very fast and simple to implement. Greedy is very, very slow — well maybe some kind of dynamic programming would help, but this wasn't the main subject of research, so I left it "as is".
All results presented here use first match searching.
Varnibble encoding is similar to varint, in this case we operate on 4-bit words instead of 8 bits — each word contains 3 bits of data and 1 bit of end-value marker. This encoding require some bit operations (shift, and, or) but with constant shifts amounts/masks, so this isn't as expansive as other bit-level encoding schemes.
Varnibble decreases mentioned earlier "bit-wasting" for small values.
This is generalization of varint and varnibble — the number of data bits can be arbitrary. Selecting the number of bits is done once for the whole sequence, this number is saved on one byte, then all elements are encoded.
There is compression method called "byte pair encoding", in pair encoding 'values' are used instead of bytes. Algorithm:
A set of trigrams and associated file's id were collected from subtree of Linux source. Each list was saved in a separate file, then I picked some of them. Here are some statistics about lists:
The plot shows size of lists encoded as differences with varint. Most of them are quite short, only few are long.
The total size of all lists:
It's clear that using differences can save up to 50% of space, so other methods always use differences instead of plain values. Results for differences encoded with varint (called for short varint diff) are set as reference.
Comparisons with other methods are done for following properties:
This algorithm finds the minimum number of bits required to save the largest value from single list. Then each value is saved on a bitfield of the calculated width.
Algorithm: