Author: | Wojciech Muła |
---|---|
Date: | 2012-05-25 |
Update: | 2017-03-12 (spelling), 2013-10-20, 2013-11-03 (FST results from PC, better scripts) |
Contents
PostgreSQL executing queries with construct like '%text%' can't use any system index (I guess none of existing SQL servers have such feature).
But it's possible to speed up such queries with an own index using n-grams. N-gram is a subword consists N successive characters. For example all 3-grams (trigrams) for word "bicycle" are: "bic", "icy", "cyc", "cle". Likewise all 4-grams for this word are: "bicy", "icyc", "ycle".
This article shows two approaches:
Each row of an index contains a pair: an n-gram and a list of ids; an id points to the row in a data table where text contains given n-gram.
When search is performed then the searched text is split to n-grams, and we choose the n-gram with the least number of assigned phrases. The chosen n-gram is used to limit the number of phrases, and finally a regular search is performed on such subset to filter out false-positives.
Pseudocode:
ngrams = split(string) best_ngram = find_best(ngrams) if best_ngram is null then return NULL; else return SELECT * FROM data WHERE data.id IN (SELECT ids FROM index WHERE ngram = best_ngram) AND data.text LIKE '%' || string || '%';
Polish aspell dictionary containing about 3.6 million words was used in experiments:
# select count(*), avg(length(text)), max(length(text)) from data; count | avg | max ---------+---------------------+----- 3669299 | 12.0597729975126039 | 39
The definition of table data is:
CREATE TABLE data ( id integer primary key, text varchar(64) );
Default B-tree index on id has been created. The size of relation is about 166MB, the size of index 63MB.
Two n-grams indexes were created for 3-letter and 4-letter grams:
CREATE TABLE index_3grams ( ngram char(3) PRIMARY KEY, count integer NOT NULL, ids integer[] -- array of indexes to data(id) ); CREATE TABLE index_4grams ( ngram char(4) PRIMARY KEY, count integer NOT NULL, ids integer[] -- array of indexes to data(id) );
A B-tree index on column ngram exists on both tables.
Some statistics:
count | min(count) | avg(count) | max(count) | |
---|---|---|---|---|
3-grams | 18107 | 1 | 2038.57 | 1054928 |
4-grams | 99424 | 1 | 334.36 | 202237 |
The physical size of tables:
main table [MB] | TOAST table [MB] | sum [MB] | |
---|---|---|---|
3-grams | 9 | 110 | 119 |
4-grams | 28 | 143 | 171 |
Check this simple query:
# SELECT * FROM data WHERE text LIKE '%domek%'; id | text ---------+----------- 367564 | dodomek 381960 | domek 2085698 | podomek 2416867 | przydomek 3389073 | zadomek (5 rows) Time: 1615.348 ms
Regardless of a searched term time is similar, about 1.6 seconds.
Now we try to repeat the sample search using these 3-grams index; 3-grams for word "domek" are: "dom", "ome", "mek":
# SELECT ngram, count FROM index_3grams WHERE ngram IN ('dom', 'ome', 'mek') ORDER BY count; ngram | count -------+------- mek | 571 dom | 9910 ome | 15604 (3 rows)
The best candidate is "mek":
# SELECT * FROM data WHERE text LIKE '%domek%' AND id=ANY((SELECT ids FROM index_3grams WHERE ngram='mek')::integer[]); id | text ---------+----------- 367564 | dodomek 381960 | domek 2085698 | podomek 2416867 | przydomek 3389073 | zadomek (5 rows) Time: 3.685 ms
3.6 milliseconds. It's not mistake, typo nor random data pasted from a console.
Now examine 4-grams index; 4-grams for "domek" are: "dome", "omek":
# SELECT ngram, count FROM index_4grams WHERE ngram IN ('dome', 'omek') ORDER BY count; ngram | count -------+------- omek | 64 dome | 639 (2 rows)
The choice is simple:
# SELECT * from data WHERE text LIKE '%domek%' AND id=ANY((SELECT ids FROM index_4grams WHERE ngram='omek')::integer[]); id | text ---------+----------- 367564 | dodomek 381960 | domek 2085698 | podomek 2416867 | przydomek 3389073 | zadomek (5 rows) Time: 1.137 ms
1.1 millisecond.
The speedup is impressive, but the method has some weakness:
Creating indexes for many n-grams, for example five, six or more characters, could help a lot, because longer n-grams are connected with relative small number of phrases (see the first table in this text for comparison).
Let define a simple function that returns an array of all trigrams:
CREATE OR REPLACE FUNCTION trigrams_array(word text) RETURNS text[] IMMUTABLE STRICT LANGUAGE "plpgsql" AS $$ DECLARE result text[]; BEGIN FOR i IN 1 .. length(word) - 2 LOOP result := result || quote_literal(substr(lower(word), i, 3)); END LOOP; RETURN result; END; $$;
Now we are able to create a lexemes list (tsvector):
CREATE OR REPLACE FUNCTION trigrams_vector(text) RETURNS tsvector IMMUTABLE STRICT LANGUAGE "SQL" AS $$ SELECT array_to_string(trigrams_array($1), ' ')::tsvector; $$;
A full text search query is represented by an object of type tsquery. The query is a set of lexemes and boolean operators (&, | and !). In n-grams searches it's required that all grams are present, so query has form 'bic' & 'icy' & 'cyc' & 'cle'.
The function that creates a query:
CREATE OR REPLACE FUNCTION trigrams_query(text) RETURNS tsquery IMMUTABLE STRICT LANGUAGE "SQL" AS $$ SELECT array_to_string(trigrams_array($1), ' & ')::tsquery; $$;
An index can be create as a functional one using function trigrams_vector, or can be set on the dedicated table's column. The functional index requires the same function that is used in a condition. An extra column takes additional disc space, but when different full search queries are performed on the same table this may be the only solution.
Sample data is a set of ~43000 titles of books from The Gutenberg Project, the postprocessed list is available for download. The table has size 3.4MB, here are some statistics:
SELECT COUNT(*), MIN(l), MAX(l), AVG(l) FROM (SELECT length(title) AS l FROM titles) AS foo; count | min | max | avg -------+-----+-----+--------------------- 43839 | 1 | 580 | 43.1125253769474669
Simple query, i.e. performing sequence scan, takes 230 ms:
SELECT count(*) FROM titles WHERE title ILIKE '%adventures%';
Functional index using function trigrams_vector is defined as follows:
CREATE INDEX titles_trigrams_gin_idx ON titles USING GIN(trigrams_vector(title)) ;
The index size is 18MB.
A SQL query have to use trigrams_vector and trigrams_query:
SELECT COUNT(*) FROM titles WHERE trigrams_vector(title) @@ trigrams_query('adventures');
The query plan:
Aggregate (cost=104.52..104.53 rows=1 width=0) -> Bitmap Heap Scan on titles (cost=100.25..104.51 rows=1 width=0) Recheck Cond: (trigrams_vector(title) @@ ”'adv” & ”dve” & ”ven” & ”ent” & ”ntu” & ”tur” & ”ure” & ”res”'::tsquery) -> Bitmap Index Scan on titles_trigrams_gin_idx (cost=0.00..100.25 rows=1 width=0) Index Cond: (trigrams_vector(title) @@ ”'adv” & ”dve” & ”ven” & ”ent” & ”ntu” & ”tur” & ”ure” & ”res”'::tsquery)
The downside of GIN index is obligatory rechecking a query, because this index may produce false-positive results.
GIST index:
CREATE INDEX titles_trigrams_gist_idx ON titles USING GIST(trigrams_vector(title)) ;
The index size is 9.4MB.
The plan for the same SQL query as used in the GIN example:
Aggregate (cost=8.57..8.58 rows=1 width=0) -> Index Scan using titles_trigrams_gist_idx on titles (cost=0.25..8.56 rows=1 width=0) Index Cond: (trigrams_vector(title) @@ ”'adv” & ”dve” & ”ven” & ”ent” & ”ntu” & ”tur” & ”ure” & ”res”'::tsquery)
There are 25 different search phrases (picked from the titles). Searching has been performed 5 times, minimum running times were considered. The script fst_test contains all data; functions and queries is available for download.
Experiments were made on my old laptop (Pentium M @ 1.5GHz with 1.2GB RAM) using PosgreSQL 9.1 from Debian.
phrase | GIN [ms] | GIST [ms] |
---|---|---|
alice in wonderland | 50.30 | 9.12 |
begin | 0.83 | 280.83 |
build | 0.79 | 90.42 |
cats | 0.60 | 49.41 |
dog | 0.44 | 327.40 |
eagle | 0.46 | 44.51 |
europe | 1.35 | 754.64 |
meaning of life | 38.82 | 8.57 |
moon | 0.51 | 94.09 |
old man | 6.67 | 46.55 |
other stories | 44.84 | 156.64 |
poland | 7.85 | 51.66 |
razor | 0.26 | 0.80 |
scientific | 7.98 | 146.38 |
scientific american | 29.89 | 80.84 |
seven | 1.71 | 213.18 |
seven periods of english | 69.81 | 3.12 |
Shakespear | 4.02 | 61.99 |
sonnets | 2.42 | 23.19 |
the odyssey | 32.15 | 3.78 |
trees | 1.23 | 81.14 |
whatever | 4.61 | 14.36 |
wonder | 3.17 | 145.17 |
world war | 5.33 | 41.88 |
writer | 4.20 | 91.95 |
Experiments made on my PC (Core2 @ 2.6GHz with 2GB RAM) using PosgreSQL 9.1 from Debian.
phrase | GIN [ms] | GIST [ms] |
---|---|---|
alice in wonderland | 23.90 | 4.23 |
begin | 0.36 | 111.20 |
build | 0.35 | 37.87 |
cats | 0.27 | 19.65 |
dog | 0.19 | 128.52 |
eagle | 0.20 | 18.94 |
europe | 0.61 | 293.58 |
meaning of life | 17.20 | 4.40 |
moon | 0.22 | 37.34 |
old man | 3.13 | 19.83 |
other stories | 19.50 | 67.57 |
poland | 3.42 | 20.94 |
razor | 0.11 | 0.54 |
scientific | 3.71 | 59.76 |
scientific american | 13.22 | 34.19 |
seven | 0.76 | 86.62 |
seven periods of english | 31.58 | 2.23 |
Shakespear | 1.88 | 25.70 |
sonnets | 1.12 | 9.69 |
the odyssey | 13.96 | 1.90 |
trees | 0.57 | 31.89 |
whatever | 2.11 | 5.80 |
wonder | 1.45 | 60.38 |
world war | 2.52 | 16.59 |
writer | 1.93 | 36.96 |
(Times are given from the first results table.)