Speeding up LIKE '%text%' queries (at least in PostgeSQL)

Author: Wojciech Muła
Added on:2012-05-25
Update:2017-03-12 (spelling), 2013-10-20, 2013-11-03 (FST results from PC, better scripts)

Contents

Introduction

PostgreSQL executing queries with construct like '%text%' can't use any system index (I guess none of existing SQL servers have such feature).

But it's possible to speed up such queries with an own index using n-grams. N-gram is a subword consists N successive characters. For example all 3-grams (trigrams) for word "bicycle" are: "bic", "icy", "cyc", "cle". Likewise all 4-grams for this word are: "bicy", "icyc", "ycle".

This article shows two approaches:

Index as a table

Each row of an index contains a pair: an n-gram and a list of ids; an id points to the row in a data table where text contains given n-gram.

When search is performed then the searched text is split to n-grams, and we choose the n-gram with the least number of assigned phrases. The chosen n-gram is used to limit the number of phrases, and finally a regular search is performed on such subset to filter out false-positives.

Pseudocode:

ngrams = split(string)
best_ngram = find_best(ngrams)
if best_ngram is null then
        return NULL;
else
        return
                SELECT * FROM data
                 WHERE data.id IN (SELECT ids FROM index WHERE ngram = best_ngram)
                   AND data.text LIKE '%' || string || '%';

PostgreSQL experiments

Data definition

Polish aspell dictionary containing about 3.6 million words was used in experiments:

# select count(*), avg(length(text)), max(length(text)) from data;
  count  |         avg         | max
---------+---------------------+-----
 3669299 | 12.0597729975126039 |  39

The definition of table data is:

CREATE TABLE data (
        id integer primary key,
        text varchar(64)
);

Default B-tree index on id has been created. The size of relation is about 166MB, the size of index 63MB.

N-grams indexes

Two n-grams indexes were created for 3-letter and 4-letter grams:

CREATE TABLE index_3grams (
        ngram char(3) PRIMARY KEY,
        count integer NOT NULL,
        ids   integer[] -- array of indexes to data(id)
);

CREATE TABLE index_4grams (
        ngram char(4) PRIMARY KEY,
        count integer NOT NULL,
        ids   integer[] -- array of indexes to data(id)
);

A B-tree index on column ngram exists on both tables.

Some statistics:

  count min(count) avg(count) max(count)
3-grams 18107 1 2038.57 1054928
4-grams 99424 1 334.36 202237

The physical size of tables:

  main table [MB] TOAST table [MB] sum [MB]
3-grams 9 110 119
4-grams 28 143 171

Demonstration

Check this simple query:

# SELECT * FROM data WHERE text LIKE '%domek%';
   id    |   text
---------+-----------
  367564 | dodomek
  381960 | domek
 2085698 | podomek
 2416867 | przydomek
 3389073 | zadomek
(5 rows)

Time: 1615.348 ms

Regardless of a searched term time is similar, about 1.6 seconds.

Now we try to repeat the sample search using these 3-grams index; 3-grams for word "domek" are: "dom", "ome", "mek":

# SELECT ngram, count FROM index_3grams WHERE ngram IN ('dom', 'ome', 'mek') ORDER BY count;
 ngram | count
-------+-------
 mek   |   571
 dom   |  9910
 ome   | 15604
(3 rows)

The best candidate is "mek":

# SELECT * FROM data
  WHERE text LIKE '%domek%'
    AND id=ANY((SELECT ids FROM index_3grams WHERE ngram='mek')::integer[]);
   id    |   text
---------+-----------
  367564 | dodomek
  381960 | domek
 2085698 | podomek
 2416867 | przydomek
 3389073 | zadomek
(5 rows)

Time: 3.685 ms

3.6 milliseconds. It's not mistake, typo nor random data pasted from a console.

Now examine 4-grams index; 4-grams for "domek" are: "dome", "omek":

# SELECT ngram, count FROM index_4grams WHERE ngram IN ('dome', 'omek') ORDER BY count;
 ngram | count
-------+-------
 omek  |    64
 dome  |   639
(2 rows)

The choice is simple:

# SELECT * from data
  WHERE text LIKE '%domek%'
    AND id=ANY((SELECT ids FROM index_4grams WHERE ngram='omek')::integer[]);
   id    |   text
---------+-----------
  367564 | dodomek
  381960 | domek
 2085698 | podomek
 2416867 | przydomek
 3389073 | zadomek
(5 rows)

Time: 1.137 ms

1.1 millisecond.

Summary

The speedup is impressive, but the method has some weakness:

  • The speedup highly depends on n-grams. If a searched text contains n-grams connected with large number of records it may be slow. For example in Polish the n-gram 'nie' (means 'no', but often is a prefix in many words) exist in 1 million words.
  • Searching texts shorter than the shortest n-gram index requires sequence scan.
  • Maintaining indexes needs additional work (for example in triggers).

Creating indexes for many n-grams, for example five, six or more characters, could help a lot, because longer n-grams are connected with relative small number of phrases (see the first table in this text for comparison).

Download

  • functions.sql — data definition and sample PL/pgSQL procedures; some SQL or PL/pgSQL constructs may seem non-optimal, but were carefully chosen after many experiments with explain analyze

Full Search Text indexes

Let define a simple function that returns an array of all trigrams:

CREATE OR REPLACE FUNCTION trigrams_array(word text)
        RETURNS text[]
        IMMUTABLE STRICT
        LANGUAGE "plpgsql"
AS $$
        DECLARE
                result text[];
        BEGIN
                FOR i IN 1 .. length(word) - 2 LOOP
                        result := result || quote_literal(substr(lower(word), i, 3));
                END LOOP;

                RETURN result;
        END;
$$;

Now we are able to create a lexemes list (tsvector):

CREATE OR REPLACE FUNCTION trigrams_vector(text)
        RETURNS tsvector
        IMMUTABLE STRICT
        LANGUAGE "SQL"
AS $$
        SELECT array_to_string(trigrams_array($1), ' ')::tsvector;
$$;

A full text search query is represented by an object of type tsquery. The query is a set of lexemes and boolean operators (&, | and !). In n-grams searches it's required that all grams are present, so query has form 'bic' & 'icy' & 'cyc' & 'cle'.

The function that creates a query:

CREATE OR REPLACE FUNCTION trigrams_query(text)
        RETURNS tsquery
        IMMUTABLE STRICT
        LANGUAGE "SQL"
AS $$
        SELECT array_to_string(trigrams_array($1), ' & ')::tsquery;
$$;

An index can be create as a functional one using function trigrams_vector, or can be set on the dedicated table's column. The functional index requires the same function that is used in a condition. An extra column takes additional disc space, but when different full search queries are performed on the same table this may be the only solution.

PostgreSQL experiments

Sample data is a set of ~43000 titles of books from The Gutenberg Project, the postprocessed list is available for download. The table has size 3.4MB, here are some statistics:

SELECT COUNT(*), MIN(l), MAX(l), AVG(l) FROM (SELECT length(title) AS l FROM titles) AS foo;

 count | min | max |         avg
-------+-----+-----+---------------------
 43839 |   1 | 580 | 43.1125253769474669

Simple query, i.e. performing sequence scan, takes 230 ms:

SELECT count(*) FROM titles WHERE title ILIKE '%adventures%';

GIN index

Functional index using function trigrams_vector is defined as follows:

CREATE INDEX titles_trigrams_gin_idx ON titles
        USING GIN(trigrams_vector(title))
;

The index size is 18MB.

A SQL query have to use trigrams_vector and trigrams_query:

SELECT COUNT(*) FROM titles WHERE trigrams_vector(title) @@ trigrams_query('adventures');

The query plan:

Aggregate  (cost=104.52..104.53 rows=1 width=0)
  ->  Bitmap Heap Scan on titles  (cost=100.25..104.51 rows=1 width=0)
        Recheck Cond: (trigrams_vector(title) @@ ”'adv” & ”dve” & ”ven” & ”ent” & ”ntu” & ”tur” & ”ure” & ”res”'::tsquery)
        ->  Bitmap Index Scan on titles_trigrams_gin_idx  (cost=0.00..100.25 rows=1 width=0)
              Index Cond: (trigrams_vector(title) @@ ”'adv” & ”dve” & ”ven” & ”ent” & ”ntu” & ”tur” & ”ure” & ”res”'::tsquery)

The downside of GIN index is obligatory rechecking a query, because this index may produce false-positive results.

GIST index

GIST index:

CREATE INDEX titles_trigrams_gist_idx ON titles
        USING GIST(trigrams_vector(title))
;

The index size is 9.4MB.

The plan for the same SQL query as used in the GIN example:

Aggregate  (cost=8.57..8.58 rows=1 width=0)
  ->  Index Scan using titles_trigrams_gist_idx on titles  (cost=0.25..8.56 rows=1 width=0)
        Index Cond: (trigrams_vector(title) @@ ”'adv” & ”dve” & ”ven” & ”ent” & ”ntu” & ”tur” & ”ure” & ”res”'::tsquery)

Results

There are 25 different search phrases (picked from the titles). Searching has been performed 5 times, minimum running times were considered. The script fst_test contains all data; functions and queries is available for download.

Experiments were made on my old laptop (Pentium M @ 1.5GHz with 1.2GB RAM) using PosgreSQL 9.1 from Debian.

phrase GIN [ms] GIST [ms]
alice in wonderland 50.30 9.12
begin 0.83 280.83
build 0.79 90.42
cats 0.60 49.41
dog 0.44 327.40
eagle 0.46 44.51
europe 1.35 754.64
meaning of life 38.82 8.57
moon 0.51 94.09
old man 6.67 46.55
other stories 44.84 156.64
poland 7.85 51.66
razor 0.26 0.80
scientific 7.98 146.38
scientific american 29.89 80.84
seven 1.71 213.18
seven periods of english 69.81 3.12
Shakespear 4.02 61.99
sonnets 2.42 23.19
the odyssey 32.15 3.78
trees 1.23 81.14
whatever 4.61 14.36
wonder 3.17 145.17
world war 5.33 41.88
writer 4.20 91.95

Experiments made on my PC (Core2 @ 2.6GHz with 2GB RAM) using PosgreSQL 9.1 from Debian.

phrase GIN [ms] GIST [ms]
alice in wonderland 23.90 4.23
begin 0.36 111.20
build 0.35 37.87
cats 0.27 19.65
dog 0.19 128.52
eagle 0.20 18.94
europe 0.61 293.58
meaning of life 17.20 4.40
moon 0.22 37.34
old man 3.13 19.83
other stories 19.50 67.57
poland 3.42 20.94
razor 0.11 0.54
scientific 3.71 59.76
scientific american 13.22 34.19
seven 0.76 86.62
seven periods of english 31.58 2.23
Shakespear 1.88 25.70
sonnets 1.12 9.69
the odyssey 13.96 1.90
trees 0.57 31.89
whatever 2.11 5.80
wonder 1.45 60.38
world war 2.52 16.59
writer 1.93 36.96

Summary

(Times are given from the first results table.)

  • GIN search is much faster than GIST in most cases.
  • GIST is generally slow, average time is 40 ms (times larger than 100 ms rejected). GIN average time is 13 ms (calculated for all data samples).
  • However GIST search is faster than GIN when number of trigrams is larger.
  • In few cases GIST search is slower than sequence scan time (280/325/750 vs 230 ms).
  • GIN index search never reached sequence scan time, the worst time is 70 ms.
  • Size of GIN index is two times larger than GIST (18MB vs 9.4MB).

Download