PostgreSQL — faster reads from static tables

Author:Wojciech Muła
Date:2013-11-02

Contents

Introduction

"Static table" means that data is changed rarely, in my case it's a cache for some reporting system. Table has no any foreign keys, constraints etc. This cache is feed by night cron by a query looking like this:

TRUNCATE reports_cache;
INSERT INTO reports_cache
        (... list of columns ...)
        (SELECT ... FROM reports_cache_data(... some parameters ...));

reports_cache_data is a stored procedure that performs all boring transformations, contains many joins, filters out certain rows, etc. The cache table contains a lot of records.

The main column used by all report is a "creation date", range filtering on this column appear in virtually all queries. For example aggregating query looks like this:

SELECT
        idx1
        count(counter1) as quantity1_count
FROM
        reports_cache
WHERE
        date BETWEEN '2013-01-01' AND '2013-01-31';

Of course there is an index on date field.

Problem and solution

If a range spans short period of time, then a query runs fast, but if span a few months, then could be quite slow. The reason of low speed is amount of data retrieved from a disc. The presence of index doesn't count, because a database engine have to read all data to feed a sum aggregator — is not possible to avoid reading data (Of course, such big table can be partitioned and some aggregations could be precalculated).

On lower level we have many disc reads, and if rows are scattered over a disc, then reads cost an additional seek time. In PostgreSQL each row has an implicit column called ctid (the type is tid) containing a physical position of row: the block number and the row index within block.

Lets examine ctid values from sample table:

$ SELECT ctid FROM reports_cache ORDER BY data LIMIT 10;

  ctid
---------
 (6,60)
 (10,91)
 (11,72)
 (20,8)
 (36,52)
 (42,12)
 (49,78)
 (52,37)
 (54,48)
 (59,73)

As we see these rows live on different blocks.

In case of querying range of dates index scan is usually used. A B+-tree index in fact keeps values ordered, so in terms of block reads the index scan acts similarly to ORDER BY clause used in the example.

If rows were physically ordered by date, we would expect shorter read time due to minimize seek time, utilize on-disc cache and probably operating system cache at I/O level.

PostgeSQL supports command CLUSTER that reorders tuples according to given index, for example:

CLUSTER my_table USING date_index

But this dubles work, i.e first we have to fill a table, then cluster data. When the table is built from scratch, we can achive same effect using ORDER BY on query building data:

CREATE reports_cache_ordered (LIKE reports_cache);

INSERT INTO reports_cache_ordered (SELECT * FROM reports_cache ORDER BY date);

Let's examine ctid for this table:

$ SELECT ctid FROM reports_cache_ordered ORDER BY data LIMIT 10;

  ctid
--------
 (0,1)
 (0,2)
 (0,3)
 (0,4)
 (0,5)
 (0,6)
 (0,7)
 (0,8)
 (0,9)
 (0,10)

Now rows are saved on disc in date order.

Experiments

There are two tables:

Simple reads

Following query was examined:

SELECT * FROM table
 ORDER BY date
 LIMIT parameter1
OFFSET parameter2

Each query was executed 5 times and best running time was considered.

parameters cache table time [ms] ordered table time [ms] ordered table speedup [%]
LIMIT OFFSET
100 0 0.194 0.154 25.97
1000 0 0.696 0.394 76.65
10000 0 66.027 3.691 1688.86
20000 0 126.459 7.578 1568.76
50000 0 290.584 21.259 1266.88
100 100000 539.982 28.405 1801.01
1000 100000 546.095 22.566 2319.99
10000 100000 598.351 27.031 2113.57
20000 100000 654.423 30.171 2069.05
50000 100000 824.982 43.018 1817.76
100 200000 1069.146 45.186 2266.10
1000 200000 1074.740 45.423 2266.07
10000 200000 1129.873 48.953 2208.08
20000 200000 1183.656 52.939 2135.89
50000 200000 1350.786 64.667 1988.83

Aggregation

In this experiment more complicated query was examined:

SELECT
        idx1,
        COUNT(count1),
        COUNT(count2),
        COUNT(count3),
        COUNT(count4)
 FROM
        table
 WHERE
        date BETWEEN start_date AND end_date
 GROUP BY idx1

Each query was executed 5 times and best running time was considered.

parameters cache table time [ms] ordered table time [ms] ordered table speedup [%]
start date end date days
2011-01-01 2011-02-01 31 105.981 24.267 336.73
2011-01-01 2011-03-01 59 127.806 46.871 172.68
2011-01-01 2011-06-01 151 188.580 113.715 65.84
2011-01-01 2011-09-01 243 252.568 180.529 39.90
2011-01-01 2012-01-01 365 340.549 275.296 23.70
2011-01-01 2013-01-01 731 629.485 624.822 0.75

Summary

So, is it worth of play? Definitively yes in scenario described in the beginning, it cost just one ORDER BY clause in a query executed once per day.

Downloads