Author: | Wojciech Muła |
---|---|
Date: | 2013-11-02 |
"Static table" means that data is changed rarely, in my case it's a cache for some reporting system. Table has no any foreign keys, constraints etc. This cache is feed by night cron by a query looking like this:
TRUNCATE reports_cache; INSERT INTO reports_cache (... list of columns ...) (SELECT ... FROM reports_cache_data(... some parameters ...));
reports_cache_data is a stored procedure that performs all boring transformations, contains many joins, filters out certain rows, etc. The cache table contains a lot of records.
The main column used by all report is a "creation date", range filtering on this column appear in virtually all queries. For example aggregating query looks like this:
SELECT idx1 count(counter1) as quantity1_count FROM reports_cache WHERE date BETWEEN '2013-01-01' AND '2013-01-31';
Of course there is an index on date field.
If a range spans short period of time, then a query runs fast, but if span a few months, then could be quite slow. The reason of low speed is amount of data retrieved from a disc. The presence of index doesn't count, because a database engine have to read all data to feed a sum aggregator — is not possible to avoid reading data (Of course, such big table can be partitioned and some aggregations could be precalculated).
On lower level we have many disc reads, and if rows are scattered over a disc, then reads cost an additional seek time. In PostgreSQL each row has an implicit column called ctid (the type is tid) containing a physical position of row: the block number and the row index within block.
Lets examine ctid values from sample table:
$ SELECT ctid FROM reports_cache ORDER BY data LIMIT 10; ctid --------- (6,60) (10,91) (11,72) (20,8) (36,52) (42,12) (49,78) (52,37) (54,48) (59,73)
As we see these rows live on different blocks.
In case of querying range of dates index scan is usually used. A B+-tree index in fact keeps values ordered, so in terms of block reads the index scan acts similarly to ORDER BY clause used in the example.
If rows were physically ordered by date, we would expect shorter read time due to minimize seek time, utilize on-disc cache and probably operating system cache at I/O level.
PostgeSQL supports command CLUSTER that reorders tuples according to given index, for example:
CLUSTER my_table USING date_index
But this dubles work, i.e first we have to fill a table, then cluster data. When the table is built from scratch, we can achive same effect using ORDER BY on query building data:
CREATE reports_cache_ordered (LIKE reports_cache); INSERT INTO reports_cache_ordered (SELECT * FROM reports_cache ORDER BY date);
Let's examine ctid for this table:
$ SELECT ctid FROM reports_cache_ordered ORDER BY data LIMIT 10; ctid -------- (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10)
Now rows are saved on disc in date order.
There are two tables:
Following query was examined:
SELECT * FROM table ORDER BY date LIMIT parameter1 OFFSET parameter2
Each query was executed 5 times and best running time was considered.
parameters | cache table time [ms] | ordered table time [ms] | ordered table speedup [%] | |
---|---|---|---|---|
LIMIT | OFFSET | |||
100 | 0 | 0.194 | 0.154 | 25.97 |
1000 | 0 | 0.696 | 0.394 | 76.65 |
10000 | 0 | 66.027 | 3.691 | 1688.86 |
20000 | 0 | 126.459 | 7.578 | 1568.76 |
50000 | 0 | 290.584 | 21.259 | 1266.88 |
100 | 100000 | 539.982 | 28.405 | 1801.01 |
1000 | 100000 | 546.095 | 22.566 | 2319.99 |
10000 | 100000 | 598.351 | 27.031 | 2113.57 |
20000 | 100000 | 654.423 | 30.171 | 2069.05 |
50000 | 100000 | 824.982 | 43.018 | 1817.76 |
100 | 200000 | 1069.146 | 45.186 | 2266.10 |
1000 | 200000 | 1074.740 | 45.423 | 2266.07 |
10000 | 200000 | 1129.873 | 48.953 | 2208.08 |
20000 | 200000 | 1183.656 | 52.939 | 2135.89 |
50000 | 200000 | 1350.786 | 64.667 | 1988.83 |
In this experiment more complicated query was examined:
SELECT idx1, COUNT(count1), COUNT(count2), COUNT(count3), COUNT(count4) FROM table WHERE date BETWEEN start_date AND end_date GROUP BY idx1
Each query was executed 5 times and best running time was considered.
parameters | cache table time [ms] | ordered table time [ms] | ordered table speedup [%] | ||
---|---|---|---|---|---|
start date | end date | days | |||
2011-01-01 | 2011-02-01 | 31 | 105.981 | 24.267 | 336.73 |
2011-01-01 | 2011-03-01 | 59 | 127.806 | 46.871 | 172.68 |
2011-01-01 | 2011-06-01 | 151 | 188.580 | 113.715 | 65.84 |
2011-01-01 | 2011-09-01 | 243 | 252.568 | 180.529 | 39.90 |
2011-01-01 | 2012-01-01 | 365 | 340.549 | 275.296 | 23.70 |
2011-01-01 | 2013-01-01 | 731 | 629.485 | 624.822 | 0.75 |
So, is it worth of play? Definitively yes in scenario described in the beginning, it cost just one ORDER BY clause in a query executed once per day.