Working with time series data and psql

This is a collection of useful commands to deal with timeseries reports.

First lets create a table on a test database and populate id with some random data.

CREATE DATABASE time_testing;


CREATE TABLE time_data AS
SELECT 
    i as id, 
    now() - '6 months'::INTERVAL * random() AS timestamp,
    floor(random()*(10-1+1))+1 AS event_count
FROM generate_series(1,100000) i;

Now we have some data to work with. Since I want to test some queries I will focus on a time range for which there is not too much data and it can be manually checked. On my case, the firsts days of the time series do not have much data:

SELECT *
FROM time_data 
WHERE timestamp < '2023-12-31' 
    AND timestamp > '2023-12-30';

When I work with sql queries I usually start generating some simple queries which results I can easily test. Then I add clauses one by one while testing each step.

Count all events for a given day

SELECT 
    sum(event_count),
    (date_trunc('day', timestamp)) AS time_span
FROM time_data 
WHERE 
    timestamp < '2023-12-31' 
    AND timestamp > '2023-12-30'
GROUP BY time_span;


 sum |       time_span        
-----+------------------------
 212 | 2023-12-30 00:00:00+01

Group events per hour

SELECT 
    sum(event_count),
    (date_trunc('hour', timestamp)) AS time_span
FROM time_data 
WHERE 
    timestamp < '2023-12-31' 
    AND timestamp > '2023-12-30'
GROUP BY time_span;

 sum |       time_span        
-----+------------------------
  43 | 2023-12-30 21:00:00+01
  72 | 2023-12-30 22:00:00+01
  97 | 2023-12-30 23:00:00+01
(3 rows)

Here it is easy to see that the sum of all hours amounts to the total count above.

Get the average per hour for a given day

SELECT
    avg(ecs),
    (date_trunc('day', hourly_span)) AS daily_span
FROM
(
    SELECT 
        sum(event_count) AS ecs,
        (date_trunc('hour', timestamp)) AS hourly_span
    FROM time_data 
    WHERE 
        timestamp < '2023-12-31' 
        AND timestamp > '2023-12-30'
    GROUP BY hourly_span
)
GROUP BY daily_span;


        avg        |       daily_span       
-------------------+------------------------
 70.66666666666667 | 2023-12-30 00:00:00+01
(1 row)

Again, since we have little data, we can quickly make some numbers and check the results.

Get the hourly average, min and max for each day

SELECT
    (date_trunc('day', hourly_span)) AS daily_span,
    max(ecs) AS max,
    avg(ecs) AS hourly_avg,
    min(ecs) AS min
FROM
(
    SELECT 
        sum(event_count) AS ecs,
        (date_trunc('hour', timestamp)) AS hourly_span
    FROM time_data 
    WHERE 
        timestamp < '2024-01-01' 
    GROUP BY hourly_span
)
GROUP BY daily_span
ORDER BY daily_span DESC;


       daily_span       | max |    hourly_avg     | min 
------------------------+-----+-------------------+-----
 2023-12-31 00:00:00+01 | 182 |            119.25 |  59
 2023-12-30 00:00:00+01 |  97 | 70.66666666666667 |  43
(2 rows)

Again here we can see that the hourly average matches the calculated above, so we are on the right path.

Working with 10s intervals

So far the intervals used can be handled by date_truc, but it can not deal with things such as X * interval. The valid intervals can be found on the docs.

For dealing with the 10s interval we can use date_bin.

Lets start with 2h interval so the data can be verified:

SELECT 
    sum(event_count),
    (date_bin(
        '2 hours'::interval,
        timestamp,
        '2023-12-01'
    )) AS time_span
FROM time_data 
WHERE 
    timestamp < '2023-12-31' 
    AND timestamp > '2023-12-30'
GROUP BY time_span
ORDER BY time_span DESC;


 sum |       time_span        
-----+------------------------
 169 | 2023-12-30 22:00:00+01
  43 | 2023-12-30 20:00:00+01
(2 rows)

Seen that once again the total numbers sum up, we can move forward to the 10s interval

SELECT 
    sum(event_count),
    (date_bin(
        '10 seconds'::interval,
        timestamp,
        '2023-12-01'
    )) AS time_span
FROM time_data 
WHERE 
    timestamp < '2023-12-31' 
    AND timestamp > '2023-12-30'
GROUP BY time_span
ORDER BY time_span DESC;

 sum |       time_span        
-----+------------------------
   3 | 2023-12-30 23:56:30+01
   2 | 2023-12-30 23:54:10+01
   9 | 2023-12-30 23:51:50+01
   2 | 2023-12-30 23:51:20+01
   5 | 2023-12-30 23:48:50+01
...
(43 rows)

As we can see here there are some timestamps missing. Lets check if there is data in some of those ranges

SELECT *
FROM time_data 
WHERE timestamp < '2023-12-30 23:59:00' 
    AND timestamp > '2023-12-30 23:54:00';

  id   |           timestamp           | event_count 
-------+-------------------------------+-------------
 12300 | 2023-12-30 23:54:15.559432+01 |           2
 27226 | 2023-12-30 23:56:34.404232+01 |           3
(2 rows)

There are intervals for which there is no data. Here we are looking into random fake data, but this could be the case for a production system. There are many reasons why data might be missing, batch jobs only running on out of office schedules, systems been shut down at out of office hours, data comming at random intervals, performance issues, etc.

However, we might need to generate a report and need to fill those missing timestamps with a default value.

Obtaining a continuous timeseries

Here the trick is to first generate the continuous timeseries and then join it to our table.

Here is how the timeseries can be generated

SELECT generate_series(
  timestamp without time zone '2023-12-30',
  timestamp without time zone '2023-12-31',
  '10 seconds'
);

And here is the query to obtain all data

WITH table_cte AS (
    SELECT 
        sum(event_count) AS evc,
        to_char(date_bin(
            '10 seconds'::interval,
            timestamp,
            '2023-12-01'
        ), 'YYYY-MM-DD HH24:MI:SS') AS time_span
    FROM time_data 
    WHERE 
        timestamp < '2023-12-31' 
        AND timestamp > '2023-12-30'
    GROUP BY time_span
)


SELECT 
    coalesce(table_cte.evc, 0),
    table_cte.time_span,
    timeseries
FROM generate_series(
    timestamp without time zone '2023-12-30',
    timestamp without time zone '2023-12-31',
    '10 seconds'
) AS timeseries
LEFT JOIN table_cte ON table_cte.time_span = to_char(timeseries,  'YYYY-MM-DD HH24:MI:SS')
WHERE timeseries < '2023-12-30 23:57:00' 
    AND timeseries > '2023-12-30 23:54:00'
ORDER BY timeseries DESC;

 coalesce |      time_span      |     timeseries      
----------+---------------------+---------------------
        0 |                     | 2023-12-30 23:56:50
        0 |                     | 2023-12-30 23:56:40
        3 | 2023-12-30 23:56:30 | 2023-12-30 23:56:30
        0 |                     | 2023-12-30 23:56:20
        0 |                     | 2023-12-30 23:56:10
        0 |                     | 2023-12-30 23:56:00
        0 |                     | 2023-12-30 23:55:50
        0 |                     | 2023-12-30 23:55:40
        0 |                     | 2023-12-30 23:55:30
        0 |                     | 2023-12-30 23:55:20
        0 |                     | 2023-12-30 23:55:10
        0 |                     | 2023-12-30 23:55:00
        0 |                     | 2023-12-30 23:54:50
        0 |                     | 2023-12-30 23:54:40
        0 |                     | 2023-12-30 23:54:30
        0 |                     | 2023-12-30 23:54:20
        2 | 2023-12-30 23:54:10 | 2023-12-30 23:54:10
(17 rows)

Great, we can once again see that the query is working for us. The time_span column is the time interval from our exsisting table and the timeseries is the continuous serie generated for the report. So every row with a mising time_span value means that our table does not contain data for that range.