Working With Time Series Using SQL
This article is an overview of using SQL to manipulate time series data.
By Michael Grogan, Data Science Consultant
Source: Photo by Tumisu from Pixabay
Tools such as Python or R are most often used to conduct deep time series analysis.
However, knowledge of how to work with time series data using SQL is essential, particularly when working with very large datasets or data that is constantly being updated.
Here are some useful commands that can be invoked in SQL to better work with time series data within the data table itself.
Background
In this example, we are going to work with weather data collected across a range of different times and locations.
The data types in the table of the PostgreSQL database are as below:
weather=# SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'weatherdata';
column_name | data_type
-------------+-----------------------------
date | timestamp without time zone
mbhpa | integer
temp | numeric
humidity | integer
place | character varying
country | character varying
realfeel | integer
timezone | integer
(8 rows)
As you can see, date is defined as a timestamp data type without time zone (which we will also look at in this article).
The variable of interest is temp (temperature) — we are going to look at ways to analyse this variable more intuitively using SQL.
Calculating Moving Averages
Here is a snippet of some of the columns in the data table:
date | mbhpa | temp | humidity
---------------------+-------+-------+----------
2020-10-12 18:33:24 | 1010 | 13.30 | 74
2020-10-15 02:12:54 | 1017 | 7.70 | 75
2020-10-14 23:53:42 | 1016 | 8.80 | 75
2020-10-15 11:03:25 | 1016 | 9.70 | 75
2020-10-15 13:05:23 | 1017 | 12.30 | 74
2020-10-15 18:47:25 | 1015 | 12.10 | 74
2020-10-16 23:23:23 | 1011 | 9.10 | 75
2020-10-20 10:25:15 | 967 | 13.80 | 83
2020-10-27 16:30:30 | 980 | 12.00 | 75
2020-10-29 15:12:07 | 988 | 11.70 | 75
2020-10-28 18:42:52 | 990 | 8.80 | 77
Suppose we wish to calculate a moving average of temperature across different time periods.
To do this, we firstly need to make sure that the data is ordered by date, and decide on how many periods should be included in the averaging window.
To start with, a 7-period moving average is used, with all temperature values ordered by date.
>>> select date, avg(temp) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) FROM weatherdata where place='Place Name'; date | avg
---------------------+---------------------
2020-11-12 16:36:40 | 8.8285714285714286
2020-11-14 15:45:08 | 9.8142857142857143
2020-11-15 08:53:26 | 10.3857142857142857
2020-11-17 10:50:32 | 11.2285714285714286
2020-11-18 14:18:58 | 11.8000000000000000
2020-11-25 14:54:11 | 11.6285714285714286
2020-11-25 19:00:21 | 10.9142857142857143
2020-11-25 19:05:31 | 10.2000000000000000
2020-11-25 23:41:34 | 9.2857142857142857
2020-11-26 15:03:10 | 9.4857142857142857
2020-11-26 17:18:33 | 8.3428571428571429
2020-11-26 21:30:39 | 7.9142857142857143
2020-11-26 22:29:17 | 7.6142857142857143
Now, let’s add a 30 and 60-period moving average. We will store these averages along with the 7-period moving average in the one table.
>>> select date, temp, avg(temp) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), avg(temp) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), avg(temp) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW), avg(temp) OVER (ORDER BY date ROWS BETWEEN 59 PRECEDING AND CURRENT ROW) FROM weatherdata where place='Place Name';
Source: Output Created by Author
More information on how to calculate moving averages within SQL can be found at the following resource by sqltrainingonline.com.
Working with time zones
You will notice that the timestamp contains a date and time. While this is fine when storing just one location in the table, things can get quite tricky when working with locations across multiple time zones.
Note that an integer variable named timezone was created within the table.
Suppose that we are analysing weather patterns across different places with a range of time zones ahead of the inputted time — in this case, all data points were inputted at GMT time.
date | timezone
---------------------+----------
2020-05-09 15:29:00 | 11
2020-05-09 17:05:00 | 11
2020-05-09 17:24:00 | 11
2020-05-10 13:02:00 | 11
2020-05-13 19:13:00 | 11
2020-05-10 13:04:00 | 11
2020-05-10 15:47:00 | 11
2020-05-13 19:10:00 | 11
2020-05-14 17:17:00 | 11
2020-05-09 15:20:00 | 5
2020-05-09 17:04:00 | 5
2020-05-09 17:25:00 | 5
2020-05-09 18:12:00 | 5
2020-05-10 13:02:00 | 5
2020-05-10 15:50:00 | 5
2020-05-10 20:32:00 | 5
2020-05-11 17:31:00 | 5
2020-05-13 19:11:00 | 5
2020-05-17 21:41:00 | 11
2020-05-15 14:08:00 | 11
2020-05-14 16:55:00 | 5
2020-05-15 14:10:00 | 5
(22 rows)
The new times can be calculated as follows:
weather=# select date + interval '1h' * timezone from weatherdata;
?column?
---------------------
2020-05-10 02:29:00
2020-05-10 04:05:00
2020-05-10 04:24:00
2020-05-11 00:02:00
2020-05-14 06:13:00
2020-05-11 00:04:00
2020-05-11 02:47:00
2020-05-14 06:10:00
2020-05-15 04:17:00
2020-05-09 20:20:00
2020-05-09 22:04:00
2020-05-09 22:25:00
2020-05-09 23:12:00
2020-05-10 18:02:00
2020-05-10 20:50:00
2020-05-11 01:32:00
2020-05-11 22:31:00
2020-05-14 00:11:00
2020-05-18 08:41:00
2020-05-16 01:08:00
2020-05-14 21:55:00
2020-05-15 19:10:00
(22 rows)
We can now store the new times as an updated variable, which we will name as newdate.
>>> select date + interval '1h' * (timezone+1) as newdate, temp, mbhpa from weatherdata; newdate | temp | mbhpa
--------------------+------+-------
2020-05-10 03:29:00 | 4.2 | 1010
2020-05-10 05:05:00 | 4.1 | 1009
2020-05-10 05:24:00 | 3.8 | 1009
This clause allows us to generate the updated times (which would reflect the actual time in each specific place when variables such as temperature, barometric pressure, etc, were recorded.
Inner Join and Having Clauses
You will notice in the above table that temperature values are included across a range of places.
Suppose that wind speed is also calculated for each place in a separate table.
In this case, we wish to calculate the average temperature across each listed place where the wind speed is higher than 20.
This can be accomplished using the inner join and having clauses as follows:
>>> select t1.place, avg(t1.temp), avg(t2.gust) from weatherdata as t1 inner join wind as t2 on t1.place=t2.place group by t1.place having avg(t2.gust)>'20'; place | avg | avg
-----------------+----------------------+----------------------
Place 1 | 17.3 | 22.4
Place 2 | 14.3 | 26.8
Place 3 | 7.1 | 27.1
Conclusion
In this article, you have been introduced to some introductory examples of using SQL to work with time series data.
In particular, you saw how to:
- Calculate moving averages
- Work with different time zones
- Calculate averages across different subsets of data
Many thanks for your time and any questions or feedback are greatly appreciated.
Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice. The findings and interpretations in this article are those of the author and are not endorsed by or affiliated with any third-party mentioned in this article.
Bio: Michael Grogan is a Data Science Consultant. He posesses expertise in time series analysis, statistics, Bayesian modeling, and machine learning with TensorFlow.
Original. Reposted with permission.
Related:
- Multidimensional multi-sensor time-series data analysis framework
- Rejection Sampling with Python
- Deep Learning Is Becoming Overused