OLAP queries in SQL: A Refresher
Based on the recent book - Principles of Database Management - The Practical Guide to Storing, Managing and Analyzing Big and Small Data - this post examines how OLAP queries can be implemented in SQL.
By Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens
The below article is based on our recent book: Principles of Database Management - The Practical Guide to Storing, Managing and Analyzing Big and Small Data (See www.pdbmbook.com).
In this article, we zoom in on how OLAP queries can be implemented in SQL. To facilitate the execution of OLAP queries and data aggregation, SQL-99 introduced three extensions to the GROUP BY statement: the CUBE, ROLLUP and GROUPING SETS operator.
The CUBE operator computes a union of GROUP BY’s on every subset of the specified attribute types. Its result set represents a multidimensional cube based upon the source table. Consider the following SALES TABLE.
PRODUCT | QUARTER | REGION | SALES |
A | Q1 | Europe | 10 |
A | Q1 | America | 20 |
A | Q2 | Europe | 20 |
A | Q2 | America | 50 |
A | Q3 | America | 20 |
A | Q4 | Europe | 10 |
A | Q4 | America | 30 |
B | Q1 | Europe | 40 |
B | Q1 | America | 60 |
B | Q2 | Europe | 20 |
B | Q2 | America | 10 |
B | Q3 | America | 20 |
B | Q4 | Europe | 10 |
B | Q4 | America | 40 |
Examples SALESTABLE.
We can now formulate the following SQL query:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY CUBE (QUARTER, REGION)
Basically, this query computes the union of 2² = 4 groupings of the SALESTABLE being: {(quarter,region), (quarter), (region), ()}, where () denotes an empty group list representing the total aggregate across the entire SALESTABLE. In other words, since quarter has 4 values and region 2 values, the resulting multiset will have 4*2+4*1+1*2+1 or 15 tuples as you can see illustrated in Table 1. NULL values have been added in the dimension columns Quarter and Region to indicate the aggregation that took place. They can be easily replaced by the more meaningful ‘ALL’ if desired. More specifically, we can add 2 CASE clauses as follows:
SELECT CASE WHEN grouping(QUARTER) = 1 THEN 'All' ELSE QUARTER END ASÂ QUARTER, CASE WHEN grouping(REGION) = 1 THEN 'All' ELSE REGION END AS REGION, SUM(SALES) FROM SALESTABLE GROUP BY CUBE (QUARTER, REGION)
The grouping() function returns 1 in case a NULL value is generated during the aggregation and 0 otherwise. This distinguishes the generated NULLs and the possible real NULLs stemming from the data. We will not add this to the subsequent OLAP queries so as to not unnecessarily complicate them.
Also, observe the NULL value for Sales in the fifth row. This represents an attribute combination which is not present in the original SALESTABLE since apparently no products were sold in Q3 in Europe. Remark that besides SUM() also other SQL aggregator functions such as MIN(), MAX(), COUNT() and AVG() can be used in the SELECT statement.
QUARTER | REGION | SALES |
Q1 | Europe | 50 |
Q1 | America | 80 |
Q2 | Europe | 40 |
Q2 | America | 60 |
Q3 | Europe | NULL |
Q3 | America | 40 |
Q4 | Europe | 20 |
Q4 | America | 80 |
Q1 | NULL | 130 |
Q2 | NULL | 100 |
Q3 | NULL | 40 |
Q4 | NULL | 90 |
NULL | Europe | 110 |
NULL | America | 250 |
NULL | NULL | 360 |
Table 1: Result from SQL query with Cube operator.
The ROLLUP operator computes the union on every prefix of the list of specified attribute types, from the most detailed up to the grand total. It is especially useful to generate reports containing both subtotals and totals. The key difference between the ROLLUP and CUBE operator is that the former generates a result set showing the aggregates for a hierarchy of values of the specified attribute types, whereas the latter generates a result set showing the aggregates for all combinations of values of the selected attribute types. Hence, the order in which the attribute types are mentioned is important for the ROLLUP but not for the CUBE operator. Consider the following query:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY ROLLUP (QUARTER, REGION)
This query generates the union of three groupings {(quarter,region), (quarter}, ()} where () again represents the full aggregation. The resulting multiset will thus have 4*2+4+1 or 13 rows and is displayed in Table 2. You can see that the region dimension is first rolled up followed by the quarter dimension. Note the two rows which have been left out when compared to the result of the CUBE operator in Table 1.
QUARTER | REGION | SALES |
Q1 | Europe | 50 |
Q1 | America | 80 |
Q2 | Europe | 40 |
Q2 | America | 60 |
Q3 | Europe | NULL |
Q3 | America | 40 |
Q4 | Europe | 20 |
Q4 | America | 80 |
Q1 | NULL | 130 |
Q2 | NULL | 100 |
Q3 | NULL | 40 |
Q4 | NULL | 90 |
NULL | NULL | 360 |
Table 2: Result from SQL query with ROLLUP operator.
Whereas the previous example applied the GROUP BY ROLLUP construct to two completely independent dimensions, it can also be applied to attribute types that represent different aggregation levels (and hence different levels of detail) along the same dimension. For example, suppose the SALESTABLE tuples represented more detailed sales data at the individual city level and that the table contained three location related columns: City, Country and Region. We could then formulate the following ROLLUP query, yielding sales totals respectively per city, per country, per region and the grand total:
SELECT REGION, COUNTRY, CITY, SUM(SALES) FROM SALESTABLE GROUP BY ROLLUP (REGION, COUNTRY, CITY)
Note that in that case the SALESTABLE would include the attribute types City, Country and Region in a single table. Since the three attribute types represent different levels of detail in the same dimension, they are transitively dependent on one another, illustrating the fact that these data warehouse data are indeed denormalized.
The GROUPING SETS operator generates a result set equivalent to that generated by a UNION ALL of multiple simple GROUP BY clauses. Consider the following example:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY GROUPING SETS ((QUARTER), (REGION))
This query is equivalent to:
SELECT QUARTER, NULL, SUM(SALES) FROM SALESTABLE GROUP BY QUARTER UNION ALL SELECT NULL, REGION, SUM(SALES) FROM SALESTABLE GROUP BY REGION
The result is given in Table 3.
QUARTER | REGION | SALES |
Q1 | NULL | 130 |
Q2 | NULL | 100 |
Q3 | NULL | 40 |
Q4 | NULL | 90 |
NULL | Europe | 110 |
NULL | America | 250 |
Table 3: Result from SQL query with GROUPING SETS operator
Multiple CUBE, ROLLUP and GROUPING SETS statements can be used in a single SQL query. Different combinations of CUBE, ROLLUP and GROUPING SETS can generate equivalent result sets. Consider the following query:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY CUBE (QUARTER, REGION)
This query is equivalent to:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY GROUPING SETS ((QUARTER, REGION), (QUARTER), (REGION), ())
Likewise, the following query:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY ROLLUP (QUARTER, REGION)
is identical to:
SELECT QUARTER, REGION, SUM(SALES) FROM SALESTABLE GROUP BY GROUPING SETS ((QUARTER, REGION), (QUARTER),())
Given the amount of data to be aggregated and retrieved, OLAP SQL queries may get very time consuming. One way to speed up performance is by turning some of these OLAP queries into materialized views. For example, an SQL query with a CUBE operator can be used to precompute aggregations on a selection of dimensions of which the results can then be stored as a materialized view. A disadvantage of view materialization is that extra efforts are needed to regularly refresh these materialized views, although it can be noted that usually companies are fine with a close to current version of the data such that the synchronization can be done overnight or at fixed time intervals.
For more information, we are happy to refer to our recent book: Principles of Database Management - The Practical Guide to Storing, Managing and Analyzing Big and Small Data.
Bio: Wilfried Lemahieu is a professor at KU Leuven (Belgium) who's areas of research include Big data storage, integration and analytics, Data quality and Business process management and service orientation.
Seppe vanden Broucke received a PhD in Applied Economics at KU Leuven, Belgium in 2014. Currently, Seppe is working as an assistant professor at the department of Decision Sciences and Information Management at KU Leuven. Seppe's research interests include business data mining and analytics, machine learning, process management, process mining. His work has been published in well-known international journals and presented at top conferences.
Bart Baesens is an associate professor at KU Leuven, and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals (e.g. Machine Learning, Management Science, IEEE Transactions on Neural Networks, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Evolutionary Computation, Journal of Machine Learning Research, …) and presented at international top conferences.
Related:
- YouTube videos on database management, SQL, Datawarehousing, Business Intelligence, OLAP, Big Data, NoSQL databases, data quality, data governance and Analytics – free
- Remote Data Science: How to Send R and Python Execution to SQL Server from Jupyter Notebooks
- SQL Cheat Sheet