Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2018 » Sep » Tutorials, Overviews » OLAP queries in SQL: A Refresher ( 18:n33 )

OLAP queries in SQL: A Refresher


Based on the recent book - Principles of Database Management - The Practical Guide to Storing, Managing and Analyzing Big and Small Data - this post examines how OLAP queries can be implemented in SQL.



By Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens

Principles of database management

The below article is based on our recent book: Principles of Database Management - The Practical Guide to Storing, Managing and Analyzing Big and Small Data (See www.pdbmbook.com).

In this article, we zoom in on how OLAP queries can be implemented in SQL. To facilitate the execution of OLAP queries and data aggregation, SQL-99 introduced three extensions to the GROUP BY statement: the CUBE, ROLLUP and GROUPING SETS operator.

The CUBE operator computes a union of GROUP BY’s on every subset of the specified attribute types.  Its result set represents a multidimensional cube based upon the source table.  Consider the following SALES TABLE.

PRODUCT QUARTER REGION SALES
A Q1 Europe 10
A Q1 America 20
A Q2 Europe 20
A Q2 America 50
A Q3 America 20
A Q4 Europe 10
A Q4 America 30
B Q1 Europe 40
B Q1 America 60
B Q2 Europe 20
B Q2 America 10
B Q3 America 20
B Q4 Europe 10
B Q4 America 40

Examples SALESTABLE.

We can now formulate the following SQL query:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)

Basically, this query computes the union of 2² = 4 groupings of the SALESTABLE being: {(quarter,region), (quarter), (region), ()}, where () denotes an empty group list representing the total aggregate across the entire SALESTABLE.  In other words, since quarter has 4 values and region 2 values, the resulting multiset will have 4*2+4*1+1*2+1 or 15 tuples as you can see illustrated in Table 1.  NULL values have been added in the dimension columns Quarter and Region to indicate the aggregation that took place.  They can be easily replaced by the more meaningful ‘ALL’ if desired.  More specifically, we can add 2 CASE clauses as follows:

 
SELECT CASE WHEN grouping(QUARTER) = 1 THEN 'All' ELSE QUARTER END AS QUARTER, CASE WHEN grouping(REGION) = 1 THEN 'All' ELSE REGION END AS REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)

The grouping() function returns 1 in case a NULL value is generated during the aggregation and 0 otherwise.  This distinguishes the generated NULLs and the possible real NULLs stemming from the data.  We will not add this to the subsequent OLAP queries so as to not unnecessarily complicate them.

Also, observe the NULL value for Sales in the fifth row.  This represents an attribute combination which is not present in the original SALESTABLE since apparently no products were sold in Q3 in Europe.  Remark that besides SUM() also other SQL aggregator functions such as MIN(), MAX(), COUNT() and AVG() can be used in the SELECT statement.

QUARTER REGION SALES
Q1 Europe 50
Q1 America 80
Q2 Europe 40
Q2 America 60
Q3 Europe NULL
Q3 America 40
Q4 Europe 20
Q4 America 80
Q1 NULL 130
Q2 NULL 100
Q3 NULL 40
Q4 NULL 90
NULL Europe 110
NULL America 250
NULL NULL 360

Table 1: Result from SQL query with Cube operator.

The ROLLUP operator computes the union on every prefix of the list of specified attribute types, from the most detailed up to the grand total.  It is especially useful to generate reports containing both subtotals and totals.  The key difference between the ROLLUP and CUBE operator is that the former generates a result set showing the aggregates for a hierarchy of values of the specified attribute types, whereas the latter generates a result set showing the aggregates for all combinations of values of the selected attribute types.  Hence, the order in which the attribute types are mentioned is important for the ROLLUP but not for the CUBE operator.  Consider the following query:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (QUARTER, REGION)

This query generates the union of three groupings {(quarter,region), (quarter}, ()} where () again represents the full aggregation.  The resulting multiset will thus have 4*2+4+1 or 13 rows and is displayed in Table 2.  You can see that the region dimension is first rolled up followed by the quarter dimension.  Note the two rows which have been left out when compared to the result of the CUBE operator in Table 1.

QUARTER REGION SALES
Q1 Europe 50
Q1 America 80
Q2 Europe 40
Q2 America 60
Q3 Europe NULL
Q3 America 40
Q4 Europe 20
Q4 America 80
Q1 NULL 130
Q2 NULL 100
Q3 NULL 40
Q4 NULL 90
NULL NULL 360

Table 2: Result from SQL query with ROLLUP operator.

Whereas the previous example applied the GROUP BY ROLLUP construct to two completely independent dimensions, it can also be applied to attribute types that represent different aggregation levels (and hence different levels of detail) along the same dimension.  For example, suppose the SALESTABLE tuples represented more detailed sales data at the individual city level and that the table contained three location related columns: City, Country and Region.  We could then formulate the following ROLLUP query, yielding sales totals respectively per city, per country, per region and the grand total:

 
SELECT REGION, COUNTRY, CITY, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (REGION, COUNTRY, CITY)

Note that in that case the SALESTABLE would include the attribute types City, Country and Region in a single table.  Since the three attribute types represent different levels of detail in the same dimension, they are transitively dependent on one another, illustrating the fact that these data warehouse data are indeed denormalized.

The GROUPING SETS operator generates a result set equivalent to that generated by a UNION ALL of multiple simple GROUP BY clauses.  Consider the following example:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER), (REGION))

This query is equivalent to:

 
SELECT QUARTER, NULL, SUM(SALES)
FROM SALESTABLE
GROUP BY QUARTER
UNION ALL
SELECT NULL, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY REGION

The result is given in Table 3.

QUARTER REGION SALES
Q1 NULL 130
Q2 NULL 100
Q3 NULL 40
Q4 NULL 90
NULL Europe 110
NULL America 250

Table 3: Result from SQL query with GROUPING SETS operator

Multiple CUBE, ROLLUP and GROUPING SETS statements can be used in a single SQL query.  Different combinations of CUBE, ROLLUP and GROUPING SETS can generate equivalent result sets.  Consider the following query:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)

This query is equivalent to:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER, REGION), (QUARTER), (REGION), ())

Likewise, the following query:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (QUARTER, REGION)

is identical to:

 
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER, REGION), (QUARTER),())

Given the amount of data to be aggregated and retrieved, OLAP SQL queries may get very time consuming.  One way to speed up performance is by turning some of these OLAP queries into materialized views.  For example, an SQL query with a CUBE operator can be used to precompute aggregations on a selection of dimensions of which the results can then be stored as a materialized view.  A disadvantage of view materialization is that extra efforts are needed to regularly refresh these materialized views, although it can be noted that usually companies are fine with a close to current version of the data such that the synchronization can be done overnight or at fixed time intervals.

For more information, we are happy to refer to our recent book: Principles of Database Management - The Practical Guide to Storing, Managing and Analyzing Big and Small Data.

Bio: Wilfried Lemahieu is a professor at KU Leuven (Belgium) who's areas of research include Big data storage, integration and analytics, Data quality and Business process management and service orientation.

Seppe vanden Broucke received a PhD in Applied Economics at KU Leuven, Belgium in 2014. Currently, Seppe is working as an assistant professor at the department of Decision Sciences and Information Management at KU Leuven. Seppe's research interests include business data mining and analytics, machine learning, process management, process mining. His work has been published in well-known international journals and presented at top conferences.

Bart Baesens is an associate professor at KU Leuven, and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals (e.g. Machine Learning, Management Science, IEEE Transactions on Neural Networks, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Evolutionary Computation, Journal of Machine Learning Research, …) and presented at international top conferences.

Related:


Sign Up