Doing Statistics with SQL
This post covers how to perform some basic in-database statistical analysis using SQL.
By Jean-Nicholas Hould, JeanNicholasHould.com
I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.
For this post, I am using an open data set. The million songs data sets is a «freely-available collection of audio features and metadata for a million contemporary popular music tracks».
If you want to follow along, here are the steps:
- Download the Million Song Data Set. I used the 10K songs subset.
- Download a SQLite Client
subset_track_metadata.dbfile with SQLite and start exploring
Why Descriptive Statistics?
When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendency (mean, median, mode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.
The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.
CAST: Run-time data type conversion between compatible data types. In this case, I’m converting a float to an integer for rounding purpose.
AVG: Aggregation function that returns the mean of the input expression value.
as avg_year: Temporarily renames a column heading - This is only for readbility purpose, to make the code more human-friendly. I will use this aliasing throughout the post.
The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.
ORDER BY: Sorts the resulting data set by one or more column. The ordering can be ascending
ASC(default) or descending
COUNT: Aggregation function that returns the # of rows that matches a criteria.
LIMIT: Specifies the maximum number of rows that can be returned from the resulting data set.
OFFSET: Skip X rows before beginning to return rows. In this specific example, we are skipping 5000 rows (which is equal to the total row count
COUNT(*)divided by 2).
The mode is the value that appears most often in a set of data.
GROUP BY: Used with aggregation functions such as
AVG, etc. Groups the resulting data set by one or more column.
Minimum/Maximum value in a data set.
MIN: Aggregation function that returns the smallest value in a data set.
MAX: Aggregation function that returns the largest value in a data set.
Distribution of songs per year
Count of songs released in each years
The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.
In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to
0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year
0, our data makes more sense. The songs are ranging from
2010 and the median is the year
With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.
Bio: Jean-Nicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com.
Original. Reposted with permission.