Doing Statistics with SQL

This post covers how to perform some basic in-database statistical analysis using SQL.



By Jean-Nicholas Hould, JeanNicholasHould.com

I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.

SQL

For this post, I am using an open data set. The million songs data sets is a «freely-available collection of audio features and metadata for a million contemporary popular music tracks».

If you want to follow along, here are the steps:

  1. Download the Million Song Data Set. I used the 10K songs subset.
  2. Download a SQLite Client
  3. Open subset_track_metadata.db file with SQLite and start exploring

Why Descriptive Statistics?

 
When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendency (mean, median, mode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.

Central Tendency

 
Mean

The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.

SELECT CAST(AVG(songs.year) as int) as avg_year FROM songs

-- | avg_year |
-- |----------|
-- | 934      |


  • CAST: Run-time data type conversion between compatible data types. In this case, I’m converting a float to an integer for rounding purpose.
  • AVG: Aggregation function that returns the mean of the input expression value.
  • as avg_year: Temporarily renames a column heading - This is only for readbility purpose, to make the code more human-friendly. I will use this aliasing throughout the post.

Median

The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.

SELECT songs.year as median_year
FROM songs 
ORDER BY songs.year 
LIMIT 1 
OFFSET (SELECT COUNT(*) FROM songs) / 2

-- | median_year |
-- |-------------|
-- | 0           |


  • ORDER BY: Sorts the resulting data set by one or more column. The ordering can be ascending ASC (default) or descending DESC.
  • COUNT: Aggregation function that returns the # of rows that matches a criteria.
  • LIMIT: Specifies the maximum number of rows that can be returned from the resulting data set.
  • OFFSET: Skip X rows before beginning to return rows. In this specific example, we are skipping 5000 rows (which is equal to the total row countCOUNT(*) divided by 2).

Mode

The mode is the value that appears most often in a set of data.

SELECT 
    songs.year,
    COUNT(*) as count
FROM songs
GROUP BY songs.year
ORDER BY COUNT(*) DESC
LIMIT 1

-- | year | count |
-- |------|-------|
-- | 0    | 5320  |


  • GROUP BY: Used with aggregation functions such as COUNT, AVG, etc. Groups the resulting data set by one or more column.

Variability

 
Min/Max Values

Minimum/Maximum value in a data set.

SELECT 
    MIN(songs.year) as min_year,
    MAX(songs.year) as max_year
FROM
    songs

-- | min_year | max_year |
-- |----------|----------|
-- | 0        | 2010     |


  • MIN: Aggregation function that returns the smallest value in a data set.
  • MAX: Aggregation function that returns the largest value in a data set.

Distribution of songs per year

Count of songs released in each years

SELECT 
    songs.year,
    COUNT(*) songs_count
FROM songs
GROUP BY songs.year
ORDER BY songs.year ASC

-- | year | song_count |
-- |------|------------|
-- | 0    | 5320       |
-- | 1926 | 2          |
-- | 1927 | 3          |
-- | ...  | ...        |
-- | 2009 | 250        |
-- | 2010 | 64         |


Next Steps

 
The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.

In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to 0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year 0, our data makes more sense. The songs are ranging from 1926 to 2010 and the median is the year 2001.

With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.

Bio: Jean-Nicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com.

Original. Reposted with permission.

Related: