Analyzing Diversity & Inclusion with SQL

The most underrated SQL function for analyzing diversity.



Analyzing Diversity & Inclusion with SQL
Image by Editor

 

Introduction

 

There has been a well-documented surge in positions related to Diversity, Equity, and Inclusion over the past 3-5 years. DEI Analysts may spend their time tracking, analyzing, and answering questions such as,

  • How do salaries compare across gender?
  • How do our departments rank in terms of diversity of race? 
  • Which positions and titles are the least diverse?

While DEI Analysts focus on answering different types of questions than Business Analysts, they still use the same technical skills and techniques.

Protected classes are typically categorical:  Gender, Race, Ethnicity, and Age (usually Age is broken down into categories)

Numeric data, such as salary, can be aggregated across protected classes with

  • Average
  • Median
  • Minimum
  • Maximum

When you analyze the combination of a Categorical and a Numeric variables, SQL makes it quite easy:

SELECT 
  ethnicity, 
  AVG(salary) as AVG_SALARY, 
  MEDIAN(salary) as MEDIAN_SALARY 
FROM 
  HRDATA 
GROUP BY 
  ethnicity

 

Ethnicity AVG_SALARY MEDIAN_SALARY
White $68,513 $60,050
African American $67,691 $55,114
Asian $68,842 $65,632

 

But what methods exist to analyze Categorical and Categorical variables together? The standard choices are quite limited:

  • Mode (most common)
  • Count Distinct
SELECT 
  department, 
  COUNT(1) AS employees, 
  COUNT(DISTINCT ethnicity) AS DISTINCT_ETHNICITY, 
  MODE(ethnicity) AS MOST_COMMON_ETHNICITY 
FROM 
  HRDATA 
GROUP BY 
  ethnicity

 

Department Employees Distinct Genders Most Common Gender
Sales 100 2 Male
IT 100 2 Male

 

At first glance, the departments look to be very similar. But how would you tell the difference between:

  • Sales has 99 male employees and 1 female employee
  • IT has 51 male employees and 49 female employees

Surely, we would consider the latter to be more diverse, but how would we know that quickly using SQL?

I am here to teach you about an underrated aggregate function called Entropy, which will help us quantify exactly how diverse each department is. 

 

Department Employees Distinct Genders Most Common Gender Entropy
Sales 100 2 Male 0.08
IT 100 2 Male 0.99

 

Unfortunately it isn’t as easy as simply doing SELECT Department, ENTROPY(ethnicity), but I will teach you the SQL logic, as well as add it into the open-source SQL Generator 5000, so that you can generate this SQL anytime you need it.

 

Some example HR data

 

Dr. Rich Huebner provides some sample HR data on Kaggle.com that we can use to explore some of the ways to analyze Diversity.

Let’s start by querying the data to compare the Position with the Race. We’ll begin with the basics: Count, Count Distinct, and Mode.

SELECT 
  POSITION, 
  COUNT(1) AS employees, 
  COUNT(DISTINCT RACEDESC) AS DISTINCT_RACE, 
  MODE(RACEDESC) AS MOST_COMMON_RACE 
FROM HR_DATA
 WHERE DATEOFTERMINATION IS NULL /*active employees*/
GROUP BY 
  POSITION
ORDER BY 2 DESC

Looking at the results, the 3 most popular positions appear to be very similar in diversity:


 

Entropy to the rescue

 

So, how would we rank these 3 departments in terms of diversity? This is where Entropy comes in.

 

What is Entropy?

 

Before we continue, let’s take a minute to understand what Entropy is, and how we can interpret it. The concept of Entropy is rooted deep in the study of Information Theory, and has many different applications including machine learning, thermodynamics, and cryptography. Therefore, if you look up the definition, it can be confusing.

Yet the most simple definition of entropy is something along the lines of this: Entropy is a numeric measure to describe how diverse something is.

Consider a bag of marbles that only have two colors: red, and blue.

Now, imagine we counted up the marbles in the bag and discovered that there were 99 blue and only 1 red marble. This situation is not very diverse, so the entropy of the bag is low. 

Next, imagine a bag that has 50 blue and 50 red marbles. This bag is very diverse, in fact it can’t get any more diverse. A bag with 51 blue and 49 red marbles is slightly less diverse. Therefore, this bag has high entropy.

Thus,

  • A bag with 100 blue marbles and 0 blue marbles is the least diverse:  Entropy = 0
  • A bag with 50 blue marbles and 50 red marbles is the most diverse:  Entropy = 1

Therefore, entropy reaches a maximum of 1 at 50/50. Here is a common plot of how Entropy changes with the Blue % of marbles:

 

Analyzing Diversity & Inclusion with SQL
Credit: https://commons.wikimedia.org/wiki/File:Binary_entropy_plot.png

 

To calculate entropy, we calculate the percentage for each color, and remember the formula:

 

Analyzing Diversity & Inclusion with SQL

 

Then, for a bag that is 70% blue marbles, we construct the formula like this:

 

Analyzing Diversity & Inclusion with SQL

 

When you extend this to more than 2 choices, you simply change the log’s base to match the number of possibilities.

 

Using SQL

 

This is a rather simple and efficient operation for SQL to handle.

The final result shows us that while neither position is perfectly diverse, Area Sales Manager is more diverse than Production Technician. 

We can confirm this visually by plotting this on a graph.

 

Analyzing Diversity & Inclusion with SQL

 

Conclusion

 

Entropy is a helpful method for describing diversity. It allows you to sort or rank departments, positions, or companies by combining those categories with a protected class such as race or gender. Even though the function doesn’t exist directly in most RDMBS, we can easily construct the SQL to calculate it. I believe it is important for anyone who works in Diversity, Equity, and Inclusion to use these calculations when looking at their oganizations workforce. Additionally, having worked in data for most of my career it’s great to see that the power of SQL works across all data informed teams, from traditional data analysts all the way to DEI analysts.
 
 
Josh Berry (@Twitter) leads Customer Facing Data Science at Rasgo and has been in the data and analytics profession since 2008. Josh spent 10 years at Comcast where he built the data science team and was a key owner of the internally developed Comcast feature store - one of the first feature stores to hit the market. Following Comcast, Josh was a critical leader in building out Customer Facing Data Science at DataRobot. In his spare time Josh performs complex analysis on interesting topics such as baseball, F1 racing, housing market predictions, and more.