Understanding Central Tendency

Learn the basics of important metrics used for measuring central tendency.



Understanding Central Tendency
Image by Editor

 

Central tendency is the property of data to be distributed about a characteristic value. In data science and statistics, the two most important measures of central tendency are the mean and median.

 

Mean

 

For a dataset with N observations, the mean value is computed by adding all the data values and dividing by N. The mean value is easy to compute, but is highly susceptible to the presence of outliers in the dataset.

 

Median

 

The median is an important measure of central tendency that is less susceptible to the presence of outliers. The median value for a dataset can be determined by sorting the dataset and then determining the middle value such that 50% of the dataset values are less than the median value, and 50% are greater than the median value.

 

Calculating Mean and Median for a Dataset

 

To illustrate the concept of central tendency, we calculate the mean and median for two datasets. The first dataset is a sample dataset with no outliers, and the second dataset is a sample dataset with outliers.
 

import numpy as np
import matplotlib.pyplot as plt

# generate some random data
np.random.seed(1)
data1 = np.random.uniform(0,10, 1000)
data2 = np.append(data1, np.linspace(150,200,100))
data2 = np.append(data2, np.linspace(15,25,10))
data = list([data1, data2])
fig, ax = plt.subplots()

# build a box plot
ax.boxplot(data)
ax.set_ylim(0,25)
xticklabels=['sample data', 'sample data with outliers']
ax.set_xticklabels(xticklabels)

# add horizontal grid lines
ax.yaxis.grid(True)

# show the plot
plt.show()

 

Understanding Central Tendency

 

Box plot showing sample data with and without outliers. The small open circles represent the outliers. Image by Author.

# mean and median of sample data with no outliers
np.mean(data1)

>>> 5.006045994559051

np.median(data1)

>>> 5.075008116147119

# mean and median of sample data with outliers
np.mean(data2)

>>> 20.455897292395537

np.median(data2)

>>> 5.565300519330409

 

We observe that the presence of outliers in the second dataset led to an increase in the mean value from 5.006 to 20.45, while the change in the median value from 5.075 to 5.565 was very small compared to the change in the mean value. This shows that the median value is a robust measure of central tendency as it is less susceptible to the presence of outliers in the dataset.

 

Summary

 

In summary, we have reviewed the two most important metrics for calculating central tendency. The mean value is easy to compute, but is highly susceptible to the presence of outliers in the dataset. The median is a robust measure of central tendency, and is less susceptible to the presence of outliers.
 
 
Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.