Descriptive Statistics Key Terms, Explained
This is a collection of 15 basic descriptive statistics key terms, explained in easy to understand language, along with an example and some Python code for computing simple descriptive statistics.
Statistics, though a central set of tools for data science, are often overlooked in favor of more solidly technical skills like programming. Even machine learning learning algorithms, with their reliance on mathematical concepts such as algebra and calculus -- not to mention statistics! -- are often treated at a higher level than is required to appreciate the underlying math, leading, perhaps, to "data scientists" who lack a fundamental understanding of one of the key aspects of their profession.
This post won't resolve the discrepancy between knowing and not knowing the absolute basics of statistics. However, if you are unable to fully understand the basic descriptive statistics terminology included herein, you are definitely lacking foundational knowledge that is needed to build a whole series of much more robust and useful professional concepts on top of.
So here is a collection of 15 basic descriptive statistics key terms, explained in easy to understand language. A comprehensive example follows, which includes a bit of Python code. Note that, as a basic introduction, mathematical representations and descriptions of the terms herein have been intentionally omitted.
1. Descriptive Statistics
Descriptive statistics are a collection of statistical tools which are used to quantitatively describe or summarize a collection of data. Descriptive statistics aim to summarize, and as such can be distinguished from inferential statistics, which are more predictive in nature.
A population is a selected individual or group representing the full set of members of a certain group of interest.
A sample is a subset drawn from a larger population. If this drawing is accomplished in such a manner that each member of the population has an equitable chance of selection, the result is referred to as a random sample.
A parameter is a value which is generated from a population. If I had all the data of all humans on Earth and generated the mean age, this value would be a parameter.
A statistic is a value which is generated from a sample. If I calculated the mean age of a subset of humans on planet Earth (much more feasible), this value would be a statistic. Hence, the discipline of statistics.
Generalizability refers tot he ability to draw conclusions about the characteristics of the population as a whole based on the results of data collected from a sample. This is ability is not a given, and depends heavily on the nature of sample collection, sample size, and various other factors.
A distribution is the arrangement of data by the values of one variable in order, from low to high. This arrangement, and its characteristics such as shape and spread, provide information about the underlying sample.
Mean, along with median and mode, is one of the 3 major measures of central tendency, which collectively evaluate an important and basic aspect of a distribution. The simple arithmetic average of a distribution of variable values (or scores), the mean provides a single, concise numerical summary of a distribution. The mean is also likely the most common statistics encountered in general research. Population mean is denoted μ, while sample mean is denoted x̄.
The median is the score of a distribution residing at the 50th percentile, separating the top and bottom 50 percent of scores. The median is useful for both splitting a set of distribution scores in half and helping to identify the skew of a distribution.
The mode is simply the score which appears most frequently in the distribution. Multimodal refers to a distribution with more than one mode; bimodal refers to a distribution with 2 modes.
When there are more scores toward one end of the distribution than the other, this results in skew. When the scores of a distribution are more clustered at the high end, the relatively fewer number of scores on the low end result in a tail, with the scenario being referred to as negative skew. Positive skew is when a distribution shows a tail at its high end.
In general, in a negatively skewed distribution, we would expect the mean to be less than the median, while in a positively skewed distribution, we would expect the mean to be greater than the median.
One of the most important measures of dispersion, the range is the difference between the maximum and minimum values of a distribution.
Variance is the statistical average of the dispersion of scores in a distribution. Variance is not often used on its own, but can be a useful calculation on the way to a more descriptive statistical measurement, such as standard deviation.
14. Standard Deviation
The standard deviation of a distribution is the average deviation between individual distribution scores and the distribution's mean. Individually, the standard deviation provides a good measure of how spread out a disquisitions scores are. When considered alongside the mean, these 2 measures provides a good overview of the distribution of scores.
15. Interquartile Range (IQR)
The IQR is the difference between the score delineating the 75th percentile and the 25th percentile, the third and first quartiles, respectively.
Below is a simple Python script for computing much of the descriptive statistics discussed above, followed by an example.
import numpy as np import matplotlib.pyplot as plt import scipy.stats dist = np.array([ 1, 4, 5, 6, 8, 8, 9, 10, 10, 11, 11, 13, 13, 13, 14, 14, 15, 15, 15, 15 ]) print('Descriptive statistics for distribution:\n', dist) print('Number of scores:', len(dist)) print('Number of unique scores:', len(np.unique(dist)) print('Sum:', sum(dist)) print('Min:', min(dist)) print('Max:', max(dist)) print('Range:', max(dist)-min(dist)) print('Mean:', np.mean(dist, axis=0)) print('Median:', np.median(dist, axis=0)) print('Mode:', scipy.stats.mode(dist)) print('Variance:', np.var(dist, axis=0)) print('Standard deviation:', np.std(dist, axis=0)) print('1st quartile:', np.percentile(dist, 25)) print('3rd quartile:', np.percentile(dist, 75)) print('Distribution skew:', scipy.stats.skew(dist)) plt.hist(dist, bins=len(dist)) plt.yticks(np.arange(0, 6, 1.0)) plt.title('Histogram of distribution scores') plt.show()
Descriptive statistics for distribution: [ 1 4 5 6 8 8 9 10 10 11 11 13 13 13 14 14 15 15 15 15] Number of scores: 20 Number of unique scores: 11 Sum: 210 Min: 1 Max: 15 Range: 14 Mean: 10.5 Median: 11.0 Mode: 15 Variance: 16.15 Standard deviation: 4.01870625948 1st quartile: 8.0 3rd quartile: 14.0 Distribution skew: -0.714152479663
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.