What is Normal?
I saw an article recently that referred to the normal curve as the data scientist's best friend. We examine myths around the normal curve, including - is most data normally distributed?
Is the Normal Curve Normal?
I saw an article recently that referred to the normal curve as the data scientist's best friend, and it is certainly true that the normal distribution is ubiquitous in classical statistical theory. Still, it's overrated.
Myth #1: Most data are normally distributed
A moment's reflection will show that this is not the case. First of all, many data are in binary form - survive/die, click/don't click, buy/don't buy, fraud/no fraud, etc. Even data that are frequently cited as examples of normal data are not - people's heights, for example. A histogram of heights will show a bump to the left of center - children. Taking away children, you still have different means for men and women, and people in different countries. Only when you define the group strictly and homogeneously - adult males in Guatemala, for example - does the distribution become normal (and that process of definition is guided by identifying what is not normal in the larger distribution).
IQ scores illustrate this myth well. There is no bell-shaped normal distribution more iconic than IQ scores. People tend to think of IQ as being normally distributed, but it is really just IQ scores that are normally distributed, IQ itself being a somewhat nebulous concept, existing in concrete form only in its metric.
And how do IQ scores get to be normally-distributed? The questions on the IQ tests get tweaked, added, and dropped so that the scores do not bunch too much at the low or high end, but are nicely distributed in a bell-shaped normal distribution.
(See the historical note below on the "error distribution" for an important distinction between the distribution of the original data, and the distribution of residuals.)
Myth #2: The normal distribution is central to statistical theory
It would be more accurate to say that, in classical statistics (that is to say, pre-computer statistics), the normal distribution, and its cousin, the t-distribution, were essential approximations. In 1908, William Gosset ("student"), published his seminal Biometrika paper ("The Probable Error of the Mean") introducing the t-distribution.
It is worth reading (right) Gosset's justification for using the normal distribution as the basis for approximating true distributions - convenience. Or, as Efron and Hastie put it (in Computer Age Statistical Inference), "mathematical tractability."
Gosset was interested in how different one sample might be from another, when drawn from the same population. He started by noting on cards the middle finger lengths of 3000 prison inmates - the data were available because, at the time, scientists were very interested in correlating physical traits with mental traits and criminal tendencies. Continuing, he drew out succeeding samples of 4, noting the mean. He finished by tabulating a frequency histogram, shown below.
Nowadays, though, resampling methods (permutation procedures, and the bootstrap) do a good job of approximating true sampling distributions, and without relying on assumptions of normality. The task of drawing thousands of samples and working with the results is now trivial.
In fact, the advent of computational power has greatly extended the realm of statistics so far beyond the normal-theory based inference procedures that normal approximations are now just a useful, though hardly central, tool in an ever-expanding toolbox.
Myth #3: Normalizing data renders it normally-distributed
Normalizing or standardizing data is often used in analytical procedures, so that the scale on which the data are measured does not affect the results. If we are attempting to find clusters in data, for example, the analysis uses "distance between records" as a key metric. We usually would not want our results to differ, depending on which metric was (e.g. meters or kilometers) used, but that will happen if we use the raw data. There are several different ways to put data on the same scale, and one common way is to subtract the mean and divide by the standard deviation. This is also called a z-score, and it allows you to compare the data to a standard normal distribution.
Normalizing the data in this way will not, however, make the data normally-distributed. The data will retain whatever general shape it had before the adjustment.
Historical note: The "Error Distribution"
The normal distribution was originally called the "error distribution," and applied to deviations from the mean in astronomical observations. And it was indeed this concept of the normal distribution of errors (residuals), rather than the original data, that drove the original wide applicability of normal theory in statistics
- Why Data Scientists Love Gaussian
- Explaining the 68-95-99.7 rule for a Normal Distribution
- Scalable Select of Random Rows in SQL