What is Normal?
I saw an article recently that referred to the normal curve as the data scientist's best friend. We examine myths around the normal curve, including  is most data normally distributed?
Is the Normal Curve Normal?
I saw an article recently that referred to the normal curve as the data scientist's best friend, and it is certainly true that the normal distribution is ubiquitous in classical statistical theory. Still, it's overrated.
Myth #1: Most data are normally distributed
A moment's reflection will show that this is not the case. First of all, many data are in binary form  survive/die, click/don't click, buy/don't buy, fraud/no fraud, etc. Even data that are frequently cited as examples of normal data are not  people's heights, for example. A histogram of heights will show a bump to the left of center  children. Taking away children, you still have different means for men and women, and people in different countries. Only when you define the group strictly and homogeneously  adult males in Guatemala, for example  does the distribution become normal (and that process of definition is guided by identifying what is not normal in the larger distribution).
IQ scores illustrate this myth well. There is no bellshaped normal distribution more iconic than IQ scores. People tend to think of IQ as being normally distributed, but it is really just IQ scores that are normally distributed, IQ itself being a somewhat nebulous concept, existing in concrete form only in its metric.
And how do IQ scores get to be normallydistributed? The questions on the IQ tests get tweaked, added, and dropped so that the scores do not bunch too much at the low or high end, but are nicely distributed in a bellshaped normal distribution.
(See the historical note below on the "error distribution" for an important distinction between the distribution of the original data, and the distribution of residuals.)
Myth #2: The normal distribution is central to statistical theory
It would be more accurate to say that, in classical statistics (that is to say, precomputer statistics), the normal distribution, and its cousin, the tdistribution, were essential approximations. In 1908, William Gosset ("student"), published his seminal Biometrika paper ("The Probable Error of the Mean") introducing the tdistribution.
It is worth reading (right) Gosset's justification for using the normal distribution as the basis for approximating true distributions  convenience. Or, as Efron and Hastie put it (in Computer Age Statistical Inference), "mathematical tractability."
Gosset was interested in how different one sample might be from another, when drawn from the same population. He started by noting on cards the middle finger lengths of 3000 prison inmates  the data were available because, at the time, scientists were very interested in correlating physical traits with mental traits and criminal tendencies. Continuing, he drew out succeeding samples of 4, noting the mean. He finished by tabulating a frequency histogram, shown below.
Nowadays, though, resampling methods (permutation procedures, and the bootstrap) do a good job of approximating true sampling distributions, and without relying on assumptions of normality. The task of drawing thousands of samples and working with the results is now trivial.
In fact, the advent of computational power has greatly extended the realm of statistics so far beyond the normaltheory based inference procedures that normal approximations are now just a useful, though hardly central, tool in an everexpanding toolbox.
Myth #3: Normalizing data renders it normallydistributed
Normalizing or standardizing data is often used in analytical procedures, so that the scale on which the data are measured does not affect the results. If we are attempting to find clusters in data, for example, the analysis uses "distance between records" as a key metric. We usually would not want our results to differ, depending on which metric was (e.g. meters or kilometers) used, but that will happen if we use the raw data. There are several different ways to put data on the same scale, and one common way is to subtract the mean and divide by the standard deviation. This is also called a zscore, and it allows you to compare the data to a standard normal distribution.
Normalizing the data in this way will not, however, make the data normallydistributed. The data will retain whatever general shape it had before the adjustment.
Historical note: The "Error Distribution"
The normal distribution was originally called the "error distribution," and applied to deviations from the mean in astronomical observations. And it was indeed this concept of the normal distribution of errors (residuals), rather than the original data, that drove the original wide applicability of normal theory in statistics
Related:
Top Stories Past 30 Days

