Removing Outliers Using Standard Deviation in Python
Standard Deviation is one of the most underrated statistical tools out there. It’s an extremely useful metric that most people know how to calculate but very few know how to use effectively.
Image by Editor
Standard Deviation: A Quick Recap
Standard deviation is a metric of variance i.e. how much the individual data points are spread out from the mean.
For example, consider the two data sets:
27 23 25 22 23 20 20 25 29 29
12 31 31 16 28 47 9 5 40 47
Both have the same mean 25. However, the first dataset has values closer to the mean and the second dataset has values more spread out.
To be more precise, the standard deviation for the first dataset is 3.13 and for the second set is 14.67.
However, it's not easy to wrap your head around numbers like 3.13 or 14.67. Right now, we only know that the second data set is more “spread out” than the first one.
Let’s put this to a more practical use.
What is Normal Distribution?
When we perform analytics, we often come across data that follow a pattern with values rallying around a mean and having almost equal results below and above it e.g.
- height of people,
- blood pressure values
- test marks
Such values follow a normal distribution.
According to the Wikipedia article on normal distribution, about 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations.
This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule.
Remove Outliers Using Normal Distribution and Standard Deviation
I applied this rule successfully when I had to clean up data from millions of IoT devices generating heating equipment data. Each data point contained the electricity usage at a point of time.
However, sometimes the devices weren’t 100% accurate and would give very high or very low values.
We needed to remove these outlier values because they were making the scales on our graph unrealistic. The challenge was that the number of these outlier values was never fixed. Sometimes we would get all valid values and sometimes these erroneous readings would cover as much as 10% of the data points.
Our approach was to remove the outlier points by eliminating any points that were above (Mean + 2*SD) and any points below (Mean - 2*SD) before plotting the frequencies.
You don’t have to use 2 though, you can tweak it a little to get a better outlier detection formula for your data.
Here’s an example using Python programming. The dataset is a classic normal distribution but as you can see, there are some values like 10, 20 which will disturb our analysis and ruin the scales on our graphs.
As you case see, we removed the outlier values and if we plot this dataset, our plot will look much better.
[386, 479, 627, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 557, 304, 404, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543]
As you can see, we were able to remove outliers. I wouldn’t recommend this method for all statistical analysis though, outliers have an import function in statistics and they are there for a reason!
But in our case, the outliers were clearly because of error in the data and the data was in a normal distribution so standard deviation made sense.
Punit Jajodia is an entrepreneur and software developer from Kathmandu, Nepal. Versatility is his biggest strength, as he has worked on a variety of projects from real-time 3D simulations on the browser and big data analytics to Windows application development. He's also the co-founder of Programiz.com, one of the largest tutorial websites on Python and R.