Descriptive Statistics: The Mighty Dwarf of Data Science
No other mean of data description is more comprehensive than Descriptive Statistics and with the ever increasing volumes of data and the era of low latency decision making needs, its relevance will only continue to increase.
Nowadays fair part of the community (often influenced by the pressure from the business) seems to show a tendency of applying somewhat complex and rather computationally expensive algorithms to applications that would have been easily accommodated in the past by much simpler (hence faster) and much more interpretable (hence of greater business value) techniques. In the series of texts to come I will try to introduce the power and beauty of descriptive statistics as an approach for quantitatively describing the nature of data and creating solid foundations for any subsequent data investigations. In this post I will introduce one of the weapons of the mighty dwarf – the “kurtosis”.
Consider a case where a monitoring system is to detect anomalies within the data. Typically, one may turn into the classic means of outlier analysis like the DBSCAN-based approaches or LOF. Nothing wrong with these, they may perfectly well point towards the directions where the outliers may be present. However, these techniques may require substantial computational resources to complete the task on high volumes of data in reasonably acceptable amount of time. A much faster alternative may come from considering the given case as a time series analysis problem. Such data coming from a system operating in ‘healthy’ conditions would have a typical, acceptable amplitude distribution and, in such scenario, any deviation from the expected shape may be considered a potential threat, worth detecting.
A very fast descriptive statistic aimed at summarizing the shape of the distribution of the signal is called the ‘kurtosis’. In mathematical terms is can be defined as below:
Where n is the total number of samples in the data, xi is the ith sample within the data and x is the sample mean of the data.
Often kurtosis is thought of as a measure of the “peakidness” of the amplitude distribution of a signal. What does it mean? We are all pretty well accustomed to the bell-shape of the Gaussian distribution. Should a Gaussian signal accumulate some additional samples of significantly greater amplitudes (impulses), its distribution will suffer from widening the its tails. As a result, the entire distribution would look more sharp i.e., more peaked compared to the pure Gaussian signal. I will present this concept in the following case study.
Consider the case study as follows: a shop recording the number of sold goods as a function of time, trying to automatically detect the presence of any abnormal demand.
Lets synthesize the data as a Gaussian distribution signal for one thousand days, with mean value centred at 10 (thousands of goods sold).
The data is shown in Figure 1.
The value of kurtosis always oscillates around the value of 3 for any normally distributed signal. In the case of our shop:
Note: Some implementations of kurtosis use the raw output of the Eq. 1, but often a value of 3 is subtracted from the kurtosis value, so that the value of kurtosis for the Gaussian signal is around 0. Such implementation is often referred to as the “excess kurtosis”. The later version is used in SciPy by default. Since I’m in favour of the classic “output of 3 for the Gaussian”, I am setting the “fisher” parameter of the kurtosis function to False.
The power and beauty of kurtosis, apart from its lightning fast computations, lies in the normalization of the output, meaning that regardless of the absolute amplitude of the signal i.e., the number of sold goods, the kurtosis value will remain around the expected 3 for a Gaussian signal: if we multiply the amplitude of our signal 100 times, the kurtosis is still within the expected range:
Coming back to the case study: suppose that due to some seasonal variations in the demand, we are expecting to see some outlier counts in our sales. Suppose that on day 200 we saw a massive spike in the sold goods, followed by a very big drop on day 201 (Figure 2).
We would like to be able to detect such events automatically. This is where the might dwarf of the data science comes in and says “I’m happy to detect such anomaly in no-time! I’ll deploy my magic “kurtosis” axe”.
Let’s create the data for the discussed situation:
A very clear increase. A very clear change. The value of kurtosis is directly proportional to the amplitude of the impulses which constitutes another very helpful property of this metric.
Mission accomplished with no head (CPU) loses – very quick detection.
Coming back to the informal definition of kurtosis as a measure of the peakidness of the distribution. Compare histograms of ‘sales’ and ‘sales_spike’ shown in Figure 3 and 4 respectively. Even though the two histograms are being generated from virtually identical signals, apart from the ‘sales_spike’ having the two impulses, the amplitude distributions look very different and one of them clearly appears as more sharp (the one with the higher kurtosis value) compared to the other.
Kurtosis is extremely efficient at detecting impulsive content within the data, it is normalized for differences in amplitudes of the data and is very fast to compute. However, it does suffer some drawbacks which I will cover in the post.
Bio: Pawel Rzeszucinski received MSc in Computer Science from Cranfield University and MSc in Electronics from Wroclaw University of Technology. He subsequently moved to The University of Manchester where he obtained PhD on project sponsored by QinetiQ related to data analytics for helicopter gearbox diagnostics. Upon returning to Poland he worked as a Senior Scientist at ABB’s Corporate Research Center and a Senior Risk Modeler in Strategic Analytics at HSBC. Currently he is a Data Scientist at Codewise.
- Descriptive Statistics Key Terms, Explained
- A Few Statistics Tips for Marketers
- Removing Outliers Using Standard Deviation in Python