Descriptive Statistics: The Mighty Dwarf of Data Science – Crest Factor
No other mean of data description is more comprehensive than Descriptive Statistics and with the ever increasing volumes of data and the era of low latency decision making needs, its relevance will only continue to increase.
By Pawel Rzeszucinski, Codewise.com
Nowadays fair part of the community (often influenced by the pressure from the business) seems to show a tendency of applying somewhat complex and rather computationally expensive algorithms to applications that would have been easily accommodated in the past by much simpler (hence faster) and much more interpretable (hence of greater business value) techniques. In the current series of texts I am introducing the power and beauty of descriptive statistics as an approach for quantitatively describing the nature of data and creating solid foundations for any subsequent data investigations. In this post I will introduce another weapon of the mighty dwarf, of similar purpose to the technique introduced in my previous post – kurtosis, but of slightly different power of statistical destruction – the “crest factor”. You will also get to meet the scary Dragon of Demand.
In the previous post we used kurtosis to detect impulsive content within the data, which was caused by some seasonal variations in the demand for the goods offered by an imaginary shop. We said that kurtosis is often thought of as a measure of the “peakidness” of the amplitude distribution of a signal and explained what is really meant by this, somewhat controversial to some peers, statement. As a result of some hands-on simulations we concluded that kurtosis is great at detecting single impulses. Today I want to spend some time considering what will happen if we encounter more than one impulse in our data and would the behaviour of kurtosis still be of desirable nature.
Let us remind ourselves that the case study was as follows: a shop recording the number of sold goods as a function of time tries to automatically detect the presence of any abnormal demand.
In the previous year (previous post), the kurtosis was used to detect the impulse with great success. Figure 1 shows the data and the corresponding value of kurtosis. The metric is 6.227 - clearly above 3 which is default for Gaussian noise. Impulsive content was detected, success! So, this year the encouraged management used the same metric. After the analysis, the value of kurtosis came out even greater: 6.920. “Surely we must have had ever larger spikes this year. Splendid!”. But further investigation revealed something somewhat different, as can be seen in Figure 2. It turned out that apart from exactly the same impulse at day 200 as in the previous year, yet another impulse occurred at day 400, but it was of smaller amplitude. The difference between the two signals is indicated on a histogram by two arrows (Figure 3).
Based on the above result, the intuition into the behaviour of kurtosis should be two-fold: yes, it will increase in the presence of an impulse and yes, it will react proportionally to the magnitude of the impulse but it will also increase its value in the presence of additional impulses, even of smaller amplitudes. Therefore, it should be noted that kurtosis summarizes the effect of various peaks in the data and increased value of the metric does not always mean larger peaks.
This is the time when the Dragon of (business) Demand asks the Mighty Dwarf for help with alternatives: “Oi, I don’t want my parameter to go up when the maximal value in the data does not go up. What if I am only interested in tracking the behaviour of the maximal peaks, but in a normalized way, so I get roughly the same output no matter the absolute scale of my data?”. The Dwarf is likely to reply the following: “I will not solve your problem entirely, but I have something that might help. Try the power of “crest factor”.
So what is crest factor? Well, theoretically speaking it is the ratio of the absolute value of the peak in the data divided by the Root Mean Square (RMS) value of the same data :
Where |xpeak| is the absolute value of the peak (positive or negative) in the data, and xRMS is the Root Mean Square (RMS) of the data, further defined as :
Where n is the total number of samples in the data and xi is the ith sample within the data.
OK, but what does this really mean? Let us break this down a bit. The peak value is the maximal of the absolute value of all the samples in the data, and the RMS is a kind of a measure of the total “weight” of the data, or the amount of energy contained in the data. We can see that unlike the kurtosis, the numerator of crest factor (CF) does not include any summation over all the points in the data, it only looks at the maximal peak. At the same time the denominator is the quadratic mean of all the samples in the data. Having such lightweight numerator (just one sample) and heavyweight denominator (all the samples) may imply that the effect of the appearance of additional impulses in the signal (apart from the already present maximal peak) will be greatly offset by all the other samples in the data – it may come quasi-unnoticed. The effect? Far lower sensitivity to the presence of new impulses compared to kurtosis.
Lets define CF and run some test to verify our assumptions.
import numpy def crest_factor(x): return np.max(np.abs(x))/np.sqrt(np.mean(np.square(x)))
NOTE: RMS in the denominator is the Root Mean Square of the data – look how explicitly it can be coded using the numpy library Root(Mean(Square)) – very easy to remember.
Having defined the metric, lets apply it to our two datasets of interest:
crest_factor(sales_spike) 1.6949964209239023 crest_factor(sales_two_spikes) 1.693766572123957
Indeed, the changes seem almost unnoticeable. Table below compares the values of kurtosis and crest factor applied on both datasets. In addition, percentage changes and baseline, impulse free signal (sales from my previous post) are included for ease of comparison.
Looks like the Dwarf was right, CF seems to hold its ground much more stably.
If you are a careful analyst, you surely noticed that CF reacted with a very slight, yet interesting decrease in its value. Why? Well, the numerator did not change – the absolute peak stayed the same – but the additional impulse caused the RMS of the signal to increase a tiny bit. Hence, the ratio went down.
Crest factor can be thought of as an alternative to kurtosis, characterized by greater focus on the influence of the largest impulses in the data and, as such, neglect (to some extent) of the importance of other, lower amplitude peaks.
 Clarence W. de Silva, Vibration and Shock Handbook, CRC Press, 2005
 Stan Tempelaars, Signal Processing, Speech and Music, Routledge, 2014
Bio: Pawel Rzeszucinski received MSc in Computer Science from Cranfield University and MSc in Electronics from Wroclaw University of Technology. He subsequently moved to The University of Manchester where he obtained PhD on project sponsored by QinetiQ related to data analytics for helicopter gearbox diagnostics. Upon returning to Poland he worked as a Senior Scientist at ABB’s Corporate Research Center and a Senior Risk Modeler in Strategic Analytics at HSBC. Currently he is a Data Scientist at Codewise.
- Descriptive Statistics: The Mighty Dwarf of Data Science
- Descriptive Statistics Key Terms, Explained
- Removing Outliers Using Standard Deviation in Python