Commonly Misunderstood Analytics Terms
Unable to follow what your analyst language during presentations? Understand what exactly the common terminologies in the data science mean.
By Jeffrey Strickland, Ph.D., CMSP.
Have you ever sat in a briefing with an analyst as they describe the results of their analytics? You probably heard some of the terms described below. You may have had a statistics class in your MBA course work ten years ago, and you vaguely remember hearing the same terms there. If you are like me, you probably can spell them correctly 60% of the time, but their actual meaning escapes you. So, let’s look at some commonly misunderstood terms.
My favorite one. It is a lower bound. The smallest in size, amount, degree, etc. “The customer made at least one transaction” (they had not less than 1).
It is an upper bound. The greatest in size, amount, degree, etc. Not more than. “The customer had at most 12 internet sessions” (they had no more than 12).
First, sampling is not a bad thing. I recently heard a corporate executive say that when you have all of the data you do not need to sample. Of course, “all of the data” is relative. If I have 15 million members in my credit union and I have complete data on every member, then I have “all the data”. But when I build a model, I will take a random sample comprised of 5% of the members (population), for example. Random samples are representative of the population. When I test the performance of my model, I will do it using another random sample from the membership (population).
It does not mean 95 percent accurate! Consider a poll or survey based on a sample of the population. What a 95 percent confidence level is saying is if the poll or survey were repeated over and over again, the results would match the results from the actual population 95 percent of the time. In the picture below x̅ represents the arithmetic mean or average value. The subscripts of x̅ (1, 2,…, 7) indicate mean values of seven different samples from the same population.
Variance is not a measure of how spread out the numbers are, as Wikipedia describes it. That would be the standard deviation. Variance is a measure of dispersion, but not in the sense of the Wikipedia definition. You can say that an observation is one standard deviation from the mean, but you cannot say that an observation is one variance from the mean. Variance is a mathematical calculation of the average of the squared differences from the Mean. We use it primarily to derive the standard deviation, the square root of variance. The equation below is the variance.
Skewness is a measure of the lack of symmetry (asymmetry) of a probability distribution. The direction of Skewness is usually the part that is misunderstood, for it is slightly counterintuitive. When we speak of asymmetry we say positive Skewness or negative Skewness. In this case a picture is worth a thousand words.
So, Skewness is describing the long tail of the distribution. A distribution with a long tail toward the right has positive Skewness. Another way to describe it is left-model, which makes it all the more counterintuitive. To make matter worse, we can describe the positive Skewness as the mean is smaller than the median and mode of the curve. To keep things straight, just focus on the direction of the longest tail.
“Central Limit Theorem” (CLT)
I know this one keeps you awake at night. The CLT states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
In English, the CLT tells us that if we take the means of the N samples from a population and plot the frequencies of their mean, we get a distribution that is approximately normal, regardless of the underlying distribution. The condition on N, the number of samples, is that it be sufficiently large. So how large is “large”? There are two conditions.
- Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.
- The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required.
In everyday use, some researchers say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger.
One final point: the CLT does not say that the distribution of the means form a normal distribution. It say they are “approximately” normally distributed. Below is a picture of the concept. The solid lines represent the shape of the distribution of the means of samples from a population that is not normal, and the dashed lines represent a normal distribution.
Now you can sleep better at night, and when your analyst tells you that your confidence level in 95% due to the Central Limit Theorem, using at least 40 samples from a positively skewed population, you will know what she is talking about, within a few standard deviations.
Bio: Jeffrey Strickland, Ph.D., CMSP is an International Expert; Predictive Analytics Consultant; Data Science Guru; Author; Speaker, and LION.
- Key Bioinformatics Terms for Data Scientists
- Insights from Data Science Handbook
- Data Mining and Predictive Analytics Glossary