KDnuggets Home » News » 2018 » Jun » Tutorials, Overviews » A Better Stats 101 ( 18:n23 )

A Better Stats 101

Statistics encourages us to think systemically and recognize that variables normally do not operate in isolation, and that an effect usually has multiple causes. Some call this multivariate thinking. Statistics is particularly useful for uncovering the Why.


Judging from "Stats 101" courses in which we plug numbers into simple formulas, or seminars given by software vendors where we click-and-drag or point-and-click for two days, it would be easy to get the impression that there really isn't much to statistics. Of course, there are dweeby types whose worlds revolve around linear algebra, calculus and programming. In a different way, they also succeed in giving us a misleading notion of what stats is really about.

So, what is it really all about? This snapshot should give you a rough idea.

Karl Pearson, regarded by many as the father of modern statistics, dubbed this new field "the grammar of science." Simplifying greatly, statistics is used in designing research and in collecting, analyzing and interpreting data. How does it differ from data science? Paraphrasing former Royal Statistical Society President Peter Diggle, data science seeks to maximize the utility of data, whereas statistics seeks to minimize the uncertainty that is associated with the interpretation of data. This distinction reflects very different mentalities, and when I read on LinkedIn that someone is "passionate about data", without reading further, I know he or she is not a statistician.

Statistics is also a way of thinking. It teaches us to conceptualize probabilistically - remember the word uncertainty in Professor Diggle's comment. It also encourages us to think systemically and recognize that variables normally do not operate in isolation, and that an effect usually has multiple causes. Some call this multivariate thinking. Statistics is particularly useful for uncovering the Why.

I elaborate a bit more about the thinking aspect of stats in Statistical Thinking and the Art of Lawnmower Maintenance, but here are a few questions statisticians should bear in mind:

  • Is my hypothesis internally consistent?
  • Do I have empirical evidence to support my hypothesis? Have I looked at all the relevant empirical evidence? Are there rival explanations I haven't considered?
  • Are patterns I’ve observed in the data likely to be real, or merely due to chance? What might have caused these patterns, if real?
  • Are there unobserved variables or other confounders I haven’t accounted for that may have caused these patterns? Am I confusing cause with effect, or correlation with causation? Am I drawing conclusions about fruit based only on apples?
  • Am I asking the right questions? Are there other questions I should be asking?
  • Have I become too ego-involved with the subject I’m researching and compromised my objectivity?

As the first link above demonstrates, there are a gigantic number of statistical methods designed for a gigantic number of purposes. It is a king-sized discipline, with many specializations and sub-specializations. My own work draws heavily on psychometrics, econometrics, biostatistics and epidemiology, though, when needed, I also utilize machine learners designed primarily for predictive analytics.

Experience, judgement and subject matter knowledge, as well as understanding decision-makers' expectations and priorities are critical. The importance of communication and interpersonal skills is hard to overstate. Programming and computer science skills are increasingly demanded in many occupations and more so in statistics than most.

There is fundamental knowledge and, being a STEM field, fundamental doesn't necessarily mean easy. With that qualification, here, loosely organized, are some of the important planks in a contemporary statistical education:

  • Probability
  • Sampling
  • Descriptive statistics
  • Inferential statistics
  • Design and analysis of experiments and quasi-experiments
  • The Generalized Linear Model (e.g., OLS regression, NB regression)
  • Time-to-event analysis (e.g., Kaplan-Meier, Cox Regression)
  • Unsupervised methods (e.g., correspondence analysis, clustering)
  • Machine learning tools (e.g., boosting, bagging, neural nets)

Data science and analytics programs are popping up all over the world the way MBA programs did a generation ago. Their quality appears even more varied, though, and there is little consistency in their curricula. Many of the topics I've listed above are skipped or crammed into the curricula along with IT and programing courses. Caveat emptor.

Formal university-level coursework in statistics is still the best way to learn about statistics. Let's not forget real experience analyzing real data, though - there is the classroom and there is the real world. I’ve been using stats in my work for more than 30 years now and still learn something new every day.

So how is statistics actually used? Here are some examples from marketing research and data science:

  • In Consumer Segmentation consumers or customers are either 1) statistically profiled with respect to a key target variable such as purchase frequency or 2) divided into clusters (segments) based on how similar they are to each other in terms of wants and needs and/or behaviors. Segmentation can be used either for targeting purposes or simply to learn more about a product or service category. Data can include consumer surveys, customer records or both. Segmentation is a core part of data mining and predictive analytics and, thus, data science.
  • Marketing Mix Modeling (aka Market Response Modeling) is quite complex, but in a nutshell, it tries to find out how much bang a client is getting for their marketing buck and attempts to identify the optimal marketing mix. It can also be extended to demand forecasting, in which future sales or market share are forecasted under various marketing and/or economic scenarios. Most often methods developed in econometrics are used.
  • Pricing Analysis is especially useful for products that are not yet on the market or for which historical data are lacking. Surveys are the main source in these cases. It also can be conducted using existing price and sales data, and price is typically a variable in marketing mix modeling. Conjoint (see below) is a popular method in pricing research.
  • Choice Modeling (aka "conjoint") is a versatile tool often used in pricing research and new product development. For example, respondents in a consumer survey might be shown a series of choice tasks with product descriptions and asked which product, if any, they would choose. Their responses are statistically analyzed with latent class or multinomial logit models and the importance ("utility") of specific features estimated. What if? simulations can be conducted to see which combinations of features - i.e., which hypothetical new product - would garner the highest preference share. The utilities can also be used as input into segmentation.
  • Key Driver Analysis is similar to choice modeling in that one application of it is to learn more about consumer priorities. It serves many kinds of objectives and there are many ways to conduct it but, in essence, we are trying to unravel the "causes" of one or more target variables such as purchase interest in a new product, liking for a TV ad, customer satisfaction or brand equity. It also plays an important role in UX and CX studies, and sensory research. Many analytic methods are used, often from the generalized linear model family (which is huge).
  • In Image and Positioning research, brand or user image data from consumer surveys are mapped with one of several methods, such as correspondence analysis, to help us understand how brands or users of brands are perceived by consumers. Mapping is frequently combined with key driver analysis to see which image attributes and image dimensions clients should focus on most.

Throughout the years, statistics has arguably been one of the most misunderstood and poorly taught subjects of all. After languishing in the background all these years, it is emerging from the fog and its importance is increasingly recognized in what I call the Data Age.

This has not really been a replacement for Stats 101, but I hope you’ve found it interesting and helpful!

Bio: Kevin Gray is President of Cannon Gray, a marketing science and analytics consultancy. He has more than 30 years’ experience in marketing research with Nielsen, Kantar, McCann and TIAA-CREF. Kevin also co-hosts the audio podcast series MR Realities.

Original. Reposted with permission.