Statistical Modeling: A Primer
It's critical to understand that statistical models are simplified representations of reality and they're all wrong but some of them are useful. So why do we use statistical models?
"Model" means different things to different people and different things at different times.
As I briefly explain in A Model's Many Faces, I often find it helpful to classify models as conceptual, operational or statistical. In this post we'll have a closer look at the last of these, statistical models. First, it's critical to understand that statistical models are simplified representations of reality and, to paraphrase the famous words of statistician George Box, they're all wrong but some of them are useful. So why do we use statistical models? We use them because we need to better understand something we don't understand very well or because we wish to predict something - sales, for instance.
There is also an important distinction between deterministic and stochastic models I should mention. Put very simply, with a deterministic model we can calculate the answer from one or more equations. A stochastic model, on the other hand, possesses some inherent randomness and we can only estimate the answer. Our estimates may be quite close, or they may be way off. In a field such as marketing research, we often don't know because we lack the data needed to make this assessment. Sometimes, though, we are able to compare model predictions with real data - predicted sales versus actual sales, for example.
Statistical models are stochastic and what we normally use in marketing research. To crib from Wikipedia: "A statistical model is a class of mathematical model, which embodies a set of assumptions concerning the generation of some sample data, and similar data from a larger population. A statistical model represents, often in considerably idealized form, the data-generating process." A word of caution is that What If? simulation tools based on statistical models are sometimes mistaken for deterministic models by naive users because of their user-friendly interfaces.
Another useful distinction is between dependence and interdependence methods. Regression, in which we have both a dependent variable and one or more independent (predictor) variables, is an example of the former. Note that we can have more than one dependent variable, as we often do in Structural Equation Modeling. Cluster analysis and factor analysis are examples of interdependence methods, which do not distinguish between dependent and independent variables. They are frequently used for brand mapping in marketing research in addition to segmentation.
Some models are purely predictive - they are only concerned with predicting something that hasn't happened yet. An example would be predicting futures sales from past sales alone. Recommender systems are another type of predictive model now widely used in marketing. Amazon presumably doesn't care why you like novels featuring attorneys but knows that people who buy John Grisham's books also frequently buy Scott Turow's. (I plead guilty on both counts.)
A causal model, on the other hand, seeks explanations. This is particularly important in marketing research when simply predicting how a customer will behave is not enough and we need to know why some consumers behave as they do in order to formulate and implement marketing activities. There is an erroneous notion among some marketing researchers that quantitative research is for getting the numbers and qualitative research is for understanding the why underlying the numbers. (I address this rather alarming misconception in Combining Smart Design with Smart Analytics.) Note that a causal model can also be used for prediction and how well it predicts is often (but not always) a criterion for judging how good the model is, so this dichotomy is somewhat blurry.
There are other important categorizations as well, for instance between time-series or longitudinal modeling, in which our data span two or more points in time, and cross-sectional modeling, in which we are only have data for one slice in time. Marketing mix modeling uses time-series data whereas most marketing research surveys are cross sectional. Tracking studies are exceptions to this rule. Some multi-level models fall between these cracks by combining cross-sectional data with time-series or longitudinal data in one model. Though complex, models for spatial and spatiotemporal data are relevant to specialized corners of marketing research.
Frequentist versus Bayesian statistics...at times this resembles the academic equivalent of a religious war. The linked post is an interview with noted Bayesian statistician Andrew Gelman who, fortunately, is the peaceful sort as well as being an outstanding educator. Most of the time either approach will work for marketing research though, generally speaking, Bayesian methods are more complex and there are fewer people skilled at them. Another conflict zone for some is between statistics and machine learning, but the two terms are increasingly used synonymously. There are also nonparametric and semiparametric models and some disagreement among statisticians regarding when these are better suited than more familiar parametric statistics.
I haven't even mentioned mixture modeling! This is particularly useful when you suspect more than one process gave rise to your data, segmented driver analysis being one example.
Suffice it to say that statisticians now have an immense tool kit, and An Analytics Toolbox gives you a peek inside of it. Despite what some have claimed over the years, we're still nowhere near the point where Artificial Intelligence or some other form of automation can replace a competent statistician or marketing science person. The growing complexity of statistical science is actually making this goal more elusive.
How these tools are used by human experts matters a great deal and will for the foreseeable future - see What Makes a Good Analyst? for some thoughts on what to look for in an analyst. Technical competence, of course, is a must since it's very easy for someone untutored in statistics to point and click themselves and their clients into a heap of trouble. However, in my experience, it's even more critical to understand who will be using the results and, to the extent possible, how they will be used.
It all begins with the brief.
I hope you've found this interesting and helpful!
Bio: Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy.
Original. Reposted with permission.