Optimizing Web sites: Advances thanks to Machine Learning

Machine learning has revitalized a nearly dormant method, leading to a powerful approach for optimizing Web pages, finding the best of thousands of alternatives.



By Steven Struhl, ConvergeAnalytic.

You can now efficiently determine the appeal of many thousands of alternative Web site configurations, thanks to the new life machine learning methods have breathed into a mostly dormant analytical approach. Before we get into how this works, let’s look at what we can do. For instance, here is a slightly disguised Web page. There many elements that could be varied, and in the simplified example in Figure 1, there are five.

Figure 1: Elements to be varied and evaluated in a Web page

Two of these elements are varied in five ways and the remaining three in four ways. This gives us a total of 5 x 5 x 4 x 4 x 4 possible combinations, or 1,600 possible configurations.

Clearly, we cannot test all these.  But using the approach we will discuss, we could determine the value of all possible combinations, using just 20 precisely designed variants. We will show a summary of results later. This method clearly surpasses traditional A/B testing, where perhaps 4 variations would be the maximum.

How this works

This approach rests on using experimental designs. These designs allow us to vary many factors simultaneously and then untangle the effects of each change accurately. The august (but much faded) underlying method is full-profile conjoint analysis. This approach was hailed as a tremendous advance in the 1970s, because it could accurately measure the impact of changing products’ features and prices. It had far better predictive accuracy than other approaches, such as scaled ratings, and led to many successful product and pricing changes.[1]

However, conjoint later was largely abandoned, in part because it could measure relatively few features or changes in features. That is, the size of experimental design must expand as we measure more variations in those features. Conjoint also required each person to do a complete experiment. That is, each study participant would see as many varied product offerings (or product profiles) as the experiment required, and would evaluate each.

For instance, eight features or attributes, each varied in three ways, would require 18 runs in the experiment—or 18 product profiles. This would allow us to measure 3 7 or 2187 possible variations. Most people will not sit still for evaluating many more product profiles than this.  But often this was not enough for complex products or services.

Also, another modeling method, discrete choice modeling, supplanted conjoint. Discrete choice modeling allowed the evaluationof products and services in the context of competitive offerings. This is far more realistic than evaluating single product profiles in the absence of competitors, and so conjoint analysis was largely shelved.

Conjoint analysis never gained much traction with improving communications or service offerings,perhaps becauseit was so strongly associated with optimizing products. Yet it made sense to use it for these purposes.  The goal in working with messages, for instance, could simply be finding the best combination of elements so that a message best appeals to intended audience.

The problem of experiment size was a last impediment. This was overcome with the use of a machine learning method, Hierarchical Bayesian Analysis. This method provides highly accurate estimates by learning from the data where an individual’s responses are spotty or missing.

The details are mind-bending in their complexity, but here is an overview. In each spot where an individual has missing or scant information, the method draws a sample from others. This sample may also be compared to the individual for similarity. This estimate is stored, then another sample is drawn and the original value is updated. This is repeated for each item that needs to be estimated for each person, with the older estimates getting progressively less weight. After tens of thousands of runs, these estimates will settle into fixed values. Each item being estimated will have a value for each person, based on what the procedure has learned from the rest of the data.

This may seem almost magical and like it cannot possibly work—but it has been tested in real-world applications for over 20 years.It does involve many millions of calculations—and can make even the speediest computer work nonstop for hours. Still, if done correctly, Hierarchical Bayesian analysis is at least 95% accurate compared to what would emerge with complete data, with person doing half an experiment.[2]Since experiments can be split among people, highly complex problems can be analyzed, such as a Web site where 12 or more elements each are varied in several ways.

Figure 2 gives summary results for the Web site. The most appealing combination can be seen easily. The units are utilities, which are abstract measurements. They are compared at the ratio level.

In the real analysis, there were more attributes.Their values derived from actual behavior, withthe baseline measurement the stickiness of the pages as people encountered them. Stickiness is the length of time until people clicked off the page. This must be measured carefully to weed out people at the very low end who might have visited by mistake, and those at the very high end, who might have stopped (for instance) to answer the phone or put out a small kitchen fire.