Big Data Lessons from Microsoft “how-old” Experiment

Salil Mehta examines Microsoft’s viral “How old do I look?” site, the limits of its age recognition, possible algorithms, and implications for Big Data analysis.

To appreciate the conditional life population likelihoods, just see any of these two below, to get a sense of the distribution layout, by gender (male or female), and by children/youth versus mature/older (0 versus 1).  For demonstrative ease, the small and censored fraction (<.05% and nearly all are women) of the global population aged >100, have been categorically truncated for this exercise.  The topic of censored-data treatment, through statistical modeling, will be dealt with in a later blog article.  For now, we can also see and appreciate the dearth of recent births in the recent global financial crisis.  The complete chart, with all four mutually-exclusive and completely exhaustive population segments added to one another, is shown on the lower-right.

Here is the distribution below for just a fraction of the typical partitions based on facial bone changes hat Microsoft’s application would be making.  They are not nicely mound shaped, and they have this typical age (and just as important the conditional standard deviation):

older male         – 33%, age 50 (19)

older female      –  6%,  age 75 (10)

younger male    – 16%, age 18 (13)

younger female – 45%, age 38 (21)

Clearly the computer is doing something and not blindly making bogus guesses with each face it recognizes, from the entire distribution.  But this is a weaker form of adding value, similar to how central bankers purport to “add value” to their otherwise impossible ability to ever guess the critical turns in the economy, and how the use of machine learning attempts to carve out basic partitions in the data.  There are a couple other data distribution portions that we described earlier, and beyond the evolutionary face bone changes we all experience.  For example, the computer application can attempt to understand other major face characteristics, and for a younger person they might look at amount of hair, size of nose.  For older people they might focus more on the suppleness of eyes, and skin wrinkles.  And as we’ll see, all of this the computer is easily prone to be deceived about their meaning.  Incorporating these greater number of parameters, we can expect the final standard errors in Microsoft’s product to come in about 1/3 of what is shown above.  Yes there is speed, but we also see there is also high error- it’s like having a supercomputer that’s twice as fast as your current handheld calculator though it suddenly creates a stream of noisy errors that can’t ever be corrected.  Also note that this shows that for younger females, the typical errors could be larger than for typical men.  Nonetheless, a guess that is off by a decade would be greater than a typical random error (e.g., from a monkey throwing darts) in every selected segment.

Technically speaking, a dart-throwing monkey or anyone would not simply guess anywhere along the entire population age distribution, but rather focus on the component, or zone, sharing the same basic statistical characteristics of the “matching” face.  One might ask if, say a trained monkey, throwing a dart wouldn’t just aim for the center of such a large and unwieldy distribution shape (nothing similar to a smoothed normal distribution).  That would be an incorrect interpretation of our probability analogy.  For those readers who think that, instead picture the entire (sub)distribution partitioned into 20 or 100 equal sized spokes.  And now each one of those is used to complete the dartboard design, of same number of spokes.  The monkey would then blindly aim for the dartboard (the board could be spun along its center if that’s a concern).

Of course we know the subdistributions of the population (e.g., each colored segment above) is unequal-weighted to begin with.  This is just to be analytically complete in describing the random process the computer goes through.  The computer does not simply, randomly guess and model off of the entire distribution.  This is similar to say FICO credit scores, where the computer seeks first to isolate the user into (legally it must be very rudimentary) demographic buckets first, and then more finely guess the parametric characteristics for each group and how they vary versus the overall population.  In the end, we hope the entire “model” works, but the proof is only seen in better and more consistent output, and without Microsoft providing an actual confidence, we on our own here collect and demonstrate their flawed out-of-sample results that give users a false sense of accuracy.

The errors caused by this tool -in failing to have an unbiased guess as to one’s age- shows faults right away as we seen it applied above to a single person.  For appropriate people, this can also be shown by looking at his or her own pictures from say two decades ago, and noticing how Microsoft’s guessed age overestimates then towards a sticky value of just less than 30.  As if pivoting about one of the final segment ages above (e.g., see the sampled ones above).  For even the three other self-portraits further above, we can see how contortions to the picture, altering the face shape, provide a narrow age bias that is also further in the wrong direction.This of course is not an issue only for the author, but across the entire human population, where even a correct answer could have been a false negative (e.g., a lucky guess).  Let’s look at other important contortions, on different types of people.  We’ll look at Hollywood actresses, a group that is selection biased towards a segment whose very career survival depends on “being young”.  Bing’s product, if anything, would always want to err on the side of looking younger when guessing at this cohorts.  But there too, it instead spectacularly at times (nor can it account for pervasive cosmetic surgery and other artificial deformities.)

But look at this Andy Warhol impression of his most illustrious muse, Marilyn Monroe.  What would be your age guess on this?

HowOldRobot doesn’t even recognize a face in the upper left.  But then working clock-wise, it concludes these ages for the other quadruplets: 53, 66, 72.  Ouch, though also a reflection of the high and tight, upward bias we see above that is possible for women.  All of these guesses are horrible, wrong in the same direction, and the worst offender was a guess at 72 (twice the age at which she died.)