Trump, Failure of Prediction, and Lessons for Data Scientists
The shocking and unexpected win of Donald Trump of presidency of the United States has once again showed the limits of Data Science and prediction when dealing with human behavior.
Correct prediction is based on statistics and statistics requires history of similar events and assumptions like independent variables to function correctly.
If we toss a 100 million fair coins, we can predict the estimated number of heads and tails quite accurately. But using polling to predict the votes of 100 million people is much more difficult. Pollsters need to get a representative sample, estimate the likelihood of a person actually voting, make many justified and unjustified assumptions, and avoid following their conscious and unconscious biases.
In the case of US Presidential election, correct prediction is even more difficult because of our system when each state (except for Maine and Nebraska) awards the winner its votes in the electoral college, and the resulting need to predict results by state.
The chart below shows that pollsters were off the mark in many states, mostly underestimating the Trump vote. ,
Source: @NateSilver538 tweet, Nov 9, 2016
To be fair, some statisticians like Salil Mehta @salilstatistics were warning about unreliability of polls, and David Wasserman of 538 actually described this scenario in Sep 2016 How Trump Could Win The White House While Losing The Popular Vote, but most pollsters were way off.
So a good lesson for Data Scientists is to question their assumptions and to be especially skeptical when predicting a rare event with limited history using human behavior.
Other important lessons are
- Examine data quality - in this election polls were not reaching all likely voters
- Beware of your own biases: many pollsters were likely Clinton supporters and did not want to question the results that favored their candidate. For example, Huffington Post had forecast 98% chance of Clinton Victory.
Other analyses of polling failures:
- Wired: Trump’s Win Isn’t the Death of Data—It Was Flawed All Along.
- NYTimes How Data Failed Us in Calling an Election
- Datanami Six Data Science Lessons from the Epic Polling Failure
- InformationWeek Trump's Election: Poll Failures Hold Data Lessons For IT
- Why I Had to Eat a Bug on CNN, by Sam Wang, Princeton, whose Princeton Election Consortium gave Trump 15% to win.