Analytics, Trump, Clinton and the Polls: Sports Analytics Series Part 5.1

Recent presidential elections (especially 2008 and 2012) have featured heavy use of analytics by candidates and pundits.  The Obama campaigns were credited with using micro targeting and advanced analytics to win elections. Analysts like Nate Silver were hailed as statistical gurus who could use polling data to predict outcomes.  In the lead up to this year’s contest we heard a lot about the Clinton campaign’s analytical advantages and the election forecasters became regular parts of election coverage.

Then Tuesday night happened.  The polls were wrong (by a little) and the advanced micro targeting techniques didn’t pay off (enough).

Why did the analytics fail?

First the polls and the election forecasts (I’ll get to the value of analytics next week). As background, commentators tend to not truly understand polls.  This creates confusion because commentators frequently over- and misinterpret what polls are saying.  For example, whenever “margin of error” is mentioned they tend to get things wrong.  A poll’s margin of error is based on sample size.  The common journalist’s error is that when you are talking about a collection of polls the sample size is much larger than an individual poll with a margin of error of 3% or 4%.  When looking at an average of many polls the “margin of error” is much smaller because the “poll of polls” has a much larger sample size.  This is a key point because when we think about the combined polls it is even more clear that something went wrong in 2016.

Diagnosing what went wrong is complicated by two factors.  First, it should be noted that because every pollster does things differently we can’t make blanket statements or talk in absolutes.  Second, diagnosing the problem requires a deep understanding of the statistics and assumptions involved in polling.

In the 2016 election my suspicion is that a two things went wrong.  As a starting point – we need to realize that polls include strong implicit assumptions about the nature of the underlying population and about voter passion (rather than preference).  When these assumptions don’t hold the polls will systematically fail.

First, most polls start with assumptions about the nature of the electorate.  In particular, there are assumptions about the base levels of Democrats, Republicans and Independents in the population.  Very often the difference between polls relates to these assumptions (LA Times versus ABC News).

The problem with assumptions about party affiliation in an election like 2016, is that the underlying coalitions of the two parties are in transition.  When I grew up the conventional wisdom was that the Republicans were the wealthy, the suburban professionals, and the free trading capitalists while the democrats were the party of the working man and unions.  Obviously these coalitions have changed.  My conjecture is that pollsters didn’t sufficiently re-balance.  In the current environment it might make sense to place greater emphasis on demographics (race and income) when designing sampling segments.

The other issue is that more attention needs to be paid towards avidity / engagement/ passion (choose your own marketing buzz word).  Polls often differentiate between likely and registered voters.  This may have been insufficient in this election. If Clinton’s likely voters were 80% likely to show up and Trumps were 95% likely then having a small percentage lead in a preference poll isn’t going to hold up in an election.

The story of the 2016 election should be something every analytics professional understands.  From the polling side the lesson is that we need to understand and question the underlying assumptions of our model and data.  As the world changes do our assumptions still hold?  Is our data still measuring what we hope it does?  Is a single dependent measure (preference versus avidity in this case) enough?