Analytics vs Intuition in Decision Making Part IV: Outliers

We have been talking about developing predictive models for tasks like evaluating draft prospects.  Last time we focused on the question of what to predict.  For drafting college prospects, this amounts to predicting things like rookie year performance measures.  In statistical parlance, this is the dependent or the Y variables.  We did this in the context of basketball and talked broadly about linear models that deliver point estimates and probability models that give the likelihood of various categories of outcomes.

Before we move to the other side of the equation and talk about the “what” and the “how” of working with the explanatory or X variables, we wanted to take a quick diversion and discuss predicting draft outliers.  What we mean by outliers is the identification of players that significantly over or under perform relative to their draft position.  In the NFL, we can think of this as the how to avoid Ryan Leaf with the second overall pick and grab Tom Brady before the sixth round problem.

In our last installment, we focused on predicting performance regardless of when a player is picked.  In some ways, this is a major omission.  All the teams in a draft are trying to make the right choices.  This means that what we are really trying to do is to exploit the biases of our competitors to get more value with our picks.

There are a variety of ways to address this problem, but for today we will focus on a relatively simple two-step approach.  The key to this approach is to create a dependent variable that indicates that a player over-performs relative to their draft position. And then try and understand if there is data that is systematically related to these over and under performing picks.

For illustrative purposes, let us assume that our key performance metric is rookie year player efficiency (PER(R)).  If teams draft rationally and efficiently (and PER is the right metric), then there should be a strong linkage between rookie year PER and draft position in the historical record.  Perhaps we estimate the following equation:

PER(R) = B0 + BDPDraftPosition + …

where PER(R) is rookie year efficiency and draft position is the order the player is selected.  In this “model” we expect that when we estimate the model that BDP will be negative since as draft position increases we would expect lower rookie year performance.  As always in these simple illustrations, the proposed model is too simple.  Maybe we need a quadratic term or some other nonlinear transformation of the explanatory variable (draft position).  But we are keeping it simple to focus on the ideas.

The second step would then be to calculate how specific players deviate from their predicted performance based on draft position.  A measure of over or under performance could then be computed by taking the difference between the players actual PER(R) and the predicted PER(R) based on draft position.

DraftPremium = PER(R) – PER(R)

Draft Premium (or deficit) would then be the dependent variable in an additional analysis.  For example, we might theorize that teams overweight the value of the most recent season.   In this case the analysts might specify the following equation.

DraftPremium = B0 + BPPER(4) + BDIFF(PER(4) – PER(3)) + …

This expression explains the over (or under) performance (DraftPremium) based on PER in the player’s senior season (PER(4)) and the change in PER between the 3rd and 4th seasons.  If the statistical model yielded a negative value for BDIFF it would suggest that players with dramatic improvements tended to be a bit of a fluke.  We might also include physical traits or level of play (Europe versus the ACC?).  Again, we will call these empirical questions that must be answer by spending (a lot of) time with the data.

We could also define “booms” or “busts” based on the degree of deviation from the predicted PER.  For example, we might label players in the top 15% of over performers to be “booms” and players in the bottom 15% to be “busts”.  We could then use a probability model like a binary probit to predict the likelihood of boom or bust.

Boom / Bust methodologies can be an important and specialized tool.  For instance, a team drafting in the top five might want to statistically assess the risk of taking a player with a minimal track record (1 year wonders, high school preps, European players, etc…).   Alternatively, when drafting in late rounds maybe it’s worth it to pick high risk players with high upsides.  The key point about using statistical models is that words like risk and upside can now be quantified.

For those following the entire series it is worth noting that we are doing something very different in this “outlier” analysis compared to the previous “predictive” analyses.  Before, we wanted to “predict” the future based on currently available data.  Today we have shifted to trying to find ‘value” by identifying the biases of other decision makers.

Mike Lewis & Manish Tripathi, Emory University 2015.

For Part 1 Click Here

For Part 2 Click Here

For Part 3 Clicke Here

Analytics vs Intuition in Decision-Making Part III: Building Predictive Models of Performance

So far in our series on draft analytics, we have discussed the relative strengths and weaknesses of statistical models relative to human experts, and we have talked about some of the challenges that occur when building databases.  We now turn to questions and issues related to building predictive models of athlete performance.

“What should we predict?” is a deceptively simple question that needs to be answered early and potentially often throughout the modeling process.  Early – because we need to have some idea of what we want to predict before the database can be fully assembled.  Often – because frequently it will be the case that no one metric performance will be ideal.

There is also the question of what “type” of thing should be predicted.  It can be a continuous variable, like how much of something.  Yards gained in football, batting average in baseball or points score in basketball would be examples.  It can also be categorical (e.g. is the player an all-star or not).

A Simple Example

So what to predict?  For now, we will focus on basketball with a few comments directed towards other sports.  We have options.  We can start with something simple like points or rebounds (note that these are continuous quantities – things like points that vary from zero to the high twenties rather than categories like whether a player is a starter or not).  We don’t think these are bad metrics but they do have limitations.  The standard complaint is that these single statistics are too one dimensional.  This is true (by definition, in this case) but there may be occasions when this is a useful analysis.

First, maybe the team seeks a one dimensional player.  The predicted quantity doesn’t need to be points.  Perhaps, there is a desperate need for rebounding or assists.  It’s a team game, and it is legitimate to try and fill a specialist role.  A single measure like points might also be useful because it could be correlated with other good “things” that are of interest to the team.

For a moment, let us assume that we select points per game as the measure to be predicted, and we predict this using all sorts of collegiate statistics (the question of the measures we should use to predict is for next time).   In the equation below, we write what might be the beginning of a forecasting equation.  In this expression, points scored during the rookie season (Points(R)) is to be predicted using points scored in college (Points(C)), collegiate strength of schedule (SOS), an interaction of points scored and strength of schedule (Points(C) X SOS) and potentially other factors.

Points(R)=β0P Points(C)+βSOS SOS+βPS Points(C)×SOS+⋯

The logic of this equation is that points scored rookie year is predictable from college points, level of competition and an adjustment for if the college points were scored against high level competition.  When we take this model to the data via a linear regression procedure we get numerical values for the beta terms.  This gives us a formula that we can use to “score” or predict the performance of a set of prospects.

The preceding is a “toy” specification in that a serious analysis would likely use a greatly expanded specification.  In the next part of our series we will focus on the right side of the equation.  What should be used as explanatory variables and what form these variables should take.

Some questions naturally arise from this discussion…

  • What pro statistics are predictable based on college performance. Maybe scoring doesn’t translate but steals do?
  • Is predicting rookie year scoring appropriate? Should we predict 3rd year scoring to get a better sense of what the player will eventually become?
  • Should the model vary based on position? Are the variables that predict something like scoring or rebounding be the same for guards versus forwards?

Most of these questions are things that should be addressed by further analysis.  One thing that the non-statistically inclined tend not to get is that there is value in looking at multiple models.  It is seldom clear-cut what the model should look like, and it’s rare that one size fits all (same model for point guards and centers?).  And maybe models only work sometimes.  Maybe we can predict pro steals but not points.  One reason why the human experts need to become at least statistically literate is that if they aren’t, the results from that analytics guys either need to be overly simplified or the expert will tend to reject the analytics because the multitude of models is just too complex.

A simple metric like points (or rebounds, or steals, etc…) is inherently limited.  There are a variety of other statistics that could be predicted that better capture the all-round performance of a player or the player’s impact on the team.  But the basic modeling procedure is the same.  We use data on existing pros to estimate a statistical model that predicts the focal metric based on data available about college prospects.

Some other examples of continuous variables we might want to predict…

  1. Player Efficiency

How about something that includes a whole spectrum of player statistics like John Hollinger’s Player Efficiency Rating (PER)?  PER involves a formula that weights points, steals, rebounds assists and other measures by fixed weights (not weights estimated from data as above).  For instance, points are multiplied by 1 while defensive rebounds are worth .3.

There are some issues with PER, such as the formula being structured that even low percentage shooters can increase their efficiency rates by taking more shots.  But the use of multiple types of statistics does provide a more holistic measurement.   In our project with the Dream we used a form of PER adapted to account for some of the data limitations.  In this project some questions were raised whether PER was an appropriate metric for the women’s game or if the weights should be different.

  1. Plus/Minus

Plus/Minus rates are a currently popular metric.  Plus/Minus stats basically measure how a player’s team performs when he or she is on the court.  Plus/Minus is great because it captures the fact that teams play better or worse when a given player is on the court.  But Plus/Minus can also be argued against if substitution patterns are highly correlated.  In our project with the Dream Plus/Minus wasn’t considered simply because we did not have a source.

  1. Minutes played

One metric that we like is simply minutes played.  While this may seem like a primitive metric, it has some nice properties.  The biggest plus is that it reflects the coach’s (a human expert) judgment.  Assuming that the human decision is influenced by production (points, rebounds, etc…) this metric is more of an intuition / analysis hybrid.  On the downside, minutes played are obviously a function of the other players on the team and injuries.

Categories of Success & Probability Models

As noted, the preceding discussion revolves around predicting numerical quantities.  There is also a tradition of placing players into broad categories.  A player that starts for a decade is probably viewed as a great draft pick while someone that doesn’t make a roster is a disaster.  Our goal with “categories” is to predict that probability that each outcome occurs.

This type of approach likely calls for a different class of models.  Rather than use linear regression we would use a probability model.  For example, there is something called an order logistic regression model that we can use to predict the probability of “ordered” career outcomes.  For example, we could predict the probabilities of a player becoming an all-star, a long-term starter, an occasional starter, career backup or a non-contributor with this type of model.  Again, we can make this prediction as a function of the player’s college performance and other available data.

Below we write an equation that captures this.

Pr(Category=j)=f(college stats,physical attributes,etc…)

This equation says that the probability that a player becomes some category “j” is some function of a bunch of observable traits.  We are going to skip the math but these types of models do require a bit “more” than linear regression models (specialized software mostly) and are more complicated to interpret.

A nice feature of probability models is that the predictions are useful for risk assessment.  For example, an ordered logistic model would provide probability estimates for the range of player categories.  A given prospect might have a 5% chance of becoming an all-star, a 60% of becoming a starter and 35% chance of being a career backup.  In contrast, the linear probability models described previously will only produce a “point” estimate.  Something along the lines of a given prospect is predicted to score 6.5 points per game or to grab 4 rebounds per game as a pro.

This is probably a good place to break.  There is much more to come.  Next time we will talk about predicting outliers and then spend some time on the explanatory variables (what we use to predict).  On a side note – this series is going to form the foundation for several sessions of our sports analytics course.  So, if there are any questions we would love to hear them (Tweet us @sportsmktprof).

Click here for Part I

Click here for Part II 

Mike Lewis & Manish Tripathi, Emory University 2015.

Analytics vs Intuition in Decision-Making Part II: Too Much and Too Little Data

The use of analytics in sports personnel decisions such as drafting and free agency signings is a topic with obvious popular appeal. Sports personnel decisions are fundamentally about how people will perform in the future. These are also tough, complex high risk decisions that are the fodder for talk radio and second guessing from just about everyone.

So how can we make these decisions? As we noted in our last post, the choice between using analytics versus using the “gut” is probably a decision that doesn’t need to be made. Analytics and data should have a role. The question is how much emphasis should be placed on the “models” and how much on the intuition of the “experts.”

In this second installment of the series, we begin the process of going deeper into the mechanics and challenges involved in leveraging data and building models to support personnel decisions. As a backdrop for this discussion, we are going to tell the story of project we helped a group of Emory students complete for the WNBA’s Atlanta Dream. Going into detail about this story / process should illuminate a couple of things. First, there is logic to how these types of analyses can best be structured. Second, a careful and systematic discussion of a project may clarify both the weaknesses and strengths of “Moneyball” type approaches to decision making.

To begin, we want to thank the Dream. This was a great project that the students loved, and it gave us an opportunity to think about the challenges in modeling draft prospects in a whole new arena. An early step in any analytics project is the building of the data infrastructure. For the WNBA, this was a challenge. Storehouses of sports data come from all sorts of places but they often start out as projects driven more by fan passion than any formal effort from an established organization. Baseball is probably the gold standard for information with detailed data going back a century. In contrast, for women’s professional and college basketball the information is comparatively sparse. There’s not a lot and it doesn’t go back very far.

After some searching (with a lot of great assistance from the Dream) we were able to identify information sources for both professional and collegiate stats. As we started to assemble databases a few things became apparent:

  • First, the data available was nowhere as detailed as what could be found for the men’s game. We were limited to season level stats at both the pro and college level. Furthermore, all we had were the basics – the data in box scores. This is good information, but it does leave the analyst wanting more.
  • Second, the data fields on professional performance were not identical to the data on collegiate performance. For example, the pro level data breaks rebounds down into offensive and defensive boards. Maybe this is a big deal and maybe not. It does make it difficult to use established metrics that place different value on the two types of rebounds.
  • Third, there was a LOT of missing data, and multiple types of missing data. In terms of player statistics, information on turnovers was at best scarce. Again, this makes it difficult to use established metrics like PER. The other thing that was missing is players themselves. We never were able to create a repository of data on international players that didn’t participate in NCAA basketball. As a side note, even if we had found international data it would be hard to interpret. How would we judge the importance of a rebound in Europe versus a rebound in South America? This isn’t just a problem for women’s basketball as this is also an issue in any global sport.

There were also a lot of things that we would have liked to have had. Some of this may have been available, and maybe we did not look hard enough. But we always need to ask the question of the incremental value versus the required effort. For example, information on players’ physical traits was very limited. We could obtain height but even basics like weight were difficult to find. And as far as we know – there is no equivalent to the NFL combine.

While these might seem like severe limitations, we think it’s really just par for the course in this type of research. Especially in the first go around! In analytics, you often work with what you have and you try to be clever in order to get the most from the data. We will get to how to approach this type of problem soon. But even with the limitations, we actually have a LOT of data. At the college level we have 4 years of data on games, played, field goals made, field goals attempted, rebounds, steals, 3 pointers, etc… If we have 15 data fields for 4 years we have 60 statistics per player. Add in data on height, strength of schedule and assorted miscellaneous fields and we have maybe 70 pieces of data per player. And maybe we want to do things like combine pieces of information; things like multiplying points per game by strength of schedule to get a measure that accounts for the greater difficulty of scoring in the ACC versus a lower tier conference. So maybe we end up with a 100 variables we want to investigate.

Why are we discussing how many field we have per prospect? Because it brings us to our next problem – the relatively small number of observations in most sports contexts. Remember the basic game in this analysis is to understand “what” predicts a successful pro career. This means that we need observations on successful and less successful pro careers.

The WNBA consists of twelve teams with rosters of twelve players. This means if we go back and collect a few years of data we are looking at just a couple hundred players with meaningful professional careers. While this may seem like a sizeable amount of data, to the data scientist this is almost nothing. Our starting point is trying to relate professional career performance to college data, which in this case means maybe two hundred pro careers to be explained by potentially about a hundred explanatory variables.

It really is a weird starting point. We have serious limitations on the explanatory data available, but we also wish the ratio of observations (players) to explanatory data fields was higher. In our next installment, we will start to talk about what we are trying to predict (measures of pro career success). Following that, we will talk about how to best use our collection of explanatory variables (college stats).

Mike Lewis & Manish Tripathi, Emory University 2015.

WNBA Social Media Equity Rankings

We begin our summer of fan base rankings with a project done by one of our favorite Emory students – Ilene Tsao.  Ilene presents a multi-dimensional analysis of the WNBA across Facebook, Twitter and Instagram.  The first set of rankings speak to the current state of affairs.  Seattle leads the way followed by LA and Atlanta.  In the second analysis, Ilene takes a look at what is possible in each market (by controlling for time in market and championships).  In this analysis the Atlanta Dream lead the way followed by Minnesota and Chicago.

The teams in the WNBA are constantly looking for ways to improve their brand and continue to expand their fan base. Social media provides a way to measure fan loyalty and support. In order to calculate WNBA teams’ social media equity, we collected data on each team’s followers across the three main social media platforms of Facebook, Twitter, and Instagram. We then ran a regression model to help predict followers for each platform as a function of factors such as metropolitan populations, number of professional teams, team winning percentages, and playoff achievements. After creating this model, we used the predicted number of followers and compared it to each team’s actual number of social media followers.  Our goal is to see who “over” or “under” achieves based on social media followers on average. We then ranked the WNBA teams based on the results.

The first model only used the metropolitan population and winning percentage of each team. After taking the average of the Facebook, Twitter, and Instagram rankings, we found the Seattle Storm had the best performance. The Connecticut Sun and Washington Mystics consistently ranked as the bottom two teams across all three platforms, but teams like the Los Angeles Sparks and Atlanta Dream had more variation. The Dream ranked 6th for Twitter, but 1st for Instagram while the Sparks ranked 1st for Twitter and 6th for Instagram. This could be because both Instagram and the Dream recently joined the social media world and the WNBA, while the Sparks and Twitter have been around for longer. Based on raw numbers, the New York Liberty has high performance in terms of social media followers, but when we adjust for market size and winning percentage, the team does poorly.

Rankings for Facebook, Twitter, and Instagram based on the metropolitan population and the teams’ winning percentages:

WNBA Social Media 1

The second model extended the previous analysis by incorporating the number of other professional teams in the area and number of WNBA championships won into the regression analysis. This model seemed to be a better fit for our data and resulted in small adjustments in the rankings. After taking the average of all three rankings with the new factors, the Atlanta Dream was ranked first while passing the Seattle Storm and Los Angeles Sparks. The Mystics were no longer consistently the worst team, but were still in the bottom half of the rankings.

Rankings based on metropolitan population, winning percentage, number of other professional teams, and number of WNBA championships:

WNBA Social Media 2Ilene Tsao, Emory University, 2015.