So far in our series on draft analytics, we have discussed the relative strengths and weaknesses of statistical models relative to human experts, and we have talked about some of the challenges that occur when building databases. We now turn to questions and issues related to building predictive models of athlete performance.

“What should we predict?” is a deceptively simple question that needs to be answered early and potentially often throughout the modeling process. Early – because we need to have some idea of what we want to predict before the database can be fully assembled. Often – because frequently it will be the case that no one metric performance will be ideal.

There is also the question of what “type” of thing should be predicted. It can be a continuous variable, like how much of something. Yards gained in football, batting average in baseball or points score in basketball would be examples. It can also be categorical (e.g. is the player an all-star or not).

A Simple Example

So what to predict? For now, we will focus on basketball with a few comments directed towards other sports. We have options. We can start with something simple like points or rebounds (note that these are continuous quantities – things like points that vary from zero to the high twenties rather than categories like whether a player is a starter or not). We don’t think these are bad metrics but they do have limitations. The standard complaint is that these single statistics are too one dimensional. This is true (by definition, in this case) but there may be occasions when this is a useful analysis.

First, maybe the team seeks a one dimensional player. The predicted quantity doesn’t need to be points. Perhaps, there is a desperate need for rebounding or assists. It’s a team game, and it is legitimate to try and fill a specialist role. A single measure like points might also be useful because it could be correlated with other good “things” that are of interest to the team.

For a moment, let us assume that we select points per game as the measure to be predicted, and we predict this using all sorts of collegiate statistics (the question of the measures we should use to predict is for next time). In the equation below, we write what might be the beginning of a forecasting equation. In this expression, points scored during the rookie season (Points(R)) is to be predicted using points scored in college (Points(C)), collegiate strength of schedule (SOS), an interaction of points scored and strength of schedule (Points(C) X SOS) and potentially other factors.

*Points(R)=β _{0}+β_{P} Points(C)+β_{SOS} SOS+β_{PS} Points(C)×SOS+⋯*

The logic of this equation is that points scored rookie year is predictable from college points, level of competition and an adjustment for if the college points were scored against high level competition. When we take this model to the data via a linear regression procedure we get numerical values for the beta terms. This gives us a formula that we can use to “score” or predict the performance of a set of prospects.

The preceding is a “toy” specification in that a serious analysis would likely use a greatly expanded specification. In the next part of our series we will focus on the right side of the equation. *What should be used as explanatory variables and what form these variables should take.*

Some questions naturally arise from this discussion…

- What pro statistics are predictable based on college performance. Maybe scoring doesn’t translate but steals do?
- Is predicting rookie year scoring appropriate? Should we predict 3
^{rd}year scoring to get a better sense of what the player will eventually become? - Should the model vary based on position? Are the variables that predict something like scoring or rebounding be the same for guards versus forwards?

Most of these questions are things that should be addressed by further analysis. One thing that the non-statistically inclined tend not to get is that there is value in looking at multiple models. It is seldom clear-cut what the model should look like, and it’s rare that one size fits all (same model for point guards and centers?). And maybe models only work sometimes. Maybe we can predict pro steals but not points. One reason why the human experts need to become at least statistically literate is that if they aren’t, the results from that analytics guys either need to be overly simplified or the expert will tend to reject the analytics because the multitude of models is just too complex.

A simple metric like points (or rebounds, or steals, etc…) is inherently limited. There are a variety of other statistics that could be predicted that better capture the all-round performance of a player or the player’s impact on the team. But the basic modeling procedure is the same. We use data on existing pros to estimate a statistical model that predicts the focal metric based on data available about college prospects.

Some other examples of continuous variables we might want to predict…

- Player Efficiency

How about something that includes a whole spectrum of player statistics like John Hollinger’s Player Efficiency Rating (PER)? PER involves a formula that weights points, steals, rebounds assists and other measures by fixed weights (not weights estimated from data as above). For instance, points are multiplied by 1 while defensive rebounds are worth .3.

There are some issues with PER, such as the formula being structured that even low percentage shooters can increase their efficiency rates by taking more shots. But the use of multiple types of statistics does provide a more holistic measurement. In our project with the Dream we used a form of PER adapted to account for some of the data limitations. In this project some questions were raised whether PER was an appropriate metric for the women’s game or if the weights should be different.

- Plus/Minus

Plus/Minus rates are a currently popular metric. Plus/Minus stats basically measure how a player’s team performs when he or she is on the court. Plus/Minus is great because it captures the fact that teams play better or worse when a given player is on the court. But Plus/Minus can also be argued against if substitution patterns are highly correlated. In our project with the Dream Plus/Minus wasn’t considered simply because we did not have a source.

- Minutes played

One metric that we like is simply minutes played. While this may seem like a primitive metric, it has some nice properties. The biggest plus is that it reflects the coach’s (a human expert) judgment. Assuming that the human decision is influenced by production (points, rebounds, etc…) this metric is more of an intuition / analysis hybrid. On the downside, minutes played are obviously a function of the other players on the team and injuries.

Categories of Success & Probability Models

As noted, the preceding discussion revolves around predicting numerical quantities. There is also a tradition of placing players into broad categories. A player that starts for a decade is probably viewed as a great draft pick while someone that doesn’t make a roster is a disaster. Our goal with “categories” is to predict that probability that each outcome occurs.

This type of approach likely calls for a different class of models. Rather than use linear regression we would use a probability model. For example, there is something called an order logistic regression model that we can use to predict the probability of “ordered” career outcomes. For example, we could predict the probabilities of a player becoming an all-star, a long-term starter, an occasional starter, career backup or a non-contributor with this type of model. Again, we can make this prediction as a function of the player’s college performance and other available data.

Below we write an equation that captures this.

*Pr(Category=j)=f(college stats,physical attributes,etc…)*

This equation says that the probability that a player becomes some category “j” is some function of a bunch of observable traits. We are going to skip the math but these types of models do require a bit “more” than linear regression models (specialized software mostly) and are more complicated to interpret.

*A nice feature of probability models is that the predictions are useful for risk assessment.* For example, an ordered logistic model would provide probability estimates for the range of player categories. A given prospect might have a 5% chance of becoming an all-star, a 60% of becoming a starter and 35% chance of being a career backup. In contrast, the linear probability models described previously will only produce a “point” estimate. Something along the lines of a given prospect is predicted to score 6.5 points per game or to grab 4 rebounds per game as a pro.

This is probably a good place to break. There is much more to come. Next time we will talk about predicting outliers and then spend some time on the explanatory variables (what we use to predict). On a side note – this series is going to form the foundation for several sessions of our sports analytics course. So, if there are any questions we would love to hear them (Tweet us @sportsmktprof).

*Mike Lewis & Manish Tripathi, Emory University 2015.*