In Part 4 of the series we started talking about what should be in the analyst’s tool kit. I advocated for linear regression to be the primary tool. Linear regression is (relatively) easy to implement and produces equations that are (relatively) easy to understand. I also made the point that linear regression is best suited for predicting continuous measures and used the example of predicting the number of touchdown passes thrown by a rookie QB.
But not everything we want to predict is going to be a continuous variable. Since we are talking about predicting quarterback performance, maybe we prefer a metric that is more discrete such as whether a player becomes a starter. Can we still use linear regression? Maybe.
Let’s return to the example from last time. The task was to predict professional (rookie year) success based on college level data. We assumed that general managers can obtain data on the number of games won as a college player, whether the player graduated (or will graduate) and the player’s height.
Our initial measure of pro success was touchdown passes. We then specified a regression model using the following equation.
But let’s say that we don’t like the TD passes metric. Maybe we don’t like it because we think TD passes are more related to wide receiver talent than to the quality of the QB. Rather than use TDs as our dependent variable we want to use whether a player becomes a starter. This is also an interesting metric as it captures whether the player was selected by a coaching staff to be the primary quarterback. This is a nice feature as the metric includes some measure of human expertise. I’ll leave criticism to the readers as an exercise.
This leads us to the following equation:
One issue we have to address before we estimate this model is how we define the term starter. In a statistical model we need to convert the word or category of “starter” into a number. In this case, the easy solution is to treat players that became starters as 1’s and players that did not as 0’s. As a second exercise – what would we do if we had three categories (did not play, reserve, starter)?
Let’s pretend we estimated the preceding model and obtained the following equation:
We can use the equation to “score” or “rate” our imaginary prospects from last time (Lewis Michaels and Manny Trips). In terms of the input data, Lewis won 40 college games, graduated and is 5′ 10”. Plugging Michael’s data into the equation gives us a score of .22. The analysis that we have performed is commonly termed a linear probability model. A simple interpretation of this result is that the expected probability of Michaels (or better said a prospect with Michaels statistics) becoming a starter is 22%.
So far so good.
Our second prospect is Manny Trips out of Stanford. Manny won 10 games, failed to graduate and is 6’ tall. For Manny the prediction would be -12.8%. This is the big problem with using linear regression to predict binary (Yes/No) outcomes. How do we interpret a negative probability? Or a probability that is greater than 1?
So what do we do next? I think we have two options. We can ignore the problem. If the goal is just to rank prospects then maybe we don’t care very much. In this case, we just care about the relative scores not the actual prediction. If we are just using analytics to screen QB prospects or to provide another data point then maybe our model is good enough. The level of investment in a modeling project should be based on how the model is going to be used. In many or most sports applications I would lean to simpler less complicated models.
Our second option is to move to a more complicated model. There are a host of models available for categorical data. We can use a binary logit or Probit model for the case of a binary system as above. If the categories have a natural ordering to them (never played, reserve, starter) then we can use an ordered logit. If there is no order to the categories, then we can use a multinomial logit. I’m still debating on how much attention I should pay to these models. Having a tool to deal with categorical variables can be invaluable but there is a cost. The mathematics become more complex, estimation of the model requires specialized software and interpretation of the model becomes less intuitive.
I think I will discuss the binary logit next time.