Today’s post introduces the topic of statistical modeling. This is, maybe, the trickiest part of the series to write. The problem is that mastering the technical side of statistical analysis usually takes years of education. And, more critically, developing the wisdom and intuition to use statistical tools effectively and creatively takes years of practice. The goal of this segment is to point people in the right direction, more than to provide detailed instruction. That said – I can adjust if there is a call for more technical material. (If you want to start from the beginning parts 1, 2 and 3 are a click away.)
Let’s start with a simple point. The primary tool for every analytics professional (sports or otherwise) should be linear regression. Linear regression allows the analyst to quantify the relationship between some focal variable of interest (dependent measure or DV) and a set of variables that we think drive that variable (independent variables). In other words, regression is a tool that can produce an equation that shows how some inputs produce an outcome of interest. In the case of player analytics, this might be a prediction of future performance based on a player’s past statistics or physical attributes.
To make this more concrete, let’s say we want to do an analysis of rookie quarterback performance (we’ve been talking a bit about QB metrics so far in the series). Selecting QBs involves significant uncertainty. The transition from the college game to the pro game requires the QB to be able to deal with more complex offensive systems, more sophisticated defenses and more talented opposing players. The task of the general manager is to identify prospects that can successfully make the transition.
Data and statistical analysis can potentially play a part in this type of decision. The starting point would be the idea that observable data on college prospects can help predict rookie year performance. As a starting point let’s assume that general managers can obtain data on the number of games won as a college player, whether the player graduated (or will graduate) and the player’s height. (We just might be foreshadowing a famous set of rules for drafting quarterbacks).
The other key decision for a statistical analysis of rookie QB performance versus college career and physical data is a performance metric. We could use the NFL passer rating formula that we have been discussing. Or we could use something else. For example, maybe the number of TD passes thrown as a rookie. This metric is interesting as it captures something about playing time and ability to create scores.
Touchdowns are also a metric that “fits” linear regression. Linear regression is best suited to the analysis of quantitative variables that vary continuously. The number of touchdowns we observe in data will range from zero to whatever the is the rookie TD record. In contrast, other metrics such as whether the player becomes a starter or a pro bowler are categorical variables. There are other techniques that are better for analyzing categorical variables. (if you are a stats jockey and are objecting to the last couple of statements please see the note below).
The purpose of regression analysis is to create an equation of the following form:
This equation says that TD passes are a function of college wins, graduation and height. The βs are the weights that are determined by the linear regression analysis. Specifically, linear regression determines the βs that best fits the data. This is the important point. The weights or βs are determined from the data. To illustrate how the equation works lets imagine that we estimated the regression model and obtained the following equation.
This equation says that we can predict rookie TD passes by plugging in each player’s data related to college wins, graduation and height. It also says that a history of winning is positively related to TDs and graduation also is a positive. The coefficient for height is zero. This indicates that height is not a predictor of rookie TDs (I’m making these number up – height probably matters). One benefit of developing a model is that we let the data speak. Our “expert” judgment might be that height matters for quarterbacks. The regression results can help identify decision biases if the coefficients don’t match the experts predictions. I am neglecting the issue of significance for now – just to keep the focus on intuition.
Let’s say we have two prospects. Lewis Michaels out of the University of Illinois who won 40 college games (hypothetical and unrealistic), graduated (in engineering) and is 5’10” (a Flutiesque prospect). Our second prospect is Manny Trips out of Duke. Manny won 10 games, failed to graduate and is 6’ tall. Michaels would seem to be the better prospect based on the available data. The statistical model allows us to predict how much better.
We make our predictions by simply plugging our player level data into the equation. We would predict Lewis would throw 10 TDs in his rookie year (1+.1*40+5*1+0*70). For Manny the prediction would be 2 TDs. For now, I am just making up the coefficients (βs). In a later entry I will estimate the model using some data on actual NFL rookie QB performance.
Regression has its shortcomings and many analysts love to object to regression analyses. But for the most part, linear regression is a solid tool for analyzing patterns in data. It’s also relatively easy to implement. We can run regressions in Excel! We shouldn’t underestimate how important it is to be able to do our analyses in standard tools like Excel.
I will extend our tool kit in a future entry. I briefly mentioned categorical variables such as whether or not a player is a starter. For these types of Yes/No (starter or not a starter) there is a tool called logistic regression that should be in our repertoire.
*One reason this note is tricky is that I’m trying to get the right balance and tone. I can already hear the objections. Lets save these for now. For example, readers do not need to alert me to the fact that TDs are censored at zero. Or that there is a mass point at zero because many rookies don’t play. Or that TDs are counted in discrete units so maybe a Poisson model is more appropriate. You get the idea. There are many ways to object to any statistical model. The real question isn’t whether a model is perfect. The real question should be whether the model provides value.