Player Analytics Fundamentals: Part 5 – Modeling 102

In Part 4 of the series we started talking about what should be in the analyst’s tool kit.  I advocated for linear regression to be the primary tool.  Linear regression is (relatively) easy to implement and produces equations that are (relatively) easy to understand.  I also made the point that linear regression is best suited for predicting continuous measures and used the example of predicting the number of touchdown passes thrown by a rookie QB.

But not everything we want to predict is going to be a continuous variable.  Since we are talking about predicting quarterback performance, maybe we prefer a metric that is more discrete such as whether a player becomes a starter.  Can we still use linear regression?  Maybe.

Let’s return to the example from last time.  The task was to predict professional (rookie year) success based on college level data.  We assumed that general managers can obtain data on the number of games won as a college player, whether the player graduated (or will graduate) and the player’s height.

Our initial measure of pro success was touchdown passes.  We then specified a regression model using the following equation.

But let’s say that we don’t like the TD passes metric.  Maybe we don’t like it because we think TD passes are more related to wide receiver talent than to the quality of the QB.  Rather than use TDs as our dependent variable we want to use whether a player becomes a starter.  This is also an interesting metric as it captures whether the player was selected by a coaching staff to be the primary quarterback.  This is a nice feature as the metric includes some measure of human expertise.  I’ll leave criticism to the readers as an exercise.

This leads us to the following equation:

One issue we have to address before we estimate this model is how we define the term starter.  In a statistical model we need to convert the word or category of “starter” into a number.  In this case, the easy solution is to treat players that became starters as 1’s and players that did not as 0’s.  As a second exercise – what would we do if we had three categories (did not play, reserve, starter)?

Let’s pretend we estimated the preceding model and obtained the following equation:

We can use the equation to “score” or “rate” our imaginary prospects from last time (Lewis Michaels and Manny Trips).  In terms of the input data, Lewis won 40 college games, graduated and is 5′ 10”.  Plugging Michael’s data into the equation gives us a score of .22.  The analysis that we have performed is commonly termed a linear probability model.  A simple interpretation of this result is that the expected probability of Michaels (or better said a prospect with Michaels statistics) becoming a starter is 22%.

So far so good.

Our second prospect is Manny Trips out of Stanford.  Manny won 10 games, failed to graduate and is 6’ tall.  For Manny the prediction would be -12.8%.  This is the big problem with using linear regression to predict binary (Yes/No) outcomes.  How do we interpret a negative probability?  Or a probability that is greater than 1?

So what do we do next?  I think we have two options.  We can ignore the problem.  If the goal is just to rank prospects then maybe we don’t care very much.  In this case, we just care about the relative scores not the actual prediction.  If we are just using analytics to screen QB prospects or to provide another data point then maybe our model is good enough.  The level of investment in a modeling project should be based on how the model is going to be used.  In many or most sports applications I would lean to simpler less complicated models.

Our second option is to move to a more complicated model.  There are a host of models available for categorical data.  We can use a binary logit or Probit model for the case of a binary system as above.  If the categories have a natural ordering to them (never played, reserve, starter) then we can use an ordered logit.  If there is no order to the categories, then we can use a multinomial logit.  I’m still debating on how much attention I should pay to these models.  Having a tool to deal with categorical variables can be invaluable but there is a cost.  The mathematics become more complex, estimation of the model requires specialized software and interpretation of the model becomes less intuitive.

I think I will discuss the binary logit next time.

Player Analytics Fundamentals: Part 4 – Statistical Models

Today’s post introduces the topic of statistical modeling.  This is, maybe, the trickiest part of the series to write.  The problem is that mastering the technical side of statistical analysis usually takes years of education.  And, more critically, developing the wisdom and intuition to use statistical tools effectively and creatively takes years of practice.  The goal of this segment is to point people in the right direction, more than to provide detailed instruction.  That said – I can adjust if there is a call for more technical material.  (If you want to start from the beginning parts 1, 2 and 3 are a click away.)

Let’s start with a simple point.  The primary tool for every analytics professional (sports or otherwise) should be linear regression.  Linear regression allows the analyst to quantify the relationship between some focal variable of interest (dependent measure or DV) and a set of variables that we think drive that variable (independent variables).  In other words, regression is a tool that can produce an equation that shows how some inputs produce an outcome of interest.  In the case of player analytics, this might be a prediction of future performance based on a player’s past statistics or physical attributes.

To make this more concrete, let’s say we want to do an analysis of rookie quarterback performance (we’ve been talking a bit about QB metrics so far in the series).  Selecting QBs involves significant uncertainty.  The transition from the college game to the pro game requires the QB to be able to deal with more complex offensive systems, more sophisticated defenses and more talented opposing players.  The task of the general manager is to identify prospects that can successfully make the transition.

Data and statistical analysis can potentially play a part in this type of decision.  The starting point would be the idea that observable data on college prospects can help predict rookie year performance.  As a starting point let’s assume that general managers can obtain data on the number of games won as a college player, whether the player graduated (or will graduate) and the player’s height.  (We just might be foreshadowing a famous set of rules for drafting quarterbacks).

The other key decision for a statistical analysis of rookie QB performance versus college career and physical data is a performance metric.  We could use the NFL passer rating formula that we have been discussing.  Or we could use something else.  For example, maybe the number of TD passes thrown as a rookie.  This metric is interesting as it captures something about playing time and ability to create scores.

Touchdowns are  also a metric that “fits” linear regression.  Linear regression is best suited to the analysis of quantitative variables that vary continuously.  The number of touchdowns we observe in data will range from zero to whatever the is the rookie TD record.  In contrast, other metrics such as whether the player becomes a starter or a pro bowler are categorical variables.  There are other techniques that are better for analyzing categorical variables.  (if you are a stats jockey and are objecting to the last couple of statements please see the note below).

The purpose of regression analysis is to create an equation of the following form:

This equation says that TD passes are a function of college wins, graduation and height.  The βs are the weights that are determined by the linear regression analysis.  Specifically, linear regression determines the βs that best fits the data.  This is the important point.  The weights or βs are determined from the data.  To illustrate how the equation works lets imagine that we estimated the regression model and obtained the following equation.

This equation says that we can predict rookie TD passes by plugging in each player’s data related to college wins, graduation and height.  It also says that a history of winning is positively related to TDs and graduation also is a positive.  The coefficient for height is zero.  This indicates that height is not a predictor of rookie TDs (I’m making these number up – height probably matters).  One benefit of developing a model is that we let the data speak.  Our “expert” judgment might be that height matters for quarterbacks.  The regression results can help identify decision biases if the coefficients don’t match the experts predictions.  I am neglecting the issue of significance for now – just to keep the focus on intuition.

Let’s say we have two prospects.  Lewis Michaels out of the University of Illinois who won 40 college games (hypothetical and unrealistic), graduated (in engineering) and is 5’10” (a Flutiesque prospect).  Our second prospect is Manny Trips out of Duke.  Manny won 10 games, failed to graduate and is 6’ tall.  Michaels would seem to be the better prospect based on the available data.  The statistical model allows us to predict how much better.

We make our predictions by simply plugging our player level data into the equation.  We would predict Lewis would throw 10 TDs in his rookie year (1+.1*40+5*1+0*70).  For Manny the prediction would be 2 TDs.  For now, I am just making up the coefficients (βs).  In a later entry I will estimate the model using some data on actual NFL rookie QB performance.

Regression has its shortcomings and many analysts love to object to regression analyses.  But for the most part, linear regression is a solid tool for analyzing patterns in data.  It’s also relatively easy to implement.  We can run regressions in Excel!  We shouldn’t underestimate how important it is to be able to do our analyses in standard tools like Excel.

I will extend our tool kit in a future entry.  I briefly mentioned categorical variables such as whether or not a player is a starter.  For these types of Yes/No (starter or not a starter) there is a tool called logistic regression that should be in our repertoire.

*One reason this note is tricky is that I’m trying to get the right balance and tone.  I can already hear the objections.  Lets save these for now.  For example, readers do not need to alert me to the fact that TDs are censored at zero.  Or that there is a mass point at zero because many rookies don’t play.  Or that TDs are counted in discrete units so maybe a Poisson model is more appropriate.  You get the idea.  There are many ways to object to any statistical model.  The real question isn’t whether a model is perfect.  The real question should be whether the model provides value.

Player Analytics Fundamentals: Part 3 – Metrics, Experts and Models

Last time I introduced the topic of player “metrics.” (If you want to get caught up you can start with Part 1 and Part 2 of the series.)  As I noted, determining the right metric is perhaps the most important task in player analytics.  It’s almost too obvious of a point to make – but the starting point for any analytics project should be deciding what to measure or manage.  It’s a non-trivial task because while the end goal (profit, wins) might be obvious, how this goal relates to an individual player (or strategy) may not be.

However, before I get too deep into metric development, I want to take a small detour and talk briefly about statistical models.  We won’t get to modeling in this entry – the goal is to motivate the need for statistical models!  If we are doing player analytics we need some type of tool kit to move us from mere opinion to fact based arguments.

To illustrate what I mean by “opinion” lets consider the example of rating quarterbacks.  In the previous entry, I presented the Passer Rating Formula used to rate NFL quarterbacks.  As a quick refresher let’s look at this beast one more time.The formula includes completion percentage (accuracy), yards per attempt (magnitude), touchdowns (ultimate success) and interceptions (failures).  Let’s pretend for a second that the formula only contained touchdowns and interceptions (just to make it simple).  The question then becomes how much should we weight touchdowns per attempt relative to interceptions per attempt?  The actual formula is hopelessly complex in some ways – we have fractional weights and statistics in different units – so let’s take a step back from the actual formula.

Imagine we have two experts proposing Passer Rating statistics that are based on touchdowns and interceptions only.  One expert might say that touchdowns per attempt are twice as important as interceptions.  We will label this “expert” created formula as ePR1 for expert 1 Passer rating.  The formula would be:

Maybe this judgment would be accompanied by some logic along the lines of “touchdowns are twice as important because the opposing team doesn’t always score as the result of an interception.”

However, the second expert suggests that the touchdowns and interceptions should be weighted equally.  Maybe the logic of the second expert is that interceptions have both direct negative consequences (loss of possession) and also negative psychological effects (loss of momentum), and should therefore be weighted more heavily.  The formula for expert 2 can be written as:

I suspect that many readers (or a high percentage of a few readers) are objecting to developing metrics using this approach.  The approach probably seems arbitrary.  It is.  I’ve intentionally presented things in a manner that highlights the subjective nature of the process.  I’ve reduced things down to just 2 stats and I’ve chosen very simple weights.  But the reality is that this is the basic process through which novices tend to develop “new” or “advanced” statistics.  In fact, it is still very much a standard practice.  The decision maker or supporting analysts gather multiple pieces of information and then use a system of weights to determine a final “grade” or evaluation.

The question then becomes which formula do we use?  Both formulas include multiple pieces of data and are based on a combination of logic and experience.  I am ignoring (for the moment) a critical element of this topic – the issue of decision biases.  In subsequent entries, I’m going to advocate for an approach that is based on data and statistical models.  Next time, we will start to talk more about statistical tools.

Player Analytics Fundamentals: Part 2 – Performance Metrics

I want to start the series with the topic of “Metric Development.”  I’m going to use the term “metric” but I could have just as easily used words like stats, measures or KPIs.  Metrics are the key to sports and other analytics functions since we need to be sure that we have the right performance standards in place before we try and optimize.  Let me say that one more time – METRIC DEVELOPMENT IS THE KEY.

The history of sports statistics has focused on so called “box score” statistics such as hits, runs or RBIs in baseball.  These simple statistics have utility but also significant limitations.  For example, in baseball a key statistic is batting average.  Batting average is intuitively useful as it shows a player’s ability to get on base and to move other runners forward.  However, batting average is also limited as it neglects the difference between types of hits.  In a batting average calculation, a double or home run is of no greater value than a single.  It also neglects the value of walks.

These short-comings motivated the development of statistics like OBPS (on base plus slugging).  Measures like OBPS that are constructed from multiple statistics are appealing because they begin to capture the multiple contributions made by a player.  On the downside these types of constructed statistics often have an arbitrary nature in terms of how component statistics are weighted.

The complexity of player contributions and the “arbitrary nature” of how simple statistics are weighted is illustrated by the formula for the NFL quarterback ratings.

This equation combines completion percentage (COMP/ATT), yards per attempt (YARDS/ATT), touchdown rate (TD/ATT) and interception rate (INT/ATT) to arrive at a single statistic for a quarterback.  On the plus side the metric includes data related to “accuracy” (completion percentage) to “scale” (yards per), to “conversion” (TDs), and to “failures” (interceptions).  We can debate if this is a sufficiently complete look at QBs (should we include sacks?) but it does cover multiple aspects of passing performance.   However, a common reaction to the formula is a question about where the weights come from.  Why is completion rate multiplied by 5 and touchdown rates multiplied by 20?

Is it a great statistic?  One way to evaluate is via a quick check of the historical record.  Does the historical ranking jive with our intuition?  Here is a link to historical rankings.

Every sport has examples of these kinds of “multi-attribute” constructed statistics.  Basketball has player efficiency metrics that involve weighting a player’s good events (points, rebounds, steals) and negative outcomes (turnovers, fouls, etc…).  The OBPS metric involves an implicit assumption that “on base percentage” and “slugging” are of equal value.

One area I want to explore is how we should construct these types of performance metrics.  This is a discussion that involves some philosophy and some statistics.  We will take this piece by piece and also show a couple of applications along the way.

Player Analytics Fundamentals: Part 1

Each Spring I teach courses on Sports Analytics.  These courses include both Marketing Analytics and On-Field Analytics.  The “Blog” has tended to focus on the Marketing of Fan side.  Moving forward, I think the balance is going to change a bit.  My plan is to re-balance the blog to include more of the on-field topics.

Last year I published a series of posts related to the fundamentals of sports analytics.  This material is relevant to both the marketing and the team performance sides of sports analytics.  This series featured comments on organizational design and decision theory.

This series is going to be a bit different than the team and player “analytics” that we see on the web.  Rather than present specific studies, I am going to begin with some fundamental principles and talk about a “general” approach to player analytics.  There is a lot of material on the web related to very specific sports analytics questions.  Analytics can be applied to baseball, football, soccer and every other sport.  And within each of these games there are countless questions to be addressed.

Rather than contribute to the littered landscape, I want to talk about how I approach sports analytics questions.  In some ways, this series is the blue print I use for thinking about sports analytics in the classroom.  My starting point is that I want to provide skills and insights that can be applied to any sport.  So we start with the fundamentals and we think a lot about how to structure problems.  I want to supply grounded general principles that can be applied to any player analytics problem.

So what’s the plan?  At a high level, sports analytics are about prediction.  We will start with a discussion about what we should be predicting.  This is a surprisingly complex issue.  From there we will talk a little bit about different statistical models.  This won’t be too bad, because I’m a firm believer in using the simplest possible models.  The second half of the series will focus on different types of prediction problems.  These will range from predicting booms and busts, to a look at how to do “comparables” in a better fashion.  In terms of the data, I think it will be a mix of football and the other kind of football.