Player Analytics Fundamentals: Part 5 – Modeling 102

In Part 4 of the series we started talking about what should be in the analyst’s tool kit.  I advocated for linear regression to be the primary tool.  Linear regression is (relatively) easy to implement and produces equations that are (relatively) easy to understand.  I also made the point that linear regression is best suited for predicting continuous measures and used the example of predicting the number of touchdown passes thrown by a rookie QB.

But not everything we want to predict is going to be a continuous variable.  Since we are talking about predicting quarterback performance, maybe we prefer a metric that is more discrete such as whether a player becomes a starter.  Can we still use linear regression?  Maybe.

Let’s return to the example from last time.  The task was to predict professional (rookie year) success based on college level data.  We assumed that general managers can obtain data on the number of games won as a college player, whether the player graduated (or will graduate) and the player’s height.

Our initial measure of pro success was touchdown passes.  We then specified a regression model using the following equation.

But let’s say that we don’t like the TD passes metric.  Maybe we don’t like it because we think TD passes are more related to wide receiver talent than to the quality of the QB.  Rather than use TDs as our dependent variable we want to use whether a player becomes a starter.  This is also an interesting metric as it captures whether the player was selected by a coaching staff to be the primary quarterback.  This is a nice feature as the metric includes some measure of human expertise.  I’ll leave criticism to the readers as an exercise.

This leads us to the following equation:

One issue we have to address before we estimate this model is how we define the term starter.  In a statistical model we need to convert the word or category of “starter” into a number.  In this case, the easy solution is to treat players that became starters as 1’s and players that did not as 0’s.  As a second exercise – what would we do if we had three categories (did not play, reserve, starter)?

Let’s pretend we estimated the preceding model and obtained the following equation:

We can use the equation to “score” or “rate” our imaginary prospects from last time (Lewis Michaels and Manny Trips).  In terms of the input data, Lewis won 40 college games, graduated and is 5′ 10”.  Plugging Michael’s data into the equation gives us a score of .22.  The analysis that we have performed is commonly termed a linear probability model.  A simple interpretation of this result is that the expected probability of Michaels (or better said a prospect with Michaels statistics) becoming a starter is 22%.

So far so good.

Our second prospect is Manny Trips out of Stanford.  Manny won 10 games, failed to graduate and is 6’ tall.  For Manny the prediction would be -12.8%.  This is the big problem with using linear regression to predict binary (Yes/No) outcomes.  How do we interpret a negative probability?  Or a probability that is greater than 1?

So what do we do next?  I think we have two options.  We can ignore the problem.  If the goal is just to rank prospects then maybe we don’t care very much.  In this case, we just care about the relative scores not the actual prediction.  If we are just using analytics to screen QB prospects or to provide another data point then maybe our model is good enough.  The level of investment in a modeling project should be based on how the model is going to be used.  In many or most sports applications I would lean to simpler less complicated models.

Our second option is to move to a more complicated model.  There are a host of models available for categorical data.  We can use a binary logit or Probit model for the case of a binary system as above.  If the categories have a natural ordering to them (never played, reserve, starter) then we can use an ordered logit.  If there is no order to the categories, then we can use a multinomial logit.  I’m still debating on how much attention I should pay to these models.  Having a tool to deal with categorical variables can be invaluable but there is a cost.  The mathematics become more complex, estimation of the model requires specialized software and interpretation of the model becomes less intuitive.

I think I will discuss the binary logit next time.

Player Analytics Fundamentals: Part 4 – Statistical Models

Today’s post introduces the topic of statistical modeling.  This is, maybe, the trickiest part of the series to write.  The problem is that mastering the technical side of statistical analysis usually takes years of education.  And, more critically, developing the wisdom and intuition to use statistical tools effectively and creatively takes years of practice.  The goal of this segment is to point people in the right direction, more than to provide detailed instruction.  That said – I can adjust if there is a call for more technical material.  (If you want to start from the beginning parts 1, 2 and 3 are a click away.)

Let’s start with a simple point.  The primary tool for every analytics professional (sports or otherwise) should be linear regression.  Linear regression allows the analyst to quantify the relationship between some focal variable of interest (dependent measure or DV) and a set of variables that we think drive that variable (independent variables).  In other words, regression is a tool that can produce an equation that shows how some inputs produce an outcome of interest.  In the case of player analytics, this might be a prediction of future performance based on a player’s past statistics or physical attributes.

To make this more concrete, let’s say we want to do an analysis of rookie quarterback performance (we’ve been talking a bit about QB metrics so far in the series).  Selecting QBs involves significant uncertainty.  The transition from the college game to the pro game requires the QB to be able to deal with more complex offensive systems, more sophisticated defenses and more talented opposing players.  The task of the general manager is to identify prospects that can successfully make the transition.

Data and statistical analysis can potentially play a part in this type of decision.  The starting point would be the idea that observable data on college prospects can help predict rookie year performance.  As a starting point let’s assume that general managers can obtain data on the number of games won as a college player, whether the player graduated (or will graduate) and the player’s height.  (We just might be foreshadowing a famous set of rules for drafting quarterbacks).

The other key decision for a statistical analysis of rookie QB performance versus college career and physical data is a performance metric.  We could use the NFL passer rating formula that we have been discussing.  Or we could use something else.  For example, maybe the number of TD passes thrown as a rookie.  This metric is interesting as it captures something about playing time and ability to create scores.

Touchdowns are  also a metric that “fits” linear regression.  Linear regression is best suited to the analysis of quantitative variables that vary continuously.  The number of touchdowns we observe in data will range from zero to whatever the is the rookie TD record.  In contrast, other metrics such as whether the player becomes a starter or a pro bowler are categorical variables.  There are other techniques that are better for analyzing categorical variables.  (if you are a stats jockey and are objecting to the last couple of statements please see the note below).

The purpose of regression analysis is to create an equation of the following form:

This equation says that TD passes are a function of college wins, graduation and height.  The βs are the weights that are determined by the linear regression analysis.  Specifically, linear regression determines the βs that best fits the data.  This is the important point.  The weights or βs are determined from the data.  To illustrate how the equation works lets imagine that we estimated the regression model and obtained the following equation.

This equation says that we can predict rookie TD passes by plugging in each player’s data related to college wins, graduation and height.  It also says that a history of winning is positively related to TDs and graduation also is a positive.  The coefficient for height is zero.  This indicates that height is not a predictor of rookie TDs (I’m making these number up – height probably matters).  One benefit of developing a model is that we let the data speak.  Our “expert” judgment might be that height matters for quarterbacks.  The regression results can help identify decision biases if the coefficients don’t match the experts predictions.  I am neglecting the issue of significance for now – just to keep the focus on intuition.

Let’s say we have two prospects.  Lewis Michaels out of the University of Illinois who won 40 college games (hypothetical and unrealistic), graduated (in engineering) and is 5’10” (a Flutiesque prospect).  Our second prospect is Manny Trips out of Duke.  Manny won 10 games, failed to graduate and is 6’ tall.  Michaels would seem to be the better prospect based on the available data.  The statistical model allows us to predict how much better.

We make our predictions by simply plugging our player level data into the equation.  We would predict Lewis would throw 10 TDs in his rookie year (1+.1*40+5*1+0*70).  For Manny the prediction would be 2 TDs.  For now, I am just making up the coefficients (βs).  In a later entry I will estimate the model using some data on actual NFL rookie QB performance.

Regression has its shortcomings and many analysts love to object to regression analyses.  But for the most part, linear regression is a solid tool for analyzing patterns in data.  It’s also relatively easy to implement.  We can run regressions in Excel!  We shouldn’t underestimate how important it is to be able to do our analyses in standard tools like Excel.

I will extend our tool kit in a future entry.  I briefly mentioned categorical variables such as whether or not a player is a starter.  For these types of Yes/No (starter or not a starter) there is a tool called logistic regression that should be in our repertoire.

*One reason this note is tricky is that I’m trying to get the right balance and tone.  I can already hear the objections.  Lets save these for now.  For example, readers do not need to alert me to the fact that TDs are censored at zero.  Or that there is a mass point at zero because many rookies don’t play.  Or that TDs are counted in discrete units so maybe a Poisson model is more appropriate.  You get the idea.  There are many ways to object to any statistical model.  The real question isn’t whether a model is perfect.  The real question should be whether the model provides value.

Player Analytics Fundamentals: Part 3 – Metrics, Experts and Models

Last time I introduced the topic of player “metrics.” (If you want to get caught up you can start with Part 1 and Part 2 of the series.)  As I noted, determining the right metric is perhaps the most important task in player analytics.  It’s almost too obvious of a point to make – but the starting point for any analytics project should be deciding what to measure or manage.  It’s a non-trivial task because while the end goal (profit, wins) might be obvious, how this goal relates to an individual player (or strategy) may not be.

However, before I get too deep into metric development, I want to take a small detour and talk briefly about statistical models.  We won’t get to modeling in this entry – the goal is to motivate the need for statistical models!  If we are doing player analytics we need some type of tool kit to move us from mere opinion to fact based arguments.

To illustrate what I mean by “opinion” lets consider the example of rating quarterbacks.  In the previous entry, I presented the Passer Rating Formula used to rate NFL quarterbacks.  As a quick refresher let’s look at this beast one more time.The formula includes completion percentage (accuracy), yards per attempt (magnitude), touchdowns (ultimate success) and interceptions (failures).  Let’s pretend for a second that the formula only contained touchdowns and interceptions (just to make it simple).  The question then becomes how much should we weight touchdowns per attempt relative to interceptions per attempt?  The actual formula is hopelessly complex in some ways – we have fractional weights and statistics in different units – so let’s take a step back from the actual formula.

Imagine we have two experts proposing Passer Rating statistics that are based on touchdowns and interceptions only.  One expert might say that touchdowns per attempt are twice as important as interceptions.  We will label this “expert” created formula as ePR1 for expert 1 Passer rating.  The formula would be:

Maybe this judgment would be accompanied by some logic along the lines of “touchdowns are twice as important because the opposing team doesn’t always score as the result of an interception.”

However, the second expert suggests that the touchdowns and interceptions should be weighted equally.  Maybe the logic of the second expert is that interceptions have both direct negative consequences (loss of possession) and also negative psychological effects (loss of momentum), and should therefore be weighted more heavily.  The formula for expert 2 can be written as:

I suspect that many readers (or a high percentage of a few readers) are objecting to developing metrics using this approach.  The approach probably seems arbitrary.  It is.  I’ve intentionally presented things in a manner that highlights the subjective nature of the process.  I’ve reduced things down to just 2 stats and I’ve chosen very simple weights.  But the reality is that this is the basic process through which novices tend to develop “new” or “advanced” statistics.  In fact, it is still very much a standard practice.  The decision maker or supporting analysts gather multiple pieces of information and then use a system of weights to determine a final “grade” or evaluation.

The question then becomes which formula do we use?  Both formulas include multiple pieces of data and are based on a combination of logic and experience.  I am ignoring (for the moment) a critical element of this topic – the issue of decision biases.  In subsequent entries, I’m going to advocate for an approach that is based on data and statistical models.  Next time, we will start to talk more about statistical tools.

Player Analytics Fundamentals: Part 2 – Performance Metrics

I want to start the series with the topic of “Metric Development.”  I’m going to use the term “metric” but I could have just as easily used words like stats, measures or KPIs.  Metrics are the key to sports and other analytics functions since we need to be sure that we have the right performance standards in place before we try and optimize.  Let me say that one more time – METRIC DEVELOPMENT IS THE KEY.

The history of sports statistics has focused on so called “box score” statistics such as hits, runs or RBIs in baseball.  These simple statistics have utility but also significant limitations.  For example, in baseball a key statistic is batting average.  Batting average is intuitively useful as it shows a player’s ability to get on base and to move other runners forward.  However, batting average is also limited as it neglects the difference between types of hits.  In a batting average calculation, a double or home run is of no greater value than a single.  It also neglects the value of walks.

These short-comings motivated the development of statistics like OBPS (on base plus slugging).  Measures like OBPS that are constructed from multiple statistics are appealing because they begin to capture the multiple contributions made by a player.  On the downside these types of constructed statistics often have an arbitrary nature in terms of how component statistics are weighted.

The complexity of player contributions and the “arbitrary nature” of how simple statistics are weighted is illustrated by the formula for the NFL quarterback ratings.

This equation combines completion percentage (COMP/ATT), yards per attempt (YARDS/ATT), touchdown rate (TD/ATT) and interception rate (INT/ATT) to arrive at a single statistic for a quarterback.  On the plus side the metric includes data related to “accuracy” (completion percentage) to “scale” (yards per), to “conversion” (TDs), and to “failures” (interceptions).  We can debate if this is a sufficiently complete look at QBs (should we include sacks?) but it does cover multiple aspects of passing performance.   However, a common reaction to the formula is a question about where the weights come from.  Why is completion rate multiplied by 5 and touchdown rates multiplied by 20?

Is it a great statistic?  One way to evaluate is via a quick check of the historical record.  Does the historical ranking jive with our intuition?  Here is a link to historical rankings.

Every sport has examples of these kinds of “multi-attribute” constructed statistics.  Basketball has player efficiency metrics that involve weighting a player’s good events (points, rebounds, steals) and negative outcomes (turnovers, fouls, etc…).  The OBPS metric involves an implicit assumption that “on base percentage” and “slugging” are of equal value.

One area I want to explore is how we should construct these types of performance metrics.  This is a discussion that involves some philosophy and some statistics.  We will take this piece by piece and also show a couple of applications along the way.

Player Analytics Fundamentals: Part 1

Each Spring I teach courses on Sports Analytics.  These courses include both Marketing Analytics and On-Field Analytics.  The “Blog” has tended to focus on the Marketing of Fan side.  Moving forward, I think the balance is going to change a bit.  My plan is to re-balance the blog to include more of the on-field topics.

Last year I published a series of posts related to the fundamentals of sports analytics.  This material is relevant to both the marketing and the team performance sides of sports analytics.  This series featured comments on organizational design and decision theory.

This series is going to be a bit different than the team and player “analytics” that we see on the web.  Rather than present specific studies, I am going to begin with some fundamental principles and talk about a “general” approach to player analytics.  There is a lot of material on the web related to very specific sports analytics questions.  Analytics can be applied to baseball, football, soccer and every other sport.  And within each of these games there are countless questions to be addressed.

Rather than contribute to the littered landscape, I want to talk about how I approach sports analytics questions.  In some ways, this series is the blue print I use for thinking about sports analytics in the classroom.  My starting point is that I want to provide skills and insights that can be applied to any sport.  So we start with the fundamentals and we think a lot about how to structure problems.  I want to supply grounded general principles that can be applied to any player analytics problem.

So what’s the plan?  At a high level, sports analytics are about prediction.  We will start with a discussion about what we should be predicting.  This is a surprisingly complex issue.  From there we will talk a little bit about different statistical models.  This won’t be too bad, because I’m a firm believer in using the simplest possible models.  The second half of the series will focus on different types of prediction problems.  These will range from predicting booms and busts, to a look at how to do “comparables” in a better fashion.  In terms of the data, I think it will be a mix of football and the other kind of football.

 

NFL Fan Base and Brand Rankings 2017

NFL Fandom Report 2017: The “Best” NFL Fans

Who has the best fans in the NFL?  What are the best brands in the NFL? These are simple questions without simple answers.  First we have to decide what we mean by “best”.  What makes for a great fan or brand?  Fans that show up even when the team is losing?  Fans that are willing to pay the highest prices?  Fans that are willing to follow a team on the road or social media?

Even after we agree on the question, answering it is also a challenge.  How do we adjust for the fact that one team might have gone on a miraculous run that filled the stadium?  Or perhaps another team suffered a slew of injuries?  How do we compare fan behavior in a market like New York with fans in a place like Green Bay?

My approach to evaluating fan bases is to use data to develop statistical models of fan interest (more details here).  The key is that these models are used to determine which city’s fans are more willing to spend or follow their teams after controlling for factors like market size and short-term changes in winning and losing.

In past years, two measures of engagement have been featured: Fan Equity and Social Media Equity.  Fan Equity focuses on home box office revenues (support via opening the wallet) and Social Media Equity focuses on fan willingness to engage as part of a team’s community (support exhibited by joining social media communities).  This year I am adding a third measure Road Equity.  Road Equity focuses on how teams draw on the road after adjusting for team performance.   These metrics provide a balance – a measure of willingness to spend, a measure unconstrained by stadium size and a measure of national appeal.

To get at an overall ranking, I’m going to use the simplest possible method.  We are just going to average the across the three metrics.  (similar analyses are available for the NBA and MLB).

The Winners

The top five fan bases (team brands if you prefer) are the Cowboys, Patriots, Eagles, Giants and Steelers.  The Cowboys excel on all the metrics.  They win in terms of Fan Equity (a revenue premium measure of brand strength), Road Equity and finish second in social media.  The underlying data (I will spare everybody the statistical models) reveals why Dallas does so well.  The Cowboy’s average home attendance (reported by ESPN) is more than 10,000 higher than the next team.  The Cowboys average ticket price is also well above average and they have the second most Twitter followers after the Patriots.  The other thing to note is that the Cowboys achieve these year in and year out , even in years when the team is not great.  

There are likely some objections to the list.  Patriot fans are bandwagon fans!  The Steelers are too low!  The Eagles above the Packers or Bears?!   Way too much to get into in a short blog post but a couple of comments.

First, Patriot fans may be bandwagon fans.  But at this point it is tough to tell.  The team has been excellent and the fans have been supportive for a long time.  And even when things tend to go wrong for the Patriots they come out ahead.  I believe that the deflate gate controversy had a significant positive impact on the Patriots’ social media following.

The Steelers are low in Fan Equity and higher on the other metrics.  We can trace this to the Steelers pricing.  The Steelers seem to price on the low side of what is possible.

The Eagle do surprise me.  They do get a bump from playing in the NFC East interms of the Road Equity metric.  The NFC East is a strong collection of brands that benefit each other.  It is not easy to disentangle these effects.  And perhaps we shouldn’t since we can make a case that the rivalries that benefit these teams are because of the interest in the individual brands.

The Losers

At the other extreme we have the Bengals, Jaguars, Titans, Rams and Chiefs.  Some of these are no surprises.  At the top of the list we have the NFL’s royalty.  No one has ever placed the Bengals, Jaguars or titans in that category.

The teams at the bottom of the rankings all suffer from relatively low attendance, have below average pricing power and have limited social followings.  The Rams are a special case.  While not a great brand in past years, the move to LA tends to punish the Rams because their results have not kept pace with the higher income and population levels in LA.

The Chiefs are the tough one on this list.  The Chiefs fill their stadium but at relatively low price.  Keep in mind that the analysis includes factors such as population and median income.  In addition, Kansas City was ranked 29th in terms of Road Attendance last year and the social media following (Twitter) is middle of the road.  The fundamental issue is that that the Chiefs produce these below average fan-based results while performing well above average on the field.

The Complete List

The complete list follows.  In addition to the overall ranking of fan bases, I also report rankings on the social and road measures.  Following the table, I provide a bit more detail regarding each of the metrics.

Three metrics are used to get a complete picture of fans.  But there are other ways to look at fan behavior and brand strength.  For example, we could look at pricing power (which teams are able to extract significant price premiums) or bandwagon fan behavior (which fans are most sensitive to winning).  I’m happy to provide these additional rankings if there is interest.

Fan Equity

Winners: Cowboys, Patriots and 49ers

Losers: Rams, Raiders, Jags

Fan Equity looks at home revenues relative to expected revenue based on team performance and market characteristics.  The goal of the metric is to measure over or under performance relative to other teams in the league.  In other words, statistical models are used to create an apples-to-apples type comparison to avoid distortions due to long-term differences in market size or short-term differences in winning rates.

The 49ers are the interesting winner on this metric.  After the last couple of years, it is doubtful that people are thinking about the 49ers having a rabid fan base.  However, the 49ers are a great example of how the approach works.  On the field the 49ers have been terrible.  But despite the on-field struggles the 49ers still pack in the fans and charge high prices.  This is evidence of a very strong brand because even while losing the 49ers fans still attend and spend.  In terms of the overall rankings the 49ers don’t do all that great because the team does not perform as well as a road or social media draw.

In terms of business concepts, this “Fan Equity” measure is similar to a “revenue premium” measure of brand equity.  It captures the differentials in fan’s willingness to financially support teams of similar quality.  From a business or marketing perspective this is a gold standard of metrics as it directly relates to how a strong brand translates to revenues and profits.

One important thing to note is that some teams may not be trying to maximize revenues.  Perhaps the team is trying to build a fan base by keeping prices low.  Or a team my price on the low side based on some notion of loyalty to its community.   In these cases the Fan equity metric may understate the engagement of fans.

Social Media Equity

Winners: Patriots, Cowboys and Broncos

Losers: Chiefs, Rams and Cardinals

Social Media Equity is also an example of a “premium” based measure of brand equity.  It differs from the Fan Equity in that it focuses on how many fans a team has online rather than fans’ willingness to pay higher prices.  Similar to Fan Equity, Social Media Equity is also constructed using statistical models that control for performance and market differences.

In terms of business application, the social media metric has several implications both on its own merits and in conjunction with the Fan Equity measure.  For example, the lack of local constraints, means that the Social Equity measure is more of a national level measure.  so while the Fan Equity metric focuses on local box office revenues, the social metric provides insight into how a team’s fandom extends beyond a metro area.

Social Media Equity may also serve as a leading indicator of a team’s future fortunes.  For a team to grow revenues it is often necessary to implement controversial price increases.  Convincing fans to sign expensive contracts to buy season tickets can also be a challenge.  Increasing prices and acquiring season ticket holders can therefore take time, while social media communities can grow quickly.  Some preliminary analysis suggests that vibrant social communities are positively correlated with future revenue growth.

A comparison of Fan Equity and Social Media can also be useful.  If Social Media equity exceeds Fan Equity it is evidence that the team has some marketing potential that is not being exploited.  For example, one issue that is common in sports is that it is difficult to estimate the price elasticity of demand because demand is often highest for the best teams and best seats.  The unconstrained nature of social media can provide an important data point for assessing whether a team has additional pricing flexibility.

Road Equity

Winners: Cowboys, Eagles and Raiders

Losers: Texans, Titans and Seahawks

Another way to look at fan quality is to look at how a team draws on the Road.  In the NBA these effects are pronounced.  Lebron or a retiring Kobe coming to town can often lead to sell outs.  At the college level some teams are known to travel very well.  A fan base that travels is almost by definition incredible passionate.

This one has a bit of a muddled interpretation.  If a team has great road attendance is it because the fans are following the team or because they have a national following?  In other words, do the local fans travel or does a team with high road attendance have a national following.  When the Steelers turned the Georgia Dome Yellow and Black was it because Steelers fans came down from Pittsburg or because the Steelers have fans everywhere.

Furthermore, if it is a national following is it because the team is popular across the country or because a lot of folks have moved from places like Pittsburgh or Buffalo to the Sun Belt.  A national following is a great characteristic that might suggest that a team’s brand is on an upswing.  Or it might be that the city itself is on a downward trajectory.

 

 

MLB Fan Base and Brand Rankings 2017

MLB Fandom Report 2017: The “Best” Fans in Baseball – Rough Draft

Who has the best fans in Major League Baseball?  What are the best brands in MLB? These are simple questions without simple answers.  What makes for a great fan or brand?  Fans that show up even when the team is losing?  Fans that are willing to pay the most?  Fans that are willing to follow a team on the road or social media?

Even after we agree on the question(s), answering it is also a challenge.  How do we adjust for the fact that one team might have gone on a miraculous run that filled the stadium?  Or perhaps another team suffered a slew of injuries?  How do we compare fan behavior in a market like New York with fans in a place like Milwaukee?  What if a team just opened a new stadium?

My approach to evaluating fan bases is to use data to develop statistical models of fan interest (more details here).  The key is that these models are used to determine which city’s fans are more willing to spend or follow their teams after controlling for factors like market size and short-term variations in performance.

This year’s overall rankings are based on three sub-rankings.  In past years, two measures of engagement have been featured: Fan Equity and Social Media Equity.  Fan Equity focuses on home box office revenues (support via opening the wallet) and Social Media Equity focuses on fan willingness to engage as part of a team’s community (support exhibited by joining social media communities).  This year I am adding a third measure – Road Equity.  Road Equity focuses on how teams draw on the road after adjusting for team performance.   These metrics provide a balance – a measure of willingness to spend, a measure unconstrained by stadium size and a measure of national appeal.

To get at an overall ranking, I’m going to use the simplest method possible.  We are just going to average the across the three metrics.

Today’s post is focused on MLB but if you are interested you can see last year’s NBA fan rankings here and this year’s  NFL rankings will be posted soon.

The Winners

Overall, the group of clubs that comprise the Top 5 contains little in the way of surprises.  The Yankees rank number one and are followed by the Cubs, Red Sox, Giants and Dodgers.  The Yankees “win” because they draw fans (usually top 5) and charge high prices even when on-field results dip.  The Yankees are also a great attraction on the road and have an enormous social media following.

In general, the clubs at the top of the list share these same traits.  They are all able to motivate fans to attend and spend as they all possess great attendance numbers and relatively high prices.  More to the point, these teams are even able to draw well and command price premiums when they are not winning.  The Cubs are the best example of this.

The list of winners probably raises an issue of “large” market bias.  However, keep in mind that the methodology is designed to control for home market effects.  The method is explicitly designed to control for differences in market demographics (and team performance).  While the “winners” tend to come from the bigger and more lucrative markets, other major market teams do not fair particularly well (see below).

The Laggards

The bottom of the list features the Marlins, Indians, Athletics, Angels and White Sox.  It is interesting that the bottom also includes teams from major markets such as LA, Chicago and Miami.

The Marlins finish is a reflection of how the team struggles on multiple dimensions. Attendance is often in the bottom 5 of the league despite being located in a major metro area.  Pricing is also below average for MLB.  Cleveland also struggles on these metrics but given the advantages of the Miami market, the Marlins relative performance is just a bit worse.

From a branding perspective it is not surprising that we see one dominant brand in the cities with two clubs.  Being a sports fan is about being part of a community.  Many fans are drawn to the bigger and more dominant community – Yankees, Cubs or Dodgers rather than the Mets, White Sox or Angels.  The A’s probably also suffer a similar set of problems as they compete against the Giants in the Bay area.

The Complete List

The complete list follows.  In addition to the overall ranking of fan bases, I also report rankings on the social and road measures.  Following the table, I provide a bit more detail regarding each of the metrics.

The Details

Fan Equity

The Winners: Red Sox, Yankees and Cardinals

The Losers: Mets, Indians and Marlins

Fan Equity looks at home revenues relative to expected revenue based on team performance and market characteristics.  The goal of the metric is to measure over (or under) performance relative to other teams in the league.  In other words, statistical models are used to create an apples-to-apples type comparison to avoid distortions due to long-term differences in market size or short-term differences in winning rates.

In terms of business concepts, this measure is similar to a “revenue premium” measure of brand equity.  It captures the differentials in fans willingness to financially support teams of similar quality.  From a business or marketing perspective this is a gold standard of metrics as it directly relates to how a strong brand translates to revenues and profits.

However, the context is sports, and that does make things different.  At a basic level sports organizations have dual objectives.  They care about winning and profit.  That is important because some teams may not be trying to maximize revenues.  Perhaps the team is trying to build a fan base by keeping prices low.   If this is the case the Fan equity metric understates the engagement of fans.

The Cardinals are the big story in terms of fan equity.  St. Louis is a unique baseball town.  Amazingly supportive fans for a market the size of St. Louis.  The Cardinals just fall short on the other more national metrics.

Social Media Equity

Winners: Blue Jays, Braves, and Yankees

Losers: Mariners, A’s and Nationals

Social Media Equity is also an example of a “premium” based measure of brand equity.  It differs from the Fan Equity in that it focuses on how many fans a team has online rather than fans’ willingness to pay higher prices.  Similar to the Fan Equity metric, Social Media Equity is also constructed using statistical models that control for performance and market differences.  Social Media Equity is more about potential.  I think that social equity is an indicator of what can be built.  but teams still have to win to make the conversion.

In terms of business application, the social media metric has several implications both on its own merits and in conjunction with the Fan Equity measure.  For example, the lack of local constraints, means that the Social Equity measure is more of a national level measure.  The Fan Equity metric focuses on local box office revenues.  In contrast, the social metric provides insight into how a team’s fandom extends beyond a metro area.

Social Media Equity may also serve as a leading indicator of a team’s future fortunes.  For a team to grow revenues it is often necessary to implement controversial price increases.  Convincing fans to sign expensive contracts to buy season tickets can also be a challenge.  Increasing prices and acquiring season ticket holders can take time while social media communities can grow quickly.  Social community size has been found to be positively correlated with future revenue growth.

A comparison of Fan Equity and Social Media can be useful.  If Social Media equity exceeds Fan Equity it is evidence that the team has some marketing potential that is not being exploited.  For example, one issue that is common in sports is that it is difficult to estimate the price elasticity of demand because demand is often highest for the best teams and best seats.  The unconstrained nature of social media can provide an important data point for assessing whether teams have additional pricing flexibility.

This is an interesting list of winners.  My guess is that the Braves and Blue Jays are on the upswing as brands.  For the teams at the bottom – it’s a concerning situation.  These teams don’t seem to be capturing the next generation.

Road Equity

Winners: Yankees, Dodgers and Cubs

Losers: Marlins, White Sox and Indians

This is a new metric for the blog. One way to look at fan quality is to look at how a team draws on the Road.  In the NBA these effects are pronounced.  Lebron or a retiring Kobe coming to town can often lead to sell outs.  At the college level some teams are known to travel very well.  A fan base that travels is almost by definition incredibly passionate.

This one has a bit of a muddled interpretation.  If a team has great road attendance is it because the fans are following the team or because they have a national following?  If the Yankees play the Rays and attendance spikes is it because Yankees fans travel or because Tampa  residents come out to see the Yankees?

The winners on this list are no surprise.  One reason I like this metric is that it is consistent with the conventional wisdom.  It has tons of face validity.

At the bottom of the rankings we have the Marlins, Indians and White Sox.  These seem to be struggling brands that lack local and national appeal.

 

 

NBA Fan Rankings: 2016 Edition

On an (almost) annual basis I present rankings of fan bases across major professional and collegiate leagues.  Today it is time for the NBA.   First, the winners and losers in this year’s rankings.  At the top of the list we have the Knicks, Lakers and Bulls. This may be the trifecta of who the league would love to have playing at Christmas and in the Finals.  At the bottom we have the Grizzlies, Nets and Hornets.

nba2016

Before i get into the details it may be helpful to briefly mention what differentiates these rankings from other analyses of teams and fans. My rankings are driven by statistical models of how teams perform on a variety of marketing metrics.  The key insight is that these models allow us to control for short-run variation in team performance and permanent differences in market potential.  In other words – the analysis uses data to identify engagement or passion (based on attend and spend) beyond what is expected based on how a team is performing and where the team is located.   More details on the methodology can be found here.

spike-lee-knicks

The Winners

This year’s list contains no real surprises.  The top five teams are all major market teams with storied traditions.  The top fan base belongs to the Knicks.   The Lakers, Bulls, Heat and Celtics follow.  The Knicks  highlight how the model works.  While the Knicks might not be winning , Knicks fans still attend and spend.

The number two team on the list (The Lakers) is in much the same situation. A dominant brand with a struggling on-court product.   The Lakers and Clippers are an interesting comparison.  Last season, the Clippers did just a bit better in terms of attendance (100.7% versus 99.7%).  But the Lakers filled their seats with an average ticket price that was substantially higher.  The power of the Laker brand is shown in this comparison because these outcomes occurred in a season where the Clippers won many more games.

Why are the Lakers still the bigger draw?  Is this a star (Kobe) effect?  Probably in part, but fan loyalty is something that evolves over time.  The Lakers have the championships, tradition and therefore the brand loyalty.  It will be interesting to see how much equity is retained long-term if the team is unable to quickly reload.  The shared market makes this an interesting story to watch. I suspect that the Lakers will continue to be the stronger brand for quite a while.

The Losers

At the bottom of the list we have Memphis, Brooklyn and Charlotte.  The interesting one in this group is Brooklyn.  Why do the Nets rank poorly?  It ends up being driven by the relative success of the Knicks versus the Nets.  The Knicks have much more pricing power while the teams operate in basically the same market (we can debate this point).  According to ESPN, the Knicks drew 19,812 fans (100% of capacity) while the Nets filled 83.6% of their building.  The Knicks also command much higher ticket prices.  And while the Nets were worse (21 victories) the Knicks were far from special (32 wins).

What can the teams at the bottom of the list do?  When you go into the data and analyze what drives brand equity the results are intuitive.   Championships, deep playoff runs and consistent playoff appearances are the key to building equity.  easy to understand but tough to accomplish.

And a Draw

An interesting aside in all this is what it means for the league.  The NBA has long been a star and franchise driven league.  In the 1980s it was about the Lakers (Magic) and Celtics (Bird).  In the 1990s it was Michael Jordan and the Bulls.  From there we shifted into Kobe and Lebron.

On one hand, the league might be (even) stronger if the top teams were the Bulls, Knicks and Lakers.  On the other hand, the emergence of Steph Curry and Golden State has the potential to help build another powerful brand.

Some more thoughts…

The Fan Equity metric is just one possible means for assessing fan bases.  In this year’s NFL rankings I reported several more analyses that focus on different market outcomes.  These were social media following, road attendance and win sensitivity (bandwagon fans).  Looking at social following tells us something about the future of the brand as it (broadly) captures fan interest of a younger demographic.  Road Attendance tells us something about national rather than local following.  These analyses also use statistical models to control for market and team performance effects.

Social Equity

Top Social Equity Team: The Lakers

Bottom Social equity: The Nets

Comment: The Lakers are an immensely strong brand on many dimensions.  The Nets are a mid-range brand when you look at raw numbers.  But they suffer when we account for them operating in the NY market.

Road Equity

Top Road Equity: The Lakers

Bottom Road Equity: Portland

Comment: The Lakers dominate.  And as this analysis was done looking at fixed effects across 15 years it is not solely due to Kobe Bryant.  Portland does well locally but is not of much interest nationally.

It is possible to do even more.  We can even look at factors such as win or price sensitivity. Win sensitivity (or bandwagon behavior) tells us whose fans only show up when a team is winning and price sensitivity tells us if a fan base is willing to show up when prices go up.  I’m skipping these latter two analyses today just to avoid overkill (available upon request).  The big message is that we can potentially construct a collection of metrics that provide a fairly comprehensive and deep understanding of each team’s fan base and brand.

Note: I have left one team off the list.  I have decided to stop reporting the local teams (Emory is in Atlanta).  The local teams have all been great to both myself and the Emory community.  This is just a small effort to eliminate some headaches for myself.

Finally… The complete list

City Fan Equity
Boston 5
Charlotte 27
Chicago 3
Cleveland 20
Dallas 15
Denver 11
Detroit 25
GoldenState 16
Houston 7
Indiana 21
LAClips 17
LALakers 2
Memphis 29
Miami 4
Milwaukee 14
Minnesota 22
Brooklyn 28
NewOrleans 24
NYKnicks 1
OKCity 13
Orlando 19
Philadelphia 26
Phoenix 9
Portland 6
Sacramento 10
SanAntonio 12
Toronto 18
Utah 8
Washington 23
 

Analytics, Trump, Clinton and the Polls: Sports Analytics Series Part 5.1

Recent presidential elections (especially 2008 and 2012) have featured heavy use of analytics by candidates and pundits.  The Obama campaigns were credited with using micro targeting and advanced analytics to win elections. Analysts like Nate Silver were hailed as statistical gurus who could use polling data to predict outcomes.  In the lead up to this year’s contest we heard a lot about the Clinton campaign’s analytical advantages and the election forecasters became regular parts of election coverage.

Then Tuesday night happened.  The polls were wrong (by a little) and the advanced micro targeting techniques didn’t pay off (enough).

Why did the analytics fail?

First the polls and the election forecasts (I’ll get to the value of analytics next week). As background, commentators tend to not truly understand polls.  This creates confusion because commentators frequently over- and misinterpret what polls are saying.  For example, whenever “margin of error” is mentioned they tend to get things wrong.  A poll’s margin of error is based on sample size.  The common journalist’s error is that when you are talking about a collection of polls the sample size is much larger than an individual poll with a margin of error of 3% or 4%.  When looking at an average of many polls the “margin of error” is much smaller because the “poll of polls” has a much larger sample size.  This is a key point because when we think about the combined polls it is even more clear that something went wrong in 2016.

Diagnosing what went wrong is complicated by two factors.  First, it should be noted that because every pollster does things differently we can’t make blanket statements or talk in absolutes.  Second, diagnosing the problem requires a deep understanding of the statistics and assumptions involved in polling.

In the 2016 election my suspicion is that a two things went wrong.  As a starting point – we need to realize that polls include strong implicit assumptions about the nature of the underlying population and about voter passion (rather than preference).  When these assumptions don’t hold the polls will systematically fail.

First, most polls start with assumptions about the nature of the electorate.  In particular, there are assumptions about the base levels of Democrats, Republicans and Independents in the population.  Very often the difference between polls relates to these assumptions (LA Times versus ABC News).

The problem with assumptions about party affiliation in an election like 2016, is that the underlying coalitions of the two parties are in transition.  When I grew up the conventional wisdom was that the Republicans were the wealthy, the suburban professionals, and the free trading capitalists while the democrats were the party of the working man and unions.  Obviously these coalitions have changed.  My conjecture is that pollsters didn’t sufficiently re-balance.  In the current environment it might make sense to place greater emphasis on demographics (race and income) when designing sampling segments.

The other issue is that more attention needs to be paid towards avidity / engagement/ passion (choose your own marketing buzz word).  Polls often differentiate between likely and registered voters.  This may have been insufficient in this election. If Clinton’s likely voters were 80% likely to show up and Trumps were 95% likely then having a small percentage lead in a preference poll isn’t going to hold up in an election.

The story of the 2016 election should be something every analytics professional understands.  From the polling side the lesson is that we need to understand and question the underlying assumptions of our model and data.  As the world changes do our assumptions still hold?  Is our data still measuring what we hope it does?  Is a single dependent measure (preference versus avidity in this case) enough?

Moving towards Modeling & Lessons from Other Arenas: Sports Analytics Series Part 5

The material in this series is derived from a combination of my experiences in sports applications and my experiences in customer analysis and database marketing.  In many respects, the development of an analytics function is similar across categories and contexts.  For instance, a key issue in any analytics function is the designing and creation of an appropriate data structure.  Creating or acquiring the right kinds of analytics capabilities (statistical skills) is also a common need across industries.

A need to understand managerial decision making styles is also common across categories.  It’s necessary to understand both the level of interest in using analytics and also the “technical level” of the decision makers.  Less experienced data scientists and statistician have a tendency to use too complicated of methods.  This can be a killer.  If the models are too complex they won’t be understood and then they won’t be used.  Linear regression with perhaps a few extensions (fixed effects, linear probability models) are usually the way to go.    Because sports organizations have less history in terms of using analytics the issue of balancing complexity can be especially challenging.

A key distinction between many sports and marketing applications is the number of variables versus the number of observations.  This is an important point of distinction between sports and non-sports industries and it is also an important issue for when we shift to discussing modeling in a couple of weeks.  When I use the term variables I am referencing individual elements of data.  For example, an element of data could be many different things such as a player’s weight or the number of shots taken or the minutes played.  We might also break variables into the categories of dependent variables (things to explain) versus independent variables (things to explain with).  When I use the term observations I am talking about “units of analysis” like players or games.

In many (most) business contexts we have many observations.  A large company may have millions of customer accounts.  There may, however, be relatively few explanatory variables.  The firm may have only transaction history variables and limited demographics.  Even in sports marketing a team interested in modeling season ticket retention may only have information such as the number of tickets previously purchased, prices paid and a few other data points.  In this same example the team may have tens of thousands of season ticket holders.  If we think of this “information” as a database we would have a row for every customer account (several thousand rows) and perhaps ten or twenty columns of variables related to each customer (past purchases and marketing activities).

One trend is that the number of explanatory variables is expanding in just about every category. In marketing applications we have much more purchase detail and often expanded demographics and psychographics.  However, the ratio of observations to columns usually still favors the observations.

In sports we (increasingly) face a very different data environment.  Especially, in player selection tasks like drafting or free agent signings.  The issue in player selection applications is that there are relatively few player level observations.  In particular, when we drill down into specific positions we often find ourselves having only tens or hundreds or player histories (depending on far back we want to go with the data).  In contrast, we may have an enormous number of variables per player.

We have historically had many different types of “box score” type stats but now we have entered into the era of player tracking and biometrics.  Now we can generate player stats related to second-by-second movement or even detailed physiological data.  In sports ranging from MMA to soccer to basketball the amount of variables has exploded.

A big question as we move forward into more modeling oriented topics is how do we deal with this situation?