The material in this series is derived from a combination of my experiences in sports applications and my experiences in customer analysis and database marketing. In many respects, the development of an analytics function is similar across categories and contexts. For instance, a key issue in any analytics function is the designing and creation of an appropriate data structure. Creating or acquiring the right kinds of analytics capabilities (statistical skills) is also a common need across industries.
A need to understand managerial decision making styles is also common across categories. It’s necessary to understand both the level of interest in using analytics and also the “technical level” of the decision makers. Less experienced data scientists and statistician have a tendency to use too complicated of methods. This can be a killer. If the models are too complex they won’t be understood and then they won’t be used. Linear regression with perhaps a few extensions (fixed effects, linear probability models) are usually the way to go. Because sports organizations have less history in terms of using analytics the issue of balancing complexity can be especially challenging.
A key distinction between many sports and marketing applications is the number of variables versus the number of observations. This is an important point of distinction between sports and non-sports industries and it is also an important issue for when we shift to discussing modeling in a couple of weeks. When I use the term variables I am referencing individual elements of data. For example, an element of data could be many different things such as a player’s weight or the number of shots taken or the minutes played. We might also break variables into the categories of dependent variables (things to explain) versus independent variables (things to explain with). When I use the term observations I am talking about “units of analysis” like players or games.
In many (most) business contexts we have many observations. A large company may have millions of customer accounts. There may, however, be relatively few explanatory variables. The firm may have only transaction history variables and limited demographics. Even in sports marketing a team interested in modeling season ticket retention may only have information such as the number of tickets previously purchased, prices paid and a few other data points. In this same example the team may have tens of thousands of season ticket holders. If we think of this “information” as a database we would have a row for every customer account (several thousand rows) and perhaps ten or twenty columns of variables related to each customer (past purchases and marketing activities).
One trend is that the number of explanatory variables is expanding in just about every category. In marketing applications we have much more purchase detail and often expanded demographics and psychographics. However, the ratio of observations to columns usually still favors the observations.
In sports we (increasingly) face a very different data environment. Especially, in player selection tasks like drafting or free agent signings. The issue in player selection applications is that there are relatively few player level observations. In particular, when we drill down into specific positions we often find ourselves having only tens or hundreds or player histories (depending on far back we want to go with the data). In contrast, we may have an enormous number of variables per player.
We have historically had many different types of “box score” type stats but now we have entered into the era of player tracking and biometrics. Now we can generate player stats related to second-by-second movement or even detailed physiological data. In sports ranging from MMA to soccer to basketball the amount of variables has exploded.
A big question as we move forward into more modeling oriented topics is how do we deal with this situation?