The use of analytics in sports personnel decisions such as drafting and free agency signings is a topic with obvious popular appeal. Sports personnel decisions are fundamentally about how people will perform in the future. These are also tough, complex high risk decisions that are the fodder for talk radio and second guessing from just about everyone.
So how can we make these decisions? As we noted in our last post, the choice between using analytics versus using the “gut” is probably a decision that doesn’t need to be made. Analytics and data should have a role. The question is how much emphasis should be placed on the “models” and how much on the intuition of the “experts.”
In this second installment of the series, we begin the process of going deeper into the mechanics and challenges involved in leveraging data and building models to support personnel decisions. As a backdrop for this discussion, we are going to tell the story of project we helped a group of Emory students complete for the WNBA’s Atlanta Dream. Going into detail about this story / process should illuminate a couple of things. First, there is logic to how these types of analyses can best be structured. Second, a careful and systematic discussion of a project may clarify both the weaknesses and strengths of “Moneyball” type approaches to decision making.
To begin, we want to thank the Dream. This was a great project that the students loved, and it gave us an opportunity to think about the challenges in modeling draft prospects in a whole new arena. An early step in any analytics project is the building of the data infrastructure. For the WNBA, this was a challenge. Storehouses of sports data come from all sorts of places but they often start out as projects driven more by fan passion than any formal effort from an established organization. Baseball is probably the gold standard for information with detailed data going back a century. In contrast, for women’s professional and college basketball the information is comparatively sparse. There’s not a lot and it doesn’t go back very far.
After some searching (with a lot of great assistance from the Dream) we were able to identify information sources for both professional and collegiate stats. As we started to assemble databases a few things became apparent:
- First, the data available was nowhere as detailed as what could be found for the men’s game. We were limited to season level stats at both the pro and college level. Furthermore, all we had were the basics – the data in box scores. This is good information, but it does leave the analyst wanting more.
- Second, the data fields on professional performance were not identical to the data on collegiate performance. For example, the pro level data breaks rebounds down into offensive and defensive boards. Maybe this is a big deal and maybe not. It does make it difficult to use established metrics that place different value on the two types of rebounds.
- Third, there was a LOT of missing data, and multiple types of missing data. In terms of player statistics, information on turnovers was at best scarce. Again, this makes it difficult to use established metrics like PER. The other thing that was missing is players themselves. We never were able to create a repository of data on international players that didn’t participate in NCAA basketball. As a side note, even if we had found international data it would be hard to interpret. How would we judge the importance of a rebound in Europe versus a rebound in South America? This isn’t just a problem for women’s basketball as this is also an issue in any global sport.
There were also a lot of things that we would have liked to have had. Some of this may have been available, and maybe we did not look hard enough. But we always need to ask the question of the incremental value versus the required effort. For example, information on players’ physical traits was very limited. We could obtain height but even basics like weight were difficult to find. And as far as we know – there is no equivalent to the NFL combine.
While these might seem like severe limitations, we think it’s really just par for the course in this type of research. Especially in the first go around! In analytics, you often work with what you have and you try to be clever in order to get the most from the data. We will get to how to approach this type of problem soon. But even with the limitations, we actually have a LOT of data. At the college level we have 4 years of data on games, played, field goals made, field goals attempted, rebounds, steals, 3 pointers, etc… If we have 15 data fields for 4 years we have 60 statistics per player. Add in data on height, strength of schedule and assorted miscellaneous fields and we have maybe 70 pieces of data per player. And maybe we want to do things like combine pieces of information; things like multiplying points per game by strength of schedule to get a measure that accounts for the greater difficulty of scoring in the ACC versus a lower tier conference. So maybe we end up with a 100 variables we want to investigate.
Why are we discussing how many field we have per prospect? Because it brings us to our next problem – the relatively small number of observations in most sports contexts. Remember the basic game in this analysis is to understand “what” predicts a successful pro career. This means that we need observations on successful and less successful pro careers.
The WNBA consists of twelve teams with rosters of twelve players. This means if we go back and collect a few years of data we are looking at just a couple hundred players with meaningful professional careers. While this may seem like a sizeable amount of data, to the data scientist this is almost nothing. Our starting point is trying to relate professional career performance to college data, which in this case means maybe two hundred pro careers to be explained by potentially about a hundred explanatory variables.
It really is a weird starting point. We have serious limitations on the explanatory data available, but we also wish the ratio of observations (players) to explanatory data fields was higher. In our next installment, we will start to talk about what we are trying to predict (measures of pro career success). Following that, we will talk about how to best use our collection of explanatory variables (college stats).
Mike Lewis & Manish Tripathi, Emory University 2015.