You Still Play The Games. Sports, Math, and the Politics of Machine Learning.
Aside from College Football Playoff rankings time, there is no better time to cram your RAM with sports than March Madness. Generally, there are two ways to do this the easy fun way (true seeds) and the hard way (machine learning). If you want to get ahead this year, I suggest you use a few simple mathematical tricks to produce your bracket. Let’s just dive on into the tricks…
On the ground level, the bracket is a snake. Moving counter-clockwise (odd seeds) and clockwise for even seeds. Thus the top bracket should include teams: 1, 8, 9, 16, 17, and so on, while the bracket directly below should include 4, 5, 12, 13 etc. Of course you are now dismissing me because you know that’s not how it works — that is the point of this method. We know that teams get moved around. Most years this is a big deal, but as the NCAA has noted, because the tournament is in Indiana, they will stay closer to the S. Generally, they avoid moving a team more than a line or two accommodate teams, protect the top four seeds in each region from playing in a hostile building, rematches and early same conference match-ups should be avoided.
The magic of true seeds is that we know that the committee has moved teams around for a variety of reasons, many legitimate, and some that border favoritism. A true seeds method, uses some external ranking system to establish what the snake should have been, and compares it to what the snake is. It is then a simple matter to subtract the actual from the projected. If a team is a 7 seed, but by RPI (or what have you) should have been an 11, and the 10 should have been an 8, you have found an inversion. Those are a key source of upsets. For our purposes today, I will go ahead and use Jeff Sagarin’s Recent Rankings. You should read his rankings and other things on his site, he as a lot to say about a variety of statistical and sporting things and writes well.
You need to think carefully about what you choose. Sagarin Recent is strong because it doesn’t act like a football model — who cares what happened during an early season tournament when your team like, didn’t win afterward. Recent models are also powerful for understanding the 5–12 and 6–11 upset zones. Typically, a 12 seed is a hot/rising team. 5 seeds are failed or cooling top-liners. Recent models can really capture the convergence between these dynamics.
The spreadsheet should have teams, their ranking based on your preferred external ranking, the snake based on that, and then their actual rankings. Remember there are six eleven and sixteen seeds. Take your time here, in any data science project 90% of the time is cleaning and preparation of the data. You will find that some popular sources of sports data use different team names in their library data versus their schedule data which can really throw you for a loop when UNC isn’t in the library, but North Carolina is. I like to continue my snake past the 68 with seeds 69–100 being 20, and all others through team 357 being seed 25. This can help you see when a 15 or 16 is in the top 100 which can be a really fun time to think about a big upset pick.
The bright yellow is hard to read, also it means right one money. The bright red and orange teams are red hot and substantially under-seeded, the blueish teams are over-seeded. If colors don’t work for you, draw the line X=Y, the teams on the left you don’t pick, the teams on the left you do. In terms of the heart of the distribution you can see that in the heart of the distribution, under-seeded is Oklahoma State, over-seeded is West Virginia. But after the sweet sixteen you will be working more with instincts than numbers.
What does this approach get you: multiple 5–12 upsets. This model tells you that Virginia-Ohio is a tight 4–13, that Oregon State and Georgetown are through, Villanova is over-seeded, but Winthrop is a true 12, so not as much of a lock, but not the crush you might expect. This next graphic is the entire field.
Teams in light colors are likely to upset, teams in black are the ones who are going to lose. The teams on idea.snake 25 are outside the top 100, as you an see its 8 out of 68. There are a number of teams in the 13/14 zone that sit clearly within the snake zone. That is your signal to take a closer look. That zone includes Colgate, Ohio, and Morehead State. It is strange to say that Michigan State is an upset threat, but they are distinctly under-seeded.
If broader trends are your thing, a quick summary of tournament conferences with their mean seeds reveals that the SEC and Big12 are over-seeded by at least 2, the ACC and Pac12 are also over-seeded but not as severely, the Big10 and the American are right on the money and for the most part the Big East and several mid majors are under-seeded. Without getting to far into the next section of the article, it is clear that the machine learning models are saying some of the teams that suffered in this model look like they are Elite 8 bound…
Without getting to far into the weeds, to do this you need a big list of games that have taken place — college sports reference is wonderful for this — and their outcomes. This is really important because machine learning models are semi-supervised, meaning we feed them some very specific inputs and a model that then is inductively assimilated and used to predict new values. For my personal ML bracket this year, I am using all tournament games between 2015–2019 mirrored. What does mirrored mean? I have each game twice, to get the data into the process I feed it each game as a team on the LEFT and a team on the RIGHT with a variety of specific outcomes for each game. The outcome for each game is then coded as left or right — based on who won and lost. The computer then trains on the stats about the teams and which stats were associated with outcome right or left. Predicting the current tournament involves feeding it a similarly structured list with this years teams listed.
Typically, I would actually do the modeling with either a random forest (implemented with a Tidymodels like method in R) or TensorFlow. The drawback to tensor flow is that the data set is small and it takes awhile. These models are fun because they allow you to really dig in and tell the system which basketball factors you care about. I am playing with this:
Outcome ~ TOV +TOV.1 + Opp_TOV + Opp_TOV.1 + TrueShooting + Opp_TS + TrueShooting.1 + Opp_TS.1+ TRB + TRB.1 + Opp_eFG.1 + Opp_eFG + Opp.FT.FGA.1 + FT.FGA.1 + Opp.FT.FGA + FT.FGA
Which means the outcome is a function how well teach team avoids turnovers and causes turnovers, each teams true shooting percentage and how their opponents shot, rebound rates, free throws per field goal and opponent stats there, and each teams strength of schedule. These are RATES, not raw numbers. Many years the raw numbers are useful despite risking multi-collinearity. For those of you playing along at home multi-collinearity is what happens when two of your variables are highly connected — so total points will be higher both because you have a good offense but also because you win, so if you are trying to correct for total wins in your model, raw points can easily thwart you. This model predicts past games 65% accurately.
The big clusters in the upper right and lower left quadrants are where the model predicted correctly, the sparse regions are where it whiffed. The triangles in the upper right are correctly called upsets, the dots in the lower left are correctly called wins for higher seeds. The fun for using machine learning for building your own tournament model is that you can actually select which basketball variables you want, like rebound rates or pace of play or even real live defensive stats. We have tested some models in the past that don’t even look at basketball stuff, opting instead for SOS and Win numbers, these are promising too.
The question is: how accurate do you want to be? I can get this model, with a few adjustments to call 85% accurately, but the predictions it spits out will be incredibly bland. Really it is a question of what you want in your bracket, do you want upsets or do you want to minimize risk? Do you want to use your model in the later rounds or is this just about some quick upsets on the first weekend? Last year we had a model that kept on predicting a 1–16 upset based on Joe Lunardi’s bracket from the first week of April.
The model above picks the first round for: Norfolk, Wichita, Mount St. Mary’s, and Michigan State. Now if we take out the SOS element, it picks: Norfolk, Drake, Mount St. Mary’s, and UCLA. The models really do turn on a time. I tend to see SOS as autocorrelation (which is like multi-collinearity but when your model detects it’s own earlier signal) where you are really detecting a delayed version of the wins signal as SOS. I will post an edit on this article on Wednesday with a full ML bracket…
The Politics of Machine Learning
The March Madness bracket approaches presented here are an allegory for so much of contemporary research. The first model created a normal distribution and looked for divergence, in this case the method is legitimate because we know the underlying statistics are parametric. Direct access to the data is the promise of the second model — no need to play with parametric assumptions about the bracket or the snake, we just dive headlong into the actual inductive system of basketball assumptions.
At the same time, the model in the second case is really a reflection of your own aesthetic and political preferences, sure the machine learns from hundreds of prior games, but you teach it how to think which is even more profound than contouring an existing political ranking system.
The model I posted above should cook you a fine bracket that keeps you in the game, but it isn’t my best model. I had one in an early draft of this article that was rolling 72.5, I can heat it up to 85. But I find multi-collinearity and autocorellation to be ugly and distinctively un-fun. Rethink the prior pages in that context, my math and code are sound, but my choices are an artistic project. Lower accuracy models pick upsets more effectively. Upsets get you clicks and columns.
I want you to use these approaches to win your bracket pool. Really, I do. Machine learning is sweeping through promised as a truth machine it is really something else — an extension of you and your intuition. Win your pool, reverse engineer the machines.