This was prepared by a team including Mariah Samano, Haley Daarstad, Angel Le, Quinn Downey, Simon Hutton, and me, Dan.
A few months ago, we started on the journey of predicting the results to March Madness. Suddenly, the apocalypse came and everything was cancelled. It turns out the end of the world is boring so we continued on this journey by creating what the brackets would have been if these current events had not occurred.
So for those of you playing along at home, we project Michigan to defeat Wisconsin in the national final. With those teams defeating Gonzaga and Villanova in the final four respectively. This is bracket 4/the Orange Bracket. Go Beavs!
The bigger question for us is how we got here, not so much the picks themselves. What we thought would be an objective process turned on human decisions and intuition and political choices. Each bracket presented here was produced using a machine learning model called a random forest (10,000 trees per run, which is just an arbitrary large, round number). We used the Lunardi bracketology from April 14 (he continues to update, we could easily use any of these). Each accuracy rating is derived from the 80/20 train/validate breakdown. Of course the games did not take place, these are merely projections.
For those of you playing along at home, the data and code needed to cook your own brackets can be found in this github repo: https://github.com/dcfaltesek/team/tree/master/basketball%20replication%20code
Note, you can’t directly replicate these brackets as you will not have the same 80/20 split that informed our work, you can produce many very similar brackets though.
Here is a good starting point. Four brackets, and four methods. Which one do you think would be best for winning big money, which one would be most plausible?
You can think of our bracket options like “Goldilocks and the Three Bears.” We got our piping hot bracket, hot bracket, too cold bracket and bracket that was just right.
When we looked at the first bracket (Figure 1), it did not contain basketball data. This produced a bracket that was wild with multiple upsets and an underdog in the final.
Total rebounds, total points scored for the season, total points against for the season, turnovers for the season, and team losses this season. No fancy stats, weights, z-scores, or normalizations. Let the machine learn from ten years of games, internally 76.2% accurate.
When we get to looking at the second bracket (Figure 2), it attempts to use basketball data, but fails to show any outliers within the bracket.
Bracket is based on an extremely smooth model with lots of z-scores based on Oliver’s four factors. It was 69.8% accurate based on the last three years of NCAA tournament games.
When looking at the third bracket (Figure 3), it produces an okay final four. But, it produces almost no upsets, which does not realistically happen.
Bracket is based on a hyper-smooth model of an Oliver four-factor like model. Z-score everything. About 72% accurate.
When looking at the fourth and final bracket (Figure 4), the results show a model that is interesting. It produced results that had upsets, but not to many, and had a realistic and interesting final four.
Bracket D has no basketball data. It is just wins, losses, and strength of schedule. About 72% accurate.
Figure 1. Green Final Four — As you can see this bracket is wild, the final four is a 1, 16, 2, and 15.
Figure 2. Blue Final Four — (1 Gonzaga def 2 Kansas; 2 Duke def 1 Baylor). Final, Gonzaga def Duke. Aside from a 1:16 upset a very conservative bracket ending in a 1211 final four.
Figure 3. Red Final Four — 2 Kansas def 1 Gonzaga; 1 Baylor def 1 Villanova. Final, Kansas def Baylor. While this bracket doesn’t include any huge swings. It includes an early, but not first round exit for Virginia.
Figure 4. Orange Final Four — This bracket has no basketball information, just wins/losses and SOS for each team. This is a satisfying bracket, with plenty of well-chosen upsets and an interesting, but not overly provocative final four, despite calling a 1:16 upset.
A few interesting questions:
- How do you know if the industry standard is coherent?
There are large swings produced by individual random selections caused by the sampling process. There was one simulation, once, where Oregon lost a 5–12 upset. Much like the Virginia problem, it is clear that in a world with major outliers and only 300 base data-points the selection of training materials should be regarded as political and important.
The 80/20 standard is used regularly in this field, but there isn’t a great reason why it shouldn’t be 70/30 or even 50/50 to really validate. 80 is round and large enough that we hope we could get every fifth case right. We generally accepted that a 70%+ success rate on predicting classes for the 20 was acceptable. If you train on 16 games then pick 4 more, going three for four on the next games seems good. This is a binary classification problem that is known to have odd results (a bunch of upsets every year). Overfitting is lurking in the shadows if you fit much tighter.
2. Are the more conservative brackets better?
Generally, professional pikers end up with a consistent pattern with two low integers and two 1 seeds. Teams with higher seeds tend but it is not common for two or more 1 seeds to make the final. This does not fit with reality.
The most recent 11 Final Four seeding patterns
These good brackets that involve multiple 1 seeds have little basis in reality. We have seen two “1 seeds” once and three once and zero once. Your best bet is one of them. Brackets that include multiple aren’t just boring, they are wrong.
People would reject the green bracket (figure 1) out of hand, but it is clear in many models that Virginia was going to lose that game. SFA was a perfect trap for Baylor in Dallas as well, as was also predicted in some brackets or Winthrop for Wisconsin. We notice that the conservative outcome is itself a fiction. Or as legendary coach Herm Edwards argued once in a press conference: this is why we play the games.
- Formulas matter, this isn’t a black box. The representation of machine learning as a black box that produces answers is itself political. We pulled back the curtain and showed that the outcomes are fairly easily manipulated. Deep learning can be pretty shallow.
- More data isn’t always better, it can just mean more extreme outliers, and definitely more noise.
- Why are we super smoothing a rough reality? Normalizing all of the data led to a worse result. Maybe hitting twice as many three point shots for the season really should appear as a true double in the model. We could come back later and install a bunch of other stuff to try to model the importance of the three-point shot, but that seems hard and to miss the point of the image of a machine learning process looking at data and inductively finding the classes.
- Beware of prejudice against reality for consistency in the model validation phase. This is the story of the Texas Marksman as provided in the book Statistical Inference as Severe Testing by Mayo: The Texas Marksman goes out to the barn and unloads a revolver into it, after they are done shooting, they paint a bullseye around where the bullets actually went. This goes two ways, both where you run the random 80/20 sample until you get what you want, but also changing the model when the results are politically or aesthetically not palatable by continuing to smooth, upsample, or normalize factors. Just like how you shouldn’t hypothesize after the fact and pre-registering hypothesis tests is a good idea, you shouldn’t rewrite your interpretative framework to match whatever your model produced.
- We decided to embarrass ourselves with the Virginia upset pick, how do we get the courage to do good data science when we are weak? The most appealing model is often not the most honest.