Last offseason we tried to recreate our success with Justin Jefferson by using a random forest model and a rule extraction algorithm to find the most valuable WR prospect heuristics. That is, we want to know what boxes need to be checked.
While last year’s class didn’t include anyone who checked the most important boxes according to the algorithm, it gave us a handful of names who were close. Rashod Bateman and Tylan Wallace both passed some of the most demanding heuristics, and were among our favorites outside of the very top of the class. The Ravens put their RotoViz subscription to good use, which was in some ways unfortunate for both receivers, but particularly for Wallace. (Wallace, it’s worth noting, is the only player in this group who spent four years in college.)
Rondale Moore and Elijah Moore also met key thresholds identified by the algorithm, and both Moores and Bateman are now trendy best ball picks in the single-digit rounds.
The 2022 class does not appear as deep as the 2021 class, but it has some intriguing potential stars. Yet how many of them check all the boxes? And which boxes do we want them to check in the first place? Here’s last year’s introduction to the exercise, which explains what we’re up to:
The best predictive models have three notable criteria — they’re stable, accurate, and interpretable. Not all kinds of models meet all three criteria. A decision tree is both accurate and interpretable, but small changes in the data can have drastic impacts on the model results, which makes a decision tree inherently unstable.
A linear regression does not suffer from quite the same instability, and it is also interpretable. But because it treats every variable as if its relation to the dependent variable can be expressed in a linear equation, it’s not always the most accurate, at least not for predicting how college prospects will perform in the NFL.
A random forest solves the stability problem of a decision tree by growing hundreds of decision trees using random slices of the data. However, the sheer amount of trees and nodes in a random forest model make it a “black box model” — that is, it’s not easily interpretable.
Making Random Forests Interpretable
However, we can solve that last problem using a technique called rule extraction. The basic idea is this: Each node in a decision tree can be thought of as a simple rule or heuristic. These heuristics prove vital in prospect evaluation. Checking all the boxes is even more important than you might have guessed.
Rule extraction algorithms pull those heuristics out of random forest models based on their frequency. The nodes that appear most frequently are taken to be the most important rules.
The resulting list of rules tells us what nodes most consistently lead to success. The table below lists the rules along with some evaluative metrics that help us understand how accurate and effective each rule is at finding hits and avoiding misses. For our purposes, if a player averages at least 12.5 PPR points over his first three seasons, I count that as a hit. (That’s a 200-point pace over 16 games — slightly more over 17 games.) Where multiple thresholds for a single metric exist, I’ve included only the most demanding threshold for each metric. I’ll note some interesting lower thresholds below. I’ll also note some interesting differences in the 2022 set of rules compared to last season.