Blair Andrews builds on the random forest model described in the Wrong Read No. 68 by turning it into actionable rules for choosing rookie wide receivers. Last year Blair told you that Justin Jefferson was the only WR prospect to check all the boxes. Today, he provides the most similar receivers from the 2021 class.
The best predictive models have three notable criteria — they’re stable, accurate, and interpretable. Not all kinds of models meet all three criteria. A decision tree is both accurate and interpretable, but small changes in the data can have drastic impacts on the model results, which makes a decision tree inherently unstable.
A linear regression does not suffer from quite the same instability, and it is also interpretable. But because it treats every variable as if its relation to the dependent variable can be expressed in a linear equation, it’s not always the most accurate, at least not for predicting how college prospects will perform in the NFL.
A random forest solves the stability problem of a decision tree by growing hundreds of decision trees using random slices of the data. However, the sheer amount of trees and nodes in a random forest model make it a “black box model” — that is, it’s not easily interpretable.
Making Random Forests Interpretable
However, we can solve that last problem using a technique called rule extraction. The basic idea is this: Each node in a decision tree can be thought of as a simple rule or heuristic. These heuristics prove vital in prospect evaluation. Checking all the boxes is even more important than you might have guessed.
Rule extraction algorithms pull those heuristics out of random forest models based on their frequency. The nodes that appear most frequently are taken to be the most important rules, in other words.
We end up with a short list of the most important rules extracted from those hundreds of trees. Some of them are created by combining two nodes that appear frequently in concert. The table below lists those rules, along with some evaluative statistics explaining how good each heuristic is at telling the hits from the misses. For our purposes, if a player averages at least 12.5 PPR points over his first three seasons (a 200-point season pace), I count that as a hit.