NBA Player Metric Comparison
Summary
The latest all-in-one NBA player metrics were compared in terms of accuracy with a focus on the modern era. The often-used retrodiction test was performed to make the comparison. New metrics were used for the first time in such an analysis, namely Real Plus-Minus (RPM), RAPTOR, Box Plus-Minus 2.0 (BPM), and Estimated Plus-Minus (EPM). Metric accuracy was compared overall and in the context of changing rosters. EPM and RPM, which were the only metrics that used RAPM directly with a Bayesian prior, consistently performed the best among all metrics, with EPM taking the lead overall. RAPTOR was the clear third-place metric with the revamped BPM putting the pressure on in fourth place. New player metrics using the latest methodologies and data are better built for today’s game.
Introduction
The goal of this project was to compare the latest all-in-one NBA player metrics on their accuracy. Similar to studies performed by Dan Rosenbaum, Alex at Sport Skeptic, Neil Paine, and most recently Ben Taylor, a retrodiction test was performed in which player metric values were used to predict team ratings in the following season. The idea is that the lower the prediction error, the more accurately the metric is assigning credit to players. As will be discussed later, this is especially true in the context of changing rosters.
The analysis is called a retrodiction test because the metric values were weighted by actual minutes played in the season that is being projected to. The playing-time weights are unknown when making real-world predictions, so while this methodology is sufficient for comparing metrics, the predicted numbers themselves are unrealistic.
This study included several new player metrics that have not been included in a retrodiction analysis to date. These include Real Plus-Minus (RPM), RAPTOR, Box Plus-Minus 2.0 (BPM), and Estimated Plus-Minus (EPM). These metrics were built in the modern era (since 2013-14) which is where this analysis is focused. RAPTOR and EPM were the first public metrics to use player-tracking data which will allow us to look at how these data may impact modern metrics. Also, metrics that use RAPM directly with a Bayesian prior were tested for the first time in a while (i.e. EPM and RPM). I’m the creator of EPM but worked hard to be objective and scientific in this analysis.
Player Metrics & Data Sources
The metrics below were compared in the analysis. The data for each metric were obtained from the linked source on March 13th, 2020, two days after the NBA was postponed indefinitely due to the covid-19 pandemic.
- Estimated Plus-Minus (EPM)
- Real Plus-Minus (RPM)
- Regularized Adjusted Plus-Minus (RAPM)
- Player Impact Plus-Minus (PIPM)
- RAPTOR/PREDATOR (modern)
- Box Plus-Minus 2.0 (BPM)
- Player Efficiency Rating (PER)
- Win Shares per 48 (WS48)
Data & Methodology Detail
This section explains in detail how data were organized and calculated. If you’re anxious for results, skip to the “Overall Results” section below and check this section out later.
To compare the metrics in the player-tracking era, only data from the 2013-14 season to present were used. The resulting dataset included six season-to-season pairs, or 180 team-seasons.
Some metrics only provided values that included the playoffs (RPM, PIPM, RAPTOR/PREDATOR), while others only provided regular season values (EPM, BPM, WS48, PER, RAPM). RAPTOR/PREDATOR also provided values by season type but only player-team-seasons—this project only used player-season values.
Handling Missing, Low-Minutes, and Rookie Players
Data were successfully matched across datasets resulting in no missing players with the exception of RPM. There were 63 player-seasons (2.1%) missing in the RPM data from 2014 through 2019 with some players having played up to 1,200 minutes in a season. Most of those missing were lower-end players and were thus given replacement-level values. 81% of the missing RPM player-seasons also played less than 250 minutes and would’ve received replacement-level values anyway as described in the next few paragraphs.
Some metrics had extreme values in small sample sizes which could be an issue when applied to the following season’s weights. For some metrics, it is by design to explain exactly what happened even if a small sample size does not represent a player’s value moving forward (e.g. RAPTOR, BPM). To significantly mitigate this issue, this study assigned the same value to players below a certain minutes-played threshold.
All players with less than 250 minutes played (in the season from which values were used) received replacement level values for each metric. Neil Paine in his study used 250 minutes as a threshold which also seemed to work well in this analysis. Several thresholds were tried from 35 minutes (5.5%) to 350 minutes (27%) with little relative change in prediction errors except with RAPTOR which benefited significantly the higher the threshold (up to about 250 minutes). The results of these iterations, along with some commentary, can be found in Appendix A. There were 673 (22.2%) player-seasons from 2014 to 2019 with less than 250 total minutes played.
Rookies, who have no previous season values, were given replacement-level values. The analysis was repeated handling rookies with both average values and current-season values. The relative prediction errors between metrics did not change much when using average or replacement-level values. There was however some movement between some metrics when giving rookies current-season values. The results of the different methods are shown in Appendix B. The reason for not choosing current-season values is explained below.
Since retrodiction testing uses values from the previous season, and rookies don't have values in the previous season, it's important to treat rookies the same to make a fair comparison. The idea of giving them current-season values is to remove them from the equation as each metric would be predicting to the same season effectively giving each metric a free pass for rookies. However, for metrics that force-fit to team ratings there would likely be a slightly better free pass. How metrics fit to the same season shouldn't matter, and since any metric can force-fit but not all do, replacement values were used for rookies.
A replacement-level value of -2.0 was used for all plus-minus based metrics, 11 for PER, and 0.045 WS48.
Team Ratings and Continuity Calculations
Team Adjusted Net Ratings were used in the analysis which were adjusted for strength of schedule (home-court advantage and strength of opponents faced). These ratings were predicted in the analysis based on player metric values from the previous season and prediction errors were assessed.
A measure of roster continuity was used in further analysis to compare metrics in the context of changing rosters for a more accurate assessment. Continuity was calculated using the method outlined by Ken Pomeroy which is the sum of the minimum percent of minutes played out of both seasons for each team. This controls for a differing number of minutes played for each team-season given overtime periods and truncated seasons (lockout and postponed seasons).
Overall Results
The first analysis was run using the entire dataset and with rookies given replacement-level values. The table below shows the overall prediction error for each metric (in terms of Root Mean Squared Error, or RMSE). The lower the error, the better the metric predicted the following season’s team ratings.
Metric | Error (RMSE) |
---|---|
EPM | 2.48 |
RPM | 2.60 |
RAPTOR | 2.63 |
BPM | 2.71 |
PREDATOR | 2.73 |
PIPM | 2.78 |
RAPM | 2.80 |
WS48 | 2.85 |
aPER | 3.12 |
PER | 3.20 |
As can be seen in the table above EPM had the lowest prediction error when not accounting for roster continuity, followed by RPM (the two Bayesian-RAPM metrics). This was the case regardless of how rookies were treated and what minutes-played threshold was used when assigning replacement-level values (see Appendix A and B).
RAPTOR followed closely behind RPM but was the only metric that was significantly sensitive to changes in the minutes-played threshold (which determined the players who received replacement-level values). When using a lower minutes-played threshold, RAPTOR’s prediction error worsened to be comparable to BPM (results and commentary in Appendix A).
In the middle tier were the remaining indirect-RAPM metrics with BPM leading PREDATOR and PIPM. The older metrics WS and PER had the highest prediction error, with PER struggling mightily (although not as bad when given an after-market team adjustment at the suggestion of Steve Ilardi, which I’ve labeled “aPER”; the adjustment formula was provided by Nathan Walker who adapted BPM's formula).
Controlling for Roster Continuity
A greater test of metric efficacy is observing how well metric values predict team net rating when rosters change. Roster continuity in and of itself is predictive of team ratings, meaning teams that have little roster turnover from one year to the next tend to perform better. This is likely due to good teams keeping good players with secondary effects from team chemistry and club strength (coaching, player development, training staff, management, etc.), but a good player metric minimizes any team effect that could be present. So how much is a particular metric dependent on teams sticking together?
One way to answer is to separate the sample into roster continuity sub-groups (or bins), similar to Ben Taylor’s study. Since this project was focused on a smaller sample size that is the modern era, splitting the data into smaller groups provided potential sample-size issues. However, the order of metrics in prediction error did not change much even with small sub-groups. Bin analysis was performed and results can be seen in Appendix C.
Another way to answer this question is to see what predictive strength of a metric remains after holding continuity constant. This was done by using a multiple regression model for each metric which included two standardized inputs: 1) the weighted sum of the metric values themselves, and 2) a measure of roster continuity as described in the Data and Methodology section.
In the table below, the “Metric Strength” values show how well each metric predicted team ratings after holding continuity equal—the higher the better. “Continuity Strength” shows how much of a role roster continuity had in the prediction. The “Importance” columns show the relative importance of the metric variable vs. the continuity variable.
STRENGTH (COEFFICIENTS) | RELATIVE IMPORTANCE | |||
---|---|---|---|---|
METRIC | CONTINUITY | METRIC | CONTINUITY | |
EPM | 3.76 | 0.23 | 94.2% | 5.8% |
RPM | 3.58 | 0.43 | 89.3% | 10.7% |
RAPTOR | 3.51 | 0.55 | 86.5% | 13.5% |
BPM | 3.46 | 0.53 | 86.7% | 13.3% |
RAPM | 3.42 | 0.44 | 88.6% | 11.4% |
PREDATOR | 3.42 | 0.60 | 85.2% | 14.8% |
PIPM | 3.39 | 0.54 | 86.2% | 13.8% |
WS48 | 3.33 | 0.55 | 85.9% | 14.1% |
aPER | 2.96 | 0.98 | 75.1% | 24.9% |
PER | 2.87 | 0.96 | 74.9% | 25.1% |
As can be seen in the table above, EPM was the most predictive of team ratings after controlling for roster continuity (coefficient: 3.76); it was nearly half as dependent on rosters staying together than the second most predictive metric (continuity coefficient: 0.23). RPM was the clear second-place metric having gained separation over RAPTOR after controlling for roster continuity.
RAPTOR, while still in third, lost some ground to BPM in the roster continuity analysis. Similar to the overall analysis, RAPTOR’s predictive strength worsened when using a lower minutes-played threshold.
Following RAPTOR, the remaining metrics that used RAPM indirectly (and RAPM directly with no prior) performed similarly after controlling for continuity, namely BPM, PREDATOR, and PIPM with BPM leading this group. The older metric WS48 surprised a bit while PER struggled again.
Conclusion
This was the first time Bayesian-RAPM based metrics were used in a retrodiction comparison study since early versions of Jerry Engleman’s xRAPM were used in older studies. These two metrics—EPM and RPM—were shown to be leading the way in assigning credit to players in the modern era so far. EPM, which also uses player-tracking data, came out on top. It’s important to note that RPM has been around for several years but the authors changed in 2020. The values from the new version were not used in this study.
Was it the use of direct-RAPM that landed EPM and RPM at the top or was it the accuracy of their priors? These metrics don’t report the prior separately, but as the creator of EPM I was able to compare it to its prior, as well as to a team-adjusted prior (force fitting to team ratings). In both cases, the EPM prior was competitive with second tier metrics but it was the use of direct-RAPM that significantly improved prediction and pushed it to the top.
This was the first-time metrics that use player-tracking data were compared in such a test, RAPTOR/PREDATOR being the other metric besides EPM to do so (RAPTOR also uses a new technique to estimate RAPM). RAPTOR appeared to be a clear third-place metric when given the benefit of the doubt on its accuracy with lower-minutes guys (commentary in Appendix A).
Given the performance of the two metrics that use player-tracking data (RAPTOR, EPM), there is some early evidence that these data will be helpful moving forward in estimating player values. Performing retrodiction tests for defense separately may provide more insight on this point.
The fourth-place BPM (v2.0) gained some ground on RAPTOR after controlling for roster continuity. The recently revamped BPM performed very well in the analysis and is especially valuable in general given it has the same formula for historical calculations (compared to EPM and RAPTOR which depend on player-tracking data only available since 2013-14), as well as game-level values.
PIPM, which uses innovative luck adjusted techniques, followed closely behind the revamped BPM and modern RAPTOR metrics. PIPM was trained with luck adjusted team ratings as well, the use of which in a retrodiction test may have resulted in slightly stronger performance (regular season only values might’ve helped as well). Similar to BPM, PIPM is not dependent on modern data and can be used for historical valuations.
This study focused on the modern era of NBA basketball (since 2013-14). While new metrics explain this time period relatively well, they were also trained specifically to do so. I tried some out-of-sample testing with EPM in which a season was removed in the modeling that created EPM, but differences in prediction errors for the omitted season were very small. It is likely a similarly-small effect for other metrics. It is still uncertain how these new metrics might fare as we move into the future and the game continues to evolve.
This analysis did not compare metrics in terms of their ability to predict offense and defense separately, something future studies could add. Also, partial season analysis would be interesting to see how metrics compare in their quickness to stabilize (although it would be difficult to assemble historical data of partial seasons for all metrics).
Appendix A
The table below shows prediction error (RMSE) by which players were given replacement-level values. For example, the first column, “<35”, shows the errors when players with less than 35 minutes were given replacement-level values (does not account for roster continuity).
Overall RMSE by Minutes-Played Threshold | |||||
<35 | <100 | <200 | <250 | <350 | |
EPM | 2.49 | 2.50 | 2.48 | 2.48 | 2.50 |
RPM | 2.60 | 2.60 | 2.60 | 2.60 | 2.63 |
RAPTOR | 2.72 | 2.69 | 2.66 | 2.63 | 2.61 |
BPM | 2.69 | 2.72 | 2.73 | 2.71 | 2.71 |
PREDATOR | 2.75 | 2.75 | 2.74 | 2.73 | 2.73 |
PIPM | 2.82 | 2.80 | 2.78 | 2.78 | 2.79 |
RAPM | 2.86 | 2.83 | 2.81 | 2.80 | 2.83 |
WS48 | 2.88 | 2.87 | 2.87 | 2.85 | 2.84 |
aPER | 3.08 | 3.09 | 3.12 | 3.12 | 3.10 |
PER | 3.19 | 3.19 | 3.19 | 3.20 | 3.18 |
As mentioned, RAPTOR suffered more than other metrics when using a lower-minutes threshold for replacement-level values. This may indicate RAPTOR being less accurate for lower-end players. RAPTOR and BPM were designed to explain what has happened more than what is realistic moving forward, but regardless of the minutes-played threshold, RAPTOR out performed its cousin PREDATOR even though the latter was specifically designed to be more predictive. While RAPTOR seems to be very competitive with RPM in the above table, this was shown to be less true when controlling for roster continuity.
The <250 minutes-played threshold was chosen for the main analysis because most metrics tended to stabilize then, including RAPTOR, and because it was the number used in Neil Paine’s study.
Appendix B
The table below shows overall prediction errors (RMSE) for each metric by how rookie values were handled. The colors are relative to each column separately (does not account for roster continuity).
Metric | Averages | Replacement | Current-Season |
EPM | 2.62 | 2.48 | 2.32 |
RPM | 2.74 | 2.60 | 2.46 |
RAPTOR | 2.75 | 2.63 | 2.47 |
BPM | 2.84 | 2.71 | 2.51 |
PREDATOR | 2.85 | 2.73 | 2.55 |
PIPM | 2.94 | 2.78 | 2.68 |
RAPM | 2.93 | 2.80 | 2.76 |
WS48 | 3.01 | 2.85 | 2.62 |
aPER | 3.43 | 3.12 | 2.81 |
PER | 3.46 | 3.20 | 3.17 |
The potential unfair advantage gained by metrics that force-fit to current-season team ratings when using current-season values for rookies is perhaps represented by how PER changes when it is force-fit (aPER). You can see the improvement made by aPER (relative to other metrics) when going from replacement to current-season values for rookies is much greater than for vanilla PER when the only difference between the two is a team-adjustment.
Appendix C
A comparison of metric prediction error by roster continuity bin is shown in the table below. The Low Continuity bin was for teams with a continuity below 0.54 and the High Continuity bin was for those above the same number. There were 99 team-seasons in the Low Continuity and 81 in the High Continuity bin. There was some noise affecting metrics in general when adjusting the continuity value used to split the sample into bins so results should be interpreted with care.
Metric | Low Continuity | High Continuity |
---|---|---|
EPM | 2.55 | 2.36 |
RPM | 2.64 | 2.48 |
RAPTOR | 2.65 | 2.47 |
BPM | 2.79 | 2.52 |
PREDATOR | 2.72 | 2.60 |
PIPM | 2.88 | 2.57 |
RAPM | 2.86 | 2.65 |
WS48 | 2.97 | 2.61 |
aPER | 3.15 | 2.84 |
PER | 3.10 | 3.11 |
When splitting into three equal bins of 60 team-seasons, the noise can be seen in the table below with the medium bin having higher prediction error in general over the low-continuity bin. There happened to be a handful of team-seasons in the medium-continuity bin that all metrics missed badly on causing these counterintuitive results.
METRIC | LOW CONTINUITY | MEDIUM CONTINUITY | HIGH CONTINUITY |
---|---|---|---|
EPM | 2.30 | 2.75 | 2.26 |
RPM | 2.52 | 2.73 | 2.41 |
RAPTOR | 2.61 | 2.68 | 2.43 |
BPM | 2.70 | 2.75 | 2.53 |
PREDATOR | 2.65 | 2.80 | 2.49 |
PIPM | 2.83 | 2.82 | 2.55 |
RAPM | 2.79 | 2.99 | 2.47 |
WS48 | 2.87 | 2.88 | 2.62 |
aPER | 3.26 | 2.91 | 2.81 |
PER | 3.15 | 2.97 | 3.10 |