| Literature DB >> 33029545 |
Bryan C Luu1, Audrey L Wright1, Heather S Haeberle2,3, Jaret M Karnuta2, Mark S Schickendantz2, Eric C Makhni4, Benedict U Nwachukwu3, Riley J Williams3, Prem N Ramkumar2.
Abstract
BACKGROUND: The opportunity to quantitatively predict next-season injury risk in the National Hockey League (NHL) has become a reality with the advent of advanced computational processors and machine learning (ML) architecture. Unlike static regression analyses that provide a momentary prediction, ML algorithms are dynamic in that they are readily capable of imbibing historical data to build a framework that improves with additive data.Entities:
Keywords: NHL; injury prediction; machine learning; regression
Year: 2020 PMID: 33029545 PMCID: PMC7522848 DOI: 10.1177/2325967120953404
Source DB: PubMed Journal: Orthop J Sports Med ISSN: 2325-9671
Definitions of Machine Learning Terms and Concepts
| Term | Definition |
|---|---|
| Variance inflation factor (VIF) | A measure of multicollinearity in a regression analysis. A higher VIF indicates that predictors are highly correlated with each other, generally indicating a less reliable result. |
| Python StatsModel package | A Python module that provides resources for conducting statistical analysis in Python. |
| K Nearest Neighbors | A pattern recognition algorithm used for both classification and regression. This algorithm classifies a case based on the classification of a majority of its neighbors. |
| Naïve Bayes | An algorithm that classifies cases based on the application of Bayes’ theorem with the assumption of conditional independence. |
| XGBoost | A machine learning algorithm that uses a gradient boosting framework to solve prediction problems. |
| Top 3 Ensemble | An ensemble algorithm that incorporates multiple machine learning algorithms (top 3) to augment predictive performance. |
| Broyden-Fletcher-Goldfarb-Shanno optimizer | An iterative algorithm that allows for the solving of unconstrained optimization problems. |
| Brier score loss | A calculation of the mean squared error between predicted and expected values. A low Brier score indicates better predictions. |
| Area under the curve (AUC) of the receiver operating characteristic curve | An aggregate measure of a model’s classification performance. AUC ranges in value from 0 to 1.0, with an AUC 1 meaning that a model is capable of distinguishing between classes 100% of the time. |
| Shapley Additive Explanations (SHAP) score | A measure of feature importance in predictive modeling. A higher SHAP value indicates a factor that predicts higher injury probability, whereas a lower SHAP value indicates a factor that predicts a lower injury probability. |
Figure 1.Schematic describing the predictive injury model for National Hockey League players.
Figure 2.(A) Calibration curve for position players. The x-axis depicts the fraction of positive values at the designated probability. As an example, assume a subcohort of 100 players with a predicted probability of 30% of being injured in the future. A perfectly calibrated classifier will correctly classify 30 of these 100 players as having a future injury. A perfectly calibrated classifier will also behave similarly across all player subcohorts with differing probabilities of being injured. Thus, a theoretical perfectly calibrated classifier will have a diagonal line in a calibration curve (dashed line). The bottom panel of the calibration curve shows the count of predicted probabilities across each predicted probability. For position players, logistic regression (blue line) is the best calibrated, as this line most nearly matches the 45° diagonal in the top plot, along with K Nearest Neighbors (green line). Random forest (orange line), XGBoost (purple line), and the Top 3 Ensemble (brown line) are the next best calibrated, with curves appearing in a sigmoid shape. Naïve Bayes (red line) is poorly calibrated. (B) Calibration curve for goalies. Logistic regression (blue line) is the best calibrated curve, followed by random forest (orange line). The remaining curves are more poorly calibrated. This is likely a consequence of fewer data points in the goalie cohort. The mean predicted value range for both nongoalies and goalies is from 0 to 1, representing the spectrum of predicted results for player injury between 0 (not injured) and 1 (injured). In both the goalie and the nongoalie cohort, a bimodal distribution can be seen for most models at 0 and 1.
Most Common Injury Types Seen in the Data for Position Players
| Injury Type | n (% of total) |
|---|---|
| Lower extremity | 1925 (29) |
| Upper extremity | 1805 (27) |
| Systemic illness | 795 (12) |
| Concussion | 450 (7) |
| Day-to-day designation | 5673 (85) |
| Total injuries | 6673 |
Most Common Injury Types Seen in the Data for Goalies
| Injury Type | n (% of total) |
|---|---|
| Lower extremity | 171 (34) |
| Systemic illness | 82 (16) |
| Sports hernia | 67 (13) |
| Upper extremity | 56 (11) |
| Day-to-day designation | 426 (84) |
| Total injuries | 309 |
Accuracy and Area Under the Receiver Operating Characteristic Curve (AUC) for Predicting Next-Season Injury Risk for Position Players and Goalies
| Model | Accuracy | AUC | F1 Score | Brier Score Loss |
|---|---|---|---|---|
| Position players | ||||
| Logistic regression | 0.946 ± 0.005 | 0.937 ± 0.011 | 0.898 ± 0.016 | 0.050 ± 0.004 |
| Random forest | 0.946 ± 0.005 | 0.936 ± 0.012 | 0.898 ± 0.016 | 0.053 ± 0.004 |
| K Nearest Neighbors | 0.700 ± 0.020 | 0.752 ± 0.028 | 0.577 ± 0.036 | 0.223 ± 0.014 |
| Naïve Bayes | 0.854 ± 0.027 | 0.917 ± 0.015 | 0.775 ± 0.035 | 0.126 ± 0.023 |
| XGBoost | 0.946 ± 0.005 | 0.948 ± 0.010 | 0.898 ± 0.016 | 0.048 ± 0.004 |
| Top 3 Ensemble | 0.946 ± 0.005 | 0.946 ± 0.010 | 0.898 ± 0.016 | 0.049 ± 0.004 |
| Goalies | ||||
| Logistic regression | 0.968 ± 0.015 | 0.947 ± 0.045 | 0.920 ± 0.045 | 0.033 ± 0.015 |
| Random forest | 0.967 ± 0.013 | 0.937 ± 0.033 | 0.917 ± 0.040 | 0.036 ± 0.012 |
| K Nearest Neighbors | 0.808 ± 0.041 | 0.816 ± 0.076 | 0.618 ± 0.105 | 0.147 ± 0.030 |
| Naïve Bayes | 0.943 ± 0.023 | 0.936 ± 0.031 | 0.869 ± 0.054 | 0.053 ± 0.023 |
| XGBoost | 0.967 ± 0.013 | 0.956 ± 0.026 | 0.917 ± 0.040 | 0.030 ± 0.011 |
| Top 3 Ensemble | 0.968 ± 0.015 | 0.952 ± 0.029 | 0.920 ± 0.045 | 0.032 ± 0.013 |
Values are reported as mean ± SD across 10 K-folds.
Figure 3.A summary Shapley Additive Explanations (SHAP) plot for National Hockey League goalies (A) and position players (B). The top 14 most important factors for model output are on the y axis. Factor impact on the model is on the x axis. For each factor, the distribution of values is displayed. A higher SHAP value indicates a factor that predicts higher injury probability, whereas a lower SHAP value indicates a factor that predicts a lower injury probability. Each datapoint is colored by the feature value. For example, age is colored blue for lower age values and red for higher age values.
Nongoalie Cohort Characteristics, Including Sabermetric Measures of Performance and Prior and Future Injury
| Variable Name | Feature |
|---|---|
| Age | Player age |
| +/- | Plus/minus (scoring) |
| PIM | Penalties in minutes (scoring) |
| EV | Even strength goals |
| PP.Special Teams | Power play goals (special teams) |
| SH.Special Teams | Short-handed goals (special teams) |
| GW | Game-winning goals |
| PP.Assists | Power play assists |
| SH.Assists | Short-handed assists |
| S% | Shooting percentage |
| BLK | Blocks at even strength |
| HIT | Hits at even strength |
| FOW | Faceoff wins at even strength |
| FO% | Faceoff win percentage at even strength |
| FF% rel | Relative Fenwick for percentage at even strength |
| oiSH% | Team on-ice shooting percentage at even strength |
| Shift | Average shift length per game |
| GP | Games played |
| oZS% | Offensive zone start percentage at even strength |
| TK | Takeaways |
| GV | Giveaways |
| E +/- | Expected +/- (given where shots came from, for and against, while this player was on the ice at even strength) |
| ATOI.ES | Average time on ice per game while at even strength |
| CF% Rel.ES | Relative Corsi for percentage while at even strength |
| GA/60.ES | On-ice goals against per 60 minutes while at even strength |
| ATOI.PP | Average time on ice per game while on the power play |
| CF% Rel.PP | Relative Corsi for percentage while on the power play |
| GF/60.PP | On-ice goals for per 60 minutes while on the power play |
| GA/60.PP | On-ice goals against per 60 minutes while on the power play |
| ATOI.SH | Average time on ice per game while short-handed |
| CF% Rel.SH′ | Relative Corsi for percentage while short-handed |
| GF/60.SH | On-ice goals for per 60 minutes while short-handed |
| GA/60.SH | On-ice goals against per 60 minutes while short-handed |
| TOI.Total | Total time on ice per season |
| Prior injury count | Number of prior injuries, counted at the end of a season |
refers to the coded name for the variable as used in the Python program. Feature is a description of the variable.
Goalie Cohort Characteristics, Including Sabermetric Measures of Performance and Prior and Future Injury
| Variable Name | Feature |
|---|---|
| Age | Goalie age |
| GAA | Goals against average |
| QS% | Quality start percentage |
| GSAA | Goals against average |
| PIM | Penalties in minutes |
| GS | Games started |
| L | Losses |
| T/O | Ties plus overtime/shootout losses |
| SO | Shutouts |
| GA%- | Goals allowed percentage relative to league goals allowed percentage |
| A | Assists |
| GP | Games played |
| MIN | Minutes played, in season |
| Prior injury count | Number of prior injuries, counted at the end of a season |
refers to the coded name for the variable as used in the Python program. Feature is a description of the variable.