| Literature DB >> 35068645 |
Bram Janssens1, Matthias Bogaert1, Mathijs Maton1.
Abstract
The importance of young athletes in the field of professional cycling has sky-rocketed during the past years. Nevertheless, the early talent identification of these riders largely remains a subjective assessment. Therefore, an analytical system which automatically detects talented riders based on their freely available youth results should be installed. However, such a system cannot be copied directly from related fields, as large distinctions are observed between cycling and other sports. The aim of this paper is to develop such a data analytical system, which leverages the unique features of each race and thereby focusses on feature engineering, data quality, and visualization. To facilitate the deployment of prediction algorithms in situations without complete cases, we propose an adaptation to the k-nearest neighbours imputation algorithm which uses expert knowledge. Overall, our proposed method correlates strongly with eventual rider performance and can aid scouts in targeting young talents. On top of that, we introduce several model interpretation tools to give insight into which current starting professional riders are expected to perform well and why.Entities:
Keywords: Interpretable machine learning; Missing value imputation; Predictive modelling; Scouting analytics; Sports analytics
Year: 2022 PMID: 35068645 PMCID: PMC8765833 DOI: 10.1007/s10479-021-04476-4
Source DB: PubMed Journal: Ann Oper Res ISSN: 0254-5330 Impact factor: 4.854
Selected youth races
| Race | Category | Country |
|---|---|---|
| Gent-Wevelgem Kattenkoers | Cobbles | Belgium |
| Ronde van Vlaanderen Beloften | Cobbles | Belgium |
| Liège - Bastogne - Liège U23 | Hilly U23 | Belgium |
| Tour de l'Avenir | Big Tour | France |
| Coppa della Pace - Trofeo F.lli Anelli | Hilly U23 | Italy |
| G.P. Palio del Recioto | Hilly U23 | Italy |
| Giro Ciclistico d'Italia | Big Tour | Italy |
| Giro Ciclistico della Valle d'Aosta - Mont Blanc | Big Tour | Italy |
| Gp Capodarco Comunita Di Capodarco | Hilly U23 | Italy |
| Gran Premio della Liberazione | Hilly U23 | Italy |
| Gran Premio Industrie del Marmo | Hilly U23 | Italy |
| Gran Premio Sportivi di Poggiana | Hilly U23 | Italy |
| Il Piccolo Lombardia | Hilly U23 | Italy |
| Le Triptyque des Monts et Châteaux | Rest | Belgium |
| Paris-Roubaix Espoirs | Cobbles | France |
| Ronde de l'Isard | Stage Race Climb | France |
| Ruota d'Oro - GP Festa del Perdono | Hilly U23 | Italy |
| Tr. Città di S. Vendemiano | Hilly U23 | Italy |
| Trofeo Piva | Hilly U23 | Italy |
| Circuito Belvedere | Hilly U23 | Italy |
| Eschborn Frankfurt U23 | Rest | Germany |
| European Chamionships U23 ITT | ITT | Varying |
| European Continental Championships U23 | Hilly U23 | Varying |
| World Champioships ITT U23 | ITT | Varying |
| World Championships U23 | Hilly U23 | Varying |
| Omloop der Vlaamse Gewesten | Cobbles | Belgium |
| Tour du Valromey | Stage Race Junior | France |
| Liège la Geize | Stage Race Junior | Belgium |
| Bernaudeau Junior | One Day Junior | France |
| Course de la Paix Junior | Stage Race Junior | Czech Republic |
| GP General Patton | One Day Junior | Luxembourg |
| Grand Prix Rüebliland | Stage Race Junior | Switzerland |
| GP dell Arno | One Day Junior | Italy |
| Keizer der Juniores | Stage Race Junior | Belgium |
| La Coupe du President de la ville Grudziadz | Stage Race Junior | Poland |
| Trofeo Karlsberg | Stage Race Junior | Germany |
| Paris Roubaix Juniors | Cobbles | France |
| Ronde van Vlaanderen voor junioren | Cobbles | Belgium |
| Sint-Martinusprijs Kontich | Stage Race Junior | Belgium |
| Driedaagse van Axel | Stage Race Junior | Netherlands |
| Tour de l’Abitibi | Stage Race Junior | Canada |
| Tour du Pays de Vaud | Stage Race Junior | France |
| Trofeo Buffoni | One Day Junior | Italy |
| Trofeo Commune di Vertova | One Day Junior | Italy |
| Trofeo Emilio Paganesi | One Day Junior | Italy |
| Le Trophee Centre Morhiban | One Day Junior | France |
| Chrono des Nations Junior | ITT | France |
| Giro Internazionalle della Lunigiana | Stage Race Junior | Italy |
| Internationales Junioren Rundfahrt Niedersachsen | Stage Race Junior | Germany |
| European Chamionships Junior | One Day Junior | Varying |
| UCI World Championships ITT Junior | ITT | Varying |
| UCI World Championships Junior | One Day Junior | Varying |
| Olympia Tour | Rest | Netherlands |
| Tour d’Alsace | Stage Race Climb | France |
| Tour de Normandie | Rest | France |
| Ster ZLM Tour | Rest | Netherlands |
| Paris-Arras | Rest | France |
| Paris-Tours U23 | Rest | France |
| Tour de Berlin | Rest | Germany |
| Tour des Pays de Savoie | Big Tour | France |
Aggregate features: feature names and description
| Feature name | Description |
|---|---|
| # Results | Total number of scraped youth results |
| # Abandons | Total amount of abandons among results |
| Abandon ratio | Ratio results/abandons |
| Victory ratio Junior | Number of victories in Junior category divided by number of Junior results |
| Podium ratio Junior | Number of podiums (places 2–3) in Junior category divided by number of Junior results |
| Top 5 ratio Junior | Number of top-5’s (places 4–5) in Junior category divided by number of Junior results |
| Victories Junior | Absolute number of victories as Junior |
| Victory ratio U23 | Number of victories in U23 category divided by number of U23 results |
| Podium ratio U23 | Number of podiums (places 2–3) in U23 category divided by number of U23 results |
| Top 5 ratio U23 | Number of top-5’s (places 4–5) in U23 category divided by number of U23 results |
| Victories U23 | Absolute number of victories as U23 |
| Evolution wins | Number of victories in U23 category divided by wins in Junior category (0 if divided by 0) |
| Evolution podium ratio | Podium ratio U23 category divided by Podium ratio Junior category (0 if divided by 0) |
| Evolution wins | Top 5 ratio U23 category divided by Top 5 ratio Junior category (0 if divided by 0) |
Features are calculated across all races
Non-aggregate (single race-based) features by group: feature names and description
| Feature group | Type of race | Number features | Example features |
|---|---|---|---|
| Best Result | Stage | 24 | |
| One day | 36 | ||
| Participation | Stage | 24 | |
| One day | 36 | ||
| Minimum Time Difference | Stage | 24 | |
| One day | 36 | ||
| Stage Wins | Stage | 24 | |
| Stage Best Results | Stage | 24 |
Features are calculated per race. Stage races have five features, one day races three
Fig. 1Results from one rider in the sample used for example calculation. Rider would have one stage win + best result 5th. (Color figure online)
Fig. 2Imputation using feature categorization. Missing features are imputed per race category: in this example cobble races versus big tours. E.g., Paris-Roubaix results do not influence Tour de l’Avenir imputations. (Color figure online)
Candidate search grids
| Algorithm | Parameter | Candidate settings |
|---|---|---|
| Linear regression | / | / |
| Decision tree | [0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2] | |
| Random forest | / | / |
| XGBoost | Learning rate | [0.1, 0.3, 0.5, 0.7, 0.9] |
| Boosting round | [100, 200, 300, 400] | |
| [0, 4, 8, 12, 16, 20] | ||
| Maximal tree depth | [6, 8, 10, 12] | |
| Objective function | [Squared error, Tweedie] | |
| Perceptron | Decay | [0.001, 0.01, 0.1] |
| Hidden layer size | [1, 2, 3, …, 20] |
Hyperparameters were optimized per algorithm through an exhaustive grid search. Values display validated parameters with their candidate settings
Fig. 3Rolling cross-validation time window. Green stands for training, orange for validation, and red for testing. (Color figure online)
Overview configurations
| Model name | Imputation method | Algorithm |
|---|---|---|
| BASELINE 1 | None | XGBoost |
| BASELINE 2 | None | Aggregation Heuristic |
| linreg_knn | Grouped KNN | Linear Regression |
| dt_knn | Grouped KNN | Decision Tree |
| xgb_knn | Grouped KNN | XGBoost |
| rf_knn | Grouped KNN | Random Forest |
| mlp_knn | Grouped KNN | Perceptron |
| linreg_mean | Mean Imputation | Linear Regression |
| dt_mean | Mean Imputation | Decision Tree |
| xgb_mean | Mean Imputation | XGBoost |
| rf_mean | Mean Imputation | Random Forest |
| mlp_mean | Mean Imputation | Perceptron |
| linreg_regression | Chained Equation Regression | Linear Regression |
| dt_regression | Chained Equation Regression | Decision Tree |
| xgb_regression | Chained Equation Regression | XGBoost |
| rf_regression | Chained Equation Regression | Random Forest |
| mlp_regression | Chained Equation Regression | Perceptron |
Each configuration was evaluated through the 5k rolling cross-validation. Two baselines were considered and 5 algorithms × 3 imputation techniques
Median rolling cross-validated results across period 2015–2019
| RMSE | Spearman | Accuracy within one | Lift | |
|---|---|---|---|---|
BASELINE 1 (XGB no impute) | 272.13 | 0.4658 | 0.8304 | 3.3333 |
BASELINE 2 (Heuristic) | / | 0.4015 | 0.8218 | 2.7272 |
| linreg_knn | ||||
| dt_knn | 360.97 | 0.2590 | 0.7692 | 1.6667 |
| xgb_knn | 2.7273 | |||
| rf_knn | 0.8273 | 3.3333 | ||
| nn_knn | 338.75 | 0.2756 | 0.7778 | 0.8333 |
| linreg_mean | ||||
| dt_mean | 344.82 | 0.2696 | 0.7818 | 2.5000 |
| xgb_mean | 273.67 | 0.4200 | 0.8273 | 2.5000 |
| rf_mean | 271.25 | 2.7273 | ||
| nn_mean | 328.96 | 0.0498 | 0.7321 | 0.0000 |
| linreg_regression | ||||
| dt_regression | 395.92 | 0.3168 | 0.7818 | 2.1429 |
| xgb_regression | 286.06 | 0.4128 | 0.8034 | 2.5000 |
| rf_regression | 2.5000 | |||
| nn_regression | 313.99 | 0.2652 | 0.7768 | 1.8182 |
Underlined values indicate a better performance than the baselines; the top performer is indicated in bold. Grouped KNN imputation before random forest performs best with regard to ranking riders. Chained equation regression imputation before linear regression performs best with regard to identifying top performers
Results riders turning professional in 2019: underlined if better than baselines; top performer in bold
| RMSE | Spearman | Accuracy within one | Lift | |
|---|---|---|---|---|
BASELINE 1 (XGB no impute) | 458.93 | 0.4931 | 0.7890 | 0.9090 |
BASELINE 2 (Heuristic) | / | 0.4553 | 0.8315 | 3.3333 |
| linreg_knn | 0.4718 | |||
| dt_knn | 493.96 | 0.2590 | 0.7523 | 0.9091 |
| xgb_knn | 0.4271 | 2.7273 | ||
| rf_knn | 0.8257 | 2.7273 | ||
| nn_knn | 479.37 | 0.2756 | 0.7982 | |
| linreg_mean | 0.4396 | |||
| dt_mean | 501.09 | 0.2685 | 0.7798 | 1.6667 |
| xgb_mean | 489.48 | 0.3748 | 0.8257 | 1.8182 |
| rf_mean | 0.4833 | 2.7273 | ||
| nn_mean | 494.54 | − 0.2409 | 0.6330 | 0.0000 |
| linreg_regression | 0.4810 | |||
| dt_regression | 521.40 | 0.2590 | 0.7982 | 0.9091 |
| xgb_regression | 0.4081 | 0.8257 | ||
| rf_regression | ||||
| nn_regression | 466.07 | 0.3714 | 0.7890 | 1.8182 |
Reported results are based on unique measurement from last fold. Results are biased through COVID19-influenced season 2020. Nonetheless, the methods perform adequate besides an increase in RMSE
Computation time imputation methods (in seconds)
| Fold | KNN imputation | Mean imputation | Regression imputation |
|---|---|---|---|
| 2015 | 1.92 | 0.06 | 13,025.15 |
| 2016 | 1.59 | 0.03 | 15,035.95 |
| 2017 | 2.46 | 0.06 | 20,488.19 |
| 2018 | 2.75 | 0.03 | 26,363.01 |
| 2019 | 3.46 | 0.03 | 32,997.33 |
Regression imputation is computationally much more expensive than KNN imputation or mean imputation. KNN imputation gives a good trade-off between predictive performance and computation time
Coefficients final regression imputation—linear regression model
| Variable | Coefficient Value |
|---|---|
| Tour des Flandres U23 best result | − 0.001 |
| Tour du Pays de Vaud GC best result | − 0.061 |
| Podium ratio U23 | 1391.267 |
| Driedaagse van Axel stage best result | − 0.123 |
| Tour des Flandres U23 participation | − 2.267 |
| Top 5 ratio Junior | 1190.966 |
| Trofeo Comune di Vertova participation | − 0.106 |
Model highly depends on consistency, as measured through podium ratio U23 and top 5 ratio Junior
Fig. 4SHAP-based variable importances of KNN imputation—random forest regression model. Model shows a large dependence on consistency, but is also influenced by key races. (Color figure online)
Top 10 predicted top performers according to linear regression model
| Rider | Professional since |
|---|---|
| Repa Vojtech | 2021 |
| Larsen Niklas | 2020 |
| Pidcock Thomas | 2021 |
| Colleoni Kevin | 2021 |
| Stewart Jake | 2020 |
| Meeus Jordi | 2021 |
| Hailemichael Mulu Kinfe | 2020 |
| Van Gils Maxim | 2021 |
| Rodenberg Frederik | 2020 |
| Van Wilder Ilan | 2020 |
Several riders have already exhibit good performance at the professional level
Fig. 5SHAP values Ärm Rait: Positive drivers (red) and negative drivers (blue). Rider is selected due to large number of podiums and European Championship result. (Color figure online)
Fig. 6SHAP values Colnaghi Luca: Positive drivers (red) and negative drivers (blue). World Championship driver is imputed: KNN imputation method is capable of imputing the Hilly U23 capabilities into one single feature. (Color figure online)
Fig. 7Spider plot Rait Ärm. Four principal components are visualized which represent all terrain (PC1), sprinters (PC2), flat terrain (PC3), and uphill (PC4). Compared with four archetypical professionals. Most similar to sprinter Caleb Ewan. (Color figure online)