| Literature DB >> 34960494 |
Juan S Angarita-Zapata1, Gina Maestre-Gongora2, Jenny Fajardo Calderín1.
Abstract
Traffic accidents are of worldwide concern, as they are one of the leading causes of death globally. One policy designed to cope with them is the design and deployment of road safety systems. These aim to predict crashes based on historical records, provided by new Internet of Things (IoT) technologies, to enhance traffic flow management and promote safer roads. Increasing data availability has helped machine learning (ML) to address the prediction of crashes and their severity. The literature reports numerous contributions regarding survey papers, experimental comparisons of various techniques, and the design of new methods at the point where crash severity prediction (CSP) and ML converge. Despite such progress, and as far as we know, there are no comprehensive research articles that theoretically and practically approach the model selection problem (MSP) in CSP. Thus, this paper introduces a bibliometric analysis and experimental benchmark of ML and automated machine learning (AutoML) as a suitable approach to automatically address the MSP in CSP. Firstly, 2318 bibliographic references were consulted to identify relevant authors, trending topics, keywords evolution, and the most common ML methods used in related-case studies, which revealed an opportunity for the use AutoML in the transportation field. Then, we compared AutoML (AutoGluon, Auto-sklearn, TPOT) and ML (CatBoost, Decision Tree, Extra Trees, Gradient Boosting, Gaussian Naive Bayes, Light Gradient Boosting Machine, Random Forest) methods in three case studies using open data portals belonging to the cities of Medellín, Bogotá, and Bucaramanga in Colombia. Our experimentation reveals that AutoGluon and CatBoost are competitive and robust ML approaches to deal with various CSP problems. In addition, we concluded that general-purpose AutoML effectively supports the MSP in CSP without developing domain-focused AutoML methods for this supervised learning problem. Finally, based on the results obtained, we introduce challenges and research opportunities that the community should explore to enhance the contributions that ML and AutoML can bring to CSP and other transportation areas.Entities:
Keywords: Internet of Things; automated machine learning; crash severity prediction; intelligent transportation systems; machine learning; supervised learning
Mesh:
Year: 2021 PMID: 34960494 PMCID: PMC8708527 DOI: 10.3390/s21248401
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Protocol followed for the bibliometric analysis carried out in this study.
Figure 2Criteria considered for the inclusion and exclusion of literature related to CSP, ML, and AutoML.
Bibliographic information of the studies reviewed for Road Accident, Crash Severity, and AutoML.
|
|
|
|
|
| Timespan | 2010:2021 | 2010:2021 | 2010:2021 |
| Sources (Journals, Books, etc.) | 310 | 53 | 333 |
| Documents | 452 | 67 | 462 |
| Average years from publication | 3.21 | 2.25 | 2.55 |
| Average citations per documents | 9.668 | 18.51 | 10.31 |
| Average citations per year per doc | 2.134 | 6.368 | 2.654 |
| References | 10,278 | 1870 | 16,000 |
|
|
|
|
|
| Article | 194 | 37 | 232 |
| Book chapter | 3 | 28 | 4 |
| Conference paper | 245 | 1 | 219 |
| Review | 10 | 1 | 5 |
|
|
|
|
|
| Keywords Plus (ID) | 2486 | 435 | 3650 |
| Author’s Keywords (DE) | 1243 | 201 | 1129 |
| AUTHORS | Road Accident | Crash Severity | AutoML |
| Authors | 1512 | 185 | 2150 |
| Author Appearances | 1679 | 199 | 2457 |
| Authors of single-authored documents | 12 | 24 | 24 |
| Authors of multi-authored documents | 1500 | 161 | 2126 |
|
|
|
|
|
| Documents per Author | 0.299 | 0.362 | 0.215 |
| Authors per Document | 3.35 | 2.76 | 4.65 |
| Co-Authors per Documents | 3.71 | 2.97 | 5.32 |
| Collaboration Index | 3.42 | 3.74 | 4.89 |
Figure 3Year-wise distribution of research in the fields of Road Accident, Crash Severity, and AutoML.
Most cited journals in the fields of road accidents, crash severity, and AutoML.
| Most Cited Sources | Articles |
|---|---|
| Accident Analysis And Prevention | 1463 |
| IEEE Transactions On Intelligent Transportation Systems | 239 |
| Transportation Research Record | 108 |
| Safety Science | 87 |
| Machine Learning | 62 |
| IEEE Access | 57 |
| Journal Of Safety Research | 50 |
| Analytic Methods In Accident Research | 45 |
| Traffic Injury Prevention | 28 |
Most cited articles.
| Papers | Year | Citations |
|---|---|---|
| [ | 2000 | 483 |
| [ | 2003 | 323 |
| [ | 2002 | 395 |
| [ | 2010 | 248 |
| [ | 2018 | 152 |
| [ | 2011 | 137 |
| [ | 2016 | 135 |
| [ | 2019 | 132 |
| [ | 2015 | 104 |
| [ | 2018 | 101 |
| [ | 2017 | 92 |
| [ | 2014 | 83 |
| [ | 2020 | 67 |
Figure 4Keywords related to road accidents approached from a ML perspective.
Machine learning methods commonly used in CSP.
| Reference | Year | Methods |
|---|---|---|
| [ | 2020 | Multi-layer perceptron (MLP), rule induction (PART) and classification and regression trees (SimpleCart) |
| [ | 2020 | Random forest (RF) and bayesian additive regression trees (BART) |
| [ | 2020 | Feed-forward neural networks (FNN), support vector machine (SVM), fuzzy C-means clustering based feed-forward neural network (FNN-FCM), and fuzzy c-means based support vector machine (SVM-FCM). |
| [ | 2020 | Naïve Bayesian (NB), Decision Tree (DT), Logistic Regression (LR), Light-GBM, and Random Forest (RF) model are proposed. |
| [ | 2020 | Multinomial logit, mixed multinomial logit, and support vector machine (SVM) |
| [ | 2020 | Random forest (RF), artificial neural network, and decision tree (DT) |
| [ | 2020 | Multi-layer Perceptron (MLP), Decision Tree (DT), Random Forest (RF) classifier and Naive Bayes (NB). |
| [ | 2019 | Random forest (RF), Adaboost with decision tree, gradient boosting decision tree (GBDT), and extreme gradient boosting decision tree (XGboost). |
| [ | 2019 | Decision Tree (DT), K-Nearest Neighbors (KNN), Naïve Bayes (NV) and AdaBoost |
| [ | 2018 | K-Nearest Neighbor(KNN), Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM) |
| [ | 2017 | Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF) |
| [ | 2016 | Decision trees (DT), artificial neural networks, Bayesian networks, support vector machines (SVM), and regression models |
Figure 5Countries with the highest scientific production at the point where CSP and ML converge.
Raw Data Information.
| City | Records | Year | Data Source |
|---|---|---|---|
| Bogotá | 66,329 | 2015–2019 | |
| Medellín | 150,646 | 2014–2018 | |
| Bucaramanga | 32,857 | 2012–2020 |
Medellín case study: Description of Binary datasets.
| Dataset | Instances | Distribution of Classes | Imbalance Ratios (A/B) | |
|---|---|---|---|---|
| People Injured (A) | Only Material Damages (B) | |||
| Med2014 | 41,776 | 23,198 | 18,578 | 1.25 |
| Med2015 | 42,427 | 23,550 | 18,877 | 1.25 |
| Med2016 | 46,838 | 26,594 | 20,244 | 1.31 |
| Med2017 | 42,443 | 22,917 | 19,526 | 1.17 |
| Med2018 | 46,655 | 24,247 | 22,408 | 1.08 |
Bogotá (Bog) and Bucaramanga (Buc) case studies: Description of Multiclass datasets.
| Dataset | Instances | Distribution of Classes | Imbalance Ratios | ||
|---|---|---|---|---|---|
| People Injured (A) | Casualties (B) | Only Material Damages (C) | |||
| Bog2015 | 31,341 | 10,738 | 529 | 20,074 | C/A = 1.87 |
| Bog2016 | 34,988 | 10,578 | 567 | 23,843 | C/A = 2.25 |
| Bog2017 | 35,171 | 10,381 | 538 | 24,252 | C/A = 2.34 |
| Bog2018 | 36,953 | 12,609 | 500 | 23,844 | C/A = 1.89 |
| Bog2019 | 34,990 | 12,371 | 492 | 22,127 | C/A = 1.79 |
| Buc2012 | 4343 | 1587 | 64 | 2692 | C/A = 1.70 |
| Buc2013 | 4055 | 1519 | 67 | 2469 | C/A = 1.63 |
| Buc2014 | 3723 | 1617 | 37 | 2069 | C/A = 1.28 |
| Buc2015 | 3765 | 1705 | 47 | 2013 | C/A = 1.18 |
| Buc2016 | 3733 | 1705 | 64 | 1964 | C/A = 1.15 |
| Buc2017 | 3807 | 1903 | 39 | 1865 | A/B = 1.02 |
| Buc2018 | 3910 | 2100 | 40 | 1770 | A/B = 1.19 |
| Buc2019 | 3724 | 1993 | 42 | 1689 | A/B = 1.18 |
| Buc2020 | 1797 | 1000 | 38 | 759 | A/B = 1.32 |
Figure 6Aggregated ROC_AUC results of ML and AutoML methods in binary datasets from Medellín case study.
Figure 7Aggregated log_loss results of ML and AutoML methods in the Bogotá and Bucaramanga multiclass datasets.
Friedman’s average ranking and p-values obtained via the Holm post-hoc test using CatB and Ag60m as control methods in binary and multiclass datasets, respectively.
| Binary Problems | Multiclass Problems | ||||
|---|---|---|---|---|---|
| Methods | Av. Ranking | Methods | Av. Ranking | ||
|
|
| - |
|
| - |
| LGBM | 4.2 | 1 | Ag150m | 3.8929 | 1 |
| Ag150m | 4.3 | 1 | Ag15m | 4.0714 | 1 |
| Ag15m | 4.7 | 1 | As150m | 5 | 1 |
| Ag60m | 4.8 | 1 | CatB | 5.2143 | 1 |
| GB | 5.2 | 1 | GB | 5.2143 | 1 |
| Tp | 5.2 | 1 | Tp | 5.4286 | 1 |
| As150m | 7.8 | 1 | As60m | 6.1429 | 1 |
| Tuned_RF | 8.4 | 0.958359 |
| 8.7143 |
|
| As60m | 9.2 | 0.593928 |
| 9.4286 |
|
| RF | 10.6 | 0.196244 |
| 9.4286 |
|
| As15m | 11 | 0.146612 |
| 11.7857 |
|
| ExtraT | 11.6 | 0.086515 |
| 13.4286 |
|
|
| 14 |
|
| 14.1786 |
|
|
| 15 |
|
| 14.3929 |
|