| Literature DB >> 35270617 |
Sheng Dong1, Afaq Khattak2, Irfan Ullah3, Jibiao Zhou4, Arshad Hussain5.
Abstract
Road traffic accidents are one of the world's most serious problems, as they result in numerous fatalities and injuries, as well as economic losses each year. Assessing the factors that contribute to the severity of road traffic injuries has proven to be insightful. The findings may contribute to a better understanding of and potential mitigation of the risk of serious injuries associated with crashes. While ensemble learning approaches are capable of establishing complex and non-linear relationships between input risk variables and outcomes for the purpose of injury severity prediction and classification, most of them share a critical limitation: their "black-box" nature. To develop interpretable predictive models for road traffic injury severity, this paper proposes four boosting-based ensemble learning models, namely a novel Natural Gradient Boosting, Adaptive Gradient Boosting, Categorical Gradient Boosting, and Light Gradient Boosting Machine, and uses a recently developed SHapley Additive exPlanations analysis to rank the risk variables and explain the optimal model. Among four models, LightGBM achieved the highest classification accuracy (73.63%), precision (72.61%), and recall (70.09%), F1-scores (70.81%), and AUC (0.71) when tested on 2015-2019 Pakistan's National Highway N-5 (Peshawar to Rahim Yar Khan Section) accident data. By incorporating the SHapley Additive exPlanations approach, we were able to interpret the model's estimation results from both global and local perspectives. Following interpretation, it was determined that the Month_of_Year, Cause_of_Accident, Driver_Age and Collision_Type all played a significant role in the estimation process. According to the analysis, young drivers and pedestrians struck by a trailer have a higher risk of suffering fatal injuries. The combination of trailers and passenger vehicles, as well as driver at-fault, hitting pedestrians and rear-end collisions, significantly increases the risk of fatal injuries. This study suggests that combining LightGBM and SHAP has the potential to develop an interpretable model for predicting road traffic injury severity.Entities:
Keywords: SHapley Additive exPlanations; boosting-based ensemble models; road traffic injuries; traffic safety
Mesh:
Year: 2022 PMID: 35270617 PMCID: PMC8910532 DOI: 10.3390/ijerph19052925
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
International comparison of risk variables from databases and guidelines of various countries with Pakistan.
| Variables | EU Directive | USA | Australia | New Zealand | Pakistan |
|---|---|---|---|---|---|
|
| Precise as possible location | Road name, GPS coordinates | Road name, reference point, distance, direction | Road name, GPS coordinates | District and kilometer marker no. starting from Karachi city (00) |
|
| No | No | Yes | No | No |
|
| No | No | Yes, restricted access | Yes | No |
|
| Yes | Recorded in the traffic units | Yes | Yes | Yes |
|
| Yes | 8 descriptors | Yes | Yes | Yes |
|
| No | Environmental circumstances | Yes | Yes | Yes |
|
| Yes | 10 descriptors | Yes | 5 descriptors | 3 descriptors |
|
| Yes | 7 descriptors | Yes | 7 descriptors | 3 descriptors |
|
| Not | All severities | All injury severities | All severities | All severities |
|
| Severe and non-severe injuries | Suspected serious injury, suspected minor injury, possible injury | Injured, admitted to hospital injured, required medical treatment | Major and minor | Fatal injury, major injury, minor injury, and no injury |
|
| No | 11 descriptors | No | Numerous causes | Yes |
|
| Yes | Yes | Yes | Yes | No |
|
| Yes | 10 descriptors | Yes | 3 descriptors | Yes |
|
| No | Yes | Yes | 4 descriptors | Yes |
|
| No | Yes | No | No | No |
|
| Yes | Date of birth | Yes | Yes | Yes |
|
| Yes | Yes | Yes | Yes | Yes |
|
| Yes | No | Foreign drivers’ identification | Foreign drivers’ identification | No |
|
| Yes | Yes | Yes | Yes | No |
|
| No | Yes | Yes | Yes | No |
|
| Yes | Yes | Yes | Yes | No |
|
| No | Yes | Yes | Yes | No |
|
| No | Yes | Yes | Yes | No |
Figure 1Framework of boosting-based ensemble learning models and SHAP analysis for model interpretation.
Figure 2National Highway N-5 in Pakistan (Peshawar to Rahim Yar Khan Section).
Description of variables in the National Highway N-5 accidents dataset.
| Type of Variable | Variable | Description | Marginal Frequency |
|---|---|---|---|
|
| Injury Severity Category | Fatal/Non-Fatal | 38.09/61.91 |
|
| Type_of_Vehicle | Rickshaw/Motorcycle/Bicycle/ | 5.34/6.78/10.77/ |
| Vehicle_Age | 0–10/11–20/21–30/31–40/41+ | 32.01/36.49/15.30/9.30/6.90 | |
| Number_of_Vehicles | Single/Multiple | 33.46/66.54 | |
|
| Driver Gender | Female/Male | 0.001/99.99 |
| Driver_Age | 18–25/26–30/31–35/36–40/41–45/46–50/51–55/55+ | 18.18/16.83/14.90/14.58/ | |
| Driving_License | No/Yes | 46.52/53.48 | |
|
| Lighting_Condition | Daylight/Night with Road Lights/Night without Road Lights | 69.11/5.33/25.56 |
| Weather_Condition | Sunny/Cloudy/Rainy | 89.85/3.59/6.56 | |
| Visibility_Condition | Clear/Fog/Smog | 96.41/3.08/0.50 | |
|
| Month_of_Year | January/February/March/ | 5.65/6.29/10.08/8.73/5.27/ |
| Day_of_Week | Monday/Tuesday/Wednesday/ | 10.76/12.67/14.52/13.51/16.87/ | |
| Time_of_Day | 12:00:00 a.m.–3:59:59 a.m./4:00:00 a.m.–7:59:59 a.m./8:00:00 a.m.–11:59:59 p.m./12:00:00 p.m.–3:59:59 p.m./4:00:00 p.m.–7:59:59 p.m./8:00:00 p.m.–11:59:59 p.m. | 8.97/14.41/23.09/21.13/ | |
| Type_of_Day | Weekday/Weekend | 68.22/31.78 | |
|
| Alignment | Straight/Horizontal Curve/Vertical Curve/Both Horizontal and Vertical Curves | 84.36/5.66/4.43/5.55 |
| Presence_of_Shoulder | No/Yes | 2.63/97.37 | |
| Surface_Condition | Dry/Wet | 92.49/7.51 | |
| Pavement_Roughness | Smooth/Rough/Potholes | 94.23/2.52/3.25 | |
| Road_Type | Urban/Rural | 52.86/47.14 | |
| Presence_of_Median | No/Yes | 3.64/96.36 | |
| Work_Zone | No/Yes | 98.64/1.35 | |
|
| Collision_Type | Head on Collision/Rear End Collision/Side Collision/Rollover/ | 5.21/43.55/19.17/12.44/ |
| Cause_of_Accident | Bicycle Rider at-Fault/Wrong Side Overtaking/Pedestrian at-Fault/Pavement Distress/Driver at-Fault/Dozing at the Wheel/Over Speeding/Motorcycle Rider at-Fault/Low Visibility/Mechanical Fault of Vehicle/Sight Obstruction/Slippery Road/Vehicle out of Control/Other | 0.56/1.46/7.29/1.51/56.331.40/3.87/3.14/0.39/7.74/ |
Figure 3Block diagram of NGBoost.
Figure 4Tree expansion in LightGBM.
Figure 5Confusion matrix plot.
Hyperparameters tuning of boosting-based ensemble models.
| Algorithm | Evaluation Metric | Hyperparameters | Range | Optimal Values |
|---|---|---|---|---|
| CatBoost | Classification accuracy | (100, 5000) | 11,600 | |
| max_depth | (0, 10) | 5 | ||
| learning_rate | (0.001, 0.5) | 0.002 | ||
| LightGBM | Classification accuracy | (100, 5000) | 3300 | |
| learning_rate | (0.001, 0.5) | 0.042 | ||
| max_depth | (0, 10) | 6 | ||
| lambda_l1 | (1 × 10−8, 10) | 0.52 | ||
| lambda_l2 | (1 × 10−8, 10) | 0.2 | ||
| NGBoost | Classification accuracy | learning_rate | (0.001, 0.50) | 0.01 |
| (100, 5000) | 600 | |||
| Max_depth | (0, 10) | 4 | ||
| AdaBoost | Classification accuracy | (100–5000) | 800 | |
| Learning_rate | (0.01, 1) | 0.5 |
Figure 6Confusion matrix of four boosting-based models using testing data: (a) confusion matrix by CatBoost model; (b) ROC curve output of CatBoost model; (c) confusion matrix by NGBoost model; (d) ROC curve output of NGBoost model; (e) confusion matrix by LightGBM model; (f) ROC curve output of LightGBM model; (g) confusion matrix by AdaBoost model; (h) ROC curve output of AdaBoost model.
Comparison of prediction performance of various models.
| Performance Metrics | Proposed Boosting-Based Ensemble Models | Existing Models for Road Traffic Injury Severity | ||||
|---|---|---|---|---|---|---|
| CatBoost | LightGBM | NGBoost | AdaBoost | ANN | Logit Model [ | |
|
| 67.34 | 73.63 | 61.13 | 66.87 | 62.17 | 60.47 |
|
| 67.32 | 72.61 | 61.33 | 61.21 | 62.37 | 88.71 |
|
| 55.87 | 70.09 | 54.71 | 59.17 | 60.60 | 63.10 |
|
| 51.28 | 70.81 | 49.04 | 60.11 | 50.81 | 50.81 |
|
| 0.684 | 0.713 | 0.588 | 0.619 | 0.601 | 0.533 |
Figure 7Importance plot: (a) variable importance based on LightGBM split criteria; (b) SHAP summary plot for variable importance.
Figure 8SHAP explanatory force plot: (a) plot for an instance value less than the base value; (b) plot for an instance value greater than the base value.
Figure 9SHAP interaction plots: (a) impact of Collision_Type and Month_of_Year on model output; (b) Driver_Age and Collision_Type on model output; (c) impact of Vehicle_Type and Collision_Type on model output; (d) impact of Driver_Age and Vehicle_Type on model output; (e) impact of Month_of_Year and Vehicle_Type on model output; (f) impact of Vehicle_Type and Cause_of_Accident on model output.