Literature DB >> 35270617

Predicting and Analyzing Road Traffic Injury Severity Using Boosting-Based Ensemble Learning Models with SHAPley Additive exPlanations.

Sheng Dong¹, Afaq Khattak², Irfan Ullah³, Jibiao Zhou⁴, Arshad Hussain⁵.

Abstract

Road traffic accidents are one of the world's most serious problems, as they result in numerous fatalities and injuries, as well as economic losses each year. Assessing the factors that contribute to the severity of road traffic injuries has proven to be insightful. The findings may contribute to a better understanding of and potential mitigation of the risk of serious injuries associated with crashes. While ensemble learning approaches are capable of establishing complex and non-linear relationships between input risk variables and outcomes for the purpose of injury severity prediction and classification, most of them share a critical limitation: their "black-box" nature. To develop interpretable predictive models for road traffic injury severity, this paper proposes four boosting-based ensemble learning models, namely a novel Natural Gradient Boosting, Adaptive Gradient Boosting, Categorical Gradient Boosting, and Light Gradient Boosting Machine, and uses a recently developed SHapley Additive exPlanations analysis to rank the risk variables and explain the optimal model. Among four models, LightGBM achieved the highest classification accuracy (73.63%), precision (72.61%), and recall (70.09%), F1-scores (70.81%), and AUC (0.71) when tested on 2015-2019 Pakistan's National Highway N-5 (Peshawar to Rahim Yar Khan Section) accident data. By incorporating the SHapley Additive exPlanations approach, we were able to interpret the model's estimation results from both global and local perspectives. Following interpretation, it was determined that the Month_of_Year, Cause_of_Accident, Driver_Age and Collision_Type all played a significant role in the estimation process. According to the analysis, young drivers and pedestrians struck by a trailer have a higher risk of suffering fatal injuries. The combination of trailers and passenger vehicles, as well as driver at-fault, hitting pedestrians and rear-end collisions, significantly increases the risk of fatal injuries. This study suggests that combining LightGBM and SHAP has the potential to develop an interpretable model for predicting road traffic injury severity.

Entities: Chemical

Keywords: SHapley Additive exPlanations; boosting-based ensemble models; road traffic injuries; traffic safety

Mesh：

Year: 2022 PMID： 35270617 PMCID： PMC8910532 DOI： 10.3390/ijerph19052925

Source DB: PubMed Journal: Int J Environ Res Public Health ISSN： 1660-4601 Impact factor: 3.390

1. Introduction

Road traffic injuries (RTIs) are a leading cause of morbidity and death on a global scale. Understanding the sequence of events preceding a road traffic injury is critical for enhancing road safety and developing interventions that reduce the prevalence and severity of road trauma. By 2030, RTIs are expected to overtake cancer as the sixth leading cause of death worldwide [1]. Despite significant advances in traffic and transportation systems, road safety continues to be a concern, with over 1.35 million people killed and another 50 million injured in road collisions each year. Automobile accidents killed 22,441 people and injured 2.18 million people in 2015, according to the National Highway Traffic Safety Administration [2], which collects road data for all 50 states in the United States. The total number of fatal crashes in the United States increased to approximately 35,000 in 2016. Additionally, according to the Washington Annual Collision Summary (WACS), the region experienced 117,053 crashes, including 499 fatal collisions, 36,531 injury collisions, and 77,358 property-damage-only crashes, implying that a crash occurs every 4.5 min and someone dies in a crash every 16 h [3]. Traffic accidents have resulted in enormous economic, physical, and emotional distress for the families involved, as well as productivity losses for society as a whole. The World Health Organization (WHO) [4] estimates that 74% of worldwide road traffic accidents occur in low- and middle-income countries combined, making up 90% of the total global number. In addition, a large proportion of the world’s population lives in countries with a low or middle income. There are still 54% of the world’s total registered vehicles in these countries and a large proportion in relation to the number of vehicles involved in traffic accidents. The fact that death rates fell in high-income countries between 2000 and 2015, but rose in low-income countries, shows just how serious traffic issues are in developing countries. Rapid population growth in Asia’s developing countries, particularly in South Asia, increased mobility, and increased motorized traffic on high-speed roads, all contribute to the development of RTIs. Almost no day passes without a road traffic accident on their national highways or motorways, resulting in an increasing number of injuries and fatalities and significant economic losses [5,6,7,8,9]. As is the case in other developing countries, Pakistan is increasingly confronted with serious road safety issues. Pakistan’s vehicle ownership increased by 18.3% over the last two decades, owing to a 2% annual population increase and a 3.3% increase in rapid urbanization [10]. In addition, air travel is out of reach for the majority of the population, and the potential for inland water transportation has not been fully realized. The excessive reliance on the roadway network has placed an undue strain on the National Highways. The occurrence of road traffic accidents has a substantial psychological and economic impact on both individuals and economies. It is very important to figure out and measure the factors that cause accidents in order to improve proactive highway design and accident frequency reduction. Understanding the causes of road traffic fatalities and injuries has remained a priority, and road safety analysis has advanced significantly over the years, particularly in developing methodologies for modelling the relationship between injury severity and risk factors, gaining knowledge about the mechanism of accident occurrence, and developing safety policies and countermeasures. Over the past several decades, efforts have been made to comprehend the intrinsic relationship between collision frequency and risk variables, such as road geometry, vehicle type, collision type, seasonal effect, traffic regulation, time of day, driver characteristics, and environmental conditions. Numerous statistical models have been used to estimate the severity of traffic accident injuries. In the past, injury severity analysis and prediction have been dominated by statistical methods, such as the linear, nonlinear, generalised linear model (GLM), Poisson regression model (PRM), ordered probit model (OPM), mixed logit model (MLM), Bayesian ordered probit model (BOPM), random parameters ordered probit model (RPOPM), and cellular automata (CA) model, which were regarded as reasonable attempts at thoroughly formulating the relationship between the number of predicting variables [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. Modelling overall accidents may not be as valuable as one might think in terms of creating safety countermeasures, as various types of accidents are frequently linked to distinct sets of primary variables. By combining all types of crashes, it may become more difficult to identify and precisely estimate the impact of a risk factor on various types of crashes, as well as to devise suitable and cost-effective safety mitigation strategies. Thus, one of the goals of this study is to determine how the varying effects of risk variables affect the severity of road traffic injuries using advanced boosting-based ensemble models. Traditional statistical approaches, while having rigor and well-defined functional forms, frequently necessitate strict model assumptions about the relationship between independent and dependent variables [26]. The model estimation outputs may be skewed or incorrect if the assumptions are violated. Similarly, combining modern data sources frequently produces complex data sets with a large number of dimensions that are difficult to model using traditional statistical techniques. Machine learning methods, on the other hand, are highly adaptable, require few or no prior assumptions about the data, and can handle missing values, noise, and outliers [27]. In the field of traffic safety research, machine learning techniques have received a lot of attention. It has become one of the most widely used and fascinating tools for estimating the severity of injuries in road traffic accidents [28,29,30,31,32,33,34,35,36,37,38,39,40,41]. Tree-based ensemble learning models have been used in a variety of fields, including landslide susceptibility mapping [42], software bug prediction [43], and bioinformatics [44]. While high predictive performance is desirable for machine learning models, it is more critical to understand the underlying mechanism by which accidents occur, especially for the purpose of developing accident prevention countermeasures. As a result, in a number of machine learning-based traffic safety studies, sensitivity analysis was used to quantify the effects of risk factors on the severity of injuries [45,46]. To summarise, two types of sensitivity analysis exist: partial dependence plots (PDP) and local sensitivity analysis (LSA) [47]. However, both the LSA and the PDP are concerned about the assumption of independence. In practice, the assumption of independent risk variables may fail to hold true, resulting in erroneous estimates of the safety effect. To address this issue, several researchers used the Shapley Additive exPlanations (SHAP) approach [48], which estimates the variable’ total, main, and interaction effects. Hu et al. [49], for example, utilised SHAP to assess the outcomes of a convolutional neural network (CNN) model in order to assess critical variables that affect the likelihood of crashes at road intersections. The SHAP analysis was used to identify critical factors in E. coli concentration data that could be used to predict beach water quality [50]. Parsa et al. [51] used SHAP to investigate the effect of various variables on the occurrence of highway accidents. Additionally, SHAP analysis is used to rank different variables in order to identify failure modes in reinforced concrete columns [52]. In terms of data, the database’s quality and breakdown dictate the statistical and machine learning approaches used and the reliability of the findings, which can assist and guide policy makers in developing strategic plans and implementing initiatives to strengthen traffic safety and thus make the roads safer. It should be noted that any approach used is constrained by the database’s limitations. Nonetheless, there is a disparity among countries, and even among local jurisdictions, in terms of data uniformity. For instance, Casado-Sanz et al. [53] conducted an international comparison of various risk variables compiled from international databases. Table 1 demonstrates an international comparison of risk variables in major databases and guidelines of various countries (USA, Australia, New Zealand, and databases, and EU directive requirements) and situates Pakistan within the context of road traffic safety.

Table 1

International comparison of risk variables from databases and guidelines of various countries with Pakistan.

Variables	EU Directive	USA	Australia	New Zealand	Pakistan
Crash Location	Precise as possible location	Road name, GPS coordinates	Road name, reference point, distance, direction	Road name, GPS coordinates	District and kilometer marker no. starting from Karachi city (00)
Crash Narrative	No	No	Yes	No	No
Crash Sketch	No	No	Yes, restricted access	Yes	No
Crash Type	Yes	Recorded in the traffic unitsSection	Yes	Yes	Yes
Collision Type	Yes	8 descriptors	Yes	Yes	Yes
Contributing Circumstances	No	Environmental circumstances	Yes	Yes	Yes
Weather Conditions	Yes	10 descriptors	Yes	5 descriptors	3 descriptors
Lighting Condition	Yes	7 descriptors	Yes	7 descriptors	3 descriptors
Reported Crashes	Notspecified	All severities	All injury severities	All severities	All severities
Definition Non-fatal Injury Levels	Severe and non-severe injuries	Suspected serious injury, suspected minor injury, possible injury	Injured, admitted to hospital injured, required medical treatment	Major and minorinjuries	Fatal injury, major injury, minor injury, and no injury
Contributing Circumstances	No	11 descriptors	No	Numerous causes	Yes
Speed Limit	Yes	Yes	Yes	Yes	No
Surface Conditions	Yes	10 descriptors	Yes	3 descriptors	Yes
Road Curves	No	Yes	Yes	4 descriptors	Yes
Gradient	No	Yes	No	No	No
Age	Yes	Date of birth	Yes	Yes	Yes
Gender	Yes	Yes	Yes	Yes	Yes
Nationality	Yes	No	Foreign drivers’ identification	Foreign drivers’ identification	No
Alcohol Level	Yes	Yes	Yes	Yes	No
Drug Test Results	No	Yes	Yes	Yes	No
Safety Equipment	Yes	Yes	Yes	Yes	No
Curve Radius	No	Yes	Yes	Yes	No
Curve Length	No	Yes	Yes	Yes	No

In this study, we aim to address the underappreciated issue of model interpretability for predicting injury severity by developing predictive and transparent booting-based ensemble learning models. To accomplish this, we used and compared four boosting-based ensemble learning models: Novel Natural Gradient Boosting (NGBoost) [54], Categorical Boosting (CatBoost) [55], Light Gradient Boosting Machine (LightGBM) [56], and Adaptive Boosting (AdaBoost) [57]. Additionally, we broadened our comparison by including commonly used machine learning models in studies of traffic safety. We then used SHAP to conduct variable importance analyses on the optimal model in order to predict the most important risk variables for injury severity, assess their robustness, and dissect their interactions with other risk variables. The SHAP yields significant findings that can be used to guide the implementation of cost-effective traffic safety countermeasures. The rest of this paper is organized as follows: The following section summarizes the study’s methodology and data, followed by descriptions of the modelling processes, namely NGBoost, CatBoost, LightGBM, and AdaBoost, as well as Shapley Additive exPlanations. Section 3 discusses the findings and the optimal model interpretation. Finally, Section 4 summarizes the findings and makes additional research recommendations.

2. Methodology

The entire operational framework proposed in this study is depicted in Figure 1. To begin with, the original accident dataset is preprocessed by eliminating redundant and erroneous information. The ensemble learning models based on boosting are initially developed on the basis of data partitioning into training, validation, and testing data sets. The training data set is used to build the classification model, the validation data set is used to fine-tune the hyperparameters of the models, and the test data set is used to evaluate the models’ performance. Once the optimal model with the best performance is identified, SHAP approach is utilized to establish additive attributes that are then employed to determine the importance of variables for injury severity and the contributions of various risk variables to each severity mode.

Figure 1

Framework of boosting-based ensemble learning models and SHAP analysis for model interpretation.

2.1. Study Route

The data for this study come from the records of traffic accidents that occurred on urban and rural segments of National Highway N-5 (Peshawar to Rahim Yar Khan) between 2015 and 2019. The N-5 is a two-lane divided two-way highway that connects Torkham in Pakistan’s Khyber Pakhtunkhwa province to Karachi in Sindh province, connecting major cities along its alignment as illustrated in Figure 2. It is Pakistan’s longest highway, measuring 1819 km (1310 miles) in length, and passes through three provinces: Sindh, Punjab, and Khyber Pakhtunkhwa. It is a major arterial connecting the north and south of the country and carries the most of the country’s traffic. Most heavy vehicles use this route to transport freight from Karachi’s seaport to upcountry cities. The maximum speed limit for light transport vehicles (LTV), which includes passenger cars, pickup trucks, and vans, ais 100 km per hour. The maximum speed limit for heavy transport vehicles, which includes buses, trucks, and trailers, is 90 km per hour.

Figure 2

National Highway N-5 in Pakistan (Peshawar to Rahim Yar Khan Section).

2.2. Data Description

Pakistan’s National Highways and Motorway Police (NH & MP) are tasked with ensuring the safety and security of the country’s highways and motorways. It also keeps track of road traffic crashes by filling out a crash investigation form at the crash scene as soon as one occurs. The NH & MP utilizes a four-page crash form called “Microcomputer Accident Analysis Proforma” (MAAP) for the purpose of recording crash minutiae. The MAAP tracks each accident on National Highway N-5 using a set of twenty-two distinct variables pertaining to the environment, the crash, the roadway, the vehicle, and the drivers. Similarly, MAAP depicts injury severity levels according to four categories: fatal injury, major injury, minor injury, and no injury (property damage only (PDO)). In our study, we classified PDO, minor, and major injuries as “non-fatal injuries”, and the rest as “fatal injuries” as part of a binary classification problem. Table 2 below summarizes these variables with their description and occurrence frequencies.

Table 2

Description of variables in the National Highway N-5 accidents dataset.

Type of Variable	Variable	Description	Marginal Frequency(%)
Injury Severity	Injury Severity Category	Fatal/Non-Fatal	38.09/61.91
Vehicle Specific	Type_of_Vehicle	Rickshaw/Motorcycle/Bicycle/Car/Pickup/Minibus/Bus/Truck/Dumper/Trailer/Tractor	5.34/6.78/10.77/12.69/3.50/9.19/8.48/20.08/4.43/16.44/2.30
	Vehicle_Age(years)	0–10/11–20/21–30/31–40/41+	32.01/36.49/15.30/9.30/6.90
	Number_of_Vehicles(Number of vehicles in crash)	Single/Multiple	33.46/66.54
Driver Specific	Driver Gender	Female/Male	0.001/99.99
	Driver_Age(Years)	18–25/26–30/31–35/36–40/41–45/46–50/51–55/55+	18.18/16.83/14.90/14.58/13.62/10.92/5.84/5.14
	Driving_License	No/Yes	46.52/53.48
Environment Specific	Lighting_Condition	Daylight/Night with Road Lights/Night without Road Lights	69.11/5.33/25.56
	Weather_Condition	Sunny/Cloudy/Rainy	89.85/3.59/6.56
	Visibility_Condition	Clear/Fog/Smog	96.41/3.08/0.50
Temporal Specific	Month_of_Year	January/February/March/April/May/June/July/August/September/October/November/December	5.65/6.29/10.08/8.73/5.27/6.87/14.96/10.08/13.17/6.10/7.32/5.46
	Day_of_Week	Monday/Tuesday/Wednesday/Thursday/Friday/Saturday/Sunday	10.76/12.67/14.52/13.51/16.87/16.82/14.85
	Time_of_Day	12:00:00 a.m.–3:59:59 a.m./4:00:00 a.m.–7:59:59 a.m./8:00:00 a.m.–11:59:59 p.m./12:00:00 p.m.–3:59:59 p.m./4:00:00 p.m.–7:59:59 p.m./8:00:00 p.m.–11:59:59 p.m.	8.97/14.41/23.09/21.13/21.02/11.38
	Type_of_Day	Weekday/Weekend	68.22/31.78
Roadway Specific	Alignment	Straight/Horizontal Curve/Vertical Curve/Both Horizontal and Vertical Curves	84.36/5.66/4.43/5.55
	Presence_of_Shoulder	No/Yes	2.63/97.37
	Surface_Condition	Dry/Wet	92.49/7.51
	Pavement_Roughness	Smooth/Rough/Potholes	94.23/2.52/3.25
	Road_Type	Urban/Rural	52.86/47.14
	Presence_of_Median	No/Yes	3.64/96.36
	Work_Zone	No/Yes	98.64/1.35
Crash Specific	Collision_Type	Head on Collision/Rear End Collision/Side Collision/Rollover/Skidding/Hit Obstacle on Road/Hit Pedestrian/Hit Animal/Run off Roadway/Hitting Nearby Trees/Fell off Bridge	5.21/43.55/19.17/12.44/3.08/4.88/10.31/0.28/0.78/0.11/0.17
Crash Specific	Cause_of_Accident	Bicycle Rider at-Fault/Wrong Side Overtaking/Pedestrian at-Fault/Pavement Distress/Driver at-Fault/Dozing at the Wheel/Over Speeding/Motorcycle Rider at-Fault/Low Visibility/Mechanical Fault of Vehicle/Sight Obstruction/Slippery Road/Vehicle out of Control/Other	0.56/1.46/7.29/1.51/56.331.40/3.87/3.14/0.39/7.74/1.79/2.35/0.90/2.35/8.91

2.3. Boosting-Based Ensemble Learning Classification Models for Injury Severity

In this research, we have employed four advanced boosting-based ensemble learning models, that is, novel Natural Gradient Boosting (NGBoost), Categorical Boosting (CatBoost), Light Gradient Boosting Machine (LightGBM) and Adaptive Boosting (AdaBoost) to accurately predict injury severity using various risk variables. Below, we explain these four boosting-based ensemble models.

2.3.1. Natural Gradient Boosting (NGBoost)

NGBoost is a supervised natural gradient descent (NGD)-based boosting algorithm that was recently developed [54]. It can be used for classification as well as probabilistic regression [58]. It is composed of three fundamental components: a base learner, a parametric probability distribution, and a scoring rule. The conceptual framework of NGBoost is illustrated in Figure 3.

Figure 3

Block diagram of NGBoost.

In Figure 3, denotes the given input risk variables, is the base learner (Decision Tree), is the boosting iteration, is the predicted target, and denotes the target distribution parameters. The NGBoost model creates a conditional probability distribution function of each predicted output in the range of 0 to 1. The higher the value of the aforementioned function, the more likely it is to correctly predict the data class, and vice versa. The framework made use of boosting to construct a series of decision trees (DTs) with minimal loss during model training. In other words, each DT learned from the preceding tree and improved the next tree in order to enhance the performance of the model. Additionally, it should be noted that this is the first paper, to our knowledge, that discusses the use of the NGBoost model for predicting the severity of road traffic injuries.

2.3.2. Categorical Boosting (CatBoost)

The CatBoost is a novel version of the gradient-boosting decision tree algorithm. It has strong learning capabilities for dealing with extremely nonlinear data [55]. The Gradient Boosting Decision Tree (GBDT) algorithm combines numerous decision trees to create a high-accuracy model, and the progress can be expressed as Equation (1): where denotes the variable vector, denotes the number of trees, denotes a learned parameter, and denotes the learned decision trees that are learned. Given a set of training samples , where denotes the total number of samples in training data, is the sample data points, and indicates true sample label. In order to learn the model in Equation (1), the Equation (2) objective function is required to be minimized: where denotes the predicted sample label, represent the loss function, which is actually the difference between and , and represent the regular function, which is employed to penalize the complexity of . It is defined as Equation (3): where α denotes a penalty parameter, which controls the number of leaf nodes , β represent the regularization parameter, and represents the weight coefficient. Let represent the loss function negative gradient, then the objective function is minimized in the direction of is given by Equation (4): By and large, conventional GBDT algorithms exhibit prediction offset, impairing the model’s generalization ability. CatBoost was proposed with two major enhancements to address this shortcoming [59]: (1) the ordered boosting technique was used to obtain an unbiased gradient estimation and minimise the prediction offset; (2) the oblivious tree technique has been used to improve the model’s reliability and prediction speed. Additionally, to improve the strategy’s handling of categorical variables, the greedy target-based statistics strategy was strengthened by incorporating prior terms into the CatBoost algorithm, which is composed of three major steps: (1) all sample datasets are ordered randomly; (2) similar samples are chosen and the average label for similar samples is calculated; and (3) the variables in each sample are digitised by adding the prior term and its associated weight coefficients. The strategy for optimising greedy target-based statistics is expressed in Equation (5): where denotes the sample’s category variable, denotes the corresponding variable, denotes the increased prior value, and denotes the weight coefficient . Prior values can be used to effectively reduce noise introduced by low-frequency variables and avoid the over-fitting phenomenon.

2.3.3. Light Gradient Boosting Machine (LightGBM)

LightGBM is a variation of the gradient boosting decision tree (GBDT) created by Microsoft and Peking University researchers [56] to overcome the efficiency and scalability challenges associated with GBDT when used to solve problems with high-dimensional input variables and huge data sets. Because it is based on decision tree algorithms, it splits the tree leaf-wise, whereas other boosting methods divide the tree level-wise. When growing on the same leaf, the leaf-wise method reduces loss more than the level-wise strategy, resulting in much higher classification accuracy than any of the known boosting algorithms. Tree development is shown in Figure 4 for both LightGBM and severe gradient boosting. LightGBM employs two novel techniques: exclusive feature bundling (EFB) and gradient-based one-side sampling (GOSS).

Figure 4

Tree expansion in LightGBM.

Suppose a dataset with and as independent and dependent variables, respectively. The sum of the outputs of a set of decision tree models is the predicted value of GBDT , as given by Equation (6): The number of decision trees is . Finding an approximation function that minimizes the loss function is required to fit a GBDT model, as shown in Equation (7): Rather than using information gain to split the internal nodes of each tree as the traditional GBDT does, LightGBM splits the internal nodes using the GOSS method. Specifically, samples with higher absolute gradient values (i.e., top %) are chosen as subset A, whereas samples with lower absolute gradient values are chosen at random to form subset B (i.e., ). As a result, the samples are divided based on the variance gain on as Equation (8): where,, , , . In each iteration, illustrates the negative gradient of the loss function for light GBM outputs. Apart from GOSS sampling, LightGBM employs the EFB technique to expedite the training process without compromising classification accuracy. Numerous applications incorporate mutually exclusive features of high-dimensional and sparse input (i.e., these features cannot be non-zero). EFB has the ability to combine these features into a single feature bundle. This algorithm can be used to make feature histograms from both these feature bundles and individual features. In summary, LightGBM is a novel approach to ensemble learning that utilises GOSS to partition internal nodes based on variance gain and EFB to reduce the dimension of input variables. Additionally, as a decision tree-based model, LightGBM is resistant to multicollinearity. As a result, it is easy to add correlated independent variables to the LightGBM model.

2.3.4. Adaptive Boosting (AdaBoost)

The AdaBoost algorithm’s basic idea is to make classifications by combining a series of weak learners using a weighted majority vote (or sum). It updates the data on a regular basis, taking into account the misclassifications of previous weak learners. This algorithm’s basic steps can be summarised as follows [57]. Given a set of training data , where denotes the total number of data points in training data, is the data point, and indicates true data point label. A strong classifier is generated by the following steps: Initially, all the data points are assigned some equal weights i.e.,: For iterations , Train weak learner using distribution , Get weak hypothesis and Select with low weighted error i.e., . Choose . Update, to obtain as Equation (9): where is a normalization factor (chosen such that will be a distribution). After learning process and weight optimization, the final strong classifier is obtained (Equation (10)), which is based on a linear combination of all the weak classifiers:

2.4. Hyperparameter Tuning

Tuning hyperparameters is an essential step in the training of boosting-based ensemble learning models. This helps to improve the model’s generalization performance, avoid over-fitting, and reduce the complexity of the models. In this study, the classification accuracy of the model is used as a performance metric to perform tuning of hyperparameters. There are several hyperparameter tuning techniques, such as GridSearch, Random Search, and Bayesian Optimization [60]. The GridSearch and Random Search CV techniques iteratively traverse the entire space of available hyperparameter values without regard to previous results and thus become time-consuming for large parameter spaces. In contrast, when deciding which hyperparameter set to evaluate next, Bayesian optimization considers previous evaluations. By making informed parameter combinations, it is able to focus on those areas of the parameter space that it believes will yield the most promising validation scores. This method usually requires fewer iterations to arrive at the best set of hyperparameter values [61,62]. The HyperOpt, which is an open-source Python library for Bayesian Optimization, was used in this research to tune the hyperparameters of four boosting-based ensemble learning models.

2.5. Model Evaluation

The confusion matrix (contingency table) and its associated parameters (classification accuracy, precision, recall, and F1-score), as well as the area under the receiver operating characteristic (ROC) curve, are used in this study to evaluate and select the optimal ensemble model. The confusion matrix enables us to assess the performance of an ensemble model. In the confusion matrix, the predicted class instances are represented by a column, the actual class instances by a row, and the accurate prediction by the diagonal [63]. Figure 5 illustrates the confusion matrix, which can be used to calculate a variety of metrics.

Figure 5

Confusion matrix plot.

The true positives (TP) and true negatives (TN) are correctly classified. A false positive (FP) occurs when an outcome is incorrectly classified as yes or positive when it is actually no or negative. When a positive result is incorrectly classified as negative, it is referred to as a false negative (FN). The TPR quantifies the proportion of positives correctly identified, whereas the FPR quantifies the proportion of negatives incorrectly classified as positives. Precision is a metric that indicates the accuracy of a classification algorithm. The precision is low, indicating a high number of FP. Recall is used to determine the completeness of a classification algorithm. A low recall indicates the presence of a large number of FN. Equations (11)–(15) show the expression for calculating performance metrics. The expression for calculating TPR, on the other hand, is the same as for recall in Equation (13): To gain a better understanding of how well a model performed, an area under the receiver operating characteristic (AUC-ROC) curve is also used [64]. Although the confusion matrix provides a detailed analysis of the model’s performance based on predictions for each category, the AUC is sometimes preferred because it presents a comparison based on a single value. The TPR is plotted against the FPT to create the ROC. AUC values range from 0 (completely incorrect) to 1 (perfectly correct).

2.6. Model Interpretation

Lundberg and Lee [48] proposed SHAP (SHapley Additive exPlanations) as a new method for interpreting machine learning black-box models. This is a framework for estimating each variable’s contribution. It is based on local explanations and game theory [65]. If is the variable of the sample, is the model’s predicted contribution value for the sample, and is the model’s baseline, i.e., the average value of the target variable for all samples in the model, then the SHAP value satisfies Equation (16): where is the SHAP value of and is the dimension of the variables, and can be thought of as the value of the variable in the sample’s contribution to the final predicted value . When is greater than zero, it means that this variable has a positive impact on the predicted value. If is true, the variable has a negative impact on the predicted value. The SHAP value has the advantage of reflecting not only the influence of the variable, but also their positive and negative effects. The SHAP value possesses three properties: local accuracy, absence, and consistency. Local accuracy is defined as the sum of all variable contributions equal to the model’s output. This property fulfills a fundamental requirement of the additive explanatory framework, given by Equation (17): where represents the SHAP value corresponding to variable , is an indicator function that takes the value 1 when the variable appears, and 0 otherwise. Absence i.e., the contribution of missing variable is zero. Not that a characteristic value in the structured data is empty, but that a characteristic is not observed in the sample, that is, Consistency, i.e., if the model structure alters but the degree to which a particular variable influences the output increases or remains constant, the contribution of that variable to the whole will also enhance or remain constant. Finally, the model mapped by the sigmoid function with predicts the probability of the sample is given by Equation (18): where , then the sample classification is equal to 0, otherwise it is 1. Based on this SHAP framework, this paper explains optimal models among the four boosting-based ensemble models in terms of injury severity.

3. Results

In this study, four boosting-based ensemble learning models are used to predict road traffic accident injury severity. Python 3.6, a free and open-source programming language, was used for this purpose. To train the models, we used the Python packages NGBoost, CatBoost, LightGBM, and AdaBoost, as well as the Scikit-learn library. To begin with, National Highway N-5 accident data were obtained from the National Highway and Motorway Police. For model development, we partitioned the National Highway N-5 accident data into the following three subsets: 20% of the data was used for testing and withheld for model performance evaluation, while the remaining 80% was divided into training and validation subsets. The training data set was used to develop the models, and the validation data set was used to estimate the model’s performance while tuning its hyperparameters. Table 3 shows the optimal hyperparameters for four boosting-based models, obtained via Bayesian Optimization.

Table 3

Hyperparameters tuning of boosting-based ensemble models.

Algorithm	Evaluation Metric	Hyperparameters	Range	Optimal Values
CatBoost	Classification accuracy	n_estimators	(100, 5000)	11,600
		max_depth	(0, 10)	5
		learning_rate	(0.001, 0.5)	0.002
LightGBM	Classification accuracy	n_estimators	(100, 5000)	3300
		learning_rate	(0.001, 0.5)	0.042
		max_depth	(0, 10)	6
		lambda_l1	(1 × 10⁻⁸, 10)	0.52
		lambda_l2	(1 × 10⁻⁸, 10)	0.2
NGBoost	Classification accuracy	learning_rate	(0.001, 0.50)	0.01
		n_estimators	(100, 5000)	600
		Max_depth	(0, 10)	4
AdaBoost	Classification accuracy	n_estimators	(100–5000)	800
AdaBoost	Classification accuracy	Learning_rate	(0.01, 1)	0.5

3.1. Performance Assessment

Four performance indicators are used to assess the developed boosting-based ensemble models: classification accuracy, recall, precision, and F1-score. These indicators are deduced from a confusion matrix generated for binary classification problems (Figure 6). Additionally, we focused on the area under the receiver operating characteristic curve as an evaluation metric, as it is an aggregate measure. We conducted a performance comparison of the estimation results for testing dataset using the confusion matrix (Table 4) and the area under the receiver operating characteristic curve. The results of boosting-based ensemble learning models are also compared with existing models (Artificial Neural Network [66] and Logit Model [67]) that have been widely used in the analysis of road traffic injury severity.

Figure 6

Confusion matrix of four boosting-based models using testing data: (a) confusion matrix by CatBoost model; (b) ROC curve output of CatBoost model; (c) confusion matrix by NGBoost model; (d) ROC curve output of NGBoost model; (e) confusion matrix by LightGBM model; (f) ROC curve output of LightGBM model; (g) confusion matrix by AdaBoost model; (h) ROC curve output of AdaBoost model.

Table 4

Comparison of prediction performance of various models.

Performance Metrics	Proposed Boosting-Based Ensemble Models				Existing Models for Road Traffic Injury Severity
Performance Metrics	CatBoost	LightGBM	NGBoost	AdaBoost	ANN[66]	Logit Model [67]
Accuracy (%)	67.34	73.63	61.13	66.87	62.17	60.47
Precision (%)	67.32	72.61	61.33	61.21	62.37	88.71
Recall (%)	55.87	70.09	54.71	59.17	60.60	63.10
F1—Score (%)	51.28	70.81	49.04	60.11	50.81	50.81
AUC	0.684	0.713	0.588	0.619	0.601	0.533

Classification accuracy, precision, recall, F1-score, and AUC-ROC values for the CatBoost model were 67.34%, 67.32%, 55.87%, 51.28%, and 0.683, respectively, using the testing dataset. The classification accuracy, precision, recall, F1-score, and AUC-ROC values for the LightGBM model were 73.63%, 72.61%, 70.09%, 70.81%, and 0.713, respectively, using the testing dataset. The novel NGBoost model, which was used for the first time in the prediction and classification of road traffic injury severity, achieved classification accuracy (61.37%), precision (61.33%), recall (54.71%), F1-score (49.04%), and AUC-ROC (0.588), respectively. Similarly, AdaBoost model achieved classification accuracy (66.87%), precision (61.21%), recall (59.17%), F1-score (60.11%), and AUC-ROC (0.619), respectively. Comparing the performance of the models based on the confusion matrix with their corresponding matrices in Table 4 and AUC-ROC revealed that the Light GBM method provided the optimal estimation performance.

3.2. The Framework of Model Interpretation for Variable

3.2.1. Global Variable Interpretation

Numerous methods exist for determining the relative importance of variables. Certain tree-based models, such as random forest, assign variable importance automatically, and the assignment scheme has effects on the results [68]. However, variable significance is not synonymous with variable contribution. Variable importance indicates which variable has the most significant impact on a model’s performance. Beyond identifying influential variables, the variable contributions provide an intuitive explanation for the considered output (fatal or non-fatal injuries). Two analyses are conducted in this study to determine the importance of each variable and its contribution to model estimation: the model-based variable importance approach and the SHAP summary plot approach. The significance of each variable in the LightGBM model was first assessed. The LightGBM model’s trees are built following the steps outlined in Section 2.3.3. Let us consider the variable set . The variable importance score is then calculated as Equation (19) based on the number of times each variable is used to split the training data across all trees: where represents the weight of each variable, and represents the variable in variable set. Figure 7a shows the best variable importance score of the National Highway N-5 accidents data variable used by the LightGBM algorithm. However, the result does not indicate how much each variable contributes to the likelihood of fatal injuries.

Figure 7

Importance plot: (a) variable importance based on LightGBM split criteria; (b) SHAP summary plot for variable importance.

The SHAP summary assessment was implemented to allow for a more comprehensive model analysis. From the SHAP summary plot, we estimated a quantitative value that aggregated the Shapely values and expressed the model contributions of variables (see Figure 7b). The vertical axis is used to arrange the input variables in ascending order of their influence, beginning with the most influential variable. The horizontal axis represents the SHAP value, and the color scale indicates the variable’s significance level; blue to pinkish-red indicates low to high significance. The greater the number of data points within a given range of SHAP values, the stronger the correlation between the input variable and injury severity. As per SHAP analysis, the most important input variable in determining the injury severity is Month_of_Year, which is ranked first in the Summary Plots, followed by Cause_of_Accident, Driver_Age and Collision_Type. The significant variables are almost consistent with the variable importance obtained by the LightGBM-based model importance approach. However, the order is different due to a difference in the evaluation procedure. It is revealed that data from older drivers have negative SHAP values, which become more negative as the driver’s age increases. It confirms that drivers under the age of 30 are more likely to encounter fatal injuries. The results are consistent with past literature [69,70,71]. When comparing the months of winter to the months of spring and summer (March, June, and July), the variable Month_of_Year reveals that the majority of fatal injuries happen in the spring and summer (March, June, and July). The Cause_of_Accident, such as road surface distress, low visibility, and wrong side overtaking, are less likely to cause fatal injuries compared to bicycle rider at-fault and driver at-fault. The head-on collisions, hitting animal on road result in non-fatal injuries with SHAP values near zero or negative compared to rear-end collision, rollover and hitting obstacle on road that results in fatal injuries. Similarly, trailers and passenger cars are more prone to serious accidents.

3.2.2. Local Variable Interpretation

The SHAP explanatory force plot is shown in Figure 8 for two cases chosen at random from the actual estimate findings. On the graph, the base value (0.5181) represents the mean of the optimal LightGBM model estimations for the training data set. If the model’s output value is greater than the base value, fatal injuries occur (i.e., a lower value than the base value). When the output of the model is greater than the base value, non-fatal injuries occur (that is, a higher value than the base value). The blue arrows indicate the magnitude of the effect of input variables that increase the likelihood of fatal injuries (increased possibility of death in accidents). The effect of input variables on the occurrence of fatal injuries is indicated by red arrows (increased possibility of no death in accidents). The area occupied by variables in each arrow indicates the extent to which that variable has an effect. Consider two examples of estimated values from the training dataset that the LightGBM model correctly classified. One instance has an estimated value that is greater than the base value, while another has an estimated value that is less than the base value.

Figure 8

SHAP explanatory force plot: (a) plot for an instance value less than the base value; (b) plot for an instance value greater than the base value.

Figure 8a depicts a situation in which the estimated value (0.020) is less than the base value (0.5181). In this instance, the three variables that could potentially contribute to the fatal injuries are Driver_Age and Collision_Type. In particular, the significantly higher Driver_Age = 7 index compared to other low indexes reflects the higher expected occurrence of non-fatal injuries. Similarly, the Collision_Type = 7 index (Hit Pedestrian) reflects a higher probability of fatal injuries. Similarly, the SHAP explanatory force plot is shown in Figure 8b for another randomly selected and correctly classified instance in which the estimated value (0.800) is greater than the base value (0.5181). It demonstrates that both the month of the year and the cause of the accident played a role in predicting non-fatal injuries. To be more specific, the Month_of_Year = 9 index (September) and the Cause_of_Accident = 3 index (Tire burst) reflect a lower number of fatal injuries. However, Collision_Type = 4 index (rollover) are more prone to the occurrence of fatal injuries.

3.2.3. Variable Interaction Analysis

The SHAP interaction plots were analysed to determine how the input variables used to estimate the optimal LightGBM model interacted in terms of their contributions (see Figure 9). The interaction analysis examined trends in Month_of_Year, Cause_of_Accident, Driver_Age, Collision_Type and Vehicle_Type. Nonetheless, other variable interactions could be evaluated as well. In Figure 9a, the scatter plots of the red and blue points illustrate the fluctuation in the Collision_Type and Collision_Type SHAP values. The SHAP value for Collision_Type is greater when Collision_Type = Hit Pedestrian, and the pattern is consistent throughout the year. This is self-evident, as the majority of locations along National Highway N-5 lack designated pedestrian crossings, increasing the likelihood that pedestrians will be struck by vehicles when they recklessly cross the road. In the majority of cases, jaywalking results in deadly injuries.

Figure 9

SHAP interaction plots: (a) impact of Collision_Type and Month_of_Year on model output; (b) Driver_Age and Collision_Type on model output; (c) impact of Vehicle_Type and Collision_Type on model output; (d) impact of Driver_Age and Vehicle_Type on model output; (e) impact of Month_of_Year and Vehicle_Type on model output; (f) impact of Vehicle_Type and Cause_of_Accident on model output.

Young drivers are more likely to sustain fatal injuries than older drivers. However, drivers over the age of 50 are more likely to be involved in non-fatal crashes (Figure 9b). Young drivers are more prone to be distracted by various activities, such as ringing cell phones, texting, using a GPS device and listening to music, and these activities are connected with decreased driving performance, increased driver reaction times, decreased vehicle control, and an increased chance of a crash. Additionally, older drivers have more experience and are more likely to have non-fatal injuries. There is a possibility that young drivers are more prone to fatal injuries because they engage in risky behaviour, such as over speeding and not obeying traffic signals and signs. Sometimes, they cannot perceive the potential hazards in the surroundings and choose incorrect behaviour. The scatter plot of Vehicle_Type and the SHAP value for Vehicle_Type are shown in Figure 9c. It demonstrates that the majority of Vehicle_Type = Trailers and Passenger Cars are more likely to result in fatal injuries as a result of rear-end crashes, rollovers, or hitting pedestrians. The trailer’s involvement in the majority of fatal injuries could be explained by the fact that National Highway N-5 is used for long-haul journeys by large trailers transporting goods from Karachi’s seaport to the countryside. Speed limit violations, braking failure, overloading, and driver distraction due to inattention or dozing can all contribute to trailer rollover accidents. Surprisingly, as indicated by the lower SHAP value, motorcycle accidents due to skidding result in non-fatal injuries. Surprisingly, the age group 26–30 is more likely to suffer fatal injuries while driving heavy vehicles compared to driving light vehicles (Figure 9d), consistent with a previous study [72]. The SHAP value for Month_of_Year = July is relatively high in the graph, showing that the majority of serious accidents with fatal injuries for heavier vehicles, such as trucks, trailers, and tractors, occur during July (Figure 9e). This result is also in line with past literature [73,74], which also highlighted that summer is more prone to crashes due to more vehicles traveling. However, in Pakistan this could be because National Highway N-5 is a major arterial that traverses through the Punjab region of Pakistan, which is known for its foggy winters, heavy monsoon precipitation, and sweltering summers. July has a maximum temperature of about 45 °C and sometimes reaches 50 °C. Increased tire pressures may result in a tire burst during July’s severe heat. The monsoon season’s torrential downpours in July may contribute to severe injuries due to wet pavements that reduce a heavy vehicle’s traction and manoeuvrability. Similarly, with Month_of_Year = September, the SHAP value is relatively low. Additionally, one can also note that truck, trailer, and tractor accidents on National Highway N-5 are nearly evenly distributed throughout the year in compared other Vehicle_Type. Additionally, Figure 9f provides valuable insight into the involvement of Vehicle_Type in accidents and the causes of those accidents. When Bicycle Rider at-Fault and Driver at-Fault (such as wrong side overtake, traffic signal violation etc.) of trailers and passenger cars are more likely to be involved in fatal collisions The reason for this might be that bicycle riders in Pakistan do not wear helmets and often ride in the wrong direction due to the absence of bicycle-specific traffic control regulations.

4. Conclusions

In this research, boosting-based ensemble learning models were used in conjunction with SHAP analysis to identify critical risk variables and quantify their effects on the severity of road traffic accident injuries using the National Highway N-5 accident dataset (2015–2019). Accurate models typically capture a complete picture of the underlying relationship between injury severity and risk factors. In this research, LightGBM outperforms CatBoost, novel NGBoost, AdaBoost as well as widely used ANN and logit model in terms of predictive classification accuracy, precision, recall, F1-Score, and AUC. The newly introduced LightGBM model provides another viable option for modelling injury severity. The lack of transparency and interpretability of machine learning models is frequently chastised. This has an effect on the widespread acceptance of models for modelling in traffic and transportation safety, although these models are more flexible and frequently more accurate than traditional predictive methods. To address the interpretability issue associated with LightGBM, the SHAP analysis was used to estimate its output in order to identify significant risk variables and quantify their impact on injury severity. The SHAP analysis’s results can be used to rank a risk variable’s overall significance. More importantly, they can be used to investigate both the individual effects of risk variables (e.g., how specific impacts may fluctuate in response to changes in the risk variable’s value) and their interaction effects. The analysis revealed that the top four important variables that are more likely to affect injury severity are Month_of_Year, Cause_of_Accident, Driver_Age and Collision_Type. However, Type_of_Day and Work_Zone variables have the most negligible impact on the injury severity. Young drivers are more likely than older drivers to sustain fatal injuries. Improved young driver education programs, stricter driving requirements, stricter driving tests, and equipping parents with sufficient knowledge to train and educate drivers could all contribute to a reduction in the overall fatal crash rate. The month of July accounts for a large number of fatal injuries, while September accounts for a greater number of non-fatal injuries. The rainy season in the Punjab region, in addition to the high temperatures of July, may account for tire bursts and skidding, probably due to a reduction in tire-to-wet pavement friction. Extra care in driving is required during the rainy days of the monsoon (July to September) with low vehicle speed and care for runoff. It has also been observed that weekend crashes are more likely to cause fatal injuries. This finding may be due to the high traffic volume on National Highway N-5. People mostly travel to their nearby hometowns or villages on weekends, causing heavy congestion. In addition, people also travel to Sunday markets on weekends. In some cases, heavy vehicles, such as trailers, are involved in fatal crashes when drivers are drowsy at the wheel or, in some cases, crashes with bicycles. There is a need to provide bicycle-specific traffic regulations on National Highway N-5. Moreover, helmet usage should be made compulsory. The strategy outlined in this paper can be used to conduct a large-scale analysis of traffic accidents and serves as a useful tool for policymakers and researchers engaged in traffic safety. This paper discussed only injury severity as determined by boosting ensemble in conjunction with SHAP analysis. Additional research could be conducted by using various other machine learning techniques with various additional risk factors. Additionally, future research could expand the dataset to increase its accuracy, and data-balancing techniques, such as SMOTE and ADAYSN, could be introduced to treat imbalanced data.

25 in total

1. A multivariate Poisson-lognormal regression model for prediction of crash counts by severity, using Bayesian methods.

Authors: Jianming Ma; Kara M Kockelman; Paul Damien
Journal: Accid Anal Prev Date: 2007-12-18

2. Associations of distraction involvement and age with driver injury severities.

Authors: Birsen Donmez; Zishu Liu
Journal: J Safety Res Date: 2014-12-24

3. A machine-learning method for improving crash injury severity analysis: a case study of work zone crashes in Cairo, Egypt.

Authors: Mahama Yahaya; Wenbo Fan; Chuanyun Fu; Xiang Li; Yue Su; Xinguo Jiang
Journal: Int J Inj Contr Saf Promot Date: 2020-04-01