Literature DB >> 33190462

Mortality Prediction from Hospital-Acquired Infections in Trauma Patients Using an Unbalanced Dataset.

Mehrdad Karajizadeh1, Mahdi Nasiri1, Mahnaz Yadollahi2, Amir Hussain Zolfaghari3, Ali Pakdam1.   

Abstract

OBJECTIVES: Machine learning has been widely used to predict diseases, and it is used to derive impressive knowledge in the healthcare domain. Our objective was to predict in-hospital mortality from hospital-acquired infections in trauma patients on an unbalanced dataset.
METHODS: Our study was a cross-sectional analysis on trauma patients with hospital-acquired infections who were admitted to Shiraz Trauma Hospital from March 20, 2017, to March 21, 2018. The study data was obtained from the surveillance hospital infection database. The data included sex, age, mechanism of injury, body region injured, severity score, type of intervention, infection day after admission, and microorganism causes of infections. We developed our mortality prediction model by random under-sampling, random over-sampling, clustering (k-mean)-C5.0, SMOTE-C5.0, ADASYN-C5.5, SMOTE-SVM, ADASYN-SVM, SMOTE-ANN, and ADASYN-ANN among hospital-acquired infections in trauma patients. All mortality predictions were conducted by IBM SPSS Modeler 18.
RESULTS: We studied 549 individuals with hospital-acquired infections in a trauma hospital in Shiraz during 2017 and 2018. Prediction accuracy before balancing of the dataset was 86.16%. In contrast, the prediction accuracy for the balanced dataset achieved by random under-sampling, random over-sampling, clustering (k-mean)-C5.0, SMOTE-C5.0, ADASYN-C5.5, and SMOTE-SVM was 70.69%, 94.74%, 93.02%, 93.66%, 90.93%, and 100%, respectively.
CONCLUSIONS: Our findings demonstrate that cleaning an unbalanced dataset increases the accuracy of the classification model. Also, predicting mortality by a clustered under-sampling approach was more precise in comparison to random under-sampling and random over-sampling methods.

Entities:  

Keywords:  C5.0; Data Mining; Decision Tree; Healthcare Associated Infections; Injuries; Machine Learning; Mortality

Year:  2020        PMID: 33190462      PMCID: PMC7674815          DOI: 10.4258/hir.2020.26.4.284

Source DB:  PubMed          Journal:  Healthc Inform Res        ISSN: 2093-3681


I. Introduction

Healthcare data mining has been widely used to help predict diseases and extract impressive knowledge [1], and it is commonly applied to detect early progress of diseases. These techniques can be applied to detect cancer, Alzheimer disease, transient ischemic attacks, lung nodules, coating on the tongue, diabetes, hepatitis, traumatic events, polyps, acute pediatric conditions, and Parkinson’s disease [2]. Typically, the prediction variable is unbalanced, which means that one class does not have as many records as the other. The largest class is called the majority, and the smallest class is called the minority [3]. Prediction models using unbalanced data are intricate, as long as balanced training sets are required for standard classifiers learning, such as logistic regression, decision tree, support vector machine (SVM), neural networks, and deep learning. Models often underestimate rare classes in terms of unbalanced data, while the overlapping between two classes will happen. There are many methods to deal with unbalanced learning, such as data level, algorithm-level, and hybrid methods. In data-level methods, researchers modify the training dataset to make it appropriate for a classifier algorithm. For balance distribution, they might generate new objects for the minority group (over-sampling) and remove instances from majority groups (under-sampling). In algorithm-level methods, they tune existing learners to decrease their bias toward the majority groups, while the cost-sensitive approach is the most commonly used algorithm-level method [4]. Our aim is to predict death by applying various methods of balancing to data on hospital-acquired infection among trauma patients. In medical datasets, records in minority classes are often more vital than those of the control class. Hence, it is critical to handle unbalanced data to improve recognition rates, while it is remarkable that the balancing method depends on the context. Trauma is a leading cause of death worldwide, while these injured patients usually acquire infections during hospitalization [5]. These infections are the principal cause of mortality and extended hospitalization for trauma patients [6]. Moreover, these types of mortality are among the top five causes of death throughout the world [7]. Trauma patients with hospital-acquired infections have a significantly increased risk of mortality, longer stays in the hospital, and increased cost of equipment or services [8,9], resulting in the nosocomial cause of 80% of in-hospital mortality [10]. Although numerous studies have been done on balancing, there has been little research on the prediction of mortality from hospital-acquired infections in trauma patients using a balanced dataset. On the other hand, context, environment, and predictor variables (such as injury severity score and injury body region) affect the prognostic model. A previous study in Shiraz Trauma Center showed that the accuracy of the traditional scoring system for predicting mortality in trauma patients is under 91% [11]. This research is one of the first works on this topic that handles unbalanced data. We compared various method of data balancing to predict death related to hospital-acquired infections in trauma patients based on a real dataset gathered in a tertiary-care teaching trauma hospital in Shiraz, Iran. This study tries to determine the best method to precisely predict the death rate for hospital-acquired infections in trauma patients. Accurate prediction models can provide useful information for decision making to manage hospital-acquired infections as a priority in terms of patient treatment. The objectives of this study were the following: Predicting death from hospital-acquired infections in trauma patients in the absence of a balanced dataset (C5.0 and CHAID); Predicting death from hospital-acquired infection in the trauma patients using a balanced dataset by sampling methods (reduced data set) (C5.0 and CHAID); Clustering hospital-acquired infections in trauma patients by k-means algorithms; Predicting death from hospital-acquired infections in trauma patients in each cluster (C5.0 and CHAID); Predicting death from hospital-acquired infections in trauma patients with SMOTE-C5.0 and ADASYN-C5.0; Predicting death from hospital-acquired infections in the trauma patients with SMOTE-SVM, ADASYN-SVM, SMOTE-ANN, and ADASYN-ANN. Many previous studies have attempted to handle unbalanced data [12-14] by adopting various approaches, such as using the right evaluation metrics, resampling the training set (under-sampling, and over-sampling), using K-fold cross-validation appropriately, ensemble different resampled datasets, resampling different ratios, and clustering the frequent class. However, no best model for these problems has been identified, while this strongly relates to techniques, models, and subjects used [2]. In 2013, Roumani et al. [15] indicated that the C5 and SVM algorithms have the highest recall and specificity, respectively, to predict death in an extremely unbalanced ICU dataset. In 2017, Gu et al. [2] reviewed class unbalanced data and provided techniques to balance data, such as data preprocessing, classification algorithms, and model evaluation. In 2016, Krawczyk [4] reviewed learning methods for unbalanced data and studied various aspects of unbalanced learning, such as classification, clustering, regression, datastream mining, and big data analytics. Further, they directed handling unbalanced data for all domains. Additionally, in 2011, Paoin [16] observed that the accuracy of the C5.0 and naive Bayes algorithms for predicting death is under 40%.

II. Methods

This study was a cross-sectional analysis on trauma patients with hospital-acquired infections who were admitted to Shiraz Trauma Hospital from March 20, 2017, to March 21, 2018. We aimed to classify unbalanced death records from hospital-acquired infections in trauma patients. For this purpose, we used the cross-industry standard process for data mining (CRISP-DM) to classify highly unbalanced data. CRISP-DM consists of six steps, namely, identifying the problem, understanding the data, preparing the data, modeling, evaluation, and deployment. It could be a cyclical process [17]. Shiraz Trauma Hospital is affiliated with Shiraz University of Medical Sciences, a national university, which collected hospital-acquired infections data for surveillance and prevention of infections. This reporting aims to reduce hospital-acquired infections. First, the hospital acquired infection records extracted from the mortality infection management database. Next, all features of hospital-acquired infection analysis were done for descriptive statistics: frequency and mean ± standard deviation (SD). Bivariate analysis was performed, and a p-value under 0.05 was considered as a significant level. Further, data preprocessing was done to enhance the data mining process using three stages: data selection, cleaning, and transformation. We set some rules for our inclusion criteria. We included all trauma patients above 15 years old who had sustained hospital-acquired infections who were injured in road traffic accidents (car, motorcycle, and pedestrian accidents), falls, assaults, and gunshots, or had been struck by an object. We excluded admissions for surgical procedure (elective), complications of previous trauma surgeries, patients who had been burned, foreign body injuries, suicides, and sports injuries, and those who referred to another hospital in Shiraz. Note that patients younger than 15 years old were excluded because they were referred to another hospital in Shiraz. Finally, records of a total of 549 trauma patients with hospital-acquired infections were selected. The values (sex, age, mechanism of injury, body region injured, severity score, type of intervention, infection day after admission, microorganism causes of infections, and outcome) were chosen from this hospital-acquired infection management database. This substantial clinical database tends to be incomplete, dirty, inaccurate, and inconsistent. Hence, for the preparation step, we removed duplicate records, found missing values, eliminated outliers, and revised inconsistency in the database. We randomly split data into training (70%), testing (20%), and validation (10%) sets. Moreover, on building the decision tree model (CHAID), we stopped when the minimum records in the parent and child branches became 2% and 1%, respectively. In the CHAID algorithm, a p-value of at least 0.05 was considered significant. All data were transformed to an appropriate format for the IBM SPSS Modeler software (IBM, Armonk, NY, USA). Some new features were also derived using other fields. For example, age was calculated by the expiring date and the birthdate. Next, we divided the participants into three age groups based on a previous study: between 15 and 45, between 46 and 64, and above 65 years [18]. Table 1 presents other categorized variables used.
Table 1

Detailed information about dataset used in this study

Data variable nameMeasurementData variable categories or valuesRoleDefinition of the data variable
1SexNominal0 = Female1 = MaleInputThe patient’s gender
2Age categoryOrdinal1 = “15–45”2 = “46–64”3 = “>=65”InputThe patient’s age at the time of injury
3Mechanism of injuryNominal1 = Car accident2 = Motorcycle accident3 = Pedestrian4 = Assault5 = falling6 = Struck by objectsInputThe mechanism (or multiple injury factor) that caused the injury event
4Injured body regionNominal1 = Head and neck2 = Face3 = Thorax4 = Abdomen5 = Extremities6 = Multiple injuriesInputISS body region
5Injury Severity Score (ISS) categoryOrdinal1 = “1–8”2 = “9–15”3 = “>=16”InputISS was calculated based on the Baker formula. The ISS severity score that reflects the patient’s injuries.
6WardNominal1 = ICU2 = General or surgical wardInputWard where detect nosocomial infection
7Type of invasive interventionNominal1 = Catheter vein2 = Urinary catheter3 = Medical ventilator4 = Tracheostomy5 = Trachea intubation6 = Arterial line7 = SurgeryInputType of invasive intervention performed
8Infected dayNominal1 = Infection is less than 21 day2 = Infection is higher than 22 dayInputSubstation detect infection date from admission date
9Hospital-acquired infectedNominal1 = upper respiratory infection2 = Urinary tract infection - other UTI3 = Surgical site infection - SKIN4 = Bloodstream infection5 = Pneumonia6 = Upper respiratory infection - symptomatic UTI7 = Central nervous system - meningitis8 = Surgical site infection - surgery took3placeInputType of hospital-acquired infections
10Survival statusNominal0 = Non-survivors1 = SurvivorsTargetSurvival status when patients discharge

ICU: intensive care unit, UTI: urinary tract infection.

Furthermore, we applied a decision-tree model for classification considering the study of Alonso et al. [19], which showed that decision-tree models are the conventional techniques in mental health. Hence, the C5.0 and CHAID algorithms were applied for classification. For the CHAID algorithm, we also used a chi-square test to decide the condition for splitting [20]. The following objectives were carried out by using the C5.0 and CHAID algorithms: To predict the death rate from hospital-acquired infections in trauma patients in the absence of a balanced dataset (using C5.0 and CHAID); To predict the death rate from hospital-acquired infections in trauma patients using a balanced dataset by using sampling methods (reduced dataset, C5.0, and CHAID); To cluster hospital-acquired infections in trauma patients by k-means algorithm; To predict the death rate from hospital-acquired infections in trauma patients regarding each cluster (C5.0 and CHAID); To predict death from hospital-acquired infections in trauma patients by using SMOTE-C5.0 and ADASYNC5.0; To predict death from hospital-acquired infections in trauma patients by using SMOTE-SVM, ADASYN-SVM, SMOTE-ANN, and ADASYN-ANN. The following tools were used in this study: IBM SPSS Modeler, MS Excel, SPSS, and Python (for running SMOTE and ADASYN). We calculated the accuracy, precision, and recall for each classifier algorithm to evaluate each model separately. Previous studies found that these metrics were commonly used to assess the performance of prognostic models [21,22]. In addition, the receiver operating characteristic curve is a standard technique for evaluating classifier performance, and the area under the curve (AUC) is another typical metric for a ROC curve. Hence, we measured the AUC in this study [21].

III. Results

There were 549 individuals who acquired hospital infections in this trauma hospital during the study period from March 2017 to March 2018. In the studied population, 82.1% were male, and 17.9% were female; 64.5% were aged between 15 to 45 years. The total number of patients with hospital-acquired infections who passed away in the hospital was 85 (15.5%), while the remaining 464 (84.5%) survived. Table 2 shows the demographic characteristic of the studied individuals.
Table 2

Bivariate analysis of mortality predictors

Survivors (n = 464)Non-survivors (n = 85)Total (n = 549)p-value
Sex0.137
 Male386 (85.6)65 (14.4)451 (100)
 Female78 (79.6)20 (20.4)98 (100)

Age (yr)<0.05
 15–45318 (89.8)36 (10.2)354 (100)
 46–6484 (81.6)19 (18.4)103 (100)
 >6562 (67.4)30 (32.6)92 (100)

Mechanism of injury<0.05
 Car accident188 (86.2)30 (13.8)218 (100)
 Motorcycle accident117 (88.6)15 (11.4)132 (100)
 Pedestrian61 (82.4)13 (17.6)74 (100)
 Gunshot8 (66.7)4 (33.3)12 (100)
 Falling65 (74.7)22 (25.3)87 (100)
 Assault13 (100)0 (0)13 (100)
 Struck by objects13 (100)0 (0)13 (100)

Injured body region0.38
 Head and neck183 (84.7)33 (15.3)216 (100)
 Face17 (81)4 (19)21 (100)
 Thorax54 (84.4)10 (15.6)64 (100)
 Abdomen16 (94.1)1 (5.9)17 (100)
 Extremities107 (88.4)14 (11.6)121 (100)
 Multiple Injuries87 (79.1)23 (20.9)110 (100)

Injury Severity Score (n = 492)0.18
 1–8157 (89.2)19 (10.8)176 (100)
 9–15170 (82.5)36 (17.5)206 (100)
 ≥1694 (85.5)16 (14.5)110 (100)

Ward<0.05
 ICU312 (80.4)76 (19.6)388 (100)
 General or surgical ward152 (94.4)9 (5.6)161 (100)

Type of invasive intervention
 Catheter vein (yes)86 (89.6)10 (10.4)96 (100)0.13
 Urinary catheter (yes)113 (90.4)12 (9.6)125 (100)<0.05
 Medical ventilator (yes)102 (75)34 (25)136 (100)<0.05
 Tracheostomy (yes)74 (87.1)11 (12.9)85 (100)0.48
 Trachea intubation (yes)14 (70)6 (30)20 (100)0.06
 Arterial line (yes)2 (100)0 (0)2 (100)0.54
 Surgery (yes)74 (88.1)10 (11.9)84 (100)0.32

Infected day0.51
 Infected in less than 21 days after admission415 (84.9)74 (15.1)489 (100)
 Infected in more than 22 days after admission49 (81.7)11 (18.3)60 (100)

Hospital-acquired infected
 Upper respiratory infection (yes)252 (83.7)49 (16.3)301 (100)0.57
 Urinary tract infection - other UTI (yes)90 (85.7)15 (14.3)105 (100)0.70
 Surgical site infection - SKIN (yes)92 (85.2)16 (14.8)108 (100)0.83
 Bloodstream infection (yes)82 (80.4)20 (19.6)102 (100)0.20
 Pneumonia (yes)34 (85)6 (15)40 (100)0.93
 Upper respiratory infection - symptomatic UTI (yes)14 (87.5)2 (12.5)16 (100)0.73
 Central nervous system - meningitis (yes)17 (70.8)7 (29.2)24 (100)<0.05
 Surgical site infection - surgery took place (yes)1 (50)1 (50)2 (100)0.17

Values are presented as number (%).

ICU: intensive care unit, UTI: urinary tract infection.

In this study, a death prediction model was applied to unbalanced hospital-acquired infection datasets. Mortality was significantly associated with age, gender, ward, urinary catheter, medical ventilator (yes), and central nervous system - meningitis (yes) (all p < 0.05). Table 2 depicts the detailed bivariate analysis of mortality predictors of the studied individuals. We predicted death rates related to hospital-acquired infections for trauma patients based on unbalanced data by using the C5.0 and CHAID algorithms. The prediction accuracy of C5.0 was higher (86.16% vs. 85.16%). The C5.0 precision count for the death class was 17.64%, and for survival was 90.27%. Table 3 displays more details for accuracy, recall, and precision in predicting the possibility of death from these hospital-acquired infections.
Table 3

Performance evaluation of death models

ModelDescriptionAUCAccuracy (%)ClassPrecision (%)Recall (%)
CHAID treeClassification without the balanced data set0.78185.16Survivors90.2786.66
Non-survivors17.6462.50

C5.0 treeClassification without the balanced data set0.61986.16Survivors99.1386.46
Non-survivors15.2976.47

AUC: area under the curve.

On the other hand, considering a balanced dataset, we predicted mortality rates by random-under sampling using the C5.0 and CHAID algorithms. The accuracy for C5.0 was 70.69%, and that for the CHAID algorithm was 61.24%, as shown in Table 4. After we boosted the dataset for over-sampling by C5.0 and CHAID, the accuracy reached 94.74% for C5.0; however, it remained relatively low at 79.47% for CHAID (Table 5).
Table 4

Performance evaluation of death models (random under-sampling)

ModelDescriptionAUCAccuracy (%)ClassPrecision (%)Recall (%)
CHAID treeClassification using the balanced data set (random under-sampling)0.70961.24Survivors28.7680.76
Non-survivors94.1170.79

C5.0 treeClassification using the balanced data set (random under-sampling)0.79770.69Survivors61.7976.38
Non-survivors80.0066.66

AUC: area under the curve.

Table 5

Performance evaluation of death models (random over-sampling)

ModelDescriptionAUCAccuracy (%)ClassPrecision (%)Recall (%)
CHAID treeClassification with the balanced data set (boost)0.88379.47Survivors74.3582.53
Non-survivors69.7076.98

C5.0 treeClassification with the balanced data set (boost)0.97494.74Survivors92.0297.26
Non-survivors97.8892.58

AUC: area under the curve.

In terms of clustering, we first used k-mean algorithms by setting 5 as the k value. We set the number of clusters (i.e., k = 5) equal to the number of principal infection diagnoses for the majority class (survivor class). Then mortality was predicted separately for each cluster. After all, the mortality prediction accuracy of this model on the clustered data was higher than the previous methods assessed in this study. Table 6 presents the findings in detail.
Table 6

Performance evaluation for death models on the clustered dataset

ModelCluster numberAUCAccuracy (%)ClassPrecision (%)Recall (%)
CHAID treeCluster 1 with alive data and dead data set0.86279.19Survivors96.4074.19
Non-survivors57.2592.59
Cluster 2 with alive data and dead data set0.96189.34Survivors10082.64
Non-survivors78.35100
Cluster 3 with alive data and dead data set0.98794.74Survivors94.6694.66
Non-survivors95.8795.87
Cluster 4 with alive data and dead data set0.99397.60Survivors97.0694.28
Non-survivors97.8998.88
Cluster 5 with alive data and dead data set0.98295.05Survivors96.5993.40
Non-survivors93.6296.70
Overall-0.96291.30Survivors96.9883.35
Non-survivors82.5696.78

C5.0 treeCluster 1 with alive data and dead data set0.89987.25Survivors95.8083.77
Non-survivors76.3493.46
Cluster 2 with alive data and dead data set0.94492.89Survivors96.0090.57
Non-survivors89.6995.60
Cluster 3 with alive data and dead data set0.96294.77Survivors96.0091.14
Non-survivors93.8196.81
Cluster 4 with alive data and dead data set0.98197.60Survivors91.18100
Non-survivors10096.80
Cluster 5 with alive data and dead data set0.99997.80Survivors97.7297.72
Non-survivors97.8797.87
Overall-0.96593.02Survivors93.8888.29
Non-survivors90.3996.04

AUC: area under the curve.

Further, we applied SMOTE-C5.0, ADASYN-C5.0, SMOTE-SVM, ADASYN-SVM, SMOTE-ANN, and ADASYN-ANN, while the AUC for death classification using SMOTE-SVM was 1.00 and 0.99 for the ADASYN-SVM algorithm. Table 7 represents the details of calibration of SVM and the ANN algorithm shown in Supplementary Table S1.
Table 7

Performance evaluation for death models with SMOTE-C5.0 and ADASYN-C5.0

ModelAUCAccuracy (%)ClassPrecision (%)Recall (%)
SMOTE-C5.00.9793.66Survivors96.3590.95
Non-survivors91.1596.43

ADASYN-C5.00.9590.93Survivors89.6092.89
Non-survivors92.4088.91

SMOTE-SVM1.00100Survivors100100
Non-survivors100100

ADASYN-SVM0.9998.57Survivors98.7498.39
Non-survivors98.4398.71

SMOTE-ANN0.9291.48Survivors86.5495.74
Non-survivors96.2798.41

ADASYN-ANN0.9797.46Survivors96.8698.09
Non-survivors98.0896.83

SVM: support vector machine, ANN: artificial neural network, AUC: area under the curve.

To validate the results, we split the data into training (70%), testing (20%), and validation (10%) sets. Table 8 shows the details for the AUC and the accuracy of each approach. The highest validation accuracy was obtained by the k-means algorithm in the clustering approach, followed by the C5.0 algorithm in classification.
Table 8

Evaluation metrics in training, testing, and validation sets

ModelEvaluation metricsTrainingTestingValidation
Classification without the balanced data set (with CHAID)AUC0.770.810.76
Accuracy (%)82.3485.5792.54

Classification without the balanced data set (with C5.0)AUC0.590.750.60
Accuracy (%)84.6888.6691.04

Classification with balance data set (boost) with CHAIDAUC0.890.870.88
Accuracy (%)79.1176.7282.42

Classification with balance data set (boost) with C5.0AUC0.970.970.97
Accuracy (%)92.6594.7191.21

Classification with the balanced data set (random under-sampling) with CHAIDAUC0.640.530.74
Accuracy (%)59.5048.2853.57

Classification with the balanced data set (random under-sampling) with C5.0AUC0.780.800.84
Accuracy (%)72.0776.9273.08

Cluster 1 with alive data and dead data set and classification with C5.5AUC0.910.820.91
Accuracy (%)88.2981.8287.76

Cluster 2 with alive data and dead data set and classification with C5.5AUC0.950.910.96
Accuracy (%)93.9490.9190.62

Cluster 3 with alive data and dead data set and classification with C5.5AUC0.960.950.96
Accuracy (%)95.7696.3088.46

Cluster 4 with alive data and dead data set and classification with C5.5AUC0.980.981.00
Accuracy (%)98.8694.7494.44

Cluster 5 with alive data and dead data set and classification with C5.5AUC0.990.991.00
Accuracy (%)97.5498.88100

Cluster 1 with alive data and dead data set and classification with CHAIDAUC0.880.7590.872
Accuracy (%)81.4672.7375.51

Cluster 2 with alive data and dead data set and classification with CHAIDAUC0.9550.9810.954
Accuracy (%)89.3993.9484.38

Cluster 3 with alive data and dead data set and classification with CHAIDAUC0.9821.000.99
Accuracy (%)94.0796.3096.15

Cluster 4 with alive data and dead data set and classification with CHAIDAUC0.991.01.0
Accuracy (%)96.59100100

Cluster 5 with alive data and dead data set and classification with CHAIDAUC0.990.950.95
Accuracy (%)98.3687.5089.66

SMOTE-C5.0AUC0.980.840.89
Accuracy (%)93.6979.6986.52

ADASYN-C5.0AUC0.900.770.69
Accuracy (%)86.3777.1675.86

SMOTE-SVMAUC1.000.9890.98
Accuracy (%)10092.7194.38

ADASYN-SVMAUC0.990.890.87
Accuracy (%)98.5781.7380.46

SMOTE-ANNAUC0.920.870.86
Accuracy (%)91.4882.2979.78

ADASYN-ANNAUC0.970.760.61
Accuracy (%)97.4672.5962.07

AUC: area under the curve.

IV. Discussion

This research developed models to predict mortality sustained by hospital-acquired infection data set (dead vs. survived) by various methods like over-sampling, under-sampling, and clustered data set using k-means. Next, death predicted by CHAID, C5.0, SMOTE-C5-0, ADASYN-C5.0, SMOTE-SVM, ADASYN-SVM, SMOTE-ANN, and ADASYN-ANN algorithms while each one run separately. Comparing all, the prediction process by clustering method on imbalanced hospital-acquired infection was better than under-sampling and over-sampling methods. As a part of this study, the best prediction accuracy for mortality from hospital-acquired infection based on an unbalanced dataset was achieved by using the cluster-based algorithm. Alongside our research, regarding cluster-based under-sampling methods, Yen and Lee [23] found that k-means reduces imbalance distribution, and Rahman and Davis [24] noted its significantly better performance on unbalanced cardiovascular data. Likewise, Onan [25] reported the more reliable predictive performance of clustering-based under-sampling methods. Additionally, our results showed that random over-sampling led to significantly better prediction performance. These results are similar to the findings of Chawla et al. [21], which showed accuracy improvement after the application of a random over-sampling approach to classify a minority class. Nevertheless, random over-sampling approaches are sometimes inefficient because it can take a long time to prepare unbalanced data [26]. Notably, we compared these three methods for unbalanced data on a hospital-acquired infection dataset; practicing the same methods as future studies on different healthcare data will be valuable. We were interested in doing this comparison; however, the time and resources of the project were limited. Further, external validation using an alternative dataset could improve the assurance of the model; hence, we consider it a limitation in our study. Original datasets are unclean and sparse. Therefore, the preparation steps for healthcare data take a long time. A further subject to study could be a systematic review of the handling of unbalanced data in healthcare, which is imperative to provide evidence-based approaches. The results of this study examined two aspects of unbalanced data elaborately, the prognosis of patients with hospital-acquired infection and the need for pre-processing these types of data. Interestingly, various balancing approaches were applied to handle the imbalance issue for hospital-acquired infection data in the trauma hospital. What stands out in these types of data is that clustered under-sampling performed better than random over-sampling and under-sampling. Overall, the issue of unbalanced data in healthcare remains from prevention to prognosis and follow-up. Hence, we suggest methods for handling unbalanced data in the healthcare domain.
  11 in total

1.  Infection control - a problem for patient safety.

Authors:  John P Burke
Journal:  N Engl J Med       Date:  2003-02-13       Impact factor: 91.245

2.  Increases in mortality, length of stay, and cost associated with hospital-acquired infections in trauma patients.

Authors:  Laurent G Glance; Pat W Stone; Dana B Mukamel; Andrew W Dick
Journal:  Arch Surg       Date:  2011-03-21

3.  Lessons learned from data mining of WHO mortality database.

Authors:  W Paoin
Journal:  Methods Inf Med       Date:  2011-06-21       Impact factor: 2.176

4.  Predicting hospital associated disability from imbalanced data using supervised learning.

Authors:  Mirka Saarela; Olli-Pekka Ryynänen; Sami Äyrämö
Journal:  Artif Intell Med       Date:  2018-10-03       Impact factor: 5.326

5.  Classifying highly imbalanced ICU data.

Authors:  Yazan F Roumani; Jerrold H May; David P Strum; Luis G Vargas
Journal:  Health Care Manag Sci       Date:  2012-11-07

6.  Risk factors affecting in-hospital mortality in patients with nosocomial infections.

Authors:  Wang-Huei Sheng; Jann-Tay Wang; Mei-Shin Lin; Shan-Chwen Chang
Journal:  J Formos Med Assoc       Date:  2007-02       Impact factor: 3.282

7.  Nosocomial infections in the surgical intensive care unit: a difference between trauma and surgical patients.

Authors:  W C Wallace; M Cinat; W B Gornick; M E Lekawa; S E Wilson
Journal:  Am Surg       Date:  1999-10       Impact factor: 0.688

8.  Discovering medical knowledge using association rule mining in young adults with acute myocardial infarction.

Authors:  Dong Gyu Lee; Kwang Sun Ryu; Mohamed Bashir; Jang-Whan Bae; Keun Ho Ryu
Journal:  J Med Syst       Date:  2013-01-15       Impact factor: 4.460

9.  Late outcomes of trauma patients with infections during index hospitalization.

Authors:  Angela S Czaja; Frederick P Rivara; Jin Wang; Thomas Koepsell; Avery B Nathens; Gregory J Jurkovich; Ellen Mackenzie
Journal:  J Trauma       Date:  2009-10

10.  Injury patterns among various age and gender groups of trauma patients in southern Iran: A cross-sectional study.

Authors:  Shahram Bolandparvaz; Mahnaz Yadollahi; Hamid Reza Abbasi; Mehrdad Anvar
Journal:  Medicine (Baltimore)       Date:  2017-10       Impact factor: 1.817

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.