| Literature DB >> 35455196 |
Shivani Batra1, Rohan Khurana1, Mohammad Zubair Khan2, Wadii Boulila3, Anis Koubaa3, Prakash Srivastava4.
Abstract
Pristine and trustworthy data are required for efficient computer modelling for medical decision-making, yet data in medical care is frequently missing. As a result, missing values may occur not just in training data but also in testing data that might contain a single undiagnosed episode or a participant. This study evaluates different imputation and regression procedures identified based on regressor performance and computational expense to fix the issues of missing values in both training and testing datasets. In the context of healthcare, several procedures are introduced for dealing with missing values. However, there is still a discussion concerning which imputation strategies are better in specific cases. This research proposes an ensemble imputation model that is educated to use a combination of simple mean imputation, k-nearest neighbour imputation, and iterative imputation methods, and then leverages them in a manner where the ideal imputation strategy is opted among them based on attribute correlations on missing value features. We introduce a unique Ensemble Strategy for Missing Value to analyse healthcare data with considerable missing values to identify unbiased and accurate prediction statistical modelling. The performance metrics have been generated using the eXtreme gradient boosting regressor, random forest regressor, and support vector regressor. The current study uses real-world healthcare data to conduct experiments and simulations of data with varying feature-wise missing frequencies indicating that the proposed technique surpasses standard missing value imputation approaches as well as the approach of dropping records holding missing values in terms of accuracy.Entities:
Keywords: ensemble learning; health data; imputation methods; missing values; regression algorithms
Year: 2022 PMID: 35455196 PMCID: PMC9030272 DOI: 10.3390/e24040533
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1Snapshot of sample real-world data explored for experimentation.
Figure 2Conceptual schema of proposed ensemble approach based on stacking mechanism.
Figure 3Data pre-processing phase.
Figure 4Model Training Phase.
Figure 5Imputation phase.
Configurations of regressors and imputation techniques.
| Regressor/Imputation Methods | Configurations |
|---|---|
| XGB Regressor | max_depth = 10 |
| Support Vector Regressor | Kernel = rbf, C = 1.5 |
| Random Forest | max_depth = 5 |
| K Nearest Neighbour Imputation | K = 5 |
| Multiple Imputation | max_itr = 5 |
| Simple Imputation | strategy = ‘mean’ |
| Proposed Ensemble Model Imputation | NA |
Instances holding one or more missing values in test dataset.
| Test Dataset Size | Number of Instances Holding One or More Missing Values | Frequency of Non-Missing Values | Frequency of Missing Values |
|---|---|---|---|
| 5000 | 3458 | 279,877 | 10,123 |
| 10,000 | 6961 | 559,955 | 20,045 |
| 20,000 | 13,857 | 1,120,278 | 39,722 |
Attribute-wise missing values for varying size test datasets.
| Attributes Name | 5K | 10K | 20K |
|---|---|---|---|
| social_distancing_total_grade | 868 | 1682 | 3315 |
| social_distancing_visitation_grade | 2176 | 4369 | 8681 |
| social_distancing_encounters_grade | 870 | 1688 | 3315 |
| social_distancing_travel_distance_grade | 860 | 1682 | 3310 |
| daily_state_test | 905 | 1791 | 3572 |
| precipitation | 1727 | 3456 | 6836 |
| temperature | 2368 | 4704 | 9330 |
| ventilator_capacity_ratio | 102 | 201 | 400 |
| icu_beds_ratio | 100 | 200 | 401 |
| Religious_congregation_ratio | 3 | 7 | 13 |
| percent_insured | 1 | 3 | 6 |
| deaths_per_100000 | 143 | 262 | 543 |
Figure 6Graphically presented attribute-wise missing values for varying size test dataset.
Results obtained for varying size test dataset.
| Test Dataset Size | Imputation Method | Mean Absolute Error | Mean Squared Error | ||||
|---|---|---|---|---|---|---|---|
| XGB | SVR | RFR | XGB | SVR | RFR | ||
| 5000 Records | Proposed | 60.81 | 202.01 | 112.8 | 8266.08 | 69,611.7 | 23,966 |
| Iterative | 78.48 | 200.03 | 147.63 | 12,261.7 | 68,882.8 | 38,878.3 | |
| KNN | 82.3 | 201.91 | 147.15 | 12,972.8 | 69,768.5 | 37,811.4 | |
| Simple Mean | 79.78 | 197.48 | 146.88 | 12,197.3 | 68,160.8 | 37,889.9 | |
| Dropping | 68.08 | 197.37 | 145.84 | 8406.14 | 64,981.4 | 35,744.9 | |
| 10,000 Records | Proposed | 54.06 | 194.73 | 115.98 | 6046.26 | 63,853.1 | 23,256.3 |
| Iterative | 72.84 | 196.45 | 145.58 | 10194 | 66,607.9 | 37,104.7 | |
| KNN | 75.58 | 198.2 | 148.12 | 11,154 | 67,537.5 | 38,554.6 | |
| Simple Mean | 73.36 | 192.69 | 146.96 | 10,372.3 | 65,122.9 | 38,134.9 | |
| Dropping | 68.08 | 197.37 | 146 | 8406.14 | 64,981.4 | 35,805.3 | |
| 20,000 Records | Proposed | 49.38 | 188.31 | 113.57 | 4473.7 | 59,422.4 | 23,298.4 |
| Iterative | 72.69 | 192.51 | 145.98 | 9462.76 | 63,737.1 | 37,942.4 | |
| KNN | 75.01 | 193.38 | 145.21 | 9881.5 | 63,836.4 | 37,135.2 | |
| Simple Mean | 74.07 | 189.46 | 146.65 | 9695.8 | 62,288.6 | 37,528.1 | |
| Dropping | 68.08 | 197.37 | 146.02 | 8406.14 | 64,981.4 | 35,825.6 | |
Figure 7Graphical presented results obtained for varying size test dataset.
Normalised results obtained for varying size test dataset.
| Test Dataset Size | Imputation Method | Mean Absolute Error | Mean Squared Error | ||||
|---|---|---|---|---|---|---|---|
| XGB | SVR | RFR | XGB | SVR | RFR | ||
| 5000 | Iterative | 0.775 | 1.010 | 0.764 | 0.674 | 1.011 | 0.616 |
| KNN | 0.739 | 1 | 0.767 | 0.637 | 0.998 | 0.634 | |
| Simple Mean | 0.762 | 1.023 | 0.768 | 0.678 | 1.021 | 0.633 | |
| Dropping | 0.893 | 1.024 | 0.773 | 0.983 | 1.071 | 0.67 | |
| 10,000 | Iterative | 0.742 | 0.991 | 0.797 | 0.593 | 0.959 | 0.627 |
| KNN | 0.715 | 0.982 | 0.783 | 0.542 | 0.945 | 0.603 | |
| Simple Mean | 0.737 | 1.011 | 0.789 | 0.583 | 0.981 | 0.610 | |
| Dropping | 0.794 | 0.987 | 0.794 | 0.719 | 0.983 | 0.650 | |
| 20,000 | Iterative | 0.679 | 0.978 | 0.778 | 0.473 | 0.932 | 0.614 |
| KNN | 0.658 | 0.974 | 0.782 | 0.453 | 0.931 | 0.627 | |
| Simple Mean | 0.667 | 0.994 | 0.774 | 0.461 | 0.954 | 0.621 | |
| Dropping | 0.725 | 0.954 | 0.778 | 0.532 | 0.914 | 0.650 | |
Figure 8Graphical presented normalised results obtained for varying size test dataset.