| Literature DB >> 32547903 |
Shahidul Islam Khan1,2, Abu Sayed Md Latiful Hoque1.
Abstract
In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.Entities:
Keywords: Data Analytics; MICE; Missing Data Imputation; Multiple Imputation; Single Imputation
Year: 2020 PMID: 32547903 PMCID: PMC7291187 DOI: 10.1186/s40537-020-00313-w
Source DB: PubMed Journal: J Big Data ISSN: 2196-1115
A dataset with missing values
| Serial | Gender | Income |
|---|---|---|
| 1 | Female | 100 |
| 2 | Female | NA |
| 3 | Male | 100 |
| 4 | Female | 300 |
| 5 | Male | NA |
| 6 | Male | 200 |
| 7 | Female | 200 |
Imputing missing values using single imputation method
| Serial | Gender | Income |
|---|---|---|
| 1 | Female | 100 |
| 2 | Female | 180 |
| 3 | Male | 100 |
| 4 | Female | 300 |
| 5 | Male | 180 |
| 6 | Male | 200 |
| 7 | Female | 200 |
Analysis of bias for single imputation method
| Serial | Age | Death reason |
|---|---|---|
| 1 | 60 | Covid-19 |
| 2 | 64 | NA |
| 3 | 42 | Heart attack |
| 4 | 67 | Covid-19 |
| 5 | 80 | NA |
| 6 | 32 | Cancer |
| 7 | 35 | Cancer |
| 8 | 45 | Cancer |
| 9 | 88 | NA |
| 10 | 33 | Heart attack |
Example of 1000 library fine data with missing values
| Serial | Distance from library | Fine amount |
|---|---|---|
| 1 | 1.7 mi | $11 |
| 2 | 2.1 mi | $10 |
| 3 | 8.6 mi | NA |
| 4 | 0.2 mi | $3 |
| 5 | 6.1 mi | NA |
| ...... | ...... | ...... |
| ...... | ..... | ..... |
| ...... | ..... | ..... |
| 1000 | 5.3 mi | $10 |
Fig. 1Regression lines from two sets of random 100 data taken from 1000 library fine data
Multiple imputation for table 4
| Serial | Distance from library | Fine amount [1st Imputation] | Fine amount [2nd Imputation] | Fine amount [3rd Imputation] |
|---|---|---|---|---|
| 1 | 1.7 mi | $11 | $11 | $11 |
| 2 | 2.1 mi | $10 | $10 | $10 |
| 3 | 8.6 mi | |||
| 4 | 0.2 mi | $3 | $3 | $3 |
| 5 | 6.1 mi | |||
| ...... | ...... | ...... | ...... | ...... |
| ...... | ..... | ..... | ..... | ..... |
| ...... | ..... | ..... | ..... | ..... |
| 1000 | 5.3 mi | $10 | $10 | $10 |
Fig. 2MICE flowchart
Fig. 3Flowchart of SICE
Fig. 4Block diagram of the system
List of existing algorithms implemented for comparison
| Attribute Type | ||
|---|---|---|
| Binary | Ordinal | Numeric |
| Implemented algorithms | ||
| Logistic regression | Polytomous logistic regression (POLYREG) | Amelia |
| Predictive mean matching (PMM) | Predictive mean matching (PMM) | k nearest neighbors (kNN) |
| Fuzzy unordered rule induction algorithms (FURIA) | Linear discriminant analysis (LDA) | Predictive mean matching (PMM) |
| Support vector machine (SVM) | Classfication and regression tree (CART) | Bayesian linear regression (BLR) |
Datasets used for imputation of binary attribute
| Dataset name | Targeted attribute name |
|---|---|
| HairEyeColor | Gender |
| Local health dataset | Gender |
| Local health dataset | Age (Binary) |
Results for binary dataset “gender”
| Algorithm | Accuracy | Sensitivity | Precision | Specificity | F-measure |
|---|---|---|---|---|---|
| MICE (PMM) | 0.546 | 0.546 | 0.546 | 0.547 | 0.546 |
| FURIA | 0.558 | 0.558 | 0.597 | 0.128 | 0.468 |
| SVM | 0.517 | 0.188 | 0.522 | 0.847 | 0.276 |
| SICE (PMM) | 0.576 | 0.656 | 0.656 | 0.499 | 0.656 |
Fig. 5Accuracy and F-measure for four algorithms to impute gender attribute
Fig. 6Performance comparison of MICE and SICE for additional binary datasets
Performance of MICE and SICE for ordinal attribute using local health dataset
| Algorithm | MICE | SICE | ||
|---|---|---|---|---|
| Accuracy | F-measure | Accuracy | F-measure | |
| PMM | 0.503 | 0.246 | 0.505 | 0.238 |
| POLYREG | 0.531 | 0.303 | 0.532 | 0.312 |
| CART | 0.537 | 0.318 | 0.536 | 0.283 |
| LDA | 0.562 | 0.353 | 0.561 | 0.341 |
Fig. 7Performance of MICE and SICE for ordinal data using PMM and POLYREG
Performance of MICE and SICE for ordinal attribute using UCI car dataset
| Algorithm | Accuracy | F-measure | ||
|---|---|---|---|---|
| MICE | SICE | MICE | SICE | |
| PMM | 62.42 | 74.56 | 23.41 | 29.51 |
| POLYREG | 83.81 | 89.59 | 72.35 | 76.29 |
| CART | 89.01 | 93.06 | 76.88 | 81.83 |
| LDA | 80.92 | 80.92 | 60.63 | 64.92 |
Fig. 8Comparison of execution time of MICE and SICE to impute UCI car dataset
Performance of the algorithms for numeric attribute of local health dataset
| Algorithm | RMSE score | Execution time (s) |
|---|---|---|
| SICE (BLR) | 19 | |
| MICE (PMM) | 21.85 | 282 |
| MICE (BLR) | 24.47 | 18 |
| Amelia | 25.6 | |
| kNN | 25.25 | 154 |
Fig. 9Performance of algorithms to predict house prices