| Literature DB >> 29599943 |
Min-Wei Huang1,2, Wei-Chao Lin3,4, Chih-Fong Tsai5.
Abstract
Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.Entities:
Mesh:
Year: 2018 PMID: 29599943 PMCID: PMC5823414 DOI: 10.1155/2018/1817479
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Dataset information.
| Dataset | Number of instances | Number of attributes | Number of classes |
|---|---|---|---|
|
| |||
| Lymphography | 148 | 18 | 4 |
| Nursery | 12960 | 8 | 11 |
| Promoters | 106 | 58 | 2 |
| SPECT | 267 | 22 | 2 |
|
| |||
| Blood | 748 | 5 | 2 |
| Breast cancer | 286 | 9 | 2 |
|
| 336 | 8 | 8 |
| Pima | 768 | 8 | 2 |
| Yeast | 1484 | 8 | 10 |
|
| |||
| Abalone | 4177 | 8 | 29 |
| Acute | 120 | 6 | 2 |
| Contraceptive | 1473 | 9 | 3 |
| Liver_disorders | 345 | 7 | 2 |
| Statlog | 270 | 13 | 2 |
| Statlog_German | 1000 | 20 | 2 |
Figure 1Classification results of imputation and instance selection combined with imputation over the categorical medical datasets.
Figure 2Classification results of imputation and instance selection combined with imputation over the numerical medical datasets.
Figure 3Classification results of imputation and instance selection combined with imputation over the mixed medical datasets.
The best imputation process over each dataset.
| Dataset | Missing rate | ||||
|---|---|---|---|---|---|
| 10% | 20% | 30% | 40% | 50% | |
|
| |||||
| Lymphography | GA + SVM | IB3 + KNNI | DROP3 + SVM | IB3 + KNNI | DROP3 + KNNI |
| Nursery | KNNI | GA + MLP | IB3 + MLP | MLP | MLP |
| Promoters | DROP3 + MLP | IB3/DROP3 + KNNI | IB3/DROP3 + MLP | IB3/DROP3 + SVM | IB3/DROP3 + SVM |
| SPECT | KNNI | MLP | MLP | MLP | KNNI |
|
| |||||
| Blood | GA + KNNI | GA + MLP | GA + MLP | GA + KNNI | DROP3 + MLP |
| Breast cancer | IB3 + SVM | IB3 + MLP | IB3 + SVM | IB3 + SVM | IB3 + SVM |
|
| IB3 + KNNI | IB3 + KNNI | IB3 + KNNI | IB3 + KNNI | IB3 + KNNI |
| Pima | IB3/DROP3/GA + KNNI/MLP/SVM | IB3 + MLP | IB3 + KNNI/MLP | IB3 + KNNI/MLP | IB3 + MLP |
| Yeast | IB3 + SVM | IB3 + KNNI | IB3 + SVM | IB3 + SVM | GA + SVM |
|
| |||||
| Abalone | IB3 + SVM | GA + MLP | GA + MLP | IB3 + SVM | GA + MLP |
| Acute | GA + SVM | MLP/SVM | KNNI/SVM | SVM | MLP |
| Contraceptive | KNNI | SVM | SVM | IB3 + SVM | MLP |
| Liver_disorders | IB3 + KNNI | IB3 + KNNI | IB3 + KNNI | IB3/GA + SVM | IB3 + KNNI |
| Statlog | IB3 + KNNI/MLP | IB3 + KNNI/MLP/SVM | GA + SVM | IB3 + SVM | IB3 + MLP |
| Statlog_German | IB3/DROP3/GA + KNNI/MLP/SVM | IB3/DROP3/GA + KNNI/MLP/SVM | IB3/DROP3/GA + KNNI/MLP/SVM | IB3/DROP3/GA + KNNI/MLP/SVM | IB3/DROP3/GA + KNNI/MLP/SVM |