| Literature DB >> 33299878 |
Kamran Mehrabani-Zeinabad1, Marziyeh Doostfatemeh1, Seyyed Mohammad Taghi Ayatollahi1.
Abstract
Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the "Bayesian additive regression trees" (BART) model is improved with the "Missingness Incorporated in Attributes" approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named "BART.m". As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.Entities:
Mesh:
Year: 2020 PMID: 33299878 PMCID: PMC7710403 DOI: 10.1155/2020/8810143
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Splitting rule choices of nodes in MIA.
Specifications of real-world datasets.
| Dataset name | Sample size | Variable number | Discrete variable number | Missing proportion | Imbalance |
|---|---|---|---|---|---|
| Breast Cancer Wisconsin [ | 699 | 10 | 0 | 2.29 | 65.5 |
| Chronic kidney disease | 400 | 24 | 13 | 60.5 | 62.5 |
| Congressional voting records | 435 | 16 | 16 | 46.67 | 61.4 |
| Credit approval | 690 | 15 | 9 | 5.36 | 55.5 |
| Cylinder bands | 540 | 39 | 19 | 48.7 | 57.8 |
| Heart disease—ungarian | 294 | 13 | 7 | 99.66 | 63.9 |
| Hepatitis | 155 | 19 | 13 | 48.39 | 79.4 |
| Horse colic | 368 | 23 | 15 | 98.1 | 63 |
| Mammographic mass [ | 961 | 5 | 2 | 13.63 | 53.7 |
| Ozone level detection | 2536 | 73 | 0 | 27.13 | 97.1 |
Accuracies achieved by simulation under MCAR missing mechanism.
| Variable | Model | Complete Variable | Missing Percent | Exclude Variable | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | ||||
| x1 | BART.m | 66.56 | 66.27 | 65.97 | 65.66 | 65.34 | 65.05 | 64.80 | 64.50 | 64.22 | 63.97 | 63.65 |
| BART.i | 66.56 | 66.04 | 65.44 | 64.75 | 63.99 | 63.29 | 62.56 | 61.80 | 61.24 | 60.86 | 63.65 | |
| RF.i | 63.63 | 63.32 | 62.94 | 62.36 | 61.80 | 61.22 | 60.60 | 60.03 | 59.49 | 59.15 | 60.74 | |
|
| ||||||||||||
| x2 | BART.m | 66.56 | 66.55 | 66.56 | 66.50 | 66.52 | 66.54 | 66.56 | 66.57 | 66.57 | 66.65 | 66.67 |
| BART.i | 66.56 | 66.53 | 66.42 | 66.32 | 66.20 | 65.89 | 65.61 | 65.11 | 64.38 | 62.69 | 66.67 | |
| RF.i | 63.65 | 63.61 | 63.47 | 63.30 | 63.14 | 62.82 | 62.60 | 62.23 | 61.62 | 60.26 | 65.91 | |
|
| ||||||||||||
| x3 | BART.m | 66.56 | 66.14 | 65.71 | 65.30 | 64.89 | 64.56 | 64.21 | 63.77 | 63.44 | 63.11 | 62.62 |
| BART.i | 66.56 | 65.91 | 64.97 | 63.90 | 62.85 | 61.64 | 60.64 | 59.48 | 58.51 | 57.16 | 62.62 | |
| RF.i | 63.63 | 63.23 | 62.56 | 61.79 | 61.01 | 60.16 | 59.30 | 58.34 | 57.47 | 56.16 | 63.03 | |
|
| ||||||||||||
| x4 | BART.m | 66.56 | 65.91 | 65.21 | 64.53 | 63.88 | 63.30 | 62.65 | 62.01 | 61.42 | 60.83 | 60.14 |
| BART.i | 66.56 | 65.67 | 64.49 | 63.21 | 61.86 | 60.52 | 59.28 | 57.91 | 56.61 | 55.15 | 60.14 | |
| RF.i | 63.63 | 63.08 | 62.26 | 61.28 | 60.28 | 59.24 | 58.17 | 56.97 | 55.85 | 54.47 | 59.60 | |
Figure 2Mean and standard deviation of accuracies achieved by simulation under the MCAR missing mechanism.
Accuracies achieved by simulation under MAR missing mechanism.
| Variable | Model | Complete Variable | Missing Percent | Exclude Variable | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | ||||
| x1 | BART.m | 66.56 | 66.26 | 65.96 | 65.63 | 65.37 | 65.08 | 64.79 | 64.52 | 64.24 | 64.01 | 63.65 |
| BART.i | 66.56 | 66.04 | 65.47 | 64.76 | 64.00 | 63.28 | 62.42 | 61.69 | 61.17 | 60.64 | 63.65 | |
| RF.i | 63.63 | 63.25 | 62.81 | 62.31 | 61.68 | 61.11 | 60.47 | 59.91 | 59.40 | 59.07 | 60.74 | |
|
| ||||||||||||
| x2 | BART.m | 66.56 | 66.53 | 66.51 | 66.54 | 66.54 | 66.57 | 66.60 | 66.61 | 66.65 | 66.67 | 66.67 |
| BART.i | 66.56 | 66.50 | 66.45 | 66.31 | 66.14 | 65.81 | 65.42 | 64.78 | 63.88 | 62.17 | 66.67 | |
| RF.i | 63.65 | 63.55 | 63.48 | 63.29 | 63.17 | 62.90 | 62.51 | 61.96 | 61.34 | 59.87 | 65.91 | |
|
| ||||||||||||
| x3 | BART.m | 66.56 | 66.22 | 65.87 | 65.48 | 65.08 | 64.68 | 64.30 | 63.87 | 63.48 | 63.13 | 62.62 |
| BART.i | 66.56 | 65.98 | 64.90 | 63.74 | 62.64 | 61.52 | 60.34 | 59.31 | 58.27 | 56.98 | 62.62 | |
| RF.i | 63.63 | 63.17 | 62.36 | 61.59 | 60.81 | 59.95 | 59.01 | 58.13 | 57.25 | 55.97 | 63.03 | |
|
| ||||||||||||
| x4 | BART.m | 66.56 | 66.30 | 66.08 | 65.71 | 65.27 | 64.76 | 63.99 | 63.25 | 62.28 | 61.32 | 60.14 |
| BART.i | 66.56 | 66.14 | 65.50 | 64.55 | 63.42 | 62.06 | 60.39 | 58.71 | 56.77 | 54.81 | 60.14 | |
| RF.i | 63.63 | 63.35 | 62.90 | 62.27 | 61.49 | 60.63 | 59.32 | 57.95 | 56.21 | 54.46 | 59.60 | |
Accuracies achieved by simulation under MNAR missing mechanism.
| Variable | Model | Complete Variable | Missing Percent | Exclude Variable | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | ||||
| x1 | BART.m | 66.56 | 66.35 | 66.16 | 65.87 | 65.51 | 65.10 | 64.76 | 64.44 | 64.16 | 63.92 | 63.65 |
| BART.i | 66.56 | 66.04 | 65.43 | 64.70 | 64.15 | 63.70 | 63.47 | 63.39 | 63.33 | 63.11 | 63.65 | |
| RF.i | 63.63 | 63.27 | 62.86 | 62.31 | 61.75 | 61.40 | 60.99 | 60.73 | 60.57 | 60.27 | 60.74 | |
|
| ||||||||||||
| x2 | BART.m | 66.56 | 66.57 | 66.52 | 66.55 | 66.55 | 66.54 | 66.55 | 66.59 | 66.58 | 66.65 | 66.67 |
| BART.i | 66.56 | 66.51 | 66.46 | 66.36 | 66.20 | 65.99 | 65.57 | 65.10 | 64.27 | 62.93 | 66.67 | |
| RF.i | 63.65 | 63.62 | 63.47 | 63.35 | 63.14 | 62.94 | 62.58 | 62.19 | 61.60 | 60.51 | 65.91 | |
|
| ||||||||||||
| x3 | BART.m | 66.56 | 66.07 | 65.52 | 65.04 | 64.61 | 64.20 | 63.85 | 63.53 | 63.31 | 63.05 | 62.62 |
| BART.i | 66.56 | 65.51 | 64.38 | 63.30 | 62.25 | 61.30 | 60.38 | 59.47 | 58.70 | 57.45 | 62.62 | |
| RF.i | 63.63 | 62.91 | 62.11 | 61.36 | 60.58 | 59.83 | 59.09 | 58.35 | 57.62 | 56.53 | 63.03 | |
|
| ||||||||||||
| x4 | BART.m | 66.56 | 66.07 | 65.44 | 64.87 | 64.21 | 63.50 | 62.90 | 62.20 | 61.55 | 60.92 | 60.14 |
| BART.i | 66.56 | 65.32 | 63.89 | 62.61 | 61.24 | 59.94 | 58.82 | 57.58 | 56.49 | 55.28 | 60.14 | |
| RF.i | 63.63 | 62.73 | 61.66 | 60.72 | 59.64 | 58.59 | 57.63 | 56.69 | 55.63 | 54.47 | 59.60 | |
Figure 3Mean and standard deviation of accuracies achieved by simulation under the MAR missing mechanism.
Figure 4Mean and standard deviation of accuracies achieved by simulation under the MNAR missing mechanism.
Mean ± standard deviation of classification accuracies of real-world datasets.
| Dataset name | BART.m | BART.i | RF.i |
|---|---|---|---|
| Breast Cancer Wisconsin | 96.74 ± 0.20 | 96.44 ± 0.18 | 97.06 ± 0.21 |
| Chronic kidney disease | 99.76 ± 0.21 | 97.32 ± 0.43 | 99.51 ± 0.15 |
| Congressional voting records | 95.86 ± 0.30 | 95.90 ± 0.27 | 95.92 ± 0.30 |
| Credit approval | 86.40 ± 0.40 | 86.50 ± 0.37 | 86.85 ± 0.49 |
| Cylinder bands | 78.83 ± 0.85 | 79.08 ± 0.83 | 84.46 ± 0.76 |
| Heart disease—Hungarian | 83.95 ± 0.54 | 78.47 ± 1.78 | 78.76 ± 1.63 |
| Hepatitis | 83.53 ± 1.11 | 86.22 ± 0.97 | 87.00 ± 1.20 |
| Horse colic | 84.69 ± 0.57 | 83.50 ± 0.78 | 83.15 ± 0.98 |
| Mammographic mass | 83.40 ± 0.30 | 82.82 ± 0.34 | 81.48 ± 0.52 |
| Ozone level detection | 97.10 ± 0.02 | 97.10 ± 0.05 | 96.97 ± 0.04 |
Run time corresponding to different methods in the application process.
| Dataset name | BART.m | missForest | BART | RF |
|---|---|---|---|---|
| Breast Cancer Wisconsin | 0 : 02.44 | 0 : 05.92 | 0 : 02.29 | 0 : 00.04 |
| Chronic kidney disease | 0 : 04.71 | 2 : 16.30 | 0 : 01.48 | 0 : 00.05 |
| Congressional voting records | 0 : 03.85 | 0 : 20.17 | 0 : 01.59 | 0 : 00.07 |
| Credit approval | 0 : 05.99 | 1 : 21.50 | 0 : 02.46 | 0 : 00.11 |
| Cylinder bands | 0 : 08.23 | 6 : 04.30 | 0 : 02.20 | 0 : 00.10 |
| Heart disease—Hungarian | 0 : 01.98 | 0 : 17.38 | 0 : 01.13 | 0 : 00.05 |
| Hepatitis | 0 : 01.45 | 0 : 19.29 | 0 : 00.74 | 0 : 00.05 |
| Horse colic | 0 : 05.10 | 1 : 37.10 | 0 : 01.44 | 0 : 00.08 |
| Mammographic mass | 0 : 04.58 | 0 : 40.35 | 0 : 03.20 | 0 : 00.12 |
| Ozone level detection | 0 : 58.98 | 4 : 39 : 36.98 | 0 : 07.81 | 0 : 00.11 |
Figure 5The means and standard deviations of BART.m, BART.i, and RF.i model's classification accuracy (top plot) and corresponding run times per second (bottom plot) on ten real-world datasets. The dataset names are presented as abbreviations in the horizontal axis.