| Literature DB >> 28243594 |
Yuzhe Liu1, Vanathi Gopalakrishnan2.
Abstract
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.Entities:
Keywords: decision tree imputation; k-nearest neighbors imputation; machine learning; missing value imputation; self-organizing map imputation
Year: 2017 PMID: 28243594 PMCID: PMC5325161 DOI: 10.3390/data2010008
Source DB: PubMed Journal: Data (Basel) ISSN: 2306-5729
Variable definitions for the 14-variable and 27-variables and what percentage of each variable was missing in the positive (+) versus the non-positive (−) MRI group.
| Variables in 14 Variable Set | Definition | Percentage Missing (+) | Percentage Missing (−) |
|---|---|---|---|
| BSA | Body Surface Area | 3.2% | 8.8% |
| EDV index | End diastolic volume index | 38.7% | 38.6% |
| ESV index | End systolic volume index | 38.7% | 38.6% |
| SV index | Stroke volume index | 38.7% | 38.6% |
| FS | Fractional shortening | 3.2% | 5.3% |
| EF | Ejection fraction | 32.3% | 35.1% |
| Ao V2 max | Aortic V2 max | 3.2% | 1.8% |
| Ao max PG | Aortic max pressure gradient | 3.2% | 1.8% |
| MV E/A | Mitral valve E/A ratio | 16.1% | 1.7% |
| IVSd z-score | Interventricular septum thickness measured in diastole, z-score | 3.2% | 10.5% |
| LVIDd z-score | Left ventricular internal dimension measured in diastole, z-score | 3.2% | 10.5% |
| LVIDs z-score | Left ventricular internal dimension measured in systole, z-score | 3.2% | 12.3% |
| LVPWd z-score | Left ventricular posterior wall thickness measured in diastole, z-score | 3.2% | 10.5% |
| LV mass z-score | Left ventricular mass measured in diastole, z-score | 3.2% | 10.5% |
|
| |||
|
| |||
| Age | Age at scan | 0% | 0% |
| Height | Height at scan | 3.2% | 8.8% |
| Weight | Weight at scan | 0% | 1.8% |
| Ao root diam | Aortic root diameter | 35.5% | 22.8% |
| MV A max | Mitral valve A wave max (max atrial filling velocity) | 35.5% | 24.6% |
| MV E max | Mitral valve E wave max (max early filling velocity) | 38.7% | 22.8% |
| PA V2 max | Pulmonary artery V2 max | 12.9% | 3.5% |
| PA max PG | Pulmonary artery max pressure gradient | 12.9% | 3.5% |
| TR max PG | Tricuspid regurgitation max pressure gradient | 35.5% | 50.9% |
| TR max vel | Tricuspid regurgitation max velocity | 38.7% | 52.6% |
| TV A max | Tricuspid valve A wave max (max atrial filling velocity) | 25.8% | 14.0% |
| TV E max | Tricuspid valve E wave max (max early filling velocity) | 19.4% | 7.0% |
| TV E/A | Tricuspid valve E/A ratio | 64.5% | 64.9% |
Figure 1Example ruleset generated using Bayesian Rule Learning (BRL).
Figure 2Workflow diagram of evaluation of decision tree imputation.
Figure 3Workflow diagram of evaluation of imputation-augmented models.
Figure 4Accuracy and agreement of decision tree imputed values in 10-fold cross validation. Accuracy was calculated from the imputed values for the samples that had values, while agreement was calculated from the imputed values for the samples that did not have values.
Figure 5Imputed values for four representative variables (a) ejection fraction (EF), (b) interventricular septum thickness z-score (IVSdZScore), (c) tricuspid regurgitation max pressure gradient (TR Max PG), and (d) tricuspid regurgitation max velocity (TR Max vel). Observed values for the positive class are shown as black circles and observed values for the negative class are shown as black X’s. Imputed values for mean, k-NN, and SOM imputation are shown as red, green, and blue dots, respectively. Because decision tree (DT) imputation requires discretized values, imputed values are reported as a discretized range.
Sensitivity, specificity, accuracy, and AUC of BRL rules learned on 14 variables using imputation-augmented data versus unaugmented data evaluated on complete data only, averaged over five 10-fold cross-validations. Performance metrics are tested against the performance of the unaugmented model. After Bonferroni correction for multiple comparisons, α = 0.0125 is the significance threshold (significant values denoted by *).
| Method | Sensitivity | Specificity | Accuracy | AUC |
|---|---|---|---|---|
| Unaugmented model | 44.7 +/− 4.7 | 88.7 +/− 5.1 | 73.5 +/− 3.4 | 59.2 +/− 6.5 |
| Mean imputation | 38.9 +/− 6.1 ( | 87.5 +/− 4.0 ( | 70.0 +/− 1.8 ( | 60.8 +/− 2.9 ( |
| Decision tree imputation | 42.2 +/− 7.5 ( | 81.9 +/− 5.4 ( | 67.6 +/− 4.5 ( | 65.8 +/− 4.8 ( |
| k-NN imputation | 35.6 +/− 5.7 ( | 86.9 +/− 5.0 ( | 68.4 +/− 3.25 ( | 57.6 +/− 1.2 ( |
| SOM imputation | 38.9 +/− 3.5 ( | 85.6 +/− 4.7 ( | 68.8 +/− 3.3 ( | 57.8 +/− 1.9 ( |
Figure 6Performance of imputation-augmented rulesets compared to unaugmented rulesets: (a) sensitivity vs. specificity of 14-variable models evaluated on complete vs. imputed data; and (b) average receiver operating characteristic (ROC) curves of 14-variable models evaluated on complete data.
Figure 7Performance of 27-variable rulesets compared to 14-variable rulesets: (a) sensitivity vs. specificity of 27-variable models compared to 14-variable models evaluated on complete data vs. imputed data; and (b) average ROC curves of 27-variable models evaluated on complete data.
Rules learned on the whole training set using 14 variables.
| Method | Number of Rules Learned | Variables Used |
|---|---|---|
| Unaugmented model (14 variables) | 7 | EF, IVSd z-score (2) |
| Mean imputation (14 variables) | 15 | IVSd z-score, LVIDd z-score, LV mass z-score, BSA (4) |
| Decision tree imputation (14 variables) | 183 | IVSd z-score, LVIDd z-score, LVIDs z-score, LV mass z-score, EF, EDV index, SV index, MV E/A, Ao max PG (9) |
| k-NN imputation (14 variables) | 15 | IVSd z-score, LVIDd z-score, LV mass z-score, BSA (4) |
| SOM imputation (14 variables) | 15 | IVSd z-score, LVIDd z-score, LV mass z-score, BSA (4) |
| Mean imputation (27 variables) | 43 | IVSd z-score, LVPWd z-score, LVIDs z-score, MV A max, LV mass z-score, SV index, FS, TV A max, TV E max, height (10) |
| Decision tree imputation (27 variables) | 255 | Ao V2 max, EF, EDV index, FS, MV A max, PA V2 max, TR max vel, TV E/A, SV index, IVSd z-score, height, weight (12) |
| k-NN imputation (27 variables) | 35 | IVSd z-score, LV mass z-score, SV index, LVIDs z-score, MV |
| SOM imputation (27 variables) | 27 | IVSd z-score, LV mass z-score, SV index, Ao root diam, LVIDs z-score, TV A max, TV E max, height (8) |