| Literature DB >> 24589914 |
Anoop D Shah, Jonathan W Bartlett, James Carpenter, Owen Nicholas, Harry Hemingway.
Abstract
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.Entities:
Keywords: angina, stable; imputation; missing data; missingness at random; regression trees; simulation; survival
Mesh:
Year: 2014 PMID: 24589914 PMCID: PMC3939843 DOI: 10.1093/aje/kwt312
Source DB: PubMed Journal: Am J Epidemiol ISSN: 0002-9262 Impact factor: 4.897
Figure 1.Generation of data sets with artificial missingness from a population of patients with stable angina in the CALIBER database, 2001–2010. Data sets D1, D2, … , D1,000 are samples of 2,000 patients with replacement from data set C. CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MAR, missing at random; MCAR, missing completely at random; MICE, multivariate imputation by chained equations.
Figure 2.Cumulative incidence of myocardial infarction or death (Kaplan-Meier failure curve) for patients with stable angina in the CALIBER database, by complete record status, 2001–2010. The solid line represents patients in data set A but not data set B (those with missing data; n = 39,268 at the start, dropping to 17,588 in year 6), and the dashed line represents patients in data set B (those with complete records; n = 13,308 at the start, dropping to 2,594 in year 6). CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records.
Factors Associated With Having a Complete Record in a Study of Patients Diagnosed With Stable Angina (Logistic Regression Model), CALIBER Database, 2001–2010
| Variable | Odds Ratio | 95% Confidence Interval | |
|---|---|---|---|
| Age, per 10 years | 4.81 | 4.02, 5.75 | <0.001 |
| Age squared, per 10 years squared | 0.89 | 0.87, 0.90 | <0.001 |
| Female sex | 1.08 | 1.03, 1.12 | 0.002 |
| Diabetes mellitus | 1.74 | 1.64, 1.84 | <0.001 |
| Peripheral arterial disease | 1.24 | 1.15, 1.35 | <0.001 |
| Previous stroke | 1.26 | 1.17, 1.36 | <0.001 |
| Heart failure | 0.96 | 0.89, 1.04 | 0.333 |
| Previous myocardial infarction | 1.03 | 0.98, 1.09 | 0.228 |
| Electronic laboratory resultsa | 4.74 | 4.48, 5.01 | <0.001 |
| Endpoint of fatal coronary heart disease | 0.40 | 0.36, 0.45 | <0.001 |
| Endpoint of nonfatal myocardial infarction | 0.29 | 0.26, 0.34 | <0.001 |
| Endpoint of noncoronary death | 0.42 | 0.39, 0.45 | <0.001 |
| Cumulative hazard | 0.03 | 0.02, 0.03 | <0.001 |
Abbreviation: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records.
a Whether a medical practice was receiving electronic laboratory results.
Comparisons Between Methods of Handling Missing Data in 1,000 Samples With Continuous Variables Missing at Random in a Pattern Similar to That of the Original Data Set (Missingness Mechanism 1), CALIBER Database, 2001–2010
| Variable and Method | Biasa of Log HR | SD of Estimated Log HR | Mean Length of 95% CI | Coverage of 95% CI, % | Between- Imputation Variance | |
|---|---|---|---|---|---|---|
| Neutrophils (109 cells/L), per doubling | ||||||
| Full data | 0.002 | 0.43 | 0.158 | 0.564 | 92.2 | |
| Complete recordc | −0.045 | −2.67 | 0.533 | 1.677 | 90.1 | |
| MICE normal | −0.038 | −5.15 | 0.232 | 0.883 | 93.4 | 0.0243 |
| MICE PMM | −0.042 | −5.68 | 0.230 | 0.889 | 93.4 | 0.0245 |
| missForest | −0.266 | 27.72 | 0.303 | 0.781 | 63.2 | 0.0014 |
| MICE RF 10 trees | −0.024 | −4.55 | 0.165 | 0.798 | 97.9 | 0.0143 |
| Lymphocytes (109 cells/L), per doubling | ||||||
| Full data | −0.007 | −1.23 | 0.155 | 0.526 | 91.6 | |
| Complete recordc | −0.087 | −5.87 | 0.464 | 1.544 | 89.8 | |
| MICE normal | 0.001 | 0.13 | 0.202 | 0.759 | 93.2 | 0.0157 |
| MICE PMM | 0.006 | 0.99 | 0.205 | 0.768 | 92.4 | 0.0162 |
| missForest | −0.190 | −22.21 | 0.270 | 0.724 | 72.5 | 0.0011 |
| MICE RF 10 trees | 0.003 | 0.56 | 0.156 | 0.727 | 97.8 | 0.0109 |
| Hemoglobin, per g/dL | ||||||
| Full data | −0.004 | −1.99 | 0.057 | 0.202 | 91.6 | |
| Complete recordc | −0.022 | −3.91 | 0.180 | 0.593 | 90.8 | |
| MICE normal | −0.007 | −2.73 | 0.076 | 0.279 | 92.6 | 0.0019 |
| MICE PMM | −0.004 | −1.47 | 0.077 | 0.279 | 92.7 | 0.0019 |
| missForest | −0.056 | −19.96 | 0.089 | 0.255 | 77.3 | 0.0001 |
| MICE RF 10 trees | −0.010 | −5.61 | 0.059 | 0.261 | 97.2 | 0.0012 |
Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; PMM, predictive mean matching; RF 10 trees, random forest with 10 trees; SD, standard deviation.
a Bias was measured relative to estimates from analysis of the full data set (data set C) (Web Table 2).
b The z score is defined as the mean bias of the estimate divided by the empirical standard error from simulations, and it should lie approximately within the interval (−2, +2).
c Results for complete records were based on the 986 samples for which it was possible to estimate hazard ratios for all parameters.
Comparisons Between Methods of Handling Missing Data in 1,000 Samples With Categorical Variables Missing Completely at Random (Missingness Mechanism 2), CALIBER Database, 2001–2010
| Variable and Method | Biasa of Log HR | SD of Estimated Log HR | Mean Length of 95% CI | Coverage of 95% CI, % | % Falsely Classifiedc | |
|---|---|---|---|---|---|---|
| Previous myocardial infarction | ||||||
| Full data | 0.006 | 1.22 | 0.154 | 0.587 | 94.2 | 0 |
| MICE logistic | −0.013 | −2.46 | 0.168 | 0.682 | 95.5 | 29.6 |
| missForest | 0.002 | 0.27 | 0.179 | 0.625 | 91.8 | 17.3 |
| MICE RF 10 trees | −0.020 | −4.21 | 0.149 | 0.662 | 97.3 | 28.5 |
| Diabetes mellitus | ||||||
| Full data | 0.010 | 2.30 | 0.156 | 0.592 | 93.7 | 0 |
| MICE logistic | 0.016 | 3.21 | 0.171 | 0.685 | 95.7 | 32.0 |
| missForest | 0.014 | 2.73 | 0.182 | 0.627 | 90.8 | 19.7 |
| MICE RF 10 trees | −0.021 | −4.25 | 0.149 | 0.668 | 97.5 | 30.7 |
| Previous stroke | ||||||
| Full data | 0.005 | 0.86 | 0.198 | 0.707 | 94.0 | 0 |
| MICE logistic | −0.005 | −0.58 | 0.207 | 0.828 | 95.5 | 17.9 |
| missForest | 0.004 | 0.65 | 0.211 | 0.763 | 92.9 | 8.4 |
| MICE RF 10 trees | −0.011 | −1.79 | 0.183 | 0.808 | 97.9 | 16.7 |
| Peripheral arterial disease | ||||||
| Full data | 0.016 | 2.59 | 0.199 | 0.730 | 93.6 | 0 |
| MICE logistic | −0.002 | −0.21 | 0.218 | 0.858 | 94.8 | 15.5 |
| missForest | 0.028 | 4.18 | 0.223 | 0.788 | 91.9 | 7.0 |
| MICE RF 10 trees | 0.005 | 0.94 | 0.192 | 0.834 | 97.1 | 14.5 |
| Heart failure | ||||||
| Full data | 0.015 | 2.47 | 0.191 | 0.653 | 91.7 | 0 |
| MICE logistic | 0.015 | 2.22 | 0.207 | 0.759 | 93.8 | 14.6 |
| missForest | 0.001 | 0.08 | 0.216 | 0.696 | 89.4 | 7.2 |
| MICE RF 10 trees | −0.034 | −5.78 | 0.190 | 0.746 | 95.5 | 13.7 |
| Smoking status: current vs. never | ||||||
| Full data | 0.019 | 2.62 | 0.264 | 0.969 | 93.9 | 0 |
| MICE logistic | 0.023 | 2.65 | 0.292 | 1.092 | 94.0 | 52.4 |
| missForest | −0.036 | −3.56 | 0.308 | 1.062 | 91.5 | 35.0 |
| MICE RF 10 trees | −0.098 | −12.92 | 0.237 | 1.072 | 95.5 | 50.0 |
| Smoking status: former vs. never | ||||||
| Full data | 0.011 | 1.66 | 0.247 | 0.908 | 93.6 | 0 |
| MICE logistic | −0.008 | −0.82 | 0.266 | 1.022 | 94.1 | 52.4 |
| missForest | 0.045 | 5.34 | 0.270 | 0.980 | 93.2 | 35.0 |
| MICE RF 10 trees | −0.060 | −8.81 | 0.212 | 1.000 | 97.1 | 50.0 |
Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; RF 10 trees, random forest with 10 trees; SD, standard deviation.
a Bias was measured relative to estimates from analysis of the full data set (data set C) (Web Table 2).
b The z score is defined as the mean bias of the estimate divided by the empirical standard error from simulations, and it should lie approximately within the interval (−2, +2).
c Percentage of imputed values that were different from the “true” (observed) missing value.
Figure 3.Bias in estimates of log hazard ratios for partially observed variables with data missing at random (missingness mechanism 1) in 1,000 samples of patients with stable angina in the CALIBER database, 2001–2010. A) log neutrophil count (109 cells/L); B) log lymphocyte count (109 cells/L); C) hemoglobin concentration (g/dL). The solid horizontal line is the “true” log hazard ratio from the full data set (data set C); the dashed lines show ±1 empirical standard error. The boxes span the interquartile range (25th–75th percentiles), and the whiskers extend to the most extreme data point, which is no more than 1.5 times the interquartile range from the box. Circles represent outliers. The light gray boxes show results from simulations with 50% complete records, and the dark gray boxes show results from simulations with 25% complete records. CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MICE, multivariate imputation by chained equations; PMM, predictive mean matching; RF, random forest.
Comparisons Between Methods of Handling Missing Data in a Survival Analysis of 1,000 Simulated Data Sets With a Predictor Variable Missing at Random That Is Associated With Fully Observed Predictors in a Nonlinear Way
| Method | Bias of Log HRa | SE of Bias | SD of Estimate | Mean Length of 95% CI | Coverage of 95% CI, % | |
|---|---|---|---|---|---|---|
| Full data | −0.0002 | 0.001 | −0.1 | 0.037 | 0.148 | 96.1 |
| rfImpute | 0.119 | 0.001 | 86.5 | 0.044 | 0.154 | 17.4 |
| missForest | 0.079 | 0.002 | 53.7 | 0.046 | 0.158 | 50.9 |
| MICE RF with 5 trees | −0.021 | 0.001 | −17.1 | 0.038 | 0.172 | 95.3 |
| MICE RF with 10 trees | −0.005 | 0.001 | −3.9 | 0.039 | 0.170 | 97.3 |
| MICE RF with 20 trees | 0.006 | 0.001 | 4.5 | 0.039 | 0.168 | 96.3 |
| MICE RF with 50 trees | 0.011 | 0.001 | 9.0 | 0.040 | 0.167 | 95.8 |
| MICE RF with 100 trees | 0.013 | 0.001 | 10.5 | 0.040 | 0.167 | 94.7 |
| Parametric MICE | −0.055 | 0.001 | −44.8 | 0.039 | 0.178 | 79.8 |
Abbreviations: CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; RF, random forest; SD, standard deviation; SE, standard error.
a The true log hazard ratio was set at 0.5.