| Literature DB >> 33629367 |
Firouzeh Noghrehchi1, Jakub Stoklosa2,3, Spiridon Penev2, David I Warton2,3.
Abstract
Multiple imputation and maximum likelihood estimation (via the expectation-maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochastic expectation-maximization approximation to the likelihood. In this article, we exploit this key result to show that familiar likelihood-based approaches to model selection, such as Akaike's information criterion (AIC) and the Bayesian information criterion (BIC), can be used to choose the imputation model that best fits the observed data. Poor choice of imputation model is known to bias inference, and while sensitivity analysis has often been used to explore the implications of different imputation models, we show that the data can be used to choose an appropriate imputation model via conventional model selection tools. We show that BIC can be consistent for selecting the correct imputation model in the presence of missing data. We verify these results empirically through simulation studies, and demonstrate their practicality on two classical missing data examples. An interesting result we saw in simulations was that not only can parameter estimates be biased by misspecifying the imputation model, but also by overfitting the imputation model. This emphasizes the importance of using model selection not just to choose the appropriate type of imputation model, but also to decide on the appropriate level of imputation model complexity.Entities:
Keywords: imputation model selection; information criteria; missing data analysis; stochastic EM algorithm
Mesh:
Year: 2021 PMID: 33629367 PMCID: PMC8248419 DOI: 10.1002/sim.8915
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Proportion of times () the information criterion chooses the fitted imputation model for different sample sizes in 500 simulated datasets
| 10% | 25% | 40% | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| 100 | 1000 | 50 | 100 | 1000 | 50 | 100 | 1000 | ||
|
| AIC |
|
|
|
|
|
|
|
|
|
| BIC |
|
|
|
|
|
|
|
|
| |
| Overpar. mild | AIC | 3 | 5 | 4 | 2 | 5 | 4 | 2 | 4 | 4 |
| BIC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Overpar. strong | AIC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BIC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Missp. mean | AIC | 3 | 0 | 0 | 5 | 0 | 0 | 12 | 2 | 0 |
| BIC | 4 | 0 | 0 | 5 | 1 | 0 | 12 | 4 | 0 | |
| Missp. dist. | AIC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BIC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: The correct model (in bold) was selected most of the time in each simulation. Note that the proportion of times this model was chosen by BIC converged to one as sample size increased, as expected under Theorem 1.
FIGURE 1Boxplot of the relative bias of for different methods. Each method is compared with the original dataset (Complete) based on the accuracy of their estimators as the missing proportion and the sample size increase from the top‐left corner to the bottom‐right. Note that AIC/BIC refer to the model chosen by AIC/BIC, respectively, among the comparing imputation models [Colour figure can be viewed at wileyonlinelibrary.com]
RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete)
| 10% | 25% | 40% | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| Complete | 146 | 103 | 32 | 146 | 103 | 32 | 146 | 103 | 32 |
| Correct | 152 | 107 | 32 | 159 | 111 | 34 | 171 | 115 | 36 |
| Overpar. mild | 152 | 106 | 32 | 159 | 111 | 34 | 171 | 115 | 36 |
| Overpar. strong | 152 | 108 | 33 | 168 | 114 | 34 | 195 | 122 | 36 |
| Missp. mean | 152 | 114 | 50 | 174 | 135 | 81 | 195 | 153 | 98 |
| Missp. dist. | 202 | 172 | 136 | 305 | 268 | 248 | 373 | 339 | 310 |
| AIC | 152 | 107 | 32 | 161 | 111 | 34 | 174 | 115 | 36 |
| BIC | 152 | 107 | 32 | 161 | 111 | 34 | 174 | 115 | 36 |
Note: The true value of the slope coefficient is 1.
Proportion of times () the information criterion chooses the corresponding model across 500 simulated datasets
|
| Model 2 | |
|---|---|---|
| AIC |
| 0.6 |
| BIC |
| 4.8 |
Note: Data were generated under Model 1, which assumes MNAR, whereas Model 2 assumed data were MAR. Note that both approaches were able to recover the correct model with high probability.
and mean square error (in parenthesis) for different imputation models across 500 simulations
|
|
|
| |
|---|---|---|---|
| Model 1 | 1.008 (0.027) | 1.014 (0.083) |
|
| Model 2 | 1.521 (0.342) | 1.288 (0.177) |
|
Survival of infants taken directly from example 9.8 of Little and Rubin
| Survival (S) | ||||
|---|---|---|---|---|
| Clinic (C) | Prenatal care (P) |
|
| |
| A |
| 3 | 176 | |
|
| 4 | 293 | ||
| B |
| 17 | 197 | |
|
| 2 | 23 | complete = 715 | |
| ? |
| 10 | 150 | |
|
| 5 | 90 | partial = 255 |
Information criteria for each candidate imputation model for the survival of infants example
| Model 1 |
| Model 3 | Model 4 | Model 5 | |
|---|---|---|---|---|---|
| AIC | 27.02 |
| 33.29 | 156.71 | 168.28 |
| BIC | 63.60 |
| 65.30 | 188.71 | 191.14 |
Note: Both AIC and BIC favored Model 2, {, in line with previous analyses. ,
Estimated cell probabilities for candidate imputation models in Survival of infants data
| Survival (S) | ||||
|---|---|---|---|---|
| Clinic (C) | Prenatal care (P) | Died | Survived | |
| Model 1: | A | Less | 0.45 | 25.35 |
| More | 0.79 | 38.78 | ||
| B | Less | 2.64 | 28.57 | |
| More | 0.34 | 3.07 | ||
| Model 2: | A | Less | 0.49 | 25.27 |
| More | 0.76 | 38.79 | ||
| B | Less | 2.68 | 28.57 | |
| More | 0.30 | 3.15 | ||
| Model 3: | A | Less | 0.82 | 36.66 |
| More | 0.30 | 28.46 | ||
| B | Less | 2.27 | 17.26 | |
| More | 0.83 | 13.40 | ||
| Model 4: | A | Less | 1.41 | 24.64 |
| More | 1.05 | 38.59 | ||
| B | Less | 1.68 | 29.27 | |
| More | 0.09 | 3.23 | ||
| Model 5: | A | Less | 1.60 | 36.34 |
| More | 1.21 | 27.41 | ||
| B | Less | 0.81 | 18.26 | |
| More | 0.09 | 3.27 |
Note: These estimates differ under each model. Although cell probabilities are estimated more similarly under the overfitted model { and the correct model {, there is a clear difference between s under Models 3, 4, and 5 and under the correct imputation model {.
Information criterion for different imputation models fitted to the Pima Indian women data
|
| Model 2 | Model 3 | |
|---|---|---|---|
| AIC |
| 7660.071 | 10544.15 |
| BIC |
| 8026.931 | 10818.13 |
Note: The data strongly favored Model 1, suggesting a non‐ignorable missing data mechanism.
Estimates of regression parameters and their standard errors (in parentheses) for different imputation models in the Pima Indian women example
| Intercept | Pregnancy | Glucose | Pressure | Triceps | Insulin | BMI | Pedigree | Age | |
|---|---|---|---|---|---|---|---|---|---|
| Model 1 |
| 0.282 (0.113) | 0.918 (0.143) |
| 0.163 (0.152) | 0.321 (0.208) | 0.503 (0.139) | 0.328 (0.098) | 0.284 (0.120) |
| Model 2 |
| 0.290 (0.113) | 1.080 (0.185) |
| 0.088 (0.199) | 0.022 (0.267) | 0.606 (0.179) | 0.322 (0.098) | 0.315 (0.119) |
| Model 3 |
| 0.076 (0.187) | 1.055 (0.178) |
| 0.114 (0.180) | 0.084 (0.181) | 0.449 (0.190) | 0.407 (0.147) | 0.628 (0.211) |
Note: Pregnancy, insulin, pedigree, and age are log‐transformed.
FIGURE 2Density curve plots of, A, observed triceps skin fold thickness (blue) and multiple imputed values of triceps skin fold thickness (red) and, B, observed 2‐hour serum insulin (blue) and multiple imputed values of 2‐hour serum insulin (red) for Model 1 in Pima Indian Women example. Similarity between the observed curve and imputed curves in (A) suggests that triceps skin fold thickness might be missing at random. Difference between the observed curve and imputed curves in (B) suggests that 2‐hour serum insulin is missing not at random [Colour figure can be viewed at wileyonlinelibrary.com]
Box 1. MI (proper) and StEM (improper) algorithms, note that they only differ at step 2
| MI (“ | StEM (“ |
| 0. Fix | 0. Fix |
| 1. | 1. |
| 2. | 2. |
| 3. Repeat steps 1 and 2 until convergence (at | 3. Repeat steps 1 and 2 until convergence (at |
| 4. Repeat steps 1 and 2 for | 4. Repeat steps 1 and 2 for |
| 5. | 5. |
RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete)
| 10% | 25% | 40% | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| Complete | 137 | 103 | 31 | 137 | 103 | 31 | 137 | 103 | 31 |
| Correct | 143 | 106 | 32 | 150 | 113 | 33 | 167 | 120 | 36 |
| Overpar. mild | 143 | 106 | 32 | 151 | 112 | 33 | 167 | 120 | 36 |
| Overpar. strong | 144 | 106 | 32 | 152 | 113 | 33 | 167 | 119 | 35 |
| Missp. mean | 149 | 112 | 46 | 170 | 136 | 82 | 206 | 162 | 124 |
| Missp. dist. | 162 | 130 | 75 | 202 | 174 | 137 | 232 | 213 | 174 |
| AIC | 143 | 106 | 32 | 150 | 113 | 33 | 168 | 120 | 36 |
| BIC | 143 | 106 | 32 | 151 | 113 | 33 | 168 | 120 | 36 |
Note: The true value of the slope coefficient is 1.
RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete)
| 10% | 25% | 40% | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| Complete | 147 | 110 | 32 | 147 | 110 | 32 | 147 | 110 | 32 |
| Correct | 157 | 119 | 34 | 170 | 124 | 36 | 186 | 134 | 38 |
| Overpar. mild | 157 | 118 | 34 | 170 | 124 | 36 | 185 | 133 | 38 |
| Overpar. strong | 158 | 119 | 34 | 166 | 123 | 36 | 185 | 133 | 38 |
| Missp. mean | 169 | 131 | 68 | 197 | 169 | 122 | 236 | 203 | 168 |
| Missp. dist. | 170 | 132 | 64 | 204 | 161 | 105 | 226 | 187 | 129 |
| AIC | 157 | 119 | 34 | 173 | 124 | 36 | 190 | 133 | 38 |
| BIC | 157 | 119 | 34 | 173 | 125 | 36 | 190 | 134 | 38 |
Note: The true value of the slope coefficient is 1.
RMSE (×1000) of 's for Amelia and MICE correct, overparametrized mild/strong and misspecified mean models across 500 simulations, compared with the original dataset (Complete)
| 10% | 25% | 40% | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| Complete | 137 | 103 | 31 | 137 | 103 | 31 | 137 | 103 | 31 |
| Amelia correct | 144 | 106 | 32 | 151 | 113 | 33 | 168 | 121 | 36 |
| MICE correct | 143 | 106 | 32 | 149 | 112 | 33 | 166 | 120 | 36 |
| Amelia mild | 143 | 106 | 32 | 150 | 112 | 33 | 167 | 120 | 35 |
| MICE mild | 143 | 106 | 32 | 150 | 112 | 33 | 167 | 119 | 36 |
| Amelia strong | 144 | 107 | 32 | 152 | 113 | 33 | 167 | 121 | 36 |
| MICE strong | 145 | 107 | 32 | 152 | 113 | 33 | 167 | 120 | 35 |
| Amelia missp. | 149 | 112 | 45 | 171 | 137 | 82 | 207 | 164 | 124 |
| MICE missp. | 149 | 112 | 45 | 170 | 136 | 82 | 205 | 162 | 124 |
Note: The true value of the coefficients is 1.
RMSE (×1000) of 's for Amelia and MICE correct, overparametrized mild/strong and misspecified mean models across 500 simulations, compared with the original dataset (Complete)
| 10% | 25% | 40% | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| Complete | 146 | 103 | 32 | 146 | 103 | 32 | 146 | 103 | 32 |
| Amelia correct | 153 | 107 | 32 | 161 | 111 | 34 | 177 | 117 | 36 |
| MICE correct | 153 | 106 | 32 | 159 | 111 | 34 | 172 | 115 | 36 |
| Amelia mild | 153 | 106 | 32 | 161 | 111 | 34 | 176 | 116 | 36 |
| MICE mild | 153 | 107 | 32 | 159 | 111 | 34 | 172 | 115 | 36 |
| Amelia strong | 152 | 107 | 32 | 161 | 111 | 34 | 181 | 116 | 36 |
| MICE strong | 152 | 107 | 32 | 161 | 111 | 34 | 179 | 117 | 36 |
| Amelia missp. | 152 | 113 | 50 | 165 | 129 | 80 | 182 | 142 | 96 |
| MICE missp. | 152 | 114 | 50 | 170 | 133 | 81 | 189 | 149 | 98 |
Note: The true value of the coefficients is 1.
RMSE (×1000) of 's for Amelia and MICE correct, overparametrized mild/strong and misspecified mean models across 500 simulations, compared with the original dataset (Complete)
| 10% | 25% | 40% | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| Complete | 147 | 110 | 32 | 147 | 110 | 32 | 147 | 110 | 32 |
| Amelia correct | 157 | 119 | 34 | 170 | 124 | 36 | 188 | 134 | 38 |
| MICE correct | 157 | 119 | 34 | 170 | 124 | 36 | 185 | 133 | 38 |
| Amelia mild | 157 | 119 | 34 | 169 | 124 | 36 | 188 | 133 | 38 |
| MICE mild | 157 | 119 | 34 | 170 | 124 | 36 | 186 | 133 | 38 |
| Amelia strong | 158 | 119 | 34 | 167 | 124 | 36 | 187 | 135 | 38 |
| MICE strong | 158 | 119 | 34 | 167 | 123 | 36 | 189 | 134 | 38 |
| Amelia missp. | 169 | 131 | 68 | 198 | 170 | 123 | 241 | 206 | 168 |
| MICE missp. | 169 | 131 | 68 | 197 | 170 | 123 | 238 | 205 | 168 |
Note: The true value of the coefficients is 1.
Proportion of times () the information criterion chooses the fitted imputation model for different sample sizes in 500 simulated datasets
| Model 1 | Model 2 |
| Model 4 | ||
|---|---|---|---|---|---|
|
| AIC | 0.0 | 3.0 |
| 16.0 |
| BIC | 0.0 | 10.0 |
| 6.2 | |
|
| AIC | 0.0 | 0.0 |
| 14.8 |
| BIC | 0.0 | 0.2 |
| 2.8 | |
|
| AIC | 0.0 | 0.0 |
| 15.6 |
| BIC | 0.0 | 0.0 |
| 1.0 |
Note: Model 3 (in bold) is the correct model, which was selected with the highest proportion even for a small sample size of . Note that the proportion of times this model was chosen approached one as sample size increased.
and mean square error (in parenthesis) for different imputation models across 500 simulations for different sample sizes
|
|
|
|
| ||
|---|---|---|---|---|---|
|
| Model 1 | 1.171 (0.09) | 1.423 (0.47) | 0.719 (0.13) | 0.627 (0.42) |
| Model 2 | 0.819 (0.10) | 1.284 (0.18) | 1.187 (0.11) | 0.713 (0.19) | |
| Model 3 | 0.938 (0.07) | 1.095 (0.14) | 1.033 (0.09) | 0.945 (0.16) | |
| Model 4 | 0.937 (0.07) | 1.095 (0.14) | 1.034 (0.09) | 0.945 (0.16) | |
|
| Model 1 | 1.193 (0.07) | 1.438 (0.33) | 0.708 (0.11) | 0.601 (0.29) |
| Model 2 | 0.862 (0.05) | 1.228 (0.10) | 1.158 (0.06) | 0.754 (0.11) | |
| Model 3 | 0.982 (0.04) | 1.029 (0.06) | 0.999 (0.05) | 0.998 (0.07) | |
| Model 4 | 0.981 (0.04) | 1.030 (0.06) | 0.999 (0.04) | 0.997 (0.07) | |
|
| Model 1 | 1.211 (0.05) | 1.397 (0.17) | 0.709 (0.09) | 0.618 (0.16) |
| Model 2 | 0.878 (0.02) | 1.192 (0.04) | 1.161 (0.03) | 0.764 (0.06) | |
| Model 3 | 1.001 (0.00) | 0.999 (0.01) | 1.000 (0.00) | 1.000 (0.01) | |
| Model 4 | 1.001 (0.00) | 0.999 (0.01) | 1.000 (0.00) | 1.000 (0.01) |