Literature DB >> 33629367

Selecting the model for multiple imputation of missing data: Just use an IC!

Firouzeh Noghrehchi1, Jakub Stoklosa2,3, Spiridon Penev2, David I Warton2,3.   

Abstract

Multiple imputation and maximum likelihood estimation (via the expectation-maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochastic expectation-maximization approximation to the likelihood. In this article, we exploit this key result to show that familiar likelihood-based approaches to model selection, such as Akaike's information criterion (AIC) and the Bayesian information criterion (BIC), can be used to choose the imputation model that best fits the observed data. Poor choice of imputation model is known to bias inference, and while sensitivity analysis has often been used to explore the implications of different imputation models, we show that the data can be used to choose an appropriate imputation model via conventional model selection tools. We show that BIC can be consistent for selecting the correct imputation model in the presence of missing data. We verify these results empirically through simulation studies, and demonstrate their practicality on two classical missing data examples. An interesting result we saw in simulations was that not only can parameter estimates be biased by misspecifying the imputation model, but also by overfitting the imputation model. This emphasizes the importance of using model selection not just to choose the appropriate type of imputation model, but also to decide on the appropriate level of imputation model complexity.
© 2021 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

Entities:  

Keywords:  imputation model selection; information criteria; missing data analysis; stochastic EM algorithm

Mesh:

Year:  2021        PMID: 33629367      PMCID: PMC8248419          DOI: 10.1002/sim.8915

Source DB:  PubMed          Journal:  Stat Med        ISSN: 0277-6715            Impact factor:   2.373


Introduction

Missing data commonly arise in many health research datasets, for example, in clinical trials, , in biomarker data, , in covariates when fitting Cox models, , or in longitudinal studies. Multiple imputation (MI) and maximum likelihood estimation (MLE) via the expectation‐maximization (EM) algorithm are two well‐known methods used to account for missing values in partially observed datasets. MI is a Monte Carlo approach where missing values are imputed to create a complete dataset, whereas the EM algorithm is an iterative method developed for finding the MLE in the presence of missing data. Stochastic versions of EM, in particular Monte Carlo EM (MCEM) and stochastic EM (StEM) , , are numerical approaches of EM to compute a Monte Carlo approximation to MLE. A little‐appreciated idea is that MI and MLE are closely connected, with some exceptions in comparing their performances in numerical results, for example, see Honaker and King. Specifically, StEM can be understood as a type of multiple imputation where data are imputed from a model that fixes parameter estimates at the maximizer of the imputation model likelihood (a type of “improper MI,” see Rubin ). This result has been referred to in Reference 15 and 16, and importantly, it permits us to think of the multiple imputation procedure in a likelihood framework, motivating the application of standard likelihood machinery in multiple imputation. While this opportunity offers considerable potential, little work along these lines has been done to date. In particular, the literature on imputation model selection within the MI framework is surprisingly sparse given its potential application range. With the exception of some pragmatic suggestions for model selection, , , no overall and general valid framework has been provided for either model selection or for imputation model selection when using MI. In this article, we focus on the question of selecting an imputation model to use to impute missing values with MI, and show that standard likelihood‐based model selection tools are applicable to this problem. We leave the development of a valid framework for model selection for predictors with MI for future work. When using MI, careful consideration of which imputation model to select is needed, because using the wrong imputation model can result in incorrect inferences and misleading conclusions. , , There are guidelines for specification of the imputation model that tend to be ad hoc and based on heuristic arguments , , rather than providing an overall valid theoretical framework for imputation model selection. Other methods include assessing the sensitivity of MI results using a weighting approach or the pattern‐mixture approach, , and using variable selection procedures to select the best imputation model based on the fraction of missing information when estimated from an appropriate Proxy pattern‐mixture model. A key focus of the current study is to provide a flexible, data‐driven approach for choosing the imputation model. In this article, we exploit the equivalence between MI and MLE, to propose imputation model selection using likelihood‐based information criteria. In principle, any likelihood‐based information criterion with desirable properties could be used, and we focus in particular on the Akaike information criterion (AIC) and Bayesian information criterion (BIC), computed using the observed data likelihood. We show that BIC preserves its usual property of consistency in model selection, when selecting an imputation model, and illustrate this via simulation. In Sections 2.1 and 2.2, we give an overview of the connection between MI and StEM methods. In Section 3, we develop AIC and BIC for choosing the best imputation model and study some of their properties. Three simulation studies are given in Section 4 and we present some real‐data examples in health research in Section 5. Finally, we provide a summary discussion in Section 6.

Missing data analysis methods

Suppose y is the observed data, z is the missing data, and r is the missingness indicator vector where its components are set as if the ith data point is missing and if the ith data point is observed. Note that y can contain the vector of observed response variables, denoted Y, as well as the vector of observed predictors . Also, denote as a vector of unknown model parameters. For now we assume that these data are missing not at random (MNAR) but note that the methods presented below are also applicable for missing at random (MAR) data, the special case of MNAR where missingness is ignorable. These data are considered to be MAR when we have , and MNAR when such that we have . For more details on the definitions of MAR and MNAR, see Seaman et al, Mealli et al, and Doretti et al. To estimate , we maximize the observed data likelihood, , which, in the presence of missing data, is obtained by integrating out the missing data from the complete‐data likelihood, : When data are MAR and missingness is ignorable, then the missingness mechanism does not depend on the missing data, that is, . Therefore, the observed likelihood ignoring the missingness mechanism is defined to be , which does not involve the model for r. Likelihood (1) can be difficult to solve for complex models or when the data are of high dimension. In the missing data context, a common approach is to impute (fill‐in) missing data, z, with some plausible values that are a good summary of z, to which we obtain a (pseudo‐) complete dataset. Two well‐known methods that use imputation procedures are multiple imputation (MI) and stochastic EM (StEM). These two methods turn out to have a close resemblance, although they were developed to serve different purposes. MI was primarily developed in the context of missing data analysis to reflect uncertainty in the imputed values. Stochastic EM was developed more broadly as a computational device when using an EM algorithm. Specifically, StEM was designed to overcome computational difficulties, or intractable E‐steps. , , , The objective of both methods is to achieve valid statistical inferences, rather than optimal predictions of the missing data. First, we give a brief overview of both methods, then show their equivalence.

Multiple imputation

Multiple imputation is a Monte Carlo method originally proposed by Rubin , where every missing observation in the dataset is imputed with a simulated value to create a complete dataset. The imputation step is repeated times, such that M‐completed datasets are generated. That is, we impute z with a set of plausible values, , , for some . Each (pseudo‐) complete dataset is then separately analyzed by standard complete‐data analysis methods. In order to achieve asymptotically valid statistical inference, the resulting estimates from each M‐completed dataset need to be pooled according to Rubin's rule, see Appendix B for further details. MI is sometimes performed in an iterative manner to approximate the observed posterior of , as summarized in Box 1 (left). Consider a complete data model , and the imputation model . Multiple imputations are repeated random draws from the posterior predictive distribution of missing data given the observed data and the current estimates of . The Monte Carlo average of the current multiple imputed datasets gives a current approximation to the observed posterior, Since we have these two steps of imputation (I‐step) and posterior re‐estimation (P‐step) are iterated a large number of times to eventually produce a draw of from their joint observed posterior. Finally, the next M imputations are implemented as multiple imputations in Rubin's combination rules to make inference about the dataset. The algorithm in Box 1 (left) imputes missing values from an imputation model whose parameters were sampled from the posterior distribution for . This is referred to by Rubin as “proper MI.” Other methods of updating are possible, but unless they satisfy Rubin's rules they are labeled “improper” and their statistical properties are less well studied. Assuming that a correct model is specified for imputation, Rubin provided rules on how to do parameter estimation, construct confidence intervals, calculate P‐values and carry out hypothesis testing based on Wald's method when using MI. We extend this list by developing imputation model selection criteria with the MI framework, see Section 3.

MLE via stochastic EM

The EM algorithm was designed to find MLE estimates of parameters of a parametric model in an iterative manner when the observed data are incomplete. This procedure makes use of Fisher's identity, where the maximization of the unknown observed log‐likelihood is replaced with the maximization of the conditional expectation of an associated complete log‐likelihood: An EM iteration consists of two steps. The E‐step computes the expectation of conditional complete‐data log‐likelihood given the observed data (with respect to the imputation model at the current estimate of parameters), The M‐step updates the estimates of parameters by maximization of the expectation function computed in the E‐step, In situations where is either analytically intractable or computationally intensive, it is possible to replace analytical computation of by a suitable approximation to this function, commonly via simulation methods such as MCMC. In this paradigm, stochastic versions of the EM algorithm were designed to numerically compute by Monte Carlo approximation. As such, the E‐step in the EM algorithm simplifies to the computation of the imputation model, , and simulation of the missing data . In other words, the E‐step turns into an imputation step (I‐step) where to approximate as a Monte Carlo average, A special case of the EM algorithm is the stochastic EM (StEM) , , summarized in Box 1 (right). StEM, without aiming to produce any approximate computation of , imputes the missing data only once in the I‐step until the algorithm converges to its stationary distribution, whose mean is close to the MLE. The random sequence of generated by the StEM does not converge pointwise to the MLE, but, under mild conditions, does converge in distribution. After the algorithm converges, multiple imputations are generated by running extra M iterations to sample from the stationary distribution, and the sample mean gives an approximate MLE of the observed likelihood. The StEM estimator is shown to be an asymptotically normal, unbiased, and consistent estimator of when considering models from the exponential family. Asymptotic properties of the StEM estimator are given in Wang and Robins and Nielsen.

MI as stochastic EM

Studying Box 1, it is evident that the MI and StEM algorithms described above are almost identical—both iterate between imputation from the current model (“I step”) and updating the imputation model (“P step”), then following convergence, both involve averaging estimates obtained from a set of M ensuing iterations. The only difference between the two algorithms is in how is updated at Step 2—using random draws from the current posterior, or using the maximizer of the current estimate of the likelihood function. While the former is a proper MI, the latter is not since (B2, see Appendix B), and therefore (B1, see Appendix B), will no longer be satisfied. Box 1. MI (proper) and StEM (improper) algorithms, note that they only differ at step 2 MI views as random (with a prior distribution) and samples values from the observed posterior. These samples from the posterior are used to impute the missing values z, and are eventually averaged to approximate the mean of the posterior distribution of . StEM views as fixed, and at each step uses an estimate of it in the model that imputes missing values z, and eventually averages these for a final estimate of . From the MI point of view, this can be understood as approximating the mode of the posterior distribution with flat priors, rather than the mean. However, both algorithms assume their estimators are asymptotically normal, , which would imply that the mean and mode would converge asymptotically. While Rubin's combination rule enables calculation of approximate standard errors when using proper MI, these are not available in the improper case. StEM nevertheless comes with recommended approaches to estimate standard errors via Louis' method, which interestingly, can be shown to have a similar form as Rubin's rule (see Appendix B).

Imputation model selection using information criteria

The equivalence between StEM and improper MI allows standard likelihood machinery to be used to improve MI's performance. This connection changes our view of MI and provides the possibility of accessing a range of likelihood‐based tools in situations where MI's performance could be improved in a likelihood‐based framework. One important gain, for example, is when a maximum likelihood framework can provide the analyst with model selection tools for choosing the best imputation model, and can enable further insights into the consequences of imputation model misspecification. In the MI literature, there are no diagnostic criteria for how to choose between imputation models. In a heuristic manner, it is recommended to either specify a multivariate normal distribution as a joint model for missing data or specify a univariate conditional imputation model for each missing variable. Also, it is recommended to add as many variables in the imputation model as possible, with at least as many variables as presented in the substantive analysis model of interest. , Furthermore, it is not clear how to choose between different missing data mechanisms. These are often considered untestable assumptions, which need to be determined from prior knowledge. Literature suggests performing a sensitivity analysis to explore the impact of the different assumptions for missingness mechanism. , In this article, we show that available and well‐known information‐based criteria in the maximum likelihood literature, which enjoy good statistical properties, can be used to select an imputation model among a set of candidate imputation models. For illustration, we will use AIC and BIC for imputation model selection, study their properties by simulation, and apply them to two real‐data examples. Denote as the number of model parameters. We write , and . BIC has been shown to be consistent under a broad range of settings, and we add multiple imputation models to that list in Theorem 1 below. The equivalence of StEM and MI could in principle be used to motivate the application of any likelihood‐based information criterion to imputation model selection. For example, a small‐sample correction to AIC developed by Hurvich and Tsai could also be applied. Cavanaugh and Shumway and Ibrahim developed AIC‐type criteria based on the expected complete likelihood (), which would be straightforward to approximate from imputed data. However, note that differs from the observed likelihood by an entropy term. In mixture models, it has been shown that if competing models differ in their entropies, complete likelihood criteria can have undesirable properties, which follow from omission of entropy from their criteria. A similar problem might arise in missing data analysis, when comparing imputation models of differing complexity. For the formulation of the next theorem, we need the notion of the most parsimonious correct model. We note that in many situations, there may be multiple correct imputation models, denoted , for example, when dealing with a sequence of nested models that includes the true model. In these situations, there exist multiple correct imputation models with different levels of complexity. Some of these correct imputation models may be too complex, including additional parameters with zero effect, and we define as the most parsimonious correct model such that for every correct model. is typically unique but in the below we do not require it to be. In Appendix A, we give a proof of the following theorem, which shows that BIC is consistent for imputation model selection: Suppose is the imputation model chosen by BIC and is a finite set of the most parsimonious correct models. If Assumptions  to (see Appendix  ) are satisfied, then We can explain how the observed log‐likelihood is informative about the imputation model using an approximation technique that, interestingly, is equivalent to that used in variational approximation of likelihood functions. Variational approximation is a (trade‐off) method for optimizing a log‐likelihood function while enhancing its tractability, hence, making approximate inference for parameters of a model. In the missing data context, an incorrect imputation model can be understood as a variational approximation to the observed log‐likelihood. The approximations are indicated by the Kullback‐Leibler divergence of the specified imputation model from the true imputation model. Suppose is a specified imputation model and is the true imputation model. Misspecification of the imputation model implies that diverges from . The divergence of from is assessed by the Kullback‐Leibler divergence, denoted by , and is always nonnegative. A zero value would only occur when the imputation model is not misspecified, whereas a positive value would imply that the imputation model is misspecified, and the further is from , the larger this divergence is, that is, the greater is . Following a similar approach as given in Ormerod and Wand, we have The maximum value of the lower bound in (2) over q is obtained when (that is, when the Kullback‐Leibler divergence of q from p is at a minimum, or ), since . Therefore, the better the specified imputation model, the lower , and hence, the higher . Thus, AIC and BIC based on observed log‐likelihoods would be able to reflect the goodness‐of‐fit of the chosen imputation model.

Simulation study

We present two categories of simulation studies to investigate the performance of AIC and BIC as imputation model selection criteria where we have: (I) a univariate missing variable and (II) multivariate missing variables.

Univariate missing variable

We consider two simulation scenarios with a univariate missing variable. In scenario (a), we fit a linear regression model with missing values in the predictor variable; and in scenario (b), we fit a log‐linear (Poisson) generalized linear model (GLM) with missing values in the (count) response variable. We present scenario (b) in Appendix C2.

Univariate missing predictor

We assume the linear regression model, , where , , is the response variable and the errors are . Suppose that both predictors are independent and standard normal, but is partially observed. We imposed missingness on as a regression function of . Specifically, suppose that is the missingness indicator of where if is missing and if is observed. Let be the probability of satisfying Also, to investigate the consequences of overfitting an imputation model, suppose there are eight fully observed and independent auxiliary variables for . We imposed 10, 25 and 40 missingness in with , and , respectively, where is the vector . Because the missingness in only depends on the data is MAR, and thus, the missingness mechanism is ignorable. We ran 500 simulations. We used the squared distance of parameters between the (th and the tth iterations as a convergence criterion for the StEM algorithm, and set the convergence threshold to 10. Furthermore, the number of multiple imputations was set to . To investigate the performance of information criteria for imputation model selection, missing values were imputed under two overparametrized models with one and eight extra predictors, respectively, as well as under the correct model and two wrong models (misspecified mean/distribution) as the following: The overparametrized model (strong): The overparametrized model (mild): The correct model: The misspecified‐mean model: The misspecified‐distribution model: Note that while and are independent, is actually conditionally dependent on (given Y) because Y is a function of and . Therefore, the correct imputation model for is a linear function of as well as Y. We used these five models to look at the ability of AIC and BIC to identify over‐fitted imputation models as well as imputation models misspecified in two different ways: underfitted or incorrect distributional assumption. Results in Table 1 show the proportion of times the corresponding model is chosen based on each information criterion for various sample sizes of and missing proportions of 10%, 25%, and 40%. These results indicated that for even a small sample size of 50 and 40% missingness, both AIC and BIC are able to choose the correct model at least 86 of the time, with misspecified models selected rarely, and selected at a rate that went to zero as sample size increased. The mild overparametrized model (Overpar. mild) was chosen by AIC 2% to 5% of the time in total, irrespective of sample size. For BIC, the rate at which the correct model was chosen by BIC converged to one as the sample size increased, as expected from Theorem 1. These results align with classical results for complete data cases, where several studies have shown that AIC overfits (asymptotically) , , while in contrast, BIC can be consistent for model selection. , A possible strategy to correct for negative bias in small sample sizes when using AIC is to use a corrected version as given in Hurvich and Tsai but that would not alter the overfitting property of AIC in large samples.
TABLE 1

Proportion of times () the information criterion chooses the fitted imputation model for different sample sizes in 500 simulated datasets

10%25%40%
n: 501001000501001000501001000
Correct AIC 94 95 96 93 95 96 86 94 96
BIC 96 100 100 95 99 100 87 96 100
Overpar. mildAIC354254244
BIC000000000
Overpar. strongAIC000000000
BIC000000000
Missp. meanAIC3005001220
BIC4005101240
Missp. dist.AIC000000000
BIC000000000

Note: The correct model (in bold) was selected most of the time in each simulation. Note that the proportion of times this model was chosen by BIC converged to one as sample size increased, as expected under Theorem 1.

Proportion of times () the information criterion chooses the fitted imputation model for different sample sizes in 500 simulated datasets Note: The correct model (in bold) was selected most of the time in each simulation. Note that the proportion of times this model was chosen by BIC converged to one as sample size increased, as expected under Theorem 1. The importance of choosing the correct imputation model in practice is perhaps better demonstrated by investigating its impact on post‐selection inference from data. We therefore compared all methods based on the magnitude of bias and root mean square error (RMSE) for 's. Figure 1 shows parallel boxplots of the relative bias of , defined here as , obtained from each fitted model against increasing missing proportions and increasing sample size. Unsurprisingly, this figure showed that the misspecified models underperformed in comparison to the correct model in terms of bias as the missing proportion increased (top to bottom) and as the sample size increased (left to right). In some cases, the relative bias was as large as 40% for wrong imputation models. More interestingly, strongly overparameterized models were also downward biased (Figure 1), when missingness rate was not small (40%) and sample size was not large (). When sample size is small, a strongly overparameterized imputation model can lead to overfitting, and when missingness rate is large, this can have appreciable consequences in terms of bias. This result is interesting as we expected no bias but high variance from overfitting due to the commonly known bias‐variance tradeoff. A possible explanation for this bias comes from the measurement error modeling literature —we postulate that overfitting an imputation model introduces extra variability into the imputed predictor, which can be understood as a form of measurement error. But when there is measurement error in a predictor that is not accounted for, linear model parameter estimates tend to be biased. In our simulation, this would bias estimates of toward zero, an effect known as regression attenuation. This was observed in our simulations.
FIGURE 1

Boxplot of the relative bias of for different methods. Each method is compared with the original dataset (Complete) based on the accuracy of their estimators as the missing proportion and the sample size increase from the top‐left corner to the bottom‐right. Note that AIC/BIC refer to the model chosen by AIC/BIC, respectively, among the comparing imputation models [Colour figure can be viewed at wileyonlinelibrary.com]

Boxplot of the relative bias of for different methods. Each method is compared with the original dataset (Complete) based on the accuracy of their estimators as the missing proportion and the sample size increase from the top‐left corner to the bottom‐right. Note that AIC/BIC refer to the model chosen by AIC/BIC, respectively, among the comparing imputation models [Colour figure can be viewed at wileyonlinelibrary.com] Table 2 shows the root mean square error (RMSE) of of the comparing models across 500 simulations. As before, there was poor performance not only for misspecified imputation models but also for the strongly overparameterized model, when the missingness rate was not small and sample size not large (Table 2, fourth row). The mildly overparameterized model showed no detectable loss of performance.
TABLE 2

RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete)

10%25%40%
n=50 n=100 n=1000 n=50 n=100 n=1000 n=50 n=100 n=1000
Complete146103321461033214610332
Correct152107321591113417111536
Overpar. mild152106321591113417111536
Overpar. strong152108331681143419512236
Missp. mean152114501741358119515398
Missp. dist.202172136305268248373339310
AIC152107321611113417411536
BIC152107321611113417411536

Note: The true value of the slope coefficient is 1.

RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete) Note: The true value of the slope coefficient is 1. Estimation of other regression parameters, and (Appendix C1), showed a similar cost of imputation model misspecification, but little impact of overfitting. We also looked at coverage probability and found that this was generally at or close to nominal levels, except in cases where the estimator was biased due to misspecification (cf.  Figure 1). We investigated how well we can avoid these issues using information criteria. We examined the impact on post‐selection inference of choosing the imputation model using AIC/BIC. If making inferences about after imputation model selection using AIC and BIC, we found negligible bias (Figure 1) and low RMSE (Table 2) in all scenarios, hence providing some protection against issues due to misspecification or overfitting of the imputation models. Finally, while we used purpose‐written StEM code to fit these models, we would expect similar results from any available multiple imputation software that was used to fit the same imputation models. To confirm this, we repeated simulations using the Amelia and mice R‐packages and compared results to those seen here. Amelia is a bootstrapped version of the EM algorithm that assumes data are multivariate normal, and imputes the missing data by drawing from their posterior distribution given the estimated parameters. The mice package with the mice.impute.norm.nob() function was used to impute using improper imputation by linear regression (MICE). We saw similar patterns for Amelia and MICE to those presented here (see Appendix C1).

Multivariate missing variable

Next, we investigated the case where missingness was of multiple dimension, and the response was non‐normal (here we considered binary data). The purpose of this simulation study is to investigate the performance of information criteria for choosing between MAR and MNAR models with MI. Following example 4.1.2 from Ibrahim, we now suppose that the response variables , , are independent fully observed Bernoulli random variables and we are interested in the analysis of a logistic regression (GLM) model where and . Suppose that the predictors, , , are partially observed variables from a bivariate normal distribution with Also, suppose that is the missingness indicator vector of (—for example, if and are both missing and if is missing and is observed. Let and be the probabilities of and , respectively, which satisfy and . We are interested in the comparison of the following two imputation models: Model 1:  and Model 2: , where Model 1 assumes that the data are MNAR and missingness is nonignorable whereas Model 2 assumes that data are MAR and ignores the missingness. We ran 500 simulations with , , and where on average we obtained about 19 missing data in , 25 in and 6 in both. Also, to draw imputations at the ith iteration of StEM, we used Metropolis Hastings within a Gibbs sampler to generate a sample from for Model 1, and a sample from for Model 2 where , and . Once again, we used the squared distance of the parameters between the (th and tth iterations as the convergence criterion and set the convergence threshold to 10 with the number of multiple imputations set to . Table 3 shows the proportion of times each model was chosen based on the corresponding information criterion. This table shows that AIC and BIC were able to choose the correct model (Model 1) 99.4 and 95.2 of the times, respectively. These results are consistent with the previous simulation results (cf. Section 4.1.1) as well as with our theoretical result (cf. Section 3), indicating that in situations where AIC and BIC are applicable, they perform satisfactory for imputation model selection.
TABLE 3

Proportion of times () the information criterion chooses the corresponding model across 500 simulated datasets

Model 1 Model 2
AIC 99.4 0.6
BIC 95.2 4.8

Note: Data were generated under Model 1, which assumes MNAR, whereas Model 2 assumed data were MAR. Note that both approaches were able to recover the correct model with high probability.

Proportion of times () the information criterion chooses the corresponding model across 500 simulated datasets Note: Data were generated under Model 1, which assumes MNAR, whereas Model 2 assumed data were MAR. Note that both approaches were able to recover the correct model with high probability. One again, we investigate the impact of imputation model selection, here between MNAR and MAR models, on post‐selection inference from data. Table 4 shows the sample average of estimates of , denoted as , , for the corresponding imputation model averaged over 500 simulations and its mean square error, , where is the simulated standard error of . This table shows that parameter estimates can be heavily biased if the wrong imputation model is used (Model 2), with much larger mean squared errors.
TABLE 4

and mean square error (in parenthesis) for different imputation models across 500 simulations

β0=1 β1=1 β2=1
Model 11.008 (0.027)1.014 (0.083) 1.018 (0.084)
Model 21.521 (0.342)1.288 (0.177) 1.295 (0.186)
and mean square error (in parenthesis) for different imputation models across 500 simulations

Examples

We give two real‐data examples in health research studies where missing values arise. The aim in both examples is to select the most appropriate imputation model, using the tools presented in Section 3.

Survival of infants

The following example is taken from example 9.8 of Little and Rubin where a 2 contingency table on the survival of infants is analysed in the presence of missing data. Suppose we have three dichotomous variables of Prenatal care (), Survival (), and Clinic (). Let denote the total count of the th cell, and be the total sample size. Table 5 is (artificially) partially classified for about 26 of the data (255 cases out of 970) where we observed only instead of . The remainder of the data (715 cases out of 970) are completely observed. Also, let denote classification probability for the th cell. The response variable is the cell counts, . Subject to missingness, these cell counts can be modeled by assuming different associations between the three predictors: Survival (S), Prenatal care (P), and Clinic (C).
TABLE 5

Survival of infants taken directly from example 9.8 of Little and Rubin

Survival (S)
Clinic (C)Prenatal care (P) Died Survived
A Less 3176
More 4293
B Less 17197
More 223complete = 715
? Less 10150
More 590partial = 255
Survival of infants taken directly from example 9.8 of Little and Rubin Assuming that these data are MAR and the missingness is ignorable, Little and Rubin applied the EM algorithm to fit various models, such as {, {, and { where, for example, { denotes a model with all the main effects of Survival, Prenatal, and Clinic and the interaction effect between Survival and Clinic. The goodness‐of‐fit using likelihood‐ratio tests was then assessed for each candidate model. Meng and Rubin applied Bayesian MI with a full (saturated) model where they tested the null models { and { (a main effects only model) against the full model using their proposed pooled likelihood‐ratio test developed for MI. Both approaches concluded that Survival is related to Clinic, but conditional on Clinic, Survival, and Prenatal care are independent indicating that { is the best parsimonious fitted model. This example can also be viewed as a problem of imputation model selection, where the response variable is the missing variable. Thus, the model that we choose to fit to the data can be used as the imputation model in the MI algorithm. As such, we will have imputation model candidates {, {, {, and so on. We fitted five competing imputation models as follows: Model 1: {, Model 2: {, Model 3: {, Model 4: {, Model 5: {. To demonstrate the model fitting procedure, the StEM/improper MI algorithm can be applied to this problem by the following iterative steps: Estimate initial value based on the observed counts. For tth iteration, , Simulate from . Re‐estimate as where is the complete log‐likelihood of a multinomial model. For instance, under Model 2, imputations for the partially classified counts of the th cell are drawn from where and Calculate the AIC/BIC for the fitted model using criteria presented in Section 3. Table 6 shows the AIC and BIC for the above‐mentioned imputation models based on multiple imputations. Also, the convergence threshold for the algorithm was set to 10. AIC and BIC both favored Model 2, in line with the results of Little and Rubin and Meng and Rubin.
TABLE 6

Information criteria for each candidate imputation model for the survival of infants example

Model 1 Model 2 Model 3Model 4Model 5
AIC27.02 25.12 33.29156.71168.28
BIC63.60 57.13 65.30188.71191.14

Note: Both AIC and BIC favored Model 2, {, in line with previous analyses. ,

Information criteria for each candidate imputation model for the survival of infants example Note: Both AIC and BIC favored Model 2, {, in line with previous analyses. , Table 7 shows the estimated cell probabilities for the above‐mentioned imputation models. These estimates differ under each imputation model. For example, the percentage of infants dying with less prenatal care in clinic A, , is estimated at 0.45 and 0.49 under the correct imputation model and under the overfitted imputation model, respectively. However, this percentage is estimated to be much higher under Model 3 (at ), and even higher under Model 4 (at ) and under Model 5 (at ). This result shows that the cell probabilities are estimated more similarly by the correct model { and under the overfitted model {. However, there is clear a difference between s under Model 3, Model 4, and Model 5 and under the correct imputation model {. The estimated cell probabilities in Table 7 can be compared with their ML estimates via the EM algorithm obtained in Little and Rubin, table 9.9.
TABLE 7

Estimated cell probabilities for candidate imputation models in Survival of infants data

Survival (S)
Clinic (C)Prenatal care (P)DiedSurvived
Model 1: {SC, SP, PC} ALess0.4525.35
More0.7938.78
BLess2.6428.57
More0.343.07
Model 2: {SC, PC} ALess0.4925.27
More0.7638.79
BLess2.6828.57
More0.303.15
Model 3: {SP, PC} ALess0.8236.66
More0.3028.46
BLess2.2717.26
More0.8313.40
Model 4: {SP, SC} ALess1.4124.64
More1.0538.59
BLess1.6829.27
More0.093.23
Model 5: {S, P, C} ALess1.6036.34
More1.2127.41
BLess0.8118.26
More0.093.27

Note: These estimates differ under each model. Although cell probabilities are estimated more similarly under the overfitted model { and the correct model {, there is a clear difference between s under Models 3, 4, and 5 and under the correct imputation model {.

Estimated cell probabilities for candidate imputation models in Survival of infants data Note: These estimates differ under each model. Although cell probabilities are estimated more similarly under the overfitted model { and the correct model {, there is a clear difference between s under Models 3, 4, and 5 and under the correct imputation model {.

Pima Indian women

In our next example, we analyzed data collected on 768 Pima Indian women of at least 21 years of age living near Phoenix, Arizona who were tested for diabetes according to World Health Organization criteria. These data were collected by the U.S. National Institute of Diabetes and Digestive and Kidney Diseases, and are available in the mlbench R‐package. These data were analyzed in Miller et al and further in Smith et al and Ripley. The binary response variable indicates whether or not diabetes was diagnosed within 5 years of the examination (positive/negative). The explanatory variables are number of pregnancies, plasma glucose concentration at 2 hours in an oral glucose tolerance test, diastolic blood pressure (mm Hg), triceps skin fold thickness (mm), 2‐hour serum insulin (U/ml), body mass index (kg/m), diabetes pedigree function, age (years). There are five values missing (0.6 for plasma glucose concentration, 11 values (1.4 for body mass index, 35 values (4.6 for diastolic blood pressure, 227 values (30 for triceps, and 374 values (49 for serum insulin. We fitted a logistic regression generalized linear model assuming three different imputation models. These are Model 1: the missingness mechanism is nonignorable (MNAR), Model 2: the missingness mechanism is ignorable (MAR), Model 3: complete case estimation (MCAR). Using the StEM/improper MI via Metropolis Hastings within a Gibbs sampler similar to Section 4.2, we assume that under Model 1, the missingness mechanism is , and under Model 2, the missingness mechanism is where is the joint missingness indicator and is the joint representation of all explanatory variables. Under the complete case model, we assume that is independent of and . Table 8 shows the AIC and BIC values compared for each fitted model where both information criteria choose Model 1 over the other models. This strongly suggests that the missingness mechanism is nonignorable. Also, Table 9 shows the estimates of regression parameters for each imputation model. Note that results given in this table are based on log‐transformations of insulin, pregnancy, pedigree and age as well as standardization on all predictors, and its interpretation here is only used for drawing comparison between the three imputation models. We see that quite different results are obtained for these three models. For example, the regression parameter for 2‐hour serum insulin is estimated at 0.321 for Model 1, 0.022 for Model 2, and 0.084 for Model 3. This result suggests how different conclusions can be drawn depending on whether we include the missingness mechanism in the model and whether or not it is worth doing so.
TABLE 8

Information criterion for different imputation models fitted to the Pima Indian women data

Model 1 Model 2Model 3
AIC 7231.142 7660.07110544.15
BIC 7714.096 8026.93110818.13

Note: The data strongly favored Model 1, suggesting a non‐ignorable missing data mechanism.

TABLE 9

Estimates of regression parameters and their standard errors (in parentheses) for different imputation models in the Pima Indian women example

InterceptPregnancyGlucosePressureTricepsInsulinBMIPedigreeAge
Model 1 0.912 (0.102)0.282 (0.113)0.918 (0.143) 0.034 (0.107)0.163 (0.152)0.321 (0.208)0.503 (0.139)0.328 (0.098)0.284 (0.120)
Model 2 0.890 (0.100)0.290 (0.113)1.080 (0.185) 0.124 (0.110)0.088 (0.199)0.022 (0.267)0.606 (0.179)0.322 (0.098)0.315 (0.119)
Model 3 1.055 (0.154)0.076 (0.187)1.055 (0.178) 0.025 (0.146)0.114 (0.180)0.084 (0.181)0.449 (0.190)0.407 (0.147)0.628 (0.211)

Note: Pregnancy, insulin, pedigree, and age are log‐transformed.

Information criterion for different imputation models fitted to the Pima Indian women data Note: The data strongly favored Model 1, suggesting a non‐ignorable missing data mechanism. Estimates of regression parameters and their standard errors (in parentheses) for different imputation models in the Pima Indian women example Note: Pregnancy, insulin, pedigree, and age are log‐transformed. Note that the nonignorability of the missingness mechanism necessarily relies on parametric assumptions. In this example, we have assumed that follows a multivariate Bernoulli distribution with a logit link function. Also, for Model 1, we assume nonignorability of missingness mechanism for all the missing variables. A further investigation of different parametric assumptions for (or whether nonignorability assumption can be relaxed for some of the missing variables) may be carried out if one has a reasonable argument for considering these in the study. For example, a visual inspection of the density plots of the observed and imputed data of triceps skin fold thickness and 2‐hour serum insulin for Model 1 in Figure 2 might suggest further investigations into whether missingness mechanism for triceps skin fold thickness can be ignored.
FIGURE 2

Density curve plots of, A, observed triceps skin fold thickness (blue) and multiple imputed values of triceps skin fold thickness (red) and, B, observed 2‐hour serum insulin (blue) and multiple imputed values of 2‐hour serum insulin (red) for Model 1 in Pima Indian Women example. Similarity between the observed curve and imputed curves in (A) suggests that triceps skin fold thickness might be missing at random. Difference between the observed curve and imputed curves in (B) suggests that 2‐hour serum insulin is missing not at random [Colour figure can be viewed at wileyonlinelibrary.com]

Density curve plots of, A, observed triceps skin fold thickness (blue) and multiple imputed values of triceps skin fold thickness (red) and, B, observed 2‐hour serum insulin (blue) and multiple imputed values of 2‐hour serum insulin (red) for Model 1 in Pima Indian Women example. Similarity between the observed curve and imputed curves in (A) suggests that triceps skin fold thickness might be missing at random. Difference between the observed curve and imputed curves in (B) suggests that 2‐hour serum insulin is missing not at random [Colour figure can be viewed at wileyonlinelibrary.com]

Discussion

By exploiting the connection between MLE and MI, our findings show that we can diagnose imputation models for misspecification using standard likelihood‐based information criteria such as AIC and BIC. Moreover, for the first time, we have provided insights into imputation model misspecification via variational approximation. Furthermore, we examined some theoretical properties of BIC. We showed that BIC is consistent for the correct imputation model which also holds when the correct model is missing not at random. We analyzed two health research datasets to demonstrate the method's flexibility for imputation model selection in the presence of missing data and the use of multiple imputation. For these examples, we evaluated the post‐imputation effect of imputation model selection on parameter estimates. We conducted three simulation studies and showed that imputation model misspecification and overfitting can have substantial costs in terms of bias and efficiency of parameter estimation, a problem that can often be avoided using model selection. These results are not surprising—the costs of misspecification and overfitting models are well‐known—but it is perhaps surprising that this has been little considered previously in the context of choosing an imputation model, where the ideas apply equally well, as does the notion of using model selection to guard against such problems. A simulation result that we did not anticipate is that overfitting the imputation model had costs not just in loss of efficiency, but also in bias. This is an unusual finding because overfitting a model is conventionally thought to not bias predictions, rather to increase their variance (the “bias‐variance tradeoff” ). We speculate that bias arose in our simulations via a similar mechanism to regression attenuation in measurement error modeling. Specifically, overfitting the imputation model would lead to larger variance in the imputed missing values of predictors, which could be interpreted as a form of measurement error on predictors. Unaccounted‐for measurement error in predictors tends to bias coefficient estimators, and in our simulation this would bias the estimator of toward zero, as we observed. While our simulations suggested negligible loss of efficiency when mildly overparametrizing the imputation model (with just one unneeded predictor included), this need not be true in general. It became clear in our simulation work that the sensitivity to choice of imputation model varied considerably with the context, and some situations are much more sensitive to imputation method than others. The fitted models used in this article were quite simple so it would be of interest to see how the criteria perform for more complicated scenarios such as in high‐dimensional settings where the number of variables or components available for use in the analysis is larger than the number of observations, or when including random effects in the model. That said, our approach is very general and applies to any scenario where MI is used on a regression model that is fitted by ML for complete data, and thus, we do not anticipate any difficulties extending simulation studies to other settings such as mixed models, beyond the consideration of which information criterion to use. The application of our findings may be extended to other areas where MI's performance could be improved in a likelihood based framework. For example, suppose that we would like to predict myocardial infarction in patients with observed blood pressure, body mass index, age, and gender, but cholesterol level is subject to missingness. MI's approach based on Rubin's rules does not provide any guidance on how to approach prediction in the presence of missing data, and most of the methods suggested in the MI literature are ad hoc. However, there is a clear guidance on how to carry out prediction in a likelihood based framework and the close connection between MI and StEM could provide clarity in this field. Another example of where likelihood based inference might improve MI's performance is in hypothesis testing. In the MI literature, hypothesis testing based on multiple imputed datasets were proposed to obtain a modified Wald test statistic. Subsequent work by Meng and Rubin developed a pooling procedure for a likelihood ratio test with MI for nested models, using the asymptotic relationship between the Wald test and likelihood ratio test statistics. Although this approach works well, it can be quite cumbersome to implement in practice. Using connections with maximum likelihood, a likelihood ratio statistic could be directly constructed for MI. Our preliminary work suggests this has improved performance over other statistics in the MI literature. We primarily focused on the StEM algorithm in this article, however, an alternative Monte‐Carlo approximation to the EM algorithm, known as MCEM, could similarly be used. This algorithm obtains an (approximate) MLE by averaging likelihood estimates across the multiple imputations, then maximizing, as opposed to StEM which maximizes on each imputation and then averages estimates. The MCEM algorithm is more efficient than the StEM algorithm for finite sample sizes and for finite number of imputations. This is due to the fact that StEM loses some efficiency due to the maximize‐then‐average strategy. Therefore, in situations where there exists little concern for computational efficiency, it would be beneficial to apply MCEM instead of StEM without having to compromise any of the results discussed in this article. Finally, in this article, we focused on using a fixed value for rather than random . As such, there is the question of whether there are any implications from using improper rather than proper MI, where we optimize for rather than sampling it from a distribution (see Box 1, Step 2). While this was done in order to make a connection to maximum likelihood, there are few issues anticipated in extending our results to the random scenario of proper MI. The reason for this is that the inference machinery behind likelihood‐based inference uses asymptotic theory as , and in this setting, the posterior distribution for will typically converge in probability to its optimized value, hence we anticipate negligible effect of sampling from its posterior rather than plugging in its optimized value. One possible adjustment to the proposed approaches when is random is to instead use Bayesian information criteria, such as the deviance information criterion. Future work could study whether similar arguments apply for random parameters.

Box 1. MI (proper) and StEM (improper) algorithms, note that they only differ at step 2

MI (“proper”):StEM (“improper”):
0. Fix θ(0) in Θ 0. Fix θ(0) in Θ
1. z(t+1)p(z|y,r,θ(t)) 1. z(t+1)p(z|y,r,θ(t))
2. θ(t+1)p(θ|y,r,z(t+1)) 2. θ(t+1)=argmaxp(y,r,z(t+1)|θ)
3. Repeat steps 1 and 2 until convergence (at t = T)3. Repeat steps 1 and 2 until convergence (at t = T)
4. Repeat steps 1 and 2 for t = T + 1, … , T + M 4. Repeat steps 1 and 2 for t = T + 1, … , T + M
5. θ^=1Mi=T+1T+Mθ(i) 5. θ^=1Mi=T+1T+Mθ(i)
TABLE C1

RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete)

10%25%40%
n=50 n=100 n=1000 n=50 n=100 n=1000 n=50 n=100 n=1000
Complete137103311371033113710331
Correct143106321501133316712036
Overpar. mild143106321511123316712036
Overpar. strong144106321521133316711935
Missp. mean1491124617013682206162124
Missp. dist.16213075202174137232213174
AIC143106321501133316812036
BIC143106321511133316812036

Note: The true value of the slope coefficient is 1.

TABLE C2

RMSE (×1000) of for different imputation methods across 500 simulations, compared with the original dataset (Complete)

10%25%40%
n=50 n=100 n=1000 n=50 n=100 n=1000 n=50 n=100 n=1000
Complete147110321471103214711032
Correct157119341701243618613438
Overpar. mild157118341701243618513338
Overpar. strong158119341661233618513338
Missp. mean16913168197169122236203168
Missp. dist.17013264204161105226187129
AIC157119341731243619013338
BIC157119341731253619013438

Note: The true value of the slope coefficient is 1.

TABLE C3

RMSE (×1000) of 's for Amelia and MICE correct, overparametrized mild/strong and misspecified mean models across 500 simulations, compared with the original dataset (Complete)

10%25%40%
n=50 n=100 n=1000 n=50 n=100 n=1000 n=50 n=100 n=1000
Complete137103311371033113710331
Amelia correct144106321511133316812136
MICE correct143106321491123316612036
Amelia mild143106321501123316712035
MICE mild143106321501123316711936
Amelia strong144107321521133316712136
MICE strong145107321521133316712035
Amelia missp.1491124517113782207164124
MICE missp.1491124517013682205162124

Note: The true value of the coefficients is 1.

TABLE C4

RMSE (×1000) of 's for Amelia and MICE correct, overparametrized mild/strong and misspecified mean models across 500 simulations, compared with the original dataset (Complete)

10%25%40%
n=50 n=100 n=1000 n=50 n=100 n=1000 n=50 n=100 n=1000
Complete146103321461033214610332
Amelia correct153107321611113417711736
MICE correct153106321591113417211536
Amelia mild153106321611113417611636
MICE mild153107321591113417211536
Amelia strong152107321611113418111636
MICE strong152107321611113417911736
Amelia missp.152113501651298018214296
MICE missp.152114501701338118914998

Note: The true value of the coefficients is 1.

TABLE C5

RMSE (×1000) of 's for Amelia and MICE correct, overparametrized mild/strong and misspecified mean models across 500 simulations, compared with the original dataset (Complete)

10%25%40%
n=50 n=100 n=1000 n=50 n=100 n=1000 n=50 n=100 n=1000
Complete147110321471103214711032
Amelia correct157119341701243618813438
MICE correct157119341701243618513338
Amelia mild157119341691243618813338
MICE mild157119341701243618613338
Amelia strong158119341671243618713538
MICE strong158119341671233618913438
Amelia missp.16913168198170123241206168
MICE missp.16913168197170123238205168

Note: The true value of the coefficients is 1.

TABLE C6

Proportion of times () the information criterion chooses the fitted imputation model for different sample sizes in 500 simulated datasets

Model 1Model 2 Model 3 Model 4
n=50AIC0.03.0 81.0 16.0
BIC0.010.0 83.8 6.2
n=100AIC0.00.0 85.2 14.8
BIC0.00.2 97.0 2.8
n=1000AIC0.00.0 84.4 15.6
BIC0.00.0 99.0 1.0

Note: Model 3 (in bold) is the correct model, which was selected with the highest proportion even for a small sample size of . Note that the proportion of times this model was chosen approached one as sample size increased.

TABLE C7

and mean square error (in parenthesis) for different imputation models across 500 simulations for different sample sizes

β0=1 β1=1 β2=1 β12=1
n = 50 Model 11.171 (0.09)1.423 (0.47)0.719 (0.13)0.627 (0.42)
Model 20.819 (0.10)1.284 (0.18)1.187 (0.11)0.713 (0.19)
Model 30.938 (0.07)1.095 (0.14)1.033 (0.09)0.945 (0.16)
Model 40.937 (0.07)1.095 (0.14)1.034 (0.09)0.945 (0.16)
n = 100 Model 11.193 (0.07)1.438 (0.33)0.708 (0.11)0.601 (0.29)
Model 20.862 (0.05)1.228 (0.10)1.158 (0.06)0.754 (0.11)
Model 30.982 (0.04)1.029 (0.06)0.999 (0.05)0.998 (0.07)
Model 40.981 (0.04)1.030 (0.06)0.999 (0.04)0.997 (0.07)
n = 1000 Model 11.211 (0.05)1.397 (0.17)0.709 (0.09)0.618 (0.16)
Model 20.878 (0.02)1.192 (0.04)1.161 (0.03)0.764 (0.06)
Model 31.001 (0.00)0.999 (0.01)1.000 (0.00)1.000 (0.01)
Model 41.001 (0.00)0.999 (0.01)1.000 (0.00)1.000 (0.01)
  14 in total

1.  A comparison of inclusive and restrictive strategies in modern missing data procedures.

Authors:  L M Collins; J L Schafer; C M Kam
Journal:  Psychol Methods       Date:  2001-12

Review 2.  Missing data analysis: making it work in the real world.

Authors:  John W Graham
Journal:  Annu Rev Psychol       Date:  2009       Impact factor: 24.137

3.  Missing data in longitudinal studies.

Authors:  N M Laird
Journal:  Stat Med       Date:  1988 Jan-Feb       Impact factor: 2.373

4.  Coping with missing data in clinical trials: a model-based approach applied to asthma trials.

Authors:  James Carpenter; Stuart Pocock; Carl Johan Lamm
Journal:  Stat Med       Date:  2002-04-30       Impact factor: 2.373

5.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.

Authors:  Jonathan A C Sterne; Ian R White; John B Carlin; Michael Spratt; Patrick Royston; Michael G Kenward; Angela M Wood; James R Carpenter
Journal:  BMJ       Date:  2009-06-29

6.  Model Selection Criteria for Missing-Data Problems Using the EM Algorithm.

Authors:  Joseph G Ibrahim; Hongtu Zhu; Niansheng Tang
Journal:  J Am Stat Assoc       Date:  2008-12-01       Impact factor: 5.033

7.  Sensitivity analysis after multiple imputation under missing at random: a weighting approach.

Authors:  James R Carpenter; Michael G Kenward; Ian R White
Journal:  Stat Methods Med Res       Date:  2007-06       Impact factor: 3.021

8.  The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data.

Authors:  Angela M Wood; Patrick Royston; Ian R White
Journal:  Biom J       Date:  2015-01-29       Impact factor: 2.207

9.  Imputing missing covariate values for the Cox model.

Authors:  Ian R White; Patrick Royston
Journal:  Stat Med       Date:  2009-07-10       Impact factor: 2.373

10.  Fractional Brownian motion and multivariate-t models for longitudinal biomedical data, with application to CD4 counts in HIV-positive patients.

Authors:  Oliver T Stirrup; Abdel G Babiker; James R Carpenter; Andrew J Copas
Journal:  Stat Med       Date:  2015-11-10       Impact factor: 2.373

View more
  1 in total

1.  Selecting the model for multiple imputation of missing data: Just use an IC!

Authors:  Firouzeh Noghrehchi; Jakub Stoklosa; Spiridon Penev; David I Warton
Journal:  Stat Med       Date:  2021-02-24       Impact factor: 2.373

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.