| Literature DB >> 28852415 |
Cattram D Nguyen1,2, John B Carlin1,2, Katherine J Lee1,2.
Abstract
BACKGROUND: Multiple imputation has become very popular as a general-purpose method for handling missing data. The validity of multiple-imputation-based analyses relies on the use of an appropriate model to impute the missing values. Despite the widespread use of multiple imputation, there are few guidelines available for checking imputation models. ANALYSIS: In this paper, we provide an overview of currently available methods for checking imputation models. These include graphical checks and numerical summaries, as well as simulation-based methods such as posterior predictive checking. These model checking techniques are illustrated using an analysis affected by missing data from the Longitudinal Study of Australian Children.Entities:
Keywords: Cross-validation; Diagnostics; Missing data; Model checking; Multiple imputation; Posterior predictive checking
Year: 2017 PMID: 28852415 PMCID: PMC5569512 DOI: 10.1186/s12982-017-0062-6
Source DB: PubMed Journal: Emerg Themes Epidemiol ISSN: 1742-7622
Missing data patterns for variables in the logistic regression analysis model (n = 5107)
| Number of participants | Percent | Conduct problems | Harsh discipline | SEP | Hardship | Psychological distress |
|---|---|---|---|---|---|---|
| 3163 | 62 | + | + | + | + | + |
| 733 | 14 | + | − | + | + | + |
| 352 | 7 | − | − | − | − | − |
| 255 | 5 | − | + | + | + | + |
| 234 | 5 | − | − | + | + | + |
| 149 | 3 | + | − | − | − | − |
| 82 | 2 | + | − | + | + | − |
| 55 | 1 | + | + | + | + | − |
| 41 | 1 | − | − | + | + | − |
| 22 | 0.4 | + | + | + | − | + |
| 7 | 0.1 | − | + | + | + | − |
| 5 | 0.1 | + | − | + | − | + |
| 3 | 0.1 | − | − | + | − | + |
| 2 | 0.04 | − | + | − | + | + |
| 1 | 0.02 | − | − | + | − | − |
| 1 | 0.02 | − | + | + | − | + |
| 1 | 0.02 | + | − | − | + | − |
| 1 | 0.02 | + | + | − | + | + |
Nb. + indicates value is present and − indicates value is missing. The sex variable was not included in the missing data patterns, because it was completely observed
Baseline characteristics of participants with complete and incomplete data for the variables in the analysis model
| Variable | Complete cases (n = 3163) | Incomplete cases (n = 1944) |
|---|---|---|
| Mother’s age (at baseline), mean (SD) | 31.8 (4.9) | 29.6 (6.0) |
| Socioeconomic Z-score, mean (SD) | 0.19 (1.0) | −0.31 (1.0) |
| Child sex (male), fraction (%) | 1625/3163 (51.4) | 983/1944 (50.6) |
| Indigenous status, fraction (%) | 73/3163 (2.3) | 157/1944 (8.1) |
| Mother’s main language is not English, fraction (%) | 2825/3126 (90.4) | 1539/1877 (82.0) |
| Sole parent family, fraction (%) | 183/3163 (5.8) | 294/1944 (15.1) |
| Child has a sibling, fraction (%) | 1895/3163 (59.9) | 1193/1944 (61.4) |
| Mother completed high school, fraction (%) | 2350/3161 (74.3) | 1060/1937 (54.7) |
Nb. The denominators in the fractions are the numbers of participants for whom the measure was available
Fig. 1Graphs comparing the distributions of the observed (n = 3506) and imputed (n = 1601) harsh discipline scores. a Kernel density plot of the observed (solid line) and imputed (dashed line) harsh discipline scores, b histogram of the observed (transparent bars) and imputed (grey bars) harsh discipline scores, c plot of the quantiles of the imputed harsh discipline scores against quantiles of the observed scores (quantile–quantile plot), and d cumulative distribution plots of the observed (solid line) and the imputed (dashed line) harsh discipline scores. Data from a single imputed dataset have been presented
Fig. 2Boxplots of the observed (labelled 0) and imputed harsh discipline scores (labelled 1–20). Data are shown for the first 20 imputed datasets
Summary statistics of the observed and imputed data for the incomplete variables in the analysis model
| Observed | Imputed | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| N | Mean | SD | Min | Max | N | Mean | SD | Min | Max | |
| Harsh discipline | 3506 | 3.36 | 1.44 | 1 | 10 | 1601 | 3.44 | 1.47 | −2.73 | 9.94 |
| Socioeconomic position | 4602 | 0.00 | 1.00 | −4.90 | 3.03 | 505 | −0.51 | 1.03 | −5.24 | 3.20 |
| Financial hardship | 4574 | 0.29 | 0.71 | 0 | 6 | 533 | 0.46 | 0.77 | −2.36 | 3.94 |
| Psychological distress | 4419 | 2.93 | 3.24 | 0 | 24 | 688 | 3.70 | 3.49 | −9.01 | 19.87 |
| Conduct problems | 4211 | 21.5%a | 896 | 20.1%a | ||||||
The summary statistics of the imputed data were calculated using pooled data over 40 imputations
SD standard deviation, Min minimum, Max maximum
aPercent with characteristic
Fig. 3Scatterplot of the harsh discipline scores against the estimated probabilities of response with lowess curves. Data shown are observed values (black) and imputed values (red) for one imputed dataset only
Fig. 4Plot of the residuals against the predicted values for the proposed imputation model fitted to the observed data for harsh discipline
Fig. 5Leave-one-out cross-validation plot for the harsh discipline scores. The median imputed values across 20 imputations (black markers) have been plotted against the observed value. The error bars are the intervals between the 5th and 95th percentiles
Results of posterior predictive checking for the logistic regression coefficients
| Test quantity (regression coefficient) | Initial imputation modela | Updated imputation modelb | ||||
|---|---|---|---|---|---|---|
|
|
| PPP |
|
| PPP | |
| Harsh discipline | 0.31 | 0.26 | 0.026 | 0.31 | 0.33 | 0.71 |
| Sex | 0.39 | 0.38 | 0.45 | 0.39 | 0.37 | 0.44 |
| Socioeconomic position | −0.31 | −0.3 | 0.63 | −0.34 | −0.34 | 0.53 |
| Financial hardship | 0.08 | 0.1 | 0.63 | 0.09 | 0.11 | 0.63 |
| Psychological distress | 0.04 | 0.06 | 0.94 | 0.04 | 0.05 | 0.62 |
Posterior predictive p values (PPP) are shown along with means of the test quantities (regression coefficients) estimated in the completed datasets, , and the replicated datasets, . Results are based on 2000 replications
aThe initial imputation model included the outcome variable as a continuous variable
bThe updated imputation model included the binary version of the outcome variable that was also used in the analysis
Fig. 6Posterior predictive checks of the coefficient for harsh discipline from the logistic regression analysis model. Estimates of the regression coefficient for harsh discipline from the replicated data are plotted against the estimates from the completed data (based on 2000 replications). The proportion of markers above the y = x line represents the posterior predictive p value (PPP = 0.026)
Overview of approaches to model checking in multiple imputation
|
|
| Explore imputed values using descriptive statistics and graphical displays |
| Use subject matter knowledge to judge the plausibility of imputed values, but remember that imputed values do not necessarily have to resemble observed data, as the goal of MI is not to predict the missing values but to produce valid inference in the presence of missing data |
|
|
| The imputed data should be compared with the observed data to assess plausibility and identify major problems with the imputation model |
| Comparisons can be made using summary statistics and graphical methods |
| Discrepancies between observed and imputed data do not necessarily signal a problem under MAR, but should be judged for their plausibility under likely missingness processes |
|
|
| Consider the target analysis when making judgements about model adequacy. If one is interested in characteristics of the marginal distributions (e.g. percentiles), then it might be important that features of the marginal distributions are preserved in the imputed data. This becomes less critical if the primary interest lies in relationships between variables |
| Posterior predictive checking can be used to check the adequacy of imputation models with respect to quantities of substantive interest. Model fit can be explored using either graphical or numerical summaries (e.g. posterior predictive p-values), but again there can be no hard and fast rules for determining adequacy of model specification |
|
|
| Use a number of different diagnostics to check imputation models. For example, descriptive statistics can be used to check the quality of imputed values themselves, while methods such as posterior predictive checking can be used to assess the imputation model with respect to target analyses |