| Literature DB >> 25630926 |
Angela M Wood1, Patrick Royston2, Ian R White3.
Abstract
Multiple imputation can be used as a tool in the process of constructing prediction models in medical and epidemiological studies with missing covariate values. Such models can be used to make predictions for model performance assessment, but the task is made more complicated by the multiple imputation structure. We summarize various predictions constructed from covariates, including multiply imputed covariates, and either the set of imputation-specific prediction model coefficients or the pooled prediction model coefficients. We further describe approaches for using the predictions to assess model performance. We distinguish between ideal model performance and pragmatic model performance, where the former refers to the model's performance in an ideal clinical setting where all individuals have fully observed predictors and the latter refers to the model's performance in a real-world clinical setting where some individuals have missing predictors. The approaches are compared through an extensive simulation study based on the UK700 trial. We determine that measures of ideal model performance can be estimated within imputed datasets and subsequently pooled to give an overall measure of model performance. Alternative methods to evaluate pragmatic model performance are required and we propose constructing predictions either from a second set of covariate imputations which make no use of observed outcomes, or from a set of partial prediction models constructed for each potential observed pattern of covariate. Pragmatic model performance is generally lower than ideal model performance. We focus on model performance within the derivation data, but describe how to extend all the methods to a validation dataset.Entities:
Keywords: Measures of model performance; Missing data; Model validation; Multiple imputation; Prediction models; Rubin's rules
Mesh:
Year: 2015 PMID: 25630926 PMCID: PMC4515100 DOI: 10.1002/bimj.201400004
Source DB: PubMed Journal: Biom J ISSN: 0323-3847 Impact factor: 2.207
A summary of possible imputation-specific and pooled predictions from a model fitted to multiply imputed data with imputation-specific regression coefficients . For a linear model where h(.) is the identity function, P2 = P3, P5 = P6, and P8 = P9. Alternative predictions use in place of
| Covariate values | Imputation-specific predictions | Pooled predictions |
|---|---|---|
| Individuals with fully observed covariates | (P1) | (P2) |
| Individuals with any imputed covariates | (P4) | (P5) |
| Individuals with any imputed covariates | (P7) | (P8) |
Figure 1Illustration to show how measures of model performance can be estimated from multiply imputed data.
Summary of variables in the UK700 trial
| Variable | Data type | Code | Number of individuals with observed values | Mean (SD) or |
|---|---|---|---|---|
| Outcome variables recorded at two years | ||||
| Comprehensive psychopathological rating scale | cprs | 595 | 2.65 (0.87) | |
| (Dis)satisfaction with case management | sat | 490 | 16.90 (4.78) | |
| Baseline variables used as covariates in simulated prediction model | ||||
| Comprehensive psychopathological rating scale | cprs0 | 705 | 2.73 (0.82) | |
| Centre | centre | 708 | ||
| St George's | 196 (28%) | |||
| Manchester | 158 (22%) | |||
| St Mary's | 201 (39%) | |||
| King's | 153 (22%) | |||
| Total disability score | distot | 659 | −0.07 (0.81) | |
| Time from onset of psychosis to study entry (months) | onset | 705 | 4.62 (0.98) | |
| Age (years) | age | 708 | 38.29 (11.64) | |
| Sex | sex | 708 | ||
| female | 304 (43%) | |||
| male | 404 (57%) | |||
| Outpatient status at recruitment | status in hospital outpatient | 707 | ||
| 289 (41%) | ||||
| 418 (59%) | ||||
| Other baseline variables | ||||
| Missing father's occupation at birth | occgp | 708 | ||
| observed | 576 (81%) | |||
| missing | 132 (19%) | |||
| (Dis)satisfaction with case management | sat0 | 410 | 18.86 (4.83) | |
Mean (SD) of the log-transformed values.
Mean squared prediction errors (x100) (Monte Carlo errors) from simulated linear model. Predictions from multiply imputed data are evaluated using imputation-specific regression coefficients (M = M2 = 5)
| Missing data pattern | Mean squared prediction errors (×100) (Monte Carlo errors) | ||||||
|---|---|---|---|---|---|---|---|
| Monotone MAR | Monotone MCAR | Independent MCAR | |||||
| % missing cprs0 | 30% | 60% | 30% | 30% | |||
| Coefficient for log(cprs0+1) | 0.34 | 0.68 | 0.34 | 0.68 | 0.68 | 0.68 | |
| Model evaluated and method of evaluation | |||||||
| 59.8 (0.1) | 59.8 (0.1) | 59.9 (0.1) | 59.8 (0.1) | 59.7 (0.1) | 59.8 (0.1) | ||
| 57.3 (0.1) | 57.8 (0.1) | 55.3 (0.2) | 55.4 (0.2) | 59.4 (0.1) | 59.4 (0.1) | ||
| Model fitted to MI data | |||||||
| | Pooled performance P1 | 58.8 (0.1) | 58.9 (0.1) | 59.4 (0.2) | 59.6 (0.2) | 59.7 (0.1) | 60.0 (0.1) |
| Pooled prediction P2 = P3 | 58.7 (0.1) | 58.8 (0.1) | 59.1 (0.2) | 59.2 (0.2) | 59.6 (0.1) | 59.9 (0.1) | |
| | Pooled performance P4 | 59.6 (0.1) | 59.5 (0.1) | 60.0 (0.1) | 60.7 (0.1) | 59.7 (0.1) | 60.3 (0.1) |
| Pooled prediction P5 = P6 | 58.0 (0.1) | 55.7 (0.1) | 56.5 (0.1) | 52.1 (0.1) | 55.0 (0.1) | 56.2 (0.1) | |
| Model fitted to MI data | |||||||
| | Pooled performance P7 | 65.9 (0.1) | 78.3 (0.1) | 72.5 (0.1) | 99.9 (0.3) | 77.6 (0.1) | 75.3 (0.1) |
| Pooled prediction P8 = P9 | 64.0 (0.1) | 72.2 (0.1) | 68.0 (0.1) | 85.5 (0.2) | 70.3 (0.1) | 69.4 (0.1) | |
| Partial prediction models fitted to MI data | |||||||
| | Pooled performance P1 | 62.4 (0.1) | 67.8 (0.1) | 64.9 (0.1) | 75.5 (0.1) | 68.4 (0.1) | 67.5 (0.1) |
| Pooled prediction P2 = P3 | 62.4 (0.1) | 67.8 (0.1) | 64.7 (0.1) | 75.3 (0.1) | 68.3 (0.1) | 67.4 (0.1) | |
Imputed covariates ( are imputed from the set of imputation models used in deriving the prediction model.
Imputed covariates ( are imputed from a second set of imputation models which exclude the outcome variable.
Mean squared prediction errors (x100) (Monte Carlo errors) from simulated logistic model. Predictions from multiply imputed data are evaluated using imputation-specific regression coefficients (M = M2 = 5)
| Missing data pattern | Mean squared prediction errors (×100) (Monte Carlo errors) | ||||||
|---|---|---|---|---|---|---|---|
| Monotone MAR | Monotone MCAR | Independent MCAR | |||||
| % missing cprs0 | 30% | 60% | 30% | 30% | |||
| Prevalence of outcome | 25% | 8% | 25% | 8% | 8% | 8% | |
| Model evaluated and method of evaluation | |||||||
| 16.6 (0.02) | 6.82 (0.02) | 16.6 (0.02) | 6.80 (0.02) | 6.82 (0.02) | 6.79 (0.02) | ||
| 14.7 (0.06) | 6.46 (0.05) | 13.5 (0.04) | 5.28 (0.04) | 6.86 (0.03) | 6.90 (0.03) | ||
| Model fitted to MI data | |||||||
| | Pooled performance P1 | 16.2 (0.07) | 6.48 (0.06) | 14.0 (0.04) | 4.59 (0.03) | 6.82 (0.03) | 6.82 (0.03) |
| Pooled probability P2 | 16.2 (0.07) | 6.47 (0.06) | 13.9 (0.04) | 4.54 (0.03) | 6.80 (0.03) | 6.81 (0.03) | |
| Pooled linear predictor P3 | 16.2 (0.07) | 6.46 (0.06) | 13.9 (0.04) | 4.54 (0.03) | 6.80 (0.03) | 6.81 (0.03) | |
| | Pooled performance P4 | 16.6 (0. 02) | 6.83 (0.02) | 16.7 (0.02) | 6.78 (0.02) | 6.83 (0.02) | 6.83 (0.02) |
| Pooled probability P5 | 16.4 (0. 03) | 6.69 (0.02) | 16.2 (0.03) | 6.55 (0.03) | 6.68 (0.02) | 6.74 (0.02) | |
| Pooled linear predictor P6 | 16.3 (0. 03) | 6.65 (0.02) | 16.2 (0.03) | 6.53 (0.03) | 6.68 (0.02) | 6.73 (0.02) | |
| Model fitted to MI data | |||||||
| | Pooled performance P7 | 17.6 (0. 02) | 7.19 (0.02) | 18.1 (0.02) | 7.54 (0.03) | 7.19 (0.02) | 6.94 (0.02) |
| Pooled probability P8 | 17.3 (0. 02) | 7.04 (0.03) | 17.5 (0.02) | 7.25 (0.02) | 7.01 (0.02) | 6.85 (0.02) | |
| Pooled linear predictor P9 | 17.3 (0. 03) | 6.99 (0.03) | 17.6 (0.02) | 7.30 (0.02) | 7.01 (0.02) | 6.94 (0.02) | |
| Partial prediction models fitted to MI data | |||||||
| | Pooled performance P1 | 17.0 (0. 02) | 6.96 (0.03) | 17.4 (0.02) | 7.21 (0.02) | 6.97 (0.02) | 6.90 (0.02) |
| Pooled probability P2 | 17.0 (0. 02) | 6.94 (0.03) | 17.4 (0.02) | 7.19 (0.02) | 6.97 (0.02) | 6.98 (0.02) | |
| Pooled linear predictor P3 | 17.0 (0. 02) | 6.94 (0.03) | 17.4 (0.02) | 7.19 (0.02) | 6.97 (0.02) | 6.97 (0.02) | |
Missing covariates ( are imputed from the set of imputation models used in deriving the prediction model.
Missing covariates ( are imputed from a second set of imputation models which exclude the outcome variable.
AUROC (Monte Carlo errors) from simulated logistic model. Predictions from multiply imputed data are evaluated using imputation-specific regression coefficients (M = M2 = 5)
| Missing data pattern | AUROC (Monte Carlo errors) | ||||||
|---|---|---|---|---|---|---|---|
| Monotone MAR | Monotone MCAR | Independent MCAR | |||||
| % missing cprs0 | 30% | 60% | 30% | 30% | |||
| Prevalence of outcome | 25% | 8% | 25% | 8% | 8% | 8% | |
| Model evaluated and method of evaluation | |||||||
| | 0.744 (0.001) | 0.825 (0.001) | 0.744 (0.001) | 0.826 (0.001) | 0.825 (0.001) | 0.825 (0.001) | |
| | 0.789 (0.001) | 0.869 (0.001) | 0.773 (0.001) | 0.842 (0.001) | 0.827 (0.001) | 0.830 (0.001) | |
| Model fitted to MI data | |||||||
| | Pooled performance P1 | 0.760 (0.001) | 0.842 (0.001) | 0.753 (0.001) | 0.827 (0.001) | 0.824 (0.001) | 0.825 (0.001) |
| Pooled probability P2 | 0.761 (0.001) | 0.844 (0.001) | 0.756 (0.001) | 0.834 (0.001) | 0.825 (0.001) | 0.826 (0.001) | |
| Pooled linear predictor P3 | 0.761 (0.001) | 0.845 (0.001) | 0.757 (0.001) | 0.835 (0.001) | 0.825 (0.001) | 0.826 (0.001) | |
| | Pooled performance P4 | 0.742 (0.001) | 0.822 (0.001) | 0.739 (0.001) | 0.818 (0.001) | 0.821 (0.001) | 0.820 (0.001) |
| Pooled probability P5 | 0.752 (0.001) | 0.829 (0.001) | 0.762 (0.001) | 0.843 (0.001) | 0.835 (0.001) | 0.828 (0.001) | |
| Pooled linear predictor P6 | 0.754 (0.001) | 0.835 (0.001) | 0.764 (0.001) | 0.848 (0.001) | 0.837 (0.001) | 0.830 (0.001) | |
| Model fitted to MI data | |||||||
| | Pooled performance P7 | 0.706 (0.001) | 0.794 (0.001) | 0.686 (0.001) | 0.752 (0.001) | 0.791 (0.001) | 0.800 (0.001) |
| Pooled probability P8 | 0.714 (0.001) | 0.800 (0.001) | 0.705 (0.001) | 0.781 (0.001) | 0.805 (0.001) | 0.810 (0.001) | |
| Pooled linear predictor P9 | 0.714 (0.001) | 0.804 (0.001) | 0.704 (0.001) | 0.776 (0.001) | 0.804 (0.001) | 0.809 (0.001) | |
| Partial prediction models fitted to MI data | |||||||
| | Pooled performance P1 | 0.727 (0.001) | 0.808 (0.001) | 0.709 (0.001) | 0.785 (0.001) | 0.809 (0.001) | 0.805 (0.001) |
| Pooled probability P2 | 0.728 (0.001) | 0.809 (0.001) | 0.711 (0.001) | 0.788 (0.001) | 0.810 (0.001) | 0.806 (0.001) | |
| Pooled linear predictor P3 | 0.728 (0.001) | 0.810 (0.001) | 0.711 (0.001) | 0.788 (0.001) | 0.810 (0.001) | 0.806 (0.001) | |
Missing covariates ( are imputed from the set of imputation models used in deriving the prediction model.
Missing covariates ( are imputed from a second set of imputation models which exclude the outcome variable.
Calibration slopes estimated from complete-case analysis and a multiple imputation approach
| Analysis approach | Individuals with complete data | Individuals with imputed covariates | All individuals |
|---|---|---|---|
| Complete-cases analysis | 1.00 (0.15) | NA | NA |
| Multiple imputation analysis using | |||
| Imputation-specific calibration slopes | |||
| Imputation 1 | 1.03 (0.17) | 0.68 (0.37) | 1.00 (0.15) |
| Imputation 2 | 1.03 (0.17) | 0.86 (0.35) | 1.00 (0.15) |
| Imputation 3 | 1.05 (0.17) | 0.80 (0.35) | 1.00 (0.15) |
| Imputation 4 | 1.14 (0.18) | 0.50 (0.34) | 1.00 (0.16) |
| Imputation 5 | 1.07 (0.16) | 0.52 (0.37) | 1.00 (0.16) |
| Pooled calibration slope | 1.07 (0.18) | 0.67 (0.40) | 1.00 (0.15) |
| Calibration slope using pooled predictions | 1.07 (0.17) | 2.73 (0.67) | 1.16 (0.16) |
| Multiple imputation analysis using | |||
| Imputation-specific calibration slopes | |||
| Imputation 1 | 1.06 (0.17) | 0.70 (0.38) | 1.03 (0.15) |
| Imputation 2 | 1.06 (0.17) | 0.88 (0.36) | 1.03 (0.16) |
| Imputation 3 | 1.06 (0.17) | 0.82 (0.36) | 1.01 (0.15) |
| Imputation 4 | 1.06 (0.17) | 0.47 (0.32) | 0.94 (0.15) |
| Imputation 5 | 1.06 (0.17) | 0.52 (0.36) | 0.99 (0.15) |
| Pooled calibration slope | 1.02 (0.18) | 0.68 (0.41) | 1.03 (0.15) |
| Calibration slope using pooled predictions | 1.06 (0.17) | 2.66 (0.66) | 1.15 (0.16) |
A summary of multiple imputation approaches for derivation and validation datasets
| Multiple imputation approach | Detail of multiple imputation approach | |
|---|---|---|
| Derivation dataset | Validation dataset | |
| V1. Impute simultaneously in combined dataset | Draw multiple imputations from observed distributions of covariates and outcomes in the combined derivation and validation datasets | |
| V2. Impute based on derivation dataset | Draw multiple imputations from observed distributions of covariates and outcomes in the derivation dataset | Draw multiple imputations from observed distributions of covariates and outcomes in the derivation dataset |
| V3. Impute separately | Draw multiple imputations from observed distributions of covariates and outcomes in the derivation dataset | Draw multiple imputations from observed distributions of covariates and outcomes in the validation dataset |
| V4. Impute separately and exclude outcome from imputation models for validation dataset | Draw multiple imputations from observed distributions of covariates and outcomes in the derivation dataset | Draw multiple imputations from observed distributions of covariates in the validation dataset |