| Literature DB >> 32687233 |
Jeroen Hoogland1, Marit van Barreveld2,3, Thomas P A Debray1,4, Johannes B Reitsma1,4, Tom E Verstraelen3, Marcel G W Dijkgraaf2, Aeilko H Zwinderman2.
Abstract
Missing data present challenges for development and real-world application of clinical prediction models. While these challenges have received considerable attention in the development setting, there is only sparse research on the handling of missing data in applied settings. The main unique feature of handling missing data in these settings is that missing data methods have to be performed for a single new individual, precluding direct application of mainstay methods used during model development. Correspondingly, we propose that it is desirable to perform model validation using missing data methods that transfer to practice in single new patients. This article compares existing and new methods to account for missing data for a new individual in the context of prediction. These methods are based on (i) submodels based on observed data only, (ii) marginalization over the missing variables, or (iii) imputation based on fully conditional specification (also known as chained equations). They were compared in an internal validation setting to highlight the use of missing data methods that transfer to practice while validating a model. As a reference, they were compared to the use of multiple imputation by chained equations in a set of test patients, because this has been used in validation studies in the past. The methods were evaluated in a simulation study where performance was measured by means of optimism corrected C-statistic and mean squared prediction error. Furthermore, they were applied in data from a large Dutch cohort of prophylactic implantable cardioverter defibrillator patients.Entities:
Keywords: clinical prediction modeling; missing data; real-world application; validation
Mesh:
Year: 2020 PMID: 32687233 PMCID: PMC7586995 DOI: 10.1002/sim.8682
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
FIGURE 1The flow of both the simulation study and applied example are shown. Parts relating only to the simulation study are shown with dashed lines. The applied example included 100 bootstrap sample evaluations. * note that within each simulation iteration these are the same cases as the out‐of‐bag sample with missing data, but with fully observed information [Colour figure can be viewed at wileyonlinelibrary.com]
Missingness models to create missing values in the simulated data
| Scenario | % Missing Data (%) | Which Covariates | Missingness Generating Mechanism | Missingness Model |
|---|---|---|---|---|
| 1 MCAR | 5 | all |
| |
| 2 MCAR | 20 | all |
| |
| 3 MCAR* | 20‐50 | x1 |
| |
| 4 MCAR | 50 | all x |
| |
| 5 MAR | 5 | all x |
|
|
| 6 MAR | 20 | all x |
|
|
| 7 MAR* | 20‐50 | x1 |
|
log( |
| 8 MAR | 50 | all x |
|
|
Abbreviations: MAR, missing at random; MCAR, missing completely at random.
Notes: indicates that the value of covariate in person is missing; is the covariate vector excluding covariate . *) Scenario 3 and 7 start from 2 and 6 respectively as implemented for all variables but and consequently add the process for scenarios 4 and 8 respectively to create missing data in .
FIGURE 2Boxplots for the difference between the estimated out‐of‐bag C‐statistic and reference C‐statistic (as derived under complete out‐of‐bag data) are shown per missing data method and missing data setting. Each simulation iteration renders an observation. [Colour figure can be viewed at wileyonlinelibrary.com]
FIGURE 3Boxplots for the average root mean squared prediction error (rMSPE) per missing data method and missing data setting. Each simulation iteration renders an observation. [Colour figure can be viewed at wileyonlinelibrary.com]
Prediction performance statistics for the applied example
| Method | Mean (OOB) C (SD)
in the test sets |
|---|---|
| 2 | 0.747 (0.034) |
| One‐step‐sweep submodel | 0.736 (0.041) |
| Marginalization over missing | 0.747 (0.034) |
| Marginalization over missing | 0.747 (0.034) |
| Stacked multiple imputation | 0.747 (0.034) |
| Stacked multiple imputation with | 0.764 (0.033) |
| Fixed chained equations | 0.748 (0.033) |
| Independent multiple imputation | 0.746 (0.034) |
| Independent multiple imputation with | 0.756 (0.034) |
Mean over 100 out‐of‐bag (OOB) samples.