| Literature DB >> 30353591 |
Le Thi Phuong Thao1, Ronald Geskus1,2.
Abstract
Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥ 50 % ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.Entities:
Keywords: lasso; multiply imputed data; prediction; stacked data; variable selection
Mesh:
Year: 2018 PMID: 30353591 PMCID: PMC6492211 DOI: 10.1002/bimj.201700232
Source DB: PubMed Journal: Biom J ISSN: 0323-3847 Impact factor: 2.207
Summary of considered model selection methods
|
|
|
|---|---|
| FULL |
|
| TrueC | Model with all |
| Model selection on bootstrap data | |
| BBeF |
|
| BLaF |
|
| Lasso on original data | |
| SepAv | Lasso selection on each original MI dataset |
| SepAvF | Lasso selection on each original MI dataset |
| Stack | Lasso selection on the |
| StackW | Lasso selection on the |
We add letter “o” after the method abbreviation for model obtained by the optimal λ
We additionally consider recalibration by score and by selected variables
Simulation scenarios
|
|
| |
|---|---|---|
| Number of covariates ( | 15 | 25 |
| Proportion of missing values per variable ( | 0.1, 0.2, 0.3, 0.4, 0.5 | 0.1, 0.2, 0.3, 0.4, 0.5 |
| The corresponding: | ||
| ‐ Percentage (%) of missing values in the data | 4, 8, 12, 16, 20 | 4, 8, 12, 16, 20 |
| ‐ Percentage (%) of complete cases | 65, 44, 30, 20, 12 | 65, 47, 34, 24, 16 |
| Sample size n (Events per variable EPV | 200 (5), 400 (10), 600 (15) | 200 (3), 400 (7), 600 (10) |
| Number of imputed datasets ( | 10, 20, 30 | 10, 20, 30 |
| Number of true/noise covariates | 7/8 | 7/18 |
Average value
Figure 1First data generating mechanism (Section 3.1.a): Mean Brier score (top figure) and AUC (bottom figure) over 500 generated datasets for the two most extreme cases of missing values (4 and 20%) with 10 imputed datasets
Figure 2First data generating mechanism (Section 3.1.a): Number of selected variables for the two most extreme cases of missing values (4% and 20%) with 10 imputed datasets
Figure 3Second data generating mechanism (section 3.1.b): Mean Brier score (top figure) and AUC (bottom figure) over 500 generated datasets for the two most extreme cases of missing values (4% and 20%) with 10 imputed datasets. The performance measures of BBeF in data with EPV of three are not available due to convergence problems in many generated datasets
Figure 4TBM data: Corrected AUC and Brier score via internal validation
Variable selection in the TBM dataset
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| BBeF | BLaF | SepAv | SepAvF | Stack | StackW | BLaFo | SepAvo | SepAvFo | Stacko | StackWo | FULL |
| Age | x | x | x | x | x | x | x | x | x | x | x | x |
| MRC grade | x | x | x | x | x | x | x | x | x | x | x | x |
| CSF lymphocyte count | x | x | x | x | x | x | x | x | x | x | x | x |
| Presence of focal neurological signs | x | x | x | x | x | x | x | x | x | x | x | x |
| Cohort | x | x | x | x | x | x | x | x | x | x | x | x |
| Received dexamethasone treatment | x | x | x | x | x | x | x | x | x | x | x | |
| Previous TB | x | x | x | x | x | x | x | x | x | x | x | |
| Weight | x | x | x | x | x | x | x | x | x | x | x | |
| Ratio of CSF to blood glucose | x | x | x | x | x | x | ||||||
| Illness duration at entry | x | x | x | x | x | x | ||||||
| CSF protein | x | x | x | x | x | |||||||
| Occurrence of seizures | x | x | x | |||||||||
| Plasma sodium | x | x | x | |||||||||
| Miliary tuberculosis present on chest radiograph | x | x | ||||||||||
| Sex | x | x | ||||||||||
| CSF glucose | x | |||||||||||
| Body temperature | x | |||||||||||