| Literature DB >> 30357870 |
Richard D Riley1, Kym Ie Snell1, Joie Ensor1, Danielle L Burke1, Frank E Harrell2, Karel Gm Moons3, Gary S Collins4.
Abstract
When designing a study to develop a new prediction model with binary or time-to-event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants (n) and outcome events (E) relative to the number of predictor parameters (p) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p, and require prespecification of the model's anticipated Cox-Snell R2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.Entities:
Keywords: binary and time-to-event outcomes; logistic and Cox regression; multivariable prediction model; pseudo R-squared; sample size; shrinkage
Mesh:
Year: 2018 PMID: 30357870 PMCID: PMC6519266 DOI: 10.1002/sim.7992
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Example of global shrinkage applied to a prognostic model for 1‐year mortality risk in patients with diabetes starting dialysis29
| Developed (unpenalised) model | Final (penalised) model adjusted for overfitting | |
|---|---|---|
|
|
|
|
| 1.962 | 1.427 | |
|
|
|
|
| Age (years) | 0.047 | 0.042 |
| Smoking | 0.631 | 0.570 |
| Macrovascular complications | 1.195 | 1.078 |
| Duration of diabetes mellitus (years) | 0.026 | 0.023 |
| Karnofsky scale | −0.043 | −0.039 |
| Haemoglobin level (g/dl) | −0.186 | −0.168 |
| Albumin level (g/l) | −0.060 | −0.054 |
Predicted values of the D statistic and from Equation (23) for selected values of the C statistic (values taken from table 1 in the work of Jinks et al41)
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| 0.50 | 0 | 0 | 0.72 | 1.319 | 0.294 | |
| 0.52 | 0.11 | 0.003 | 0.74 | 1.462 | 0.338 | |
| 0.54 | 0.221 | 0.011 | 0.76 | 1.61 | 0.382 | |
| 0.56 | 0.332 | 0.026 | 0.78 | 1.765 | 0.427 | |
| 0.58 | 0.445 | 0.045 | 0.80 | 1.927 | 0.470 | |
| 0.60 | 0.560 | 0.070 | 0.82 | 2.096 | 0.512 | |
| 0.62 | 0.678 | 0.099 | 0.84 | 2.273 | 0.552 | |
| 0.64 | 0.798 | 0.132 | 0.86 | 2.459 | 0.591 | |
| 0.66 | 0.922 | 0.169 | 0.88 | 2.652 | 0.627 | |
| 0.68 | 1.05 | 0.208 | 0.90 | 2.857 | 0.661 | |
| 0.70 | 1.182 | 0.25 | 0.92 | 3.070 | 0.692 |
Figure 1Summary of the steps involved in calculating the minimum sample size required for developing a multivariable prediction model for binary or time‐to‐event outcomes
Figure 2Events per predictor parameter required to achieve various expected shrinkage () values for a new prediction model of venous thromboembolism recurrence risk with an assumed of 0.051 [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 3Sample size required (based on Equation (11)) for a particular number of predictor parameters () to achieve a particular value of expected shrinkage (), for a new prediction model of venous thromboembolism recurrence risk with an assumed of 0.051 [Colour figure can be viewed at wileyonlinelibrary.com]