Literature DB >> 29966490

Sample size for binary logistic prediction models: Beyond events per variable criteria.

Maarten van Smeden¹, Karel Gm Moons¹, Joris Ah de Groot¹, Gary S Collins², Douglas G Altman², Marinus Jc Eijkemans¹, Johannes B Reitsma¹.

Abstract

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

Entities: Chemical Disease Gene Species

Keywords: EPV; Logistic regression; prediction models; predictive performance; sample size; simulations

Year: 2018 PMID： 29966490 PMCID： PMC6710621 DOI： 10.1177/0962280218784726

Source DB: PubMed Journal: Stat Methods Med Res ISSN： 0962-2802 Impact factor: 3.021

1 Introduction

Binary logistic regression modeling is among the most frequently used approaches for developing multivariable clinical prediction models for binary outcomes.[1,2] Two major categories are: diagnostic prediction models that estimate the probability of a target disease being currently present versus not present; and prognostic prediction models that predict the probability of developing a certain health state or disease outcome over a certain time period.[3] These models are developed to estimate probabilities for new individuals, i.e. individuals that were not part of the data used for developing the model,[3-5] which need to be accurate and estimated with sufficient precision to correctly guide patient management and treatment decisions. One key contributing factor to obtain robust predictive performance of prediction models is the size of the data set used for development of the prediction model relative to the number of predictors (variables) considered for inclusion in the model (hereinafter referred to as candidate predictors).[4,6-10] For logistic regression analysis, sample size is typically expressed in terms of events per variable (EPV), defined by the ratio of the number of events, i.e. number of observations in the smaller of the two outcome groups, relative to the number of degrees of freedom (parameters) required to represent the predictors considered in developing the prediction model. Lower EPV values in the prediction model development have frequently been associated with poorer predictive performance upon validation.[6,7,9,11-13] In the medical literature, an EPV of 10 is widely used as the lower limit for developing prediction models that predict a binary outcome.[14,15] This minimal sample size criterion has also generally been accepted as a methodological quality item in appraising published prediction modeling studies.[2,14,16] However, some authors have expressed concerns that that the EPV rule-of-thumb is not based on convincing scientific reasoning.[17] The rule also did not perform well in large-scale simulation studies.[18-20] Indeed, EPV has been found too lenient when default stepwise predictor selection strategies are used for development of the prediction model.[11,13] EPV may be needed when stepwise predictor selection with conventional type I error is applied.[11] Conversely, more recent work suggests that the EPV criterion may be too strict in particular settings, showing several examples where prediction models developed with modern regression shrinkage techniques showed good out-of-sample predictive performance in settings with EPV ≪ 10.[15,21] Despite all the concerns and controversy, surprisingly few alternatives for considering sample size for logistic regression analysis have been proposed to move beyond EPV criteria, except those that have focused on significance testing of logistic regression coefficients.[22] Sample size calculations for testing single coefficients are of little interest when developing a prediction model to be used for new individuals where the predictive performance of the model as a whole is of primary concern. Our work is motivated by the lack of sample size guidance and uncertainty about the factors driving the predictive performance of clinical prediction models that are developed using binary logistic regression. We report an extensive simulation study to evaluate out-of-sample predictive performance (hereafter shortened to predictive performance) of developed prediction models, applying several methods for model development. We examine the predictive performance of logistic regression-based prediction models developed using conventional Maximum Likelihood (ML), Ridge regression,[23] Least absolute shrinkage and selection operator (Lasso),[24] Firth’s correction[25] and heuristic shrinkage after ML estimation.[26] Backwards elimination predictor selection using the conventional p = .05 and p = .157 (=AIC) stopping rules is also evaluated. Using a full-factorial approach, we varied EPV, the events fraction, number of candidate predictors, area under the ROC curve (model discrimination), distribution of predictor variables and type of predictor variable effects. The simulation results are summarized using metamodels.[27,28] This paper is structured as follows. In section 2 we present models and notation. The design of the simulation study is presented in section 3, and the results are described in section 4. A discussion of our findings and its implications for sample size considerations for logistic regression is presented in section 5.

2 Developing a prediction model using logistic regression

2.1 General notation

We define a logistic regression model for estimating the probability of an event occurring (Y = 1) versus not occurring (Y = 0) given values of (a subset of) P candidate predictors, . For an individual i (), let ) )). The logistic model assumes that π is an inverse logistic function of where the vector contains an intercept, a scalar, and regression coefficients corresponding to the log odds ratios for a 1-unit increase in the corresponding predictor (hereinafter referred to as predictor effects), assuming a linear effect for each candidate predictor. At different steps in the prediction model development, the number of predictor effects estimated () may be smaller than the number of candidate predictors (P) due to predictor selection.

2.2 Maximum likelihood estimation and known finite sample properties

Conventionally, the dimensional parameter vector of the logistic model is estimated by ML estimation, which maximizes the log-likelihood function[29] which are usually derived by iteratively solving the scoring equation: . While ML logistic regression remains a popular approach to developing prediction models, ML is also known to possess several finite sample properties that can cause problems when applying the technique in small or sparse data. These properties can be classified into the following five separate and not mutually exclusive issues: As these small and sparse data effects can affect the performance of a developed ML prediction model, and thus impact the required sample size for prediction model development studies, we additionally focus on four commonly applied shrinkage estimators for logistic regression. Each of these methods aims to reduce at least one of the aforementioned issues. Issue 1: ML estimators are not optimal for making model predictions of the expected probability (risk) in new individuals. In most circumstances shrinkage estimators can be defined that have lower expected error for estimating probabilities in new individuals than ML estimators.[30,31] The benefits of the shrinkage estimators over the corresponding ML estimators decreases with increasing EPV.[9] Issue 2: the predictor effects are finite sample biased.[32,33] The regression coefficients from ML logistic regression models estimate the (multivariable) log-odds ratios for the individual predictors, which are biased towards more extreme effects, i.e. creating optimistic estimates of predictor effects for individual predictors in smaller data sets. This bias reduces with increasing EPV,[34-36] but may not completely disappear even in large samples.[19] Issue 3: model estimation becomes instable when predictor effects are large or sparse (i.e. separation).[37,38] The estimated predictor effects tend to become infinitely large in value when a linear combination of predictors can be defined that perfectly discriminates between events and non-events. Extreme probability estimates close to their natural boundaries of 0 or 1 is an undesirable consequence. Separation becomes less likely with increasing EPV.[19] Issue 4: model estimation becomes instable when predictors are strongly correlated (i.e. collinearity).[9,39] If correlations between predictors are very strong, the standard errors for the predictor effects become inflated reflecting uncertainty about the effect of the individual predictor, although this has limited effect on the predictive performance of the entire model.[8] With increasing EPV, spurious predictor collinearity becomes less likely. Issue 5: commonly used automated predictor selection strategies (e.g. stepwise selection using p-values to decide on predictor inclusion[40]) cause distortions when applied in smaller data sets. In small datasets, predictor selection is known to: (i) lead to unstable models where small changes in the number of individuals – deletion or addition of individuals – can result in different predictors being selected[7,8,41,42]; (ii) cause bias in the predictor effects towards extremer values[10,11]; and (iii) reduce a model’s predictive performance when applied in new individuals, due to omission of important predictors (underfitting) or inclusion of many unimportant predictors (overfitting).[9,11] The distortions due to predictor selection typically decrease with increasing EPV.

2.3 Regression shrinkage

2.3.1 Heuristic shrinkage logistic regression

Van Houwelingen and Le Cessie[43] proposed a heuristic shrinkage (HS) factor to be applied uniformly on all the predictor effects. The shrinkage factor is calculated as where G2 is the ML logistic regression’s likelihood ratio statistic: , with L(0) denoting the likelihood under the intercept-only ML logistic model.[8] The predictor effects of the ML regression are subsequently multiplied with to obtain shrunken predictor effect estimates. After shrinkage, the intercept is recalculated by refitting the ML model by taking the shrunken regression coefficients as fixed (i.e. as offset terms). The HS estimator was developed to improve a model’s predictive performance over the ML estimator in smaller data sets (issue 1).[43] However, in cases of weak predictor effects the HS estimator can perform poorly, as can be seen from equation (1) that takes on a negative value if: , in which case each of the predictor effects switches sign and a different modeling strategy is recommended. As HS relies on estimating the ML model to calculate the shrinkage factor and intercept, it may be sensitive to ML estimation instability (issue 3 and issue 4).[44]

2.3.2 Firth logistic regression

Firth’s penalized likelihood logistic regression model[25,38,45] penalizes the model likelihood by , where denotes Fisher’s information matrix evaluated at , with . Firth’s logistic regression model is estimated by solving the modified scoring equation where h is the diagonal element i of matrix (cf.[38]). The Firth estimator was initially developed to remove the first-order finite sample bias (issue 2) in ML estimators of logistic regression coefficients and other exponential models with canonical links.[25] As a consequence of its penalty function, the regression coefficients remain finite in situations of separation (issue 3).[38] More recently, Puhr and colleagues[21] evaluated the Firth’s estimator for improving predictive performance (issue 1), warning that it introduced bias in predicted probabilities toward , a consequence of the use the Fisher’s matrix for penalization that maximizes at .[21]

2.3.3 Ridge logistic regression

Ridge regression penalizes the likelihood proportionally to the sum of squared predictor effects: . For estimation, the predictors are standardized to have mean zero and unit variance.[46] Then, for a particular value of the tuning parameter λ1, the regression coefficients are estimated using a coordinate decent algorithm that minimizes where with the tilde ( ˜ ) denoting that estimates are evaluated at their current value (cf.[47]). The optimal value for the tuning parameter λ1 can be approximated using cross-validation optimized for a particular performance criterion. In this paper, we apply the commonly used 10-fold cross-validation (whenever possible) with minimal deviance as the performance criterion. The Ridge estimator was originally developed to deal with collinearity (issue 4).[23,39] Due to its penalty function it can also deal with separation (issue 3). Moreover, the Ridge estimator has been shown to improve predictive performance in smaller data sets that do not suffer from collinearity or separation (issue 1), although in some circumstances it showed signs of underfitting.[15,48]

2.3.4 Least absolute shrinkage and selection operator (Lasso) regression

Lasso regression penalizes the likelihood proportional to the sum of the absolute value of predictor effects: . Estimating Lasso regression can be done using the same procedure as Ridge regression, where with defined by equation (4). Similar to Ridge regression, in this paper we use 10-fold cross-validation with minimal deviance as the performance criterion to define the tuning parameter. Lasso regression is attractive for developing prediction models as it simultaneously performs regression shrinkage (addressing issue 1) and predictor selection (by shrinking some coefficients to zero), while avoiding some of the adverse effects of regular automated predictor selection strategies (issue 5). It is also suited to handle separation (issue 3), but in the context of highly correlated predictors (issue 4), the Lasso has been reported to perform less well.[15,49]

3 Methods

This simulation study was set up to evaluate the predictive performance of various prediction modeling strategies in relation to characteristics of the development data. Our primary interest was in the size of the development data set relative to other data characteristics, such as the number of candidate predictors and the events fraction (i.e. Pr(Y = 1)). The various modeling strategies we considered are described in section 3 3.1, and the variations in data characteristics are described in section 3.2. A description of the predictive performance metrics and metamodels are given in sections 3.3.1 and 3.3.2, respectively. Software and error handling are given in section 3.4.

3.1 Modeling strategies

The predictive performance of the various logistic regression models as described in section 2 were evaluated on large sample validation data sets. These regressions correspond to different ways of applying regression shrinkage (ML regression applies none). For future reference, we collectively call these approaches “regression shrinkage strategies”. We also evaluated predictive performance after backward elimination predictor selection.[40] This procedure starts by estimating a model with all P candidate predictor variables and considering the p-values associated with the predictor effects. For some pre-specified threshold value, the variable with the highest p-value exceeding the threshold value is dropped. The model is then re-estimated without the omitted variable. This process is continued until all the p-values associated with the effects of the predictors in the model are below the threshold. In this paper, we consider two commonly used threshold p-values values for ML and Firth’s regressions. We use conventional threshold p = 0.050 and a more conservative threshold p = 0.157. The latter is equivalent to the AIC criterion for selection of predictors. We collectively call the backwards elimination predictor selection approaches and Lasso (which performs predictor selection by means of shrinkage) “predictor selection strategies”.

3.2 Design and procedure

We conducted a full factorial simulation study, examining six design factors. These six factors are: (1) EPV, ranging from 3 to 50; (2) events fraction (Pr(Y = 1)), ranging from 50% to 6%; (3) number of candidate predictors (P), ranging from 4 to 12; (4) model discrimination, defined by the area under the ROC curve (AUC[50]), ranging from 0.65 to 0.85; (5) distribution of the predictor variables, independent Bernoulli or multivariate normal with equal pairwise correlation ranging from 0 to 0.5; (6) type of predictor effect, ranging from equal predictor effects for all candidate predictors to half of the candidate predictors as noise variables. All factor levels are described in Table 1.

Table 1.

Design factorial simulation study ().

Simulation factors		Factor levels
1.	Events per variable (EPV)	3, 5, 10, 15, 20, 30, 50
2.	Events fraction	1/2, 1/4, 1/8, 1/16
3.	Number of candidate predictors (P)	4, 8, 12
4.	Model discrimination (AUC)	.65,.75,.85
5.	Distribution of predictor variables	B(0.5):	Independent Bernoulli with success probability.5.
		MVN(0.0):	Normal (means = 0, variances = 1, covariances = 0.0)
		MVN(0.3):	Normal (means = 0, variances = 1, covariances = 0.3)
		MVN(0.5):	Normal (means = 0, variances = 1, covariances = 0.5)
6.	Predictor effects	Equal effect:	β1=…=βP
		1 strong:	3β1=β2=…=βP
		1 noise:	β1=0,β2=…=βP
		1/2 noise:	β1=…=βP/2=0,βP/2+1=…=βP

Design factorial simulation study (). In total, 4032 unique simulation scenarios were investigated. For each of these scenarios, 5000 simulation runs were executed using the following steps: A development data set was generated satisfying the simulation conditions (Table 1). For each of Y = 1 hypothetical individuals, a predictor variable vector () was drawn. For each individual, a binary outcome was generated as Bernoulli(π) (i.e. the outcome was drawn conditional on the true risk for each individual, which depends on the true predictor effects and the individuals predictor values). Nine binary logistic prediction models with different regression shrinkage and predictor selection strategies were estimated on the development data generated at step 1. These approaches are described in Table 2.

Table 2.

Prediction models: parameter shrinkage and variable selection strategies.

Model	Parameter shrinkage	Variable selection	Abbreviation
Maximum likelihood (full model)	No	No	ML
Maximum likelihood (backward 1)	No	Yes, p < 0.050	MLp
Maximum likelihood (backward 2)	No	Yes, p < 0.157	MLAIC
Heuristic shrinkage	Yes	No	HS
Firth’s penalized likelihood (full model)	Yes	No	Firth
Firth’s penalized likelihood (backward 1)	Yes	Yes, p < 0.050	Firthp
Firth’s penalized likelihood (backward 2)	Yes	Yes, p < 0.157	FirthAIC
Ridge penalized likelihood	Yes	No	Ridge
Lasso penalized likelihood	Yes	Yes	Lasso

A large validation data set was generated with sample size, (i.e. data set with 5000 expected events, which is 25 times larger than the recommended minimum sample size for validation studies[51]), using the sampling approach of step 1. The performance of the prediction models developed in step 2 is evaluated on the validation data generated in step 3. The measures of performance are detailed in section 3.3. Prediction models: parameter shrinkage and variable selection strategies. More details about the development of the simulation scenarios appear in Web Appendix A.

3.3 Simulation outcomes

3.3.1 Predictive performance metrics

Model discrimination was evaluated by the average (taken over all validation simulation samples) loss in the area under the ROC-curve (ΔAUC). ΔAUC was defined by the average difference between the AUCs estimated on the generated data and the AUC of the data generating model (the AUC defined by simulation factor number 5, Table 1). ΔAUCs were expected to be negative, with higher values (closer to zero) indicating better discriminative performance. Model calibration performance was evaluated by the median of calibration slopes (CS) and average calibration in the large (CIL). CS values closer to 1 and CIL closer to 0 indicate better performance. CS was estimated using standard procedures.[52-54] Due to the expected skewness of slope distributions for smaller sized development data sets, medians rather than means and interquartile ranges rather than standard deviations were calculated. CS < 1 indicates model overfitting, CS > 1 indicates underfitting. CIL was calculated by average differences between the generated events fraction and average estimated probabilities . Values of CIL < 0 indicates systematic underestimation of estimated probabilities, CIL > 0 indicates systematic overestimation of estimated probabilities. The prediction error was evaluated by the average of Brier scores[55] (Brier), the square root of the mean squared prediction error (rMPSE) and mean absolute prediction error (MAPE). The rMPSE and MAPE are based on the distance between the estimated probabilities () and the true probabilities (π, which can be calculated under the data generating model using the generated predictor variable vector ()), by the square root of the average squared distance and the absolute distance, respectively. Lower values for Brier, rMSPE and MAPE indicate better performance.

3.3.2 Metamodels

Variation in simulation results across simulation conditions was studied by using metamodels.[27,28] The metamodels were used to quantify the relative impact of the various development data characteristics (taken as covariates in the metamodels) on a particular predictive performance simulation outcome (the outcome variable in the metamodel). We considered the following covariates in the metamodel: development sample size (N), events fraction (), number of candidate predictor (P), true area under the characteristic curve (AUC), binary predictor variables (Bin, coding: 0 = no, 1 = yes), predictor pairwise correlations (Cor), and noise variables (Noise, coding: 0 = no, 1 = yes). Metamodels were developed for the following outcomes: natural log transformed MSPE (=rMPSE[2]), natural log transformed MAPE, natural log transformed Brier, ΔAUC (×100 for notational convenience) and CS. These models were developed separately for each of the shrinkage and predictor selection strategies. To facilitate interpretation, three separate metamodels were considered: i) a full model with all metamodel covariates, ii) a simplified model with only the development data size, events fraction and the number of candidate predictors, and for comparison: iii) a model with only development data EPV as a covariate. Metamodel ii was conceptualized before the start of the simulation study based on the notion that it would incorporate the same type of information as needed for estimating EPV before data collection, that is information available at the design phase of a prediction model development study (i.e. before the actual number of events are known). The metamodels were estimated using linear regression with a Ridge penalty (i.e. Gaussian Ridge regression) specifying only linear main effects of the metamodel covariates. While more complex models (e.g. for interactions and non-linear effects) are possible, we found that linear main effects to be sufficient for constructing the metamodels. The Ridge metamodel tuning parameter was chosen based on 10-fold cross-validation that minimized mean squared error.

3.4 Software and estimation error handling

All simulations were performed in R (version 3.2.2)[56] executed on a central high-performance computing facility running on a CentOS Linux operating system. We used the CRAN packages: GLMnet[47] for estimating the Ridge and Lasso regression models (version 2.0-5, with an expanded grid of tuning parameters of 100 additional λ values that were smaller than the lowest value of the default), package logistf (version 1.21) for estimating ML, HS and Firth’s model and to perform backward selection and package MASS (version 7.3-45)[57] for generating predictor data. Estimation errors were closely monitored (details in Web Appendix B). A summary of the estimation errors and their handling is given in Table 3. The Web Appendix also presents detailed simulation results focusing only on the ML model (Web Appendix C) and the relative rankings of the various model strategies with respect to the observed predictive performance (Web Appendix D).

Table 3.

Simulation estimation errors and consequences.

	No. (%)	Consequences
Development datasets generated	20,160,000 (100%)
Simulation conditions	4,032 (100%)
Separation detected	90,846 (0.45%)	The separated cases are left in (to avoid selective missing data).
Degenerate distributions
<3 events or <3 non-events generated	211 (0.001%)	Data sets are treated as missing data sets.
<8 events or <8 non-events generated	68,048 (0.34%)	Leave-one-out cross-validation is used for estimating Lasso and Ridge tuning parameters.
Degenerate predictor variable generated	0 (0%)
Heuristic shrinkage factor inestimable	2,470,118 (12.25%)	For HS, results are replaced by ML results.
Degenerated linear predictor (no variables selected)
MLp	650,133 (3.22%)
MLAIC	179,638 (0.89%)
Firthp	718,194 (3.56%)
FirthAIC	204,617 (1.01%)
Lasso	744,575 (3.69%)

Simulation estimation errors and consequences.

4 Results

4.1 Predictive performance by relative size of development data

Figure 1 shows the average predictive performance of the various prediction models as a function of EPV and the events fraction. The impact of EPV and events fraction was consistent across the prediction models. There was improved predictive performance (i.e. reduction in average value for rMSPE and MAPE; ΔAUC closer to zero) when EPV increased (while keeping events fraction constant), and when the events fraction decreased (while keeping EPV constant). Differences between events fraction conditions decreased when EPV increased. Brier consistently improved (i.e. reduction in average value) with decreasing events fractions across prediction models, but showed little association with EPV beyond an EPV of 20.

Figure 1.

Marginal out-of-sample predictive performance.

Marginal out-of-sample predictive performance. Close to perfect average values (a value of 0) were observed for CIL for all models across all studied conditions (Figure 1), except for the Firth regressions with and without predictor selection (Figure 1). This miscalibration-in-the-large occurred in lower EPV settings, and did not occur in the conditions where the events fraction was 1/2. CS improved (i.e. average values closer to 1) with increasing EPV and decreasing events fraction for all models. On average, all models except the Ridge regression showed signs of overfitting (CS values below 1). The Ridge regression consistently showed signs of underfitting (CS values above 1). For all models, improved CS values were observed when EPV increased (while keeping the events fraction constant) and the events fraction decreased (while keeping EPV constant).

4.1.1 Performance of regression shrinkage strategies by relative size of development data

Unsurprisingly, the impact of shrinkage lessened with increasing EPV as depicted in Figure 2. The active regression shrinkage strategies (Ridge, Lasso, HS, Firth) showed lower median rMPSE and MAPE values than the non-shrunken ML regression at EPV = 5 and EPV = 10. In those settings, Ridge, Lasso and HS regression showed more variability between simulation scenarios than Firth and ML. For simulation scenarios at EPV = 50, the differences between shrinkage strategies were smaller.

Figure 2.

Boxplot distribution of out-of-sample predictive performance outcomes (restricted to conditions with events fraction = 1/2).

Boxplot distribution of out-of-sample predictive performance outcomes (restricted to conditions with events fraction = 1/2). In Figure 2, Brier and CIL outcomes showed little variation between shrinkage strategies. Notice that for this figure the events fraction was kept constant at 1/2, miscalibration-in-the-large was therefore not observed for the Firth regression. Poor CIL and rMPSE for the HS model was observed in some conditions with a high rate of separation (results not shown). Only the Ridge regression showed superior performance on the outcome ΔAUC, with little differences between HS, Firth and ML, and slightly less favorable and more variable performance of the Lasso regression at EPV = 5 and EPV = 10. The Lasso regression yielded CS closest to optimal (value of 1).

4.1.2 Performance of predictor selection strategies by relative size of development data

Backwards elimination (ML, ML, Firth and Firth) produced higher median rMPSE and MAPE than ML and Firth regressions that did not perform predictor selection (Figure 2). Median rMPSE and MAPE were more favorable for ML and Firth than ML and Firth. Backwards elimination also showed more variable MAPE and rMPSE values across the different simulation scenarios. The patterns were noticeable for the EPV = 5 and EPV = 10 conditions but did not completely disappear even at EPV = 50. Lasso regression had lower MAPE and rMPSE than the backwards elimination strategies and less variable results between conditions for the whole considered range of EPV. Brier and CIL showed little variation between predictor selection strategies (Figure 2). For the predictor selection strategies, median ΔAUC were least favorable and more variable for Firth and ML, followed by ML and Firth, followed by Lasso. Lasso also yielded closer to optimal CS, with little differences observed between the backwards elimination strategies. These patterns were observed consistently across the considered EPV range.

4.2 Predictive performance by other development data characteristics

Figure 3 describes the average performances of prediction models. We left out Brier (only noticeable changes occurred when varying the AUC of the data generating mechanism) and CIL (close to optimal for all but Firth regressions) for this presentation.

Figure 3.

Average relative out-of-sample performances of modeling strategies per simulation factor level.

Average relative out-of-sample performances of modeling strategies per simulation factor level. Lower AUC of the data generating mechanism was associated with poorer CS and ΔAUC outcomes. In conditions with AUC = 0.65, Ridge regression was superior in terms of rMPSE, MAPE and ΔAUC, while HS was superior in terms of CS. We also observed improved predictive performance as the number of predictors increased. This is partly due to a doubling of the development data size in our simulations when going from 4 to 8 predictors and three-fold increase in sample size when going from 4 to 12 predictors, a direct consequence of EPV as one of the chosen simulation factors. With respect to the individual effects of the predictor variables (Figure 3), the average predictive performance of the variable selection strategies was best in conditions with one strong predictor. Effects of noise variables on the performances were negligible. Higher pairwise correlations between the predictors improved rMPSE, MAPE and ΔAUC for Ridge and Lasso and CS for Lasso. Higher correlations also increased the signs of underfitting of the Ridge regression (CS > 1).

4.3 Metamodels results

Table 4 presents the fitted results of the metamodels (linear regressions subject to a Ridge penalty).

Table 4.

Results of simulation meta models: Outcome: ln(MSPE).

			Natural log transformed					Original scale
Meta model		Int	EPV	N	Events fraction	P	AUC	Cor	Bin	Noise	R [2]
Full	ML	−0.55	.	−1.06	0.36	0.94	0.40	0.00	0.05	0.00	0.993
Simplified	ML	−0.59	.	−1.06	0.36	0.94	.	.	.	.	0.992
EPV only	ML	−3.29	−1.06	.	.	.	.	.	.	.	0.432
Full	Firth	−0.84	.	−1.03	0.33	0.93	0.31	0.00	0.04	0.00	0.993
Simplified	Firth	−0.86	.	−1.03	0.33	0.93	.	.	.	.	0.992
EPV only	Firth	−3.42	−1.03	.	.	.	.	.	.	.	0.438
Full	HS	−0.39	.	−0.97	0.44	0.74	1.17	0.00	−0.01	0.00	0.985
Simplified	HS	−0.75	.	−0.97	0.44	0.74	.	.	.	.	0.977
EPV only	HS	−3.64	−0.97	.	.	.	.	.	.	.	0.385
Full	Lasso	−0.59	.	−0.93	0.46	0.68	0.97	−0.48	0.04	0.03	0.983
Simplified	Lasso	−0.86	.	−0.93	0.46	0.68	.	.	.	.	0.973
EPV only	Lasso	−3.78	−0.93	.	.	.	.	.	.	.	0.371
Full	Ridge	−0.39	.	−0.88	0.50	0.49	1.33	−0.85	0.03	−0.02	0.979
Simplified	Ridge	−0.93	.	−0.88	0.50	0.49	.	.	.	.	0.952
EPV only	Ridge	−4.08	−0.88	.	.	.	.	.	.	.	0.337
Full	ML_p	−0.85	.	−1.02	0.40	0.95	0.34	0.03	0.07	0.17	0.943
Simplified	ML_p	−0.57	.	−1.03	0.40	0.96	.	.	.	.	0.939
EPV only	ML_p	−3.18	−1.03	.	.	.	.	.	.	.	0.393
Full	ML_AIC	−0.74	.	−1.05	0.38	0.95	0.35	0.00	0.06	0.10	0.977
Simplified	ML_AIC	−0.59	.	−1.05	0.38	0.95	.	.	.	.	0.975
EPV only	ML_AIC	−3.25	−1.05	.	.	.	.	.	.	.	0.417
Full	Firth_p	−0.94	.	−1.01	0.39	0.95	0.34	0.02	0.07	0.17	0.939
Simplified	Firth_p	−0.66	.	−1.01	0.39	0.95	.	.	.	.	0.935
EPV only	Firth_p	−3.22	−1.01	.	.	.	.	.	.	.	0.392
Full	Firth_AIC	−0.90	.	−1.03	0.37	0.95	0.32	0.00	0.06	0.10	0.975
Simplified	Firth_AIC	−0.74	.	−1.03	0.37	0.95	.	.	.	.	0.973
EPV only	Firth_AIC	−3.32	−1.03	.	.	.	.	.	.	.	0.418

Full: metamodel with all eight meta-model covariates; Simplified: model with covariates N, events fraction and P, EPV only: meta model with EPV as a covariate. Int: Intercept; EPV: Events per variable; N: Sample size; P: number of candidate predictors; AUC: Area under the ROC-curve; Cor: Predictor pairwise correlations.

Results of simulation meta models: Outcome: ln(MSPE). Full: metamodel with all eight meta-model covariates; Simplified: model with covariates N, events fraction and P, EPV only: meta model with EPV as a covariate. Int: Intercept; EPV: Events per variable; N: Sample size; P: number of candidate predictors; AUC: Area under the ROC-curve; Cor: Predictor pairwise correlations. The metamodels showed similar results for the outcomes natural log transformed MSPE (ln(MSPE)) and MAPE (ln(MAPE)) outcomes (Tables 4 and 5). For the metamodels that included all eight covariates as linear main effects, the percentage of explained variance (R2) was 99.3% for the outcome ln(rMSPE) and for ln(MAPE). Using the simplified metamodel with three covariates, the R2 dropped to between 93.5% and 99.2% indicating that these factors – N, events fraction and P – explained a sizable amount of the variance between simulation conditions. R2 was similar for ML and Firth regression, but lower for Ridge, Lasso, HS and after backwards elimination. Using only EPV as covariate in the metamodel yielded R2 between 28.5% and 43.2% for the ln(MSPE) and ln(MAPE) outcomes.

Table 5.

Results of simulation meta models: Outcome: ln(MAPE).

			Natural log transformed					Original scale
Meta model		Int	EPV	N	Events fraction	P	AUC	Cor	Bin	Noise	R [2]
Full	ML	−0.60	.	−0.53	0.31	0.48	−0.50	0.00	−0.01	0.00	0.996
Simplified	ML	−0.48	.	−0.53	0.31	0.48	.	.	.	.	0.992
EPV only	ML	−2.03	−0.53	.	.	.	.	.	.	.	0.355
Full	Firth	−0.74	.	−0.51	0.29	0.47	−0.51	0.00	−0.01	0.00	0.996
Simplified	Firth	−0.61	.	−0.51	0.29	0.47	.	.	.	.	0.991
EPV only	Firth	−2.10	−0.51	.	.	.	.	.	.	.	0.357
Full	HS	−0.55	.	−0.49	0.33	0.39	−0.15	0.00	−0.03	0.00	0.991
Simplified	HS	−0.56	.	−0.49	0.33	0.39	.	.	.	.	0.991
EPV only	HS	−2.19	−0.49	.	.	.	.	.	.	.	0.326
Full	Lasso	−0.59	.	−0.48	0.34	0.35	−0.19	−0.24	−0.01	0.01	0.989
Simplified	Lasso	−0.59	.	−0.48	0.34	0.35	.	.	.	.	0.983
EPV only	Lasso	−2.24	−0.48	.	.	.	.	.	.	.	0.314
Full	Ridge	−0.48	.	−0.45	0.36	0.26	0.03	−0.43	−0.02	−0.01	0.986
Simplified	Ridge	−0.61	.	−0.45	0.36	0.26	.	.	.	.	0.970
EPV only	Ridge	−2.39	−0.45	.	.	.	.	.	.	.	0.285
Full	ML_p	−0.75	.	−0.52	0.31	0.49	−0.58	0.03	−0.01	0.09	0.951
Simplified	ML_p	−0.45	.	−0.52	0.31	0.49	.	.	.	.	0.942
EPV only	ML_p	−1.95	−0.52	.	.	.	.	.	.	.	0.334
Full	ML_AIC	−0.70	.	−0.53	0.31	0.49	−0.55	0.01	−0.01	0.06	0.982
Simplified	ML_AIC	−0.48	.	−0.53	0.31	0.49	.	.	.	.	0.975
EPV only	ML_AIC	−2.00	−0.53	.	.	.	.	.	.	.	0.348
Full	Firth_p	−0.79	.	−0.52	0.30	0.50	−0.56	0.02	−0.01	0.09	0.947
Simplified	Firth_p	−0.49	.	−0.52	0.30	0.50	.	.	.	.	0.938
EPV only	Firth_p	−1.96	−0.52	.	.	.	.	.	.	.	0.335
Full	Firth_AIC	−0.78	.	−0.52	0.30	0.49	−0.55	0.01	−0.01	0.06	0.979
Simplified	Firth_AIC	−0.55	.	−0.52	0.30	0.49	.	.	.	.	0.973
EPV only	Firth_AIC	−2.03	−0.52	.	.	.	.	.	.	.	0.348

Results of simulation meta models: Outcome: ln(MAPE). Full: metamodel with all eight meta-model covariates; Simplified: model with covariates N, events fraction and P, EPV only: meta model with EPV as a covariate. Int: Intercept; EPV: Events per variable; N: Sample size; P: number of candidate predictors; AUC: Area under the ROC-curve; Cor: Predictor pairwise correlations. As expected, MSPE and MAPE were negatively related to N and positively related to P. The positive relation between the events fraction (events fraction ) and MPSE/MAPE can be explained by a shift of the average of estimated probabilities π towards zero as the event rate decreases (assuming the model is appropriately calibrated). These lower probabilities have lower expected variance, considering that the variance of a Bernoulli is asymptotically . Similar findings were observed for the outcome ln(Brier) (Table 6). There was a strong relation between the simplified model covariates and ln(Brier) (). Little variation between the fitted metamodel coefficients and R2 was observed for the different regression models. For all models, N was negatively related to ln(Brier), while the events fraction and P were positively related to ln(Brier). In contrast, EPV had only weak relation to ln(Brier) with .

Table 6.

Results of simulation meta models: Outcome: ln(Brier).

			Natural log transformed					Original scale
Meta model		Int	EPV	N	Events fraction	P	AUC	Cor	Bin	Noise	R [2]
Full	ML	−1.23	.	−0.04	0.62	0.04	−1.02	0.00	0.01	0.00	0.969
Simplified	ML	−0.91	.	−0.04	0.62	0.04	.	.	.	.	0.925
EPV only	ML	−2.06	−0.04	.	.	.	.	.	.	.	0.005
Full	Firth	−1.27	.	−0.03	0.62	0.03	−1.02	0.00	0.01	0.00	0.969
Simplified	Firth	−0.95	.	−0.03	0.62	0.03	.	.	.	.	0.923
EPV only	Firth	−2.08	−0.03	.	.	.	.	.	.	.	0.003
Full	HS	−1.23	.	−0.03	0.62	0.02	−0.98	0.00	0.00	0.00	0.969
Simplified	HS	−0.93	.	−0.03	0.62	0.02	.	.	.	.	0.927
EPV only	HS	−2.08	−0.03	.	.	.	.	.	.	.	0.003
Full	Lasso	−1.27	.	−0.03	0.62	0.02	−1.00	−0.02	0.01	0.00	0.969
Simplified	Lasso	−0.96	.	−0.03	0.62	0.02	.	.	.	.	0.925
EPV only	Lasso	−2.10	−0.03	.	.	.	.	.	.	.	0.002
Full	Ridge	−1.29	.	−0.02	0.62	0.01	−1.00	−0.02	0.01	0.00	0.968
Simplified	Ridge	−0.98	.	−0.02	0.62	0.01	.	.	.	.	0.924
EPV only	Ridge	−2.12	−0.02	.	.	.	.	.	.	.	0.002
Full	ML_p	−1.19	.	−0.04	0.62	0.04	−0.96	−0.02	0.01	0.01	0.969
Simplified	ML_p	−0.89	.	−0.04	0.62	0.04	.	.	.	.	0.929
EPV only	ML_p	−2.04	−0.04	.	.	.	.	.	.	.	0.006
Full	ML_AIC	−1.22	.	−0.04	0.62	0.04	−0.99	−0.01	0.01	0.00	0.969
Simplified	ML_AIC	−0.91	.	−0.04	0.62	0.04	.	.	.	.	0.927
EPV only	ML_AIC	−2.05	−0.04	.	.	.	.	.	.	.	0.005
Full	Firth_p	−1.20	.	−0.04	0.62	0.04	−0.96	−0.02	0.01	0.01	0.968
Simplified	Firth_p	−0.90	.	−0.04	0.62	0.04	.	.	.	.	0.929
EPV only	Firth_p	−2.05	−0.04	.	.	.	.	.	.	.	0.005
Full	Firth_AIC	−1.24	.	−0.04	0.62	0.04	−1.00	−0.01	0.01	0.00	0.969
Simplified	Firth_AIC	−0.93	.	−0.04	0.62	0.04	.	.	.	.	0.926
EPV only	Firth_AIC	−2.06	−0.04	.	.	.	.	.	.	.	0.004

Results of simulation meta models: Outcome: ln(Brier). Full: metamodel with all eight meta-model covariates; Simplified: model with covariates N, events fraction and P, EPV only: meta model with EPV as a covariate. Int: Intercept; EPV: Events per variable; N: Sample size; P: number of candidate predictors; AUC: Area under the ROC-curve; Cor: Predictor pairwise correlations. The outcomes ΔAUC (Table 7) and CS (Table 8) were less well predicted by the eight covariate metamodel and varied considerably between the prediction models (R2 between 84.8% and 49.6%). Similarly, for the simplified metamodel with three covariates, R2 was between 70.0% and 19.0%. R2 dropped even further for metamodels with EPV as the only predictor, R2 was between 63.3% and 18.0%. Largest R2 was observed for the ML regression. Similar to other metamodels, ΔAUC and CS improved with higher N and decreasing P. The direction of effect of the events fraction is in the opposite direction as compared to MSPE, MAPE and brier, showing decreased performance with decreasing event fraction.

Table 7.

Results of simulation meta models: Outcome: ΔAUC × 100.

			Natural log transformed					Original scale
Meta model		Int	EPV	N	Events fraction	P	AUC	Cor	Bin	Noise	R [2]
Full	ML	−3.63	.	1.47	0.92	−1.66	5.02	−0.03	−0.46	−0.06	0.821
Simplified	ML	−6.01	.	1.47	0.92	−1.66	.	.	.	.	0.700
EPV only	ML	−5.44	1.47	.	.	.	.	.	.	.	0.633
Full	Firth	−3.56	.	1.46	0.92	−1.66	5.05	−0.03	−0.47	−0.06	0.822
Simplified	Firth	−5.97	.	1.46	0.92	−1.66	.	.	.	.	0.698
EPV only	Firth	−5.42	1.46	.	.	.	.	.	.	.	0.632
Full	HS	−5.60	.	1.63	1.11	−1.53	4.05	−0.03	−0.08	−0.07	0.665
Simplified	HS	−7.05	.	1.63	1.11	−1.53	.	.	.	.	0.614
EPV only	HS	−5.96	1.63	.	.	.	.	.	.	.	0.571
Full	Lasso	−6.11	.	1.93	1.24	−1.63	7.30	2.11	−0.45	−0.08	0.713
Simplified	Lasso	−8.73	.	1.93	1.24	−1.63	.	.	.	.	0.580
EPV only	Lasso	−6.95	1.93	.	.	.	.	.	.	.	0.528
Full	Ridge	−3.14	.	0.98	0.62	−0.91	3.38	2.18	−0.42	−0.03	0.684
Simplified	Ridge	−4.47	.	0.98	0.62	−0.91	.	.	.	.	0.515
EPV only	Ridge	−3.70	0.98	.	.	.	.	.	.	.	0.468
Full	ML_p	−9.03	.	2.62	1.75	−2.29	8.89	2.18	−0.23	−0.23	0.764
Simplified	ML_p	−11.94	.	2.62	1.75	−2.29	.	.	.	.	0.645
EPV only	ML_p	−9.80	2.62	.	.	.	.	.	.	.	0.597
Full	ML_AIC	−5.79	.	1.91	1.25	−1.87	6.64	0.99	−0.33	−0.17	0.797
Simplified	ML_AIC	−8.37	.	1.91	1.25	−1.87	.	.	.	.	0.680
EPV only	ML_AIC	−7.13	1.92	.	.	.	.	.	.	.	0.626
Full	Firth_p	−9.92	.	2.75	1.81	−2.33	8.72	2.36	−0.22	−0.22	0.751
Simplified	Firth_p	−12.72	.	2.75	1.81	−2.33	.	.	.	.	0.646
EPV only	Firth_p	−10.25	2.75	.	.	.	.	.	.	.	0.592
Full	Firth_AIC	−6.18	.	1.98	1.28	−1.89	6.61	1.10	−0.33	−0.17	0.785
Simplified	Firth_AIC	−8.73	.	1.98	1.28	−1.89	.	.	.	.	0.677
EPV only	Firth_AIC	−7.34	1.98	.	.	.	.	.	.	.	0.621

Table 8.

Results of simulation meta models: Outcome: CS.

			Natural log transformed					Original scale
Meta model		Int	EPV	N	Events fraction	P	AUC	Cor	Bin	Noise	R [2]
Full	ML	0.50	.	0.15	0.09	−0.16	0.65	0.00	0.00	0.00	0.848
Simplified	ML	0.31	.	0.15	0.09	−0.16	.	.	.	.	0.689
EPV only	ML	0.40	0.15	.	.	.	.	.	.	.	0.616
Full	Firth	0.73	.	0.12	0.08	−0.15	0.77	0.00	−0.01	0.00	0.835
Simplified	Firth	0.50	.	0.12	0.08	−0.15	.	.	.	.	0.556
EPV only	Firth	0.52	0.12	.	.	.	.	.	.	.	0.505
Full	HS	0.73	.	0.07	0.02	−0.08	0.03	0.00	−0.01	0.00	0.496
Simplified	HS	0.71	.	0.07	0.02	−0.08	.	.	.	.	0.495
EPV only	HS	0.77	0.07	.	.	.	.	.	.	.	0.368
Full	Lasso	0.98	.	0.04	0.03	−0.05	0.43	0.12	0.00	−0.01	0.513
Simplified	Lasso	0.85	.	0.04	0.03	−0.05	.	.	.	.	0.190
EPV only	Lasso	0.85	0.04	.	.	.	.	.	.	.	0.180
Full	Ridge	1.19	.	−0.05	−0.03	0.03	−0.25	0.14	0.01	0.00	0.823
Simplified	Ridge	1.31	.	−0.05	−0.03	0.03	.	.	.	.	0.488
EPV only	Ridge	1.23	−0.05	.	.	.	.	.	.	.	0.418
Full	ML_p	0.51	.	0.15	0.09	−0.15	0.65	0.09	0.00	−0.01	0.832
Simplified	ML_p	0.33	.	0.15	0.09	−0.15	.	.	.	.	0.652
EPV only	ML_p	0.42	0.15	.	.	.	.	.	.	.	0.588
Full	ML_AIC	0.52	.	0.15	0.09	−0.15	0.63	0.06	0.00	−0.01	0.848
Simplified	ML_AIC	0.33	.	0.15	0.09	−0.15	.	.	.	.	0.682
EPV only	ML_AIC	0.42	0.15	.	.	.	.	.	.	.	0.611
Full	Firth_p	0.67	.	0.13	0.08	−0.15	0.74	0.09	−0.01	−0.01	0.826
Simplified	Firth_p	0.44	.	0.13	0.08	−0.15	.	.	.	.	0.575
EPV only	Firth_p	0.49	0.13	.	.	.	.	.	.	.	0.522
Full	Firth_AIC	0.69	.	0.13	0.08	−0.15	0.73	0.05	−0.01	−0.01	0.846
Simplified	Firth_AIC	0.46	.	0.13	0.08	−0.15	.	.	.	.	0.595
EPV only	Firth_AIC	0.50	0.13	.	.	.	.	.	.	.	0.537

Results of simulation meta models: Outcome: ΔAUC × 100. Full: metamodel with all eight meta-model covariates; Simplified: model with covariates N, events fraction and P, EPV only: meta model with EPV as a covariate. Int: Intercept; EPV: Events per variable; N: Sample size; P: number of candidate predictors; AUC: Area under the ROC-curve; Cor: Predictor pairwise correlations. Results of simulation meta models: Outcome: CS. Full: metamodel with all eight meta-model covariates; Simplified: model with covariates N, events fraction and P, EPV only: meta model with EPV as a covariate. Int: Intercept; EPV: Events per variable; N: Sample size; P: number of candidate predictors; AUC: Area under the ROC-curve; Cor: Predictor pairwise correlations.

5 Discussion

This paper has investigated the impact of EPV and other development data characteristics in relation to modelling strategies on the (out-of-sample) predictive performance of prediction models developed with logistic regression. We showed that the EPV fails to have a strong relation with metrics of predictive performance across modelling strategies. Given our findings, it is clear that EPV is not an appropriate sample size criterion for binary prediction model development studies. Below we discuss our simulation results, followed by a discussion of the implications for sample size determination for prediction model development. A new strategy for such sample size consideration is proposed.

5.1 Simulation findings

Our study confirms previous findings that predictive performance can be poor for prediction models developed using conventional maximum likelihood binary logistic regression in data with a small number of subjects relative to the number of predictors. As expected, predictive performance generally improved when regression shrinkage strategies were applied, while backwards elimination predictor selection strategies generally worsened the predictive accuracy of the prediction model. These tendencies were observed consistently for discrimination (ΔAUC), calibration slopes (CS) and prediction error (rMPSE, MAPE, and Brier) outcomes. Calibration in the large was near ideal for all models in all simulation settings, except for Firth regression that showed upward biased estimation of probability towards . Some more recent refinements to the Firth’s correction have shown promising results in circumventing the issues with calibration in the large.[21,48,58,59] With larger sample sizes, the benefits (in terms of predictive performance) of the regression shrinkage strategies gradually declined, but predictive performance after shrinkage remained slightly superior or equivalent to ML regression even for larger sample sizes. Between the regression shrinkage strategies, the Ridge regression showed best discrimination (lowest average ΔAUC) and lowest prediction error (lowest average rMSPE, MAPE and Brier) performance when compared to Firth, Lasso and HS. Median CS of the HS and Lasso regression were closer to optimal than the Ridge regression, the latter showing signs of underfitting. The observed tendency to underfitting of Ridge regression is consistent with other recent simulation studies.[16,48] In smaller samples, backwards elimination with conventional p = 0.050 and AIC criteria, generally performed worse than an equivalent regression without predictor selection or Lasso, even when only half of the predictor variables were randomly associated to the outcome. For conditions with EPV as large as 50, backwards elimination was found to yield higher rMSPE and MAPE than the equivalent model with all variables left in. Between the backward elimination criteria, the more conservative AIC criterion was found to produce better average predictive performance than p = 0.050, in accordance with earlier work.[9,11] The metamodels fitted on the simulation results revealed that between simulation variation of (r)MPSE, MAPE and Brier could largely be explained by a linear model with three covariates: sample size, the events fraction and the number of candidate predictors. The joint effect of these three covariates on prediction error tended to become slightly weaker when regression shrinkage or variable selection strategies were applied. ΔAUC and CS were found to be more unpredictable by the metamodel regression. ΔAUC and CS were found to be particularly sensitive to the prediction model development strategy employed (e.g. whether regression shrinkage or predictor selection was used) and, importantly, dependent on the AUC of the data generating mechanism. Some limitations apply to our study. The broad setup of our simulations, with over 4000 unique scenarios, does allow for a generalization of the findings to a large variety of prediction modeling settings. However, as with any simulation study, the number of investigated scenarios was finite and extrapolation of our findings far beyond the investigated regions is not advised. A total of nine prediction modeling strategies were investigated. In practice, we expect that other approaches to regression shrinkage and predictor selection than we considered may sometimes be preferable (e.g. Elastic Net,[49] non-negative Garrotte,[60] random forest[61]). Finding optimal strategies for developing clinical prediction models in small or sparse data was not the main objective of the current study but is a worthwhile topic for future research.

5.2 Implications for sample size considerations

There is general consensus on the importance of having data of adequately size when developing a prediction model.[2] However, consensus is lacking on the criteria to determine what size would count as adequate. Our results showed that the recommended minimal EPV criteria for prediction model development, notably the EPV rule,[34] falls short of providing appropriate sample size guidance. Earlier critiques on EPV as a sample size criterion has identified its weak theoretical and empirical underpinning,[17-20] and has shown that the EPV rule can be too lenient[11,13] or too strict,[15,21] depending on the modelling approach taken. The current study also showed that EPV fails the minimal requirement of strong relation to (at least one aspect of) predictive performance. In itself EPV was found to have only a weak relation with outcomes of prediction error and a mediocre relation with calibration and discrimination. The EPV rule also does not adequately account for changes in events fraction. The implied relation by the EPV rule between required sample size (N) and the events fraction is described by the function: / events fraction, where the events fraction (trivially recoding of Y can ensure the events fraction not exceeds 1/2). This relation is depicted in Figure 4. The figure shows that the relation between the events fraction and required N is in the same direction but much steeper for EPV than the relation between the required sample size when keeping expected CS and ΔAUC constant. The relation between prediction error measures is in the opposite direction.

Figure 4.

Relation required sample size and events fraction. Calculations based on metamodels with criterion values that were kept constant. For illustration purposes, the criterion values were chosen such that they would intersect at events fraction = 1/2. The search for new minimal sample size criteria inherently calls for abandoning EPV as the sole sample size criterion. Alternatives for sample size must have a predictable relationship with future predictive performance and be on a scale that is interpretable for users. It is our view that general single threshold values should be avoided. Instead, sample size determination should be based on threshold values on an interpretable scale that ensure predictive performance that is fit for purpose. What counts as fit for purpose varies from application to application (e.g. clinical prediction models for informing short-term high-risk treatment decisions may differ from the requirements for long-term low-risk decisions). It is the duty of the researcher to define what constitutes as fit for purpose in context and explain how the sample size was arrived at (see also: the TRIPOD statement[2,3]).

5.3 New sample size criteria

Out-of-sample (r)MSPE and MAPE are natural metrics to determine sample size adequacy of prediction models, as they define the expected distance (squared or absolute) for new individuals between the estimated probabilities for new patients and their unobservable “true” values. Because clinical prediction models are primary used to estimate probabilities for new individuals,[3-5] rMSPE and MAPE have direct relevance when developing a prediction model. The out-of-sample rMPSE and MAPE can be approximated via simulations as we have done in this paper. Our simulation code is available via GitHub (https://github.com/MvanSmeden/Beyond-EPV). Alternatively, rMSPE and MAPE may be approximated via the results of our metamodels (Table 4). For instance, at a sample size of N = 400, with P = 8 candidate predictors and an expected event fraction of 1/4, the predicted out-of-sample rMPSE is 0.065 when ML model (without variable selection) is applied and 0.053 for Ridge regression; MAPE is 0.045 for the ML model and 0.038 for the Ridge regression. Obviously, whether or not these expected “average” prediction errors on the probability scale are acceptable or not depends on the intended use of the prediction model (i.e. N = 400 may not be sufficient for accurate estimation of probability for high risk treatment decisions, even though for this example EPV = 20). We warn readers that these out-of-sample performance predictions from the simulation metamodels have not been externally validated and that approximations may not work well far outside the range of investigated simulation settings. In particular, using these approximations for sample size calculations with very low events fractions may yield unacceptably poor discrimination and calibration performances (see Figure 4).

6 Conclusion

The currently recommended sample size criteria for developing prediction models, notably the EPV rule-of-thumb, are insufficient to warrant appropriate sample size decisions. EPV criteria fail to take into account the intended use of the prediction model and have only a weak relation to out-of-sample predictive performance of the prediction model. Instead, sample size should be determined based on a meaningful out-of-sample predictive performance scale, such as the rMPSE and MAPE. The results of our study can be used to inform sample size considerations when developing a binary prediction model given the required predictive performance in new individuals.

74 in total

Review 1. How to read and review papers on machine learning and artificial intelligence in radiology: a survival guide to key methodological concepts.

Authors: Burak Kocak; Ece Ates Kus; Ozgur Kilickesmez
Journal: Eur Radiol Date: 2020-10-01 Impact factor: 5.315

2. Letter to the Editor: What Factors Are Associated With Poor Shoulder Function and Serious Complications After Internal Fixation of Three-part and Four-part Proximal Humerus Fracture-dislocations?

Authors: Ankit Dadra; Prasoon Kumar; Siddhartha Sharma; Devendra Kumar Chouhan; Ajay Gowda
Journal: Clin Orthop Relat Res Date: 2022-08-08 Impact factor: 4.755

3. A Very Short List of Common Pitfalls in Research Design, Data Analysis, and Reporting.

Authors: Maarten van Smeden
Journal: PRiMER Date: 2022-08-10

4. Determinants of Complex Regional Pain Syndrome Type I among Radial Head Fracture Patients with Unilateral Arthroplasty.

Authors: Ye Wang; Menglu Jiang; Xu Dai; Qin Zhang
Journal: Orthop Surg Date: 2022-06-08 Impact factor: 2.279

5. Black Box Prediction Methods in Sports Medicine Deserve a Red Card for Reckless Practice: A Change of Tactics is Needed to Advance Athlete Care.

Authors: Garrett S Bullock; Tom Hughes; Amelia H Arundale; Patrick Ward; Gary S Collins; Stefan Kluzek
Journal: Sports Med Date: 2022-02-17 Impact factor: 11.928

6. Benefits and challenges of using logistic regression to assess neuropsychological performance validity: Evidence from a simulation study.

Authors: Alexander Weigard; Robert J Spencer
Journal: Clin Neuropsychol Date: 2022-01-10 Impact factor: 4.373

7. Predictors Of Postoperative Lower Urinary Tract Symptoms Improvements In Patient With Small-Volume Prostate And Bladder Outlet Obstruction.

Authors: Xiao-Dong Li; Yu-Peng Wu; Zhi-Bin Ke; Ting-Ting Lin; Shao-Hao Chen; Xue-Yi Xue; Ning Xu; Yong Wei
Journal: Ther Clin Risk Manag Date: 2019-11-07 Impact factor: 2.423

8. Using Genomics Feature Selection Method in Radiomics Pipeline Improves Prognostication Performance in Locally Advanced Esophageal Squamous Cell Carcinoma-A Pilot Study.

Authors: Chen-Yi Xie; Yi-Huai Hu; Joshua Wing-Kei Ho; Lu-Jun Han; Hong Yang; Jing Wen; Ka-On Lam; Ian Yu-Hong Wong; Simon Ying-Kit Law; Keith Wan-Hang Chiu; Jian-Hua Fu; Varut Vardhanabhuti
Journal: Cancers (Basel) Date: 2021-04-29 Impact factor: 6.639

9. Predictive model of ischemic optic neuropathy in spinal fusion surgery using a longitudinal medical claims database.

Authors: Heather E Moss; Lan Xiao; Shikhar H Shah; Yi-Fan Chen; Charlotte E Joslin; Steven Roth
Journal: Spine J Date: 2020-11-26 Impact factor: 4.166

10. Predictors of osteoradionecrosis following irradiated tooth extraction.

Authors: Szu Ching Khoo; Syed Nabil; Azizah Ahmad Fauzi; Siti Salmiah Mohd Yunus; Wei Cheong Ngeow; Roszalina Ramli
Journal: Radiat Oncol Date: 2021-07-14 Impact factor: 3.481