Literature DB >> 26264962

How to develop a more accurate risk prediction model when there are few events.

Menelaos Pavlou¹, Gareth Ambler², Shaun R Seaman³, Oliver Guttmann⁴, Perry Elliott⁵, Michael King⁶, Rumana Z Omar².

Abstract

Entities: Chemical

Year: 2015 PMID： 26264962 PMCID： PMC4531311 DOI： 10.1136/bmj.h3868

Source DB: PubMed Journal: BMJ ISSN： 0959-8138

× No keyword cloud information.

Risk prediction models are used in clinical decision making and are used to help patients make an informed choice about their treatment Model overfitting could arise when the number of events is small compared with the number of predictors in the risk model In an overfitted model, the probability of an event tends to be underestimated in low risk patients and overestimated in high risk patients In datasets with few events, penalised regression methods can provide better predictions than standard regression Risk prediction models that typically use a number of predictors based on patient characteristics to predict health outcomes are a cornerstone of modern clinical medicine.1 Models developed using data with few events compared with the number of predictors often underperform when applied to new patient cohorts.2 A key statistical reason for this is “model overfitting.” Overfitted models tend to underestimate the probability of an event in low risk patients and overestimate it in high risk patients, which could affect clinical decision making. In this paper, we discuss the potential of penalised regression methods to alleviate this problem and thus develop more accurate prediction models. Statistical models are often used to predict the probability that an individual with a given set of risk factors will experience a health outcome, usually termed an “event.” These risk prediction models can help in clinical decision making and help patients make an informed choice regarding their treatment.3 4 5 6 Risk models are developed using several risk factors typically based on patient characteristics that are thought to be associated with the health event of interest (box 1). These predictors are usually selected on the basis of clinical experience and following a literature review. Given patient characteristics, the risk model can calculate the probability of a patient having the event. However, before a risk model is used in clinical practice, the predictive ability of the model should be evaluated. This process is known as model validation, and involves an assessment of calibration (the agreement between the observed outcomes and predictions) and discrimination (the model’s ability to discriminate between low and high risk patients).2 Typically, the model is validated internally (for example, using bootstrapping7 in box 2) or externally using patient data not used for model development (box 1).

Box 1: Development and validation of a risk model with a binary outcome

Development

This stage is when information on a binary outcome and predictor variables in a patient cohort is obtained, and a risk model is constructed. For illustration, we consider here an example outcome and set of predictor variables. Outcome: mechanical failure of artificial heart valve (yes v no) Predictor variables: sex (score of 1=female), age (years), body surface area (BSA; m2), and whether a replacement valve came from a batch with fractures (score of 1=valve came from batch with fractures) A risk model relates the risk of a patient experiencing an event to a set of predictors. A common choice is the logistic regression model, which takes the form: Patient’s risk of heart failure = exp(patient’s risk score) ÷ (1+exp(patient’s risk score)) Where patient’s risk score = intercept + (bsex×sex) + (bage×age) + (bBSA×BSA) + (bfracture×fracture); And bsex, bage, bBSA, and bfracture are regression coefficients that describe how a patient’s values of the predictor variables affect risk. The regression coefficients are estimated as those values that optimise the ability of the model to predict the outcomes in the patient cohort. This is called “fitting the risk model,” and can be achieved using various methods, such as standard logistic regression, ridge, or lasso.

Prediction

To predict risk, the fitted risk model is used to calculate a risk score for each patient. For example, if the estimated regression coefficients are as follows: bsex = −0.193 bage = −0.0497 bBSA= 1.344 bfracture = 1.261 Intercept = −4.25 The risk score for a 40 year old female patient with a body surface area of 1.7 m2 and an artificial valve from a batch with fractures would then be calculated as: −4.25 + (−0.193×1 (female sex)) + (−0.0497×40 (age in years)) + (1.344×1.7 (BSA in m2)) + (1.261×1 (fracture present in batch)) = −2.89 Therefore, her predicted risk would be: exp(−2.89) ÷ (1+exp(−2.89)) = 5.3%

External validation

For external validation, a completely new cohort of patients with information on the same outcome and predictors is studied. The estimated regression coefficients (from the development phase) are used to predict the risks for patients in the new cohort. The agreement between the predicted risks and observed outcomes is assessed—that is, the model is validated by evaluating performance measures that assess, for example, calibration and discrimination.

Box 2: Bootstrap validation

Bootstrap validation may be used when no external cohort of patients is available. The aim is to estimate how good the performance of the prediction model developed on the development set (the original dataset) would be on a hypothetical set of new patients. A bootstrap dataset is an imitation of the original dataset and is constructed by the random sampling of patients “with replacement” (that is, a patient can be selected more than once) from the original dataset. Typically, a large number of bootstrap datasets (for example, 200) is created. Each dataset acts as a development dataset. In the simplest form of internal validation for the performance measure of a calibration slope: The model is fitted to each bootstrap dataset The estimated coefficients are used to obtain predictions for the patients in the original dataset These predictions are used to calculate the calibration slope for the fitted model. The 200 estimates (one estimate for each bootstrap dataset) of the calibration slope are then averaged. For other performance measures—for example, the area under the receiver operating characteristic (ROC) curve—optimism adjusted measures can be obtained using a similar procedure. In practice, datasets used in risk model development often contain few events compared with the number of candidate predictors, particularly when the event of interest is rare. An example would be structural failure of mechanical heart valves8 and sudden cardiac death in patients with hypertrophic cardiomyopathy.6 In such situations, use of standard regression methods to develop risk models could accurately predict outcomes for patients in the dataset used to develop the model, but may often perform less well in a new patient group. This difference is because the fitted model captures not only the underlying clinical associations between the outcome and predictors, but also the random variation (noise) present in the development dataset. This problem is called “model overfitting.” An overfitted model typically underestimates the probability of an event in low risk patients and overestimates it in high risk patients.2 This is known as poor calibration and has important consequences for clinical decision making. For example, overestimation of sudden cardiac death risk could lead to the unnecessary recommendation of implantable cardioverter defibrillators, exposing patients to surgical complications and wasting resources.6 This article focuses on ridge and lasso, two popular regression methods that can be used to alleviate the problem of model overfitting and are recommended in the TRIPOD checklist for developing and validating prediction models.9 Their ability to provide more accurate predictions than standard methods when there are few events is illustrated in two clinical examples.

Sample size calculation for developing risk prediction models

When developing a risk model, a rule of thumb based on the events per variable (EPV) ratio is often used to determine the sample size. The EPV is the number of events in the data divided by the number of regression coefficients in the risk model. (Note that if variable selection is performed, the number of regression coefficients refers to the initial set of predictors, before variable selection.) It has been suggested that an EPV of 10 or more is needed to avoid the problem of overﬁtting.7 10 For example, a dataset should contain at least 60 events to fit a risk model with six regression coefficients. When the EPV is smaller than 10, the effect of overfitting is pronounced.11 The development of a risk model often begins with a systematic review of the literature and consultation with clinical experts to identify a set of candidate predictors. However, even when this procedure is followed, an EPV of 10 may be difficult to achieve in studies involving few events, and therefore researchers often consider ways to reduce the number of predictors before developing the model. There are two common strategies. The first is univariable screening, where each predictor’s relation with the outcome is examined individually and only statistically significant predictors are included in the risk model. The second strategy is stepwise model selection (for example, backwards elimination), where predictors that are not statistically significant at a prespecified P value are removed in a stepwise manner from a model that initially includes all candidate predictors. However, both approaches have serious drawbacks—for example, the predictor selection process may not be stable (small changes in the data or in the predictor selection process could lead to different predictors being included in the final model).7 11 12 13

Shrinkage methods

Another way to alleviate the problem of model overﬁtting is to use methods that tend to shrink the regression coefficients (towards zero). Shrinking the regression coefficients has the effect of moving poorly calibrated predicted risks towards the average risk, and could assist in making more accurate predictions when the model is applied in new patients.11 14 The simplest method is to shrink the regression coefficients by a common factor—for example, 20%—after they have been estimated by standard regression. This factor can be chosen using bootstrapping.7 15 However, this approach does not perform well if the EPV is very low,14 and we do not discuss it further. An alternative approach, which is the focus of this paper, is to incorporate shrinkage as part of the model fitting procedure.

Penalised regression

Penalised regression is a flexible shrinkage approach that is effective when the EPV is low (<10). It aims to fit the same statistical model as standard regression but uses a different estimation procedure. The process of fitting a penalised regression model is as follows. Firstly, the form of the risk model (for example, logistic or Cox regression for binary and survival data, respectively) is specified using all candidate predictors. Next, the model is fitted to the data to estimate the regression coefficients. In standard logistic or Cox regression, the coefficients are estimated without imposing any constraints on their values. In datasets with few events, the range of the predicted risks is too wide as result of overfitting, but this range can be reduced by shrinking the regression coefficients towards zero. Penalised regression achieves this by placing a constraint on the values of the regression coefficients. The penalised regression coefficient estimates are typically smaller than those from standard regression. Several penalised methods that use different constraints have been proposed.13 16 17 We focus on ridge and lasso,14 arguably the two most popular shrinkage methods.

Ridge regression

Ridge fits the risk model under the constraint that the sum of the squared regression coeﬃcients does not exceed a particular threshold.17 18 The threshold is chosen to maximise the model’s predictive ability, using cross validation. In cross validation, the dataset is split into k groups. The model is fitted to the (k−1) groups and validated on the omitted group. This procedure is repeated k times, each time omitting a different group.

Lasso regression

Lasso is similar to ridge, but constrains the sum of the absolute values of the regression coefficients.16 Unlike ridge, lasso can effectively exclude predictors from the final model by shrinking their coeﬃcients to exactly zero. Both ridge and lasso regression are readily available in software such as R (for example, package “penalized”) and SPSS. In health research, where a set of prespecified predictors is often available, ridge regression is usually the preferred option.14 However, lasso might be preferred if a simpler model with fewer predictors (without affecting the predictive ability of the model) is desired, for example, to save time or resources by collecting less information on patients.

How to detect model overfitting

An overfitted model could be detected through an assessment of model calibration using either an internal validation technique or external validation.7 This may be done by dividing the patients into risk groups according to their predicted risk, and comparing the proportion of patients who experienced the event in each group with the average predicted risk in that group, using a graph (calibration plot2) or table (which leads to the Hosmer-Lemeshow test19). Alternatively, the degree of overﬁtting may be quantified using a simple regression model. For binary outcomes, the outcomes in the validation data are regressed using logistic regression on their predicted risk scores (box 1). If the model is well calibrated, the estimated slope (or calibration slope) should be close to 1, whereas an overﬁtted model would have a slope much less than 1, indicating that low risks are underestimated and high risks are overestimated.2

Application of penalised regression

The use of ridge and lasso methods can be illustrated by using data for 3118 patients with mechanical heart valves.8 The event of interest was the mechanical failure of the artificial valve, which occurred in only 56 individuals. The candidate predictors in this analysis were patient age, sex, BSA, fractures in the batch of the valve (no v yes), year of valve manufacture (before 1981 v after 1981), and valve size or position modelled using six clinically meaningful combinations constructed according to their expected levels of risk. A logistic regression model was used for illustrative purposes, with 10 coefficients. The EPV is 56/10=5.6, well below the recommended minimum of 10. Standard, ridge, and lasso regression were used to estimate the regression coefficients shown in the table. We also used backwards elimination (with a 15% significance level14), which excluded the variable sex from the model (coefficients not shown).

Estimates of regression coefficients, calculated by standard regression and penalised methods

Predictors	Descriptive statistics†	Regression coefficient estimates*
Predictors	Descriptive statistics†	Standard regression	Ridge regression	Lasso regression
Intercept	—	−7.80	−5.97 (23)	−6.65 (15)
Sex (female)	1337 (43)	−0.24	−0.14 (41)	−0.16 (34)
Age (years)	54.1 (10.8)	−0.052	−0.047 (11)	−0.050 (4)
Body surface area (m²)	1.6 (0.3)	1.98	1.52 (24)	1.75 (12)
Aortic size 23, 27, 29, 31 mm	692 (22)	1.43	0.36 (75)	0.61 (68)
Mitral size 23-27 mm	369 (12)	1.30	0.22 (84)	0.43 (67)
Mitral size 29 mm	611(20)	1.95	0.80 (59)	1.13 (42)
Mitral size 31 mm	656 (21)	2.62	1.38 (47)	1.77 (33)
Mitral size 33 mm	104 (3)	2.58	1.41 (45)	1.73 (33)
Fracture in batch (yes)	1108 (35)	0.59	0.69 (−17)	0.64 (−9)
Date of manufacture (after 1981)	2363 (76)	1.38	1.02 (26)	1.22 (12)

*For ridge and lasso methods, numbers in brackets are percentages and represent the shrinkage compared with the standard regression estimates.

†For descriptive statistics, data are mean (standard deviation) for continuous predictors (age and body surface area), and number (percentages) for binary predictors.

Estimates of regression coefficients, calculated by standard regression and penalised methods *For ridge and lasso methods, numbers in brackets are percentages and represent the shrinkage compared with the standard regression estimates. †For descriptive statistics, data are mean (standard deviation) for continuous predictors (age and body surface area), and number (percentages) for binary predictors. The ridge and lasso coefficients were reduced compared with those from the standard regression model, with the greatest shrinkage applied to the valve size and position predictors (45-84% shrinkage for ridge and 33-68% for lasso). The shrinkage is reflected by the predicted risks, especially for high risk patients. Consider, for example, a female patient aged 20.5 years and with 1.7 m2 BSA, who had a 31 mm mitral valve manufactured after 1981 from a batch without fractured implants. Using the estimated coefficients from standard regression (table), the risk score for this patient is calculated by the following formula: Risk score = −7.8 (intercept) + (−0.24×1(female sex)) + (−0.052×20.5(age; years)) + (1.98×1.7(BSA; m2)) + (2.62×1(mitral size 31 mm)) + (0.589×0(no fracture)) + (1.38×1(date of manufacture after 1981)) = −1.714. Therefore, the predicted risk of mechanical failure is: exp(−1.714) ÷ (1+exp(−1.714)) = 18% (average risk is 1.8%). When the estimated coefficients from ridge and lasso are used instead, the predicted risks are less extreme: 12% and 15%, respectively. Figure 1 confirms that there are fewer extreme risk scores after applying shrinkage.

Fig 1 Distribution of predicted risk scores estimated using standard, ridge, and lasso regression

Fig 1 Distribution of predicted risk scores estimated using standard, ridge, and lasso regression The predictive performances of the risk models (developed using standard regression, backwards elimination, ridge, and lasso) were assessed using bootstrap validation (box 2).7 Calibration was assessed using the calibration slope and a calibration plot. Discrimination was measured by the commonly used area under the ROC curve measure, where a value of 1 suggests perfect discrimination and a value of 0.5 suggests no discrimination. Standard regression produced an overfitted model (calibration slope 0.76 (95% confidence interval 0.65 to 0.99)), whereas the models from ridge and lasso demonstrated far better calibration (calibration slopes of 1.01 and 0.94, respectively). The calibration plot in figure 2 shows the observed proportion of patients who experienced the event and the average of their predicted risks in each of the four groups. Clearly, the standard risk model severely overestimates the risk of valve fracture for patients at the highest risk, which in practice might lead to patients undergoing unnecessary valve explant surgery. All three risk models (from standard, ridge, and lasso regression) demonstrated similar discrimination (all ROC areas 0.80 (95% confidence interval 0.78 to 0.82)). Backwards elimination also produced an overfitted model (calibration slope 0.77) with similar discrimination (ROC area 0.795). A second example illustrates the external validation of risk models (based on Cox regression) for sudden cardiac death in patients with hypertrophic cardiomyopathy (web appendix).

Fig 2 Observed proportions versus average predicted risk of the event (using standard, ridge and lasso regression). Overestimation of risk for high risk patients can be seen when standard regression is used

Conclusion

When the number of events is low relative to the number of predictors in the risk model, standard regression may produce overﬁtted risk models that make inaccurate predictions. Common approaches to reduce the number of predictors in a risk model, such as stepwise selection or univariable screening, are problematic and should be avoided.7 14 Often the EPV can still be small (<10) even after existing knowledge has been used to eliminate some of the initial candidate predictors. In such cases, it is recommended that the use of penalised regression methods be explored. Risk models produced using penalised regression generally show improved calibration, and could also show improved discrimination.14 Other methods could be more appropriate in some situations.13 Notably, there may be scenarios where existing evidence (from published risk models, meta-analysis, and expert opinion) can be incorporated in the estimation procedure. These contributions could lead to better predictions than those obtained from ridge and lasso.20 21 In this paper, we focused on the issue of model overfitting, but small datasets and datasets with few events are also susceptible to other problems, especially when binary predictors with a very low (or high) prevalence are present; in such scenarios other methods may be more suitable than ridge and lasso.22

15 in total

1. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets.

Authors: E W Steyerberg; M J Eijkemans; F E Harrell; J D Habbema
Journal: Med Decis Making Date: 2001 Jan-Feb Impact factor: 2.583

2. Outlet strut fracture of Björk-Shiley convexo concave heart valves: the UK cohort study.

Authors: R Z Omar; L S Morton; D A Halliday; E M Danns; M T Beirne; W J Blot; K M Taylor
Journal: Heart Date: 2001-07 Impact factor: 5.994

3. A solution to the problem of separation in logistic regression.

Authors: Georg Heinze; Michael Schemper
Journal: Stat Med Date: 2002-08-30 Impact factor: 2.373

4. An evaluation of penalised survival methods for developing prognostic models with rare events.

Authors: G Ambler; S Seaman; R Z Omar
Journal: Stat Med Date: 2011-10-14 Impact factor: 2.373

5. Generic, simple risk stratification model for heart valve surgery.

Authors: Gareth Ambler; Rumana Z Omar; Patrick Royston; Robin Kinsman; Bruce E Keogh; Kenneth M Taylor
Journal: Circulation Date: 2005-07-05 Impact factor: 29.690

6. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates.

Authors: P Peduzzi; J Concato; A R Feinstein; T R Holford
Journal: J Clin Epidemiol Date: 1995-12 Impact factor: 6.437

7. Penalized likelihood in Cox regression.

Authors: P J Verweij; H C Van Houwelingen
Journal: Stat Med Date: 1994 Dec 15-30 Impact factor: 2.373

8. Assessing the performance of prediction models: a framework for traditional and novel measures.

Authors: Ewout W Steyerberg; Andrew J Vickers; Nancy R Cook; Thomas Gerds; Mithat Gonen; Nancy Obuchowski; Michael J Pencina; Michael W Kattan
Journal: Epidemiology Date: 2010-01 Impact factor: 4.822

9. General cardiovascular risk profile for use in primary care: the Framingham Heart Study.

Authors: Ralph B D'Agostino; Ramachandran S Vasan; Michael J Pencina; Philip A Wolf; Mark Cobain; Joseph M Massaro; William B Kannel
Journal: Circulation Date: 2008-01-22 Impact factor: 29.690

10. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study.

Authors: Julia Hippisley-Cox; Carol Coupland; Yana Vinogradova; John Robson; Margaret May; Peter Brindle
Journal: BMJ Date: 2007-07-05

140 in total

1. Assessing risk in pulmonary arterial hypertension: what we know, what we don't.

Authors: Raymond L Benza; Harrison W Farber; Mona Selej; Mardi Gomberg-Maitland
Journal: Eur Respir J Date: 2017-08-03 Impact factor: 16.671

2. Predicting the onset of hazardous alcohol drinking in primary care: development and validation of a simple risk algorithm.

Authors: Juan Ángel Bellón; Juan de Dios Luna; Michael King; Irwin Nazareth; Emma Motrico; María Josefa GildeGómez-Barragán; Francisco Torres-González; Carmen Montón-Franco; Marta Sánchez-Celaya; Miguel Ángel Díaz-Barreiros; Catalina Vicens; Patricia Moreno-Peral
Journal: Br J Gen Pract Date: 2017-04 Impact factor: 5.386

3. Normal Tissue Complication Probability (NTCP) Modelling of Severe Acute Mucositis using a Novel Oral Mucosal Surface Organ at Risk.

Authors: J A Dean; L C Welsh; K H Wong; A Aleksic; E Dunne; M R Islam; A Patel; P Patel; I Petkar; I Phillips; J Sham; U Schick; K L Newbold; S A Bhide; K J Harrington; C M Nutting; S L Gulliford
Journal: Clin Oncol (R Coll Radiol) Date: 2017-01-03 Impact factor: 4.126

4. Prognostic value of revascularising viable myocardium in elderly patients with stable coronary artery disease and left ventricular dysfunction: a PET/CT study.

Authors: Mehdi Namdar; Olivier Rager; Julien Priamo; Angela Frei; Stephane Noble; Gael Amzalag; Osman Ratib; René Nkoulou
Journal: Int J Cardiovasc Imaging Date: 2018-05-28 Impact factor: 2.357

5. Plasma metabolite biomarkers predictive of radiation induced cardiotoxicity.

Authors: Keith Unger; Yaoxiang Li; Celine Yeh; Ana Barac; Monvadi B Srichai; Elizabeth A Ballew; Michael Girgis; Meth Jayatilake; Vijayalakshmi Sridharan; Marjan Boerma; Amrita K Cheema
Journal: Radiother Oncol Date: 2020-04-20 Impact factor: 6.280

6. Development and Validation of a County-Level Social Determinants of Health Risk Assessment Tool for Cardiovascular Disease.

Authors: Young-Rock Hong; Arch G Mainous
Journal: Ann Fam Med Date: 2020-07 Impact factor: 5.166

7. Heart rate variability and falls in Huntington's disease.

Authors: Daniel E Vigo; Marcelo Merello; Cinthia Terroba-Chambi; Veronica Bruno
Journal: Clin Auton Res Date: 2020-02-06 Impact factor: 4.435

8. Development and validation of a risk assessment tool for gastric cancer in a general Japanese population.

Authors: Masahiro Iida; Fumie Ikeda; Jun Hata; Yoichiro Hirakawa; Tomoyuki Ohara; Naoko Mukai; Daigo Yoshida; Koji Yonemoto; Motohiro Esaki; Takanari Kitazono; Yutaka Kiyohara; Toshiharu Ninomiya
Journal: Gastric Cancer Date: 2017-10-17 Impact factor: 7.370

9. Predictive modeling of inpatient mortality in departments of internal medicine.

Authors: Naama Schwartz; Ali Sakhnini; Naiel Bisharat
Journal: Intern Emerg Med Date: 2017-12-30 Impact factor: 3.397

10. Development and Validation of a Nomogram for Predicting the 6-Year Risk of Cognitive Impairment Among Chinese Older Adults.

Authors: Jinhui Zhou; Yuebin Lv; Chen Mao; Jun Duan; Xiang Gao; Jiaonan Wang; Zhaoxue Yin; Wanying Shi; Jiesi Luo; Qi Kang; Xiaochang Zhang; Yuan Wei; Virginia Byers Kraus; Xiaoming Shi
Journal: J Am Med Dir Assoc Date: 2020-06-03 Impact factor: 4.669