Literature DB >> 31373722

Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: An overview and illustration.

Ewout W Steyerberg^1,2, Daan Nieboer², Thomas P A Debray^3,4, Hans C van Houwelingen¹.

Abstract

Clinical prediction models aim to provide estimates of absolute risk for a diagnostic or prognostic endpoint. Such models may be derived from data from various studies in the context of a meta-analysis. We describe and propose approaches for assessing heterogeneity in predictor effects and predictions arising from models based on data from different sources. These methods are illustrated in a case study with patients suffering from traumatic brain injury, where we aim to predict 6-month mortality based on individual patient data using meta-analytic techniques (15 studies, n = 11 022 patients). The insights into various aspects of heterogeneity are important to develop better models and understand problems with the transportability of absolute risk predictions.

Entities: Chemical Disease Gene Species

Keywords: heterogeneity; meta-analysis; prediction; regression modeling

Year: 2019 PMID： 31373722 PMCID： PMC6772012 DOI： 10.1002/sim.8296

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

INTRODUCTION

Clinical prediction models aim to provide estimates of absolute risk of an endpoint. Common endpoints are the presence of a disease (establishing a diagnosis according to a reference standard) and the occurrence of a future event (prognosis, eg, mortality within 30 days, within 6 months, or longer follow‐up).1 Prediction models are increasingly common in the medical literature, and multiple models may be available for the same type of patients for similar endpoints.2 Published prediction models often use different predictors to derive predictions for individual patients.3 Moreover, many prediction models are developed in relatively small samples from a specific setting, eg, a single hospital.4 Prediction models that are developed from small samples are prone to statistical overfitting, and may therefore have poor accuracy when applied to new patients.5 Applying penalization or shrinkage techniques may limit such problems,6, 7 but better prediction models can be derived with larger numbers of patients. If these larger numbers of individual patient data (IPD) come from different sources, we may aim to develop a global prediction model, with improved validity across multiple settings or populations. A global model can, for instance, be derived by merging all IPD sets and estimating a common baseline risk and set of predictor effects. This strategy clearly obfuscates possible differences between studies and ignores clustering of patients within studies. Several more advanced strategies have recently been proposed. Access to data on large numbers of patients from different settings allows us to assess between setting heterogeneity, following principles from meta‐analysis (MA).8, 9 10, 11 In the current paper, we aim to describe and propose approaches for assessing heterogeneity in predictor effects and predictions arising from prediction models based on data from different studies. We consider between‐study heterogeneity with respect to missing values, covariate and endpoint distribution, and model performance. Such assessment of heterogeneity may serve two purposes: to support or refute the idea of a global prediction model; to appropriately indicate the uncertainty when applying the global model across different populations. This paper starts with an overview of some key characteristics of commonly used regression models to estimate absolute risk, and some background on a case study where we develop a global model based on IPD from 15 studies to predict 6‐month mortality after traumatic brain injury (TBI).12 Section 3 considers characteristics of the included studies and differences in study design and included patients and differences in case‐mix, while Section 4 focuses on dealing with missing values. Heterogeneity in predictor effects and predictions is discussed in Sections 5 and 6. Section 7 is a general discussion, where we not only consider the situation of having access to IPD from each study but also variants such as having access to only one IPD data set, or no IPD at all. We end with some reflections on the impact of heterogeneity on model performance and model applicability.

PREDICTION MODELS AND MA

Common types of regression models

The most common prediction problems in medicine concern binary endpoints, where logistic regression models are often used to estimate the probability that a certain endpoint Y is present or will occur conditional on the 1 × row vector of predictors , where p is the number of predictors, ie, Here, is the model intercept, and represents a row vector reflecting the relative effects of the predictor values . We refer to as the linear predictor or prognostic index, which summarizes the effects of the predictors .5 The intercept is kept separate from the linear predictor. In terms of odds of the endpoint, we can also write For time‐to‐event endpoints, such as survival, the baseline risk is dependent on time and can therefore no longer by summarized by a single constant. For this reason, time‐to‐event endpoints are commonly modeled with Cox regression We notice that the logistic regression model contains a constant , while the Cox model contains a nonparametric baseline hazard h 0(t) that plays the role of a generalized constant. Both reflect the baseline risk in a prediction model. We can make the baseline risk more interpretable by subtracting the mean value for the predictors , as is common for the Cox regression model. Moreover, we might smooth the baseline hazard to facilitate calculation of absolute risks.13 Here, we focus on IPD‐MA using logistic regression models. Extensions of these methods to Cox regression models are discussed in Appendix A, along with situations where one IPD is available and where no IPD is available.

Meta‐analysis for prediction

Similar to the MA of randomized trials, it is readily possible to summarize parameter estimates from multiple studies by calculating a weighted average for the intercept term and regression coefficients.14 Issues of interest are summary estimates of baseline risk and predictor effects, as well as corresponding estimates of between‐study heterogeneity. Even more important is the heterogeneity in the linear predictor and absolute risk predictions. These predictions depend on the joint effects of all predictors and baseline risk and need to be reasonably similar across studies for a prediction model to be labeled “generalizable.” We hence focus on three aspects of heterogeneity for predictions, namely, in baseline risk, predictor effects, and the linear predictor (which is directly linked to absolute risk predictions).

Case study

For illustrative purposes, we analyze 15 studies of patients suffering from TBI, including IPD from 11 randomized controlled trials and four observational studies. These studies were part of the IMPACT project, where a total of 25 prognostic factors were considered for prediction of 6‐month mortality.15 Mortality occurred in 20% to 40% of the patients, while follow‐up was nearly complete for the 6‐months status (Table 1). Three different models were developed of increasing complexity.12 The core model used three key predictors, ie, age, the assessment of Glasgow Coma Scale motor score at admission, and pupillary reactivity at admission. The CT model contained the predictors of the core model, secondary insults (hypoxia and hypotension) and results from a CT scan (Marshall CT classification system, traumatic subarachnoid hemorrhage and epidural hematoma). The most elaborate model contained the predictors of the CT model and results from lab tests (glucose and hemoglobin levels).12 For our case study, we focus on the global CT model. This model was fitted with study as a main effect, and common effects for the predictors. R code to perform the analyses is available from the authors (R version 3.5.0, The R Project for Statistical Computing). The data that support the findings of this study are available from the corresponding author upon reasonable request.

Table 1

Description of 15 IMPACT data sets of 11 022 patients with traumatic brain injury (TBI)

Nr.	Name	Enrollment period	Type1	n
1	TINT	1991–1994	RCT	1118
2	TIUS	1991–1994	RCT	1041
3	SLIN	1994–1996	RCT	409
4	SAP	1995–1997	RCT	919
5	PEG	1993–1995	RCT	1510
6	HIT I	1987–1989	RCT	350
7	UK4	1986–1988	OBS	791
8	TCDB	1984–1987	OBS	603
9	SKB	1996–1996	RCT	126
10	EBIC	1995–1995	OBS	822
11	HIT II	1989–1991	RCT	819
12	NABIS	1994–1998	RCT	385
13	CSTAT	1996–1997	RCT	517
14	PHARMOS	2001–2004	RCT	856
15	APOE	1996–1999	OBS	756

Type of study, RCT: randomized controlled trial, OBS: observational cohort

Description of 15 IMPACT data sets of 11 022 patients with traumatic brain injury (TBI) Type of study, RCT: randomized controlled trial, OBS: observational cohort

STUDY CHARACTERISTICS

Heterogeneity in study design

Meta‐analysis requires a reasonable degree of similarity between studies to provide meaningful summary estimates. It is therefore important to consider whether the included studies have major differences in their design, selection of subjects, and setting, as this may affect the baseline risk and/or predictor‐endpoint relations. For the 15 TBI studies, four were performed in relatively unselected populations (“surveys,” observational cohorts), with broad inclusion criteria. Inclusion criteria were stricter for 11 randomized controlled trials. The studies also varied in the calendar time of enrollment of patients, and one study was rather small (#9, SKB, Table 1).

Heterogeneity in case‐mix

Between‐study heterogeneity in case‐mix is a common source of heterogeneity in baseline risk and predictor‐endpoint associations. Briefly, heterogeneity in case‐mix occurs when the distribution of patient characteristics varies across studies. In the TBI case study, there appear to be systematic differences between observational studies and RCTs in terms of observed mortality and patient characteristics (ie, the case‐mix distribution) (Table 2). Case‐mix variability was particularly high in the 4 observational studies, which we quantified by the standard deviation of the linear predictor of the global model fitted using all studies. We might also quantify this heterogeneity by study‐specific models, but the standard deviation of the linear predictor would then depend on both case‐mix and estimated coefficients.

Table 2

Six‐month mortality, case‐mix distribution, and discriminative ability of the membership model in identifying membership of a specific study

Nr.	Name	6‐month mortality	Mean lp	SD lp	Membership c‐statistic
1	TINT	25%	−1.42	1.23	0.62
2	TIUS	22%	−1.6	1.13	0.65
3	SLIN	23%	−1.42	0.99	0.76
4	SAP	23%	−1.44	1.02	0.60
5	PEG	24%	−1.51	1.26	0.67
6	HIT I	28%	−1.23	1.35	0.68
7	UK4	45%	−0.27	1.77	0.64
8	TCDB	44%	−0.36	1.74	0.67
9	SKB	27%	−1.19	0.99	0.75
10	EBIC	34%	−0.98	1.81	0.63
11	HIT II	23%	−1.49	1.10	0.63
12	NABIS	26%	−1.27	1.08	0.65
13	CSTAT	22%	−1.57	1.16	0.61
14	PHARMOS	17%	−1.78	0.79	0.68
15	APOE	15%	−2.45	1.65	0.73

lp: linear predictor, based on a common prediction model and study‐specific predictor values; membership c statistic: discriminative ability to separate a specific study from all other studies, where a high c‐statistic reflects substantial differences in baseline characteristics and outcome.

Six‐month mortality, case‐mix distribution, and discriminative ability of the membership model in identifying membership of a specific study lp: linear predictor, based on a common prediction model and study‐specific predictor values; membership c statistic: discriminative ability to separate a specific study from all other studies, where a high c‐statistic reflects substantial differences in baseline characteristics and outcome. In the TBI case study, we further note substantial differences in the incidence of the end point (6‐month mortality): 17% in the most recent trial (study #14) versus 42% in a survey (study #7). The difference in mortality rate may partly be explained by study design, since RCTs only included patients who survived long enough to be included in the trial, while the observational studies also included patients who died shortly after arriving at a hospital. It is also possible to inspect the distributions of individual predictors. These are rather different between studies (Figure 1). We note that this inspection does not take into account the possible correlation between predictors.

Figure 1

Distribution of patient characteristics in 15 studies with 11 022 traumatic brain injury patients, after single imputation of missing values [Colour figure can be viewed at wileyonlinelibrary.com]

Distribution of patient characteristics in 15 studies with 11 022 traumatic brain injury patients, after single imputation of missing values [Colour figure can be viewed at wileyonlinelibrary.com] Finally, a summary measure of case‐mix similarity between studies can be obtained using a membership model, where we quantify how well we can separate patients from different studies from each other (using the c‐statistic).16 We therefore developed a membership model using multinomial logistic regression (Table 2), where study membership was the outcome. We included all predictor variables of the CT model and 6‐month mortality as covariates. A variant of this membership model might include only predictor variables. The c‐statistic of the membership model can be calculated by comparing the predicted probabilities for patients from one study with the predicted probabilities of patients not included in that study. We found that studies 3, 9, and 15 were somewhat different from the other studies, with c‐statistics above 0.70, based on the distribution of patient characteristics and mortality.

MISSING PREDICTOR VALUES AND MA

Missing values and imputation

Imputation of missing values poses specific challenges in the context of development or validation of a global prediction model. Advanced imputation approaches may be required to fully address between‐study heterogeneity in the correlation structure between predictors and endpoint. Ignoring such heterogeneity in the imputation procedure may lead to bias in the estimated coefficients and their associated standard errors, to bias in estimates of between‐study heterogeneity, and to model validation results that are too optimistic.11, 17 More advanced multilevel imputation methods have recently proposed to impute missing values in large, clustered data sets such as IPD‐MA and can also be used to impute covariates that are systematically missing for one or more studies.18, 19, 20, 21, 22, 23 In our case study, we applied the simpler approach as used previously by Steyerberg et al,12 where the authors imputed missing predictors using the study as a fixed effect in the imputation model. This imputation model admittedly only adjusts for heterogeneity in the levels and prevalence of missing predictors.

HETEROGENEITY IN COMBINATIONS OF PREDICTOR EFFECTS

Estimating stratified predictor effects

After considering general between‐study heterogeneity and imputing missing values, an important step is to estimate predictor‐endpoint associations across the available data sets. Ideally, the global prediction model is prespecified. Prior knowledge and/or clinical expertise may have guided the selection of predictors as well as the choice of linear or nonlinear forms in case of continuous predictors.6, 24 In practice, some form of selection may be based on predictor‐endpoint associations observed across the set of studies considered in the MA. Such a selection will cause only limited bias if sample sizes are large and the number of candidate predictors small. We first discuss full stratification by study. Subsequently, we discuss several simplifications. The presence of heterogeneity between J studies may be considered as follows. First, we consider stratified estimation of the model intercept and regression coefficients for each study j (Table 3)

Table 3

Study	Intercept	Age	Motor score	Pupillary reactivity	Hypoxia	Hypotension	CT class	tSAH
1	−1.22 (0.09)	0.20 (0.05)	−0.39 (0.08)	0.41 (0.11)	0.36 (0.20)	1.03 (0.21)	0.56 (0.10)	1.01 (0.17)
2	−1.40 (0.10)	0.21 (0.07)	−0.40 (0.08)	0.36 (0.11)	0.46 (0.18)	0.75 (0.19)	0.34 (0.10)	0.74 (0.17)
3	−1.35 (0.22)	0.28 (0.09)	−0.28 (0.12)	0.71 (0.23)	−0.36 (0.58)	0.97 (0.35)	0.47 (0.15)	0.70 (0.37)
4	−1.34 (0.09)	0.20 (0.06)	−0.14 (0.07)	0.74 (0.11)	0.68 (0.24)	0.22 (0.23)	0.33 (0.10)	0.82 (0.18)
5	−1.73 (0.10)	0.21 (0.05)	−0.52 (0.06)	0.52 (0.08)	0.33 (0.16)	0.77 (0.17)	0.38 (0.08)	0.54 (0.14)
6	−1.41 (0.19)	0.30 (0.09)	−0.45 (0.13)	0.82 (0.17)	0.00 (0.38)	−0.60 (0.63)	0.38 (0.08)	0.95 (0.29)
7	−0.93 (0.11)	0.43 (0.05)	−0.30 (0.09)	1.01 (0.12)	0.07 (0.21)	1.21 (0.22)	0.36 (0.11)	0.70 (0.19)
8	−0.73 (0.12)	0.47 (0.07)	−0.42 (0.10)	0.57 (0.12)	0.36 (0.27)	1.31 (0.25)	0.43 (0.12)	0.63 (0.21)
9	−1.28 (0.35)	0.38 (0.16)	−0.23 (0.22)	0.34 (0.26)	−0.40 (0.54)	0.71 (0.59)	0.64 (0.28)	0.72 (0.63)
10	−1.41 (0.12)	0.40 (0.05)	−0.45 (0.09)	0.80 (0.12)	0.54 (0.23)	0.73 (0.24)	0.31 (0.11)	0.81 (0.19)
11	−1.44 (0.11)	0.22 (0.06)	−0.40 (0.09)	0.43 (0.11)	0.21 (0.23)	0.34 (0.30)	0.38 (0.11)	0.97 (0.19)
12	−1.49 (0.17)	0.24 (0.10)	−0.39 (0.11)	0.68 (0.14)	0.35 (0.28)	0.83 (0.34)	0.36 (0.21)	0.60 (0.26)
13	−1.43 (0.14)	0.22 (0.09)	−0.42 (0.11)	0.68 (0.14)	−0.04 (0.34)	0.26 (0.30)	0.52 (0.14)	0.76 (0.24)
14	−1.61 (0.11)	0.17 (0.07)	−0.34 (0.09)	0.29 (0.16)	0.06 (0.23)	0.46 (0.26)	0.53 (0.11)	0.42 (0.21)
15	−2.07 (0.18)	0.52 (0.07)	−0.59 (0.15)	0.91 (0.16)	0.33 (0.28)	0.54 (0.37)	0.29 (0.14)	0.47 (0.26)
Pooled	−1.35 (0.07)	0.28 (0.03)	−0.38 (0.03)	0.61 (0.06)	0.27 (0.07)	0.71 (0.10)	0.40 (0.03)	0.72 (0.06)
Estimated τ	0.25	0.09	0.07	0.17	0.08	0.27	0.06	0.08
95% Prediction interval	[−1.92, −0.78]	[0.08, 0.48]	[−0.55, −0.20]	[0.21, 1.01]	[0.05, 0.50]	[0.08, 1.34]	[0.25, 0.55]	[0.50, 0.93]
I ²	84%	67%	35%	65%	0%	49%	0%	2%

Age was analyzed as a continuous predictor, per 10 years; Motor score, pupillary reactivity, and CT class were analyzed as continuous predictors, coded as in Figure 1. Hypoxia, hypotension, and tSAH were binary predictors. For interpretation of the baseline risk (the intercept ), we standardized predictors by subtracting the overall means of predictor values.

Multivariable logistic regression models to predict mortality 6 months after traumatic brain injury fitted separately in each of the 15 studies. We show the estimated regression coefficients with associated standard errors for the 15 studies. A two‐stage multivariate meta‐analysis provided pooled estimates of the between‐study variance parameter tau and prediction intervals for the regression coefficients. The between versus within‐study heterogeneity is summarized in I estimates Age was analyzed as a continuous predictor, per 10 years; Motor score, pupillary reactivity, and CT class were analyzed as continuous predictors, coded as in Figure 1. Hypoxia, hypotension, and tSAH were binary predictors. For interpretation of the baseline risk (the intercept ), we standardized predictors by subtracting the overall means of predictor values. For descriptive purposes, we propose forest plots for visualization of the heterogeneity in predictor effects (Figure 2). Additionally, pooled estimates with associated (approximate) prediction intervals and I estimates based on a multivariate MA can provide further insight in the extent of between‐study heterogeneity.14

Figure 2

Forest plots showing estimated multivariable logistic regression coefficients and associated 95% confidence interval per study. The largest heterogeneity was noted for pupillary reactivity (τ = 0.17) and hypotension (τ = 0.27)

Pooling with full stratification

For pooled analysis, we consider , the (P + 1)‐vector of regression coefficients with associated within‐study covariance matrix . More specifically, the study‐specific model is given by The distribution of over the population is assumed to be a multivariate normal distribution To avoid identifiability problems, we assume that the number of studies exceeds the number of study‐specific parameters, ie, N > P + 1. Simplifications can be made by adopting an autocorrelation structure or specifying a diagonal matrix for . Insight in the heterogeneity of the model predictors is gained by first fitting the model for each study separately, yielding study‐specific estimates (Table 3 and Figure 2). The IPD are needed to obtain the full within‐study covariance matrix . These covariance matrices are typically not available from published studies, and are the key benefit of having access to IPD rather than published results only. The parameters of the global prediction model can be estimated from the model , where contains the study‐specific intercept term and regression coefficients , and represents the corresponding within‐study covariance matrix. The between‐study covariance matrix of the pooled intercept term and predictor effects is given by . Fitting this model also yields the covariance matrix of the estimate of . This should not be confused with covariance matrix of the random effects. The aforementioned approach to an IPD‐MA is a two‐stage approach because studies are first analyzed individually and corresponding results are then combined in a second multivariate step. A one‐stage approach would fit a logistic regression model with random effects. A full specification of this model for patient i in study j may be as follows: The one‐stage approach is computationally more demanding than the two‐stage approach, and is expected to give similar results with similar model specification and reasonable sample size.8, 25 In the case study, differences between the one‐ and two‐stage pooling were negligible. We present the two‐stage results in Table 3. The between‐study variance parameter was relatively large compared to the pooled SE for three predictors (age, pupillary reactivity, and hypotension), with relatively large prediction intervals (Table 3).

Heterogeneity in predictions from different cohorts

Predictor effects show substantial heterogeneity across cohorts (Table 3). On the other hand, the correlation between predictors within the studies may make that the resulting predicted probabilities between studies for a patient with the same characteristics are still quite close. We propose to further consider differences between predictions for all covariate patterns that occur across the studies. We hereto construct scatter plots of predicted probabilities according to models fitted in each of the individual studies (Figure 3). We label this approach a 1‐to‐1 comparison of study‐specific model predictions, since predictions from models from each study are each compared to each other. For the case study, each comparison includes the 11 022 patients in the IPD data set. We then note that some studies provide very similar predictions, eg, study 1 and 2. This fits with the fact that these are similar trials, one recruiting patients in the US and the other internationally. Other studies provide somewhat different predictions compared to the other studies, eg, studies 6, 8, 9, and 15. This is partly attributable to differences in baseline risk, reflected in lines below or above the line of identity in Figure 3.

Figure 3

Correlation between predictions of study‐specific models in a pairwise comparison between studies: 1‐to‐1 comparisons of predictions for all patients in the individual patient data set (n = 11 022) The effect of differences in predictor strengths on the predictions can be seen from the “veins” in the plot. For instance, when comparing the prediction based on the model developed in study 6 with the predictions made by the models developed in the other studies, the plot typically shows two lines around which the predictions are clustered. This reflects that the regression coefficient of the predictor “hypoxia” is close to zero in study 6 and is different from zero in the other studies (Table 3). In other comparisons multiple “veins” are visible, attributable to differences in predictor strength of predictors with multiple categories, such as motor score and pupillary reactivity.

Simplifications with respect to heterogeneity

Several simplifications are possible when the extent of heterogeneity across studies is limited. The strongest simplification is to ignore any heterogeneity and thus to assume that . This implies that all studies agree on the baseline risk and predictor effects in the global model, and that differences in study‐specific estimates only appear due to sampling error. This simplification will not be realistic for most applications. A less drastic simplification is to assume that the intercepts may vary between studies but that the predictor effects are common.8, 9 We label this a common effect approach with respect to . Several estimation procedures may be followed to account for differences in baseline risk. If the number of studies is limited, say less than 5 studies, it is usually possible to condition on the “study” variable by estimating a separate intercept for each study. Alternatively, when more studies are available, it is reasonable to assume a Normal random effects distribution for the intercept terms.14 This second approach allows for a simple summary estimate of the between‐study heterogeneity in the intercept as . The corresponding logistic regression model is specified as follows: If is substantial, this implies that adjustments for the intercept need to be considered when applying the global model in a local setting (see discussion). The heterogeneity between studies can well be summarized in measures such as the median odds ratio (MOR).26 The assumption of common predictor effects can be weakened by specifying that the linear predictors share a common direction in covariate space, but that the size of their effects might be systematically different. This can be modeled by a rank = 1 model and hence, . In this model, the random variation between studies is described by the correlated pair (. The study‐specific relative effects are then allowed to vary in a proportional way.27 Estimation of this model is possible with a nonlinear mixed effect model, such as available with proc nlmixed in SAS software. We can approximate this model by fitting a model ignoring between‐study heterogeneity on all available patients to derive a linear predictor for each patient and subsequently fitting a random effect model with random intercept and a random slope for the linear predictor. Both approaches gave very similar results in our case study, and hence, we present the results from the approximation using two steps. A further weakening of the restrictiveness is obtained by allowing models of higher rank, such as a rank = 2 model. Again, this model can be estimated with a nonlinear mixed model, and approximated in a two‐step approach with first estimating the linear predictor globally and subsequently fitting a random effect model with the linear predictors as covariates with random slopes. In these restricted models of rank 1 or rank 2, the covariance matrix has a simpler structure compared to the fully stratified model with study‐specific predictor effects.

Heterogeneity in specific predictions of the prediction model

In the fully stratified model, the baseline risks and predictor effects show considerable variability over the studies (Table 3). The resulting predicted probabilities also show quite some variability (Figure 3). If we consider a fixed value of the covariate vector, we can compute an approximate 95% prediction interval for new studies of . Since we can estimate this interval using This is an approximate approach since it ignores the uncertainty introduced by estimating the within‐ and between‐study covariance matrices. For this reason, a Student‐T, rather than a Normal distribution, is often used to calculate confidence intervals. Similarly, approximate 95% prediction intervals for new studies can be constructed for the rank = 1 model and random intercept model. By keeping all but one predictor fixed, we can then investigate the heterogeneity of predictions across different values of this predictor. The values of the other predictors need to be chosen, for example as representative values, ie, the median for continuous predictors, and the most common category for categorical predictors. We illustrate this approach for the age effect for otherwise average risk patients in the TBI case study. Obviously, the 95% prediction interval is smallest with a naïve single fit model, where we ignore any between‐study heterogeneity. The common effect model assumes fixed effects per study but leaves the baseline risk free. The uncertainty is larger when we allow for between‐study differences in the predictor effects with a rank = 1 model or a fully stratified model (Figure 4). Note that the risk predictions according to age and their uncertainty depend on the choice of values for the other predictors in the model.

Figure 4

Prediction intervals for new studies assuming a fixed effects model, random intercept model, rank = 1 model, or fully stratified model

Model selection

We thus far described models that allow for different degrees of heterogeneity in predictor effects and/or baseline risk. Selecting the most appropriate model for the data at hand can be based on information criteria such as AIC/BIC or using formal statistical tests comparing the different models. We propose to apply the following test procedure. A similar type of test was recently proposed for selecting an update method for existing prediction models,28 and longer ago for selection of functional forms of covariates in fractional polynomials.29 We propose to perform a series of likelihood ratio tests which start with the fully stratified model and consider simplifications from that systematically. These tests require assessing the significance of the variance components of random effect models. Since these parameters are at the boundary of the parameter space under the null when testing if the variance is different from zero, a mixture of ‐distributions is required to obtain a p‐value.30 Hence, we compare the following: In our case study, the closed‐test procedure selects the rank = 1 model as the most appropriate model (Table 4).

Table 4

Comparison of variants of a global model in the TBI case study with 15 studies to predict 6‐month mortality

Model variant	Baseline risk	Predictor effects	Case study	−2 log‐likelihood	p‐value of
					fully stratified
					fit against
					other model
Fully stratified	Per study	Per study	See Table 2	9750
Single fit	Common	Common	‐	9922	p < 0.0001
Common effect	Per study	Common	τ_α = 0.29	9810	p < 0.0001
Rank 1	Per study	Proportional per study	τ_α = 0.24; τ_γ = 0.12	9791	p = 0.16

The fully stratified model against a model without any heterogeneity (simple logistic regression ignoring any clustering of patients). The distribution used to calculate the p‐value is a 50:50 mixture of two ‐distributions with degrees of freedom equal to and , where p is the number of regression coefficients included in the model; if the test is not significant, select the model without any heterogeneity; otherwise, continue. The fully stratified model against a model with heterogeneity in baseline risk. Here, a 50:50 mixture of two ‐distributions with degrees of freedom is equal to and ; if nonsignificant, select the model with heterogeneity in baseline risk; otherwise, continue. The fully stratified model against the rank = 1 model. Here, the variance components tested are not on the boundary of the parameter space; a ‐distribution with degrees of freedom is used; if nonsignificant, select the rank = 1 model; otherwise, select the fully stratified model. Comparison of variants of a global model in the TBI case study with 15 studies to predict 6‐month mortality

Observed heterogeneity in predictions

When substantial heterogeneity is observed in predictions across included studies, study‐level covariates should be considered in the prediction model. These may explain heterogeneity in predictions across studies. In our case study, we considered the following study‐level covariates in the rank = 1 model, ie, year of start of study, RCT, or observational study. We found that the heterogeneity in slope could be explained partly by the whether a study was an observational study or RCT. Accounting for study type, there remained statistically significant heterogeneity in baseline risk. We hence conclude that the appropriate model for the case study would a global model, separately for observational studies and RCTs, with a random effect for study to allow for heterogeneity in baseline risk. Predictor effects for observational studies appear to be larger compared to predictor effects in RCTs (Table 5). The heterogeneity in baseline risk seems higher for observational studies, reflected in the higher standard deviation of the model intercept.

Table 5

Logistic regression coefficients of models stratified by observational studies and RCTs

	Model based on
	Observational studies	RCT
Intercept	−1.27	−1.41
Age	0.43	0.22
Motor score	−0.42	−0.37
Pupillary reactivity	0.82	0.52
Hypoxia	0.28	0.27
Hypotension	1.03	0.62
CT class	0.34	0.42
tSAH	0.68	0.73
τ_α	0.44	0.14

Logistic regression coefficients of models stratified by observational studies and RCTs

VALIDATION OF MODELS DEVELOPED IN AN IPD MA

Heterogeneity in calibration of predictions

Calibration can well be assessed graphically, with some summary statistics such as calibration intercept and calibration slope.31, 32, 33, 34 A natural approach is to consider the cross‐validated performance, where models are developed for all studies minus one, for what has been labeled “internal‐external validation.”9 , 35 , 36 We leave one study out at a time, which leads to 15 validations for the case study. Here, we validate the rank=1 model, including the study‐level covariate “observational study versus RCT.” We note some miscalibration in the observational studies with higher mortality than predicted in #7 and #8, while #15 showed less mortality than predicted (Figure 5). The higher degree of miscalibration for the observational studies is in line with the larger estimated heterogeneity of the random intercept. Overall, patterns of calibration were reasonable for most trials, although a recent trial (#14) had somewhat lower than expected mortality (Figure 6).

Figure 5

Calibration plots of model developed in observational studies in a leave‐one‐study‐out cross‐validation

Figure 6

Calibration plots of model developed in RCTs in a leave‐one‐study‐out cross‐validation

Calibration plots of model developed in observational studies in a leave‐one‐study‐out cross‐validation Calibration plots of model developed in RCTs in a leave‐one‐study‐out cross‐validation

Heterogeneity in discrimination of predictions

Discrimination of prediction models is commonly assessed by a concordance statistic (c).6 We assessed the discriminative ability of the proposed prediction model in a similar fashion as the calibration, again using “internal‐external” cross‐validation stratified for observational studies and RCTs. The common estimate was c = 0.86, with a 95% confidence interval from 0.84 to 0.87 for observational studies (Figure 7). The approximate 95% prediction interval was identical to the 95% confidence interval as there was no evidence of heterogeneity in observed c‐statistics across the observational studies. For the RCTs, the pooled estimate was c = 0.76 with a 95% confidence interval from 0.74 to 0.78 and a 95% prediction interval from 0.71 to 0.81. This reflects the heterogeneity across studies in predictor effects, but also in case‐mix, since the c‐statistic depends on the combination of predictor effects and case‐mix.37, 38

Figure 7

c‐statistics leave‐one‐study‐out cross‐validation

c‐statistics leave‐one‐study‐out cross‐validation The 95% prediction intervals may be interpreted as plausible ranges for the c‐statistics of the proposed model in a new observational setting or in a new RCT. The observed c‐statistic for the observational studies was higher, in line with the larger case‐mix heterogeneity in patients as compared to the RCTs (Table 2).

CASE STUDY SUMMARIZED

In our case study, we found substantial heterogeneity in baseline risk and predictor effects of a fully stratified model. Using the approximate closed‐test procedure, we found that the rank = 1 model was the most appropriate model, rather than the initially proposed global model with common effects for the regression coefficients.12 The observed heterogeneity in slope of the rank = 1 model could largely be explained by whether a study was a trial or an observational study. Substantial heterogeneity in the baseline risk remained present after adjustment for study type. This was especially pronounced for the observational studies. These results argue for two separate models, ie, one for trials and one for observational studies, each with further local adjustment for the baseline risk (the model intercept). The heterogeneity in baseline risk was clear in the internal‐external validation approach for calibration for the observational studies, while the calibration was less heterogeneous across trials. The c‐statistic was higher for observational studies, reflecting the larger case‐mix heterogeneity and stronger predictor effects in these studies as compared to trials. The steps taking in this assessment of the case study are summarized in Figure 8.

Figure 8

Schematic representation of the research questions to be answered for the development and validation of a prediction model in an individual patient data meta‐analysis [Colour figure can be viewed at wileyonlinelibrary.com] In conclusion, these results suggest that two separate models are needed for different purposes. If the goal is to use the model for use in general care, the model based on the observational studies may be the better choice. In contrast, the model based on the RCTs may be the better choice if the model is used for risk stratification of patients participating in a randomized trial.

DISCUSSION

In this overview, we considered the assessment of various aspects of heterogeneity in the context of a MA for prediction models. The assessment of heterogeneity in baseline risk, and each of the predictor effects per se, is rather similar to standard MA, except for the multivariate nature of MA of multivariable regression coefficients from multiple studies. More challenging is the assessment of heterogeneity in combinations of baseline risk and predictor effects (Section 5), and heterogeneity in absolute risk and model performance (Section 6).

Heterogeneity in IPD MA

We first discussed the assessment of between‐study heterogeneity in design, predictors, and endpoints (Section 3). Specifically, a membership model may be valuable to assess between‐study heterogeneity in case‐mix distributions, and to identify outlier studies.16 In the case study, there was however only a weak relation between‐study characteristics and deviating baseline risk or deviating predictor effects. Some trials had a lower mortality, which may be related to the selection of somewhat more favorable patients, while this selection was not fully captured in the observed covariates. In other cases, the baseline risk may be related even more directly to study characteristics. As an example from a very different clinical field, we found a substantially higher probability of indolent prostate cancers among screened patients compared to clinically presenting patients.39 Next, methods to deal with missing values are essential in the context of multivariable prediction modeling, since a few missing values over different predictors may cause a substantial loss in efficiency in a complete case analysis (Section 4). Moreover, some variables may be systematically missing in some of the studies. Several methods to deal with sporadically and systematically missing values in an IPD‐MA have been proposed that aim to maximize congeniality between the imputation and the analysis model. Although it has been demonstrated that multilevel imputation models with a random slope are uncongenial, the resulting bias is often negligible. In the case study of TBI patients, three sets with increasingly complex prediction models were originally proposed, with exclusion of studies for more complex prediction models if predictors were systematically missing.12 The problem of systematically missing variables was also noted in most of 15 IPD meta‐analyses reviewed recently.11 If imputation is attempted, allowing for between‐study heterogeneity is important, arguing for a more refined procedure than applied in our case study.18, 19, 20, 21 Heterogeneity in predictor effects is straightforward to visualize with forest plots and other standard meta‐analytical approaches (Section 5). We may debate whether the choice of predictors for the prediction models should be influenced by the amount of heterogeneity. If the effect is relatively strong but heterogeneous, the first check should be whether there are explanations for the heterogeneity, such as issues in the operationalization of the predictor. If not, we propose that such a strong predictor be kept in the model. The heterogeneity will be reflected in extra uncertainty in prediction intervals (Section 6). Additionally, apparent heterogeneity may be explained by missed nonlinear associations between predictors and outcome, or not including relevant interactions in the global model. Heterogeneity in absolute risk is what matters ultimately for prediction models (Section 6). This heterogeneity may be driven by heterogeneity in baseline risk in many cases. It is the focus of quality of care research, where specialized centers may claim better results than others, eg, a lower surgical mortality, also related to the volume of care delivered.40 For comparison of predicted absolute risks, the 1‐to‐1 comparisons capture all information (Figure 3). In a variant of this plot, we might standardize the baseline risk, such that the focus shifts to heterogeneity in predictor effects. Figure 3 shows the predicted risks for all available predictor combinations, which is related to attempting to assess the strongest type of calibration, ie, the comparison of observed to predicted risk for individual covariate patterns.34 Validation by predicted risk (rather than specific covariate patterns) is standard for prediction models, and, as illustrated, is readily possible in the context of MA with an internal‐external cross‐validation procedure.16, 35, 36, 41 Both calibration and discrimination can hence be assessed across studies (Section 6). It should also be noted that for internal‐external validation results relatively large studies are required to obtain reliable estimates of study‐specific model performance. Recent studies suggest at least 100 events and 100 nonevents for binary outcomes.42 If there is substantial heterogeneity in baseline risk, the MOR can be a helpful measure to quantify the magnitude of this heterogeneity.43 The MOR is defined as the median value of the odds ratio between the study at the lowest risk and the study with the highest risk when picking two studies at random, or in other words, the increase in risk (in median) of when a patient is included in a study with higher risk of the outcome compared to a study with a lower risk of the outcome. We recognize that our review and proposals have some limitations. Although the test procedure proposed in the manuscript allows for a formal approach to test whether a fully stratified model is needed, similar procedures to choose between fixed and random effects meta‐analyses have been criticized previously. Therefore, it is advised to not only apply the test procedure but also assess the heterogeneity between studies and consider the practical implications of the between‐study heterogeneity. We focused on prediction of a binary endpoint. Extensions to survival models require further study (Appendix A).44 More case studies are required to evaluate the practical usefulness of different options to both methodological and clinical researchers. We considered the combination of multiple predictors, while much current research aims to quantify the value of new predictors, such as new markers to predict cardiovascular disease, beyond what is possible with traditional predictors.45 The incremental value of a marker may then be studied per study, with MA approaches applied to performance measures such as the increment in discriminative ability (c statistic, possibly with some transformation).9, 46 Moreover, we considered an ideal modeling situation, where we have access to IPD and have the most relevant predictors available in each study. We expand on situations with aggregated data in Appendix B.

Heterogeneity and global model performance

If a global model is considered reasonable to propose, we are interested in its performance across different settings and populations.41 It is helpful to distinguish reproducibility and transportability.47 A prediction model is reproducible when it performs sufficiently accurate across new patients from the same underlying population. This property can be directly assessed in a single development data set by applying internal validation techniques such as bootstrapping, with random resampling.6, 48 Model transportability, however, requires the model to perform well across samples from “different but related populations.”47 Transportability can be assessed by performing external validation studies, or by adopting nonrandom resampling methods such as internal‐external cross‐validation.36 When the model is reproducible and shows good transportability across several complementary settings, it can be concluded that its generalizability is adequate.49 This claim is stronger if the included studies are more different from the development study. Therefore, the larger the differences between studies, the stronger the test of generalizability. If development and validation samples are quite similar, reproducibility rather than transportability is assessed.16, 47

Heterogeneity and local performance

Once a global model from an IPD‐MA is developed and validated across multiple studies, we may consider its practical application in a specific local setting.17 If heterogeneity in predictions between settings is substantial, setting specific covariates may be included to explain part of the observed heterogeneity, as was illustrated in our case study. If substantial heterogeneity remains, a global prediction model is suboptimal in some settings. For settings that were included in the IPD MA, either study specific estimates of coefficients may be used, or a suboptimal performance may be acceptable. If the model is however applied in a new setting, not included in the MA, several other options may be considered. Option 3 requires IPD from the local setting. If heterogeneity appears to be substantial, there is an urgent need to gather such local data to update the model. Furthermore, the usefulness of the global model as prior information will be limited as the information in newly collected data will outweigh the information in the global model. Updating methods may start with simple adjustment of the baseline risk to guarantee calibration‐in‐the‐large in the local setting. More extensive updating approaches can be considered, guided by the available sample size.28, 50 Use the global model with the common estimates of baseline risk and common estimates for the predictor effects; Pick baseline risk and/or coefficients from a setting considered most similar to the new setting, based on setting and context specific characteristics; Use the global model as prior information and update the model for the new setting.

CONCLUSIONS

Meta‐analytical techniques have a key role in the development of a global prediction model across multiple studies. As illustrated, IPD from multiple studies give excellent opportunities to investigate, quantify, and report any heterogeneity in the baseline risks, predictor effects, predictions, and performance of a prediction model. We expect that such IPD meta‐analyses will increasingly be performed and advocate the presented approaches to assessing heterogeneity for model selection and presentation of uncertainty in the chosen model.

54 in total

1. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis.

Authors: E W Steyerberg; F E Harrell; G J Borsboom; M J Eijkemans; Y Vergouwe; J D Habbema
Journal: J Clin Epidemiol Date: 2001-08 Impact factor: 6.437

2. Meta-analysis of pairs of survival curves under heterogeneity: a Poisson correlated gamma-frailty approach.

Authors: M Fiocco; H Putter; J C van Houwelingen
Journal: Stat Med Date: 2009-12-30 Impact factor: 2.373

3. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE.

Authors: Shahab Jolani; Thomas P A Debray; Hendrik Koffijberg; Stef van Buuren; Karel G M Moons
Journal: Stat Med Date: 2015-02-09 Impact factor: 2.373

4. Robust prediction of t-year survival with data from multiple studies.

Authors: Tianxi Cai; Thomas A Gerds; Yingye Zheng; Jinbo Chen
Journal: Biometrics Date: 2010-07-28 Impact factor: 2.571

5. Aggregating published prediction models with individual participant data: a comparison of different approaches.

Authors: Thomas P A Debray; Hendrik Koffijberg; Yvonne Vergouwe; Karel G M Moons; Ewout W Steyerberg
Journal: Stat Med Date: 2012-06-26 Impact factor: 2.373

6. A new synthesis analysis method for building logistic regression prediction models.

Authors: Elisa Sheng; Xiao Hua Zhou; Hua Chen; Guizhou Hu; Ashlee Duncan
Journal: Stat Med Date: 2014-03-17 Impact factor: 2.373

7. Does ignoring clustering in multicenter data influence the performance of prediction models? A simulation study.

Authors: L Wynants; Y Vergouwe; S Van Huffel; D Timmerman; B Van Calster
Journal: Stat Methods Med Res Date: 2016-09-19 Impact factor: 3.021

8. A comparison of existing methods for multiple imputation in individual participant data meta-analysis.

Authors: Deborah Kunkel; Eloise E Kaizar
Journal: Stat Med Date: 2017-07-10 Impact factor: 2.373

9. Assessing the generalizability of prognostic information.

Authors: A C Justice; K E Covinsky; J A Berlin
Journal: Ann Intern Med Date: 1999-03-16 Impact factor: 25.391

10. Incorporating published univariable associations in diagnostic and prognostic modeling.

Authors: Thomas P A Debray; Hendrik Koffijberg; Difei Lu; Yvonne Vergouwe; Ewout W Steyerberg; Karel G M Moons
Journal: BMC Med Res Methodol Date: 2012-08-10 Impact factor: 4.615

11 in total

Review 1. Effect of mannose-binding lectin gene polymorphisms on the risk of rheumatoid arthritis: Evidence from a meta-analysis.

Authors: Jinjian Xu; Gang Chen; Zhen Yan; Mochang Qiu; Wentao Tong; Xiaobin Zhang; Li Zhang; Yimin Zhu; Keqi Liu
Journal: Int J Rheum Dis Date: 2021-01-17 Impact factor: 2.454

2. Meta-analysis for individual participant data with a continuous exposure: A case study.

Authors: Darsy Darssan; Gita D Mishra; Darren C Greenwood; Sven Sandin; Eric J Brunner; Sybil L Crawford; Samar R El Khoudary; Maria Mori Brooks; Ellen B Gold; Mette Kildevæld Simonsen; Hsin-Fang Chung; Elisabete Weiderpass; Annette J Dobson
Journal: J Clin Epidemiol Date: 2021-09-04 Impact factor: 7.407

3. Individual participant data meta-analysis with mixed-effects transformation models.

Authors: Bálint Tamási; Michael Crowther; Milo Alan Puhan; Ewout W Steyerberg; Torsten Hothorn
Journal: Biostatistics Date: 2022-10-14 Impact factor: 5.279

Review 4. A tutorial on calibration measurements and calibration models for clinical prediction models.

Authors: Yingxiang Huang; Wentao Li; Fima Macheret; Rodney A Gabriel; Lucila Ohno-Machado
Journal: J Am Med Inform Assoc Date: 2020-04-01 Impact factor: 4.497

5. Untangling the Gordian Knot of Vitamin D Supplementation and Type 2 Diabetes Prevention.

Authors: Anastassios G Pittas; Ethan M Balk
Journal: Diabetes Care Date: 2020-07 Impact factor: 17.152

6. A meta-analysis of XRCC1 single nucleotide polymorphism and susceptibility to gynecological malignancies.

Authors: Xue Qin Zhang; Li Li
Journal: Medicine (Baltimore) Date: 2021-12-17 Impact factor: 1.817

7. A NICE combination for predicting hospitalisation at the Emergency Department: a European multicentre observational study of febrile children.

Authors: Dorine M Borensztajn; Nienke N Hagedoorn; Enitan D Carrol; Ulrich von Both; Juan Emmanuel Dewez; Marieke Emonts; Michiel van der Flier; Ronald de Groot; Jethro Herberg; Benno Kohlmaier; Emma Lim; Ian K Maconochie; Federico Martinon-Torres; Daan Nieboer; Ruud G Nijman; Rianne Oostenbrink; Marko Pokorn; Irene Rivero Calle; Franc Strle; Maria Tsolia; Clementien L Vermont; Shunmay Yeung; Dace Zavadska; Werner Zenz; Michael Levin; Henriette A Moll
Journal: Lancet Reg Health Eur Date: 2021-07-12

8. Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: An overview and illustration.

Authors: Ewout W Steyerberg; Daan Nieboer; Thomas P A Debray; Hans C van Houwelingen
Journal: Stat Med Date: 2019-08-02 Impact factor: 2.373

9. Systematic review and meta-analysis of mortality risk prediction models in adult cardiac surgery.

Authors: Shubhra Sinha; Arnaldo Dimagli; Lauren Dixon; Mario Gaudino; Massimo Caputo; Hunaid A Vohra; Gianni Angelini; Umberto Benedetto
Journal: Interact Cardiovasc Thorac Surg Date: 2021-10-29

10. Individual participant data systematic reviews with meta-analyses of psychotherapies for borderline personality disorder.

Authors: Ole Jakob Storebø; Johanne Pereira Ribeiro; Mickey T Kongerslev; Jutta Stoffers-Winterling; Mie Sedoc Jørgensen; Klaus Lieb; Anthony Bateman; Richard Kirubakaran; Nicolas Dérian; Eirini Karyotaki; Pim Cuijpers; Erik Simonsen
Journal: BMJ Open Date: 2021-06-21 Impact factor: 2.692