Literature DB >> 32022311

On the aggregation of published prognostic scores for causal inference in observational studies.

Tri-Long Nguyen^1,2,3, Gary S Collins⁴, Fabio Pellegrini⁵, Karel G M Moons^2,6, Thomas P A Debray^2,6.

Abstract

As real world evidence on drug efficacy involves nonrandomized studies, statistical methods adjusting for confounding are needed. In this context, prognostic score (PGS) analysis has recently been proposed as a method for causal inference. It aims to restore balance across the different treatment groups by identifying subjects with a similar prognosis for a given reference exposure ("control"). This requires the development of a multivariable prognostic model in the control arm of the study sample, which is then extrapolated to the different treatment arms. Unfortunately, large cohorts for developing prognostic models are not always available. Prognostic models are therefore subject to a dilemma between overfitting and parsimony; the latter being prone to a violation of the assumption of no unmeasured confounders when important covariates are ignored. Although it is possible to limit overfitting by using penalization strategies, an alternative approach is to adopt evidence synthesis. Aggregating previously published prognostic models may improve the generalizability of PGS, while taking account of a large set of covariates-even when limited individual participant data are available. In this article, we extend a method for prediction model aggregation to PGS analysis in nonrandomized studies. We conduct extensive simulations to assess the validity of model aggregation, compared with other methods of PGS analysis for estimating marginal treatment effects. We show that aggregating existing PGS into a "meta-score" is robust to misspecification, even when elementary scores wrongfully omit confounders or focus on different outcomes. We illustrate our methods in a setting of treatments for asthma.

Entities: Chemical Disease Gene Species

Keywords: aggregation; causal inference; observational study; prognostic score; regression modelling

Mesh：

Year: 2020 PMID： 32022311 PMCID： PMC7187258 DOI： 10.1002/sim.8489

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

INTRODUCTION

Nonrandomized studies are an important source of evidence for causal inference, and often used to assess the safety and effectiveness of certain treatments.1 Because nonrandomized studies are observational by nature, differences in individual outcomes cannot directly be attributed to the treatment exposure, and the estimation of comparative treatment effects is therefore prone to confounding bias.2 Over the past few decades, several methods for addressing confounding in nonrandomized studies have been proposed. Amongst these, prognostic score (PGS) analysis (also referred to as “disease‐risk score analysis”) has recently gained popularity in the medical literature.3, 4, 5, 6, 7, 8, 9 Initially formalized by Hansen9 in the case of binary treatments, the theory of the PGS has since been extended to the case of general treatment regimes.6 This approach can be seen as the “prognostic analogue of the propensity score” and is aimed at achieving a prognostic balance across the different treatment arms or exposure groups.6, 9 Briefly, PGS analysis involves the development of a multivariable model predicting an efficacy or safety endpoint (eg, mortality) in patients under a specific exposure (often the control exposure). This model is then applied to the patients of the remaining treatment exposure groups, to estimate their most likely outcome if they had received the control exposure.6, 9 In other words, the PGS analysis develops a prediction model in a set of control individuals, and then extrapolates its predictions to individuals of other exposure groups. Hansen9 (in binary exposures), then Nguyen and Debray6 (in general treatment regimes), showed that conditioning on this PGS allows to estimate treatment effects that are free from confounding bias. Yet, a key requirement for PGS analysis is that the developed prognostic model provides sufficiently accurate predictions of the (potential) outcome in the control and treated populations. This implies that the prognostic model should properly account for covariates that are associated with the outcome under the control exposure, particularly if their distribution varies across the treatment groups.6 Unfortunately, developing accurate prognostic models is subject to dilemmas when few control individuals are available. On the one hand, considering too numerous covariates is likely to result in overfitting, which impedes the generalizability of the prognostic model to other treatment groups. On the other hand, many parsimonious modelling strategies such as penalization and variable selection (eg, least absolute shrinkage and selection operator [LASSO] regression) typically lead to biased outcome predictions, which in turn might introduce (unmeasured) confounding in PGS analysis. To avoid this problem, it has been suggested to use prognostic models developed from large external cohorts of control individuals.3, 4, 5, 8 However, there is a growing evidence that published prognostic models tend to have limited transportability across different settings and populations, especially when studies adopt different selection criteria, different variable and outcome definitions, or different measurement methods. For this reason, it is recommended to tailor or update existing models before implementing them in new populations.10, 11 Rather than focusing on a single external prognostic model, a much more effective strategy might be to utilize and combine multiple previously published prognostic models. In particular, by combining multiple prediction models it becomes possible to account for a large number of covariates without having to estimate their individual effects in the data at hand—which may limit both the risk of overfitting and that of unmeasured confounding in PGS analysis. Debray et al12 previously showed that aggregating multiple published prognostic models, rather than developing or tailoring/updating a single model, helps improve predictive accuracy in new settings (ie, transportability), particularly when model development samples are relatively small. In this article, we extend this aggregation approach to PGS analysis for causal inference purposes in nonrandomized studies. The article is structured as follows: Section 2 presents an overview of the framework of PGS analysis; Section 3 describes a method for aggregating published prognostic models within the context of PGS analysis; Section 4 displays an illustrative study; Section 5 reports two series of simulations; finally, Section 6 opens a discussion on our proposed approach to PGS analysis.

PGS ANALYSIS FOR CAUSAL INFERENCE: AN OVERVIEW

Causal estimands

Following Rubin's counterfactual model,2 for individual , let be the treatment status (we consider a binary treatment exposure in this article: , treated individual; , control individual), and the potential outcomes if were set to be equal to 1 and 0, respectively, and a set of measurable prognostic covariates. Under the “stable unit treatment value assumption”, individual potential responses are not influenced by other units , ∀.13 Hence, for individual the effect caused by the treatment, say “individual treatment effect” (ITE), is defined as: In practice, ITE can never be measured—which is referred to as the “fundamental problem of causal inference”14—as only or only can be observed, but never both. Under the “consistency rule”, the observed outcome is usually defined as: .15 An quantity of interest to be estimated is the Average Treatment effect in the Entire population (ATE); that is, the average difference in outcome that would be caused if all individuals were to receive the treatment vs if all individuals were not to receive the treatment (but the control): (Index disappears since potential outcomes and ITEs are averaged across individuals.) Another quantity of interest is the Average Treatment effect in the Treated population (ATT); that is, the average difference in outcome that would be caused if the treated population were treated (“factual” condition) vs if the treated population were not treated but under the control exposure (“counterfactual” condition): As opposed to the ATE, here the expectation is iterated over the treated population instead of the entire population. The ATT is often an estimand of primary interest since it refers to the causal effect of the treatment in the specific population that it is intended for.16 Hence, we focus on the estimation of this estimand in the current article. In nonrandomized studies, treatment allocation (eg, drug prescription) is often related to several (un)measured covariates. As a consequence, no causal effect can be directly inferred from the average difference in the observed outcomes across the treatment arms.2 (In other words, there is no reason for to coincide with the ATT or the ATE.) In response to this issue, multivariable scoring methods for addressing confounding have been proposed, such as the propensity score analysis and its prognostic analogue.6, 9, 17, 18

Prognostic scores

Briefly, PGS analysis aims to achieve a prognostic balance between different exposure groups. It thereby approximates a design in which treated and control individuals would have a similar distribution of prognosis if they were left untreated (ie, under the control exposure).9 Note, nonrandomized clinical datasets generally include more control than treated units, which is the reason why the PGS is often defined with respect to the control exposure; in contrary case (ie, less control than treated units), “treated” can be relabeled as “control” and vice‐versa. Although PGS analysis originally focused on causal inference with binary exposures, we recently discussed how to investigate general treatment regimes.6 For the sake of simplicity, we here focus on binary treatment exposures. Let be a PGS such that .9 Hansen9 showed that, under the assumption of no “hidden bias” given (ie, ; that is, captures all confounders), conditioning on leads to . From this, under the rule of consistency (ie, ) and the assumption of “positivity” conditional on the PGS (ie, 0 < Pr( with certainty), the ATT can be identified as follows. (For mathematical proofs, see Hansen.9) denotes the total expectation over all in the treated population. Thus, conditioning on the PGS via subclassification or matching allows an unbiased estimation of the ATT.6, 9 (Note, this identification of the ATT holds regardless of the presence/absence of effect modifiers, which is not the case for the ATE.6, 9). The main advantages of PGS analysis are the following. First, as opposed to the propensity score (defined as ), the PGS does not require to predict the treatment allocation, and is therefore particularly useful when exposures are rare or have a large number of categories.6, 19, 20 Second, PGS analysis does not require the strong positivity assumption needed in propensity score analysis, Pr{0 < Pr(,18 but instead the relaxed assumption Pr{0 < Pr(.9 Rather than requiring a stochastic treatment allocation across all covariate values, PGS analysis only requires a stochastic treatment allocation across all values of the PGS.

Practical issues in PGS analysis

In practice, neither the propensity score, , nor the PGS, , is known; they are both estimated using the available nonrandomized data. Whilst the propensity score model, , is fitted to the entire sample (ie, including both treated and control individuals), the PGS model, , is fitted only to the control arm, as a regression model fitted to both arms would estimate a confused mixture of and .9 (Though it may be possible to use both arms by stratifying all unknown PGS parameters across the treatment groups, for instance by means of interaction terms, such an approach is likely to be at risk of model‐dependence. For instance, see Groenwold et al21 or van Klaveren et al.22) Two practical issues should be considered in PGS analysis: After fitting to the control arm, , the estimated PGS model is to be extrapolated to the treated arm, . Contrary to propensity score analysis where the balance property of can be evaluated in the entire sample, the balance property of is assessable only in the control arm, since is not observable in the treated arm (ie, “fundamental problem of causal inference”). Practical issue (A) arises from the PGS model being estimated only in the control patients. This implies that the sample size for estimating a PGS model is smaller than for propensity score modelling. The estimated PGS model is then applied to all patients of the study to predict their potential outcome . Since treated patients do not contribute to the estimation of the prognostic model, it is possible that model predictions do not transport well to the treated arm. This may affect the validity of the subsequent treatment effect estimation. Practical issue (B) arises from the difficulty to evaluate the accuracy of the PGS model, because is not observed in treated patients. As a result, this accuracy can be assessed only in the control patients. In this sense, (A) is a matter of modelling (ie, to be ensured, the generalizability of requires an appropriate modelling strategy); whilst (B) is a matter of diagnostic (ie, to be ensured, the balance property of needs proper diagnostic methods). Some authors have proposed the use of a “dry‐run” analysis in response to (B)23; however, there remains, to our best knowledge, a paucity of the literature addressing (A). It is well‐recognized that a prognostic model derived in one particular sample may not generalize well to other samples from the same (control) or different (treated) populations. Overfitting is likely to happen when the development sample is relatively small (compared to the number of candidate predictors); and transportability issues are likely to occur when the case‐mix of the development sample is too narrow.24, 25, 26, 27 In PGS analysis, it is of particular importance that properly accounts for covariates that substantially affect prognosis and may have different distributions across the treatment exposure groups. Although can be derived in the same cohort containing the treated and untreated individuals, it has been shown through simulations that obtained from a separate sample of control individuals should be more efficient.9 Hereupon, follows the discussion that “these difficulties [related to same‐sample estimation] may be substantially avoided if an alternate sample of controls, perhaps historical controls, are available for the determination of the PGS.”9 The use of external, historical scores has therefore gained increased consideration, as they often improve the performance of PGS analysis.3, 4, 5, 8 Unfortunately historical controls are not always easy to obtain, and models derived from historical controls may not necessarily calibrate well across different studies or treatment groups (for instance, due to differences in variable definitions or measurement methods across cohort studies). Furthermore, the use of a single, historical PGS model may not always be straightforward, particularly when multiple competing models exist. Hence, instead of considering one external, historical PGS model at face validity, we introduce the idea of aggregating multiple published PGS models, and tailoring them to the control arm of the nonrandomized treatment study.

PROPOSED METHOD: AGGREGATION OF PUBLISHED PGSs

Debray et al12 previously proposed aggregating published prediction models to help improve their accuracy and generalizability across different settings and populations. The approach bears some similarities with penalization during prediction model development, as it loosely imposes (in this case, previously estimated) predictive associations to the data at hand. The authors showed through clinical datasets and simulations that the aggregation of multiple models into one “meta‐model” tends to outperform redevelopment or updating of a single prognostic model, particularly when development datasets are relatively small compared to the number of predictors being studied.12 We focus on an approach called “stacked regressions”,12 that is, a linear combination of published models—that we extend to the PGS analysis. Let be the set of published PGS models, , wherein, using generalized linear models with link function: Stacked regressions incorporate each prediction as a predictor in a generalized linear regression model that is to be estimated in the control arm of the nonrandomized study sample. For instance, for samples with binary outcomes, the meta‐model is estimated by optimizing the following log‐likelihood function in the N control individuals with Z =0: In this expression, the constraint was proposed to allow the omission of literature models with strong collinearity.12 Further, is defined as the set of all covariates across the published PGS models. The estimation of the parameters leads to the meta‐model : Since originate from different studies and settings, they are likely to include many covariates. This implies that the stacked model, which only involves one unknown parameter for each literature model (rather than for each covariate), allows one to adjust for many confounders even when very few data are at hand, thereby minimizing both the risk of unmeasured confounders and that of overfitting. The PGS model derived from the meta‐model, , is then used for conditioning (eg, matching or subclassification). Replacing by in Equation (1), one can estimate the ATT as: The next section presents an illustrative case in which we estimate the ATT, using a “meta‐score”, external (ie, previously published) PGS, or same‐sample PGS.

ILLUSTRATIVE CASE STUDY

Methods

To illustrate our approach, we performed a subgroup analysis of the ACCURATE trial.28 (The data that support the findings of the current study are available from the investigators of this trial upon reasonable request.) Briefly, this pragmatic trial investigated three asthma interventions to reduce the 1‐year risk of exacerbation in asthmatic patients (see Data S1, Section 1.1). The severity of clinical manifestations of asthma was classified into controlled asthma (Ca), partly Ca (PCa), and uncontrolled asthma categories. Here, we defined Ca as an Asthma Control Questionnaire (ACQ) score equal or below 0.75, and PCa as 0.75 < ACQ ≤1.50.28 We considered PGS analysis in the ACCURATE trial to estimate the comparative efficacy of PCa (“treatment”) vs Ca (“control exposure”) in a subgroup of female asthmatic patients. A naïve estimation of the causal effect would likely suffer from confounding bias given the cluster‐randomization,29, 30 subgroup analysis,31, 32 and exclusion of some randomized patients (see Data S1, Section 1.1). We conducted three strategies of PGS analysis. First, we derived three same‐sample (logistic) PGS models using maximum likelihood (ML) estimation, ridge regression, and LASSO regression. These scores were developed from the “control” female patients allocated to Ca, and included seven covariates: ACQ score; current smoking status (yes/no); chronic sinusitis complaints (yes/no); hospitalization for asthma ever (yes/no); steroid courses for asthma within last year (yes/no); predicted prebronchodilator forced expiratory volume in 1 second (FEV1); fraction of exhaled nitric oxide. (We chose these covariates as they had been previously identified as prognostic variables,33 and thus, potential confounders.) Second, we used three external, published asthma prediction models as PGSs, which predicted different but related outcomes (Data S1, Section 1.2). The model reported by Schatz et al34 predicted the risk of asthma hospitalization at 1 year, given three covariates: hospitalization for asthma ever (yes/no); steroid courses for asthma within last year (yes/no); income. The model presented by Eisner et al. estimated the risk of unscheduled care visit at 1 year, given two covariates: severity of asthma score and asthma control test.35 The TENOR model, reported by Miller et al,36 predicted the risk of emergency department visit or hospitalization at 6 months, given 10 covariates: younger age; female sex; nonwhite race; body mass index > or = 35 kg m−2; postbronchodilator percent predicted forced vital capacity <70%; history of pneumonia; diabetes; cataracts; intubation for asthma; and three or more steroid bursts in the prior 3 months. Finally, using the proposed method of stacked regressions, we aggregated the three external scores (Schatz score, Eisner score, and TENOR score) into a “meta‐score” in the “control” arm of study data, and compared this meta‐score with the same‐sample PGS. All covariate sets and outcomes considered in each of the PGS are reported in Table 1.

Table 1

Covariates and outcomes considered in different PGSs

	Same‐sample score	Schatz score	Eisner score	TENOR score	Meta‐score
Covariates	7	3	2	10	15
ACQ score	X
Currently smoking	X
Chronic sinusitis complaint	X
Hospitalization for asthma	X	X			X
Steroid course for asthma within last year	X	X			X
Predicted FEV1	X
Fraction of exhaled nitric oxide	X
Income		X			X
Severity of asthma score			X		X
Asthma control test			X		X
Age				X	X
Sex				X	X
Race				X	X
Body mass index				X	X
Predicted FVC				X	X
History of pneumonia				X	X
Diabetes				X	X
Cataracts				X	X
Intubation for asthma				X	X
Steroid bursts in the prior 3 months				X	X
Outcomes
Exacerbation at 1 year	X				X
Hospitalization at 1 year		X
Unscheduled care visit at 1 year			X
Emergency department visit or hospitalization at 6 months				X

Abbreviations: ACQ, Asthma Control Questionnaire; FEV1, prebronchodilator forced expiratory volume in 1 s; FVC; postbronchodilator forced capacity volume; PGS, prognostic score.

Covariates and outcomes considered in different PGSs Abbreviations: ACQ, Asthma Control Questionnaire; FEV1, prebronchodilator forced expiratory volume in 1 s; FVC; postbronchodilator forced capacity volume; PGS, prognostic score. We performed a full matching analysis on the PGSs to estimate the ATT.6 To address the issue of missing data (which concerned only the same‐sample PGSs), we used multiple imputation as described by Leyrat et al37 in propensity score analysis, who recommend to apply Rubin's rules on the estimated treatment effects across the imputed datasets, and not over the scores. We therefore created 10 imputed datasets. In each imputed dataset, we fit three PGSs (based on ML, ridge, and LASSO) to the control arm, matched on each on these three scores (separately), and estimated the treatment effect. We then averaged the treatment effect estimates obtained from the 10 imputed datasets. We computed standard errors of treatment effect estimates (and 95% confidence intervals [CI]) by bootstrapping (1000 iterations): we resampled the initial dataset with replacement and performed in each bootstrap sample all aforementioned analyses, including multiple imputations.38 We estimated the standard errors as the standard deviations of the estimates across the 1000 iterations.

Results

In the subgroup of 281 female patients, the active “treatment” (ie, strategy aiming at PCa) was allocated to 134 individuals (47.7%). Table 2 summarizes baseline characteristics and predicted outcome risks across the two groups. Twenty‐three (15.6%) and 21 (15.7%) patients experienced an exacerbation at 1 year in the “control” and in the “treatment” group, respectively. Thus, a naïve approach ignoring the potential for confounding bias would estimate the treatment effect to be equal to 0.001, on an absolute risk difference scale. Note that given the few outcomes observed in the control arm, the same‐sample PGSs—in particular, the one estimated by ML—may have suffered more from overfitting issues (ratio of events per variable [EPV] = 3.3) than did the “meta‐score” (EPV = 7.7).39, 40

Table 2

Baseline characteristics and predicted outcome risks in the subgroup of female patients (N = 281)

	Ca	PCa
	n = 147 (42.7%)	n = 134 (37.7%)
ACQ score	1.22 (SD: 1)	1.02 (SD: 0.96)
Currently smoking	21 (14.5%)	22 (16.9%)
Chronic sinusitis complaint	20 (13.7%)	17 (13.2%)
Hospitalization for asthma	14 (9.5%)	15 (11.2%)
Steroid course for asthma within last year	39 (26.5%)	27 (20.1%)
Predicted FEV1 (%)	92.29 (SD: 14.33)	92.35 (SD: 16.75)
Fraction of exhaled nitric oxide	26.40 (SD: 30.96)	21.83 (SD: 21.18)
Schatz score (logit scale)	−3.32 (SD: 0.61)	−3.30 (SD: 0.56)
Eisner score (logit scale)	−1.06 (SD: 0.48)	−1.11 (SD: 0.45)
TENOR score (logit scale)	−3.75 (SD: 0.87)	−3.81 (SD: 0.89)
Exacerbation at 1 year	23 (15.6%)	21 (15.7%)

Note: Mean and SD are reported for continuous variables; frequency and percentage for binary variables.

Abbreviations: ACQ, Asthma Control Questionnaire; Ca, aiming at controlled asthma; FEV1, prebronchodilator forced expiratory volume in 1 s; PCa, aiming at partially controlled asthma.

Baseline characteristics and predicted outcome risks in the subgroup of female patients (N = 281) Note: Mean and SD are reported for continuous variables; frequency and percentage for binary variables. Abbreviations: ACQ, Asthma Control Questionnaire; Ca, aiming at controlled asthma; FEV1, prebronchodilator forced expiratory volume in 1 s; PCa, aiming at partially controlled asthma. After full matching on the same‐sample PGSs derived by ML, ridge, and LASSO regressions, the ATT was estimated to be equal to 0.021 (95% CI: −0.054 to 0.097), 0.024 (95% CI: −0.052 to 0.099), and 0.022 (95% CI: −0.052 to 0.096), respectively. After full matching on the (historical) Schatz, Eisner, and TENOR scores, the estimated ATT was equal to 0.018 (95% CI: −0.069 to 0.105), 0.027 (95% CI: −0.054 to 0.108), and − 0.002 (95% CI: −0.088 to 0.085), respectively. After full matching on the meta‐score (ie, aggregation and simultaneous updating of the Schatz, Eisner, and TENOR scores), the estimated ATT was equal to 0.019 (95% CI, −0.060; 0.098). In summary, the evaluated approaches yielded an absolute treatment effect ranging from 0 to 3%, with considerably large CI. Given these differences in estimated treatment effects, we conducted two simulation studies to better inform the performance of each method.

SIMULATION STUDIES

Simulation study 1

Data generation

We conducted a series of simulations in which we estimated the ATT of a (fictive) treatment . We considered nine covariates and a binary outcome (see Box 1). To generate data with realistic distributions and correlation structures, we took the ACCURATE study as a reference,28 and generated new, simulated, datasets by random sampling from its posterior distribution (Data S1, Sections 1.3 and 1.4). Briefly, we estimated the conditional distributions of in the ACCURATE study. The conditional distributions were then combined into a Markov chain where Monte Carlo sampling (using 1000 iterations) was used to generate new individuals that resemble those observed in the ACCURATE cohort (see Box 1). We defined four populations A, B, C, and D with different patient characteristics and focusing on different but related outcomes. Population D (and its corresponding outcome definition) was of primary interest for estimation of the ATT of treatment such that . Samples of population D were, however, small and restricted to . Populations A, B, and C represented external historical cohorts, from which previous prognostic models could have been derived and published in the literature. These external cohorts had larger sample sizes ) but adopted outcome definitions that did not necessarily correspond to the outcome of primary interest (, , ). Furthermore, since in clinical practice inclusion criteria are likely to differ across studies, we defined bounds (ie, truncation) to impose slightly different covariate distributions across the four populations. In samples of population D, we generated the potential outcome of primary interest that would be observed were individuals to receive the fictive treatment (see Box 1). We defined using a nonlinear nonmonotonic function of (see Box 1). In samples of population D, we generated the treatment status according to a logistic model mimicking a clinical decision (ie, true propensity model; see Box 1). This model set the treatment prevalence at 35 to 40%. Finally, we generated the observed outcome using the rule of consistency. For each population (A, B, C, and D), we used this procedure to generate samples. Across all simulated samples of population D, the true ATT corresponded, on an absolute risk difference scale, to the empirical expectation:

Statistical analysis

In each simulation , we compared three strategies of PGS analysis for estimating the ATT. First, we fitted a logistic regression model to the control arm of sample to predict the potential outcome in this same sample (ie, same‐sample PGS). We included all covariates ( to ) into the regression model, such that no hidden bias was introduced (ie, “utopian” case). Second, we used the score derived from sample and applied it as a PGS in sample (ie, external PGS). Finally, we performed the proposed method of stacked regressions to aggregate external PGSs obtained from samples , , and into a meta‐score which predicts in sample (ie, aggregated external PGSs). To assess the robustness of the use of external PGSs (either individually or after aggregation), we considered different scenarios (see Box 2). Regardless of whether the PGSs were externally or same‐sample derived, we considered three approaches to regression: ML, ridge, and LASSO regressions. To evaluate the impact of the sample size of historical cohorts from which external scores were derived, we defined samples of populations A, B, and C as including either 500, 1000, or 2000 subjects. Therefore, at each iteration we had: three same‐sample PGSs derived from ; 36 external PGSs derived from sample ; 36 meta‐PGSs aggregating models derived from samples , , and . We performed full matching analysis on the PGSs.6 As opposed to classical matching methods, full matching allows the entire sample to be preserved.41, 42 It creates strata within which control and treated units are paired in an optimal way such that the overall distance between them is minimized. Since each stratum (or “pair”) can include unequal numbers of treated and controls, individuals are to be weighted.41, 42 Let denotes the stratum in which individual falls; to estimate the ATT, the individual weight is: We estimated the ATT as the weighted average difference in outcome across the two treatment arms. We compared the performance of the three PGS analyses, by measuring the bias and mean‐squared error (MSE) of estimates.

Results

Across all simulations, the mean treatment prevalence in population D was equal to 37.1%. The mean outcome prevalence was equal to 15.3% in control individuals and 16.9% in treated individuals, thereby leading to a mean naïve absolute risk difference of 0.016 (empirical SD: 0.077). In populations A, B, and C (ie, historical control cohorts), the mean outcome prevalence was equal to 10.9%, 12.3%, and 10.6%, respectively. Using the different methods of PGS, the performances (bias, SD, and MSE) of ATT estimates across all scenarios were reported in Table 3.

Table 3

Performance of different PGS analyses

PGS	Outcome of external score	Covariates of external score	Modelling method	NkA,B,C= 500		NkA,B,C= 1000		NkA,B,C= 2000
PGS	Outcome of external score	Covariates of external score	Modelling method	Relative bias	100× MSE	Relative bias	100× MSE	Relative bias	100× MSE
Same‐sample PGS			ML	−14.9%	0.6509	−14.9%	0.6509	−14.9%	0.6509
			Ridge	−2.3%	0.6070	−2.3%	0.6070	−2.3%	0.6070
			LASSO	−14.9%	0.6692	−14.9%	0.6692	−14.9%	0.6692
External PGS	Identical	Identical	ML	−13.8%	0.6235	−10.3%	0.6043	−8.0%	0.5987
			Ridge	−11.5%	0.6247	−8.0%	0.5928	−6.9%	0.5920
			LASSO	−12.6%	0.6197	−10.3%	0.5956	−8.0%	0.6047
		Different	ML	−41.4%	0.8078	−39.1%	0.7728	−39.1%	0.7769
			Ridge	−41.4%	0.8050	−39.1%	0.7820	−37.9%	0.7640
			LASSO	−42.5%	0.8040	−39.1%	0.7801	−39.1%	0.7686
	Different	Identical	ML	−48.3%	0.9012	−39.1%	0.8083	−26.4%	0.7057
			Ridge	−43.7%	0.8815	−33.3%	0.7707	−21.8%	0.6958
			LASSO	−64.4%	1.0745	−52.9%	0.9468	−33.3%	0.7664
		Different	ML	−62.1%	1.0218	−58.6%	0.9860	−51.7%	0.9065
			Ridge	−63.2%	1.0452	−58.6%	0.9799	−51.7%	0.9055
			LASSO	−74.7%	1.1631	−70.1%	1.1072	−60.9%	0.9829
Aggregated external PGSs	Identical	Identical	ML	−6.9%	0.5975	−5.7%	0.5851	−5.7%	0.5729
			Ridge	−3.4%	0.5759	−3.4%	0.5861	−3.4%	0.5675
			LASSO	−4.6%	0.5775	−4.6%	0.5911	−4.6%	0.5713
		Different	ML	−5.7%	0.5898	−4.6%	0.5767	−3.4%	0.5869
			Ridge	−4.6%	0.5859	−3.4%	0.5848	−3.4%	0.5805
			LASSO	−5.7%	0.5931	−3.4%	0.5706	−3.4%	0.5835
	Different	Identical	ML	−13.8%	0.6337	−9.2%	0.6146	−5.7%	0.5956
			Ridge	−9.2%	0.6235	−4.6%	0.5854	−2.3%	0.5884
			LASSO	−17.2%	0.6644	−9.2%	0.6118	−5.7%	0.6020
		Different	ML	−12.6%	0.6308	−9.2%	0.6111	−5.7%	0.6024
			Ridge	−11.5%	0.6228	−8.0%	0.6068	−5.7%	0.6003
			LASSO	−19.5%	0.6585	−11.5%	0.6202	−8.0%	0.6202

Abbreviations: LASSO, least absolute shrinkage and selection operator; ML, maximum likelihood; MSE, mean‐squared error; , sample size for external prognostic score derivation; PGS, prognostic score.

Performance of different PGS analyses Abbreviations: LASSO, least absolute shrinkage and selection operator; ML, maximum likelihood; MSE, mean‐squared error; , sample size for external prognostic score derivation; PGS, prognostic score. Though it was considered under “utopian” conditions (ie, no misspecification nor hidden bias), we found that the use of same‐sample PGSs could lead to a biased estimation of the ATT, in particular when the PGS was derived by ML or LASSO regression (Table 3). This bias may have arisen from overfitting when using ML, whilst it might be due to residual confounding given variable selection when using LASSO regression. In comparison, same‐sample PGSs derived by ridge regression performed better (Table 3). The use of external PGSs (derived from samples of population A) that included all covariates and predicted the outcome of primary interest yielded a performance comparable to that of same‐sample PGSs, when the sample size for external PGS derivation was relatively small, 500 (Figure 1). This performance in estimating the ATT improved when the size of samples was larger, (Figures 2 and 3). We showed that external PGSs including a limited set of covariates—thus, potentially omitting important confounders—resulted in biased estimates, regardless of whether the outcome of sample was identical to the primary outcome of sample (Figures 1, 2, 3). In addition, considering a different but related secondary outcome ( instead of ) also led to bias, even when no confounder was missing in the external PGSs (Figures 1, 2, 3). The use of external PGSs derived by ridge regression generally outperformed those obtained by ML and LASSO regression across the different scenarios (Figures 1, 2, 3).

Figure 1

Figure 2

Figure 3

Performance of different approaches to PGS analysis (sample size for external score derivation, ). The dotted line refers to the “true” ATT. AES, aggregated external prognostic scores; ES, external prognostic scores; LASSO, least absolute shrinkage and selection operator; ML, maximum likelihood; PGS, prognostic score; SS, same‐sample prognostic scores [Colour figure can be viewed at http://wileyonlinelibrary.com] Performance of different approaches to PGS analysis (sample size for external score derivation, ). The dotted line refers to the “true” ATT. AES, aggregated external prognostic scores; ES, external prognostic scores; LASSO, least absolute shrinkage and selection operator; ML, maximum likelihood; PGS, prognostic score; SS, same‐sample prognostic scores [Colour figure can be viewed at http://wileyonlinelibrary.com] Performance of different approaches to PGS analysis (sample size for external score derivation, ). The dotted line refers to the “true” ATT. AES, aggregated external prognostic scores; ES, external prognostic scores; LASSO, least absolute shrinkage and selection operator; ML, maximum likelihood; PGS, prognostic score; SS, same‐sample prognostic scores [Colour figure can be viewed at http://wileyonlinelibrary.com] As shown in Figures 1, 2, 3, the use of aggregated external PGSs after stacked regressions led to unbiased estimates of the ATT. This result appeared consistent across all (unfavorable) scenarios. When all three external PGSs (derived from samples of populations A, B, and C) included all covariates and predicted the outcome of primary interest, the use of a meta‐PGS outperformed all other methods in terms of bias, variance, and mean square error (Table 3). We observed that the aggregation of external PGSs derived by ridge regression led to the best performance (Table 3). In the most unfavorable scenario, in which the three external PGSs included different covariate sets and predicted secondary outcomes (, , and instead of ), the aggregation of PGSs still allowed an unbiased estimation of the ATT, especially when the historical cohorts were large (Figures 2 and 3). Even in this worst‐case scenario, the aggregation of external PGSs yielded a performance comparable to that of “utopian” same‐sample PGSs derived by ridge regression (Table 3). Additional results from simulations on larger sample sizes () are reported in Data S1, Section 1.5. Results were more modest, but comparable to those described above.

Simulation study 2

We conducted a second series of simulations which favored the traditional same‐sample PGS over stacked regressions of external PGSs (for details, see Data S1, Section 2). We considered a binary outcome and 20 independent covariates (16 of these affected the outcome). Further, we defined the treatment under investigation as ineffective, that is, null treatment effect. We considered three scenarios to evaluate the performance of stacked regressions. For all scenarios, we generated five historical cohorts with 50 events (ie, rare outcome). Scenario A: The historical cohorts varied systematically with respect to their baseline risk. More specifically, there was between‐study heterogeneity in the intercept term of the data generation model from each historical cohort. Scenario B: The historical cohorts varied systematically with respect to their baseline risk and the overall magnitude of their predictor effects. More specifically, there was between‐study heterogeneity in the intercept term and the overall slope of the data generation model from each historical cohort. Scenario C: The historical cohorts varied systematically with respect to their baseline risk and their individual predictor effects. More specifically, there was between‐study heterogeneity in the intercept term and the coefficients of the data generation model from each historical cohort. The historical cohorts were used to derive the “previously published” PGSs following a backward selection procedure (ie, risk of omitting important confounders). Finally, we generated a nonrandomized study where the covariate distribution was unbalanced between treated and control patients (ie, confounding). We used this sample to estimate the ATT. Results demonstrated that stacked regressions helped reduce bias and stabilize treatment effect estimates (see Data S1, Section 2.3). Stacked regressions outperformed traditional PGS analysis even when historical cohorts differed with respect to baseline risk and relative strength of predictor effects (scenarios A and B). When historical cohorts were less related to the study population (scenario C), the aggregation of published PGSs showed a benefit over the traditional estimator only when the study sample at hand was relatively small.

DISCUSSION

We have proposed and assessed an extension of stacked regressions to PGS analysis, when multiple historical scores are available. We demonstrated through extensive simulations that this approach facilitated an unbiased estimation of causal treatment effects in nonrandomized studies, even if the historical PGSs omitted important covariates and/or predicted different (but related) outcomes. Since the treatment assignment mechanism in observational studies is often based on clinical decisions (which are likely to vary across practitioners), to have an exhaustive set of confounders at disposal would suppose (i) knowing, comprehensively, the underlying mechanism of treatment assignment and (ii) recording and accessing to all recognized confounders—a scenario which could be deemed “utopian”. A major advantage of PGS analysis (over propensity score analysis) is that it no longer requires to predict the treatment allocation and adopts a less restrictive positivity assumption. However, PGS analysis (propensity score analysis alike) assumes the absence of unmeasured confounding (“hidden bias”), and therefore requires to adjust for as many (possible) confounders as possible.6, 9 This may become problematic when relatively few patient‐level data are available. Although we show that ridge regression may improve the estimation of PGSs in small samples, further improvement is possible by considering evidence synthesis strategies. In particular, aggregation of multiple published PGSs allows to account for a wide set of confounding covariates (at a limited expense in terms of required degrees of freedom), and thus diminishes the potential for hidden bias in nonrandomized studies. Across our simulations, we show that aggregation outperforms traditional penalization methods and enables an unbiased treatment effect estimation, even when the PGSs include different covariates or predict different outcomes from that being investigated. Aggregation is most advantageous when the sample available for treatment effect estimation is small, likely because same‐sample PGS is prone to overfitting (when using ML regression) or residual confounding (when selecting covariates by LASSO regression). Interestingly, prognostic research is increasing, and prediction models are becoming abundant in the medical literature.27, 43 Nowadays, clinical observational datasets often include prognostic models aiming at predicting patient outcomes that are used in daily practice routine. For instance, in intensive care units some prognostic models (eg, simplified acute physiology score II44) are routinely recorded (therefore available in observational datasets), whilst the individuals model variables are not necessarily reported (eg, natremia level or patient temperature). Stacked regressions require no raw (individual participant) data from the historical cohorts; they only need access to the previously published models. These may be available as part of the study results and be presented as estimated regression coefficients, or (adjusted) relative risks or odds ratios. It is also possible to combine literature models, which are presented as score charts or risk scores (eg, TENOR risk score in our illustrative example), or which have been implemented in a web app. To critically appraise previously published prognostic models that may be relevant for aggregation, we recommend the use of PROBAST and CHARMS tools.45, 46, 47 Both were developed to help researchers assess the risk of bias and applicability of published prognostic models, and thus infer whether their inclusion would be beneficial. If considered at low risk of bias and applicable to the question at hand, published prediction models can be externally validated in the available data. Because stacked regressions involve a default update of the common intercept and slope, the inclusion of miscalibrated models may not necessarily be problematic. An important result from our simulation study is that using historical PGSs without considering any form of updating might be treacherous and lead to substantial bias. This practice has, however, often been recommended in the literature where it is common to assume that historical PGSs developed from very large samples (N ≥ 10 000) remain perfectly valid when applied to new patients.3, 4, 5, 8 It is, however, likely that historical scores are derived in different (but related) settings, under suboptimal conditions, which may include different covariate sets, consider different outcomes, or adopt different variable definitions. As a result, a single external PGS can be miscalibrated in the nonrandomized data at hand and result in biased treatment effect estimates. We therefore recommend to update published PGSs before their implementation, and propose to borrow strength across multiple PGSs by combining and updating them in the control arm of the study sample through stacked regressions. Recent extensions of stacked regressions allow to recalibrate multiple prediction models while concurrently revising specific covariates and adopting a penalized likelihood.48 As raised by an anonymous reviewer, stacked regressions bear some similarities with Super Learning.49, 50, 51 Both approaches create a weighted sum of candidate models (possibly derived using various modelling methods, such as generalized linear regression, neural networks, or random forests). However, rather than combining multiple models that are developed in a single dataset, we propose to combine published models previously developed in different (and possibly larger) datasets. As a result, the candidate models for stacked regressions may exhibit more diversity and provide new information that cannot be derived from the data at hand. Because these candidate models are more likely to be miscalibrated, stacked regressions simultaneously update their predictions to the study population and estimate their optimal combination. In a way akin to Super Learning, the performance of stacked regressions could further be improved by expanding the set of candidate models with additional models derived from the nonrandomized data at hand. Our study has to be considered in light of some limitations. First, we restricted our simulations to scenarios of small to medium sample sizes, and did not explore the benefit of aggregating PGSs over direct application of external PGSs derived from very large settings. The major interest of the latter resides in addressing the overfitting issues to which same‐sample approaches may be prone. Given that aggregating multiple PGSs enables to gather a large amount of information from the medical literature at the expense of but a very few parameters to be estimated, we hypothesize that integrating PGSs originating from large historical cohorts into meta‐scores may all the more contribute to the robustness of our approach. Second, we did not investigate the scenario in which meta‐scores did not cover all true confounders. If the analyst recognizes that confounders are left aside, one strategy to adopt may be to include those confounders into the meta‐score, along with the external PGSs. Finally, methods for estimating standard errors of treatment effect estimates obtained from our approach are needed; one of which might be the adoption of bootstrapping as we conducted in our illustrative study. We did not investigate this in the current simulation study as it requires substantial computational power. In conclusion, this article proposes a method for aggregating PGSs for causal inference. Through extensive simulations, we demonstrate the robustness of this approach, in particular, to misspecification. We show and discuss how our method could, interestingly, limit both the bias due to unmeasured confounders and that related to overfitting in PGS analysis.

CONFLICT OF INTEREST

The authors declare no conflict of interest. Data S1: Supporting Information Click here for additional data file.

41 in total

Review 1. Risk prediction models: II. External validation, model updating, and impact assessment.

Authors: Karel G M Moons; Andre Pascal Kengne; Diederick E Grobbee; Patrick Royston; Yvonne Vergouwe; Douglas G Altman; Mark Woodward
Journal: Heart Date: 2012-03-07 Impact factor: 5.994

2. On the consistency rule in causal inference: axiom, definition, assumption, or theorem?

Authors: Judea Pearl
Journal: Epidemiology Date: 2010-11 Impact factor: 4.822

3. Prognosis and prognostic research: application and impact of prognostic models in clinical practice.

Authors: Karel G M Moons; Douglas G Altman; Yvonne Vergouwe; Patrick Royston
Journal: BMJ Date: 2009-06-04

4. Propensity score estimators for the average treatment effect and the average treatment effect on the treated may yield very different estimates.

Authors: R Pirracchio; M Carone; M Resche Rigon; E Caruana; A Mebazaa; S Chevret
Journal: Stat Methods Med Res Date: 2013-11-06 Impact factor: 3.021

2. Establishment of a Nomogram-Based Prognostic Model (LASSO-COX Regression) for Predicting Progression-Free Survival of Primary Non-Small Cell Lung Cancer Patients Treated with Adjuvant Chinese Herbal Medicines Therapy: A Retrospective Study of Case Series.

Authors: Bin Luo; Ming Yang; Zixin Han; Zujun Que; Tianle Luo; Jianhui Tian
Journal: Front Oncol Date: 2022-07-08 Impact factor: 5.738

3. EA3: A softmax algorithm for evidence appraisal aggregation.

Authors: Francesco De Pretis; Jürgen Landes
Journal: PLoS One Date: 2021-06-17 Impact factor: 3.240

3 in total

On the aggregation of published prognostic scores for causal inference in observational studies.

INTRODUCTION

PGS ANALYSIS FOR CAUSAL INFERENCE: AN OVERVIEW

Causal estimands

Prognostic scores

Practical issues in PGS analysis

PROPOSED METHOD: AGGREGATION OF PUBLISHED PGSs

ILLUSTRATIVE CASE STUDY

Methods

Results

SIMULATION STUDIES

Simulation study 1

Data generation

Statistical analysis

Results

Simulation study 2

DISCUSSION

CONFLICT OF INTEREST

Review 1. Risk prediction models: II. External validation, model updating, and impact assessment.

2. On the consistency rule in causal inference: axiom, definition, assumption, or theorem?

3. Prognosis and prognostic research: application and impact of prognostic models in clinical practice.

4. Propensity score estimators for the average treatment effect and the average treatment effect on the treated may yield very different estimates.

5. A simulation study of the number of events per variable in logistic regression analysis.

6. Comparison of high-dimensional confounder summary scores in comparative studies of newly marketed medications.

7. Role of disease risk scores in comparative effectiveness research with emerging therapies.

8. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study.

9. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies.

10. Dimension reduction and shrinkage methods for high dimensional disease risk scores in historical data.

1. On the aggregation of published prognostic scores for causal inference in observational studies.

2. Establishment of a Nomogram-Based Prognostic Model (LASSO-COX Regression) for Predicting Progression-Free Survival of Primary Non-Small Cell Lung Cancer Patients Treated with Adjuvant Chinese Herbal Medicines Therapy: A Retrospective Study of Case Series.

3. EA3: A softmax algorithm for evidence appraisal aggregation.