| Literature DB >> 27650716 |
H-G Eichler1, B Bloechl-Daum2, P Bauer3, F Bretz4, J Brown5, L V Hampson6, P Honig7, M Krams8, H Leufkens9, R Lim10, M M Lumpkin11, M J Murphy12, F Pignatti1, M Posch3, S Schneeweiss13, M Trusheim14, F Koenig3.
Abstract
A central question in the assessment of benefit/harm of new treatments is: how does the average outcome on the new treatment (the factual) compare to the average outcome had patients received no treatment or a different treatment known to be effective (the counterfactual)? Randomized controlled trials (RCTs) are the standard for comparing the factual with the counterfactual. Recent developments necessitate and enable a new way of determining the counterfactual for some new medicines. For select situations, we propose a new framework for evidence generation, which we call "threshold-crossing." This framework leverages the wealth of information that is becoming available from completed RCTs and from real world data sources. Relying on formalized procedures, information gleaned from these data is used to estimate the counterfactual, enabling efficacy assessment of new drugs. We propose future (research) activities to enable "threshold-crossing" for carefully selected products and indications in which RCTs are not feasible.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27650716 PMCID: PMC5114686 DOI: 10.1002/cpt.515
Source DB: PubMed Journal: Clin Pharmacol Ther ISSN: 0009-9236 Impact factor: 6.875
Summary of key components relevant to the concept of the counterfactual and how it underpins the definition of a causal treatment effect
| Term/concept | Description |
|---|---|
| Counterfactual | Suppose a patient may receive one of two treatments: an experimental drug E or a control C. Then patient i has two potential outcomes: their response if they receive treatment E, denoted by Yi(Ti = E), and their response if they receive treatment C, Yi(Ti = C). Only the outcome corresponding to the treatment actually received will be observed and, thus be factual, the other will remain counterfactual. |
| Causal effect | We often ask what is the effect on outcome Y of taking drug E (as opposed to drug C) keeping all other things equal? |
| Individual causal effect | The causal effect of drug E on individual i is measured by Yi(Ti = E) – Yi(Ti = C). However, individual causal effects are not identifiable because only one of a patient's potential outcomes can ever be observed (the factual but not the counterfactual). |
| Average causal effect | The average causal effect of drug E on outcome Y can be expressed as the difference between the mean counterfactual outcome that would be observed if all population members received treatment E and the mean outcome that would be observed if everyone received C. |
| RCT estimating causal effects | Under certain assumptions, average causal effects can be estimated from RCTs. Randomly assigning treatments to patients ensures that (at least under repeated sampling) treatment groups will be exchangeable in the sense that pairs of counterfactual outcomes (Yi(Ti = E), Yi(Ti = C)) will be distributed in the same way across patients in groups E and C. This implies that the average observed outcome in group C will equal the average counterfactual outcome that would be observed if all patients in group E had received drug C instead, and vice versa. Thus, the average causal effect of drug E can be estimated by comparing average outcomes across treatment arms. |
| Causation vs. association | Association is the phenomenon whereby two occurrences tend to be seen together, for example, higher response rates among contemporary patients on drug E than among historical controls. However, the risk of confounding means that association does not imply causation. Instead, the higher response rates on drug E may be attributable to a common cause linking both the treatment received and response. Examples include an imbalance in the baseline prognostic characteristics of patients, which may be a result of a drift in disease detection rates or improvements in patient management, or by fundamentally different patient populations being included in the studies. |
| Bias | The estimate of the average causal effect of drug E obtained from a direct comparison of average outcomes among contemporary experimental patients and historical controls will be biased if these groups are not exchangeable. Departures from exchangeability may arise due to a myriad of reasons including |
| Historical controls for estimating average causal effects | By comparing the outcome of a single‐arm trial with historical controls, we can estimate the average causal effect among contemporary patients of receiving drug E as opposed to drug C. Below, we outline a selection of techniques that could be used to control for confounding when comparing outcomes from a single‐arm trial with historical controls. |
| Multivariable regression | Obtain an estimate of the causal effect by fitting a regression model to the historical and contemporary data adjusting for all confounders. |
| Inverse probability of treatment weighting | Create exchangeable treatment groups by weighting each individual by their fitted probability of receiving the treatment they actually received given their baseline covariates. Estimate causal effects by fitting models using weighted least squares. |
| Propensity scores | Conditional probability of an individual receiving drug E given their baseline covariates. Stratification |
| Instrumental variables | Causal effects are identifiable if one can find a strong “instrument” associated with the outcome only through its association with the treatment received. |
RCTs, randomized controlled trials; RWD, real world data.
Figure 1Flow diagram of a threshold crossing trial. The top panel shows the initial, linear sequence of steps, and the bottom panel describes the adaptive follow‐up after completion of the initial single‐arm trial. RCT, randomized controlled trial.
Figure 2We performed clinical trial simulations to evaluate the operating characteristics of threshold‐crossing trials when frequentist hypothesis tests and corresponding sample size calculations for single‐arm trials are naively applied.
To demonstrate the efficacy of a new drug, the most common approach is to conduct parallel group trials to show superiority of the new treatment over control, i.e. testing the null hypothesis versus the alternative at one‐sided significance level of 2.5%, where and denote the expected response in the new and control treatment arm, respectively. For the results presented we assume a normally distributed endpoint with σ=1. For example, if such a trial was powered at 80% to detect a standardized effect difference of Δ= =0.2 between the new and the control treatment, a sample size of around 400 patients per group would be required resulting in a total trial sample size of 800 (red horizontal line in panel a). Alternatively, one may apply a threshold‐crossing single arm trial testing versus using a one‐sample test at one‐sided level 2.5%, where t is the a‐priori fixed threshold determined from historical controls. What is the impact on the error rates, if one takes a rejection of naively as a rejection for ? Assume trialists naively use the observed mean estimated from historical controls as threshold t. A conventional sample size calculation for a single arm trial yields a trial sample size of about 200 for a standardized effect of Δ=0.2. Hence, in a best‐case scenario, with no uncertainty on the effect size in the control arm, sample size can be reduced to a quarter relative to a parallel group design. However, due to sampling variability, the observed mean in the controls typically does not coincide with the true population mean (even assuming would be identical for historical and concurrent controls). As a consequence, the power to reject decreases with decreasing sample size in the historical controls due to increasing variability of the historical estimate (blue line panel b). In addition, the type I error rate to erroneously reject can be substantially inflated for small sample sizes of historical controls (blue line panel c). In contrast, both the type I error rate and the power (if the true standardized effect is indeed Δ=0.2) of the parallel group design with concurrent controls do not depend on the historical data (red line in panels b and c). The uncertainty due to the sampling variability when estimating the historical response could be addressed by a more cautious choice of the threshold t, e.g., taking the upper boundary of a two‐sided 95%‐confidence interval for µ computed from historical controls. A conventional sample size calculation for a single arm trial accounting for a higher threshold (i.e., adjusting the standardized effect 0.2 size by the half width of the confidence interval) yields a sample size of about 400 (=half of that for the parallel group design), if about 1000 historical controls were available (see black line in panel a). The more historical data are available, the lower the resulting sample size for the new threshold‐crossing trial. Assuming is identical for historical and concurrent controls, the type I error rate is controlled (black line panel c), however a loss of power is observed if the historical control data base is small (black line panel b). Furthermore, if differs between historical and concurrent controls, e.g., the mean response under control treatment is increasing over time, there might be an inflation of the type I error rate with the thresholding single‐arm design (panel d black line), but not for the traditional two arm parallel group design (with concurrent controls). To address such biases, one may apply even more conservative (larger) thresholds t, for example by adding a percentage of the assumed standardized effect to the upper boundary of the historical 95% confidence interval (e.g., adding 0.1Δ, 0.2Δ, and 0.3Δ for yellow, green and gray lines in panels). This comes at the cost of larger sample sizes (see panel a), but by using sufficiently conservative (large) thresholds, an inflation of the type I error rate to erroneously reject can be avoided (see green and gray line in panel d). For simplicity we have assumed that all historical controls come from one data source, e.g., a single clinical trial or a registry. If several sources are to be used, one has to account for between trial variability as well, e.g., by replacing the sample mean estimate of µ by a meta‐analytic estimate of µ obtained from a fixed or random effects meta‐analysis of historical controls. panel a: Sample sizes, power and type I error rate are given for a parallel‐group design and single‐arm threshold designs applying different thresholds. The sample size of the historical controls is shown on the x‐axis. The operating characteristics of the designs shown in panels b, c, and d are based on the sample sizes shown in panel a (that depend on the size of historical controls and assumed thresholds).