Literature DB >> 35514320

Approaches to Selecting "Time Zero" in External Control Arms with Multiple Potential Entry Points: A Simulation Study of 8 Approaches.

Anthony J Hatswell^1,2, Kevin Deighton¹, Julia Thornton Snider³, M Alan Brookhart⁴, Imi Faghmous³, Anik R Patel³.

Abstract

BACKGROUND: When including data from an external control arm to estimate comparative effectiveness, there is a methodological choice of when to set "time zero," the point at which a patient would be eligible/enrolled in a contemporary study. Where patients receive multiple lines of eligible therapy and thus alternative points could be selected, this issue is complex.
METHODS: A simulation study was conducted in which patients received multiple prior lines of therapy before entering either cohort. The results from the control and intervention data sets are compared using 8 methods for selecting time zero. The base-case comparison was set up to be biased against the intervention (which is generally received later), with methods compared in their ability to estimate the true intervention effectiveness. We further investigate the impact of key study attributes (such as sample size) and degree of overlap in time-varying covariates (such as prior lines of therapy) on study results.
RESULTS: Of the 8 methods, 5 (all lines, random line, systematically selecting groups based on mean absolute error, root mean square error, or propensity scores) showed good performance in accounting for differences between the line at which patients were included. The first eligible line can be statistically inefficient in some situations. All lines (with censoring) cannot be used for survival outcomes. The last eligible line cannot be recommended.
CONCLUSIONS: Multiple methods are available for selecting the most appropriate time zero from an external control arm. Based on the simulation, we demonstrate that some methods frequently perform poorly, with several viable methods remaining. In selecting between the viable methods, analysts should consider the context of their analysis and justify the approach selected. HIGHLIGHTS: There are multiple methods available from which an analyst may select "time zero" in an external control cohort.This simulation study demonstrates that some methods perform poorly but most are viable options, depending on context and the degree of overlap in time zero across cohorts.Careful thought and clear justification should be used when selecting the strategy for a study.

Entities: Chemical

Keywords: anchor date; big data; index date; real world data; target trial; time zero

Mesh：

Year: 2022 PMID： 35514320 PMCID： PMC9459359 DOI： 10.1177/0272989X221096070

Source DB: PubMed Journal: Med Decis Making ISSN： 0272-989X Impact factor: 2.749

Introduction

The use of external cohorts in regulatory or health technology assessment submissions is becoming increasingly common in the United States and Europe. A recent systematic review identified 43 occasions in which nonrandomized study designs using external controls were included in applications to US or EU regulatory authorities, most of which were met with approval. Data from external comparator groups can come from a variety of sources and be used to augment clinical trials, particularly uncontrolled studies (which are often single-arm trials) when a comparison group is not feasible. When data are taken from an external source for comparison with a clinical trial, the aim is to replicate the conditions of a randomized study as closely as possible. Systematically attempting to do so is known as the “target trial” approach and involves selecting patients at the point where treatment decisions are being made, including enrolling them in the trial, should they have been available. The classical example of this design is the comparative new user design, which emulates a parallel-group randomized trial. A complexity exists, however, in which patients could be included at various points, that is, they would have been eligible for entry to the trial at several discrete time points (as opposed to only a single point). This is known in the literature as defining “time zero,” the “index date,” or “anchor date” for patients[1,3,6,7]; here, we suggest the term “time zero” as the most appropriate and widely used term. An example of the problem is given in Hernán and Robins with the case of women older than 50 years of age receiving hormone therapy, who would be eligible at age 50, 51, 52, and so forth. This issue is of particular importance in situations in which outcome risk changes depending on which potential time zero is selected; for example, in cancer patients, prognosis generally deteriorates by treatment line.[8-12] Consequently, any imbalance in the number of prior lines of therapy in a cross-trial comparison is likely to induce bias. Selection of eligible intervals should also be done to avoid immortal person time and other kinds of selection bias, where enrollment and assignment to different treatment groups depend on events that occur after the start of follow-up. For example, immortal person time occurs when patients have to survive some initial period of follow-up before they can be classified as being treated.[13,15] In oncology, the absence of later lines of therapy is likely indicative of poor outcomes (e.g., death), with outcomes (such as response rates) likely correlated within patients. Selecting an appropriate time zero is an issue we recently faced when determining the comparative effectiveness of the chimeric antigen receptor T-cell (CAR T) therapy axicabtagene ciloleucel (Yescarta, Kite, a Gilead Company) in follicular lymphoma (FL) and was the motivation for completing this study. FL is an incurable disease that can recur many times within a patient’s lifetime, with the prognosis worsening with each passing line of therapy. For the pivotal single-arm ZUMA-5 study, the experimental nature of the intervention meant that patients had to have received a minimum of 2 prior lines of therapy (the final sample had a mean of 3.6 and maximum of 9 prior lines ). To estimate comparative outcomes, an external control arm was constructed based on electronic medical records pooled with historical clinical trial data. Because of the length of time over which patients could have been included in this study, a higher proportion of earlier lines were observed, with patients generally having multiple candidate time zeros, at which point they would have fulfilled the entry criteria for the axicabtagene ciloleucel study, with an example shown in Figure 1.

Figure 1

Stylized diagram of line selection options.

Abbreviations: LoT, Line of Therapy; FEL, First Eligible Line; LEL, Last Eligible Line.

Stylized diagram of line selection options. Abbreviations: LoT, Line of Therapy; FEL, First Eligible Line; LEL, Last Eligible Line. In this article, we present a simulation study to add to the existing conceptual discussions for defining time zero. First, we lay out the design of the simulation, which is intended to capture the salient characteristics of the occasions when external controls are frequently used, namely, uncontrolled studies in oncology. We then present the various approaches available to the analyst for selecting time zero, including established and novel methods, before presenting the results of the study comparing the performance of the different methods under a range of scenarios when coupled with methods to account for confounding (i.e., propensity scoring).

Methods

Simulation Study Design

The setup of the simulation study was designed to mimic an external control following patients through multiple lines of therapy and an intervention study conducted at later lines of therapy on average. Patient characteristics are sampled at the incidence of a patient’s disease (i.e., line 1) and deteriorate with each line of therapy received, with each successive line also assumed to have reduced effectiveness ceteris paribus, as seen in many cancers. To implement the study design, patients were deemed to have 8 characteristics (6 observable characteristics and 2 unobservable characteristics) affecting outcomes, each sampled from (independent) normal distributions, with the resulting value used to sample time-to-event outcomes from exponential distributions for time to progression and overall survival (OS), the shorter of these times then being used for progression-free survival (PFS). If a patient’s first outcome was death, this time was recorded, whereas if it was progression, they moved to the next treatment line with a further set of time-to-event outcomes sampled. Patient characteristics were set to, on average, deteriorate each treatment line according to sampling from normal distributions (Figure 1). If a treatment line was deemed to be the intervention, exponential distributions with longer time-to-event outcomes were used for that line, after which they would revert to having outcomes sampled as in the external control arm. Outcomes were also set to worsen, ceteris paribus, as the number of prior lines a patient had received increased (Figure 2).

Figure 2

Diagrammatical representation of the data-generation process for patient outcomes.

Diagrammatical representation of the data-generation process for patient outcomes. Starting treatment lines were drawn from a binomial distribution with 6 trials and a probability of 1/3 in the external control and 2/3 in the intervention arm (and thus a resulting mean difference of 2 lines). This difference in starting line leads to a bias against the intervention as patients will begin treatment later in the pathway. Implicitly, this means that in a naïve comparison intervention, patients will have less favorable characteristics and thus have a worse prognosis. Within the data set, patients are then assumed to be followed up for 60 mo (external control) or 37 mo (intervention) before administrative censoring occurs. All censoring is assumed to occur at the same point so as not to introduce randomness into estimates of restricted mean survival time (RMST), which was estimated at 36 mo. A diagrammatical representation of the study is provided in Figure 1, with inputs presented mathematically in Table 1.

Table 1

Parameters Used for the Implementation of the Simulation Study

Parameter	Base-Case Value
Number of patients sampled	External control: 500Intervention: 750
Starting line of therapy	Control: Binomial(probability=13,size=6) Intervention: Binomial(probability=23,size=6)
Patient characteristics (n = 8) at line 1	Both arms: Truncatednormal(n=8,mean=140,s.d.=20,lower=100)
Change in patient characteristics each line	Characteristics 1–3: normal(mean=−6,s.d.=5,) Characteristics 4–8: normal(mean=0,s.d.=5,)
Deterioration applied by line	Applied to characteristic 8 (unobserved): normal(mean=−9,s.d.=5,)
Time to progression	Control: Exponential(3+14∑i=18xi) Intervention: Exponential(3+12∑i=18xi) Where xi is the vector of i characteristics for patient j
Overall survival	Control: Exponential(3+13∑i=18xi) Intervention: Exponential(3+34∑i=18xi) Where xi is the vector of i characteristics for patient j
Administrative censoring	External control: 60 moIntervention: 37 mo

s.d., standard deviation.

Parameters Used for the Implementation of the Simulation Study s.d., standard deviation. In the simulation, 3 data sets were constructed: the external control data set, the intervention data set, and a “true control” data set. The true control data set was a facsimile of the intervention data set to the point patients received the intervention, at which point they instead receive the control outcomes. This allows the calculation of outcomes in a set of identical patients (i.e., a true counterfactual of the same patients). Methods are then applied with the aim of estimating the effect of the intervention by comparing outcomes observed in the external control and intervention samples.

Methods for Comparison

In the simulation, 8 methods for setting time zero were implemented and used to estimate the intervention effectiveness in each run. This estimated value was then compared with the results when using the values derived from the true control (i.e., those observed when calculating the effect size in the group of identical patients). Each of the methods investigated in the study for defining time zero in the external control arm are outlined below.

First eligible line

In accordance with guidance from various sources, this approach includes patient records at the first new line of therapy after meeting the eligibility criteria for the intervention study. For a study using retrospective electronic medical record data with passive qualification of patients into the study, this will generally lead to an overrepresentation of earlier lines of treatment, whereas for a prospective study, this may not be the case because of more intentional decisions on patient qualification for the study, including number of treatments failed. In the case of limited overlap in the number of prior lines of therapy between groups, we anticipate this approach to be statistically inefficient, although it should be noted that it has been used previously.[19,20]

Last eligible line

This approach includes patient records at their last eligible line of recorded therapy. This approach has been used in empirical work, but concerns have been raised about the selection bias that it may induce in the external control group, since records are selected with the knowledge that they are the last lines of therapy received by patients and therefore more likely to end with a poor outcome.

Inclusion of all lines (cloning patients who have multiple lines of treatment)

This approach is discussed by Hernán and Robins and involves setting the unit of analysis to individual lines of treatment, rather than individual patients, resulting in patients with multiple lines being included multiple times in the analysis. Although this approach is likely to increase statistical power, a group-based robust variance estimator must be used to estimate standard errors to address within-patient correlation of outcomes (for example, response rates).

Inclusion of all lines but censoring survival after progression

This approach is a modification of the above approach, whereby OS is censored at the point of progression between treatment lines. This modification changes the estimand but avoids having deaths attributable to multiple treatment lines observed on the same patient, an issue that superficially appears problematic.

Use of a random line of treatment

This approach is suggested by Hernán and Robins, which involves randomly selecting 1 line per patient in situations in which multiple eligible lines are available. The target population will reflect patients being treated in later lines than a “first eligible line” approach. This approach was discussed at length by Backenroth with published examples also available.[23,24]

Rebalancing the external control arm to minimize the mean absolute error in the number of prior lines of therapy between data sets

This is a modification of the random line approach, whereby 1 line is selected per patient but with the objective of minimizing the difference in the number of prior lines between the external control and intervention data sets. This was implemented by taking 30 samples of random lines, calculating the mean absolute error in the percentage of patients at each line between the intervention and sampled external control lines, then choosing the data set with the lowest value.

Rebalancing the external control arm to minimize the root mean squared error in the number of prior lines of therapy between data sets

This is a modification of the above approach using root mean squared error (RMSE), which penalizes large mismatches in line distribution more harshly.

Using propensity score matching to identify the best matching line for each treated patient (allowing patients to be matched only once)

Propensity scores are widely used in medicine to control confounding. Here, the propensity score is the probability that a given patient would be in the trial versus the external control population. The propensity score is used to match eligible lines from the external control data to the nearest treated patient line using custom R code. The custom code uses a loop to select the nearest score (from all patients and lines) to a randomly selected treatment patient. After each match, that control patient (and all of their unmatched lines) are removed from the matchable pool, meaning each treated patient received a 1:1 match, but no control patients were matched more than once. Each method was applied with and without the application of standardized mortality ratio (SMR) propensity score weighting. This was calculated using a propensity score estimated from the 6 observable patient characteristics and the line of therapy as covariates. SMR weights were then applied whereby treated patients were given a weight of 1, whereas weights for external control patients were defined as the ratio of the estimated propensity score to 1 minus the estimated propensity score. By using SMR weighting, we explicitly targeted the effectiveness of the intervention in the population represented in the intervention study. This provided a common reference population for all analyses. The provision of both unadjusted and adjusted results, however, does allow for understanding whether the results of an accurate comparison are due to the method used for setting time zero.

Scenario Analyses

To ensure that the results of the study are generalizable, a large number of scenario analyses was conducted, which included varying the patient numbers, simulation setup, outcomes observed, and the effect of subsequent lines of therapy. The changes made in each scenario analysis are presented in Table 2.

Table 2

Scenario Analyses and Resulting Findings

Number	Scenario Setup	Findings
1	Number of patients sampled doubled in both arms	Results are consistent with the base case, although errors and coverage probabilities improve for all viable methods
2	Number of patients sampled halved in both arms	Errors are generally increased; however, no method appears disproportionately affected
3	Number of active patients halved	Errors are generally increased; however, no method appears disproportionately affected
4	Administrative censoring time halved to 18 mo	No meaningful changes in results
5	Starting health of patients increased in both arms (+12.5% at baseline)	No meaningful changes in results
6	Starting health of patients increased in the intervention only; +12.5% at baseline	Due to the (biased) comparison, naïve results are generally worse; however, post-SMR results are similar to the base case
7	More effective intervention; sampled times multiplied by 1.25	Slight improvements in coverage probabilities
8	More effective control; sum of patient characteristics multiplied by 1.25 before sampling	Slight increases in errors
9	Longer OS for both control and intervention, i.e., effect of OS reduced; sum of patient characteristics divided by 0.75 before sampling, i.e., a condition in which death is less common	OS results more uncertain for all comparisons
10	Different survival model for all time-to-event outcomes; Weibull with shape 1.25	No meaningful changes, estimates generally slightly worse
11	Different survival model for the intervention time to event estimates; Weibull with shape 1.25	Slight improvements in estimation of PFS, worsening of OS, likely driven by fewer observed events
12	Disease with a low death rate simulated; risk of the control set to that of the intervention, with sampled overall survival time multiplied by 10	PFS estimates improved for all estimates
13	Effect of health loss by treatment line doubled	Naïve estimates more inaccurate, with no meaningful changes after SMR weighting
14	Only 2 potential lines of treatment	Errors reduced, first eligible line in particular benefitting
15	Bigger imbalance between starting treatment lines; probability increased from 2/3 to 9/10 for intervention	Relative worsening of naïve errors, as well as post-SMR weighting errors for first eligible line. Relative improvements for random and rebalance approaches
16	No imbalance in treatment lines; control rate probability set equal to that of the intervention (2/3)	Relative improvement of naïve comparisons, and first eligible line; more uncertain estimates of overall survival differences
17	Intervention has no effect; all time to events set equal to that of control	All viable methods demonstrating low levels of error, with the bias present in last eligible line and all lines with censoring clear
18	Unbiased comparison; all starting lines and effectiveness calculations for the intervention set equal to the control	All viable methods demonstrating low levels of error, with the bias present in last eligible line particularly apparent

OS, overall survival; PFS, progression-free survival; SMR, standardized mortality ratio.

Scenario Analyses and Resulting Findings OS, overall survival; PFS, progression-free survival; SMR, standardized mortality ratio.

Outcomes Presented

The aim of the methods used is to retrieve the true intervention effectiveness when presented with an external control data set that has treatment lines biased toward earlier lines of therapy, with correspondingly better patient characteristics and outcomes. To do this, multiple estimates of effectiveness were calculated including the ratio of RMST at 3 y and a hazard ratio (HR) estimated from a Cox model. The RMST is particularly useful in the presence of nonproportional hazards, whereas the Cox model is frequently used in clinical studies. For each outcome, the RMSE is presented, along with the bias, for which the Monte Carlo standard error (MCSE) is also presented as a measure of variance within the results. In addition, 2 further metrics are presented for the Cox model: the coverage probability (the percentage of scenarios that contain the true value) and the error at the 95th percentile (as a measure of the likely maximum error). All results are presented for both PFS and OS, before, and after SMR weighting is applied.

Implementation and Software

To understand the performance of each method, a large (n = 50,000) number of patients were simulated for each data set (external control, intervention, and then the true control) to understand the “true” results against which simulations would be judged. In each run, a sample of patients were then taken from the external control and intervention data sets (1000 and 750 in the base case) to which all methods of selecting time zero were applied. This process was repeated 5000 times per scenario, in line with the approach of Morris et al. All analyses were performed in R version 4.1.2.

Results

Simulation and Base-Case Results

In the simulation, external control patients entered the study at an earlier line of therapy than intervention patients did, which created an inherent bias in patient characteristics in favor of the control, thereby reducing the observed benefit of the intervention. This is shown visually in the panels included in Figure 3 for PFS and OS across all methods for run 5000 of the base case, with base-case results provided in Table 3. Looking at PFS outcomes without the application of SMR weighting demonstrates substantial bias for all methods except for the use of propensity scoring to match similar patients. This underlines that statistical methods to account for confounding are required, regardless of the approach taken toward defining time zero.

Figure 3

Example of each method applied to a single run of the simulation for progression-free survival and overall survival, with and without standardized mortality ratio weighting.

Table 3

Base-Case Results for Progression-Free Survival, Naïve, and Standardized Mortality Weighting Comparisons

Naïve Comparison
Method	Progression-Free Survival								Overall Survival
	Ratio of RMST			Cox PH Model					Ratio of RMST			Cox PH Model
	Mean Value	RMSE	Bias	Mean HR	RMSE	Bias	Error 95th Percentile	Coverage Probability	Mean Value	RMSE	Bias	Mean HR	RMSE	Bias	Error 95th Percentile	Coverage Probability
	Mean Value	RMSE	(MCSE)	Mean HR	RMSE	(MCSE)	Error 95th Percentile	Coverage Probability	Mean Value	RMSE	(MCSE)	Mean HR	RMSE	(MCSE)	Error 95th Percentile	Coverage Probability
True	2.376			0.396					1.827			0.468
First eligible line	1.698	0.683	−0.679 (0.001)	0.54	0.147	0.144 (0)	0.191	0	1.092	0.735	−0.734 (0.001)	0.854	0.389	0.387 (0.001)	0.462	0
Last eligible line	2.326	0.122	−0.051 (0.002)	0.42	0.032	0.024 (0)	0.06	79.8	2.836	1.017	1.009 (0.002)	0.318	0.15	−0.149 (0)	0.178	0
All lines	2.055	0.329	−0.321 (0.001)	0.469	0.075	0.073 (0)	0.106	2.5	1.347	0.482	−0.48 (0.001)	0.681	0.216	0.214 (0)	0.264	0
All lines (censoring)	2.055	0.329	−0.321 (0.001)	0.469	0.075	0.073 (0)	0.106	2.5	1.347	0.482	−0.48 (0.001)	0.874	0.408	0.406 (0.001)	0.473	0
Random	1.866	0.517	−0.51 (0.001)	0.502	0.109	0.106 (0)	0.15	0.5	1.324	0.505	−0.503 (0.001)	0.683	0.219	0.216 (0)	0.273	0
Rebalanced MAE	1.882	0.502	−0.495 (0.001)	0.498	0.105	0.102 (0)	0.146	0.6	1.359	0.47	−0.468 (0.001)	0.663	0.198	0.195 (0)	0.25	0
Rebalanced MSE	1.883	0.5	−0.493 (0.001)	0.498	0.105	0.102 (0)	0.145	0.6	1.367	0.462	−0.46 (0.001)	0.658	0.193	0.191 (0)	0.245	0
Propensity scored	2.299	0.15	−0.077 (0.002)	0.41	0.029	0.014 (0)	0.057	88.4	1.796	0.096	−0.031 (0.001)	0.48	0.033	0.013 (0)	0.065	89.4
Standardized Mortality Ratio Weighted
Method	Progression-Free Survival								Overall Survival
	Ratio of RMST			Cox PH Model					Ratio of RMST			Cox PH Model
	Mean Value	RMSE	Bias	Mean HR	RMSE	Bias	Error 95th Percentile	Coverage Probability	Mean Value	RMSE	Bias	Mean HR	RMSE	Bias	Error 95th Percentile	Coverage Probability
	Mean Value	RMSE	(MCSE)	Mean HR	RMSE	(MCSE)	Error 95th Percentile	Coverage Probability	Mean Value	RMSE	(MCSE)	Mean HR	RMSE	(MCSE)	Error 95th Percentile	Coverage Probability
True	2.376			0.396					1.827			0.468
First eligible line	2.315	0.177	−0.061 (0.002)	0.409	0.031	0.013 (0)	0.061	96	1.773	0.114	−0.054 (0.001)	0.484	0.034	0.017 (0)	0.067	96.5
Last eligible line	2.045	0.393	−0.332 (0.003)	0.49	0.111	0.094 (0.001)	0.172	77.9	2.478	0.697	0.651 (0.004)	0.376	0.104	−0.091 (0.001)	0.182	71.7
All lines	2.359	0.091	−0.017 (0.001)	0.401	0.018	0.005 (0)	0.037	94.5	1.779	0.077	−0.048 (0.001)	0.482	0.026	0.014 (0)	0.05	91.2
All lines (censoring)	2.359	0.091	−0.017 (0.001)	0.401	0.018	0.005 (0)	0.037	94.5	1.173	0.655	−0.654 (0.001)	0.728	0.264	0.261 (0.001)	0.325	0
Random	2.344	0.125	−0.033 (0.002)	0.405	0.024	0.009 (0)	0.048	93.9	1.948	0.145	0.121 (0.001)	0.44	0.035	−0.027 (0)	0.063	81.5
Rebalanced MAE	2.341	0.124	−0.035 (0.002)	0.405	0.024	0.009 (0)	0.047	94.1	1.966	0.161	0.139 (0.001)	0.436	0.038	−0.031 (0)	0.066	76.1
Rebalanced MSE	2.34	0.124	−0.036 (0.002)	0.405	0.024	0.009 (0)	0.047	94.2	1.966	0.161	0.14 (0.001)	0.436	0.038	−0.032 (0)	0.066	75.7
Propensity scored	2.356	0.119	−0.02 (0.002)	0.402	0.023	0.006 (0)	0.046	94.8	1.843	0.084	0.016 (0.001)	0.469	0.027	0.002 (0)	0.053	94.8

HR, hazard ratio; MAE, mean absolute error; MCSE, Monte Carlo standard error; MSE, mean squared error; PH, proportional hazards; RMSE, root mean squared error; RMST, restricted mean survival time.

Example of each method applied to a single run of the simulation for progression-free survival and overall survival, with and without standardized mortality ratio weighting. Base-Case Results for Progression-Free Survival, Naïve, and Standardized Mortality Weighting Comparisons HR, hazard ratio; MAE, mean absolute error; MCSE, Monte Carlo standard error; MSE, mean squared error; PH, proportional hazards; RMSE, root mean squared error; RMST, restricted mean survival time. Although the application of SMR weighting improved estimates, with most demonstrating good performance, it is immediately apparent that 2 methods, last eligible line and all lines (censoring) are biased: last eligible line in all outcomes and all lines (censoring) in OS outcomes. This finding was consistent in all simulations and scenario analyses, and thus renders these methods as nonviable. Given their poor performance and bias, the results of these 2 methods in simulations are not discussed further. In terms of the remaining methods, the larger bias in the unweighted first eligible line approach was largely ameliorated by the application of SMR weighting. Although there were differences in the point estimates of the mean error and bias that exceeded that of the MCSE (i.e., differences beyond that seen in variability between samples), these differences did not appear meaningful between methods, given the simulated nature of the data. For example, when estimating PFS with the application of SMR weights, the bias in the Cox HR ranged from 0.005 to 0.013 across all methods, with the coverage probability of being 94.1% to 96.0%. Although these findings of similarity between methods hold for both PFS and OS, there are differences that should be discussed, namely, that the first eligible line shows the potential for higher levels of error, as shown by its 95th percentile error being the highest of all viable methods for both PFS and OS. The other finding worth noting would be that the coverage probability for OS is notably lower for most methods. This is likely as a result of fewer observed events but should be considered.

Scenario Analysis Results

Scenario analysis setups and main findings are presented in Table 2, with distributions of the error in the ratio of RMST for OS presented in Figure 4. Full tabulated results are available as supplementary material. The results show that as patient numbers are varied in scenario analyses (scenarios 1–3), similar results are seen to those of the base case. As would be expected, simulations with increased patient numbers perform better but without any method deviating from this pattern. This is similar with a shorter follow-up time (scenario 4), where differences are seen due to less data being available, but no method appeared better (or worse) under such circumstances.

Figure 4

Density plot of error in the ratio of restricted mean survival time (RMST) for overall survival (OS), all scenario analyses.

Density plot of error in the ratio of restricted mean survival time (RMST) for overall survival (OS), all scenario analyses. Changing the setup of the simulation study regarding patient characteristics (scenarios 5 and 6) and the relationship between characteristics and outcomes (scenarios 7–11) again resulted in differences in the magnitude of findings, without affecting the findings themselves. Notably, all viable methods were able to account for bias in observed patient characteristics (scenario 6) after the application of SMR weighting. Findings were also not dependent on the survival modeling approach, with Weibull models used in scenarios 10 and 11 not affecting the results. Changing the simulation to accommodate different types of disease (scenarios 12–14) resulted in only minor changes to results. Scenario 12 was designed to mimic a disease having little/no mortality impact allowing all methods still to be used for estimation of PFS. Similarly, scenarios 13 and 14 explored the impact of treatment lines, with the main finding being that with only 2 lines, first eligible line generally improved, whereas with a large gap between overlapping treatment lines, first eligible line performed poorly, with relative gains for methods than explicitly aim to rebalance. Structural tests of methods were performed in scenarios 16–18 and demonstrated methods to be unbiased. These tests included no difference in treatment lines in scenario 16, an inert intervention in scenario 17, and a fully unbiased scenario in scenario 18. Overall, these scenarios demonstrated that after SMR weighting, treatment effects (or the lack thereof) were correctly identified, without bias being introduced that might lead to type I errors (failing to reject the null hypothesis).

Discussion

Main Findings

This simulation study explored several approaches for selecting time zero when comparing a single-arm trial to an external control cohort with an imbalance in the number of prior therapies between data sets. The number of prior therapies could predict both prognosis and the severity of covariates that deteriorate over time, which makes appropriate balance of this variable between cohorts essential when conducting comparative effectiveness studies. Our main findings demonstrate that several methods for generating estimates in time-to-event outcomes have similar (low) levels of bias and precision, with limited evidence for a single superior approach. However, both approaches of last eligible line and the selection of all lines with censoring of OS on progression resulted in distributions of estimates that substantially deviated from our target. A key finding was that application of SMR weighting was essential in supporting the methods used for the selection of time zero, which highlights the importance of this additional consideration, namely, the use of a method to balance between groups. Following this step, multiple choices of selecting time zero could be supported, which would otherwise not be the case. In making a selection between these methods, extensive scenario analyses indicate that the first eligible line approach may be suitable only in situations in which the external control and intervention data sets are similar or the sample sizes are large or similarly matched, due to lower statistical efficiency. Using multiple records per patient, in which death could be attributed to multiple treatment lines in the same patient, did not result in bias, and indeed frequently produced numerically superior estimates. Selecting a single random eligible line for the external control cohort also compared well with no clear bias or inaccuracy. This finding was consistent regardless of whether this process was repeated to minimize mean absolute error (MAE)/mean squared error (MSE) or performed only once. Nevertheless, repeated random sampling to minimize MAE/MSE may offer a degree of reassurance against a “bad draw” in the selection of a single random line, as can be seen in the values of 95th percentile of error. That multiple methods showed similar outcomes means that a case could be made depending on the context in which they are to be used. This includes the broader considerations of ease of explanation, perceived differences between studies, or compatibility with other methods that are also required for analysis (such as multiple imputation). The contribution of this study is therefore to present the methods that we know to be available and to identify those that should be seen as an option set from which the analyst may select. We do this based not only on theoretical advantages but also simulated data. The bias that we found in the last eligible line approach is in agreement with the results of Suissa, who found inflated mortality rates in the control group in such a design, suggesting strong selection bias. When we censored OS on disease progression, we observed estimates that deviated substantially from our benchmark. By artificially censoring patients on progression, we essentially changed the target of estimation. In this analysis, follow-up continues only on patients who have not progressed, and they are increasingly up-weighted over time to stand in for the patients who have progressed. As such, we were implicitly estimating the effectiveness of the intervention in a population for which progression could not occur; by artificially censoring patients on progression, we likely introduce bias from dependent censoring, because those who progress are different from those who do not. This estimator is also not of clinical relevance because progression cannot be universally prevented. An alternative approach would be to treat progression as a competing event, which would estimate the effectiveness of the intervention on death before progression. Where this issue did not occur was in scenario 12, in which survival was exogenous to the disease process, as might be seen in conditions such as migraine, psoriasis, and constipation. In this case, the censoring was not linked to outcome, with the method performing similarly to others. As we seldom fully understand disease processes, however, we would still recommend caution if suggesting this approach.

Limitations

The findings of the study were robust to extensive scenario analyses that were conducted varying parameters individually and jointly around the themes of differences in simulation setup, type of survival model used for simulation, and degree of bias (including no bias), in comparisons. A limitation, however, remains that there are further scenarios (and potentially even methods) that could be included. There are also many different decisions that could have been (legitimately) made regarding the setup of the study that may have affected the findings. The simulation has also revolved mainly around oncology products, whereas the methods available have wider applicability. Other simulation setups for different settings may therefore also be valuable. The main limitation of the work, however, is the reliance on simulated data. Although unavoidable (in a need to have a known “truth” to compare against), this does mean that we would caution against overinterpretation of absolute results, for example, concluding one method to be superior due to lower RMSE, or bias—as it is entirely possible these small differences are an artifact of the simulation process. For these reasons, we would encourage further work, ideally in real data sets such as large randomized controlled trials from which samples could be taken and the methods compared.

Conclusions

Given the results of the simulation presented, analysts may wish to consider a variety of factors (including available sample size and degree of imbalance) in choosing an appropriate method for selecting time zero in external cohorts. We would suggest that these include interpretability (where random line is perhaps the easiest to understand and propensity score matching or including all lines the most complex), statistical power, and interoperability with other techniques that may be required. Beyond which method is used to set time zero, further justification should also be provided for any individual analysis, including (but not limited to) a demonstration of why each characteristic is selected for balancing, the degree of overlap between studies, histograms of weights, and effective sample sizes. Ultimately, this study highlights a subset of methodologies with acceptable bias in the estimation of time-to-event outcomes that may be used for the selection of time zero in external control cohorts. Click here for additional data file. Supplemental material, sj-xlsx-1-mdm-10.1177_0272989X221096070 for Approaches to Selecting “Time Zero” in External Control Arms with Multiple Potential Entry Points: A Simulation Study of 8 Approaches by Anthony J. Hatswell, Kevin Deighton, Julia Thornton Snider, M. Alan Brookhart, Imi Faghmous and Anik R. Patel in Medical Decision Making

26 in total

1. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes.

Authors: Linda E Lévesque; James A Hanley; Abbas Kezouh; Samy Suissa
Journal: BMJ Date: 2010-03-12

Review 2. The combination of randomized and historical controls in clinical trials.

Authors: S J Pocock
Journal: J Chronic Dis Date: 1976-03

3. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available.

Authors: Miguel A Hernán; James M Robins
Journal: Am J Epidemiol Date: 2016-03-18 Impact factor: 4.897

Review 4. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses.

Authors: Miguel A Hernán; Brian C Sauer; Sonia Hernández-Díaz; Robert Platt; Ian Shrier
Journal: J Clin Epidemiol Date: 2016-05-27 Impact factor: 6.437

Review 5. Some statistical considerations in the clinical development of cancer immunotherapies.

Authors: Bo Huang
Journal: Pharm Stat Date: 2017-11-02 Impact factor: 1.894

6. Using simulation studies to evaluate statistical methods.

Authors: Tim P Morris; Ian R White; Michael J Crowther
Journal: Stat Med Date: 2019-01-16 Impact factor: 2.497

7. Characteristics of non-randomised studies using comparisons with external controls submitted for regulatory approval in the USA and Europe: a systematic review.

Authors: Sarah Goring; Aliki Taylor; Kerstin Müller; Tina Jun Jian Li; Ellen E Korol; Adrian R Levy; Nick Freemantle
Journal: BMJ Open Date: 2019-02-27 Impact factor: 2.692

8. Ibrutinib versus previous standard of care: an adjusted comparison in patients with relapsed/refractory chronic lymphocytic leukaemia.

Authors: Lotta Hansson; Anna Asklid; Joris Diels; Sandra Eketorp-Sylvan; Johanna Repits; Frans Søltoft; Ulrich Jäger; Anders Österborg
Journal: Ann Hematol Date: 2017-07-31 Impact factor: 3.673

9. Follicular lymphoma in the modern era: survival, treatment outcomes, and identification of high-risk subgroups.

Authors: Connie L Batlevi; Fushen Sha; Anna Alperovich; Ai Ni; Katy Smith; Zhitao Ying; Jacob D Soumerai; Philip C Caron; Lorenzo Falchi; Audrey Hamilton; Paul A Hamlin; Steven M Horwitz; Erel Joffe; Anita Kumar; Matthew J Matasar; Alison J Moskowitz; Craig H Moskowitz; Ariela Noy; Colette Owens; Lia M Palomba; David Straus; Gottfried von Keudell; Andrew D Zelenetz; Venkatraman E Seshan; Anas Younes
Journal: Blood Cancer J Date: 2020-07-17 Impact factor: 11.037

10. Single-agent ibrutinib in RESONATE-2™ and RESONATE™ versus treatments in the real-world PHEDRA databases for patients with chronic lymphocytic leukemia.

Authors: Gilles Salles; Emmanuel Bachy; Lukas Smolej; Martin Simkovic; Lucile Baseggio; Anna Panovska; Hervé Besson; Nollaig Healy; Jamie Garside; Wafae Iraqi; Joris Diels; Corinna Pick-Lauer; Martin Spacek; Renata Urbanova; Daniel Lysak; Ruben Hermans; Jessica Lundbom; Evelyne Callet-Bauchu; Michael Doubek
Journal: Ann Hematol Date: 2019-11-19 Impact factor: 3.673