Literature DB >> 35102685

Decision rules for identifying combination therapies in open-entry, randomized controlled platform trials.

Elias Laurin Meyer¹, Peter Mesenbrink², Cornelia Dunger-Baldauf³, Ekkehard Glimm^3,4, Yuhan Li², Franz König¹.

Abstract

Platform trials have become increasingly popular for drug development programs, attracting interest from statisticians, clinicians and regulatory agencies. Many statistical questions related to designing platform trials-such as the impact of decision rules, sharing of information across cohorts, and allocation ratios on operating characteristics and error rates-remain unanswered. In many platform trials, the definition of error rates is not straightforward as classical error rate concepts are not applicable. For an open-entry, exploratory platform trial design comparing combination therapies to the respective monotherapies and standard-of-care, we define a set of error rates and operating characteristics and then use these to compare a set of design parameters under a range of simulation assumptions. When setting up the simulations, we aimed for realistic trial trajectories, such that for example, a priori we do not know the exact number of treatments that will be included over time in a specific simulation run as this follows a stochastic mechanism. Our results indicate that the method of data sharing, exact specification of decision rules and a priori assumptions regarding the treatment efficacy all strongly contribute to the operating characteristics of the platform trial. Furthermore, different operating characteristics might be of importance to different stakeholders. Together with the potential flexibility and complexity of a platform trial, which also impact the achieved operating characteristics via, for example, the degree of efficiency of data sharing this implies that utmost care needs to be given to evaluation of different assumptions and design parameters at the design stage.

Entities: Chemical

Keywords: adaptive trials; clinical trial simulation; combination therapy; platform trials

Mesh：

Year: 2022 PMID： 35102685 PMCID： PMC9304586 DOI： 10.1002/pst.2194

Source DB: PubMed Journal: Pharm Stat ISSN： 1539-1604 Impact factor: 1.234

INTRODUCTION

The goal to test as many investigational treatments as possible over the shortest duration, which is influenced by both recent advances in drug discovery and biotechnology and the ongoing global pandemic due to the SARS‐CoV‐2 virus, , , , , , has made master protocols and especially platform trials an increasingly feasible alternative solution to the time‐consuming sequence of classical randomized controlled trials. , , , , , Platform trial designs allow for the evaluation of one or more investigational treatments in the study population(s) of interest within the same clinical trial, as compared to traditional randomized controlled trials, which usually evaluate only one investigational treatment in one study population. When cohorts share key inclusion/exclusion criteria, trial data can easily be shared across such sub‐studies. In practice, setting up a platform trial potentially may require additional time due to operational and statistical challenges. However, simulations have shown that platform trials can be superior to classical trial designs with respect to various operating characteristics which include the overall study duration. In a setting where only few new agents are expected to be superior to standard of care, Saville and Berry investigated the operating characteristics of adaptive Bayesian platform trials using binary endpoints compared with a sequence of “traditional” trials, that is, trials testing only one hypothesis, and found that platform trials perform dramatically better in terms of the number of patients and time required until the first superior investigational treatment has been identified. Using real data from the 2013 to 2016 Ebola virus disease epidemic in West Africa, Brueckner et al. investigated the operating characteristics of various multi‐arm multi‐stage and two‐arm single stage designs, as well as group‐sequential two‐arm designs, and found that designs with frequent interim analyses outperformed single‐stage designs with respect to average trial duration and sample size when fixing the type 1 error and power. When having a pool of investigational agents available, which should all be tested against a common shared control, and using progression‐free survival as the efficacy endpoint, Yuan et al. found that average trial duration and average sample size are drastically reduced when using a multi‐arm, Bayesian adaptive platform trial design using response‐adaptive randomization compared with traditional two‐arm trials evaluating one agent at a time. Hobbs et al. reached a similar conclusion when comparing a platform trial with a binary endpoint and futility monitoring based on Bayesian predictive probabilities with a sequence of two‐arm trials. Tang et al. investigated a phase II setting in which several monotherapies are combined with several backbone therapies and tested in a single‐arm manner. Assuming different treatment combination effects, they found that their proposed Bayesian platform design with adaptive shrinkage has a lower average sample size in the majority of scenarios investigated and always a higher percentage of correct combination selections when compared with a fully Bayesian hierarchical model and a sequence of Simon's two‐stage designs. Ventz et al. proposed a frequentist adaptive platform (so called “rolling‐arms design”) design as an alternative to sequences of two‐arm designs and Bayesian adaptive platform designs, which is much simpler than the Bayesian adaptive platform designs in that it uses equal allocation ratios and simpler and established decision rules based on group sequential analysis. The authors found that performance under different treatment effect assumptions and a set of general assumptions was comparable to, if not slightly better than Bayesian adaptive platform designs and much better than a sequence of traditional two‐arm designs in terms of average sample size and study duration. For a comprehensive review on the evolution of master protocol clinical trials and the differences between basket, umbrella and platform trials, see Meyer et al. In this article, we explore the impact of both decision rules and assumptions on the nature of the treatment effects and availability of treatments on certain operating characteristics of an open‐entry, cohort platform trial with multiple study arms. Such a platform design makes sense in any context in which multiple pair‐wise comparisons are necessary to advance a compound, whereby some comparisons are based on treatments unique to different cohorts and some comparisons are based on treatments common to all cohorts (and any mixture thereof). An example for such a context would be a drug development program in Nonalcoholic steatohepatitis (NASH) trying to advance two‐compound combination therapies with a common backbone therapy. The article is organized as follows: In Section 2, we describe the trial design under investigation, the different testing strategies as well as the investigated operating characteristics. In Section 3, we discuss the simulation setup and present and discuss the results of the different simulation scenarios. We conclude with a general discussion in Section 4.

METHODS

Platform design

We investigated an open‐entry, exploratory cohort platform study design with a binary endpoint evaluating the efficacy of a two‐compound combination therapy compared to the respective monotherapies and the standard‐of‐care (SoC). After an initial inclusion of one cohort, we allow new cohorts to enter the platform trial over time until a maximum number of cohorts is reached. We assume different cohorts share common key inclusion/exclusion criteria and patients are always drawn from the same population, such that all eligible patients can be randomized to any cohort. Each cohort consists of four arms: combination therapy, monotherapy A, monotherapy B and SoC. Monotherapy A is the same for all cohorts (further referred to as “backbone monotherapy”), while monotherapy B (further referred to as “add‐on monotherapy X”) is different in every cohort X. The combination of monotherapies A and B is called combination therapy. See Figure 1 for a schematic overview of the proposed trial design. To show that the backbone and SoC treatments do not change during the platform trial, we use the same color coding of gray and green for all cohorts, respectively. The x‐axis shows the calendar time. At any point in time, new cohorts could enter the platform trial. In contrast to a simulation of a classical RCT, here different response rates for the same treatment arms in different cohorts can be assumed. The assignment of response rates to treatment arms is described in more detail in Appendix A.3. We conduct one interim analysis for every cohort (indicated by the vertical yellow line in Figure 1), on the basis of which the cohort might be stopped early for either futility or efficacy. The platform trial ends if there is no active, recruiting cohort left. The proposed trial design is suggested for phase II trials. Subsequent phase III studies would follow if a combination therapy is graduated successfully. Following a rigorous interpretation of the current FDA and EMA regulatory guidelines, , superiority of the combination therapy over both monotherapies and superiority of both monotherapies over SoC needs to be demonstrated. Therefore, we test the combination therapy against both of the monotherapies and both of the monotherapies against SoC, resulting in four comparisons (indicated by the red “testing strategy” lines in Figure 1).

FIGURE 1

Schematic overview of the proposed platform trial design. New cohorts consisting of a combination therapy arm, a monotherapy arm using the same compound in every cohort (referred to as “backbone monotherapy”), an add‐on monotherapy arm which is different in every cohort and a SoC arm are entering the platform over time. While the add‐on monotherapy and therefore the combination therapy is different in every cohort (as indicated by the differently shaded colors), the backbone monotherapy and SoC are the same in every cohort (as indicated by the same colors). Each cohort has an interim analysis after about half of the initially planned sample size, after which the cohort can be stopped for early efficacy or futility. The red brackets indicate the testing strategy within each cohort, that is, comparison of combination therapy against both monotherapies and both monotherapies against SoC. We differentiate between per‐cohort and per‐platform operating characteristics (OCs)

Bayesian decision rules

As we discussed in Section 2.1, for testing the efficacy of the combination therapy, four pair‐wise comparisons are performed. In this article, for every one of the comparisons, we consider Bayesian decision rules based on the posterior distributions of the response rates of the respective study arms. In principle, any other Bayesian (e.g., predictive probabilities, hierarchical models, etc.) or frequentist rules (based on p‐values, point estimates or confidence intervals) could be used. While these decision rules are based on fundamentally different paradigms, they might translate into the exact same stopping rules, for example, with respect to the observed response rate. , Decision rules based on posterior distributions used vague, independent Beta(1/2, 1/2) priors for all simulation results presented in this article; however, please note that independence of the vague priors is a strong assumption and its violation can lead to selection bias as pointed out by many authors. , , We differentiate two types of decisions for cohorts: “GO” (graduate combination therapy, that is, declare combination therapy successful) and “STOP” (stop evaluation of combination therapy and do not graduate, that is, declare the combination therapy unsuccessful). Generally, we consider decision rules of the following form: whereby π denotes the response rate in treatment arm x (x ∈{C, A, B, S}) and T ∈ 1, 2 denotes the analysis time point. At interim (T = 1), if neither a decision for early efficacy or futility is made, the cohort continues unchanged. At final (T = 2), if the efficacy boundaries are not met, the cohort automatically stops for futility. The initial letters E or F in the superscript of the thresholds δ and γ indicate if this boundary is used to stop for efficacy (E) or futility (F). The subscripts refer to the treatment arms (C (combination), A (monotherapy A), B (monotherapy B) and S (SoC)) and allow for different thresholds for the individual comparisons. Choosing, for example, corresponds to not allowing early stopping for efficacy at interim. If at interim both stopping for early efficacy and futility is allowed, parameters need to be chosen carefully such that GO and STOP and decisions are not simultaneously possible.

Data sharing and allocation ratios

The advantage of platform trials is that they can use the data from all cohorts, which potentially increases the power in every individual cohort. We can specify whether we want to share information on the backbone monotherapy and SoC arms across the study cohorts. Several different methods have been proposed to facilitate adequate borrowing of non‐concurrent (these can be internal or external to the trial) controls. , , , , , We consider four options, all applying to both SoC and backbone monotherapy: (1) no sharing, using only data from the current cohort (see first row of Figure 2), (2) full sharing of all available data, that is, using all data 1‐to‐1 (see second row of Figure 2), (3) only sharing of concurrent data, that is, using concurrent data 1‐to‐1 (see third row of Figure 2), and (4) using a dynamic borrowing approach further described in Appendix A.1, in which the degree of shared data increases with the homogeneity of the observed data in the current cohort and the pooled observed data of all other cohorts, that is, discounting the pooled observed data of other cohorts less, if the observed response rate is similar (see fourth row of Figure 2). Whenever a new cohort enters the platform trial, a key design element is the allocation ratio to the newly added arms (combination therapy, add‐on monotherapy, backbone monotherapy and SoC) and whether the allocation ratio of the already ongoing cohorts should be changed as well, for example, randomizing less patients to backbone monotherapy and SoC in case this data is shared across cohorts. The platform trial advances dynamically and as a result the structure can follow many different trajectories (it is unknown in the beginning of the trial how many cohorts will enter the platform, how many of them will run concurrently, whether the generated data will stem from the same underlying distributions and should therefore be shared, etc.). As the best possible compromise under uncertainty, we aimed to achieve a balanced randomization for every comparison in case of either no data sharing or sharing only concurrent data. Depending on the type of data sharing and the number of active arms, this means either a balanced randomization ratio within each cohort in case of no data sharing (i.e., 1:1:1:1, combination: add‐on mono (monotherapy B): backbone mono (monotherapy A): SoC), or a randomization ratio that allocates more patients to the combination and add‐on monotherapy arm for every additional active cohort in case of using only concurrent data. As an example, if at any point in time k cohorts are active at the same time and we share only concurrent data, the randomization ratio is k: k: 1: 1 in all cohorts, which ensures an equal number of patients per arm for every comparison. This allocation ratio is updated for all active cohorts every time the number of active cohorts k changes due to dropping or adding a new cohort. As an example, assume 30:30:30:30 patients have been enrolled in cohort 1 before a second cohort is added. Then, until, for example, an interim analysis is performed in cohort 1, both cohorts will have an allocation ratio of 2:2:1:1. If the interim analysis is scheduled after 180 patients per cohort, this would mean another 20:20:10:10 patients need to be enrolled in cohorts 1 and 2, since we can use the 10 concurrently enrolled backbone monotherapy and 10 concurrently enrolled SoC patients in cohort 2 for the interim analysis in cohort 1, leading to a balanced 50:50:50:50 patients for the comparisons. In case of either full sharing or dynamic borrowing, we use the same approach as when using concurrent data only.

FIGURE 2

Schematic overview of the different levels of sharing. No sharing happens if only “cohort” data are used. If sharing “all” data, whenever in any cohort an interim or final analysis is performed, all SoC and backbone monotherapy data available from all cohorts are used. If sharing only “concurrent” data, whenever in any cohort an interim or final analysis is performed, all SoC and backbone monotherapy data that was collected during the active enrollment time of the cohort under investigation are used. If sharing “dynamically,” whenever in any cohort an interim or final analysis is performed, the degree of data sharing of SoC and backbone monotherapy data from other cohorts increases with the homogeneity of the observed response rate of the respective arms. A solid fill represents using data 1‐to‐1, while a dashed fill represents using discounted data (for more information see Appendix A.1). If at any given time there are k active cohorts, the allocation ratio is 1:1:1:1 in case of no data sharing and k:k: 1:1 otherwise (combination: add‐on monotherapy: backbone monotherapy: SoC). This allocation ratio is updated for all active cohorts every time the number of active cohorts k changes due to dropping or adding a new cohort

Definition of cohort success and operating characteristics

As discussed in section 2.1, for testing the efficacy of the combination therapy, four pair‐wise comparisons are conducted. If we were running a single, independent trial investigating a combination therapy, we would consider the trial a success if all of the necessary pair‐wise comparison were successful. Consequently, we would consider it a failure if at least one of the necessary pair‐wise comparison was unsuccessful. For clearer understanding, we call these two options, respectively, a positive or negative outcome. Depending on the formulated hypotheses, these might be either true or false positives or negatives. To allow evaluation of the decision rules in terms of frequentist type 1 error and power, for each of the four pair‐wise comparisons we formulate a set of hypotheses of the following sort (exemplary for combination vs. monotherapy A, but analogously for all other comparisons): H 0: π ≤ π + ζ versus H 1: π > π + ζ . Please note that as we are drawing the response rates randomly for every new cohort entering the platform trial, it is unknown a priori whether this null hypothesis holds. Furthermore, please note while for our simulations this is not the case, ζ can be different from δ chosen in Section 2.2 (in our simulations, for all pair‐wise comparisons ζ = δ = 0). We will elaborate more on what happens in this case in Appendix A.2. If all such pair‐wise alternative hypotheses within a cohort hold, we call the cohort truly efficacious (and any decision made is either a true positive or a false negative), otherwise we call it truly not efficacious (and any decision made is either a true negative or a false positive). The latter implies that this includes scenarios where for some of the pair‐wise hypotheses of interest the alternative and for some the null holds. Depending on the decision rules used, it could be debated if other definitions are more meaningful (e.g., if the comparisons of the combination therapy to the monotherapies are both true alternatives the cohort may be considered as efficacious). While for a single, independent trial this would yield a single outcome (true positive, false positive, true negative, false negative), for a platform trial with multiple cohorts this yields a vector of such outcomes, one for each investigated cohort. In order to evaluate different trial designs, operating characteristics need to be chosen that take into account the special features of the trial design, but are at the same time interpretable in the classical context of hypothesis testing. In more simple trial designs evaluating one treatment against a control, power and type 1 error rates are used to judge the design under consideration. For platform trials, similarly to multi‐arm multi‐stage trials, the choice of operating characteristics is not obvious. , , , , Furthermore, for the particular trial design under consideration, many operating characteristics based on the pair‐wise comparisons of the different monotherapies, SoC and combination therapy could be considered. We decided to investigate cohort‐level and platform‐level operating characteristics (see Figure 1). As described previously, when simulating platform trials we might allow the same treatment arms to have different response rates in different cohorts. Even if we allow for a mix of true null and alternative hypotheses, for a single simulated platform trial, it could happen by chance that only efficacious or only not efficacious cohorts are added. Consequently, for the platform‐level operating characteristics, a definition challenge arises when by chance either no true positive and false negative (in case all cohorts are truly not efficacious) or true negative and false positive (in case all cohorts are truly efficacious) decisions are possible. To make sure that the operating characteristics reflect this situation (which depends on the prior on the treatment effects), we differentiate between counting all simulation iterations (which implicitly takes into account the prior on the treatment effect, “BA” [“Bayesian Average”] operating characteristics, FWER BA and Disj Power BA) or only those simulation iterations where a false decision could have been made towards the type 1 error rate and power (FWER and Disj Power). An overview of the operating characteristics used in this article and their definitions can be found in Table 1.

TABLE 1

Operating characteristics used in this paper and their definitions

Name	Definition
PCP	“Per‐Cohort‐Power,” the ratio of the sum of true positives among the sum of truly efficacious cohorts (i.e., the sum of true positives and false negatives) across all platform trial simulations, that is, the probability of a true positive decision for any new cohort entering the trial. This is a measure of how wasteful the trial is with superior therapies.
PCT1ER	“Per‐Cohort‐Type‐1‐Error,” the ratio of the sum of false positives among the sum of all truly not efficacious cohorts across all platform trial simulations, that is, the probability of a false positive decision for any new cohort entering the trial. This is a measure of how sensitive the trial is in detecting futile therapies.
FWER	“Family‐wise type 1 Error Rate”, the proportion of platform trials, in which at least one false positive decision has been made (i.e., probability of at least one false positive decision across all cohorts), where only such trials are considered, which contain at least one cohort that is in truth futile. Formal definition: 1∣I0∣∑i∈I0𝟙FPi>0, where I0*=i∈1…iter:niH0>0, where iter is the number of platform trial simulation iterations, FP _i denotes the number of false‐positive decisions in simulated platform trial i and niH0 is the number of not efficacious cohorts in platform trial i.
FWER BA	“Family‐wise type 1 Error Rate Bayesian Average”, the proportion of platform trials, in which at least one false positive decision has been made (i.e., probability of at least one false positive decision across all cohorts), regardless of whether or not any cohorts which are in truth futile exist in these trials. Formal definition: 1iter∑i=1iter𝟙FPi>0, where iter is the number of platform trial simulation iterations and FP _i denotes the number of false‐positive decisions in simulated platform trial i. This will differ from FWER in scenarios where—due to a prior on the treatment effect—in some simulation runs, there are by chance only efficacious cohorts in the platform trial (see Appendix A.3 for more details on the different treatment efficacy scenarios).
Disj Power	“Disjunctive Power”, the proportion of platform trials, in which at least one correct positive decision has been made (i.e., probability of at least one true positive decision across all cohorts), where only such trials are considered, which contain at least one cohort that is in truth superior. Formal definition: 1∣I1∣∑i∈I1𝟙TPi>0, where I1*=i∈1…iter:niH1>0, where iter is the number of platform trial simulation iterations, TP _i denotes the number of true‐positive decisions in simulated platform trial i and niH1 is the number of efficacious cohorts in platform trial i.
Disj Power BA	“Disjunctive Power Bayesian Average”, the proportion of platform trials, in which at least one correct positive decision has been made (i.e., probability of at least one true positive decision across all cohorts), regardless of whether or not any cohorts which are in truth superior exist in these trials. Formal definition: 1iter∑i=1iter𝟙TPi>0, where iter is the number of platform trial simulation iterations and TP _i denotes the number of true‐positive decisions in simulated platform trial i. This will differ from Disj Power in scenarios where—due to a prior on the treatment effect—in some simulation runs, there are by chance no efficacious cohorts in the platform trial (see Appendix A.3 for more details on the different treatment efficacy scenarios).

Operating characteristics used in this paper and their definitions

SIMULATIONS

Simulation setup

We investigate the impact of a range of different design parameters and assumptions on the operating characteristics. In total, we investigated 14 different settings with respect to the treatment efficacy assumptions for the combination arm, the monotherapy arms and the SoC arm. In the main text, with one exception, only results of one treatment efficacy setting (setting 1) are shown as in this scenario both truly efficacious and not efficacious cohorts can enter the platform trial (see section 2.4). In the investigated setting (setting 1), for every new cohort entering the platform trial, the backbone monotherapy (response rate 20%) is superior to SoC (response rate 10%). The add‐on monotherapy efficacy is random with 50% probability to be as efficacious as the backbone monotherapy (response rate 20%) and 50% probability to be not efficacious (response rate 10%). Adding to the monotherapies, the combination therapy interaction effect is additive, meaning the combination therapy is superior to both monotherapies if the add‐on monotherapy is efficacious (response rate 40%) and not superior to the backbone monotherapy otherwise (response rate 20%). In terms of sample sizes, we vary the final cohort sample size from 100 to 500 in steps of 100 and fix the interim sample size at half of the final sample size. The maximum number of cohorts entering the platform trial over time is varied between 3 and 7 in steps of 2 and the probability to include new cohorts in the platform after every patient is set to either 1% or 3%. For the Bayesian decision rules (see section 2.2), by default we set all δ = 0, all γ = 0.9 and all γ = 0.5 in Equation (1). An overview of the simulation setup can be found in Table 2. It should be obvious that the chosen decision rules will yield dramatically different power and type 1 errors across different simulation parameters and treatment efficacy settings. It was our primary goal to investigate the relative impact of the simulation parameters and treatment efficacy settings on the operating characteristics. Of course, for any given combination of simulation parameters and treatment efficacy assumptions, the decision rules could be adapted to, for example, achieve a per‐cohort power of 80%. We will investigate the impact of the decision rules in more detail in section 3.2.1.

TABLE 2

Name	Type	Investigated values	Description
Maximum number of cohorts	Assumption	3, 5, 7	Assumed maximum number of cohorts per platform (can be less in individual simulations)
Cohort inclusion rate	Assumption	0.01, 0.03	Probability to include a new cohort in the ongoing platform trial after every simulated patient (unless maximum number of cohorts is reached). The default value leads to reaching the assumed maximum number of cohorts in nearly every simulation for nearly every sample size.
Treatment efficacy setting	Assumption	1, 2–14	Assumed treatment effects of the different study arms. For more information, see Table 3.
Final cohort sample size	Design choice	100, 200, 300, 400, 500	Number of patients after which final analysis in a cohort is conducted (interim analysis always after half the final sample size)
Data sharing	Design choice	All, concurrent, dynamic, cohort	Different methods of data sharing used at analyses, ranging from full pooling to not sharing at all. For more information, see section 2.3.
Allocation ratios	Design choice	Balanced or unbalanced (depends on data sharing)	Allocation ratio to treatment arms within a cohort. Depending on the method of data sharing, the allocation ratio is set to either balanced or unbalanced (i.e., randomizing more patients to combination and add‐on monotherapy). For more information, see section 2.3.
Bayesian decision rule	Design choice	δ ∈ [0, 0.2], γ ∈ [0.65, 0.95] (0.9)	Thresholds used in the Bayesian decision making at interim and final. Values other than the default values are used only in Figure 6. For more information, see section 2.2.
Interim analysis	Design choice	Early stopping for futility, no early stopping for futility	Binding rules on whether or not a cohort can be stopped at interim for early futility (note: cohorts can always stop for early efficacy). For more information, see section 2.2. Further extensions investigated in the supplements include using a surrogate short‐term endpoint for interim decision making.

Simulation Setup Overview. For different simulation parameters, we differentiate between parameters that are considered a design choice and parameters that are considered an assumption regarding the future course of the platform or treatment effects. For the investigated values, we state in bold which value was considered the default value, that is, unless stated otherwise in a particular figure, the parameters was set to this value. For some parameters there is no default (e.g., when shown as a simulation dimension in every figure or if chosen as a fixed design parameter in section 2) For every configuration of design parameters and assumptions, 10,000 platform trials were simulated. Unless otherwise specified, simulation results presented in this article use treatment effect scenario 1, the above mentioned Bayesian decision rules, a final sample size of 500 per cohort, a maximum number of cohorts per platform trial of up to 7 and a probability of including a new cohort after every patient of 3%. Results of all further treatment efficacy assumptions, as well as a table summarizing all different treatment efficacy settings are presented in the supplements (Table 3). For a detailed overview of the general simulation assumptions, as well as all possible simulation parameters for the R software package and Shiny App, see the R package CohortPlat vignette on GitHub. The R software package can be downloaded from GitHub or CRAN. Further results not discussed in the main text can be found in the Supplemental material. For a complete overview of all simulation results, we developed a Shiny App facilitating self‐exploration of all of our simulation results. The purpose of this R Shiny app is to quickly inspect and visualize all simulation results that were computed for this paper. We uploaded the R Shiny app, alongside all of our simulation results used in this paper to our server.

TABLE 3

Setting	π_SoC	π _MonoA (γ _MonoA)	π _MonoB (γ _MonoB)	π _Combo (γ _Combo)	Description
1	0.10	0.20 (2)	0.10 (1) with p = 0.5 0.20 (2) with p = 0.5	0.20 (1) if γ _MonoB = 1 0.40 (1) if γ _MonoB = 2	Backbone monotherapy superior to SoC, add‐on monotherapy has 50:50 chance to be superior to SoC; in case add‐on monotherapy not superior to SoC, combination therapy as effective as backbone monotherapy, otherwise combination therapy significantly better than monotherapies
2	0.10	0.20 (2)	0.10 (1)	0.20 (1)	Backbone monotherapy superior to SoC, but add‐on monotherapy not superior to SoC and combination therapy not better than backbone monotherapy
3	0.10	0.20 (2)	0.10 (1)	0.30 (1.5)	backbone monotherapy superior to SoC and combination therapy superior to backbone monotherapy, but add‐on monotherapy not superior to SoC
4	0.10	0.20 (2)	0.10 (1)	0.40 (2)	Backbone monotherapy superior to SoC and combination therapy superior to backbone monotherapy (increased combination treatment effect compared to setting 4), but add‐on monotherapy not superior to SoC
5	0.10	0.20 (2)	0.20 (2)	0.20 (0.5)	Both monotherapies are superior to SoC, but combination therapy is not better than monotherapies
6	0.10	0.20 (2)	0.20 (2)	0.30 (0.75)	Both monotherapies are superior to SoC and combination therapy is better than monotherapies
7	0.10	0.20 (2)	0.20 (2)	0.40 (1)	Both monotherapies are superior to SoC and combination therapy is superior to monotherapies (increased combination treatment effect compared to setting 7)
8	0.10	0.10 (1)	0.10 (1)	0.10 (1)	Global null hypothesis
9	0.20	0.20 (1)	0.20 (1)	0.20 (1)	Global null hypothesis with higher response rates
10	0.10	0.20 (2)	0.10 (1) with p = 0.5 0.20 (2) with p = 0.5	0.20γ _MonoB0.5 (0.5) with p = 13 0.20γ _MonoB1 (1) with p = 13 0.20γ _MonoB1.5 (1.5) with p = 13	Backbone monotherapy superior to SoC, add‐on monotherapy has 50:50 chance to be superior to SoC; combination therapy interaction effect can either be antagonistic/non‐existent, additive or synergistic (with equal probabilities)
11	0.10 + 0.03*(c‐1)	0.10 + 0.03*(c‐1) (1)	0.10 + 0.03*(c‐1) (1)	0.10 + 0.03*(c‐1) (1)	Time‐trend null scenario; every new cohort (first cohort c = 1, second cohort c = 2, …) will have SoC response rate that is by 3%‐points higher than that of the previous cohort
12	0.10 + 0.03*(c‐1)	0.20 + 0.03*(c‐1) (2)	0.20 + 0.03*(c‐1) (2)	0.40 + 0.03*(c‐1) (1)	Time‐trend scenario, whereby monotherapies superior to SoC and combination therapy superior to monotherapies; every new cohort (first cohort c = 1, second cohort c = 2, …) will have SoC response rate that is by 3%‐points higher than that of the previous cohort
13	0.20	0.30 (1.5)	0.30 (1.5)	0.40 (89)	Analogous to setting 7, but SoC response rate is 20%
14	0.20	0.30 (1.5)	0.30 (1.5)	0.50 (109)	Analogous to setting 8, but SoC response rate is 20%

Overview of different treatment effect assumptions. The priors T , T and T for γ , γ and γ as described in section A.3 are all pointwise with a support of 1,2 or 3 different points (each with probability “p”) and result in effective response rates π , π , π , and π . Only results of setting 1 are shown in the main text, the rest in the supplements 0.10 (1) with p = 0.5 0.20 (2) with p = 0.5 0.20 (1) if γ = 1 0.40 (1) if γ = 2 0.10 (1) with p = 0.5 0.20 (2) with p = 0.5 0.20*γ *0.5 (0.5) with p = 0.20*γ *1 (1) with p = 0.20*γ *1.5 (1.5) with p =

Simulation results

In Figure 3, we investigate the impact of the data sharing and maximum number of cohorts per platform on PCT1ER and FWER (Figure 3A) and PCP and disjunctive power (Figure 3B). We observe that the PCP does not increase with the platform size (maximum number of cohorts) in case of no data sharing, while it does increase with the magnitude of data sharing. For the disjunctive power we observe a similar relationship, that is, it increases regardless of the data sharing with increasing number of investigated cohorts. Furthermore, we see that the PCT1ER stays more or less constant with respect to the platform size, however is lowest for the dynamic borrowing approach. We believe that this is due to the following: A type 1 error in a cohort happens only if either the observed combination therapy and add‐on monotherapy efficacy are on a random high or if the observed backbone and SoC response rates are on a random low. In the former case, the degree of data sharing will have no impact on the decisions made. In the latter case, if dynamic borrowing is used, all other cohorts will discount this data, leading to fewer type 1 errors in the other cohorts compared to if complete pooling was used. On the other hand, for the cohort that had observed backbone and SoC response rates on a random low, there is a chance to not make a type 1 error when any sort of data sharing is used. As a result of the PCT1ER behavior, the FWER increases with increasing platform size and is again lowest in the case of dynamic borrowing.

FIGURE 3

Impact of the data sharing (linetype and point shape) and maximum number of cohorts per platform (x‐axis) on the per‐cohort and per‐platform type 1 error (Figure 3A) and power (Figure 3B) in treatment efficacy setting 1. Please note that different scaling of the y‐axis is used in the two subfigures. With increasing number of cohorts in the platform, the chance to make at least one correct positive or negative decision increases. When data is shared, the per‐cohort power increases, while it stays constant when no data sharing is planned In Figure 4, we investigate the impact of optimistic/pessimistic assumptions regarding the expected platform size, cohort inclusion rate and final cohort sample size on per‐cohort, and per‐platform power. We observe that the PCP is independent of the assumptions regarding the maximum number of included cohorts and the cohort inclusion rate. When the sample size is increased, both PCP and disjunctive power increase, however the increase is faster when using more optimistic assumptions regarding the expected platform size and cohort inclusion rate, as firstly more data can be shared and secondly more cohorts will be included on average, thereby increasing the chance for any true positive decision.

FIGURE 4

Per‐cohort and per‐platform power with respect to data sharing (linetype and point shape), assumptions regarding the maximum number of cohorts (rows), cohort inclusion rate (columns) and final cohort sample size (x‐axis) in treatment efficacy setting 1. Generally, both types of power increase with increasing final cohort sample size. In case of no data sharing, the per‐cohort power is independent of assumptions regarding the expected platform size and cohort inclusion rate Next, we removed the 50:50 chance for new cohorts entering the platform trial to have an efficacious add‐on monotherapy and increased this probability to 100% (i.e., we are using treatment efficacy setting 7 instead of setting 1). We also wanted to investigate the impact of an increase in SoC response rate, so we increased the SoC response rate from 10% to 20% but kept the incremental increases in response rate the same (i.e., 10% points increased for the monotherapies and 30% points increased for the combination therapy; setting 14). Results for the power are presented in Figure 5. While all the relative influences of level of data sharing, use of decision rules, and so on, appear to be unchanged, we observed consistently slightly lower power for increased SoC response rate, which could be due to increased variance in the observed response rates. Another phenomenon that is particularly pertinent in this figure: While the PCP is always lowest when sharing no data, the disjunctive power is only the lowest when sharing no data when using small sample sizes. With increasing sample size the disjunctive power is greatest when sharing no data. While this might seem not intuitive at first, we believe it is due to random highs in the SoC (leading to unsuccessful comparisons of monotherapies vs. SoC) and backbone monotherapy (leading to unsuccessful comparisons of the combination therapy vs. backbone monotherapy) arms. When these occur, sharing this data across cohorts might negatively impact all other truly efficacious cohorts leading to simultaneously only false negative decisions. As a result, the disjunctive power drops. This cannot happen when no data is shared.

FIGURE 5

Impact of SoC response rate (columns; treatment efficacy settings 7 vs. 14), final cohort sample size (x‐axis) and data sharing (linetypes and point shapes) on per‐cohort and per‐platform power. We observed consistently lower power for increased SoC response rate. We also observed that while for lower sample sizes the per‐platform power is increased with increasing amount of data sharing, this is not true anymore for larger sample sizes

Impact of modified decision rules

We further investigated the impact of parameters required for the Bayesian GO decision rules on the error rates. Every individual comparison includes a superiority margin δ and a required confidence γ (e.g., posterior (P(π > π + δ|Data) > γ)). In the decision rules in section 2.2 we fixed δ = 0 and γ = 0.90. We now varied γ from 0.65 to 0.95 and δ from 0 to 0.2. The impact on the operating characteristics is presented in Figure 6. Figure 6A reveals that both when sharing no or all data, the FWER can increase beyond 0.05 and even far beyond 0.10. Similarly, Figure 6B reveals that when choosing more lenient decision rules and increasing the data sharing, a PCP of close to 1 is attainable. Such contour plots will help to fine tune the Bayesian decision rules in order to achieve the desired operating characteristics. For γ, one would usually expect values equal to or larger than 80%. The choice of δ also relies on clinical judgment. Generally, as a result of the combined decision rule, operating characteristics are rather conservative with respect to type 1 error rates.

FIGURE 6

Impact of the Bayesian GO decision rules on (A) per‐cohort and per‐platform type 1 error rates and (B) per‐cohort and per‐platform power in treatment efficacy setting 1. The black dot corresponds to the decision rules chosen in section 2.2. For every error rate, we set the data sharing (rows) to either full (“all”) or none (“cohort”). The x‐axis shows the required confidence (“gamma”) and the y‐axis the required superiority (“delta”) used in the Bayesian decision making in section 2.2. It is apparent that by choice of δ and γ, a wide range of different error rates can be obtained

Summary of main results

Planning a platform trial is much more complex than planning a traditional full development trial. Firstly, both power and type 1 error might have to be controlled on the platform or cohort level. A platform trial that replaces sponsors' individual trials with the aim of advancing as many compounds as possible will try to control error rates on the cohort level. However, a platform trial with the goal of making sure that at least one efficacious compound is advanced would rather control error rates on the platform level. As seen in Figure 4, the per‐cohort power depends on the assumptions regarding the cohort inclusion rate and number of cohorts that will enter over time. In case there is substantial uncertainty regarding how fast and how many new cohorts will enter the platform trial over time when determining the sample size, conservative lower bounds for the per‐cohort power can be used by assuming no data sharing will take place. In this case, further simulations revealed that a final cohort sample size of around 600 would be needed to achieve a PCP of 0.8. In order to achieve a disjunctive power of 0.8, a final cohort sample size in the range of 300–500 is needed (depending on the assumptions; for few cohorts and a lower probability of cohort inclusion roughly 500 and for more cohorts and a higher probability of inclusion slightly below 300). In the fortunate event that the platform will attract many cohorts, the power for future cohorts would only increase with data sharing while guaranteeing a certain power for the first cohorts. If the priority is PCP and we are extremely confident about new cohorts entering the platform over time, by simply pooling all data we might achieve a PCP of 0.8 with a final cohort sample size of around 340 and a disjunctive power of 0.8 with a final cohort sample size of around 220. However, the above results are only true for the chosen treatment efficacy assumptions. When we are more optimistic about the treatment efficacy, the best choice for maximizing disjunctive power might be to not share any data at all (as seen in Figure 5, where in the left panel a disjunctive power of 0.8 is achieved with a final cohort sample size below 200 and no data sharing).

DISCUSSION

To our knowledge, we conducted the first cohort platform simulation study to evaluate combination therapies with an extensive range of simulation parameters that reflect the potential complexity and a priori unknown trajectory of platform trials. The design under investigation is an open‐entry, cohort platform study with a binary endpoint evaluating the efficacy of a number of two‐compound combination therapies with a common backbone monotherapy compared to the respective monotherapies with putative individual efficacy over SoC. For example, the SoC arm could include a placebo add‐on to achieve blinding. A range of treatment effect scenarios, as well as types of data sharing were investigated, from no sharing to full sharing. In our simulations, no a priori knowledge is required regarding the possible trajectories of the platform trials, as in every simulation, different trajectories are dynamically simulated based on randomly generated events such as patients, outcomes, cohort inclusions, and so on. This means in one simulation, the platform trial might be stopped after including only one cohort and in another simulation run, it might stop after evaluating 7 cohorts. This is also a major difference compared to simulation programs for multi‐arm multi‐stage studies, which usually assume a fixed number of different arms or compute only arm‐wise operating characteristics. Within each simulation in our program, periods of overlap between cohorts are recorded. We believe that especially this dynamic simulation is one of the key strengths of our simulation study, as it produces different realistic trajectories and computes operating characteristics across all such possible trajectories. We developed an R package and Shiny app alongside that facilitate easy usage and reproducibility of all simulation code, while the Shiny app for result exploration facilitates reproduction of our extensive simulation results, which we could not all discuss in this paper. Development of software for platform trial simulation is not straightforward and time spent on programming increases rapidly when more realistic features such as staggered entry, adding and dropping of treatment arms or data sharing across treatments/cohorts are included. Our results indicate that—apart from the treatment efficacy assumptions—the method of data sharing, exact specification of decision rules and complexity of the platform trial (expressed as maximum number of cohorts and cohort inclusion rate) are most influential to the operating characteristics of the platform trial. As expected, in nearly all cases, pooling all data leads to the largest power and type 1 error rates, whereas no data sharing leads to the lowest power. Methods that do not pool all data, but either discount them or use only concurrently enrolled patients, might significantly increase the power while only marginally increasing or even decreasing the type 1 error. Furthermore, definition of error rates in the context of a cohort platform trial with a combined decision rule per cohort is not straightforward. , , , , , As an example, whether or not to include simulated platforms that—as a result of random sampling of treatment efficacy from a specified prior distribution—do not contain truly efficacious cohorts in the calculation of the disjunctive power can lead to discrepancies of up to 15% points in our simulations. In terms of type 1 error rates, we defined a type 1 error via a target product profile and focused on interpreting the per‐cohort type 1 error and the family‐wise error rate in this paper. Other authors have recently suggested that FDR control might yield better properties than family‐wise error rate control in the multi‐arm context. The control of error rates such as the FDR might be appropriate for exploratory platform trials, which are less of a regulatory concern. Furthermore, in our investigations we treat configurations of “partial” superiority (e.g., the case where the combination is declared superior to the two monotherapies, but one or both of these are not declared superior to SoC) as non‐rejections of the null hypothesis. Reversely, when for any of the four pair‐wise comparisons conducted the null hypothesis holds, we consider any successful decision on the cohort level a false positive decision. For some configurations of null and alternative hypotheses within a cohort (e.g., when the combination therapy is superior to both monotherapies and SoC, but one of the components is not superior to SoC), the per‐cohort type 1 error and the family‐wise error rate are much higher than in the global null settings. For these configurations, partial power concepts might be more appropriate, or more relaxed testing strategies or thresholds than the one implemented could be considered, for example, if superiority of the backbone over SoC does not have to be shown within the platform trial. In this case, three instead of four pairwise comparisons would be required. Also it might make sense to use more relaxed thresholds for the comparisons of monotherapies versus SoC than for the comparisons of combination therapy versus the monotherapies. A further simplification would be to only test the combination therapy against the monotherapies (two pairwise comparisons) or against SoC (one pairwise comparison), in which case also the number of arms per cohort and allocation ratio would have to be adapted. For example in oncology, sometimes it is sufficient to demonstrate synergistic effects of the combination therapy, because monotherapies cannot separate from SoC. Such a combination therapy trial would be a success if combination therapy is superior to both monotherapies and SoC, even if the monotherapies do not separate from SoC. As for any simulation program, especially in a highly complex and dynamic context such as platform trials, we had to make simplifications. The following restrictions must be acknowledged: (1) Patients are simulated in batches to achieve the targeted allocation rate. In more detail, for every time interval a batch of patients is simulated, all entering the study at the same time (imitating a block randomization). This led to potentially slightly different sample sizes at interim and final than planned. However, this is also not uncommon in reality. By simulating the outcomes this way, the overall run time could be significantly reduced, thereby allowing us to investigate more than ten thousand combinations of simulation parameters and simulating more than one hundred million individual platform trajectories. For more information on the exact patient sampling mechanism, see the R package vignette. (2) The dynamic borrowing approach we used is rather straightforward and heuristic. While based on a well‐known and accepted method described by Schmidli et al. we had to adapt the approach to be compatible with our cohort platform design and simulation structure, leading to some heuristic adaptations for the sake of computational efficiency. (3) Furthermore, in the simulations individual patient demographic or baseline characteristics were not considered. We assumed that patients were always drawn from the same population and different cohorts shared common inclusion/exclusion criteria. All newly simulated patients under such assumptions were eligible for randomization to any cohort ongoing at the time the patient was simulated, which enabled naive borrowing of data across cohorts during the analyses. In practice, separate cohorts could have some different inclusion / exclusion criteria due to scientific (e.g., treatments of different modes of action) or operational considerations. Nevertheless, the schematic approaches in this simulation study for assessing trial operating characteristics could remain instrumental. Otherwise, partial instead of complete data sharing could be planned for when designing a platform study. In such practical scenarios, prior epidemiological and clinical knowledge on the disease of interest also plays a critical role for appropriate data borrowing hence for the reliable assessment of power and error rates. (4) As we would expect in real life, our simulations show variations in the final sample size. Especially when performing more data sharing, the originally planned final sample size tends to be exceeded. Further simplifications included not simulating variations in recruitment speed due to availability of centers, external and internal events such as approval of competitor drugs or discontinuation of drug development programs. While the software does facilitate stopping treatments due to safety events, we did not use this parameter in the presented simulation study. As the best possible compromise under uncertainty, we aimed to achieve a balanced randomization for every comparison in case of either no data sharing or sharing only concurrent data and refrained from investigating response‐adaptive randomization, which might increase the efficiency of the platform trial. , We only investigated platform trials with a maximum number of cohorts between 3 and 7. We believe this is sufficient in the initial planning step. If less cohorts are expected, a platform trial is not warranted. If more cohorts are expected to be included over time, simulations would need to be re‐run. To evaluate combination therapies the proposed platform design will be an efficient way to evaluate potential drugs and the resulting therapies. When exploring Bayesian decision rules, a key factor is fine‐tuning of decision parameters (e.g., δ and γ) in case operating characteristics should be controlled at a certain level. In the Bayesian decision rules, we tried to demonstrate superiority for the comparison of the monotherapies versus SoC. However, in some therapeutic areas, SoC could be an active drug with a different mechanism of action. In such instances, it could be sufficient to demonstrate non‐inferiority for some of the pairwise comparisons using pre‐specified non‐inferiority margins δ < 0. Furthermore, if drugs have more than one dose level, the platform trial can be extended to allow for more dose levels per cohort. This may result in even more complex models and decision rules, depending on which data sharing models are to be used when calculating the posteriors. Considering dose–response relationships in combination trials will be a part of future research, as well as investigation of non‐inferiority decision rules and impact of different allocation ratios on the operating characteristics.

CONFLICT OF INTEREST

The research of ELM was funded by Novartis on the university and not an individual level. PM, CD‐B, EG, YL are employees of Novartis Pharmaceuticals Corporation. FK reports grants on a university level from Novartis Pharma AG during the conduct of the study and from Merck KGaA outside the submitted work. Appendix S1: Supporting information Click here for additional data file.

40 in total

1. Adaptive designs for confirmatory clinical trials.

Authors: Frank Bretz; Franz Koenig; Werner Brannath; Ekkehard Glimm; Martin Posch
Journal: Stat Med Date: 2009-04-15 Impact factor: 2.373

2. Efficiencies of platform clinical trials: A vision of the future.

Authors: Benjamin R Saville; Scott M Berry
Journal: Clin Trials Date: 2016-02-17 Impact factor: 2.486

3. Incorporating historical two-arm data in clinical trials with binary outcome: A practical approach.

Authors: Manuel Feißt; Johannes Krisam; Meinhard Kieser
Journal: Pharm Stat Date: 2020-03-30 Impact factor: 1.894

4. Revisit of test-then-pool methods and some practical considerations.

Authors: Wen Li; Frank Liu; Duane Snavely
Journal: Pharm Stat Date: 2020-03-14 Impact factor: 1.894

5. Comparing Bayesian early stopping boundaries for phase II clinical trials.

Authors: Liyun Jiang; Fangrong Yan; Peter F Thall; Xuelin Huang
Journal: Pharm Stat Date: 2020-07-27 Impact factor: 1.234

Review 6. Correcting for multiple-testing in multi-arm trials: is it necessary and is it done?

Authors: James M S Wason; Lynne Stecher; Adrian P Mander
Journal: Trials Date: 2014-09-17 Impact factor: 2.279

7. Systematic review of basket trials, umbrella trials, and platform trials: a landscape analysis of master protocols.

Authors: Jay J H Park; Ellie Siden; Michael J Zoratti; Louis Dron; Ofir Harari; Joel Singer; Richard T Lester; Kristian Thorlund; Edward J Mills
Journal: Trials Date: 2019-09-18 Impact factor: 2.279

8. Lopinavir-ritonavir in patients admitted to hospital with COVID-19 (RECOVERY): a randomised, controlled, open-label, platform trial.

Authors:
Journal: Lancet Date: 2020-10-05 Impact factor: 79.321

9. Clinical Trials Impacted by the COVID-19 Pandemic: Adaptive Designs to the Rescue?

Authors: Cornelia Ursula Kunz; Silke Jörgens; Frank Bretz; Nigel Stallard; Kelly Van Lancker; Dong Xi; Sarah Zohar; Christoph Gerlinger; Tim Friede
Journal: Stat Biopharm Res Date: 2020-08-19 Impact factor: 1.452

10. Decision rules for identifying combination therapies in open-entry, randomized controlled platform trials.

Authors: Elias Laurin Meyer; Peter Mesenbrink; Cornelia Dunger-Baldauf; Ekkehard Glimm; Yuhan Li; Franz König
Journal: Pharm Stat Date: 2022-01-31 Impact factor: 1.234

2 in total

1. SynergyFinder 3.0: an interactive analysis and consensus interpretation of multi-drug synergies across multiple samples.

Authors: Aleksandr Ianevski; Anil K Giri; Tero Aittokallio
Journal: Nucleic Acids Res Date: 2022-05-17 Impact factor: 19.160

2. Decision rules for identifying combination therapies in open-entry, randomized controlled platform trials.

Authors: Elias Laurin Meyer; Peter Mesenbrink; Cornelia Dunger-Baldauf; Ekkehard Glimm; Yuhan Li; Franz König
Journal: Pharm Stat Date: 2022-01-31 Impact factor: 1.234

2 in total