Literature DB >> 32048313

Adaptive multiarm multistage clinical trials.

Pranab Ghosh¹, Lingyun Liu², Cyrus Mehta^2,3.

Abstract

Two methods for designing adaptive multiarm multistage (MAMS) clinical trials, originating from conceptually different group sequential frameworks are presented, and their operating characteristics are compared. In both methods pairwise comparisons are made, stage-by-stage, between each treatment arm and a common control arm with the goal of identifying active treatments and dropping inactive ones. At any stage one may alter the future course of the trial through adaptive changes to the prespecified decision rules for treatment selection and sample size reestimation, and notwithstanding such changes, both methods guarantee strong control of the family-wise error rate. The stage-wise MAMS approach was historically the first to be developed and remains the standard method for designing inferentially seamless phase 2-3 clinical trials. In this approach, at each stage, the data from each treatment comparison are summarized by a single multiplicity adjusted P-value. These stage-wise P-values are combined by a prespecified combination function and the resultant test statistic is monitored with respect to the classical two-arm group sequential efficacy boundaries. The cumulative MAMS approach is a more recent development in which a separate test statistic is constructed for each treatment comparison from the cumulative data at each stage. These statistics are then monitored with respect to multiplicity adjusted group sequential efficacy boundaries. We compared the powers of the two methods for designs with two and three active treatment arms, under commonly utilized decision rules for treatment selection, sample size reestimation and early stopping. In our investigations, which were carried out over a reasonably exhaustive exploration of the parameter space, the cumulative MAMS designs were more powerful than the stage-wise MAMS designs, except for the homogeneous case of equal treatment effects, where a small power advantage was discernable for the stage-wise MAMS designs.

Entities: Chemical Disease Gene Species

Keywords: P-value combination; Dunnett; FWER; MAMS; adaptive Dunnett; adaptive MAMS; closed testing; cumulative MAMS; early stopping; multistage design; pairwise comparison; sample size reestimation; seamless phase 2-3; treatment selection; two-stage design

Mesh：

Year: 2020 PMID： 32048313 PMCID： PMC7065228 DOI： 10.1002/sim.8464

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

INTRODUCTION

Adaptive multiarm multistage (MAMS) clinical trials compare multiple treatment arms in pairwise fashion to a common control arm over two or more stages. These trials are characterized by interim looks at the accumulating data in order to either stop the trial early for overwhelming efficacy, stop the trial early for futilty, or to make mid‐course adaptive changes such as dropping ineffective treatment arms, changing the sample size, the error spending function, and the number of future looks. Two approaches, originating from different conceptual frameworks, have evolved for constructing adaptive MAMS designs in a statistically valid manner. We refer to them, respectively, as stage‐wise MAMS and cumulative MAMS, because of the manner in which the test statistic is constructed by each method. Although both methods may be viewed as multivariate extensions of the classical two‐arm group sequential design they differ in how they control the multiplicity inherent in an adaptive MAMS design. The stage‐wise MAMS approach combines independent multiplicity adjusted P‐values from the different stages of the trial in accordance with a prespecified combination function and utilizes closed testing1 to ensure strong control of the family‐wise error rate (FWER). It provides full flexibility, at the end of each stage, to make data‐dependent adaptive changes, such as selecting a subset of the initial treatments or reestimating the sample size, for the remainder of the trial. Critical values for early efficacy stopping are obtained by applying the methods developed for classical two‐arm group sequential designs.2 Bauer and Köhne3 introduced this idea for two‐stage designs with multiple arms and Bauer and Kieser4 elaborated it further to include treatment selection at the end of stage 1. Posch et al5 introduced a larger family of multiplicity adjusted P‐values for the two stages, proposed the inverse normal combination function for combining them, and discussed parameter estimation at the end of the trial. One can directly extend this approach to J>2 stages, as was performed by Lehmacher and Wassmer6 for the special case of two‐arm trials and by Magirr, Stallard, and Jaki7 (Section 3.1) for multiarm trials. The cumulative MAMS approach extends the usual two‐arm group‐sequential efficacy boundaries2 to the multiarm setting. A separate cumulative test statistic having an independent increments structure is obtained for the pairwise comparison of each treatment arm to a common control arm, and is monitored stage by stage. Efficacy can be claimed for any treatment arm whose statistic crosses an efficacy boundary. These efficacy boundaries are derived from the distribution of the maximum of the test statistics under the global null hypothesis that all treatment arms are ineffective. They provide strong control of the FWER. Magirr, Jaki and Whitehead8 generated these boundaries for the maximum of the Wald statistics. Ghosh et al9 reduced the computational complexity of this approach by using the maximum score statistic, in place of the maximum Wald statistic. In both these approaches, a futility boundary could be included for dropping nonperforming treatment arms at one or more stages. However, neither Reference 8 nor Reference 9 can allow for data‐dependent adaptive changes such as treatment selection or sample size reestimation. To obtain this flexibility it is necessary to incorporate both closed testing1 and conditional error rate methodology,10, 11 into the testing framework as was done by Koenig et al12 for two‐stage designs with no early stopping and by Magirr, Stallard and Jaki7 (Section 3.2) more generally. This paper has two objectives. First, we show how to extend the cumulative MAMS approach of Ghosh et al9 to permit adaptive dose selection and sample size reestimation by use of closed testing and preservation of conditional error rates. Our approach is similar to that of References 12 and 7, but presented within the group sequential framework of Reference 2. For completeness we also present the stage‐wise MAMS approach within the group sequential framework of Reference 2, pointing out how it differs with respect to test statistics and group sequential boundaries from the cumulative MAMS approach. Second, we compare the operating characteristics of the cumulative MAMS and stage‐wise MAMS approaches, both analytically and empirically, in several settings. It is seen that the cumulative MAMS designs outperform the stage‐wise MAMS designs with respect to power in every setting but one, where there is a small, practically negligible, power advantage for the stage‐wise MAMS design. While two‐stage designs are by far the most common application of adaptive designs we have also included results for three‐stage designs. These results were previously unavailable due to the heavy computational burden they impose. The computational methods developed by Ghosh et al9 were essential for simulating the three‐stage cumulative MAMS designs in a realistic amount of time and thereby evaluating their operating characteristics. In Section 2 we introduce the cumulative MAMS approach, explain how the group sequential boundaries are obtained from the distribution of the maximum score statistic, and show how to incorporate adaptive treatment selection and sample size reestimation into the design. In Section 3 we review the stage‐wise MAMS approach for making adaptive changes to an ongoing study. For ease of exposition we confine our discussion in these sections to two‐stage designs, as this suffices to explain the main principles of cumulative MAMS and stage‐wise MAMS adaptation. The more general case of J>2 stages is discussed in Appendix. In Section 4 we compare the power of the cumulative and stage‐wise MAMS approaches—analytically for two active doses vs placebo, and by simulation for three three active doses vs placebo. A more general simulation‐based comparison that incorporates, treatment selection, early stopping, and sample size reestimation is presented in Section 5 for a recently completed cardiovascular trial.13 We summarize our findings in Section 6 along with some recommendations for the choosing between the two approaches.

THE CUMULATIVE MAMS APPROACH

Consider a trial in which D treatment arms, indexed by i=1,2,…D, are each compared to a common control arm indexed by i=0. Patients are randomized to either treatment arm i or to the control arm in accordance with a prespecified allocation ratio λ. We assume that a patient's response on arm i is normal with mean μ and variance . Let δ=μ−μ0,i=1,2,…D, represent the mean effect of treatment arm i relative to the control arm. Let denote the null hypothesis for treatment arm i and let denote the global null hypothesis. In this section we will develop the cumulative MAMS approach for a two‐stage adaptive design to test H 0 against the one‐sided alternative that δ>0 for at least one i. The generalization to J>2 stages is presented in Appendix A1. Let j=1,2 denote the first and second stages, respectively, and let n be the sample size of arm i at stage j. Define the score statistic , where is the maximum likelihood estimate of δ and is its Fisher information from data up to and including stage j. Then is a multivariate Brownian process with , , , and where . These results hold exactly if the patient level data are normally distributed and asymptotically otherwise.14 Let and . For future reference let be the score statistic for the incremental data accumulated between stage 1 and stage 2, where and n 0(2)=n 02−n 01. Then is independent of and has a multivariate normal distribution with , , and . In practice, when evaluating these distributions, we will replace the unknown Fisher information quantities and by corresponding estimates, from the data. (See, for example, equation (9)). The simulation results in Table 1 of Section 5 demonstrate that this second‐order approximation preserves type‐1 error even for relatively small sample sizes. Using computational methods discussed in Ghosh et al9 for multivariate Brownian processes we can obtain level‐α group sequential boundaries (b 1,b 2) such that where denotes probability under and α1 is the portion of the prespecified allowable type‐1 error that is spent at stage 1.

Table 1

Power comparisons of single stage, stage‐wise multiarm multistage (MAMS) and cumulative MAMS designs

(A) Two‐stage SOCRATES design (10 000 simulated trials)
	Power (standard error)
	Single	Adaptive Stage‐Wise MAMS			Adaptive
	Stage				Cumulative
δ_ (with σ=0.52)	Dunnett	Bonferroni	Simes	Dunnett	MAMS
(0.187, 0.187, 0.187)	0.804 (.004)	0.728 (.004)	0.785 (.004)	0.786 (.004)	0.805 (.004)
(0, 0.187, 0.187)	0.731 (.004)	0.667 (.005)	0.713 (.004)	0.734 (.004)	0.768 (.004)
(0, 0, 0.187)	0.591 (.005)	0.521 (.005)	0.527 (.005)	0.597 (.005)	0.657 (0.005)
(0, 0, 0)	0.025 (.002)	0.018 (.001)	0.020 (.001)	0.021 (.001)	0.023 (.001)
Drop any treatment i at stage 1 if corresponding δ^i1<0
(B) Three‐stage SOCRATES design (10 000 simulated trials)
	Power (SE)
	Single	Adaptive Stage‐Wise MAMS			Adaptive
	Stage				Cumulative
δ_ (with σ=0.52)	Dunnett	Bonferroni	Simes	Dunnett	MAMS
(0.187, 0.187, 0.187)	0.804 (.004)	0.678 (.005)	0.778 (.004)	0.787 (.004)	0.806 (.004)
(0, 0.187, 0.187)	0.731 (.004)	0.610 (.005)	0.691 (.005)	0.725 (.004)	0.773 (.004)
(0, 0, 0.187)	0.591 (.005)	0.445 (.005)	0.494 (.005)	0.592 (.005)	0.647 (.005)
(0, 0, 0)	0.025 (.002)	0.017 (0.001)	0.018 (.001)	0.022 (.001)	0.023 (.001)
Drop any treatment i at stage 1 if corresponding δ^i1<0

Power comparisons of single stage, stage‐wise multiarm multistage (MAMS) and cumulative MAMS designs We shall, throughout, denote observed values of random variables by lowercase letters. Thus denotes the observed value of . We may reject any hypothesis for which the corresponding w ≥b 1. The trial is then terminated for efficacy. If, however, the trial continues to stage 2 where again any hypothesis is rejected for which the corresponding w ≥b 2. Due to the use of the statistic this hypothesis testing procedure maintains strong control of the FWER.8 It is important to recognize that the efficacy boundaries for a multiarm group sequential design must be stricter than the corresponding efficacy boundaries for a two‐arm group sequential design, since the former have to adjust for the multiplicity due to testing more than one hypothesis at each look. For example, if D=4 the multiarm group sequential boundaries for treatment i, derived from the Lan and DeMets15 error spending function are and for a one‐sided test at α=0.025 and an interim look at 50% of the total information. In contrast the two‐arm group sequential efficacy boundaries in this setting are and . We consider two possible adaptations at the end of stage 1. (a) Permit one or more treatment arms to be dropped. (b) Alter the sample size of each treatment arm i that will be proceeding to stage 2, while maintaining its allocation ratio λ. Strong control of FWER can be maintained without any adjustment to the group sequential design if (a) is the only adaptation. We can, optionally, improve the efficiency of the design by recomputing the stage 2 boundary in conjunction with closed testing. If, on the other hand, the adaptation includes (b) then it is essential to recompute the stage 2 boundary in conjunction with closed testing in order to maintain strong control of FWER. We next discuss how this is accomplished. Let and denote the indices of the treatments selected for stage 2. At stage 2 we are interested in testing for all i∈S while maintaining strong control of the FWER at level α. To achieve this control, each must be tested by a closed level‐α test. That is, may only be rejected if, for all such that i∈I, is rejected with a valid local level‐α test.1 The valid local level‐α test of is constructed in two steps. Compute two‐stage group sequential level‐α boundaries (b ,b ) for making ||I|| comparisons to a common control. These boundaries must satisfy where , j=1,2. If , is rejected. Otherwise we proceed to Step 2. After examining the stage 1 data a subset consisting of ||S|| treatments is selected for testing at stage 2. Suppose that the incremental stage 2 sample size of the control arm is altered from n 0(2) to , and suppose that the incremental stage 2 sample sizes of the ||S|| treatment arms are correspondingly increased so as to preserve their respective allocation ratios relative to the control arm. Let I =I∩S. In order to preserve the type‐1 error of the trial we must replace the stage 2 boundary b with such that where and the “∗” indicates that the sample size of the stage 2 statistic has been altered from n to . We reject if . Equation (2) is a consequence of the conditional error rate principle11 which states that in order to preserve the overall type‐1 error of the trial its conditional type‐1 error after adaptation should not exceed the conditional type‐1 error of the original trial, given the stage 1 data. Thereby is rejected by a valid level‐α test. Finally, rejection of requires that be rejected in the above manner for all possible subsets that contain i. This will ensure that the test of is closed and will thereby guarantee strong control of FWER.

THE STAGE‐WISE MAMS APPROACH

We recapitulate the two‐stage method described by Reference 5, but present it in the classical group sequential framework of Reference 2, which facilitates generalization to J>2 stages as given in Appendix A2. Recall from Section 2 that we can reject any elementary hypothesis only if the intersection hypothesis is rejected by a valid local level‐α test for all subsets that contain i. In stage‐wise MAMS the test of utilizes multiplicity adjusted P‐values computed from the incremental data at stages 1 and 2. Any valid multiplicity adjusted P‐values may be utilized for this purpose. Popular candidates include the t‐test based P‐values adjusted for multiplicity by the nonparametric Bonferroni and Simes procedures for which the appropriate formulae are given in Reference 5. However, in order to make a meaningful comparison between the cumulative and stage‐wise MAMS approaches, we will utilize P‐values that are derived from the maximum score statistic. In that case the multiplicity adjusted P‐value for testing at stage j is the single‐stage Dunnett P‐value16 where and are the score statistics based on the incremental data at stages 1 and 2, respectively. To evaluate Equation (3) exactly we define, for all i∈I, where is the estimated Fisher information from the incremental data of stage j. Define . Then the multiplicity adjusted Dunnett P‐value can be computed exactly as where has a multivariate‐T distribution with mean , degrees of freedom, and a known covariance matrix that depends on the allocation ratios of the treatment arms to the control arm. A two‐stage level‐α test of can now be constructed as follows. Define the test statistic for stage 1 as We will use the same type‐1 error, α1, for stage 1 as was used in the cumulative MAMS approach. Thus for any , is rejected by a valid level‐α1 test if Z ≥c 1, where c 1= Φ−1(1−α1). The trial terminates for efficacy at stage 1 if there exists at least one such that for all that contain i, Z ≥c 1, for then can be rejected by a level‐α1 closed test. If the trial does not terminate at stage 1 let be the set of treatment indexes selected for stage 2 and I =I∩S be the set of treatments from I that are carried forward to stage 2. Let denote the maximum incremental score statistic in the set I . Then the second‐stage P‐value for testing is We now compute the test statistic for stage 2 as a weighted sum of inverse normal components where h 1 and h 2 are prespecified weights whose sum of squares is 1. The statistics Z and Z are N(0,1) under and Z −Z is independent of Z . Thus one can readily obtain the efficacy boundary c 2 such that by the usual methods for two‐arm group sequential designs.2 We reject with strong control of FWER if Z ≥c 2 for all possible with i∈I. The generalization to J>2 stages is given in Appendix A2. Note that the efficacy boundaries (c 1,c 2) only protect the multiplicity induced by testing the same hypothesis over two stages. In particular, they do not adjusted for the multiplicity due to testing multiple treatment arms against a common control arm. The latter multiplicity adjustment is applied through the Dunnett P‐values. In contrast the cumulative MAMS approach applies the adjustments for both the sources of multiplicity directly through the efficacy boundaries. For example, if the Lan‐DeMets15 efficacy boundaries for the stage‐wise MAMS design are c 1=2.9626 and c 2=1.9868. These are the efficacy boundaries for comparing a single treatment arm to a control arm even though in fact four treatments are being compared to the same control. For the cumulative MAMS design, however, the Wald‐scale boundaries for comparing four treatments to a common control would be and .

CUMULATIVE MAMS VS STAGE‐WISE MAMS

Our goal is to compare the cumulative and stage‐wise MAMS approaches with respect to global power, defined here as the probability of rejecting for any treatment i, i=1,2,…D. We will first make these comparisons for the special case of two active doses, no early stopping and no dose selection. In this ideal setting it is possible to make the comparisons analytically and thereby gain a deeper insight into the conditions under which one method has greater power than the other. We will then extend these comparisons to more general settings by simulation.

Analytical Comparison with Two Active Doses and Two Stages

Patients are randomized equally between the three arms of the study and each patient's response is normally distributed with σ2=1. The control arm has a mean of zero and treatment i has mean δ, i=1,2. The null hypothesis corresponding to the treatment i is . We will test the global null hypothesis against the one‐sided alternative that δ>0 for at least one i=1,2. Under the assumption of no early stopping, no dropping of treatments and no adaptive sample size reestimation, one can derive analytical power functions for the cumulative and stage‐wise MAMS designs. Let f 1(w 11,w 21) be the probability density function of , the stage 1 score statistics. Let f (2)(w 1(2),w 2(2)) be the probability density function of , the incremental stage 2 score statistics.(For notational convenience we have suppressed the dependence of these densities on .) Let b 2 denote the critical value for declaring statistical significance at the end of stage 2. Then we have shown in Appendix A1 that P(CUMUL) and P(STAGE), the respective cumulative and stage‐wise MAMS probabilities of rejecting H 0 when the true treatment effect is , are given by and where and are the multiplicity‐adjusted P‐values for the two stages, and is a function of the maximum of (w 11,w 21) through p 1. It is instructive to compare the two power functions (6) and (7). They differ only in the upper limits of the inner (or stage 2) integrals. In P(CUMUL) the stage 2 score statistics (w 1(2),w 2(2)) are confined to the region (−∞,b 2−w 11)×(−∞,b 2−w 21). Notice that this is the acceptance region for a test that rejects H 0 if either w 11+w 1(2)≥b 2 or w 21+w 2(2)≥b 2. Thus P(CUMUL) is derived from a test that is based on sufficient statistics. In contrast the stage 2 score statistics (w 1(2),w 2(2)) in the expression for P(STAGE) are confined to the region . This is the acceptance region for a test that rejects H 0 if . Clearly this test is not based on sufficient statistics. The impact on global power of nonadherence to the sufficiency principle is shown in Figure 1, where the two‐test methods are compared for δ1 and δ2 in the range 0 to 3, and in Figure 2, where the two‐test methods are compared with equal δ values over the range δ1=δ2=0 to δ1=δ2=3. We have chosen α=0.05 for both test methods, with total statistical information for evaluating P(CUMUL), and stage‐wise statistical information for evaluating P(STAGE). With these design parameters both designs achieve 0.95 power at δ1=δ2=3 and FWER equal to 0.05 at δ1=δ2=0. The following conclusions may be drawn:

Figure 1

Analytical power comparisons: Stage‐wise vs cumulative multiarm multistage

Figure 2

Detailed analytical power comparisons at δ 1=δ 2

Except for a small region near δ1=δ2=1.5, P(CUMUL) exceeds P(STAGE) everywhere, with absolute power gains between 0% and 5%. When δ1=δ2=1.5 there is a tiny power loss, P(CUMUL)−P(STAGE)=−0.2%, which disappears rapidly as soon as δ2 moves away from δ1. The power gain for P(CUMUL) is maximum when the two δ values differ by the greatest amount; δ1=0,δ2=3 or δ1=3,δ2=0 The slight loss in power at δ1=δ2=1.5 shown in Figure 1 suggests that similar losses might also occur at other values of δ1=δ2. This is confirmed by an examination of Figure 2 where P(CUMUL)−P(STAGE) is plotted over the range δ1=δ2=0 to δ1=δ2=3. The power loss is zero at δ1=δ2=0, increases gradually to a maximum of −0.002 at δ1=δ2=1.5 and then declines, reaching zero once again at δ1=δ2=3. Analytical power comparisons: Stage‐wise vs cumulative multiarm multistage Detailed analytical power comparisons at δ 1=δ 2 It is worth noting that, in this setting the cumulative MAMS design has the property of consonance. When H 0 is rejected by the cumulative MAMS method we can, in addition to rejecting H 0, also reject either or or both of them, depending on which component(s) of crossed the efficacy boundary. For the P‐value combination test, however, rejecting H 0 does not provide any additional information about the status of or individually. We need to further reject either or or both by local level‐α tests before we an make an efficacy claim for these dose groups. These additional tests have not been factored into the analytical power calculations for the P‐value combination approach. Therefore we can conclude that the actual power of the P‐value combination approach to identify efficacious doses is even less than P(STAGE).

Simulation‐based comparison with three active doses and selection

The analytical expressions in Equations (6) and (7) were derived in the idealized setting of two active doses, no early stopping and no dropping of treatment arms at the end of stage 1. We now consider the more realistic setting of three active doses in which nonperforming doses are dropped at the end of stage 1. Figure 3 is a three‐dimensional (3D) plot showing the absolute power gain, P(CUMUL)−P(STAGE), when δ3=0.3 , (δ1,δ2)=0,0.05,…,0.3, σ2=1, and treatment i is dropped at the end of stage 1 if . Figure 4 is a similar 3D plot with the same σ2 and range of values for the δ's, but with a stricter criterion for dropping doses; here treatment i is dropped if . Both plots are based on 10 000 simulated trials. By examining these plots one may draw three important conclusions about the power differential between the cumulative MAMS and stage‐wise MAMS designs.

Figure 3

P(CUMUL)−P(STAGE): δ 3=0.3; (δ 1,δ 2)=0,(0.5),0.3; drop dose if δ <−0.1

Figure 4

P(CUMUL)−P(STAGE): δ 3=0.3; (δ 1,δ 2)=0,(0.5),0.3; drop dose if δ <−0.3

P(CUMUL) exceeds P(STAGE) with absolute power gains up to 9% when the cut‐off for dropping doses is and up to 11% when the cut‐off for dropping doses is The gain in power of P(CUMUL) over P(STAGE) appears to depend on the degree of heterogeneity among the δ values. The greater the heterogeneity, the greater the power gain. To see this note the following: The gain in power of P(CUMUL) over P(STAGE) is maximum when δ1=δ2=0 and δ3 = 0.3 The gain in power of P(CUMUL) over P(STAGE) is zero when δ1=δ2=δ3=0.3 At δ3=0.3 and any fixed value for δ1, the gain in power of P(CUMUL) over P(STAGE) increases as δ2 decreases from 0.3 to 0. At δ3=0.3 and any fixed value for δ2, the gain in power of P(CUMUL) over P(STAGE) increases as δ1 decreases from 0.3 to 0 The gain in power of P(CUMUL) over P(STAGE) is larger in Figure 4 than in Figure 3 for every (δ1,δ2,δ3) combination. As the only difference between the two figures is the value of below which doses are dropped, it would appear that the stricter the criterion for dropping doses at the end of stage 1, the greater the power differential. We will revisit this conjecture in Section 5 in the context of an actual clinical trial. P(CUMUL)−P(STAGE): δ 3=0.3; (δ 1,δ 2)=0,(0.5),0.3; drop dose if δ <−0.1 Figures 3 and 4 display results only for the portion of the parameter space where δ3=0.3 and (δ1,δ2)≤δ3. For completeness, additional simulations were also carried out in the region of the parameter space where δ1 and δ2 exceed δ3=0.3. Here too P(CUMUL) exceeded P(STAGE) everywhere. The power gains were, however, small (about 0.5% on average), because in this region of the parameter space, both P(CUMUL) and P(STAGE) had very large absolute powers—93% to 99%. P(CUMUL)−P(STAGE): δ 3=0.3; (δ 1,δ 2)=0,(0.5),0.3; drop dose if δ <−0.3

THE SOCRATES‐REDUCED TRIAL

SOCRATES‐REDUCED was a multicenter, randomized, placebo‐controlled trial which enrolled patients with worsening chronic heart failure after clinical stabilization.13 Patients were randomized to three different dose groups (2.5, 5, and 10 mg) of oral vericiguat or placebo. The primary end point of the trial was change from baseline to week 12 in log‐transformed N‐terminal pro‐B‐type natriuretic peptide (NT‐proBNP). The statistical analysis plan specified that for the analysis of the primary endpoint the patients from the three dose groups would be pooled and compared to the placebo arm. The trial was designed for 80% power to detect a difference of δ=0.187 between the pooled dose group and placebo, at one‐sided α=0.025. In order to meet these design requirements, and assuming that σ=0.52, a total of 260 patients (65/arm) were randomized to the study. This trial, however, failed to show statistical significance. The observed treatment effect for the pooled dose group relative to placebo was only 0.122 (P‐value = .075, one‐sided). The data from the trial showed a dose‐response relationship with an observed difference from placebo of 0.248 for the 10‐mg dose group (P = .024), 0.073 for the 5‐mg dose group (P = .15), and 0.04 for the 2.5‐mg dose group (P = .19). Pooling the three dose groups for the final analysis caused a dilution of the observed treatment effect and resulted in a failed trial even though the 10‐mg dose appears to be clearly effective. We will use this example to display the operating characteristics of alternative cumulative and stage‐wise MAMS designs that might have been used for identifying effective doses in a multiarm setting. A single‐stage four‐arm design based on Dunnett's test in which σ=0.52 and δ=0.187 for each dose vs placebo requires 388 patients (97/arm) for 80% power at one‐sided α=0.025. Here power is defined as the probability that the null hypothesis will be rejected for at least one‐dose group. In Table 1 we compare the operating characteristics of this single‐stage Dunnett design with corresponding operating characteristics of stage‐wise MAMS designs that utilize three different multiplicity‐adjusted P‐values (Bonferroni, Simes, or Dunnett), and with the cumulative MAMS design, under a range of treatment differences from placebo for the three dose groups. These adaptive designs are conducted over two equally spaced stages in Table 1A and over three equally spaced stages in Table 1B. The adaptation occurs at the end of stage 1 and consists of early stopping if any dose group crosses an efficacy boundary, or dropping any dose group having an observed treatment effect that is worse than placebo. When doses are dropped their remaining sample sizes are reallocated in equal proportion to the remaining doses or placebo. The Bonferroni, Simes, and Dunnett stage‐wise MAMS procedures combine multiplicity‐adjusted P‐values derived from the Student's t distribution in accordance with Equation (A8) of Appendix A2. All table entries are based on 10 000 simulated trials. The value of α spent at each stage j to obtain the efficacy stopping boundaries is derived from the Lan and DeMets, O'Brien‐Fleming type, error spending function.15 For the stage‐wise MAMS designs these are the usual two‐arm group sequential boundaries, obtained as solutions to Equations (A11) and (A12) of Appendix A3. For the cumulative MAMS design, these are multiplicity adjusted multiarm group sequential boundaries, derived as shown in equations (A5) and (A6) of Appendix A2. However, as recommended by Wason et al,17 these multiarm boundaries, b , are further transformed by the formula to adjust for possible biases in small samples due to estimating the unknown for each treatment i in the compuation of the test statistic. Here is the estimated Fisher information about δ at stage j, is the estimated variance of the response to treatment i, based on cumulative data up to and including stage j, and is the inverse of the Student's t distribution with degrees of freedom d =n 0+n −1. This adjustment to the boundaries allows us to use estimated Fisher information in place of the unknown actual Fisher information without inflating the type‐1 error. The last rows of Table 1 show that this adjustment preserves the FWER, albeit slightly conservatively. We have verified that if the simulations are performed with the actual Fisher information, the FWER is exactly 0.025, thereby demonstrating that, in the absence of any large sample approximations, the adaptive cumulative MAMS design exhausts the entire α. For the scenarios considered here, the adaptive cumulative MAMS design dominates the other designs with respect to power. Furthermore among the three stage‐wise MAMS methods displayed in Table 1, the methods that utilize the Bonferroni or Simes adjustments have considerably lower power than the method that utilizes the Dunnett adjustment. The power gains of the cumulative MAMS design over the other designs are more pronounced for heterogeneous treatment effects compared to homogeneous treatment effects. For example, it is seen from Table 1A for two‐stage designs where , that the cumulative MAMS design produces 6% more power than the stage‐wise MAMS design using Dunnett P‐values, 13% more power than the stage‐wise MAMS design using Simes P‐values, 14% more power than the stage‐wise MAMS design using Bonferroni P‐values, and 7% more power than the single‐stage Dunnett design. It is interesting to observe that even in the homogeneous case where the stage‐wise MAMS design using Dunnett P‐values has 2% less power than the cumulative MAMS design. This would appear to contradict the results of Section 4 where there is essentially no difference in power between stage‐wise and cumulative MAMS designs when the δ values are all equal. The explanation is that the designs in Section 4, unlike the SOCRATES‐REDUCED designs, do not include early stopping. The presence of early stopping boundaries causes a loss of power for stage‐wise MAMS relative to cumulative MAMS. Table 1B displays similar results for three‐stage designs. Three‐stage designs, however, have the additional advantage of lower average sample sizes due to the possibility of early stopping. This is seen in Table 2

Table 2

Two‐stage vs three‐stage comparisons for cumulative multiarm multistage (MAMS)

	Power (std error)		Average Sample Size
δ_ (with σ=0.52)	Two‐Stage	Three Stage	Two‐Stage	Three‐Stage
(0.187, 0.187, 0.187)	0.805 (.004)	0.806 (0.004)	360	336
(0, 0.187, 0.187)	0.768 (.004)	0.773 (.004)	366	343
(0, 0, 0.187)	0.657 (.005)	0.647 (.005)	370	343
(0, 0, 0)	0.023 (.001)	0.023 (.001)	339	323

Two‐stage vs three‐stage comparisons for cumulative multiarm multistage (MAMS) We noted at the end of Section 4.2 that the stricter the criterion for dropping doses at the end of stage 1, the greater the gain in power for cumulative MAMS over stage‐wise MAMS designs. It would be interesting to determine whether this result holds also for the SOCRATES‐REDUCED designs. In Table 3 we explore this conjecture for two‐stage designs with three different configurations for . In Table 3A, . In Table 3B, . In Table 3C, . In each table we use three progressively stricter criteria for dropping treatments— in row 1, in row 2, and in row 3.

Table 3

Power gains for adaptive cumulative multiarm multistage (MAMS) over adaptive stage‐wise MAMS

(A) P(CUMUL)−P(STAGE):(δ₁,δ₂,δ₃)=(0.187,0.187,0.187) and σ=0.52
Dose Dropping	Multiplicity‐adjusted P‐values for stage‐wise MAMS
Criterion	Bonferroni	Simes	Dunnett
Any δ^i1<0	7.7%	1.8%	2.1%
Any δ^i1<−σ	8.5%	1.9%	2.2%
Any δ^i1<−2σ	7.3%	2.1%	1.5%
(B) P(CUMUL)−P(STAGE):(δ₁,δ₂,δ₃)=(0,0.187,0.187) and σ=0.52
Dose Dropping	Multiplicity‐adjusted P‐values for stage‐wise MAMS
Criterion	Bonferroni	Simes	Dunnett
Any δ^i1<0	10.1%	5.5%	3.9%
Any δ^i1<−σ	15.7%	12.7%	9.2%
Any δ^i1<−2σ	15.3%	10.7%	7.9%
(C) P(CUMUL)−P(STAGE):(δ₁,δ₂,δ₃)=(0,0,0.187) and σ=0.52
Dose Dropping	Multiplicity‐adjusted P‐values for stage‐wise MAMS
Criterion	Bonferroni	Simes	Dunnett
Any δ^i1<0	13.6%	13.1%	6.1%
Any δ^i1<−σ	21.0%	20.3%	14.3%
Any δ^i1<−2σ	17.7%	16.5%	11.5%

Power gains for adaptive cumulative multiarm multistage (MAMS) over adaptive stage‐wise MAMS In each table, for each design, a pattern emerges whereby P(CUMUL)−P(STAGE) increases in moving from row 1 to row 2 and then decreases in moving from row 2 to row 3. A similar pattern was observed for the three‐stage designs. We are unable to find an explanation for this behavior. It is note‐worthy however, that the gains in power increase substantially with increasing heterogeneity of the δ values. For example, in Table 3C the value of P(CUMUL)−P(STAGE) can be as high as 21% for Bonferroni, 20.3% for Simes and 14.3% for Dunnett.

DISCUSSION

The usual practice in clinical drug development has been to first run a phase 2 trial with multiple doses, and then run a separate two‐arm phase 3 trial in which the best dose from phase 2 is compared to a control arm. Adaptive designs combine phase 2 and phase 3 into a single integrated trial and thereby utilize fewer patient resources and shorten the time required to identify and market efficacious medical products. To be acceptable for regulatory submissions such designs must have strong control of FWER. Both the stage‐wise MAMS and the cumulative MAMS designs have this property. In stage‐wise MAMS designs, FWER control is achieved by constructing the test statistic as a weighted combination of inverse normal multiplicity‐adjusted P‐values from the incremental data at each stage, and monitoring this statistic with respect to the classical two‐arm group sequential boundaries. Since the weights are prespecified, this test statistic has the cannonical distribution of the usual two‐sample Wald or score statistic under the global null hypothesis, even if the sample size is reestimated in the course of the trial. Additionally, closed testing is implemented to identify the active treatment arms. In cumulative MAMS designs, strong FWER control is achieved by constructing a separate cumulative Wald or score statistic for each pairwise comparison and monitoring it with respect to group sequential boundaries that are adjusted for testing multiple treatment arms. Although these boundaries provide strong control of the FWER in the presence of arbitrary or unplanned treatment selection, they can be sharpened through step‐down closed testing and preservation of conditional error rates as described in Section 2 and Appendix A2. The sharpened boundaries provide additional flexibility to alter the sample size. Thus the stage‐wise and cumulative MAMS designs provide the same degree of flexibility to make adaptive changes to an ongoing design. There is, however, a fundamental difference in the handling of multiplicity by the two methods. In stage‐wise MAMS the multiplicity is incorporated into the adjusted P‐values whereas in cumulative MAMS it is incorporated into the group sequential boundaries. We have compared the stage‐wise MAMS and cumulative MAMS approaches in a systematic manner under different configurations of the treatment effects and decision rules for dropping arms. Our first investigation, in Section 4.1, was for two treatment arms vs a common control arm with no treatment selection and no early stopping. In this simple setting it was possible to compare the two designs analytically and thus determine with great accuracy that only in the homogeneous case where δ1=δ2 does the stage‐wise MAMS design have greater power than the cumulative MAMS design. Moreover the power differential for this configuration of δ is at most 0.2%. For all other configurations the cumulative MAMS design has greater power with the power differential increasing as the δ values separate, and reaching 5% when the δ values are farthest apart. Next, in Section 4.2, we investigated the case of three treatment arms vs a common control arm, with treatment selection at the end of stage one but no early stopping. This investigation was by simulation and demonstrated greater power gains, up to 11% for cumulative MAMS designs over stage‐wise MAMS designs. As before, the power gains increased with greater heterogeneity among the δ values. Finally, in Section 5 we simulated two and three‐stage designs with dose selection as well as sample size reestimation for the SOCRATES‐REDUCED clinical trial. Here too the cumulative MAMS designs had greater power than the stage‐wise MAMS designs, with power gains that increased substantially with greater heterogeneity among the δ values. For example, for one could obtain a 14.3% power gain for cumulative MAMS over stage‐wise MAMS with Dunnett‐adjusted P‐values, a 20.3% power gain over stage‐wise MAMS with Simes‐adjusted P‐values and a 21% power gain over stage‐wise MAMS with Bonferroni‐adjusted P‐values. While the large power gains for cumulative MAMS designs over stage‐wise MAMS designs shown here have not been shown previously, they are consistent with results published in Koenig et al,12 Friede and Stallard18 and Magirr et al.7 Koenig et al12 and Friede and Stallard18 showed a benefit for the adaptive Dunnett test over the P‐value combination test for two‐stage designs with treatment selection but no early stopping or sample size reestimation. Magirr et al7 investigated two and three‐stage designs with treatment selection, early stopping and sample size reestimation, and showed a benefit for the “CE‐SB” and “CE‐AP” designs that utilize cumulative statistics and recompute multiplicity adjusted stopping boundaries through use of conditional error rates to control the FWER, over the “PC‐SB” designs that control the FWER through inverse normal combination of adjusted P‐values. Even small gains in power can translate into huge sample size savings for cumulative MAMS designs over stage‐wise MAMS designs. For example, it is seen from Table 1B that, for a sample size of 388, if the cumulative MAMS design has 64.7% power while the stage‐wise MAMS design has 59.2% power. In order for the stage‐wise MAMS design to also have 64.7% power, 448 subjects would be needed. Furthermore, as can be seen from Table 2, the average sample size of the cumulative MAMS design in this three‐stage early‐stopping setting is 343 subjects. We have determined in a separate simulation that the corresponding average sample size of the stage‐wise MAMS design is 424 subjects. It was conjectured by a reviewer that the power advantage of the cumulative MAMS design over the stage‐wise MAMS design in Section 5 might be due to the specific sample‐size increase rule utilized in our simulations. This rule, which might be termed “proportional upscaling,” requires that the initially specified total sample size not be reduced when arms are dropped at an interim analysis. Instead the sample size that would have been assigned to the dropped arms is reallocated to continuing arms, in proportion to the original allocation ratios. To check the validity of this conjecture we resimulated the designs in Table 1A without proportional upscaling. In Table 4 we display power and sample size comparisons for the two‐stage SOCRATES design in which the unallocated sample sizes of the dropped arm are not reassigned to the arms that continue. As can be seen, these results are qualitatively similar to those of Table 1A. Thus the power advantage of the cumulative MAMS design appears to hold with or without proportional upscaling.

Table 4

Power comparisons without proportional upscaling (10 000 simulated trials)

	Power (SE)
	Single‐	Adaptive Stage‐Wise Multiarm Multistage			Adaptive
	Stage				Cumulative
δ_ (with σ=0.52)	Dunnett	Bonferroni	Simes	Dunnett	MAMS
(0.187, 0.187, 0.187)	0.804 (.004)	0.714 (.005)	0.775 (.004)	0.771 (.004)	0.789 (.004)
(0, 0.187, 0.187)	0.731 (.004)	0.584 (.005)	0.629 (.005)	0.656 (.005)	0.692 (.005)
(0, 0, 0.187)	0.591 (.005)	0.380 (.005)	0.398 (.005)	0.453 (.005)	0.502 (0.005)
(0, 0, 0)	0.025 (.002)	0.012 (.001)	0.015 (.001)	0.017 (.001)	0.024 (.002)
Drop any treatment i at stage 1 if corresponding δ^i1<0

Power comparisons without proportional upscaling (10 000 simulated trials) The conclusions we draw from the results presented in this paper are as follows: Cumulative MAMS designs appear to be more powerful than stage‐wise MAMS design except in the homogeneous case where all the δ values are the same. For the special case of two active treatments, with no treatment selection or sample size increase, analytical comparisons were possible. They revealed that when δ1=δ2 there is a small advantage for the stage‐wise MAMS design over the cumulative MAMS design, but it disappears as the two δs begin to diverge. It is thus entirely plausible that the same effect is present in the more complex setting of multiple doses, multiple looks and sample size reestimation considered in Sections 4.2 and 5. If present, however, the effect is too small to be detected in an experiment involving 10 000 simulated trials. The magnitude of the power gain of cumulative MAMS designs over stage‐wise MAMS designs can be substantial and increases with increasing heterogeneity of the δ values. Our results are based on a reasonably exhaustive exploration of the parameter space for three active treatment arms under specific decision rules for treatment selection, sample size reestimation and early stopping. We cannot claim that they hold for all possible adaptive designs. Nevertheless the designs that we have considered here are ones that are likely to adopted in practice. For other designs it is recommended to explore the operating characteristics of the two approaches by simulation using the tools we have discussed here. We tried to ascertain why the cumulative MAMS approach was more powerful than the stage‐wise MAMS approach. We have three conjectures. For the special case of two active doses with no early stopping or dropping of doses we were able to obtain explict power functions for the two methods in Section 4.1 and thereby demonstrate that the cumulative MAMS test, unlike the stage‐wise MAMS test is based on sufficient statistics When there is no sample size reestimation the multiplicity‐adjusted cumulative MAMS boundaries are consonant. That is, although these boundaries have been constructed under the global null hypothesis H 0, any elementary hypothesis for which w ≥b can be rejected without loss of FWER control. In contrast, in order to reject in the stage‐wise MAMS approach, one must always go through the entire closed testing procedure If treatments are dropped at an interim look in the cumulative MAMS design it is possible gain efficiency through boundary recomputation in conjunction with closed testing. Specifically, in the two‐stage cumulative MAMS design, the final critical value for testing is adjusted from b to by imposing the Müller and Schäfer condition11 through Equation (2). Although not shown here, we have verified that so that this adjustment confers an advantage on the group sequential approach that is not available to the P‐value combination approach. We have not been able to explain why P(CUMUL)−P(STAGE) increases with increasing heterogeneity of the δ values. We are also unable to explain why P(CUMUL)−P(STAGE) first increases with increasing conservatism of the rule for dropping arms and then decreases. This phenomenon is manifest in every column of Table 3. We believe that this behavior is worth further investigation. Throughout this paper we have utilized score statistics for monitoring the data and performing the hypothesis tests. We assumed in Section 2 that the scores are normally distributed with independent increments. These distributional properties hold exactly for normal data with known variance and asymptotically for all other settings in which the variance is estimated by maximum likelihood methods.14 We showed in Section 5, Equations (8) and (9), how one might use the t‐distribution to transform the cumulative MAMS boundaries and thereby obtain type‐1 error control for the case of normal data with unknown variance. We did not examine the accuracy of the asymptotic distributions when the underlying data are binomial or have time‐to‐event end points. In this regard the stage‐wise MAMS approach, though not as powerful as the cumulative MAMS approach, might be more robust since one can combine P‐values that are adjusted for multiplicity by nonparametric methods like the Bonferroni and Simes method rather than resort to normal approximations. On the other hand if convergence of the score statistics to asymptotic normality with independent increments was in doubt one could set the nominal type‐1 error of the cumulative MAMS design to be smaller than the desired α, say α/2, so as to ensure that the actual type‐1 error would be controlled at level‐α. The huge power advantage that the cumulative MAMS design enjoys over stage‐wise MAMS designs that utilize multiplicity adjusted nonparametric P‐values, as evidenced by Table 3 of Section 5, would probably not be offset even by extreme conservatism in the choice of the nominal α. This reasoning would not, however, be applicable if we were interested in testing multiple endpoints rather than testing multiple treatment arms. The multiarm problem is amenable to cumulative MAMS designs because the interarm correlation structure can be determined exactly from the treatment to control allocation ratio. The correlations between multiple endpoint must be estimated from the data and hence are subject to sampling error. Thus for multiple endpoint problems the stage‐wise MAMS methods that utilize the nonparametric Simes or Bonferroni adjustments to control the multiplicity might have an advantage over the cumulative MAMS methods that rely on large‐sample approximations. This is a topic for further investigation. Another topic for further investigation is parameter estimation at the end of the trial. Bias reduction methods were investigated by Posch et al5 for stage‐wise MAMS designs with dose selection but no sample size adaptation. For two‐arm group sequential designs with adaptive sample size reestimation, methods have been developed by Gao et al,19 Brannath et al,20 and Mehta et al.21 There has been some recent work on unbiased point estimates in phase 2‐3 trials by Bowden and Glimm,22 Robertson et al,23 and Stallard and Kimani.24 Magirr et al25 have proposed simultaneous confidence intervals that are compatible with closed testing in adaptive designs. Further study is needed to understand how these methods may be incorporated into the general framework presented here.

19 in total

1. Adaptive group sequential designs for clinical trials: combining the advantages of adaptive and of classical group sequential approaches.

Authors: H H Müller; H Schäfer
Journal: Biometrics Date: 2001-09 Impact factor: 2.571

2. Adaptive sample size calculations in group sequential trials.

Authors: W Lehmacher; G Wassmer
Journal: Biometrics Date: 1999-12 Impact factor: 2.571

3. Testing and estimation in flexible group sequential designs with adaptive treatment selection.

Authors: Martin Posch; Franz Koenig; Michael Branson; Werner Brannath; Cornelia Dunger-Baldauf; Peter Bauer
Journal: Stat Med Date: 2005-12-30 Impact factor: 2.373

4. Exact confidence bounds following adaptive group sequential tests.

Authors: Werner Brannath; Cyrus R Mehta; Martin Posch
Journal: Biometrics Date: 2009-06 Impact factor: 2.571

5. Multiple testing to establish superiority/equivalence of a new treatment compared with kappa standard treatments.

Authors: C W Dunnett; A C Tamhane
Journal: Stat Med Date: 1997-11-15 Impact factor: 2.373

6. Flexible sequential designs for multi-arm clinical trials.

Authors: D Magirr; N Stallard; T Jaki
Journal: Stat Med Date: 2014-05-13 Impact factor: 2.373

7. Unbiased estimation in seamless phase II/III trials with unequal treatment effect variances and hypothesis-driven selection rules.

Authors: David S Robertson; A Toby Prevost; Jack Bowden
Journal: Stat Med Date: 2016-04-21 Impact factor: 2.373