Alessandra Serra1, Pavel Mozgunov1, Thomas Jaki1,2. 1. MRC Biostatistics Unit, University of Cambridge, Cambridge, UK. 2. Department of Mathematics and Statistics, Lancaster University, Lancaster, UK.
Abstract
One family of designs that can noticeably improve efficiency in later stages of drug development are multi-arm multi-stage (MAMS) designs. They allow several arms to be studied concurrently and gain efficiency by dropping poorly performing treatment arms during the trial as well as by allowing to stop early for benefit. Conventional MAMS designs were developed for the setting, in which treatment arms are independent and hence can be inefficient when an order in the effects of the arms can be assumed (eg, when considering different treatment durations or different doses). In this work, we extend the MAMS framework to incorporate the order of treatment effects when no parametric dose-response or duration-response model is assumed. The design can identify all promising treatments with high probability. We show that the design provides strong control of the family-wise error rate and illustrate the design in a study of symptomatic asthma. Via simulations we show that the inclusion of the ordering information leads to better decision-making compared to a fixed sample and a MAMS design. Specifically, in the considered settings, reductions in sample size of around 15% were achieved in comparison to a conventional MAMS design.
One family of designs that can noticeably improve efficiency in later stages of drug development are multi-arm multi-stage (MAMS) designs. They allow several arms to be studied concurrently and gain efficiency by dropping poorly performing treatment arms during the trial as well as by allowing to stop early for benefit. Conventional MAMS designs were developed for the setting, in which treatment arms are independent and hence can be inefficient when an order in the effects of the arms can be assumed (eg, when considering different treatment durations or different doses). In this work, we extend the MAMS framework to incorporate the order of treatment effects when no parametric dose-response or duration-response model is assumed. The design can identify all promising treatments with high probability. We show that the design provides strong control of the family-wise error rate and illustrate the design in a study of symptomatic asthma. Via simulations we show that the inclusion of the ordering information leads to better decision-making compared to a fixed sample and a MAMS design. Specifically, in the considered settings, reductions in sample size of around 15% were achieved in comparison to a conventional MAMS design.
Drug development is costly and time consuming.
One family of clinical trial designs that can improve the development process are multi-arm multi-stage designs (MAMS).
In aMAMS trial, insufficiently promising treatments can be dropped or the trial can be stopped due to overwhelming benefit at a series of interim analyses.To date these designs have focused on the setting of independent treatment arms and have been argued to be a highly efficient approach to clinical trials.
They could, however, be suboptimal if an “order” (ie, a monotonic relationship) among the treatment effects can be assumed. Such an order can occur naturally, for example, when multiple doses or administration schedules of the same treatment are tested or when nested combinations of treatments are investigated. Another area where an order can often be assumed is when considering different treatment durations. In infectious diseases such as Tuberculosis (TB) and Hepatitis B (HBV), the treatment duration with current standard regimes is lengthy
which results in a large burden on the patients, potentially high costs, increased risk of non-compliance and side effects.
In TB and HBV, for example, treatment periods of 6 and 12 months are typical.
Novel treatments or combinations of treatments in these areas offer the opportunity for both higher efficacy and shorter treatment periods.
In the setting of multiple treatment durations Quartagno et al
have proposed to model the duration-response curve.While this is an efficient way to understand the duration-effect relationship, it is less clear how to definitively conclude whether a duration is “better” than the current standard.In this work, we extend the MAMS framework and propose a design that incorporates the order of treatment effects in the decision-making when no parametric dose-response or duration-response model is assumed. The objective of the design is to identify all promising arms (eg, treatment durations, doses, or combination of treatments), including the one associated with the smallest relevant treatment effect.The rest of the manuscript continues as follows. A case study is introduced in Section 2 before a detailed description of the 3-arm and 2-stage design is provided in Section 3. Section 4 then generalizes the proposed design to an arbitrary number of arms and stages and provides some theoretical results. Section 5 revisits the case study before the design is evaluated via simulations in Section 6. In Section 7, the effect of various critical bounds on the operating characteristics of the proposed design is explored.We conclude with a discussion.
Case Study Setting
The Tiotropium add-on therapy in adolescents with moderate asthma: A 1-year randomized controlled trial (NCT01257230)
is a Phase III study that assessed the efficacy and safety of once-daily tiotropium via Respimat added to inhaled corticosteroid (ICS) with or without a leukotriene receptor antagonist in adolescent patients with moderate symptomatic asthma. Patients were randomized with equal probability to receive 5 μg (2 puffs of 2.5 μg) or 2.5 μg (2 puffs of 1.25 μg) of once-daily tiotropium or placebo (2 puffs). The primary outcome was change from baseline in peak FEV1 within 3 h after dosing (peak FEV1[0−3) measured after 24 weeks of treatment. The null hypotheses were tested in a stepwise manner to control the type I error starting from the highest dose suggesting that a monotonic dose-response relationship can be assumed.
A 3-Arm 2-Stage Order Restricted Design
In this section, we develop an order restricted design (ORD) for the setting of the case study. We denote the highest dose (5μg) by T
1 and the lower dose (2.5μg) by T
2. The generalization to an arbitrary number of arms and stages is given in Section 4.Assume that a patien’s response follows a normal distribution with known common variance, σ
2. An alternative approach is outlined in Section 8 for the case of unknown variance. Let
be the observation of the ith patient on treatment k (the control arm is denoted by 0) and
be the number of patients on arm k up to stage j. Let
be the true treatment effect of active arm k ∈ {1,2} compared to the control. We denote the vector of treatment effects by
Consider the following order relationship:
implying that the treatment effect of the second treatment is at most as large as the treatment effect for the first treatment. Let
be the ratio between the number of subjects allocated to treatment k ∈ {0,1,2} and control at each stage j with
Let
be the test statistic
at stage j for comparing arm k ∈ {1,2} to control, where
and
with k ∈ {0,1,2} and n is the sample size in the control group at the first stage. The vector of test statistics follows a multivariate normal distribution
with
and the covariances between Z-statistics are
with
with
with
withWe test the null hypotheses:
with the global null hypothesis denoted by
Let
and
be the critical values at stage j for T
1 and T
2, respectively, used to test the hypotheses, withThe proposed design then takes into account the order among the treatment effects when making the decisions at the first stage and the final analysis and a set of decision rules consistent with this order is given in Table 1. For example, if both Z-statistics cross the upper bounds at the interim analysis, the trial is stopped for efficacy (as in a conventional MAMS design). In contrast to the traditional MAMS design, the trial continues if there is contradicting evidence with respect to the order, for example, if
crosses the upper bound, but there is not enough evidence to claim superiority of T
1 to control, then both arms are continued to the next stage.
Table 1
Combination of the decision rules in the 3-arm 2-stage trial with θ(1) ≥ θ(2)
Zj(1)≥uj(1)
lj(1)<Zj(1)<uj(1)
Zj(1)≤lj(1)
Zj(2)≥uj(2)
Stop: select T1,T2
Proceed with T1, T2
Proceed with T1, T2
lj(2)<Zj(2)<uj(2)
Proceed with T2
Proceed with T1, T2
Drop both arms
Zj(2)≤lj(2)
Stop: select T1
Proceed with T1
Drop both arms
Note: Cells colored in red correspond to contradicting evidence.
The idea behind these decision rules is that at any stage the effectiveness of T
2 can be claimed only if T
1 can be declared superior to the control. Therefore, T
1 can be regarded as a gatekeeper.
Following this procedure, depending on the context, alternative decisions could be considered for the cells colored in red in Table 1 (see Section 2 of the Supporting Information for more discussion).
Family-wise error rate
For confirmatory clinical trials, control of the family wise error rate (FWER) in the strong sense at level α, that is the probability to reject at least one true null hypothesis, is often required.16 Using the rules described in Table 1, the FWER for the 3-arm 2-stage ORD can be written asEquation (1) shows that the events used for the computation of the type I error under the global null hypothesis ({Reject H
01 and H
02}, {Reject H
01 not H are a subset of the events ({Reject H ∪ H
02}) used in the MAMS design of Magirr et al.
Thus, the probability of rejecting at least one hypothesis under the global null will be smaller for the ORD compared to the MAMS design, while the probability of rejecting neither hypothesis will be smaller for MAMS if the same bounds are used. It is worth noting that overall, the critical bounds, if these are the same for all active treatments
for the 3-arm 2-stage ORD are smaller in each stage compared to the MAMS design of Magirr et al
(see Section 7 of the Supporting Information). Consequently the ORD design is strictly more powerful than the MAMS design under these assumptions.The critical bounds for the given treatment arm can be defined as function of a (possibly arm-specific) parameter, that is,
which can be searched over a grid of values for a in order to strongly control the FWER at level α. If a
(1) = a
(2) = a, then a unique solution can be found restricting the search over α such that the expression given in Equation (1) is below α under the global null hypothesis. In this case, the solution is unique either when the search is based on different boundary shapes
or when the same boundary shapes are used for all experimental arms—
for all k∈ {1,2}. If a
( are not the same for each arm, additional constraints are required for the uniqueness of the solution and to maintain the strong control of the FWER at level α, such as the control under the partial null hypotheses. In the 3-arm setting, for example, this is (θ
(1), 0). However, different values of θ
(1) can provide different boundaries, and so the solution is unique for the specific value of θ
(1) (see Section 7 for more details).Theorem 1 below then shows that, if the same bounds,
are used for each arm, and the same allocation ratios (with respect to the control) are used for all active treatments, the FWER is maximized under the global null hypothesis and hence the above ensures strong control of the FWER.Theorem 1.
Consider a 3-arm 2-stage ORD design and denote the global null hypothesis byLet
0
be a vector where at least one treatment effect is less or equal to 0. Then,The proofs of all theorems are given in Section 1 of the Supporting Information.
Power requirement
To power the study, we consider the configuration
where
and δ
0 is the minimum clinically relevant difference. The ORD is then powered at (1 - β) to reject both hypotheses under
when Equation (2) is satisfied:Theoretical considerations and numerical evaluations have shown that, if Pocock boundaries
for both treatments are used,
and critical values found such that Equation (1) is below α, the power of the ORD design is, practically, no smaller than a fixed balanced sample design with the same sample size. Furthermore, for a number of treatment effects, it was found to be strictly positive. Full details of these considerations are given in Section 3 of the Supporting Information.
Generalization Of The Ord To K-Arm J-Stage
Consider a clinical trial with K − 1 active treatment arms,
against a control treatment and J stages and denote the treatment effect comparing treatment k against control by θ We denote the vector of treatment effects by
The null hypotheses of interest are
Let
denote the test statistic based on all data up to stage j for comparison
as before and assume that the following order relationship holds:
Let
be the critical values at stage j withThe decision rules at the interim analyses follow the same principle as for the 3-arm 2-stage design defined above. The decisions are made in order to be able to select all promising treatment arms at the end of the trial and H
0 can only be rejected if all
have been rejected. Once H
0 has been rejected, the recruitment to arms
is stopped, where ℒ
j is the lowest index of a treatment arms remaining in the trial at stage j. If there is contradicting evidence with respect to the order at stage j, that is when
and there is at least one
such that
then recruitment to these arms continue. As for the 3-arm 2-stage design, if there is sufficient evidence to drop arm k, that is when
and if there is any contradicting evidence for
then the recruitment to arms
is stopped, where
is the highest index on the treatment arms remaining in the trial at stage j. A general algorithm for the decision-making in this setting is given in Algorithm 1.Let ℒ
and
be the lowest and highest indices on the active treatment arms remaining in the trial at stagej, respectively. At stage j, compute
forStop recruitment to arm k at stage j for efficacy if for allStop recruitment to arm k at stage j for futility ifand for all
orand for at least one
and for noStop the trial when the recruitment to all arms is stopped.Let M be a random variable representing the number of arms (including the control) at stage j when H
01 failed to be rejected at stage j − 1. Because of the hierarchy in testing the hypotheses, the FWER for an K-arm J-stage design can be written as
where P (M) is the probability that at the previous stage H failed to be rejected and the number of arms were at least m. One can show that the following iterative equality holds
with at the first stage
and 0 otherwise. The set
defines the event that H
01 failed to be rejected at stage j − 1 when the number of treatment arms (including the control arm) in the trial were c. This set is formally defined in Table 2. In the definition of
the superscript (c − 1) indicates the number of active treatment arms that are still in the trial at stage j − 1, while the subscript c − m + 1 refers to the number of active treatment arms (that is equal to c − m) that have been dropped before reaching the stage j.
Table 2
Sets of events for K-arm J-stage design with
Set
Definition
Cj(k)
{lj−1(k)<Zj−1(k)<uj−1(k)}
Sj(k)
{Zj−1(k)≤lj−1(k)}
Ej(k)
{((Cj(1)∪Sj(1))∩(Cj(k)∪Sj(k)))∪(Ej(k−1)∩Cj(k))}
Ej(0)
Ω
Aj,1(k),k>0
Ej(k)
Aj,2(k),k>1
Ej(k−1)∩Sj(k)
Aj,k+1−s(k),k>2,s=1:k−2
Ej(s)∩Sj(s+1)∩t=s+2k(Cj(t)∪Sj(t))
Note: Ω is the whole sample space.
While the expression for the FWER in the general case is cumbersome, for a fixed number of stages (arms), the FWER for K-arm (J-stage) can be found iteratively—an example for 2-stage trials is given in Section 4 of the Supporting Information.The critical bounds for a K-arm J-stage ORD can be, again, defined as functions of parameters a
( such that
Thus, as for the 3-arm 2-stage setting, under the constraint of
a unique solution can be found to control the FWER at level α in the strong sense. Theorem 2 below then shows that, if the same bounds,
are used for each arm and the same allocation ratios (with respect to the control) are used for all active treatments, the FWER is maximized under the global null hypothesis
and hence the above ensures strong control of the FWER.Theorem 2.
Consider a K-arm J-stage ORD design and denote the global null hypothesis byThen,
In the next session, a simulation study will be described in order to apply the proposed design in the context of the asthma trial.
Case Study
Setting
We revisit the results of the clinical trial of Tiotropium add-on therapy in adolescents with moderate asthma: A 1-year randomized controlled trial (NCT01257230).
Patients were randomized in a 1:1:1 ratio to receive 5μg or 2.5μg of once-daily tiotropium or placebo. The null hypotheses were tested in a stepwise manner to control the type I error at level α = 0.025. The study was powered at 80% to detect a difference of 120 mL between treatments in the change from baseline of peak FEV1[0-3 assuming a common SD of 340 mL. It was found that 127 patients per group were needed, resulting in a maximum sample size of 381 patients. The trial is revisited using the ORD, which can be applied assuming a monotonic dose-response relationship.In line with the original trial we assume that the change from baseline of peak FEV1[0-3 is normally distributed with SD σ = 340, k ∈ {0,1, 2}, and common baseline mean FEV1 of μ
(0)= 2747. As in the original study, we consider an improvement of FEV1 of 120 of interest and hence consider the following values for
and
= (120 120). The design is powered at 80% to reject all hypotheses or at least one hypothesis (in order to compare the sample sizes between the ORD and the original trial design) when all doses have the same effect compared to the placebo. Additionally to the achieved power, the efficiency of the proposed design is measured by its expected sample size (ESS), that is the mean number of patients recruited to the trial before it is terminated.We consider one- and two-stage ORD designs and note that the one-stage ORD design corresponds to the hierarchical testing strategy used in the original design. For the two-stage design the interim analysis takes place after half of the total sample size has been observed and triangular critical bounds
are used. The numerical results found using R19 and 106 replicate simulations.
Numerical results
Consider the 3-arm 1-stage ORD using the maximum total sample size of 381 patients that corresponds to the same maximum total sample size originally planned for the study. In this setting, the critical bound at the final analysis is
Table 3 describes the results of the simulation.
Table 3
Results of the simulations that revisit the NCT01257230 trial using the ORD
Design powered to reject all hypotheses
θ(1)
θ(2)
Max. SS
Stages
Reject all
Reject H01 not H02
Reject at least one H0k
ESS
0
0
474
1
0.005
0.020
0.025
474.00
534
2
0.004
0.021
0.025
316.39
120
0
474
1
0.025
0.856
0.881
474.00
534
2
0.025
0.854
0.879
371.83
120
120
474
1
0.803
0.078
0.881
474.00
534
2
0.802
0.081
0.883
399.81
Design powered to reject at least one hypothesis
θ(1)
θ(2)
Max. SS
Stages
Reject all
Reject H01 not H02
Reject at least one H0k
ESS
0
0
381
1
0.005
0.021
0.025
381.00
426
2
0.004
0.021
0.025
252.43
120
0
381
1
0.025
0.779
0.803
381.00
426
2
0.024
0.774
0.798
304.67
120
120
381
1
0.691
0.112
0.803
381.00
426
2
0.684
0.117
0.802
331.89
Note: For the two-stage design, the triangular bounds are used. Proportions refer to 106 replications and values of interest are in bold. Abbreviations: ESS, expected sample size; Max. SS, maximum sample size.
It can be seen that the FWER is controlled at level α = 0.025 under all considered null scenarios. For the scenarios where there is at least one dose that is superior to control, the probability to reject at least one hypothesis is 80% as required in the original study. Note that 80% is the probability of rejecting at least one dose considering any rejections and not only correct rejections. Therefore, when no interim analyses are planned, the ORD requires the same maximum sample size as in the original study if it is powered to reject at least one hypothesis. It is worth noting that the probability of rejecting all hypotheses, if all of them are true, is around 69%.The distinguishing feature of the proposed ORD is that it allows to include interim analyses during the trial. Table 3 shows the operating characteristics of the design when an interim analysis takes place after observing half of the maximum sample sizes. If the study is powered to reject at least one hypothesis at 80%, a maximum sample size of 426 patients is needed—45 more patients than for the 1-stage design. The gain from using a two-stage design arises in terms of the expected sample size—on average the number of patients is expected to be below 332 under each scenario. At the same time, under this power configuration, the probability to identify the smallest promising dose is at 68% similar to the single-stage design.Furthermore, it is argued by the construction of the ORD proposed in Section 3 that if it is of interest to identify the lowest dose with the promising treatment effect, the trial should be powered to reject all hypotheses. In order to identify the lowest effective dose with the desirable 80% probability, a single-stage trial would require 474 patients and the two-stage design 534 patients. As before, the inclusion of an interim analysis and the use of triangular bounds lead to the reduction of the expected number of patients. Specifically, on average the number of patients is expected to be below 400 under each scenario.Overall, the ORD was found to reproduce the sample size calculation of the original study when no interim analyses were planned and is powered to reject at least one hypothesis. At the same time, the ORD allows the inclusion of interim analyses during the trial. When an interim analysis is planned during the trial, the ORD has shown to be an efficient design as it allows to stop the trial earlier or to drop unpromising doses with high probability before the end of the study.The results discussed in this section use triangular bounds. Qualitatively similar results using Pocock
and O’Brien and Fleming
boundaries are provided in Section 6 of the Supporting Information. To further investigate the design characteristics of the proposed ORD, an extensive simulation study is conducted in Section 6.
Numerical Evaluations
As in the motivating example, consider a clinical trial setting with 3 treatments arms, 2 experimental and a control with 1:1:1 allocation ratio. Consider a single-stage design (that is a fixed sample design (FSD) with hierarchical testing, FSD(h)) and a two-stage design with one pre-planned interim analysis at the middle of the trial. The objective of the trial is to find all promising treatment arms. The FWER is to be controlled below α = 0.05 and the power of trial is to be at least 80% to reject both hypotheses when both treatment arms have the same effect compared to the control. The patients’ responses on treatment arm k are assumed to have normal distribution with mean μ
( and SD σ = 1. For the control group μ
(0) = 1, while for the treatment arms
We fix the clinically relevant difference to be 0.5. Therefore, the scenario under which the ORD is powered is
= (0.5, 0.5). We evaluate the performance of the design under various treatment effects configurations when the treatment effect on the first arm is fixed to be 0.5 and the treatment effect on the second arm is varied:
and
For the 3-arm 2-stage ORD, the bounds are found under the conditions a
( = a (ie, the same for each treatment arm) and they are found using a grid search—based on the triangular boundary shape
for the all experimental treatments—over one single parameter. Thus, the solution found is unique.
Competing approaches
The proposed ORD is compared to the FSD,
in which the total sample size is specified at the design stage of the trial and it is not subject to adaptations during the process of the trial. The hypotheses are tested at the end of the trial only and any hypothesis can be rejected independently of the other.The second comparator is the MAMS design by Magirr et al.
However, the conventional MAMS design is not an appropriate comparator as it does stop as soon as at least one hypothesis is rejected. Therefore, the following modification of the MAMS design is considered for the comparison. At the interim analysis, if a Z-statistic corresponding to one arm crosses the upper or lower bound while another does not, the trial will still continue with the treatment that did not cross a bound. The design is referred to as the MAMS(m).The FWER expression for MAMS(m) is the same as derived by Magirr et al3 but the power expression changes. To power a 3-arm 2-stage design, for a given configuration of
and a pre-specified level β, we search for the sample size that satisfiesTwo main simulation studies are conducted to compare the three competing approaches. In the first one, each design is constructed such that it yields 80% power to reject both null hypotheses under the alternative hypothesis. The second one compares the power between approaches based on the common sample size.
Same power requirement for all designs
In this subsection, the results when all design are powered at 80% to reject both null hypotheses are provided. The resulting design specifications and operating characteristics under the global null hypothesis are provided in Table 4.
Table 4
FWER, maximum sample sizes (Max. SS), expected sample sizes (ESS), and critical bounds under
= (0, 0) for the 3-arm 2-stage ORD, 3-arm 1-stage ORD (FSD(h)), FSD, and MAMS(m) designs when all designs are powered at 80% to reject both hypotheses under
= (0.5,0.5)
Design
u1,u2,l1
Max. SS
ESS
Reject at least one H0k
FSD
1.917, -, 1.917
231
231.0
0.05
FSD(h)
1.644, -, 1.644
192
192.0
0.05
ORD
1.898, 1.789, 0.633
222
134.4
0.05
MAMS(m)
2.179, 2.055, 0.726
264
166.6
0.05
Note: 3-arm 2-stage ORD and MAMS(m) use triangular bounds. Results are provided using 106 replications.
Under
= (0,0), all designs control the type I error at level α = 0.05 as expected. The total maximum sample sizes necessary to reach a power of 80% is 231 for the FSD, 192 for the FSD with a hierarchical test (FSD(h)) which is akin to a single-stage ORD design, 222 for the 3-arm 2-stage ORD and 264 for the MAMS(m) designs. The maximum sample size (to achieve the same power to reject both hypotheses) for the FSD is greater compared to FSD(h) and the two-stage ORD as it does not account for the hierarchy in testing.The designs’ performances under the configuration
and
are presented in Figure 1. When the second treatment arm is no different to control, θ
(2) = 0, the probability of rejecting the null hypothesis for this arm is controlled at α = 0.05 for all designs and when
=(0.5,0.5) all the designs satisfy the power requirement at 80%. However, under all other considered non-zero values of θ
(2), the probability of rejecting both hypotheses is higher for the approaches accounting for the hierarchy, FSD(h) and ORD, than for other competing designs. The gain from using a two-stage ORD design compared to FSD(h) arises in terms of expected sample sizes which is strictly lower for the two-stage design. The 3-arm 2-stage ORD has noticeably lower expected sample size (ESS) compared to all other designs ranging from 13% to 35% depending on the simulation scenario. The largest difference in power (an increase of around 5%) between the 3-arm 2-stage ORD and the MAMS(m) design is achieved under θ
(2) = 0.2.
Figure 1
Power and expected sample sizes (ESS) under θ = (0.5, θ
(2)) and θ
(2) ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} for the FSD, 3-arm 1-stage ORD (FSD(h)), 3-arm 2-stage ORD, and MAMS(m) designs when all designs are powered at 80% to reject both hypotheses under θ = (0.5, 0.5). 3-arm 2-stage ORD and MAMS(m) use triangular bounds. Results are provided using 106 replications
Common sample size for all designs
In this subsection, the three designs are compared with a common maximum sample size of 222 patients, which is the maximum sample size necessary for the 2-stage ORD in order to reject both hypotheses at 80% under
= (0.5, 0.5).The design specifications and operating characteristics of the designs under the global null hypothesis are provided in Table 5, while the designs’ performances under the configuration
and
are presented in Figure 2.
Table 5
FWER, maximum sample sizes (Max. SS), expected sample sizes (ESS), and critical bounds under
= (0,0) for the 3-arm 2-stage ORD, 3-arm 1-stage ORD (FSD(h)), FSD, and MAMS(m) designs when all designs have the same common total sample size equal to 222 patients
Design
u1,u2,l1
Max. SS
ESS
Reject at least one H0k
FSD
1.917, -, 1.917
222
222.0
0.05
FSD(h)
1.644, -, 1.644
222
222.0
0.05
ORD
1.898, 1.789, 0.633
222
134.4
0.05
MAMS(m)
2.179, 2.055, 0.726
222
140.1
0.05
Note: 3-arm 2-stage ORD, and MAMS(m) use triangular bounds. Results are provided using 106 replications.
Figure 2
Power and expected sample sizes (ESS) under θ = (0.5, θ
(2)) and θ
(2) ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} for the FSD, 3-arm 1-stage ORD (FSD(h)), 3-arm 2-stage ORD, and MAMS(m) designs when all designs have the same common total sample size equal to 222 patients. 3-arm 2-stage ORD and MAMS(m) use triangular bounds. Results are provided using 106 replications
As expected all designs control the FWER under the global null. Under the scenario when the second treatment arm is no different to control, θ
(2) = 0, the probability of rejecting the null hypothesis for this arm is controlled at α = 0.05. Under all other considered non-zero values of θ
(2), the approaches accounting for the hierarchy, FSD(h) and ORD, result in higher power compared to the other competing designs. The gain from using a two-stage ORD design compared to FSD(h) can once more be seen in terms of expected sample sizes which is strictly lower for the two-stage design. The 2-stage ORD has lower ESS compared to the other designs with reductions between 3% and 32%. The largest difference in power—around 9.8%—between the ORD and the MAMS(m) design is achieved under θ
(2) = 0.3.Overall, the ORD results in noticeable gains across all considered scenarios both in terms of power and expected sample size. Therefore, the inclusion of the order restriction into the decision rules for the decision-making can provide advantages in power and/or expected sample size compared to standard approach to multi-arm trials, specifically, the FSD and the MAMS(m).
Different Bounds For Each Treatment Arm
In the results above the same bounds are used for both treatments. However, the proposed design allows for different boundary shapes to be used for different treatments which could lead to potential benefit in terms of power. In this section, we explore the effect of various boundary shapes on the operating characteristics of the designs.We consider the setting as in Section 6.1 and let
be the upper and lower bounds for T
1 at the stage j, and
be the boundaries for T
2 at the stage j, being functions of a
(1) and a
(2), respectively. The critical bounds and the sample size could be searched over a grid of values for a
(1) and a
(2) in order to strongly control the FWER at level α and to satisfy the power requirements in Equation (2). To maintain strong control of the FWER at level α it is necessary to control the type I error when
0 = (θ
(1), 0). Indeed, as shown in Section 1 of the Supporting Information, under
0 = (θ
(1),0) it holds
where
tends toThus, the bounds for the second arm are searched over a grid of values of a
(2) as for a 2-arm design with 2 stages
and then the bounds for T
1 are searched in order to satisfy Equation (1) under the null hypothesis. Finally, the sample size is searched to satisfy the power requirements, assuming equal allocation to all arms.Several combinations of boundary shapes are compared considering all possibilities with constant POC,
O’Brien and Fleming OBF
and triangular TRIAN
bounds. Among all these nine combinations of bounds, a subset of six is selected. For each shape of the bounds for T
1, two combinations are selected. These are those that provide the smallest ESS and the highest power compared to the other ones. The combinations that are excluded are the ones that use constant bounds for T
1 and T
2, the combination with O’Brien and Fleming and constant bounds for T
1 and T
2 respectively and the combination with triangular and constant bounds for T
1 and T
2 respectively. The complete set of results is provided in Section 5 of the Supporting Information.The summary of operating characteristics, probability to reject both and probability to reject only one, and the expected sample size, for the proposed design using the six remaining combinations of boundary shapes is provided in Figure 3.
Figure 3
Probability of rejecting both hypotheses (left) and probability of rejecting the first but not the second hypothesis (right) under θ = (0.5, θ
(2)) and θ
(2) ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} for the 3-arm 2-stage ORD design when it is powered at 80% to reject both hypotheses under θ = (0.5, 0.5). ORD uses the selected combination of bounds which control the type I error under θ = (∞, 0). Results are provided using 106 replications
The design resulting in the highest power in rejecting both hypotheses among the subset of selected bounds is the one that uses triangular bounds for the treatment associated to the highest effect and O’Brien and Fleming bounds for the arm associated to the lowest treatment effect. Indeed, in the way that O’Brien and Fleming bounds are constructed, if
then the trial tends to stop later and the final decision on T
2 is based on more data. Therefore, the test becomes more powerful compared to the test that tends to make a final decision on T
2 earlier. This selection of bounds also corresponds to the smallest probability of rejecting the first hypothesis and not the second one (that is the probability of making an error when θ
(2) < 0) compared to the other combinations.The combination that uses O’Brien and Fleming bounds for both treatment arms, the combination with Pocock bound for T
1 and O’Brien and Fleming bounds for T
2, and the combination of triangular and O’Brien and Fleming bounds result in similar power to reject both hypotheses but the first two combinations require lower maximum sample size compared to the latter—198 and 204 patients, respectively against 210 patients. At the same time, the three combinations differ in terms of ESS and in terms of probability of rejecting only the first hypothesis. Indeed, among these three combinations, the one with triangular bound for T
1 and O’Brien and Fleming bound for T
2 is the one with the smallest ESS and presents the smallest probability of rejecting only the first hypothesis.The combination that uses O’Brien and Fleming bound for T
1 and triangular bound for T
2 is the one with smallest power and highest probability of rejecting only the first hypothesis. While the combination that uses triangular bounds for both treatment arms is the one with the smallest ESS (indeed triangular bounds are constructed to minimize the ESS
) for each configuration of
compared to the other ones, even though it is one of the combinations with the highest maximum sample size—222 patients. Nevertheless, this combination has slightly smaller power of rejecting both hypotheses and slightly smaller probability of rejecting only the first hypothesis compared to the combination with Pocock bounds for T
1 and triangular bound for T
2.Overall, the results suggest that there is a benefit, in terms of power and ESS, in using different bounds for each treatment arm. However, the final choice of the bounds will depend on the specific objectives of the trial. For example, in order to minimize the ESS it is recommended the use of triangular bounds for both treatment arms, whereas in order to maximize the power of rejecting all hypotheses it is recommended the use of triangular bounds for T
1 and O’Brien and Fleming bounds for T
2.
Discussion
The aim of the current study was to explore MAMS designs that could select the most promising arm associated to the minimum treatment effect. An approach that takes the order relationship among the treatment effects (when no parametric dose-response or duration-response model is assumed) into account has been proposed. In the proposed approach we claim the effectiveness of the arm associated to the minimum treatment effect compared to the control only if we claim that the treatment associated to the maximum effect is efficacious. Through theoretical arguments and extensive numerical evaluation we show that the proposed design can provide noticeable advantages in power and/or expected sample sizes required in the trial compared to the alternatives.The proposed design can be applied to a wide range of clinical trial settings where it could be assumed an order among the treatment effects. For example, in clinical trial designs applied to infectious diseases, such as TB and HBV, where it can be assumed that longer treatment durations correspond to a higher efficacy. In this case, the focus translates into the problem of selecting the shortest promising treatment duration. The proposed design can also be applied to clinical trial settings where nested combinations of treatments are tested against a common control arm.The proposed design is closely linked to the hierarchical procedures described in the literature for example, by Glimm et al,
Tamhane et al.
In particular, in the literature, the hierarchical procedure is applied for testing multiple end-points in a specific order. In some applications, the interest is to test the secondary endpoint just if the primary endpoint has been rejected. Therefore, the endpoints could be tested on hierarchically order and various strategies can be adopted to test the hypotheses
depending on the study objectives. It can be noted that when non-binding futility boundaries
are used in the 3-arm 2-stage ordered restricted design, the overall testing procedure
coincides with the proposed method.In this study, it has been assumed that the common variance is known. However, the effect of this assumption is not negligible especially with small sample sizes. In this case, a possible approach would be to transform the individual test statistics using the function
where T denotes the t distribution function with d degrees of freedom and Φ is the standard cumulative density function. More details of this approach are given in Jennison and Turnbull.Several avenues of future research present itself. First, focus has been given to superiority tests in this work. In certain diseases, such as in TB, non-inferiority designs are the norm and hence further research on non-inferiority hypothesis tests is of interest. Second, we assume that information time is the same for all treatments. When considering different durations of treatment, however, this information accumulates a different times and hence further work will consider optimal designs in this setting. Finally, we assume in this work that there is no uncertainty about the order of the treatment effects. Using a Bayesian framework, however, would naturally allow for uncertainty in that assumption to be considered.
Authors: Eckard Hamelmann; Eric D Bateman; Christian Vogelberg; Stanley J Szefler; Mark Vandewalker; Petra Moroni-Zentgraf; Mandy Avis; Anna Unseld; Michael Engel; Attilio L Boner Journal: J Allergy Clin Immunol Date: 2016-03-05 Impact factor: 10.793
Authors: Philip Pallmann; Alun W Bedding; Babak Choodari-Oskooei; Munyaradzi Dimairo; Laura Flight; Lisa V Hampson; Jane Holmes; Adrian P Mander; Lang'o Odondi; Matthew R Sydes; Sofía S Villar; James M S Wason; Christopher J Weir; Graham M Wheeler; Christina Yap; Thomas Jaki Journal: BMC Med Date: 2018-02-28 Impact factor: 8.775
Authors: Matteo Quartagno; A Sarah Walker; James R Carpenter; Patrick Pj Phillips; Mahesh Kb Parmar Journal: Clin Trials Date: 2018-06-05 Impact factor: 2.486
Authors: Christian Lienhardt; Andrew Nunn; Richard Chaisson; Andrew A Vernon; Matteo Zignol; Payam Nahid; Eric Delaporte; Tereza Kasaeva Journal: PLoS Med Date: 2020-02-27 Impact factor: 11.069
Authors: Thomas Burnett; Pavel Mozgunov; Philip Pallmann; Sofia S Villar; Graham M Wheeler; Thomas Jaki Journal: BMC Med Date: 2020-11-19 Impact factor: 8.775