Literature DB >> 30714095

Familywise error control in multi-armed response-adaptive trials.

Abstract

Response-adaptive designs allow the randomization probabilities to change during the course of a trial based on cumulated response data so that a greater proportion of patients can be allocated to the better performing treatments. A major concern over the use of response-adaptive designs in practice, particularly from a regulatory viewpoint, is controlling the type I error rate. In particular, we show that the naïve z-test can have an inflated type I error rate even after applying a Bonferroni correction. Simulation studies have often been used to demonstrate error control but do not provide a guarantee. In this article, we present adaptive testing procedures for normally distributed outcomes that ensure strong familywise error control by iteratively applying the conditional invariance principle. Our approach can be used for fully sequential and block randomized trials and for a large class of adaptive randomization rules found in the literature. We show there is a high price to pay in terms of power to guarantee familywise error control for randomization schemes with extreme allocation probabilities. However, for proposed Bayesian adaptive randomization schemes in the literature, our adaptive tests maintain or increase the power of the trial compared to the z-test. We illustrate our method using a three-armed trial in primary hypercholesterolemia.

Entities: Chemical

Keywords: Bayesian methods; closed testing; multiple comparisons; response-adaptive randomization; type I error

Mesh：

Year: 2019 PMID： 30714095 PMCID： PMC6739232 DOI： 10.1111/biom.13042

Source DB: PubMed Journal: Biometrics ISSN： 0006-341X Impact factor: 2.571

INTRODUCTION

Clinical trials typically randomize patients using a fixed randomization scheme, where the probabilities of assigning patients to the experimental treatments and control are pre‐specified and constant. A common method is to simply use equal randomization to the different arms of the trial. However, such randomization schemes can mean that a substantial proportion of the trial participants will continue to be allocated to treatments that are not the best available, even if interim data indicates that one treatment is likely to be superior. Response‐adaptive trials address this concern by adaptively changing the randomization probabilities, so that a greater proportion of patients are allocated to the treatment arm which has a better performance based on the cumulated response data. Many classes of response‐adaptive randomization (RAR) schemes have been proposed for binary outcomes, and there is also a growing interest in RAR for continuous responses. In Web Appendix A, we present an overview of multi‐arm RAR schemes described in the literature. Our focus in this article is on normally distributed outcomes, which are encountered in a number of clinical trials. Indeed, 23 out of the 59 trials identified in a review of multi‐arm trials by Wason et al. (2014) had a continuous outcome. Despite the extensive literature on RAR, relatively few clinical trials have actually used such schemes in practice. A current example is the ongoing I‐SPY 2 trial (Park et al., 2016), which screens drugs in neoadjuvant breast cancer and uses RAR as part of its design. One of the key concerns over using RAR, particularly from a regulatory perspective, is ensuring that the type I error rate is controlled. In a multi‐arm trial, multiple hypotheses are tested simultaneously by design, which leads to a multiple testing problem. To account for this, testing procedures are used that guarantee strong control of the familywise error rate (FWER), which ensures the maximum probability of making at least one type I error is controlled. For confirmatory trials in particular, demonstrating strong control of the FWER is often required by regulators (European Medicines Agency, 2002; Food and Drug Administration, 2018). For response‐adaptive trials, a rigorous proof of FWER control for a particular design is difficult given the complexities of the treatment allocation process. Hence error control has typically either been demonstrated through simulation studies, or by exploiting the asymptotic structure of the adaptive randomization procedure (Hu and Rosenberger, 2006; Zhu and Hu, 2010). However, neither method provides a guarantee of FWER control, particularly with small sample sizes. Another possibility is the use of (re)'randomization tests to preserve type I error (Simon and Simon, 2011), which we return to in Section 6. Gutjahr et al. (2011) showed how to achieve strong control of the FWER for normally distributed outcomes in a two‐stage design incorporating RAR in the first stage. However, our focus is on general response‐adaptive trials, without the necessity of restricting to two stages or having a final stage of equal randomization. In this article, we show how to guarantee strong control of the FWER for both fully sequential and block randomized response‐adaptive trials, for a large class of adaptive randomization rules. Our proposed procedure works by reweighting the usual z‐statistic through an iterative application of the conditional invariance principle. The resulting test statistic can then be used to test the null hypothesis that a treatment is superior to the control. The rest of the article is organised as follows. In Section 2, we describe the proposed method for fully sequential response‐adaptive trials with a fixed allocation to the control. This method is then modified for block randomized response‐adaptive trials in Section 3, for both a fixed or adaptive control allocation. Simulation studies for the proposed methods are presented in Section 4, and Section 5 gives a case study based on a trial in primary hypercholesterolemia. We conclude with a discussion in Section 6. All proof details can be found online in the Supporting Information section.

FULLY SEQUENTIAL RESPONSE‐ADAPTIVE TRIALS

Trial setting

Suppose a trial is conducted to test experimental treatments against a common control, using the following design. A total of n patients are allocated to the experimental treatments, and patients are allocated to the control, where and n are fixed in advance. Patients are allocated to the different experimental treatments using RAR, where we assume that the randomization rule does not depend on the control information. We also assume the allocation to the control is fixed; that is, the probability of assigning a patient to the control is pre‐specified and constant. Maintaining allocation to the control is recommended by the Food and Drug Administration (2018), since it best maintains the power of the trial, and helps address the concern about changing patient characteristics over the course of the trial. The RAR for the experimental treatments starts with a burn‐in period B, which uses fixed randomization to allocate patients to the ith treatment , with the again fixed in advance. Hence a total of patients are allocated to the experimental treatments during the burn‐in period. Let denote the treatment allocation for the kth experimental patient (), where if the kth patient is allocated to the ith treatment. The allocation can depend on the data (i.e., the allocations and outcomes) observed up to patient , as well as any external information available at that time. Also let denote the efficacy outcome for the kth patient, while denotes the efficacy outcomes for the jth patient on the control () where The variance is assumed known and, without loss of generality, we set . Here represents the incremental benefit of treatment i compared to the control, and is the parameter of interest. Finally, let denote the total number of allocations to the ith experimental treatment (including the burn‐in period) and .

Hypothesis testing

The elementary null hypotheses are against the one‐sided alternatives . We discuss the case when at the end of Section 2.5. One general method to control for multiple testing is to use the closure principle (Marcus et al., 1976) and consider all intersection hypotheses , where . To strongly control the FWER, we reject an elementary null hypothesis if we also reject every with using a local level’ test. Hence we need to define a valid level’ test for all the intersection hypotheses . The naïve z‐test for , which does not take into account the RAR used in the trial, rejects if the test statistic is greater than , where is the standard normal quantile. As an alternative to using the closure principle with the test statistic above, one could simply use the Bonferroni correction, or a step‐up/step‐down procedure such as the Holm procedure. These would only involve calculating test statistics for the t elementary null hypotheses, i.e., calculating for (). Hence we present the methodology assuming the closure principle will be used, with the Bonferroni and Holm procedures considered as special cases. We return to this issue in Section 4.

Inflation of the familywise error rate

Since the z‐test ignores the adaptive randomization used, it is possible to inflate the FWER. As an example, consider the following adaptive randomization scheme for treatments: where . This can be viewed as implementing early stopping for efficacy for treatment 1, which is not taken into account using the naïve z‐test. We ran a simulation study to calculate the type I error rate using the above randomization scheme. We set , , and the true treatment means , . The type I error rate as averaged over simulations is , more than double the nominal level. We subsequently refer to allocation rules of this type as ‘type I error inflator’ rules (which clearly would never be used in practice).

Auxiliary design

Working with the actual design of the trial is difficult because RAR affects the distribution of the usual z‐test statistics. Hence for each we introduce a simpler design, called the auxiliary design, for which we do know the distribution. The actual trial design can then be viewed as a series of data‐dependent modifications of the auxiliary design, where we account for the modifications using the conditional invariance principle. The auxiliary designs are purely hypothetical, and are only used to construct the modified tests for the actual design. As well, the allocations in the auxiliary designs are fixed before the start of the actual trial. The auxiliary design for hypothesis is as follows. As in the actual design, a total of n patients are allocated to the experimental treatments, and patients are allocated to control. The allocations and responses to the control treatment are the same as the actual design. For the patients allocated to the experimental treatments, the auxiliary design starts with a burn‐in period B with r patients that is identical to the actual design. The subsequent allocations are given by a fixed sequence , which can be chosen arbitrarily. The final allocation must be to one of the treatments in I. These allocations can be considered as an approximation, made before the trial begins, to what we could reasonably expect the actual allocations to be. Assuming clinical equipoise, there is no way of knowing a priori that the probability of assignment to a treatment will be higher or lower than that for any other treatment (given some adaptive randomization scheme). Hence randomizing equal numbers of patients to each treatment for the auxiliary design is the natural choice for trials where there is genuine uncertainty over the effectiveness of the different treatments. We return to these issues in Section 6. For the auxiliary design, let denote the efficacy outcome for the kth patient under the auxiliary design (), where for by design. Also let denote the total number of allocations to the ith experimental treatment, and denote the total number of allocations to the ith treatment for patients . We then define and . Under the auxiliary design, is fixed for all i, and hence under , the usual z‐statistic is normally distributed with mean zero and variance . Hence we reject if is greater than .

Adaptive test statistic

Adaptive designs, such as the trial being considered, follow a common conditional invariance principle in order to control the type I error rate (Brannath et al., 2007). For our response‐adaptive trial in question, we apply the conditional invariance principle sequentially, where each step considers the next patient recruited into the trial. Below we give the test statistic for testing hypothesis under the actual design, given that the allocation is fully sequential. Under , the following test statistic is normally distributed with mean 0 and variance The weights and for the efficacy outcomes are calculated recursively based on the number of allocations to the experimental treatments and the control, with full details and the proof of Theorem 1 provided in Web Appendix B. Using Theorem 1, we reject if is greater than . In practice, to keep the weights as close to the natural weight for as many of the control observations as possible, we recommend setting and , as used for the simulation studies in Section 4.1. As a simple illustration of how the weights change over the course of a trial, consider testing experimental treatments. We set , , and . Suppose we have no a priori reason to favour one treatment over the other, and so we simply choose the auxiliary design to be an equal randomization of the two treatments: Here the vertical line indicates where the burn‐in period ends, and the * represents the allocation for , which by design must satisfy . We set and , so that . Table 1 below shows how the weights change over the course of a trial for an actual allocation a that is similar to the auxiliary design b. Further examples for a variety of actual allocations a can be found in Web Appendix C.

Table 1

An actual allocation a that is similar to the auxiliary design b. The weights that would be used in the naïve z‐test are , and

a =	1	2	2	2	1	2	2	1	2	1	2
b =	1	2	2	1	2	2	1	1	2	1	*
w(1) =	6	6	6	5.16	6	6	4.94	4.94	4.94	4.94	–	w1(0)=9.74,w10(0)=−5.38
w(2) =	6	6	6	7.01	5.74	5.74	7.63	7.63	7.63	7.63	7.63	w1(0)=w10(0)=9.58

An actual allocation a that is similar to the auxiliary design b. The weights that would be used in the naïve z‐test are , and In all of the scenarios that we have investigated, the weights for the experimental treatments have been positive (although we cannot rule out the possibility of having negative or imaginary weights). Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses . To see this, suppose the elementary null hypotheses are . Under , we can rewrite the distribution of the responses as , where . Hence under where and are the adaptive test statistics for and respectively.

BLOCK RANDOMIZED RESPONSE‐ADAPTIVE TRIALS

It may not be feasible or desirable to randomize patients one‐by‐one in a fully sequential manner. Instead one can use block randomization, where after the burn‐in period B, patients are adaptively randomized to the experimental treatments in blocks of size over J stages, with . The randomization of the jth block depends on the data up to block , as well as any external information available at the time. The allocation to the control is again assumed to be fixed throughout the trial. Due to the block structure of the trial, we can relax the assumption that the randomization rule used for the experimental treatments does not depend on the control information. This is achieved by splitting up the patients allocated to the control into blocks. More explicitly, suppose that during the burn‐in period, patients are allocated to the control, where is fixed in advance. Subsequently, in the jth block, patients are allocated to the control, where . We assume that for the final block . The response‐adaptive randomization at block l may now depend on the control information available at the end of block ; that is, the outcome data available from the first patients allocated to the control. To control the FWER, we can modify the approach in Section 2 to account for the block structure. As before, we have an auxiliary design for the patients on the experimental treatments, but now in step l of the process the actual design is a data‐dependent modification of all the allocations for the patients in block l. Hence the weights for the observations in each block will be the same, and are updated block‐by‐block. Full details of the auxiliary design and the resulting adaptive test statistic can be found in Web Appendix D.

Extension for adaptive control allocations

Thus far, we have assumed that the allocations to the control follow some fixed scheme. We now relax this assumption in the block‐randomized setting. The formula for the adaptive test statistic can be found in Web Appendix E. Note that it is possible the procedure will fail to give a valid test statistic in this setting, as shown in Web Appendix F.3.

SIMULATION STUDIES

As we have already seen in Section 2.3, using the closure principle with the usual z‐test does not strongly control the FWER. An alternative method is to use the Bonferroni correction on the elementary null hypotheses . We also consider the Holm procedure, which is a step‐down procedure that is uniformly more powerful than Bonferroni (Holm, 1979). An advantage of both these procedures is that only t test statistics are calculated, rather than test statistics when using the closure principle. This motivates also applying the Holm procedure to the p‐values derived from the adaptive test statistics for . More precisely, we use the adjusted p‐values , instead of the usual p‐values derived from the z‐test. To distinguish between the methods, we call our proposed procedure that uses the closure principle the ‘adaptive closed test’. Similarly, applying the closure principle to the usual z‐test gives the ‘closed z‐test’. Applying the Holm procedure to our adjusted p‐values gives the ‘Holm adaptive test’, while applying the Holm procedure to the usual p‐values gives the ‘Holm z‐test’. In our simulation studies, we compare the different methods by looking at the FWER and the power of the different tests. To keep the comparisons simple, and as a similar measure to the FWER, we present results for the disjunctive power, which is the probability of rejecting at least one false null hypothesis. In order to see how the adaptive randomization procedures affect power, we provide comparisons with using equal randomization in Web Appendix F.5.

Fully sequential randomization

We first consider a fully sequential response‐adaptive trial, as presented in Section 2, with patients allocated to the experimental treatments after the burn‐in and patients allocated to the control. In the burn‐in period, five patients are allocated to each of the experimental treatments. We set and the true control mean for simplicity. We compare the methods under two randomization schemes, the first being the Type I error inflator and the second being a Bayesian Adaptive Randomization (BAR) scheme, with full details given in Web Appendix F.1. Simulation results: Table 2 gives the results for the type I error inflator randomization scheme, while Table 3 gives the results for BAR. The auxiliary designs in all scenarios were simply random draws from a discrete uniform distribution on .

Table 2

Familywise error rate and disjunctive power for the type I error inflator in the fully sequential setting. There were simulated trials for each set of parameter values

		Adaptive closed test		Adaptive test (Holm)		Closed z‐test		z‐test (Holm)		z‐test (Bonferroni)
	Parameter values	Error	Power	Error	Power	Error	Power	Error	Power	Error	Power
1.	δ1=δ2=0	3.3	–	4.7	–	4.7	–	7.0	–	7.0	–
2.	δ1=0, δ2=1	4.8	21.7	3.7	27.5	10.3	26.5	9.9	63.6	5.0	63.5
3.	δ1=δ2=0.5	–	62.4	–	52.4	–	69.9	–	61.6	–	61.6
4.	δ1=δ2=δ3=0	2.8	–	3.8	–	4.1	–	5.9	–	5.9
5.	δ1=δ2=0, δ3=1	3.2	13.1	4.2	24.2	5.1	17.2	6.4	54.2	4.5	54.1
6.	δ1=0, δ2=δ3=1	4.6	22.2	3.2	28.0	9.7	27.0	9.0	75.4	3.2	75.4
7.	δ1=0, δ2=0.5, δ3=1	4.0	19.1	2.6	24.5	9.1	23.9	7.4	58.5	3.2	58.4
8.	δ1=δ2=δ3=0.5	–	51.3	–	41.7	–	57.8	–	49.7	–	49.7

Table 3

Familywise error rate and disjunctive power for BAR in the fully sequential setting. There were simulated trials for each set of parameter values

		Adaptive closed test		Adaptive test (Holm)		Closed z‐test		z‐test (Holm)		z‐test (Bonferroni)
	Parameter values	Error	Power	Error	Power	Error	Power	Error	Power	Error	Power
1.	δ1=δ2=0	4.7	–	4.5	–	4.8	–	4.1	–	4.1	–
2.	δ1=0, δ2=0.5	4.6	46.4	4.4	52.4	3.9	46.7	3.6	53.6	1.9	53.5
3.	δ1=δ2=0.5	–	70.8	–	66.4	–	71.2	–	65.9	–	65.9
4.	δ1=δ2=δ3=0	3.8	–	4.1	–	4.0	–	3.8	–	3.8
5.	δ1=δ2=0, δ3=1	4.4	59.9	4.2	88.7	4.3	60.1	3.8	90.6	2.6	90.6
6.	δ1=0, δ2=δ3=1	4.8	89.8	4.7	95.1	4.0	90.1	3.9	96.0	1.3	96.0
7.	δ1=0, δ2=0.5, δ3=1	4.3	74.8	3.9	88.2	3.9	75.7	3.4	90.0	1.4	90.0
8.	δ1=δ2=δ3=0.5	–	56.5	–	51.8	–	57.9	–	52.7	–	52.7

Familywise error rate and disjunctive power for the type I error inflator in the fully sequential setting. There were simulated trials for each set of parameter values Familywise error rate and disjunctive power for BAR in the fully sequential setting. There were simulated trials for each set of parameter values Looking first at the results for the type I error inflator in Table 2, the closed z‐test does not control the FWER in the scenarios where at least one null hypothesis is false, with an error rate as high as in scenario 2. Applying the Holm procedure to the z‐test does not control the FWER, and actually increases the error rate in some scenarios (1 and 4). Applying Bonferroni to the z‐test also does not control the FWER, as can be seen in scenarios where all null hypotheses are true. In contrast, both the adaptive closed test and the Holm adaptive test strongly control the FWER, although they tend to be rather conservative. The fact that the Bonferroni correction, the use of the closure principle and the Holm procedure all do not control the FWER may appear surprising at first. The inflation occurs because when using RAR, the usual z‐test will no longer be normally distributed under the null. Therefore the p‐values derived from the z‐statistics are not guaranteed to be stochastically larger than or equal to the uniform distribution under the null, which is required for these methods to work. As for the power of the different methods, when at least one of the null hypotheses is true (as in scenarios 2, 5, 6 and 7), the Holm z‐test has substantially higher power than the closed z‐test. Indeed, the power more than doubles in all four scenarios, and even more than triples in scenario 5. This dramatic increase in power demonstrates that in these scenarios, the closed z‐test is not very sensitive. This is because the test statistic for will be ‘diluted’ by the contribution from responses belonging to the null hypotheses that are true. It is only when all of the null hypotheses are false, as in scenarios 3 and 8, that the power of the closed z‐test is reasonable, with a slightly higher power than the Holm z‐test. As for the adaptive tests, the adaptive closed test has a slightly lower power than the closed z‐test for all scenarios, with an absolute decrease of between in scenario 5 and in scenario 3. However, the Holm adaptive test has a substantially lower power than the Holm z‐test, with the latter having more than double the power. This demonstrates the high cost in terms of power that controlling the FWER can incur for this randomization scheme. We return to this issue in Section 4.3. Turning to the BAR scheme in Table 3, this time all of the methods strongly control the FWER. All methods are slightly conservative, with the adaptive closed test being generally the closest to the nominal level. The Bonferroni‐corrected z‐test is noticeably more conservative than all the other methods, particularly when there are three treatments. In terms of disjunctive power, if at least one of the null hypotheses are true, we again see that the closed tests suffer from reduced power compared to the Holm versions. However, with BAR the loss of power is less dramatic, with a maximum of a relative decrease in power in scenario 5, but with much smaller decreases in scenarios 2 and 7 for example. This time, the adaptive closed test has almost the same power as the closed z‐test, losing a maximum of only in scenario 8. In addition, the Holm adaptive test and Holm z‐test now have comparable power, with a maximum loss of only in scenarios 5 and 7. This indicates that for BAR schemes, the adaptive tests do not lose out very much in terms of power.

Block randomization with a fixed control allocation

We now consider block randomized trials with a fixed control allocation, as presented in Section 3.1. We use the setup of a trial with blocks, with sizes (40, 40, 40) for the experimental treatments and (20, 20, 20) for the control. In the burn‐in period, five patients are allocated to each of the treatments including the control. We set the true control mean , and . We compare the methods under a type I error inflator scheme and a BAR scheme, with full details given in Web Appendix F.2. Simulation results: Table 4 gives the results for the type I error inflator randomization scheme, while Table 5 gives the results for BAR. The auxiliary designs in all scenarios were simply random draws from a discrete uniform distribution on .

Table 4

Familywise error rate and disjunctive power for the type I error inflator, for block randomization with a fixed control allocation. There were simulated trials for each set of parameter values

		Adaptive closed test		Adaptive test (Holm)		Closed z‐test		z‐test (Holm)		z‐test (Bonferroni)
	Parameter values	Error	Power	Error	Power	Error	Power	Error	Power	Error	Power
1.	δ1=δ2=0	3.8	–	4.8	–	4.6	–	6.5	–	6.5	–
2.	δ1=0, δ2=1	4.8	22.0	3.6	26.9	8.3	25.6	7.8	61.1	4.3	61.0
3.	δ1=δ2=0.5	–	92.7	–	87.9	–	94.6	–	91.7	–	91.7
4.	δ1=δ2=δ3=0	3.2	–	4.1	–	4.1	–	6.1	–	6.1
5.	δ1=δ2=0, δ3=1	3.7	14.2	4.4	23.4	4.7	18.1	6.2	61.2	4.5	61.1
6.	δ1=0, δ2=δ3=1	4.9	20.1	3.2	26.1	8.1	23.0	7.3	78.5	3.2	78.4
7.	δ1=0, δ2=0.5, δ3=1	4.7	17.7	3.0	23.8	8.0	21.1	6.7	66.2	2.8	66.2
8.	δ1=δ2=δ3=0.5	–	91.3	–	83.4	–	94.0	–	89.7	–	89.7

Table 5

Familywise error rate and disjunctive power for BAR, for block randomization with a fixed control allocation. There were simulated trials for each set of parameter values

		Adaptive closed test		Adaptive test (Holm)		Closed z‐test		z‐test (Holm)		z‐test (Bonferroni)
	Parameter values	Error	Power	Error	Power	Error	Power	Error	Power	Error	Power
1.	δ1=δ2=0	4.8	–	4.6	–	4.8	–	4.5	–	4.5	–
2.	δ1=0, δ2=0.5	5.0	61.2	4.9	82.7	4.9	61.2	4.8	82.9	2.5	82.8
3.	δ1=δ2=0.5	–	94.5	–	92.3	–	94.5	–	92.2	–	92.2
4.	δ1=δ2=δ3=0	3.7	–	4.5	–	3.7	–	4.2	–	4.2
5.	δ1=δ2=0, δ3=0.5	4.4	36.1	4.6	71.8	4.3	36.0	4.4	71.8	3.0	71.7
6.	δ1=0, δ2=δ3=0.5	5.0	67.3	4.6	85.6	4.8	66.8	4.4	85.4	1.6	85.4
7.	δ1=0, δ2=0.25, δ3=0.5	4.6	51.1	3.7	73.0	4.4	50.9	3.5	72.6	1.6	72.6
8.	δ1=δ2=δ3=0.5	–	93.5	–	90.7	–	93.4	–	90.4	–	90.4

Familywise error rate and disjunctive power for the type I error inflator, for block randomization with a fixed control allocation. There were simulated trials for each set of parameter values Familywise error rate and disjunctive power for BAR, for block randomization with a fixed control allocation. There were simulated trials for each set of parameter values The results are broadly similar to those for the fully sequential setting presented in Section 4.1. For the type I error inflator, we see that the closed z‐test does not control the FWER in general (as seen in scenarios 2, 6 and 7), and neither does applying the Holm procedure to the z‐test. The Bonferroni‐corrected z‐test has an inflated FWER when all null hypotheses are true, as in scenarios 1 and 4. In contrast, the adaptive tests strongly control the FWER in all scenarios. However, again this comes at the cost of reduced power. There is a slight reduction in power between the closed z‐test and the closed adaptive test, of between 3–4% in absolute terms. In scenarios where at least one null hypothesis is true, the Holm z‐test has a much higher power than the Holm adaptive test, with the power more than doubling in these scenarios, and actually tripling in scenario 6. As for the BAR scheme, all of the methods strongly control the FWER. This time, for some scenarios the adaptive closed test basically achieves the nominal level, as in scenarios 2 and 6. When there are three treatments, the Bonferroni‐corrected z‐test can again be overly conservative, as in scenarios 6 and 7. In contrast to the fully sequential setting, with block randomization we see that the adaptive tests actually have the highest power out of all the methods in all scenarios except scenario 2. When at least one null hypothesis is true, the Holm adaptive test has the highest power, while when all null hypotheses are false the adaptive closed test has the highest power. The power gains are small, but demonstrate that we do not always lose out in terms of power when using the proposed adaptive tests. Block randomization with an adaptive control allocation: In Web Appendix F.3, we present a simulation study considering block randomization with an adaptive control allocation, as presented in Section 3.2. The results are broadly similar to those presented above.

Summary

In summary, the simulation results show that in the randomization settings considered, our proposed adaptive tests strongly control the FWER, as would be expected from theory. In contrast, the various z‐tests can all fail to control the error rate, as seen in the results for the type I error inflator. However, given a more realistic randomization scheme, such as the BAR schemes we considered, the z‐tests achieve strong familywise error control. As for disjunctive power, we see that when at least one null hypothesis is true, the closed tests suffer a very large drop in power compared to the Holm versions. This is because of the ‘dilution’ of the test statistic as mentioned in Section 4.1. However, when all the null hypotheses are true, then the closed test has the higher power, although the gains are at most modest. The adaptive tests can pay a large price in terms of power when compared with the z‐tests, as seen in the results for the type I error inflator. In Web Appendix F.4, we give an additional simulation study with two treatments, where the randomization scheme used is simply a fixed allocation to the experimental treatments but with unequal randomization probabilities. We show that when the probability of assignment to treatment 2 is low (i.e., less than 0.2), there is a large drop in the power of the adaptive tests for testing . This explains what is happening with the type I error inflator when , where in the majority of trial realisations, apart from the unlikely event that treatment 1 stops early for ‘efficacy’, the probability of assignment to treatment 2 is zero by design. Hence, the type I inflator is in fact close to a worst‐case scenario for the adaptive tests. However, for adaptive randomization schemes that implement a restriction on the probability of assignment so that it is above a minimum threshold (or equivalently a minimal proportion of patients in each treatment group, we would not expect there to be a substantial loss of power when using the Holm adaptive test compared with the Holm z‐test, particularly in the block randomized setting.

CASE STUDY

Finally, we illustrate our proposed methodology using an example based on a phase II placebo‐controlled trial in primary hypercholesterolemia (Roth et al., 2012). The purpose of the study was to compare the effects of using the SAR236553 antibody with high‐dose or lose‐dose atorvastatin, as compared with high‐dose atorvastatin alone. The primary outcome was the least‐squares mean percent reduction from baseline of low‐density lipoprotein cholesterol (LDL‐C). Patients were randomly assigned, in a 1:1:1 ratio, to receive 80 mg of atorvastatin plus placebo, 10 mg of atorvastatin plus SAR236553, or 80 mg of atorvastatin plus SAR236553. For convenience, we label these different interventions as the ‘control’, ‘low dose’ and ‘high dose’ respectively. In the trial, the observed least‐squares mean SE percent reduction from baseline in LDL‐C was for the control, for the low dose and for the high dose. There were patients on the control, patients on the low dose and patients on the high dose, giving a total of patients on the two experimental doses. We use the observed values from the trial and assume that the distribution of the least‐squares standardized mean percent reduction from baseline of LDL‐C is for the control, for the low dose, and for the high dose. Now suppose that the trial was carried out as an adaptive block randomized trial with a fixed control allocation, as described in Section 3.1. Let the trial have blocks, with block sizes (15, 15, 15) for the experimental treatments and (8, 8, 8) for the placebo. In the burn‐in period, 7 patients are allocated to the control and 8 patients are allocated to each of the experimental doses. Hence, a total of 31 patients are on the control and 61 on the experimental treatments, as in the original trial. We use the BAR scheme of Section 4.2, with priors and (), while . Table 6 shows the results for a simulated trial with the above parameters, where the BAR scheme allocated 13 patients to the low dose and 32 patients to the high dose after the burn‐in period. This yields the natural weights used in the naïve z‐test of for the low dose and for the high dose. The natural weight for the control is by design. The auxiliary design randomly assigned 44 patients to the low or high dose in a 1:1 ratio, and allocated 21 patients to the low dose and 23 patients to the high dose.

Table 6

Test statistics, p‐values and weights for a simulated block randomized trial using a BAR scheme

	Low dose	High dose
z‐test statistic	13.76 (p<0.001)	15.50 (p<0.001)
Adaptive test statistic	12.21 (p<0.001)	16.22 (p<0.001)
Natural weights	n1′=21, n0=31	n2′=40, n0=31
Adaptive weights	w(1)=(30,28.05,21.49,16.43)	w(2)=(32,34.09,42.68,46.08)
	w(0)=(31,31.73,35.91,42.76)	w(0)=(31,30.43,28.86,28.41)

Test statistics, p‐values and weights for a simulated block randomized trial using a BAR scheme The adaptive test statistic is slightly smaller than the z‐test statistic for the low dose, while the converse is true for the test statistics for the high dose. Looking at the adaptive weights for the burn‐in period and the three blocks, we see that for the low dose, the weights for the low dose decrease for each block while the control weights increase. This pattern is reversed for the high dose. Given that all the p‐values are less than , using either the z‐test or the adaptive test we would conclude that adding the SAR236553 antibody to high‐dose or low‐dose atorvastatin leads to a statistically significant reduction in LDL‐C levels.

DISCUSSION

A major regulatory concern over the use of response‐adaptive trials in clinical practice has been ensuring control of the type I error rate. We have proposed procedures that guarantee strong familywise error control in the following multi‐armed trial settings: Fully sequential response‐adaptive trials with a fixed control allocation (where the randomization rule does not depend on the control information); Block‐randomized response‐adaptive trials with a fixed control allocation; Block‐randomized response‐adaptive trials including an adaptive control allocation. These procedures are applicable to a large class of RAR rules for settings (2) and (3), with only some restrictions in setting (1). Hence many Bayesian and ‘optimal’ RAR schemes proposed in the literature can be used without adjustment, with only the final test statistic having to be modified. Since our proposed adaptive testing procedures are based on the conditional invariance principle, they have the additional important flexibility of being valid when the allocation is changed due to external information (i.e., information other than the previously observed treatment assignments and patient responses). Such changes might occur due to safety data, non‐compliance or even trial protocol violations. In practice, to control the FWER we would recommend using the Holm adaptive test. Importantly, it has a much higher power than the adaptive closed test when at least one of the null hypotheses are true. As well, it only requires t hypothesis tests as compared with hypothesis tests for the adaptive closed test. Our adaptive tests lead to unequal weightings of patients, which may be controversial. Indeed, this is a more general issue that can affect the analysis of any flexible adaptive design that uses that combination test approach (Burman and Sonesson, 2006). One solution is to use the so‐called ‘dual test’, and reject a hypothesis only if both the adaptive test and the naïve z‐test rejects (Posch et al., 2003), but this can come at the cost of reduced power. A related problem in practice is explaining the unequal weighting of patients to clinicians, which would be a fruitful area of further investigation. As mentioned in Section 2.4, the choice of the allocations in the auxiliary design is meant to be an approximation of the actual allocations. This choice, while not affecting the control of the FWER, can affect the power of the resulting adaptive test procedure. If information is available before the trial begins of a likely ordering of the treatment effects (such as in a trial testing different doses of a drug), then it might be possible to design ‘optimal’ auxiliary designs that maximise the expected power of the resulting adaptive test procedure. We have assumed that the variances of the control and experimental treatments are known. Fully accounting for unknown variances would add considerable complexity to our approach. In Web Appendices F.6 and F.7, we show empirically that estimating the variance from the data (with or without assuming a common variance) does not inflate the FWER when using the Holm adaptive test, for the simulation scenarios considered in this article. Our proposed procedures are designed for normally‐distributed outcomes but can be implemented for other types of outcomes through the use of asymptotics. For example, with survival data one could use the asymptotically standard normal form for the logrank test statistic. For binary data, a starting point would be to use the asymptotically normal test statistic for contrasting each treatment arm with the control (Jennison and Turnbull, 2000; Wason and Trippa, 2014), particularly in the block randomised setting. However, it would be useful to extend our procedures to work directly with other types of outcomes. This would be a fruitful yet challenging area for future research, since applying the conditional invariance principle in these settings without appealing to asymptotics is likely to be complex, especially given the form of the usual test statistics (such as Fisher's exact test). As mentioned in Section 1, another possible method of controlling the FWER is through the use of (re)'randomization tests (Simon and Simon, 2011). These tests have the advantage of being valid under unknown variances, non‐normally distributed outcomes and even time trends associated with the outcome. However, our proposed adaptive testing procedures have the advantage of being valid under the composite null hypothesis (as long as the weights for the experimental treatments are positive). Also, although it would be interesting to compare randomization tests with our proposed testing procedures, calculating the randomization p‐values using Monte Carlo methods for trial simulations (and repeating this for each set of parameter values) would be computationally challenging. Finally, although we did not explicitly consider it in this article, the adaptive randomization procedures used could also incorporate covariate information, so that the allocation probabilities vary across patients with different covariates. These covariate‐adjusted RAR schemes are particularly useful when certain characteristics of the patients may be correlated with the primary outcome (Hu and Rosenberger, 2006). A related setting would be biomaker‐guided response‐adaptive trials, such as I‐SPY 2. Additional supporting information may be found online in the Supporting Information section at the end of the article. Supplementary Materials. Click here for additional data file.

10 in total

1. Issues in designing flexible trials.

Authors: Martin Posch; Peter Bauer; Werner Brannath
Journal: Stat Med Date: 2003-03-30 Impact factor: 2.373

2. Are flexible designs sound?

Authors: Carl-Fredrik Burman; Christian Sonesson
Journal: Biometrics Date: 2006-09 Impact factor: 2.571

3. Multiplicity and flexibility in clinical trials.

Authors: Werner Brannath; Franz Koenig; Peter Bauer
Journal: Pharm Stat Date: 2007 Jul-Sep Impact factor: 1.894

4. Familywise error control in multi-armed response-adaptive two-stage designs.

Authors: Georg Gutjahr; Martin Posch; Werner Brannath
Journal: J Biopharm Stat Date: 2011-07 Impact factor: 1.051

5. Adaptive Randomization of Neratinib in Early Breast Cancer.

Authors: John W Park; Minetta C Liu; Douglas Yee; Christina Yau; Laura J van 't Veer; W Fraser Symmans; Melissa Paoloni; Jane Perlmutter; Nola M Hylton; Michael Hogarth; Angela DeMichele; Meredith B Buxton; A Jo Chien; Anne M Wallace; Judy C Boughey; Tufia C Haddad; Stephen Y Chui; Kathleen A Kemmer; Henry G Kaplan; Claudine Isaacs; Rita Nanda; Debasish Tripathy; Kathy S Albain; Kirsten K Edmiston; Anthony D Elias; Donald W Northfelt; Lajos Pusztai; Stacy L Moulder; Julie E Lang; Rebecca K Viscusi; David M Euhus; Barbara B Haley; Qamar J Khan; William C Wood; Michelle Melisko; Richard Schwab; Teresa Helsten; Julia Lyandres; Sarah E Davis; Gillian L Hirst; Ashish Sanil; Laura J Esserman; Donald A Berry
Journal: N Engl J Med Date: 2016-07-07 Impact factor: 91.245

6. Using Randomization Tests to Preserve Type I Error With Response-Adaptive and Covariate-Adaptive Randomization.

Authors: Richard Simon; Noah Robin Simon
Journal: Stat Probab Lett Date: 2011-07 Impact factor: 0.870

7. Atorvastatin with or without an antibody to PCSK9 in primary hypercholesterolemia.

Authors: Eli M Roth; James M McKenney; Corinne Hanotin; Gaelle Asset; Evan A Stein
Journal: N Engl J Med Date: 2012-10-31 Impact factor: 91.245

Review 8. Correcting for multiple-testing in multi-arm trials: is it necessary and is it done?

Authors: James M S Wason; Lynne Stecher; Adrian P Mander
Journal: Trials Date: 2014-09-17 Impact factor: 2.279

9. Familywise error control in multi-armed response-adaptive trials.

Authors: D S Robertson; J M S Wason
Journal: Biometrics Date: 2019-04-03 Impact factor: 2.571

10. A comparison of Bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.

Authors: James M S Wason; Lorenzo Trippa
Journal: Stat Med Date: 2014-01-14 Impact factor: 2.373

10 in total

4 in total

1. BAGS: A Bayesian Adaptive Group Sequential Trial Design With Subgroup-Specific Survival Comparisons.

Authors: Ruitao Lin; Peter F Thall; Ying Yuan
Journal: J Am Stat Assoc Date: 2020-11-30 Impact factor: 4.369

2. Familywise error control in multi-armed response-adaptive trials.

Authors: D S Robertson; J M S Wason
Journal: Biometrics Date: 2019-04-03 Impact factor: 2.571

3. Adding flexibility to clinical trial designs: an example-based guide to the practical use of adaptive designs.

Authors: Thomas Burnett; Pavel Mozgunov; Philip Pallmann; Sofia S Villar; Graham M Wheeler; Thomas Jaki
Journal: BMC Med Date: 2020-11-19 Impact factor: 8.775

4. Efficient Adaptive Designs for Clinical Trials of Interventions for COVID-19.

Authors: Nigel Stallard; Lisa Hampson; Norbert Benda; Werner Brannath; Thomas Burnett; Tim Friede; Peter K Kimani; Franz Koenig; Johannes Krisam; Pavel Mozgunov; Martin Posch; James Wason; Gernot Wassmer; John Whitehead; S Faye Williamson; Sarah Zohar; Thomas Jaki
Journal: Stat Biopharm Res Date: 2020-07-29 Impact factor: 1.452

4 in total