Literature DB >> 24697322

A comparison of methods for treatment selection in seamless phase II/III clinical trials incorporating information on short-term endpoints.

Cornelia Ursula Kunz¹, Tim Friede, Nicholas Parsons, Susan Todd, Nigel Stallard.

Abstract

In an adaptive seamless phase II/III clinical trial interim analysis, data are used for treatment selection, enabling resources to be focused on comparison of more effective treatment(s) with a control. In this paper, we compare two methods recently proposed to enable use of short-term endpoint data for decision-making at the interim analysis. The comparison focuses on the power and the probability of correctly identifying the most promising treatment. We show that the choice of method depends on how well short-term data predict the best treatment, which may be measured by the correlation between treatment effects on short- and long-term endpoints.

Entities: Chemical Disease Gene Species

Keywords: Adaptive seamless design; Multi-arm multi-stage trial; Surrogate endpoints

Mesh：

Year: 2015 PMID： 24697322 PMCID： PMC4339952 DOI： 10.1080/10543406.2013.840646

Source DB: PubMed Journal: J Biopharm Stat ISSN： 1054-3406 Impact factor: 1.051

INTRODUCTION

In recent years, adaptive designs in the various phases of drug development have gained popularity. Such designs use information from accumulating data in an ongoing trial to make decisions about the conduct of the rest of the study (Gallo et al., 2006). One particular form of adaptive design is the combined phase II/III adaptive seamless design. A trial of this type is conducted in two stages. During the first stage, the exploratory stage, patients are recruited to several experimental treatments and a control treatment. One or more interim analyses are then performed, at which treatments that appear ineffective are dropped. The main objective of this first stage is to identify the most promising treatments, so that recruitment of further patients can be restricted to only those treatments and the control. At the end of the second stage, the confirmatory stage, the selected treatment(s) is (are) compared to the control within a formal testing framework, again possibly involving a sequence of interim analyses, based on all data from the selected treatment(s) and the control. Several authors have developed methodology for conducting phase II/III studies that protects the overall type I error rate of the trial (see, e.g., Bauer and Kieser, 1999; Stallard and Todd, 2003; Kelly et al., 2005; Posch et al., 2005; Bretz et al., 2006; Koenig et al., 2008). Reviews of the different approaches are given by Chow et al. (2005), Friede and Stallard (2008), Bretz et al. (2009), and Stallard and Todd (2011). In the pharmaceutical setting, adaptive designs continue to gain acceptance. Regulatory authorities have recently produced guidance documents on the topic (Food and Drug Administration (FDA), 2010; European Medicines Agency (EMEA) - Committee for Medicinal Products for Human Use (CHMP), 2007), giving further evidence that they anticipate more clinical trials will be designed using this framework. Indeed, there are a number of therapeutic areas where phase II/III seamless adaptive designs have already been implemented. Schmoll et al. (2010) describe a pharmaceutical trial in oncology that was designed using the methodology of Stallard and Todd (2003) and Todd and Stallard (2005). Barnes et al. (2010) discuss the use of a phase II/III design in chronic obstructive pulmonary disease. In other therapeutic areas adaptive designs have been proposed and promoted. Dragalin (2011) discusses the potential for the use of adaptive designs in all phases of development, including discussion of phase II/III trials, in central nervous system studies. Chataway et al. (2011) and Friede et al. (2011) propose a phase II/III seamless adaptive design for use in secondary progressive multiple sclerosis trials. A recent area of research in the development of further methodology for phase II/III designs concerns the question of how to incorporate early endpoint data into the treatment selection part of such a trial. The desire to do this arises when the primary endpoint of interest for each patient is only available after a number of months or even years and yet there are more immediately measured endpoints available, building on earlier work on incorporation of early endpoints in sequential clinical trials comparing a single experimental treatment with a control (Cook and Farewell, 1996; Marschner and Becker, 2001; Galbraith and Marschner, 2003; Sooriyarachchi et al., 2006; Whitehead et al., 2008). An example can be found in secondary progressive multiple sclerosis, where long-term changes in disability scales are the main goal, but early evidence of treatment effect may be observed as changes to lesions in the brain detected using magnetic resonance imaging scanning technology. Two alternative methods for incorporating early endpoint data in phase II/III clinical trials have been proposed by Stallard (2010) and Friede et al. (2011). The methods differ in the way in which the treatment to continue to the second stage is chosen. Treatment selection under the method described by Stallard (2010) makes use of short-term endpoint data combined with any available long-term data. In contrast, Friede et al. (2011) propose a method of treatment selection that uses only short-term endpoint data. Both approaches base the final inference on the long-term endpoint data only, though they differ in the way in which data from the two stages of the trial are combined. The aim of this paper is to compare the methods proposed in these two manuscripts. Since both methods have been shown to control the type I error rate, we will focus on comparison of the power of the two approaches in a range of realistic scenarios. This will inform researchers aiming to design a seamless phase II/III trial in which short-term endpoint data can be used for decision-making at an interim analysis. The two methods under consideration are reviewed in detail in Section 2, where a common notation is also established. Sections 3 and 4 describe comparisons of the two approaches in the settings of fixed and random treatment effect models, respectively. The paper concludes with a discussion in Section 5.

NOTATION AND REVIEW OF METHODS

Setting and Notation

Consider a clinical trial conducted in two stages. In the first stage, patients are randomized to the control treatment T 0 or to one of k experimental treatments, T, . Suppose that data on the primary, long-term, endpoint are available for n 1 patients in each treatment group, and that in addition, short-term endpoint data are observed for N 1 patients in each treatment group, with . In stage one, we therefore have patients with short-term endpoint data only and n l patients with both short- and long-term endpoint data in each treatment group. Following an interim analysis, one experimental treatment, denoted by T, is chosen to continue to the second stage along with the control treatment with a further patients recruited to each of these treatment groups, giving a total of n 2 patients per group in all. Two possible ways of making the treatment selection are described below. Suppose that following the second stage, patients are followed up so that primary long-term endpoint data are available for the total of n 2 patients receiving each of treatments T and T 0. Denote by and , respectively, the short-term and long-term endpoint data from patient j in group i. When both endpoints are observed, that is for for i = 0, I and for other , the two endpoints for each patient are assumed to follow a bivariate normal distribution. When only the short-term endpoint is observed, that is for , , , is assumed to follow a normal distribution so that we have where and denote the true means on the short- and long-term endpoints, respectively, in group i; and denote the true variances for the short- and long-term endpoints, respectively; and denotes the true correlation between the endpoints within each group. The variances and and the correlation will be assumed known and equal for all patients. In the calculation of selection probabilities below, the true variances and correlation will be used. In the simulations, estimates obtained from the data will be used in place of the true values, as suggested by Stallard (2010) and Friede et al. (2011). Given the mean values, individual patients are assumed to be independent so that , , and for or . A summary of the parameters in the fixed and random effects models are given in Table 1. The parameters of interest are the treatment effects relative to the control treatment on the long-term endpoint, that is , and we wish to test the null hypotheses denoted against the one-sided alternative hypotheses denoted by for treatment group .

Table 1

Summary of model parameters

Sample sizes
N₁	Total number of patients per group with short-term data at interim analysis
n₁	Number of patients per group with short-term and long-term data at interim analysis
N₁ − n₁	Number of patients per group with short-term data only at interim analysis
n₂	Number of patients per group with short-term and long-term data at final analysis
Fixed or random effects model parameters
	Long-term endpoint treatment mean for group i
	Long-term endpoint variance
	Short-term endpoint treatment mean for group i
	Short-term endpoint variance
	Correlation between long-term and short-term endpoints within each treatment group
Random effects model parameters
	Mean long-term treatment mean for group i
	Variance of long-term treatment mean
	Mean short-term treatment mean for group i
	Variance of short-term treatment mean
	Correlation between long-term and short-term treatment means

Summary of model parameters Two methods for use of short-term endpoint data for treatment selection in a two-stage trial have been proposed (Friede et al., 2011; Stallard, 2010). These methods are described briefly below. The aim of this paper is to compare these methods. This comparison will be based on model (1). We will consider two cases. In the first case, the fixed effects model, it is assumed that the true means and can be specified so that these can be taken to be constant. And in the second, the random effects model, the means will be taken to be random and to follow a bivariate normal distribution with where and denote the true means, and denote the true variances, and denotes the true correlation between the means for the two endpoints for any given treatment. We assume that the random treatment means have the same variances and correlations, so we may drop the subscript and denote these by , , and , and are independent for different treatments, that is , , and for . The random effects model will allow us to model a situation in which we envisage that the treatments being evaluated are drawn at random from the distribution given by (2). In this case, the treatment means are considered to be unknown but correlated for the two endpoints with specified correlation and variance.

Method of Friede et al. (2011)

Friede et al. (2011) propose a method for selection of the treatment that will continue to the next stage based on the short-term endpoint only, selecting the experimental treatment with the largest observed sample mean at the interim analysis. Let denote the standardized test statistic for comparison of treatment i, to the control in terms of the short-term endpoint only on the basis of data available at the interim analysis. The experimental treatment group with the highest value of is then chosen to continue to the second stage along with the control while all other treatments are dropped. At the end of the trial, long-term endpoint data are available from all n 2 patients randomized to the selected treatment and the control. Thus, as the parameters of interest are the long-term endpoint means , only the long-term endpoint data will be used in the final analysis. In this method, then, the short-term data are thus only used for treatment selection, and the long-term data are used only for the final comparison of the selected treatment with the control. In order to control the type I error rate, the final analysis must allow for the treatment selection. Friede et al propose using a combination test approach to combine all data from those patients with any data observed at the interim analysis with the data from new patients observed at the end of the second stage, with a Dunnett correction applied to the first stage test statistics. In detail, let denote the standardized test statistic for comparison of group i to the control group based on the long-term endpoint data from the N 1 patients per group who have short-term endpoint data available at the interim analysis. Let denote a p-value based on . Similarly, let denote the standardized test statistic for comparison of group I and the control group based on the additional long-term endpoint data observed at the end of the trial and let . Note that the are based on some data not observed at the time of the interim analysis and that is independent of all and of any data available at the interim analysis. To allow for the treatment selection at the first stage, in order to test a null hypothesis , where is some nonempty subset of and denotes the intersection hypothesis , the stage one p-value is obtained from a Dunnett test (Dunnett, 1955) using the test statistic in, for instance, equation (1) of Friede and Stallard (2008). This gives a stage one p-value for the test of , corrected for the multiple comparisons. If the selected treatment, I, is in , a stage two p-value for testing , is just that for testing the selected treatment, . If , is set to one to give a conservative test (Posch et al., 2005). The stage one and stage two p-values may then be combined, for example, using the weighted inverse normal combination function (Lehmacher and Wassmer, 1999) for predefined weights w 1 and w 2 with , which may be used to test . The construction of the p-values ensures that the stage two p-values are independent of any data available at the interim analysis, and hence of the treatment selection. The p-values obtained thus satisfy the weaker p-clud condition (Brannath et al., 2002), so that no further correction for the treatment selection is necessary and the combination test provides a test of that controls the type I error rate at the nominal level for any treatment selection method. If the null hypothesis H is rejected and if all with are rejected, the type I error rate for the family of hypotheses H, , is controlled in the strong sense (Marcus et al., 1976).

Method of Stallard (2010)

Stallard (2010) proposes basing treatment selection on the maximum likelihood estimate of the long-term treatment effects, , calculated at the interim analysis. Let denote the standardized score statistic for obtained from all data available at the interim analysis. In the case that , this depends on the short-term data in addition to the long-term data. If , , and are unknown, may be estimated using the double regression method proposed by Engel and Walstra (1991) (see, Stallard, 2010), in which results of regression of X on group membership for and of Y on X and group membership for are combined to give . For known , , and , is shown by Hampson and Jennison (2013) to be given by where and denotes the sample mean of the N l short-term endpoint observations from group i observed at the interim analysis. The quantity given by expression (6) can be viewed as an effective sample size per group, corresponding to the number of long-term observations per group that would give the same amount of information on as that available from the n l long-term and N l short-term responses allowing for the correlation . If so that long-term and short-term responses for any given patient are independent, and the short-term observations give no information on , . If , so that short- and long-term responses are perfectly correlated, , so that the amount of information on is the same as if long-term data had been observed for all patients. In the method described by Stallard (2010), treatment selection is based on statistics with the treatment group with the highest value for being selected to continue to the second stage together with the control group. Note that, unlike the Friede et al. method, this method requires that at least some long-term endpoint data are available at the time of the interim analysis. At the end of the trial, long-term endpoint data are available from all n 2 patients randomized to the selected treatment and the control, so that as with the Friede et al. method, only the long-term endpoint data will be used in the final analysis. However, the final analysis using the Stallard method combines the evidence from the two stages in a different way to that suggested by Friede et al. Suppose that treatment T is selected to continue with the control to the second stage. Let denote the standardized score test statistic for based on all data available at the end of the trial, that is, Stallard derives the joint distribution of , showing that this is similar to that for test statistics in a seamless phase II/III trial with the primary endpoint alone used at an interim analysis with patients per group. The joint distribution of and can thus be obtained, allowing a critical value c to be constructed so as to control the type I error rate if H 0 is rejected whenever .

COMPARISON OF METHODS: FIXED EFFECTS MODEL

We are interested in comparing the methods proposed by Stallard (2010) and Friede et al. (2011). We first consider the fixed effects model setting and explore the properties of the two methods for fixed treatment effects on the short- and long-term endpoints. The methods will be compared in terms of the probability of selecting an effective treatment in Section 3.1 and of the resulting power of the final analysis in Section 3.2.

Selection Probability

Although we wish to focus on the probability of selecting the correct treatment, we can define this in two different ways. For given treatment means for the long-term endpoints, , we could consider either the probability of selecting any effective treatment, that is choosing I to be any i with , or the probability of selecting the most effective treatment, that is choosing I to be the i that maximizes . Throughout this paper, we will focus on the latter. Furthermore, we will, without loss of generality, generally consider scenarios in which T l has the best effect, and report the probability of selecting treatment T 1. The probability of selecting treatment T 1 with the Friede et al. method based on equation (3) is equal to while the probability of selecting treatment T 1 with the Stallard selection method based on equation (5) is equal to These probabilities could be estimated via simulation. Alternatively, for and assumed known, they can be calculated exactly from the joint distributions of and , respectively. These distributions are given in the Appendix. Selection probabilities can thus be found using standard numerical routines for calculation of multivariate normal tail areas, for example using pmvnorm in R (Genz et al. 2012). Computer code to perform these calculations and the simulations described below in R can be obtained from the corresponding author. For the Friede et al. method, the selection probability depends only on N 1 and the standardized short-term endpoint treatment effects, , while for the Stallard method, the selection probability depends on n 1, N 1, and the correlation between the endpoints, , through , the standardized long-term endpoint treatment effects, , but not on . The upper panels (panels A1 and B1) of Fig. 1 show the probability of selecting treatment T 1 using the Friede et al and Stallard selection methods when three experimental treatments are included in the first stage and . Panel A1 gives the selection probability with for different stage one sample sizes for a a range of values. Panel B1 gives the selection probability with , , and for a range of values (with so that these are the standardized values), again for a range of values.

Figure 1

Probability to select treatment 1 (panels A1 and B1) and power (panels A2 and B2) for the Stallard (2010) and Friede et al. (2011) methods for different parameter settings under the fixed effects model. As indicated above, the probability of selection with the Friede et al. method does not depend on . With , the probability of selection with the Stallard method is equal to that with the Friede et al. method when , when the most information is obtained from the observations per group for whom only short-term endpoint data are available and . For , the probability of selecting treatment T 1 is lower for the Stallard method than that for the Friede method when and exceed 0, so that treatment T 1 is actually the most effective, with the difference between the two methods larger for larger . The selection probability for the Stallard method is smallest, and most different from that for the Friede et al. method, when and . The selection probability in this case is equal to that for a method that selects the best treatment solely on the basis of the long-term endpoint data from n 1 patients per group available at the interim analysis and so, unsurprisingly, decreases with decreasing n 1. Since the selection probability for the Stallard method depends on and not on , whilst that for the Friede et al. method depends on and not on , panel B1 of Fig. 1 enables comparison of selection probabilities in settings with . Although when the probability of selecting treatment T l with the Friede et al. is always as great at that for the Stallard method, it can be seen that this probability may be lower for the Friede et al. method when .

Power

As was the case with the probability of correct selection, we can define the power in different ways. In order to be consistent with the definition of selection probability, we define the power as the probability of rejecting the false null hypothesis corresponding to the most effective treatment, that is to rejecting H when I is the i that maximimes . This definition is closely related to the “individual power” defined as the probability of rejecting a particular false null hypothesis (Westfall et al., 2011). The difference is that in the case of the individual power the null hypothesis we are interested in is specified in advance. Note that other definitions for the power are possible, such as, for example, defining the power as the probability to reject any false null hypothesis. For a discussion of different power concepts in the context of multiple testing see Westfall et al. (2011). Assuming as above, without loss of generality, that the treatment effect on the long-term endpoint, , is largest for , the power for the Friede et al. method is equal to where is the combination function defined by (4) so that for all corresponds to rejection of H 1 in the Friede et al. method using the combination test and closed testing procedure as described above. For the Stallard method, the power is equal to where c is the critical value obtained to control the type I error rate using the method of Stallard (2010). For the Stallard method, the power depends on , n 2, and the standardized long-term endpoint treatment effects, , but not on the short-term endpoint effects . For the Friede et al. method, as the selection is based on the short-term endpoint data and the final test of the long-term endpoint data, the power depends on in addition to and . As (9) and (10) involve data from both stage one and stage two, analytic calculation of the power is less straightforward than that for the selection probabilities. The power values can most easily be estimated through simulation of data from the fixed effects model (1). This also allows the assumption of known and to be relaxed. Simulated power values for the two methods are shown in the lower panels (panels A2 and B2) of Fig. 1 in the same settings as the selection probabilities shown in the upper panels and discussed above with . Estimated power values plotted are based on 10,000 simulations for each of the scenarios considered. For the larger effect sizes as shown in the panel A2 and the upper curves in panel B2, the power is very similar to the selection probability shown in the upper two panels. In this case if treatment T 1 is selected in stage one, the combination of the larger stage two sample size and the large effect size mean that it is very likely to be shown to be superior to the control. For smaller effect sizes, there is a larger chance of failing to demonstrate superiorioty even if the treatment T l is correctly selected, so that power values are smaller than the selection probabilities. In this case for extreme values of or for very small treatment effects the Friede et al. method may be less powerful than the Stallard method. It is also interesting to note that while the power for the Stallard method is the same for positive and negative values of of the same magnitude, for the Friede et al. method the power appears to be slightly lower for negative than for positive . Figure 1 shows power values for . As the power cannot exceed the selection probability, we may note, as above, that the Stallard method will be more powerful than the Friede et al. method if is sufficiently large compared to .

COMPARISON OF METHODS: RANDOM EFFECTS MODEL

For the fixed effects model, the distributional forms and calculated values given above show that the probability of selecting treatment T 1 and the power to reject the null hypothesis for this treatment, H 1, is higher for the Friede et al. selection method than that for the Stallard method when , but can be lower when . Unsurprisingly, given that the Friede et al. selection method relies solely on short-term endpoint observations, the performance of the method is good when the effects on the short-term endpoint are similar (or larger) to those on the long-term endpoint, but may be poor when they are smaller or reversed. In order to capture the relationship between the treatment effects and , it is therefore interesting to consider the random effects model introduced above, in which the correlation between the treatment means is explicitly included in the statistical model. As with the fixed effects model, we will consider the probability of selecting treatment T 1. Since the mean effect for this treatment, , is now considered to be a random variable, however, treatment T 1 might not always be the most effective even if for . We will therefore focus on the probability of selecting treatment T 1 given that it is the most effective treatment, that is given that for all . This is given by in the case in which selection is made using the Friede et al. method, and by in the case when selection is made using the Stallard method. These probabilities may be evaluated using the joint distributions of and or of and given in the Appendix. Figure 2 shows the probability of selecting treatment T 1 given that this is actually the most effective treatment when selection uses either the Friede et al. or the Stallard method. Selection probabilities are shown for a range of values in the setting in which . In panels A1 and A2, and for , so that on average the first treatment is effective on both endpoints and all others are not, and . The separate lines give the treatment selection for different sample sizes. In panels B1 and B2, , and , for and with different lines on the plot corresponding to different treatment effects. In panels C1 and C2, and for , and and the different lines correspond to different values of and . The right-hand column in the figure shows selection probabilities for , that is when there is perfect correlation between the means on the two endpoints, and the left-hand column those for .

Figure 2

Probability to select treatment 1 based on the methods by Stallard (2010) and by Friede et al. (2011) for different parameter settings under the random effects model (given that treatment 1 is the most effective). The selection probabilities for shown in the right-hand column are generally similar to those for the fixed effects model with given in Fig. 1. The best treatment is more likely to be selected using the Friede et al. method than the Stallard method, with the two methods coinciding when . The main difference between the selection probabilities under the random effects model and the fixed effects model in this case is that under the random effects model there is very little effect of the average treatment effect, , in contrast to the results for the fixed effects model considered above. This is reasonable given that the figure shows the probability of selecting T 1 given that the actual treatment effect is largest for that treatment, that is given . An increase in for the Stallard method or in for the Friede et al. method does, however, reduce the probability of selecting treatment T l, as the standardized average difference between the treatments on the long- or short-term endpoint, respectively, is reduced. As the Friede et al. method uses only the short-term endpoint data for the selection, it is not surprising that it performs well when the means on the two endpoints are perfectly correlated, since the selection is based on a larger number of observations and a treatment performing well on the short-term endpoint is more likely to have a large long-term endpoint mean. The left-hand column shows selection probabilities for . The selection probabilities for the Stallard method do not depend on , so that these are exactly the same as those in the panels in the right-hand column. The Friede et al. method selects the correct treatment with lower probability than when the short-term and long-term treatment means are perfectly correlated; in this case the short-term endpoint means are less predictive of the treatment with the largest long-term responses. In this case, the Stallard method can lead to a higher probability of correctly selecting treatment T 1, particularly when is high. Smaller values of the correlation will result in worse performance of the Friede et al. method. The latter point is illustrated more clearly in Fig. 3. This shows the probability under the random effects model of correctly selecting treatment T 1 given that this is the most effective for , , , , for the Stallard method for a range of values and for the Friede et al. method for a range of values. Since the selection probabilities for the Stallard method do not depend on and for the Friede et al. method do not depend on , the two lines are shown on the same graph. Comparing the two lines, we see that the Stallard method always has a higher selection probability than the Friede et al. method if except when , when both probabilities are the same. The three horizontal lines represent selection probabilities for the Friede et al. method where is fixed to either 1 (short dash), 0.95 (dash dot), or 0.9 (long dash). Comparing these lines with those for the Stallard method, we observe that if the Friede et al. method is always better than the Stallard method regardless of (with the exception of , when the selection probabilities for the two methods are again equal). If , the Friede et al. method is only better than the Stallard method if is small. While if , the Stallard method is always better than the Friede et al. method regardless of .

Figure 3

Probability to select treatment 1 based on the methods by Stallard (2010) for a range of values and by Friede et al. (2011) for a range of values under the random effects model (given that treatment 1 is the most effective). When , the probability of selecting treatment T 1 using the Friede et al. method can be low. The probability approaches zero as approaches −1 and the treatment effects on the long- and short-term endpoints consistently go in opposite directions. In a similar approach to that used for evaluation of the treatment selection probabilities, we consider the power defined to be the probability that treatment T 1 is selected at the interim analysis and found to be significantly superior to the control at the final analysis conditional on it actually being the best treatment, that is on . As in the fixed effects model, the power will again be estimated via simulation. In this case, data are simulated from the random effects model given by (1) and (2). In detail, for each simulation, treatment means are first simulated from (2) then, given these treatment mean values, data are simulated from (1). Simulated power values are shown in Fig. 4 under the same scenarios as Fig. 2. As in the fixed effects setting, for reasonably large standardized effect sizes, the power is similar to the selection probability, but is slightly lower when the standardized effect is smaller, either because of a reduction in the effect size or an increase in the within-treatment variance.

Figure 4

Power for the Stallard (2010) and Friede et al. (2011) methods for different parameter settings under the random effects model (given that treatment 1 is the most effective).

DISCUSSION

There has been much recent interest in adaptive seamless phase II/III clinical trials in which randomization is initially between a number of experimental treatments and a control, with less effective treatments dropped from the study on the basis of results from an interim analysis. Building on methods using short-term information to supplement long-term information originally developed in the context of interim analyses for early stopping, two methods have been proposed for using short-term endpoint data in the treatment selection (Stallard, 2010; Friede et al., 2011). In this paper, we have compared these two methods. Our aim has been to provide a comparison that will enable choice of the most appropriate method when designing an adaptive seamless phase II/III design. In the Friede et al. method, only the short-term endpoint data are used for the treatment selection. In contrast, the Stallard method uses a combination of short- and long-term endpoint data. The latter method can thus only be used when some long-term responses are available for inclusion in the interim analysis. In both methods, the final analysis is based on the long-term endpoint data alone from the selected treatment and control. This is in contrast to other group-sequential methods in which it is desired to draw inference on both endpoints, for example requiring both to be sufficiently promising (see, e.g., Jennison and Turnbull, 1993; Kimani et al., 2009) or with early and late observations of the same endpoint treated as a longitudinal data (see, e.g., Spiessens et al., 2000; Lee et al., 1996). Our comparison has considered scenarios in which the treatment means are taken to be fixed, with one treatment more effective than all others and the control, which are equally effective, and scenarios in which the treatment means are taken to be random but are correlated. A summary of the effects of the different model parameters on the selection probability based on the simulations reported above is given in Table 2. Our results indicate that under the fixed effects model, if the treatment effect on the short-term endpoint is as large or larger than that on the long-term endpoint for the effective treatment, the Friede et al. method is more likely to lead to selection of the most effective treatment, and is correspondingly more powerful. If the effect on the short-term endpoint is less than that on the long-term endpoint, the Stallard method may be more likely to select the correct treatment and more powerful, particularly when the within-group correlation between the endpoints is high.

Table 2

Summary impact of model parameters on selection probabilities

Sample sizes
n₁	Larger values reduce impact of short-term endpoint data.
N₁	Larger numbers increase impact of short-term endpoint data.
n₂	Larger values increase power but do not influence treatment selection.
Fixed or random effects model parameters
	More disperse values increase differences between treatments making treatment selection easier.
	Larger values increase variability making treatment selection harder.
	More disperse values increase differences between treatments making treatment selection easier with Friede et al. method. No impact on Stallard method.
	Larger values increase variability making treatment selection harder.
	Larger values (of ) increase influence of short-term endpoints in Stallard method. No impact in Friede et al. method.
Random effects model parameters
	More disperse values increase differences between treatments making treatment selection easier.
	Larger values make treatment means more disperse making treatment selection easier.
	More disperse values increase differences between treatments making treatment selection easier with Friede et al. method. No impact on Stallard method.
	Larger values make treatment means more disperse making treatment selection easier with Friede et al. method. No impact on Stallard method.
	Larger values make treatment effects on two endpoints more closely related and improve treatment selection with Friede et al. method. No impact on Stallard method.

Summary impact of model parameters on selection probabilities Under the random effects model, the effect of correlation between the treatment means on the two endpoints can be considered. This parameter gives an indication of the extent to which treatment effects on the long- and short-term endpoints go in the same direction. In this case, our results indicate that the Friede et al. method leads to a higher probability of selecting the best treatment and to higher power only when the correlation between the treatment means is sufficiently high. The threshold depends on the sample sizes and variances, but we have shown that even when the number of patients for whom long-term endpoint data are available at the interim analysis is small, under the scenarios we have considered, the Friede et al. method is less powerful unless the correlation between the means is relatively high; for the scenario we considered above 0.9 when the within-group variance and between-group variance are both equal to 1. In order to be able to choose between the different methods, some estimates of the model parameters, including the variances and correlations in (1) and (2) are required. In some cases, data from other trials will be available, particularly to give information on the parameters in (1). The correlation can vary considerably depending on the setting and endpoints chosen. Julious and Mullee (2008), for example, report a of 0.67 between the same endpoint measured at baseline and at the end of the trial, so that the correlation between an early and final measurement of this endpoint would presumably be higher than this, whereas Chataway et al. (2011) report a of 0.13 between two different endpoints, though it was still proposed to use the early endpoint for treatment selection. The parameters in (2) are harder to estimate since their estimation requires data from a number of different trials or treatments. If detailed information on parameter values is unavailable, it may still be possible to make some guess of possible ranges for parameters, or to use the methods described above to conduct sensitivity analyses. We are currently working on approaches that use the data from the first stage of the trial to estimate the parameters of (1) and (2) and to decide between the different treatment selection strategies on the basis of these estimates. Our comparison of the procedures has used a combination of analytic calculations based on multivariate normal distributions to calculate selection probabilities and simulations to estimate the power. The simulations can be time-consuming when an extensive search for an appropriate sample size is required, or when it is desirable to explore the tradeoff between patients in stages one and two of the trial. The power is bounded above by the selection probability and in many of the settings considered above, the two probabilities are quite similar. This is likely to be particularly true when the assumed effect size is relatively large and the sample size for the second stage is substantially larger than that for the first stage. For example, in the settings described above with three treatments compared to a control treatment on the basis of long-term data on 5 or 15 patients per group and short-term data on 100, 20, or 50 patients per group at the interim analysis with a final sample size of 200 per group, when the standardized effect size on both endpoints for the sole effective treatment of 0.5, we found that the estimated power was at least 97.5% of the selection probability. In such cases, an approximate sample size calculation could be based on the selection probability using the analytic calculations described. If necessary, this could be followed by a much more restricted set of simulations to confirm the power of the final design chosen.

A. APPENDIX: DISTRIBUTIONS REQUIRED FOR CALCULATION OF TREATMENT SELECTION PROBABILITIES

A.1. Fixed Effects Model

Calculation of the probability of selecting treatment T 1 using the Friede et al. and Stallard methods under the fixed effects model require the joint distribution of and , respectively. Detailed derivations of these are given in the online Supplementary Material, leading to and

A.2. Random Effects Model

Treatment selection probabilities using the Friede et al. and Stallard methods under the fixed effects model may be evaluated using the joint distributions of and or of and , respectively. The joint distribution of and is given by where with and The joint distribution of and is given by where with and given by (6). Detailed derivations are again given in the online supplemental material.

ACKNOWLEDGMENTS

We are grateful to the Editor and two anonymous reviewers for their helpful comments on this paper.

SUPPLEMENTAL MATERIAL

Supplemental data for this article can be accessed on the publisher’s website.

FUNDING

The work was funded by UK Medical Research Council grant number G1001344.

25 in total

Review 1. An overview of group sequential methods in longitudinal clinical trials.

Authors: B Spiessens; E Lesaffre; G Verbeke; K Kim; D L DeMets
Journal: Stat Methods Med Res Date: 2000-10 Impact factor: 3.021

2. Adaptive sample size calculations in group sequential trials.

Authors: W Lehmacher; G Wassmer
Journal: Biometrics Date: 1999-12 Impact factor: 2.571

3. Sequential designs for phase III clinical trials incorporating treatment selection.

Authors: Nigel Stallard; Susan Todd
Journal: Stat Med Date: 2003-03-15 Impact factor: 2.373

Review 4. Seamless phase II/III designs.

Authors: Nigel Stallard; Susan Todd
Journal: Stat Methods Med Res Date: 2010-08-19 Impact factor: 3.021

Review 5. An introduction to adaptive designs and adaptation in CNS trials.

Authors: Vladimir Dragalin
Journal: Eur Neuropsychopharmacol Date: 2011-02 Impact factor: 4.600

6. A novel adaptive design strategy increases the efficiency of clinical trials in secondary progressive multiple sclerosis.

Authors: Jeremy Chataway; Richard Nicholas; Susan Todd; David H Miller; Nicholas Parsons; Elsa Valdés-Márquez; Nigel Stallard; Tim Friede
Journal: Mult Scler Date: 2010-08-26 Impact factor: 6.312

7. Interim monitoring of clinical trials based on long-term binary endpoints.

Authors: I C Marschner; S L Becker
Journal: Stat Med Date: 2001-01-30 Impact factor: 2.373

8. Dose selection in seamless phase II/III clinical trials based on efficacy and safety.

Authors: Peter K Kimani; Nigel Stallard; Jane L Hutton
Journal: Stat Med Date: 2009-03-15 Impact factor: 2.373

9. Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: an application in multiple sclerosis.

Authors: T Friede; N Parsons; N Stallard; S Todd; E Valdes Marquez; J Chataway; R Nicholas
Journal: Stat Med Date: 2011-02-22 Impact factor: 2.373