Literature DB >> 34950839

Two-Stage Single-Arm Trials Are Rarely Analyzed Effectively or Reported Adequately.

Michael J Grayling1, Adrian P Mander2.   

Abstract

PURPOSE: Two-stage single-arm designs have historically been the most common design used in phase II oncology. They remain a mainstay today, particularly for trials in rare subgroups. Consequently, it is imperative such studies be designed, analyzed, and reported effectively. We comprehensively review such trials to examine whether this is the case.
METHODS: Oncology trials that used Simon's two-stage design over a 5-year period were identified and reviewed. They were evaluated for whether they reported sufficient design (eg, required sample size) and analysis (eg, CI) details. Articles that did not adjust their inference for the incorporation of an interim analysis were also reanalyzed.
RESULTS: Four-hundred twenty-five articles were included. Of these, just 47.5% provided the five components that ensure design reproducibility. Only 1.2% and 2.1% reported an adjusted point estimate or CI, respectively. Just 55.3% provided the final stage rejection bound, indicating many trials did not test a hypothesis for their primary outcome. Trial reanalyses suggested reported point estimates underestimated treatment effects and reported CIs were too narrow.
CONCLUSION: Key design details of two-stage single-arm trials are often unreported. Their inference is rarely performed such as to remove bias introduced by the interim analysis. These findings are particular alarming when considered against the growing trend in which nonrandomized trials make up a large proportion of all evidence on a treatment's effectiveness in a rare biomarker-defined patient subgroup. Future studies must improve the way they are analyzed and reported.
© 2021 by American Society of Clinical Oncology.

Entities:  

Mesh:

Year:  2021        PMID: 34950839      PMCID: PMC8691516          DOI: 10.1200/PO.21.00276

Source DB:  PubMed          Journal:  JCO Precis Oncol        ISSN: 2473-4284


BACKGROUND

For many types of cancers, randomized trials are becoming more common in phase II.[1] However, recent analyses indicate single-arm designs remain most widely used.[2] Additionally, as more cancer studies investigate treatments targeting particular molecular alterations, it is likely single-arm trials will remain commonly used in oncological drug development, given consensus opinion is that rarer subgroups are one area in which a single-arm trial is a logical design.[1]

CONTEXT

Key objective Accurate reporting of clinical trial design and analysis is critically important for scientific reproducibility. Simon's two-stage is among the most commonly used designs in cancer research. We use 425 recent reports on the results of phase II oncology trials to determine how the cancer community can improve their communication of such trials. Knowledge generated Many important features of the design and analysis of included trials were not adequately described in the reports. Efficient design alternatives to the conventional optimal and minimax designs were rarely used. Numerous papers have now been published that help better analyze Simon's two-stage trials, but we found little evidence of their use in practice. Relevance Greater care is needed at the design, analysis, and reporting stages of trials using Simon's two-stage design. This may improve knowledge transfer on estimated patient response rates and is particularly relevant, given the growing trend of nonrandomized trials for evaluating treatment effectiveness in rare biomarker-defined patient subgroups. In single-arm trials, the primary outcome is often dichotomous,[2] typically chosen as objective response[1] through RECIST.[3] Among the available single-arm designs for a binary outcome, Simon's two-stage design[4] is generally preferred.[5] The habitual use of Simon's design has seen much research be conducted into its effective utilization. Recent work includes methodology to account for deviation from the planned design,[6-8] criteria to simultaneously optimize design and analysis,[9] and evaluations of the value of such trials within wider drug development plans.[10] Indeed, many publications have now addressed how to handle issues that can arise in trials using Simon's design. Nonetheless, it is not known to what extent the advice provided has permeated through to practice. Several authors have evaluated the reporting of phase II oncology trials without differentiating by design. Grellety et al[11] reviewed 156 phase II oncology trials published in 2011, assessing the quality of reporting using two scores. One of these, the Key Methodological Score (KMS), consisted of three items: provision of a clear (1) definition of a criterion of principal judgment, (2) justification for the number of patients included, and (3) definition of the population on which the principal and/or secondary judgment criteria were evaluated. They found the median KMS was 2/3, whereas only 16.1% of the studies had a KMS of 3/3. Langrand-Escure et al[2] reviewed 557 phase II and phase II/III oncology trials published in 2010-2015 in three high-impact journals, also appraising the quality of reporting using the KMS. They concluded just 26.2% of the articles had a KMS of 3/3. They additionally found a sample size calculation was missing in 66% of the articles. These findings are concerning, but it is possible they only scratch the surface of the issues in the use of two-stage single-arm designs in practice. No paper has sought to ascertain the degree to which precise components of the design of such trials are included in the published reports. Moreover, no research has evaluated the frequency with which trialists have heeded the recommendations of the many articles that argue for the need for the final analysis to be adjusted to account for the interim analysis. Finally, the extent to which deviation from the planned design occurs in practice, or the impact of this on study error rates, is unknown. Given the extent of the use of two-stage single-arm designs in practice, it is paramount such studies be designed, analyzed, and reported effectively. This is particularly true when a confirmatory randomized trial is unlikely to be possible; the single-arm trial then forms the majority of evidence from which important decisions (eg, around licensing) must be made. With little known about the quality of articles on trials that used a two-stage design, we sought to systematically review a large number of such trials to ascertain issues in design, analysis, and reporting.

METHODS

Simon's Two-Stage Design

We review trials that used Simon's two-stage design. Therefore, we briefly summarize the statistical aspects of such trials. The design evaluates a binary primary outcome, from patient , assumed to be distributed as (ie, ). Thus, P is the probability of success for the primary outcome. The following hypothesis is tested: , with a type I error rate of when . The trial is powered to when . Here, and are commonly referred to as the maximal success probability that does not warrant further investigation and the minimal success probability that allows further investigation. Often, is based on the historical success probability for the current standard of care. The design includes a single interim analysis for futility (a no-go decision) and is indexed by , , , and . In stage I, outcomes for patients are accumulated. Then, serves as a stopping boundary: if , the trial terminates for futility, with not rejected. Otherwise, outcomes for further patients are gathered. Finally, is used to determine whether to reject : it is rejected if and not rejected otherwise. The design parameters, , , , and , are chosen to minimize optimality criteria, among the combinations that meet the type I error and power requirements. Simon[4] suggested two optimality criteria: (1) null-optimal, to minimize the expected sample size when , and (2) minimax, to minimize the maximal sample size . Other optimality criteria have since been proposed.[12-14] Post-trial inference could be performed using methods developed for one-sample proportions, eg, a CI could be computed using Clopper-Pearson.[15] Depending on the stage of termination, a point estimate for P could be given as or (these are sometimes referred to as naïve estimates within the context of a Simon two-stage trial). However, it is well-known that the inclusion of an interim analysis means that adjusted inference should be performed. This is to ensure computed P values are consistent with the decision on whether to reject , which acquired CIs have the desired coverage, and to reduce point estimate bias.[16] Many adjusted methods have been proposed, including that of Jung et al[17] for P values, Jennison and Turnbull[18] for CIs, and Jung and Kim[19] for point estimates. Several methods for handling deviation from the planned design (ie, scenarios in which the interim or final analysis is conducted with a sample size different from or ) have also been developed.[6-8,20] We provide extended details of all methods used later in the Data Supplement (online only). Here, we focus on providing more details of a particular method for computing an adjusted point estimate, which will be used at length later. As noted above, several methods have been proposed for estimating P in a Simon two-stage trial. Each essentially aims to reduce the bias in the estimate. Informally, bias can be thought of as expecting to, on average, incorrectly estimate P. The reason multiple methods have been developed is that no one approach is clearly best.[16] However, some believe the uniform minimum variance unbiased estimator (UMVUE) should be preferred.[16,19] This has the lowest variance among estimators that are always unbiased; low variance is useful as it means that, on average, our estimate should be closer to P. The UMVUE has a more complex form than the naïve estimates given above (Data Supplement) but is still easy to calculate.

Literature Review

See the Data Supplement for further details.

Inclusion criteria.

To identify articles, PubMed was searched on February 21, 2018, using the term (“2013/01/01”[Date - Publication]: “2017/12/31”[Date - Publication]) AND Clinical Trial[Publication Type] AND (phase II[Title/Abstract] OR phase 2[Title/Abstract]) AND (cancer[All Fields] OR oncology[All Fields]), returning 5,344 articles for review. The key inclusion criteria were (1) full-length articles, (2) primary publications on a trial's complete results, and (3) report results for at least one treatment arm that used Simon's two-stage design. Next, 534 articles (10.0%) were randomly selected for evaluation for inclusion by M.J.G. and A.P.M., with a 10.0% duplicate extraction used to ensure agreement on inclusion could be precisely estimated. The authors agreed on inclusion for 520 articles (97.3%). Given the high-level of agreement, the remaining articles were assessed for inclusion by M.J.G. only, with discussion with A.P.M. where required.

Data extraction.

Data on each of the questions listed in Data Supplement were extracted by M.J.G. for each arm, in each article, deemed eligible for inclusion. To establish the reliability of this extraction, data extracted by M.J.G. were compared with those independently extracted on 58 arms by A.P.M. across 14 questions requiring nonbinary value extraction (eg, “Q5. What was the value of ?”), the duplicate extractions agreed 96.2% of the time. Across a wider set of 26 questions, including those requiring only binary value extraction, the duplicate extractions agreed 94.3% of the time.

Trial reanalyses.

Reanalyses of included articles were conducted to evaluate the possible impact of not using adjusted inferential procedures. The UMVUE (which as discussed may be preferred because of its unbiasedness) was compared with reported naïve point estimates to measure the potential degree of over or underestimation in practice compared with a best practice analysis. We compared the estimated coverage of computed adjusted CIs with those of reported unadjusted CIs to determine whether CIs may be attaining the desired coverage. Given the absence of evidence is not evidence of absence, the included articles that did not state they reported an adjusted point estimate (Q25) were also reanalyzed (subject to reporting required design components) to evaluate which of seven possible point estimates the reported point estimate (Q26) was consistent with, to the reported number of decimal places. Equivalent computations were conducted for those articles that did not state they reported an adjusted CI (Q30); the reported CI (Q32-34) was compared for consistency with four unadjusted and two adjusted CIs. Reanalyses were limited to those trials (1) adjudged to have terminated in stage II, as point estimate and CI procedures do not, in general, adjust when a trial terminates in stage I and (2) that reported the number of successes and sample size assumed in the analysis, as these are required to calculate unadjusted point estimates and CIs. To reanalyze using adjusted inferential procedures, and must have been reported.

RESULTS

Included Articles

Five hundred articles were deemed eligible for inclusion, with 425 reporting the results of a single eligible treatment arm. The remaining 75 articles reported the results for an additional 204 eligible arms (arms per article: median 2, range [2-15]). To remove the need to account for skew caused by the quality of articles reporting multiple included treatment arms, we discuss here the findings for only the 425 articles that reported the results of a single eligible treatment arm. Findings for the remaining 75 articles are given in the Data Supplement. Table 1 provides descriptors on the 425 articles. At least 15.8% of the articles came from each allowed publication year, with included articles being published in 100 journals and considering a wide variety of cancer types.
TABLE 1.

Descriptors on the 425 Included Articles That Reported the Results of a Single Eligible Treatment Arm

Descriptors on the 425 Included Articles That Reported the Results of a Single Eligible Treatment Arm One hundred ten trials (25.9%) were judged to have terminated in stage I and 298 in stage II (70.1%). Among the 298 judged to have terminated in stage II, only 80 (26.4%) stated the criteria had been met for progression to stage II, indicating this judgment often had to be based on the enrolled sample size. For 17 articles (4.0%), it was not possible to ascertain when the trial terminated; this was typically caused by neither of the planned stagewise sample sizes being reported.

Reporting of Design Characteristics

Table 2 summarizes extracted data on reporting of design characteristics. Although 380 articles (89.4%) clearly stated , only 78 (18.4%) provided a justification for its value. The probability was often reported (391 articles; 92.0%), as were the desired type I (372 articles; 87.5%) and type II error rates (382 articles; 89.9%). The chosen optimality criteria were stated in only 240 articles (56.5%). This drives the fact that only 202 articles (47.5%) reported , , , , and the optimality criteria, the five components that ensure easy design reproduction. Although (349 articles; 82.1%), (371 articles; 87.3%), and (394 articles; 92.7%) were all regularly reported, was given in only 235 articles (55.3%).
TABLE 2.

Reporting of the Design of the 425 Included Articles That Reported the Results of a Single Eligible Treatment Arm

Reporting of the Design of the 425 Included Articles That Reported the Results of a Single Eligible Treatment Arm

Reporting of Inferential Procedures

Table 3 summarizes extracted data on the reporting of inferential procedures. Although point estimates were often reported (372 articles; 87.8%), only five articles (1.2%) stated they had reported an adjusted point estimate. In contrast, P values were rarely reported (four articles; 1.3%). For CIs, just 233 articles (54.8%) reported a CI, with only nine (2.1%) indicating they reported an adjusted CI. All trials that stated they had reported an adjusted point estimate or CI were ones judged to have terminated in stage II; we return to this point in the Discussion.
TABLE 3.

Reporting of Inferential Procedures Performed in the 425 Included Articles That Reported the Results of a Single Eligible Treatment Arm, With Additional Stratification by Stage of Termination

Reporting of Inferential Procedures Performed in the 425 Included Articles That Reported the Results of a Single Eligible Treatment Arm, With Additional Stratification by Stage of Termination To evaluate whether articles that reported a point estimate or CI but did not indicate it was adjusted were consistent (to their reported number of decimal places) with unadjusted or adjusted analyses, the trials were reanalyzed (Table 4). Two hundred seventy (96.1%) reanalyzed articles reported point estimates consistent with an unadjusted estimate. However, 133 of 228 articles (58.3%) for which adjusted point estimates could be calculated were also consistent with at least one adjusted estimate. For the CIs, 116 of 178 reanalyzed articles (65.2%) were consistent with at least one unadjusted interval. Far fewer articles (3/140; 2.1%) for which adjusted CIs could be computed were consistent with an adjusted CI.
TABLE 4.

Reanalysis of the Subset of the 425 Articles That Reported a Point Estimate or CI Not Stated to Have Been Adjusted

Reanalysis of the Subset of the 425 Articles That Reported a Point Estimate or CI Not Stated to Have Been Adjusted To visualize the impact of not using adjusted inferential procedures, Figure 1A displays the unadjusted estimate () against the UMVUE () for the 233 trials that terminated in stage II where the UMVUE could be computed. The difference between and is presented as a percentage of in Figure 1B. These plots indicate that although the difference between the unadjusted and adjusted estimates may often be small, there are instances in which it is large, in 25 cases more than 25% of the difference . Furthermore, there were 103 trials for which was reported, and a UMVUE was computable, in which . Potentially significantly, among these, 4.9% (5/103) trials had (i.e., the estimated response rate changed from below to above following adjustment).
FIG 1.

Point estimate comparison. (A) A comparison of the naive unadjusted point estimate () and the UMVUE () is given for the 233 trials that terminated in stage II where the UMVUE could be computed. (B) The difference between and is presented as a percentage of (where and are the maximal success probability that does not warrant further investigation and the minimal success probability that allows further investigation, specified at the design stage), along with a boxplot to indicate the distribution of these data. UMVUE, uniform minimum variance unbiased estimator.

Point estimate comparison. (A) A comparison of the naive unadjusted point estimate () and the UMVUE () is given for the 233 trials that terminated in stage II where the UMVUE could be computed. (B) The difference between and is presented as a percentage of (where and are the maximal success probability that does not warrant further investigation and the minimal success probability that allows further investigation, specified at the design stage), along with a boxplot to indicate the distribution of these data. UMVUE, uniform minimum variance unbiased estimator. Similar visualizations are provided in Figure 2. Figure 2A displays the length of the reported unadjusted CI against the length of the corresponding adjusted CI proposed by Jennison and Turnbull[18] for the 140 trials for which this adjusted CI could be computed. Figure 2B compares the respective coverage of these unadjusted and adjusted CIs when for the 131 trials in which the target coverage was 0.95. In general, the length of the unadjusted CI is shorter than the corresponding adjusted CI, which is reflected in the coverage being below the desired level for the unadjusted procedure in several instances. The adjusted CI procedure guarantees coverage of at least 0.95, but the cost of this is coverage sometimes far above that required.
FIG 2.

CI comparison. (A) The length of the reported unadjusted CI is compared with the length of the corresponding adjusted CI proposed by Jennison and Turnbull for the 140 articles for which this adjusted CI could be computed. (B) The respective coverage of these unadjusted and adjusted CIs when is given for the 131 of these articles in which the target coverage was 0.95. In both cases, points are colored by the unadjusted CI that the reanalysis indicated the reported CI matched with. For those CIs that matched none of the unadjusted CIs, Clopper-Pearson was used to compute the coverage.

CI comparison. (A) The length of the reported unadjusted CI is compared with the length of the corresponding adjusted CI proposed by Jennison and Turnbull for the 140 articles for which this adjusted CI could be computed. (B) The respective coverage of these unadjusted and adjusted CIs when is given for the 131 of these articles in which the target coverage was 0.95. In both cases, points are colored by the unadjusted CI that the reanalysis indicated the reported CI matched with. For those CIs that matched none of the unadjusted CIs, Clopper-Pearson was used to compute the coverage. Note that 348 trials that were judged to have ended in stage I or stage II reported a point estimate, P value, or CI, as well as the sample size required by their design. Among these, just 99 (28.4%) performed their analysis using the planned sample size. Differences between planned and analyzed sample sizes are shown in the Data Supplement.

DISCUSSION

A large proportion of all phase II evidence comes from trials using Simon's two-stage design. In addition, for trials in rare molecularly defined patient subgroups, it may often be the case that such two-stage single-arm trials will provide the majority of evidence ever available on a treatment's efficacy. This necessitates that such studies be designed, analyzed, and reported effectively. We evaluated the degree to which this is true through a comprehensive review. It is easy to argue reporting of design components was extremely poor. Reproducibility of designs is limited by infrequent reporting of , , , and in unison. It is alarming only 18.4% of the trials provided a justification for , considering result interpretation is highly dependent on this value. It may be considered disappointing that most trials chose standard error rates (e.g., ), as it has been highlighted small concessions in this regard can lead to notable efficiency gains.[21] Similar statements are true for the optimality criteria.[13,22] Few articles stated they used adjusted inference. Given there is no additional cost to using these methods, this is disappointing. Figures 1 and 2 indicate the result of this may be that trials were conservative in their reported point estimate, but anticonservative in the width of their CI. It is also concerning that only 54.8% of the articles included a CI, given the size of single-arm trials makes uncertainty around a point estimate important to quantify. Many final analyses were performed with a sample size different from that specified in the design (71.6%). This highlights the need to plan for design deviation and echoes previous findings.[23] We initially hoped to extract data on how trials handled design deviation when interpreting their results. This was unfortunately judged to be too subjective an endeavor, as many studies interpreted findings through informal comparison of their point estimate or CI bounds to and/or . Difficulties in attaining the planned sample size may be reflected in only 55.3% of the trials reporting . Lack of reporting of also indicates many trials that use Simon's design do not formally test the hypothesis they claim to. It is troubling that so many trials are being published without a formal statistical test being conducted for their primary outcome. We note that methodology to comprehensively handle design deviation is available; its use is depicted in the Data Supplement, which provides the error rates for 45 trials when the methodology of Englert and Kieser[7] is implemented. Using this methodology, trials are assured to conform to their desired type I error rate, and it appears sample sizes that enable power to reach close to the desired level may have been achieved in practice. Without using methodology to account for design deviation, many trials may be interpreting their findings in a manner associated with a high probability of erroneous decision making. We acknowledge several limitations to our work. Only a 10% duplicate extraction was performed. Given the strength of our findings, though, it is unlikely our conclusions would be altered by additional duplicate extractions. It is also impossible to be certain those trials that did not state they used an adjusted inferential procedure had used an unadjusted method. Our reanalyses (Table 4) provide evidence this may be the case. However, for trials that terminated in stage I, we cannot know whether a plan to use adjusted inference if the trial had continued to stage II went unreported. Given past work assessing adherence to CONSORT recommendations,[24] our findings should perhaps not be surprising. Nonetheless, it may have been hoped the simplicity of Simon's design would lead to effective reporting. Our results indicate a CONSORT extension for single-arm oncology trials may be warranted.
  22 in total

1.  On the estimation of the binomial probability in multistage clinical trials.

Authors:  Sin-Ho Jung; Kyung Mann Kim
Journal:  Stat Med       Date:  2004-03-30       Impact factor: 2.373

2.  Optimal two-stage designs for phase II clinical trials.

Authors:  R Simon
Journal:  Control Clin Trials       Date:  1989-03

3.  Methods for proper handling of overrunning and underrunning in phase II designs for oncology trials.

Authors:  Stefan Englert; Meinhard Kieser
Journal:  Stat Med       Date:  2015-03-17       Impact factor: 2.373

4.  New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1).

Authors:  E A Eisenhauer; P Therasse; J Bogaerts; L H Schwartz; D Sargent; R Ford; J Dancey; S Arbuck; S Gwyther; M Mooney; L Rubinstein; L Shankar; L Dodd; R Kaplan; D Lacombe; J Verweij
Journal:  Eur J Cancer       Date:  2009-01       Impact factor: 9.162

5.  Admissible two-stage designs for phase II cancer clinical trials.

Authors:  Sin-Ho Jung; Taiyeong Lee; KyungMann Kim; Stephen L George
Journal:  Stat Med       Date:  2004-02-28       Impact factor: 2.373

6.  Identifying combined design and analysis procedures in two-stage trials with a binary end point.

Authors:  Jack Bowden; James Wason
Journal:  Stat Med       Date:  2012-07-11       Impact factor: 2.373

7.  Two-stage designs optimal under the alternative hypothesis for phase II cancer clinical trials.

Authors:  A P Mander; S G Thompson
Journal:  Contemp Clin Trials       Date:  2010-08-01       Impact factor: 2.226

Review 8.  Does use of the CONSORT Statement impact the completeness of reporting of randomised controlled trials published in medical journals? A Cochrane review.

Authors:  Lucy Turner; Larissa Shamseer; Douglas G Altman; Kenneth F Schulz; David Moher
Journal:  Syst Rev       Date:  2012-11-29

9.  What inference for two-stage phase II trials?

Authors:  Raphaël Porcher; Kristell Desseaux
Journal:  BMC Med Res Methodol       Date:  2012-08-06       Impact factor: 4.615

10.  Do single-arm trials have a role in drug development plans incorporating randomised trials?

Authors:  Michael J Grayling; Adrian P Mander
Journal:  Pharm Stat       Date:  2015-11-26       Impact factor: 1.894

View more
  1 in total

Review 1.  Sources of bias for single-arm phase II cancer clinical trials.

Authors:  Sin-Ho Jung
Journal:  Ann Transl Med       Date:  2022-09
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.