Literature DB >> 34870495

A class of two-sample nonparametric statistics for binary and time-to-event outcomes.

Marta Bofill Roig^1,2, Guadalupe Gómez Melis¹.

Abstract

We propose a class of two-sample statistics for testing the equality of proportions and the equality of survival functions. We build our proposal on a weighted combination of a score test for the difference in proportions and a weighted Kaplan-Meier statistic-based test for the difference of survival functions. The proposed statistics are fully non-parametric and do not rely on the proportional hazards assumption for the survival outcome. We present the asymptotic distribution of these statistics, propose a variance estimator, and show their asymptotic properties under fixed and local alternatives. We discuss different choices of weights including those that control the relative relevance of each outcome and emphasize the type of difference to be detected in the survival outcome. We evaluate the performance of these statistics with small sample sizes through a simulation study and illustrate their use with a randomized phase III cancer vaccine trial. We have implemented the proposed statistics in the R package SurvBin, available on GitHub (https://github.com/MartaBofillRoig/SurvBin).

Entities: Chemical

Keywords: Clinical trials; mixed outcomes; multiple endpoints; non-proportional hazards; survival analysis; weighted Mean survival test

Mesh：

Year: 2021 PMID： 34870495 PMCID： PMC8829729 DOI： 10.1177/09622802211048030

Source DB: PubMed Journal: Stat Methods Med Res ISSN： 0962-2802 Impact factor: 3.021

Introduction

In many clinical studies, two or more endpoints are investigated aiming to provide a comprehensive picture of the treatment’s benefits and harms. Survival analysis has often been the sharp focus of clinical trial research. However, when there is more than one event of interest, the time until the appearance of the event is not always the unique center of attention; often the occurrence of an event over a fixed time period is as well an outcome of interest. In the context of cancer immunotherapies trials, short-term binary endpoints based on the tumor size, such as objective response, are common in early-phase trials, whereas overall survival remains the gold standard in late-phase trials.[1,2] Since traditional oncology endpoints may not capture the clinical benefit of cancer immunotherapies, the idea of looking at both tumor response and survival has grown from the belief that together may achieve a better characterization of the clinical response. Several authors have considered both objective response and overall survival as primary endpoints in cancer trials. Lai and Zee proposed a single-arm phase II trial design with a tumor response rate and a time-to-event outcome, such as overall survival or progression progression-free survival. In their design, the dependence between the binary response and the time-to-event outcome is modeled through a Gaussian copula. Lai et al. proposed a two-step sequential design in which the response rate and the time to the event are jointly modeled. Their approach relates the response rate and the time to the event by means of a mixture model build on the basis of the Cox proportional hazards model assumption. Chen and Wang presented a joint model for binary marker responses and survival outcomes for clustered data. They based the statistical inference on a multivariate penalized likelihood method and estimated the standard errors using a jackknife resampling method. An additional challenge in immunotherapy trials lies in the fact that delayed effects are likely to be found, bringing the need of alternative methods accounting for the non-proportionality of the hazards. Statistics that look at differences between integrated weighted survival curves, such as those defined by Pepe and Fleming[8,9] and extended by Gu et al. , are better suited to detect early or late survival differences and do not depend on the proportional hazards assumption. In this paper, we aim to propose a class of two-sample statistics that could be used in phase II/III design to jointly evaluate the efficacy on binary and survival endpoints, even in the presence of delayed treatment effects. The problem of how to analyze multiple outcomes has been widely discussed in the literature.[11,12] Depending on the inferential goals, this problem can be addressed in different ways. When the trial’s objective is to test the null hypothesis of no effect on any of the endpoints, one can use the univariate tests together with a multiplicity adjustment to control the inflation of the type I error. The most popular multiple testing procedures (e.g., Bonferroni procedure ) do not make any assumption regarding the joint distribution of the univariate tests, and therefore may lead to conservative designs when the endpoints are correlated. When aiming at demonstrating that the treatment has an effect across the endpoints without necessarily specifying the minimum effect on any of the endpoints, one can consider methods based on a combination of the univariate statistics, which generally require assumptions about the joint test distribution. Several approaches have been developed allowing for the joint distribution of test statistics. O’Brien and Pocock et al. proposed global test statistics through the sum of individual statistics. O’Brien developed a generalized least squares method by combining multiple statistics into a single hypothesis test when variables are normally distributed; whereas Pocock et al. extended O’Brien’s work to asymptotically normal test statistics. Hothorn et al. and Pipper et al. approached the problem of testing multiple hypotheses using parametric and semi-parametric models. Hothorn et al. used the limiting distribution of the parameter estimators to build upon the corresponding test statistics and their joint distribution. Based on that, their approach corrects the significance level by means of the simultaneous asymptotic normality of the test statistics. Pipper et al. proposed a procedure for evaluating the efficacy in trials with multiple endpoints of different types. Their procedure is based on simultaneous asymptotic normality of the effect estimators from the single models for each endpoint together with multiple testing adjustments. Extensive research has been done on joint modeling of longitudinal measurements and survival data (comprehensive overviews can be found in Tsiatis and Davidian, Rizopoulos and Papageorgiou et al. ). In most cases, the primary focus is on characterizing the association between the longitudinal and event time processes. The common framework is to relate the time-to-event and longitudinal outcomes through the proportional hazard model. Nevertheless, the relationship between binary response at a specific time point and survival outcome has received less attention. In this paper, we have followed the idea launched by Pocock et al. of combining multiple test statistics into a single hypothesis test. Specifically, we propose a class of statistics based on a weighted sum of a difference in proportions test and a weighted Kaplan–Meier test-based on the difference of survival functions to evaluate an overall effect on the binary and survival endpoints. Our proposal adds versatility to the study design by enabling different follow-up periods for each endpoint, and flexibility by incorporating weights. We define these weights to specify unequal priorities to the different endpoints and to anticipate the type of time-to-event difference to be detected. This article is organized as follows. In Section ‘A general class of binary and survival test statistics’, we present the class of statistics for binary and time-to-event outcomes. In Section ‘Large sample results’, we set out the assumptions and present the large sample distribution theory for the proposed statistics. In Section ‘On the choice of weights’, we introduce different weights and discuss their choice. We give an overview of our R package SurvBin in Section ‘Implementation’ and illustrate our proposal with a recent immunotherapy trial in Section ‘Example’. In Section ‘Simulation study’, we evaluate the performance of these statistics in terms of the significance level and the statistical power with a simulation study. We conclude with a discussion. All the required functions to use these statistics have been implemented in R and have been made available at: https://github.com/MartaBofillRoig/SurvBin.

A general class of binary and survival test statistics

Consider a randomized controlled trial comparing two treatment groups, control group ( ) and intervention group ( ), each composed of individuals, and denote by the total sample size. Suppose that both groups are followed over the time interval and are compared on the basis of the following two endpoints: the occurrence of an event before ( ), and the time to a different event within the interval ( ). For the -th group ( ), let be the probability of having the event before , and be the survival function of the time to the event . We consider the problem of testing simultaneously : and : , aiming to demonstrate an overall effect across the binary and survival endpoints, either by a higher probability of the occurrence of , : , or an improved survival with respect to in the intervention group, : , . The hypothesis problem can then be formalized as follows: We propose a class of statistics —hereafter called -class— as a weighted linear combination of the difference of proportions statistic for the binary outcome and the integrated weighted difference of two survival functions for the time-to-event outcome, as follows: for some real numbers , such that , and where: denoting by the estimated proportion of events before , and by the Kaplan–Meier estimator of for group . The estimates and are such that converge in probability to and , respectively, as , where and represent the asymptotic variances of and , respectively. Both theoretical and estimated expressions for the variances of and will be given in Section ‘Large sample results’ (see equations (5), (6) for the theoretical expressions and (10), (11) for the estimates). The term is a possibly random function that converges pointwise in probability to a deterministic function . For ease of notation, and letting , we will suppress the dependence on and use instead , , . Note that , , and depend on the sample size , but it has been omitted in notation for short. The weights control the relative relevance of each outcome—if any—and the random weight function serves two purposes: to specify the type of survival difference that may exist between groups and to stabilize the variance of the difference of the two Kaplan–Meier functions. Some well-known special cases of are follows: We state the precise conditions for the weight function in Section ‘Large sample results’ and postpone the discussion about the choice of and to Section ‘On the choice of weights’. , where is the pooled Kaplan–Meier estimator for the censoring distribution. This choice of down-weights the contributions on those times where the censoring is heavy. , where and is the pooled Kaplan–Meier estimator for the survival function. This corresponds to the weights of the Fleming–Harrington family. Then, for instance, if and , emphasizes early differences between survival functions; whereas late differences could be highlighted with and . , where denotes the number of individuals at risk of at time . In this case, accentuates the information at the beginning of the survival curve allowing early failures to receive more weight than later failures. The statistics in the -class are defined for possible different follow-up configurations based on different choices of: the overall follow-up period, ; the time where the binary event is evaluated, ; and the origin time for the survival outcome, ; taking into account that . There are however no restrictions on whether or not these periods overlap and, if they do, how much and when. We illustrate two different situations with different configurations for in Figure 1. The first case is exemplified by an HIV therapeutic vaccination study where safety-tolerability response (binary outcome) and time-to-viral rebound (survival outcome) are outcomes of interest. Whereas the safety–tolerability is evaluated at week 6 ( ), the time-to-viral rebound is evaluated from week 6 to week 18 ( and ). The second example in the area of immunotherapy trials includes a binary outcome (objective response), evaluated at month 6, and overall survival, evaluated from randomization until year 4 ( , and ).

Figure 1.

Illustration of two different follow-up configurations, the red and blue arrows represent the time-frame for binary and time-to-event outcomes, respectively. The red line goes from the start of the study (at time-point ) until the binary outcome is evaluated at time . The blue (dashed) line goes from when the time-to-event information begins to be collected ( ) to the end of the study ( ). The -class statistics includes several statistical tests. If , and , then, corresponds to the global test statistic proposed by Pocock et al. . If , , and , the statistic is the equivalent of the linear combination test of Logan et al. when there is no censorship until for testing for differences in survival curves after a pre-specified time-point.

Large sample results

In this section, we derive the asymptotic distribution of the -class of statistics given in (2) under the null hypothesis and under contiguous alternatives, present an estimator of their asymptotic variance, and discuss the consistency of the -statistics against any alternative hypothesis of the form of in (1). We start the section with the conditions we require for the -class of statistics. To make the paper more concise and more readable, proofs and technical details are in the Supplemental material. Moreover, stratified -tests are also presented and discussed in the Supplemental material.

Further notation and assumptions

We consider two independent random samples of ( ) individuals and for each we denote the binary response by , the time to by and the censoring time by for and where is the usual 0/1 indicator function. Assume that is non-informatively right-censored by , that is independent of , and that the occurrence of the survival and censoring times, and , does not prevent to assess the binary response, . The observable data are summarized by , where and . Denote by and the censoring survival function and the Kaplan–Meier estimator for the censoring times, respectively. As we will see in the next section, the distribution of the -statistics relies, among others, on the survival function for those individuals who respond ( ) to the binary endpoint. We then introduce here the survival function for responders as P . Furthermore we assume that: (i) at the end of follow-up, , , and ; (ii) the limiting fraction of the total sample size is non-negligible, that is, ; and (iii) is a nonnegative piecewise continuous with finitely discontinuity points. For all the continuity points in , converges in probability to as . Moreover, and are functions of total variation bounded in probability. Finally, we introduce the counting process as the number of observed events that have occurred by time for the -th group ( ) and as the number of subjects at risk at time for the -th group. We define and suppose that . Remark: Throughout the paper and to refer to the group ( ), we will use subindexes for the individual observations and stochastic processes, as in , while we will use superindexes in parentheses for the functions and parameters, as in .

Asymptotic distribution

To derive the asymptotic distribution of the statistic , we use that can be approximated by , the same statistic with the weights replaced by its deterministic function (see Lemma 1 in the Supplemental material). Roughly speaking, thanks to this approximation, we can ignore the randomness of and use to obtain the limiting distribution of . In what follows, we state the asymptotic distributions under the null hypothesis in Theorem 1 and under a sequence of contiguous alternatives in Theorem 2. Let be the statistic defined in (2). Under the conditions outlined in ‘Further Notation and Assumptions’, if the null hypothesis holds, converges in distribution, as , to a normal distribution as follows: where , stand for the asymptotic variances of and , respectively, and is the covariance between and . Their corresponding expressions are given by: where , ( or ), , and for . Recall that , , and depend on , but we omit them for notational simplicity. Let be the statistic defined in (2). Under the conditions outlined in the “Further notation and assumptions” section, consider the following sequences of contiguous alternatives for both binary and time-to-event hypotheses satisfying, as : and for some constant and bounded function , and . Then, under contiguous alternatives of the form: we have that: in distribution as , where , and are given in (5)–(7), respectively. The covariance in (7) involves the conditional probabilities and , while represents the survival function for responders –individuals that have had the binary event –, stands for the probability of being a responder among individuals experiencing at . Also note that, if , the survival experience starts after the binary event has been evaluated and only involves the second integral in (7). We notice that the efficiency of the -statistics, , under contiguous alternatives is driven by the non-centrality parameter , that is, by the sum of the weighted non-centrality parameters of and .

Variance estimation and consistency

We now describe how to use the -statistics to test versus given in (1). In particular, we propose a consistent estimator of the asymptotic variance of , and present the standardized -statistics to test . The asymptotic variance of , given in Theorem 1, can be consistently estimated by: where , , and denote the estimates of , and , and are given by: where ( or ), is the pooled Kaplan–Meier estimator of the survival functions, is the pooled estimator of the probabilities , is the Kaplan–Meier estimator of , and is the estimator of . The variance estimator presented in (9) is obtained assuming that the variances of the two groups are equal (pooled estimator). An unpooled variance estimator is proposed in the Supplemental material. For both pooled and unpooled estimators, smoothing techniques are used to estimate over the time period . In this work, we have chosen kernel smoothing methods. Note that resampling methods can also be used to get an estimator of the variance of . In the simulation section, we will discuss the results using the pooled, unpooled, and bootstrap variance estimators. To test the global null hypothesis in (1), we consider the normalized statistic of : Because this statistic (13) converges in distribution to a standard normal distribution, it can be used to test by comparing to a standard normal distribution. Moreover, for positive , the statistic is consistent against any alternative hypothesis of the form of in (1).

On the choice of weights

An important consideration when applying the statistics proposed in this paper is the choice of the weight functions. The -class of statistics involves the already mentioned random weight function and deterministic weights . These weights are defined according to different purposes and have different roles in the statistic . In this section, we include different weights and discuss some of their strengths as well as shortcomings. The list provided is not exhaustive, other weights are possible and might be useful in specific circumstances.

Choice of

The purpose of the weights is to prioritize the binary and the time-to-event outcomes. They have to be specified in advance according to the research questions. Whenever the two outcomes are equally relevant, we should choose . In this case, the statistics will be optimal whenever the standardized effects on both outcomes coincide. The choice of might be very general as long as converges in probability to a function , and both and satisfy the conditions outlined in the ‘Further notation and assumptions’ section. In this section, we center our attention on a family of weights of the form: where: (i) is a data-dependent function that converges, in probability to , a nonnegative piecewise continuous function with bounded variation on . The term takes care of the expected differences between survival functions and can be used as well to emphasize some parts of the follow-up according to the time-points ( ); (ii) the weights converge in probability to a deterministic positive bounded weight function . The main purpose of the weight is to ensure the stability of the variance of the difference of the two Kaplan–Meier functions. To do so, we make the additional assumption that: for all , and for some constants . Different choices of yield other known statistics. For instance, if , corresponds to the Weighted Kaplan–Meier statistics.[8,9] Whenever and correspond to the weights (15) and (14), respectively, introduced below, we have the statistic proposed by Shen and Cai. Furthermore, note that the weight functions of the form are similar to those proposed by Shen and Cai ; while they assume that is a bounded continuous function, we assume that is a nonnegative piecewise continuous function with bounded variation on , and instead of only considering the Pepe and Fleming weight function corresponding to (15), we also allow for different weight functions . Finally, if we do not consider any weight, that is, if , corresponds to the difference of restricted mean survival times from to . In what follows, we outline different choices of and , together with a brief discussion for each one: We require to be small towards the end of the observation period if censoring is heavy. The usual weight functions involve Kaplan–Meier estimators of the censoring survival functions. The most common weight functions are as follows: and , both proposed by Pepe and Fleming. Among other properties, has been proved to be a competitor to the log-rank test for the proportional hazards alternative. Note that if the censoring survival functions are equal for both groups and the sampling design is balanced ( ), then, the differences in Kaplan–Meier estimators are weighted by the censoring survival function, that is, for . Also note that for uncensored data. Analogously to Fleming and Harrington statistics, could be used to specify the type of expected differences between survival functions. That is, if we set: the choice , leads to a test to detect early differences, while , leads to a test to detect late differences; and leads to a test evenly distributed over time and corresponds to the weight function of the log-rank. To put more emphasis on those times after the binary follow-up period we might consider: for .

Implementation

We have developed the SurvBin package to facilitate the use of the -statistics and is now available on GitHub (https://github.com/MartaBofillRoig/SurvBin). The SurvBin package contains two key functions: lstats to compute the standardized -statistic, , using the variance estimator given in Section ‘Large sample results’; and lstats_boots to compute the standardized -statistic by using a bootstrap procedure to estimate the variance. The SurvBin package also provides the functions survbinCov to calculate ; and bintest and survtest to compute the univariate binary and survival statistics (3) and (4), and , respectively. In addition, the SurvBin package includes the function simsurvbin that can be used to simulate bivariate binary and survival data in a variety of situations. The main function lstats can be called by: where time, status, binary and treat are vectors of the right-censored data, the status indicator, the binary data and the treatment group indicator, respectively; tau0, tau, taub denote the follow-up configuration; wb, ws are the weights ; rho, gam, eta are scalar parameters that controls the weight , which is given by ; and var_est indicates the variance estimate to use (pooled or unpooled). In this work, we estimate by means of the Epanechnikov kernel function, and the local bandwidth selection and the boundary correction described by Muller and Wang by using the muhaz package.

Example

Melanoma has been considered a good target for immunotherapy and its treatment has been a key goal in recent years. Here we consider a randomized, double-blind, phase III trial whose primary objective was to determine the safety and efficacy of the combination of a melanoma immunotherapy (gp100) together with an antibody vaccine (ipilimumab) in patients with previously treated metastatic melanoma. Despite the original endpoint was objective response rate at week 12, it was amended to overall survival and then considered a secondary endpoint. A total of 676 patients were randomly assigned to receive ipilimumab plus gp100, ipilimumab alone, or gp100 alone. The study was designed to have at least power to detect a difference in overall survival between the ipilimumab-plus-gp100 and gp100-alone groups at a two-sided level of , using a log-rank test. Cox proportional-hazards models were used to estimate hazard ratios and to test their significance. The results showed that ipilimumab with gp100 improved overall survival as compared with gp100 alone in patients with metastatic melanoma. However, the treatment had a delayed effect and an overlap between the Kaplan–Meier curves was observed during the first six months. Hence, the proportional hazards assumption appeared to be no longer valid, and a different approach would have been advisable. To illustrate our proposal, we consider the comparison between the ipilimumab-plus-gp100 and gp100-alone groups based on the overall survival and objective response as multiple primary endpoints of the study. For this purpose, we have reconstructed individual observed times by scanning the overall survival Kaplan–Meier curves reported in Figure 1A of Hodi et al. using the reconstructKM package (see Figure 2), and, afterwards, we have simulated the binary response to mimic the percentage of responses obtained in the study.

Figure 2.

Kaplan–Meier curves for overall survival for ipilimumab-plus-gp100 and gp100-alone groups (arms 1 and 0, respectively).

Kaplan–Meier curves for overall survival for ipilimumab-plus-gp100 and gp100-alone groups (arms 1 and 0, respectively). Using the data obtained, we employ the -statistic by means of the function lstats in the SurvBin package. To do so, we need to specify the weights ( ) to be used, and the time-points ( ). In our particular case, we take according to the trial design, choose to account for censoring and delayed effects in late times, and to emphasize the importance of overall survival over objective response. As shown below, the function lstats returns the standardized -statistic, together with the -statistic and its standard deviation, and the individual statistics. The value of the -statistic, in (2), is and is obtained by using the values of and ( and , respectively), and and ( and ). The statistic equals and is computed by using the variance estimator in (9) and then by means of and together with the estimated covariance ( ). Since we obtained , we have a basis to reject and conclude that there is an overall effect of ipilimumab across the binary endpoint of tumor reduction and overall survival in patients with metastatic melanoma. Note that, in this example, we have been using the pooled variance estimator for the -statistic. We have also calculated the statistic using the unpooled and bootstrap variance estimators (see Supplemental material) and notice that the results were not substantially different. We also explored how much the results might vary if we change the weights assigned. Table 1 shows the values of the -statistic when we modify the weights . Moreover, we compared the values of the statistics for the survival endpoint when using the weighted Kaplan–Meier statistics and weighted log-rank statistics. We note no major differences when using weighted Kaplan–Meier tests with different weights, nor in the weighted log-rank tests. This is because the survivals are the same at the beginning of the curves, and therefore putting a higher weight at the end of the curve does not have as much influence as if there was a small effect at the beginning of the trial. On the other hand, we observe that the weighted Kaplan–Meier statistics take higher values than the weighted log-rank tests for all combinations of weights.

Table 1.

Comparison of statistics (at years) using different weights: -statistics with different combinations of ; and weighted Kaplan–Meier (WKM) statistics and weighted log-rank (WLR) statistics with different .

Statistics	Weights	Standardized test
L -statistics	(ωb,ωs)=(0.25,0.75)	4.11
L -statistics	(ωb,ωs)=(0.5,0.5)	3.93
L -statistics	(ωb,ωs)=(0.75,0.25)	2.94
WKM statistics	(ρ,γ,η)=(0,0,0)	3.70
WKM statistics	(ρ,γ,η)=(0,1,0)	3.69
WLR statistics	(ρ,γ)=(0,0)	3.13
WLR statistics	(ρ,γ)=(0,1)	2.75

Simulation study

Design

We conducted a simulation study to evaluate our proposal in terms of the statistical power and the type-I error with small sample sizes. We generated bivariate binary and time-to-event data through a copula-based framework and used conditional sampling as described in Trivedi and Zimmer. The parameters used for the simulation (summarized in Table 2) have been the following:

Table 2.

Scenarios used in the simulation study.

Parameter	Value	Parameter	Value
p(0)	0.1,0.3	a	0.5,1,2
b	1	c	3
θ	0.001,0.51,0.91	n(i) ( i=0,1 )	250
τb	0.5,1	τ	1
ρ,γ	0,1	η	1
d	0,0.075	HR	0.75,1
(ωb,ωs)	( 0.25,0.75 ),( 0.5,0.5 ),( 0.75,0.25 )	t*	0,0.5

Clayton’s copula with association parameter between the marginal distributions of the binary and time-to-event outcomes equal to . These values correspond, respectively, to Spearman’s rank correlation values equal to , , , which represent increasing associations between the binary and time-to-event outcomes. We have not considered higher values of as they do not fulfill the condition that ( ). Weibull survival functions, , with and . Probability of having the binary endpoint .; and Sample size per arm for and total sample size . The censoring distributions between groups were assumed equal and uniform with . Two different follow-up configurations were considered for : (i) ; and (ii) . We have considered the weights: with and . When simulating under the null hypothesis, we considered equal to ; whereas when simulating under the alternative hypotheses, we considered equal to , , and . Scenarios used in the simulation study. The simulations under the alternative hypothesis considered four different situations depending on whether there is a treatment effect on both endpoints and the type of difference between the survival curves. Specifically, the following cases were considered: We used to simulate the effects on the binary endpoint. For the survival endpoint, we considered under proportional hazards, and and under delayed effects. Effect on both binary and survival endpoints. The effect on the survival endpoint satisfies the proportional hazards assumption, that is, the hazard ratio (HR) between treatment groups is constant over the study duration; Effect on the binary endpoint and non-effect on the survival endpoint ( ); Non-effect on the binary endpoint ( ) and effect on the survival endpoint with HR constant over the study duration; Effect on both binary and survival endpoints. The treatment differences on the survival endpoint have a delayed effect, that is, the survival functions are assumed to be equal until time , and there is a constant hazard ratio (HR) between treatment groups from to . We evaluated the empirical significance level and the statistical power using the -statistics with pooled, unpooled, and bootstrap variance estimators, and for the sake of the comparison, using the Bonferroni procedure. In addition, we presented the empirical results for testing the individual hypothesis and by using the statistics (3) and (4). The total number of scenarios was ( under the null hypothesis and under the alternative hypothesis). We ran replicates and estimated the significance level ( ) for each scenario under the null hypothesis. We ran replicates and estimated the statistical power for each scenario under alternative hypotheses. We performed all computations using the R software (version 4.0.2).

Power properties

When there is a treatment effect on both endpoints and the proportional hazards assumption is fulfilled (case 1), we obtained empirical powers with medians , , using the -statistics with pooled, unpooled, and bootstrap variance estimators, respectively; whereas the median of the empirical powers using Bonferroni was . When there is a treatment effect on both endpoints and there are delayed effects (case 4), the empirical powers for the -statistics have medians , , using the pooled, unpooled, and bootstrap variance estimators, respectively; whereas the median of the empirical powers using Bonferroni was . Table 3 summarizes the simulation results on the power across different parameters in case 1. We compared the performance of the different variance estimators and noticed that the empirical powers do not substantially differ between them. We also observed that the power is not affected by the different weight functions in the case of proportional hazards. We obtained higher powers when emphasizing late-differences between the survival curves ( ) in the case of delayed effects (median powers of , , for unpooled, pooled, and bootstrap variance estimators, respectively, with against , , for unpooled, pooled, and bootstrap variance estimators with ).

Table 3.

		Empirical size				Empirical powers (Case 1)
		Pooled	Unpooled	Boots.	Bonf.	Pooled	Unpooled	Boots.	Bonf.
(τb,τ)	(0.5,1)	0.053	0.053	0.050	0.050	0.84	0.85	0.83	0.80
	(1,1)	0.056	0.055	0.050	0.050	0.85	0.85	0.83	0.79
θ	0.001	0.051	0.051	0.052	0.051	0.87	0.88	0.87	0.82
	0.510	0.054	0.055	0.049	0.050	0.84	0.83	0.81	0.79
	0.910	0.056	0.056	0.048	0.049	0.82	0.82	0.79	0.78
p(0)	0.1	0.053	0.054	0.050	0.051	0.89	0.89	0.88	0.84
	0.3	0.053	0.055	0.050	0.049	0.78	0.78	0.76	0.73
a	0.5	0.053	0.054	0.051	0.050	0.87	0.87	0.86	0.82
	1	0.053	0.054	0.050	0.050	0.84	0.85	0.83	0.80
	2	0.054	0.055	0.048	0.050	0.82	0.81	0.79	0.78
(ρ,γ,η)	(0,0,1)	0.054	0.053	0.051	0.050	0.84	0.85	0.83	0.80
	(0,1,1)	0.054	0.055	0.051	0.050	0.84	0.85	0.83	0.80
	(1,0,1)	0.053	0.055	0.050	0.050	0.83	0.82	0.80	0.79
	(1,1,1)	0.054	0.053	0.050	0.050	0.86	0.86	0.84	0.81

Median empirical size and median empirical power from and replications, respectively. The empirical size and powers are calculated using: the -statistics (in (2)) according to the pooled, unpooled, bootstrap variance estimators (labeled as Pooled, Unpooled, and Boots.); and the Bonferroni procedure (Bonf.). Under the null hypothesis there is no effect on any of the endpoints ( HR ). Under the alternative hypothesis there is effect on both endpoints (Case 1: HR ) and the effect on the survival endpoint satisfies the proportional hazards assumptions ( ). Figure 3 shows boxplots for the empirical powers using the pooled, unpooled, and bootstrap variance estimators and the Bonferroni procedure. These simulations show the superiority of the -statistics over the Bonferroni procedure, in terms of power, both under proportional hazards and under delayed effects and regardless of the choice of the weights ( ).

Figure 3.

Boxplot of empirical powers based on scenarios in Table 2. The empirical powers are calculated using: the -statistics (in (2)) according to the pooled, unpooled, bootstrap variance estimators; the Bonferroni procedure; and the individual statistics (3) and (4). The individual statistics for the binary and survival endpoints are labeled, respectively, as BE and SE. The color indicates which combination of weights ( ) were used: red for ( ); blue for ( ); and green for ( ). When there is a treatment effect on only one of the endpoints (cases 2 and 3), the behavior of the power mainly relies on the pre-specified weights ( ) (see Figure 3). If the survival endpoint is considered clinically more important than the binary endpoint and we use the weights ( ), then the median of the empirical powers is around in case 2 (i.e., when there is treatment effect on the survival endpoint) and around in case 3 (i.e., when there is no effect on the survival endpoint). We found a similar behavior when the binary endpoint is considered more important and ( ). If both endpoints are equally important and there is treatment effect in only one of them, the empirical powers using the -statistics take values between the power would have had used the two individual statistics. Given that the Bonferroni procedure assigns more importance to the more highly significant of the endpoints, the powers are in this case higher using Bonferroni than using -statistics. We have also evaluated the empirical powers if we have had an endpoint with a small effect instead of no effect. We observe that the difference in powers between the -statistics and Bonferroni procedure is smaller, and that the powers using -statistics could be even higher than the ones using Bonferroni (see Supplemental material). Furthermore, we have assessed how the association between the binary and survival endpoints might affect the results by considering Frank’s copula instead of Clayton’s copula. We notice that there is no substantial difference between the results using Frank’s and Clayton’s copulas. Results using Frank’s copula can be found in the Supplemental material.

Size properties

The empirical results show that the type I error is very close to the nominal level across a broad range of situations. The empirical size resulted in type I errors with a median of , , and using the unpooled, pooled, and bootstrap variance estimators, respectively. Table 3 summarizes the results according to the parameters of the simulation study. The results show that the -statistics have the appropriate size and are not specially influenced by the selection of weights ( ). We observed that when using the unpooled and pooled estimators, the empirical size is slightly larger than 0.050, especially when . This can be explained mainly by the number of individuals at risk at the end of the follow-up. Having a small number of individuals make it difficult for the smooth estimation of the probability in (7). Therefore, we recommend the use of the bootstrap variance estimator for studies with small sample sizes with long follow-ups for both the binary and survival endpoints and where the probability of observing the binary endpoint is low.

Discussion

We have proposed a class of statistics for a two-sample comparison based on two different outcomes: one dichotomous taking care, in most occasions, of short short-term effects, and a second one addressed to detect long long-term differences in a survival endpoint. Such statistics test the equality of proportions and the equality of survival functions. The approach combines a score test for the difference in proportions and a weighted Kaplan-Meier test-based for the difference of survival functions. The statistics are fully non-parametric and level for testing the null hypothesis of no effect on any of these two outcomes. The statistics in the -class are appealing in situations when both outcomes are relevant, regardless of how the follow-up periods of each outcome are, and even when the hazards are not proportional with respect to the time-to-event outcome or in the presence of delayed treatment effects, albeit the survival curves are supposed not to cross. In this work, we focus mainly on clinical trials where the groups to be compared are supposed to only differ with respect to the treatment they receive. However, in observational studies and also sometimes in clinical trials, there is a need to adjust for other covariates than treatment. Whenever the number of covariates is not too large, stratified tests can be useful. The stratified version of the -test is deferred to the Supplemental material. We have incorporated weighted functions in the -statistics to control the relative relevance of each outcome and to specify the type of survival differences that may exist between groups. In our proposed statistics, the weights have been defined with the goal of incorporating the potential difference in clinical importance between the binary and survival endpoint, and therefore they must be fixed in the planning stage. As shown in the simulation study, the power of the trial will depend on the trial objectives and then on the relevance of each of the endpoints by means of . A sensitivity analysis could be carried out calculating the -statistics based on a pre-specified set of weights, as it is often done with the weighted log-rank statistics. The extension of these statistics incorporating data-driven weights to maximize the power will be considered in future works. The testing procedure using the -class of statistics satisfies a property called coherence that says that the nonrejection of an intersection hypothesis implies the nonrejection of any sub-hypothesis it implies, that is, and . However, the testing procedure based on the -class of statistics does not fulfil the consonant property that states that the rejection of the global null hypothesis implies the rejection of at least one of its sub-hypothesis. Bittman et al. faced the problem of how to combine tests into a multiple testing procedure for obtaining a procedure that satisfies the coherence and consonance principles. An extension of this work to obtain a testing procedure that satisfies both properties could be an important research line to consider. If, however, it is desired to conclude an efficacy claim for the individual endpoints, alternative methods could also be considered, such as an adjusted Bonferroni (taking into account the joint distribution of the univariate tests stated in Section ‘Large sample results’), or a two-stage method in which efficacy is assessed in the univariate tests after detecting an overall effect. In this paper, we have considered weighted Kaplan–Meier tests-based for comparing the survival curves motivated by recent immunotherapy trials in which the proportional hazards assumption is in doubt. However, if the survival endpoint satisfies the proportional hazards assumption, the combination of a weighted log-rank test and difference in proportion test could also be considered. Previous works by Wei and Lachin[34,35] could then be used for this problem. The comparison of the relative efficiency between tests combining binary and survival data using weighted log-rank tests or weighted Kaplan–Meier tests and differences in proportions under a range of alternative hypotheses is open for future research. This work has been restricted to those cases in which censoring and survival do not prevent assessing the binary endpoint response. Two questions remain open for future research concerning this point. The first is the possibility of censorship occurring earlier and making it impossible to assess the binary endpoint. The second is the fact that the survival event may preclude observing the binary endpoint. While the second case refers to a competing risk problem, the first refers to a censoring regarding binary endpoint. Regarding the competing-risk problem, it should also be noted that the time-to-event is usually more relevant in practice. And, in this situation, it is generally considered that if the survival endpoint happens before the binary endpoint, the binary endpoint would have taken the value representing the worst outcome. Further research is needed to study the implications of competing risks or censorship on this type of design. We plan to extend the -class of statistics to more general censoring schemes where the binary endpoint could be censored or where there may be a case of competing risk. Last but not least, extensions to sequential and adaptive procedures in which the binary outcome could be tested at more than one time-point remain open for future research. Click here for additional data file. Supplemental material, sj-pdf-1-smm-10.1177_09622802211048030 for A class of two-sample nonparametric statistics for binary and time-to-event outcomes by Marta Bofill Roig and Guadalupe Gómez Melis in Statistical Methods in Medical Research

21 in total

Review 1. Simultaneous inference in general parametric models.

Authors: Torsten Hothorn; Frank Bretz; Peter Westfall
Journal: Biom J Date: 2008-06 Impact factor: 2.207

2. Improved survival with ipilimumab in patients with metastatic melanoma.

Authors: F Stephen Hodi; Steven J O'Day; David F McDermott; Robert W Weber; Jeffrey A Sosman; John B Haanen; Rene Gonzalez; Caroline Robert; Dirk Schadendorf; Jessica C Hassel; Wallace Akerley; Alfons J M van den Eertwegh; Jose Lutzky; Paul Lorigan; Julia M Vaubel; Gerald P Linette; David Hogg; Christian H Ottensmeier; Celeste Lebbé; Christian Peschel; Ian Quirt; Joseph I Clark; Jedd D Wolchok; Jeffrey S Weber; Jason Tian; Michael J Yellin; Geoffrey M Nichol; Axel Hoos; Walter J Urba
Journal: N Engl J Med Date: 2010-06-05 Impact factor: 91.245

3. Weighted Kaplan-Meier statistics: a class of distance tests for censored survival data.

Authors: M S Pepe; T R Fleming
Journal: Biometrics Date: 1989-06 Impact factor: 2.571

4. The analysis of multiple endpoints in clinical trials.

Authors: S J Pocock; N L Geller; A A Tsiatis
Journal: Biometrics Date: 1987-09 Impact factor: 2.571

5. Joint modeling of binary response and survival for clustered data in clinical trials.

Authors: Bingshu E Chen; Jia Wang
Journal: Stat Med Date: 2019-11-28 Impact factor: 2.373

6. Multiple significance tests: the Bonferroni method.

Authors: J M Bland; D G Altman
Journal: BMJ Date: 1995-01-21

Review 7. Design of oncology clinical trials: a review.

Authors: Revathi Ananthakrishnan; Sandeep Menon
Journal: Crit Rev Oncol Hematol Date: 2013-04-25 Impact factor: 6.312

8. Application of the Wei-Lachin multivariate one-directional test to multiple event-time outcomes.

Authors: John M Lachin; Ionut Bebu
Journal: Clin Trials Date: 2015-09-02 Impact factor: 2.486

9. Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation.

Authors: Brent R Logan; John P Klein; Mei-Jie Zhang
Journal: Biometrics Date: 2008-01-11 Impact factor: 1.701

10. iHIVARNA phase IIa, a randomized, placebo-controlled, double-blinded trial to evaluate the safety and immunogenicity of iHIVARNA-01 in chronically HIV-infected patients under stable combined antiretroviral therapy.

Authors: Wesley de Jong; Joeri Aerts; Sabine Allard; Christian Brander; Jozefien Buyze; Eric Florence; Eric van Gorp; Guido Vanham; Lorna Leal; Beatriz Mothe; Kris Thielemans; Montse Plana; Félipe Garcia; Rob Gruters
Journal: Trials Date: 2019-06-17 Impact factor: 2.279