Literature DB >> 34417679

Conditional screening for ultrahigh-dimensional survival data in case-cohort studies.

Jing Zhang1, Haibo Zhou2, Yanyan Liu3, Jianwen Cai4.   

Abstract

The case-cohort design has been widely used to reduce the cost of covariate measurements in large cohort studies. In many such studies, the number of covariates is very large, and the goal of the research is to identify active covariates which have great influence on response. Since the introduction of sure independence screening, screening procedures have achieved great success in terms of effectively reducing the dimensionality and identifying active covariates. However, commonly used screening methods are based on marginal correlation or its variants, they may fail to identify hidden active variables which are jointly important but are weakly correlated with the response. Moreover, these screening methods are mainly proposed for data under the simple random sampling and can not be directly applied to case-cohort data. In this paper, we consider the ultrahigh-dimensional survival data under the case-cohort design, and propose a conditional screening method by incorporating some important prior known information of active variables. This method can effectively detect hidden active variables. Furthermore, it possesses the sure screening property under some mild regularity conditions and does not require any complicated numerical optimization. We evaluate the finite sample performance of the proposed method via extensive simulation studies and further illustrate the new approach through a real data set from patients with breast cancer.
© 2021. The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Entities:  

Keywords:  Case-cohort design; Conditional screening; Sure screening property; Survival data; Ultrahigh-dimensional data; Weighted estimating equation

Mesh:

Year:  2021        PMID: 34417679      PMCID: PMC8561435          DOI: 10.1007/s10985-021-09531-7

Source DB:  PubMed          Journal:  Lifetime Data Anal        ISSN: 1380-7870            Impact factor:   1.429


Introduction

In large epidemiological cohort studies, it is common that some diseases of interest (e.g., cancer, heart disease, HIV infection) have very low incidence. In addition, some exposures can be very expensive to measure and it is not feasible to obtain the measures on all cohort members due to restrictions on resources. To reduce the cost while keeping as much efficiency as possible, Prentice (1986) proposed the case-cohort design, where the expensive covariates are obtained only for a random sample of the full cohort, called the subcohort, as well as the additional cases who have experienced the event of interest during the follow-up period. When covariate dimension p is smaller than sample size n, various methods have been proposed for analyzing data under this design, such as the pseudo-likelihood approach (Prentice, 1986; Self and Prentice, 1988; Kalbfleisch and Lawless, 1988), the estimating equation method (Chen and Lo, 1999; Chen, 2001), the multiple imputation approach (Marti and Chavance, 2011; Keogh and White, 2013), the maximum likelihood estimation (Scheike and Martinussen, 2004; Zeng and Lin, 2014), weighted estimating equation approach (Barlow, 1994; Borgan et al., 2000; Kulich and Lin, 2004; Breslow and Wellner, 2007; Kang and Cai, 2009; Kim et al., 2013), among others. With the rapid development of biomedical technology, high-dimensional data are frequently collected in large epidemiological studies. The feature of this kind of data is that the covariate dimension p is much larger than sample size n. An important purpose of analyzing this type of data is to identify a subset of covariates related to the event of interest and construct the effective models based on the selected covariates. For scenarios where p increases with n at polynomial rate (e.g., p = n with α > 0), the regularization method has been demonstrated to be an effective dimension reduction method for simple random sampling (SRS) data (e.g., Tibshirani, 1996; Fan and Li, 2001; Zou, 2006; Candes and Tao, 2007; Zhang, 2010) and has been generalized to high-dimensional data under the case-cohort design recently. For example, Ni et al. (2016) proposed a variable selection procedure by using the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) for scenarios where p increases at a slower rate than n. Kim and Ahn (2019) proposed a bi-level variable selection method to select non-zero group and within-group variables for cases where variables have group structure. These methods can select variables and estimate parameters simultaneously, however, the computation inherent in regularization methods makes them involve the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability when the dimension p is ultrahigh in the sense that p = exp(n) with α > 0 (Fan et al., 2009). For SRS data, the feature screening method has achieved great success in dealing with the challenge of ultrahigh-dimensional settings. Various marginal screening methods have been proposed under different settings, such as linear models (Fan and Lv, 2008), generalized linear models (Fan and Song, 2010), additive models (Fan et al., 2011), the varying coefficient models (Fan et al., 2014; Liu et al., 2014) and model-free scenarios (e.g., Zhu et al., 2011; Li et al., 2012a; Li et al., 2012b; He et al., 2013; Chang et al., 2013; Cui et al., 2015; Mai and Zou, 2015; Wu and Yin, 2015). For censored survival data, several model-based screening methods (e.g., Tibshirani, 2009; Zhao and Li, 2012; Gorst-Rasmussen and Scheike, 2013) and model-free screening methods (e.g., Song et al., 2014; Wu and Yin, 2015; Zhang et al., 2017; Zhou and Zhu, 2017; Liu et al., 2018; Zhang et al., 2018; Lin et al., 2018; Pan et al., 2019) have been proposed via defining different marginal utilities. Although they are powerful in reducing the dimensionality, they may face some challenges in some situations. For instance, as noted in Fan and Lv (2008), the correlation among covariates heavily influence the marginal utility. When the correlation among covariates is relatively high, the marginal screening methods may fail to retain the hidden active variables which have great influence on response but are weakly correlated with the response. Although some iterative screening methods (e.g., Fan and Lv, 2008; Zhu et al., 2011; Zhang et al., 2018, Pan et al., 2019) and forward screening approaches (e.g., Wang, 2009) have been proposed to alleviate this problem, the computation speed is relatively slow and the statistical properties are elusive. In many applications, researchers can obtain some prior information of active variables from previous investigations and experiences. For example, in the breast cancer study (van de Vijver et al., 2002), gene AL080059 has been known to be predictive to patients’ survival time in the literature (Yeung et al., 2005; van’t Veer et al., 2002). Barut et al. (2016) pointed out we can improve the accuracy in variable screening by including such prior knowledge. In view of this thought, they proposed the conditional screening approach for generalized linear models and showed that conditioning helps reducing the correlation among covariates, thus can detect the hidden active variables with higher probability. Hong et al. (2016) further proposed to integrate prior information using data-driven approaches. Hu and Lin (2017) put forward a conditional screening procedure via ranking covariates based on conditional marginal empirical likelihood ratios. Liu and Wang (2017) proposed a screening method based on conditional distance correlation. Hong et al. (2018) developed a conditional screening method for censored data under the proportional hazards model. Liu and Chen (2018) considered the conditional quantile independence screening approach for ultrahigh-dimensional heterogeneous data. Lu and Lin (2020) proposed a model-free conditional screening via conditional distance correlation. Extensive simulation studies showed these conditional screening methods which incorporate important prior information of active variables can provide a powerful means to identify hidden active variables for ultrahigh-dimensional data. The research on marginal and conditional screening methods has been fruitful for ultrahigh-dimensional SRS data, but to the best of our knowledge, conditional screening method has not been studied for case-cohort data, the existing conditional screening methods can not be directly applied to the case-cohort data due to its special data structure. To fill the gap, we propose a conditional screening method for ultrahigh-dimensional case-cohort data under the framework of Cox proportional hazards model. We construct the marginal hazards regression models for each covariate by including the known important covariates. As some covariates are not fully observed, we build the weighted estimating equation to obtain the estimators of the parameters. Then we propose the marginal utilities based on the parameter estimates to measure the contribution of each covariate and retain the covariates with top ranked contributions. We refer to it as conditional weighted screening method, in short the C-WSIS procedure. As pointed out by Barut et al. (2016), the correlation between covariates can be weakened upon conditioning, so that hidden active covariates have a higher chance to be retained. Therefore, the proposed method enables the detection of hidden active covariates for ultrahigh dimensional survival data under the case-cohort design. Under some reasonable conditions, it enjoys the sure screening property and the ranking consistency. Our research is the first one that focus on conditional screening for ultrahigh dimensional case-cohort data, it can be viewed as an extension of Hong et al. (2018) from SRS data to case-cohort data. Note that although the ideas are similar, the generalization is quite challenging due to the much more complex structure of case-cohort data, both implementation and the theory will be quite different. The rest of the article is organized as follows. In Section 2, we introduce the model, data and present the details of the CWSIS procedure. In Section 3, we establish the theoretical properties of the proposed CWSIS method. Section 4 presents results from simulation studies. A real data set from the breast cancer study is analyzed in Section 5. Section 6 provides some remarks and discussions. The regularity conditions and the technical proofs are presented in the Appendix.

Conditional screening for case-cohort data

Suppose there are n independent subjects in a cohort study. Let T and C denote the failure time and censoring time of subject i, we only observe X = min(T, C) and Δ = I(T ≤ C) due to right-censoring. Let Z = (Z,…,Z)T denote the p-dimensional covariate, under the case-cohort design, Z is available only on the cases (Δ = 1) and the subcohort (a random subset of the full cohort). Let ξ be the indicator for subcohort membership, i.e., ξ = 1 and 0 denote whether or not the ith subject in the full cohort is selected into the subcohort. For the selection of subcohort, we consider independent Bernoulli sampling with selection probability π = Pr(ξ = 1) ∈ (0, 1). Thus, the observable data for the ith subject is {X, Δ, Z, ξ} when ξ = 1 or Δ = 1, and {X, Δ, ξ} when ξ = 0 and Δ = 0. Suppose that the failure time follows the proportional hazards model (Cox, 1972), under which the conditional hazard function of T given Z has the form where λ0(t) is the unspecified baseline hazard function and = (α1,…,α)T is the unknown regression parameter. Assume that the failure time T and the censoring time C are independent given Z. In an ultrahigh-dimensional setting, the dimensionality p greatly exceeds sample size n and can be allowed to increase at an exponential rate of n. Under the sparsity principle, only a small number of covariates have great influence on the response variable, i.e., ‖‖ is much smaller than p, where ‖‖ denotes the number of nonzero elements of . Assume we have the prior information that a set of covariates are related to survival time T and the index set is denoted by denotes the number of covariates in C. Write and . Here, is known, and are unknown. The true hazard function in (1) is equivalent to Let and be the true set of non-zero coefficients and its cardinality. Our goal is to recover the set as precisely as possible based on data from case-cohort studies. In other words, we want to find a subset of covariates which satisfies . To perform an initial screening procedure, we construct the marginal Cox regression models for each covariate individually, here we also add the known covariates in to each marginal model. Specifically, for the hazard function of T given (, Z) has the form where λ(t) is the unspecified baseline hazard function, and and β are the unknown regression parameters corresponding to covariates and Z in the marginal Cox model, respectively. Since the covariates can only be observed for the selected subcohort and cases for case-cohort data, we consider the following weighted estimating equation with where for and l = 0, 1, 2. Here, we choose the time-varying weight function , where is a consistent estimator of the true sampling probability π. Note that w(η) weights the ith subject by the inverse probability of selection, it equals to 1 for the cases and for the sampled censored subjects. The maximum marginal pseudo-partial likelihood estimator is defined as the solution to the weighted estimating equation . Define the information matrix which is of (q+1) dimension. Let be the variance estimate of , i.e., the (p + 1)th diagonal element of matrix . For , we define which serves as the proposed utility measure for the jth covariate. We rank covariates Z () by the value of from the largest to smallest and retain those at the top of the rank list. For a given threshold γ > 0, the selected index set in addition to set is given by In practical applications, we can pre-determine a positive integer d0 and define the estimated active set as Similar to Fan and Lv (2008) and other literature related to feature screening, we can choose d0 = ⌈n/log n⌉, where n denotes the case-cohort sample size. Similar to the conditional screening procedures of Barut et al. (2016) and Hong et al. (2018), the outstanding advantage of the proposed CWSIS procedure is that it enables the detection of hidden active covariates for ultrahigh dimensional case-cohort data. To demonstrate this merit, we set up an example in a similar way to Barut et al. (2016) and Hong et al. (2018). In particular, the failure time T follows the Cox proportional hazards model , where λ0(t) = 1, , Z ~ N(0, Σ) with Σ = (σ), σ = 1 for i = 1,…,p, σ = 0.5 for i ≠ j. By this design, Z5 is a hidden active covariate. We consider four different conditioning sets, = {∅}, {1}, {1, 2}, {6, 7, 8}. The densities of the proposed screening statistic for Z5 (hidden active covariate) and Z6, …, Z2000 (inactive covariates) are summarized in Figure 1. When , CWSIS is equivalent to the marginal screening approach, the value of for Z5 is much smaller than the corresponding value of inactive covariates with a high probability. When the conditioning set includes one truly active covariate (), the curve for Z5 is on the right and there is a clear separation between these two curves. When we include more truly active covariates (), this separation becomes larger. We note a very interesting phenomenon that when the conditioning set consists of three inactive covariates (), the chance of identifying the hidden variable Z5 using CWSIS is still higher than the marginal screening method. This may be due to the correlation between them and the active covariates, such inactive variables can effectively function as surrogates for the active variables, thus conditioning on them can help detect hidden variables. A similar phenomenon was also observed in Barut et al. (2016) and Hong et al. (2018).
Fig. 1

Density of the screening statistic for the hidden active covariate Z5 compared with a mixture of densities of inactive covariates Z6, …, Z2000 with different conditioning sets: Case 1: = {∅} which is equivalent to marginal screening; Case 2: = {1}, one truly active covariate; Case 3: = {1, 2}, two truly active covariates; Case 4: = {6, 7, 8}, three inactive covariates. The full cohort sample size n = 500, number of covariates p = 2000, noncase-to-case ratio is 1 : 1, the failure rate equals to 20%.

Theoretical property

In this section, we show the CWSIS procedure enjoys the sure screening property and the ranking consistency property, which demonstrate that our CWSIS procedure tends to rank the active covariates above the inactive ones with high probability, furthermore, all the active covariates survive after screening with probability tending to 1 as n → ∞. These two properties lay out the theoretical foundation of our CWSIS procedure. Define and for and l = 0, 1, 2. Let be the solution of the following equation , with The regularity conditions are given in Appendix A, under which we establish the following lemmas and theorems. Lemma 1 Under conditions C1-C8, if and only if α = 0 for all . Lemma 2 Suppose conditions C1-C8 hold, there exist constants c2 > 0 and 0 < κ < 1/2 such that Lemma 3 Under conditions C1-C8, for any ϵ1 > 0 and ϵ2 > 0, there exist positive constant c3 and integer N such that for any n > N and 0 < κ < 1/2, where a is the size of , q is the size of , c2 is the same value in lemma 2. Lemma 3 shows that the proposed maximum marginal pseudo-partial likelihood estimate is a consistent estimate of . By lemmas 1 and 3, we indeed can distinguish from by the proposed marginal utility . Theorem 1 states the sure independent screening property of the CWSIS procedure. Theorem 1 (The sure screening property) Under conditions C1-C8, for any 0 < κ < 1/2 and ϵ2 > 0, there exists positive constant c3 such that where a is the size of , q is the size of . Furthermore, we have From this theorem, we can see that all active covariates survive after screening with a probability tending to one. The next theorem establishes the ranking consistency property of the proposed method. Theorem 2 (The ranking consistency) Under conditions C1-C8, we have when n → ∞. This lays out the theoretical foundation that our procedure ensures active covariates be ranked prior to the inactive ones with overwhelming probability. The proof of theorems and these lemmas are presented in the Appendix B.

Simulation studies

We examine the finite sample performance of the proposed CWSIS procedure and make comparisons with some existing methods via simulation studies. For brevity, we refer to the feature aberration at survival times screening procedure of Gorst-Rasmussen and Scheike (2013) as FAST-SIS, the principled sure independent screening procedure of Zhao and Li (2012) as P-SIS, the censored rank independence screening of Song et al. (2014) as CRIS. Furthermore, we consider the marginal weighted screening procedure (MWSIS), where we fit the marginal Cox regressions for each Z and construct the weighted estimating equation to obtain the estimate , then define the active index set as , I(β) denotes the information matrix. As the PSIS, FAST and CRIS can only deal with the SRS data, we generate the SRS data with the same sample size as the case-cohort data for PSIS, FAST and CRIS. We consider the survival data generated from the Cox proportional hazards model and employ the independent Bernoulli sampling to generate the subcohort. We consider full cohort sample size n = 500, 1000, and the number of covariates p = 2000, 4000. As the incidence rate for case-cohort studies is usually very low or moderate, we consider the failure rate of 20% for n = 500, 5% and 10% for n = 1000. We consider the noncase-to-case ratio of 1 : 1, thus the sample size of the case-cohort data in our simulation studies equals to 100, 200. For each configuration, we repeat 500 simulations and employ three evaluation criteria (Li et al., 2012b). The first one is the minimum model size to include all active predictors, denoted by . We present the median and interquartile range (IQR) of out of 500 replications. The second one is the selection proportion that each important variable is selected into the model with a given model size d0, denoted by . The third one is the selection proportion that all important variables are selected into the model with a given model size d0, denoted by . An effective screening procedure is expected to yield close to the true minimum model size and both and close to one. Here, we choose d0 = ⌈n/log n⌉ (Fan and Lv, 2008), n is the case-cohort sample size and ⌈x⌉ denotes the integer part of x.

Example 1.

T are generated from the Cox proportional hazards model where with for i = 1,…p, σ = 0.5 for i ≠ j. The censoring time C ~ Unif(0, τ), the constant τ represents the end time of the study and is used to control the failure rate.

Example 2.

We consider the same model as example 1, with , i.e., only Z1 and Z are active covariates. The first (p − 1) covariates with , where σ = 1 for i = 1, …, (p − 1), σ = ρ for i ≠ j. We vary the value of ρ to be 0, 0.3, 0.7, with a larger ρ yielding a higher collinearity. The last covariate Z ~ N(0, 1). We compute the absolute correlation between the survival time T and each covariate Z (j = 1, …, p) for p = 2000 through the inverse probability weighting scheme and further summarize the marginal correlation in three groups: the active covariates (Z1, …, Z4 for example 1 and Z1 for example 2), the hidden active covariates (Z5 for example 1 and Z for example 2), and the inactive covariates (Z6, …, Z for example 1 and Z2, …, Z( for example 2). Figures 2 and 3 depict the distribution of the absolute correlation for these three groups, from which we can see the marginal signal strength of hidden active covariates are weaker than the inactive covariates. Therefore, the marginal screening methods MWSIS, PSIS, FAST and CRIS are difficult to identify the hidden active covariates. The proposed conditional screening method CWSIS is an ideal alternative. In our simulations, we simply choose Z1 as the conditional covariate. In practice, if we have no useful prior information about active covariates, we can choose those covariates which have higher marginal signal strength as the conditional set (Barut et al., 2016; Lu and Lin, 2020). To have a fair comparison, we add one (the number of conditional covariate in our examples) to for the proposed conditional screening method CWSIS.
Fig. 2

Absolute correlation of the survival time and the covariates for p = 2000.

Fig. 3

Absolute correlation of the survival time and the covariates for p = 4000.

The simulation results for , and are summarized in Tables 1–2. By observing the values of for Z5 in example 1 and Z in example 2, we can conclude that the proposed CWSIS procedure can detect the hidden active covariates with high probabilities, while the other four methods MWSIS, PSIS, FAST and CRIS fail to select them. In example 2, ρ equals to 0, 0.3, and 0.7, with a larger ρ yielding a higher collinearity. The proposed method CWSIS performs well even with high collinearity, while the other four methods do not behave well even when ρ = 0 and the performance deteriorates with the increasing value of ρ. As expected, CWSIS needs a smaller model size to possess the sure screening property in all settings. Larger case-cohort sample size and higher failure rate are associated with better performance. In particular, larger cohort sample size can handle rare disease situations better.
Table 1

The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 1

Pe
p n FR nc MethodMedianIQR X 1 X 2 X 3 X 4 X 5 Pa
200010005%50PSIS18494690.1300.1060.0940.1020.0160.000
FAST18434460.1160.1080.1060.1040.0160.000
CRIS13987640.2900.2700.2880.2500.2200.084
MWSIS200020.3540.3420.3520.3420.0000.000
CWSIS4477460.3180.3500.3300.4760.018
100010%100PSIS1998180.4880.4560.4360.4620.0000.000
FAST1998190.4740.4440.4220.4500.0000.000
CRIS17213790.0500.0320.0200.0380.0040.000
MWSIS200000.7840.8040.7740.7940.0000.000
CWSIS691720.7900.7680.8100.7600.356
50020%100PSIS200010.6860.7060.7060.6680.0020.000
FAST200010.6540.6540.6940.6220.0020.000
CRIS17204050.0540.0540.0440.0540.0020.000
MWSIS200000.8120.8320.8400.7980.0000.000
CWSIS471680.8280.8520.8060.7640.442
400010005%50PSIS37477160.3840.3640.3760.3800.0020.000
FAST37487440.3680.3520.3780.3620.0020.000
CRIS313311000.0220.0180.0140.0220.0000.000
MWSIS400030.6700.7000.7100.7020.0000.000
CWSIS90814770.7200.6800.7340.6880.252
100010%100PSIS3995460.3840.3640.3760.3800.0020.000
FAST3995480.3680.3520.3780.3620.0020.000
CRIS33637950.0220.0180.0140.0220.0000.000
MWSIS400000.6700.7000.7100.7020.0000.000
CWSIS1363890.7200.6800.7340.6880.252
50020%100PSIS400020.6000.6080.5780.6300.0000.000
FAST400020.5820.5920.5740.5820.0000.000
CRIS34478710.0360.0500.0380.0240.0000.000
MWSIS400000.7700.7320.7300.7660.0000.000
CWSIS862770.7700.7840.8060.7460.350

n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n, the average number of cases; CWSIS: the proposed conditional screening method; MWSIS: the marginal weighted screening procedure; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014).

Table 2

The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 2

p = 2000
p = 4000
Pe
Pe
n FR ρ nc MethodMedianIQR X 1 Xp Pa MedianIQR X 1 Xp Pa
50020%0100PSIS5789791.0000.0920.092127920881.0000.0660.066
FAST5949751.0000.0900.090128620541.0000.0680.068
CRIS84110051.0000.0320.032168319200.9980.0140.014
MWSIS4249511.0000.1040.10493618191.0000.0880.088
CWSIS201.0001.000201.0001.000
0.3100PSIS19731311.0000.0000.00039583021.0000.0000.000
FAST19711381.0000.0000.00039583111.0000.0000.000
CRIS127811920.9980.0220.022279522200.9980.0120.012
MWSIS1997331.0000.0000.0003993621.0000.0020.002
CWSIS201.0001.000201.0001.000
0.7100PSIS200000.3800.0000.000400000.2780.0000.000
FAST200001.0000.0000.000400001.0000.0000.000
CRIS19454600.9900.0080.00839238480.9780.0040.004
MWSIS200000.6760.0000.000400000.5580.0000.000
CWSIS201.0001.000201.0001.000
100010%0100PSIS66410331.0000.0640.064138520721.0000.0380.038
FAST68410241.0000.0640.064137620311.0000.0360.036
CRIS9379820.9200.0140.008193821400.8160.0020.000
MWSIS3157951.0000.1700.17059917661.0000.1160.116
CWSIS200.9980.998200.9980.998
0.3100PSIS19283740.9640.0020.00238707100.9460.0000.000
FAST19263821.0000.0020.00238636781.0000.0000.000
CRIS123311190.8840.0340.030242723550.8100.0240.012
MWSIS1999151.0000.0000.0003998631.0000.0000.000
CWSIS201.0001.000200.9980.998
0.7100PSIS200000.0420.0000.000400000.0160.0000.000
FAST200000.9960.0000.000400000.9940.0000.000
CRIS17379300.7940.0280.024345119210.7100.0240.010
MWSIS200000.2080.0000.000400000.1500.0000.000
CWSIS201.0001.000200.9980.998
10005%050PSIS10759840.2540.0060.002201022740.2660.0080.006
FAST93110220.9900.0080.008177121381.0000.0020.002
CRIS133210020.5620.0000.000236418260.4360.0000.000
MWSIS5209831.0000.0420.042108220231.0000.0300.030
CWSIS210.9360.936220.8820.882
0.350PSIS16786670.0800.0020.000345914760.0460.0040.002
FAST15808250.9580.0020.002324916420.9760.0020.002
CRIS150110770.5680.0060.004259220360.4480.0040.000
MWSIS19811550.9840.0000.00039712040.9560.0000.000
CWSIS200.9460.946210.8900.890
0.750PSIS200020.0000.0000.000400010.0000.0000.000
FAST200020.5020.0000.000400010.5360.0000.000
CRIS172110370.5900.0100.002309322470.4620.0080.002
MWSIS200000.0200.0000.000400000.0060.0000.000
CWSIS200.9420.942210.9000.900

n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n the average number of cases; ρ, the correlation coefficient of covariates; CWSIS: the proposed conditional screening method; MWSIS: the marginal weighted screening procedure; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014).

To assess the performance of the proposed method in the settings that are similar to the real data, we further consider n = 300 and the failure rate of 25% for example 2, the remaining setups are kept the same as before. Here, we also consider the unweighted conditional screening method NCWSIS which does not adopt the weight function and simply treat the case-cohort data as SRS data, and the conditional screening method C-SMPLE in Hong et al. (2018). Since the method C-SMPLE in Hong et al. (2018) is proposed for SRS data, it can not be directly used to handle the case-cohort data, we generate the SRS data with the same sample size as the case-cohort data for CSMPLE. The simulation results for , and are summarized in Table 3, from which we can see that the proposed method can detect the hidden active covariates with high probabilities and delivers its distinctive advantages for all the considered settings. By comparing the results of NCWSIS, CSMPLE and CWSIS, we can conclude that the performance of the conditional screening method is improved by including the case-cohort weight. Moreover, the proposed conditional screening procedure based on case-cohort design is more accurate in selecting the active covariates than the conditional screening based on a SRS of the same size as the case-cohort sample. For example, when p = 2000 and ρ = 0.7, the value of is only 0.460 for CSMPLE, while the corresponding value of the proposed method CWSIS equals to 1.
Table 3

The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 2 with n = 300 and FR=25%

p = 2000
p = 4000
Pe
Pe
ρ nc MethodMedianIQR X 1 Xp Pa MedianIQR X 1 Xp Pa
075PSIS67110401.0000.0620.062122619201.0000.0320.032
FAST71110351.0000.0680.068122119271.0000.0340.034
CRIS85410641.0000.0260.026162620570.9980.0120.012
MWSIS59910571.0000.0580.058101419001.0000.0400.040
NCWSIS553-0.7060.7068115-0.6480.648
CSMPLE768-0.6740.67412108-0.6080.608
CWSIS20-1.0001.00020-0.9940.994
0.375PSIS19602361.0000.0000.00039174851.0000.0000.000
FAST19592361.0000.0000.00039155161.0000.0000.000
CRIS133912160.9980.0140.014271824670.9940.0200.020
MWSIS19871061.0000.0000.00039662251.0000.0000.000
NCWSIS360-0.7120.712457-0.7040.704
CSMPLE1065-0.6480.64816223-0.5740.574
CWSIS20-1.0001.00020-1.0001.000
0.775PSIS200000.5960.0000.000400000.5780.0000.000
FAST200001.0000.0000.000400001.0000.0000.000
CRIS19483990.9900.0020.002392110350.9780.0060.006
MWSIS200000.6100.0000.000400000.5560.0000.000
NCWSIS240-0.7320.732238-0.7360.736
CSMPLE44204-0.4600.46074606-0.3880.388
CWSIS20-1.0001.00020-1.0001.000

n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n, the average number of cases; ρ, the correlation coefficient of covariates; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014); MWSIS: the marginal weighted screening procedure; NCWSIS: the unweighted conditional screening method; CSMPLE: the conditional screening method of Hong et al. (2018); CWSIS: the proposed conditional screening method.

Application to breast cancer data

As an illustration, we apply the proposed CWSIS method to the breast cancer data (van de Vijver et al., 2002), with 295 female patients who have primary invasive breast carcinoma. For each patient, the expressions of 24885 genes were profiled on cDNA arrays from all tumors. A set of 4919 candidate genes were selected after initial screening using the Rosetta error model (van’t Veer et al., 2002). By excluding the individuals with missing values, we have 289 subjects with 4919 candidate genes. The median observed time was 7.23 years (ranging from 0.05 to 18.34 years). During the follow-up, 78 patients died of breast cancer and the other 211 patients were still alive, which led to the failure rate of 26.99%. Of the 289 patient samples, 60 samples overlapped with the 78 training samples from van’t Veer et al (2002), we use these 60 samples as the testing set and the case-cohort samples as our training set. The details of these two sets are summarized in Table 4. The interest of the study is to identify genes that have great influence on patients’ overall survival rate.
Table 4

Summary of the breast cancer data

DatasetNumMinMaxMedianFail(%)
Train2890.05518.3417.22526.99
Test600.71215.3527.60638.33

Train, the training set; Test, the testing set; Num, the number of patients; Min, the minimum observed survival time; Max, the maximum observed survival time; Median, the median of observed survival time; Fail, the failure rate.

We illustrate the proposed method by identifying genes that have great influence on patients’ overall survival rate based on data from a case-cohort sample. Specifically, we select the subcohort by independent Bernoulli sampling with the selection probability π = 0.37, which results in about the same number of cases and noncases. The subcohort has 111 subjects and the final case-cohort sample has 155 subjects. Gene AL080059 has been known to be predictive to patients’ survival time in the literature (Yeung et al., 2005; van’t Veer et al., 2002), we use it as the conditional variable in the proposed procedure. The screening methods are usually considered as an initial step to reduce the dimensionality and then followed with some model-based regularization methods. In particular, we first apply the proposed CWSIS procedure to reduce the dimension from p = 4919 to ⌈155/log(155)⌉ = 31 and then utilize different regularization methods LASSO, SCAD and MCP to select the significant ones among these 31 genes under the framework of the Cox proportional hazards regression, the tuning parameter was selected by the 10-fold cross-validation. We summarize the name and the corresponding estimated value of the coefficient for selected genes in Table 5, from which we can see that genes Contig58368.RC, NM.014889, NM.005689, NM.013290, AL080059, NM.013332, Contig63649.RC and NM.002916 were all selected by the LASSO, SCAD and MCP methods, indicating that these eight genes could be associated with patients’ survival rate. Moreover, genes Contig58368.RC, NM.014889 and NM.005689 were ranked at the first three position, which means that these three genes may have great influence on patients’ survival rate.
Table 5

The results of selected important genes for the breast cancer data using the regularization methods

LASSO
SCAD
MCP
NameEst.NameEst.NameEst.
Contig58368.RC0.392Contig58368.RC0.516Contig58368.RC0.515
NM.0148890.277NM.0148890.446NM.0148890.445
NM.0056890.201NM.0056890.329NM.0056890.329
NM.0133320.199NM.0132900.326NM.0132900.325
Contig63649.RC0.178AL0800590.312AL0800590.312
NM.0132900.172NM.0133320.256NM.0133320.256
AL0800590.168Contig63649.RC0.249Contig63649.RC0.249
NM.0029160.140NM.0029160.204NM.0029160.206
NM.0122910.102
Contig31288.RC0.083
Contig38288.RC0.049
NM.0033760.017
NM.0016730.014

Name: the name for selected genes; Est.: the corresponding estimated value of the coefficient for selected genes.

To evaluate the predictive accuracy of C-WSIS, we further compute the C-statistic estimator (Uno et al., 2011). For comparison, we also apply the MWSIS and NC-WSIS procedures to analyze this data. In particular, we first apply these three screening methods to reduce the dimension to ⌈155/log(155)⌉ = 31, then perform the LASSO penalization to further remove some irrelevant covariates, with the tuning parameter selected by the 10-fold cross-validation. We obtain the risk score for each subject by using the final model selected by LASSO and further compute the corresponding concordance statistic (C-statistic) (Uno et al., 2011) in the testing set. The standard deviations (SD) of C-statistic are obtained from perturbation resampling 1000 times. The corresponding values of C-statistic and SD (the values in the parenthesis) are 0.862 (0.059), 0.796 (0.078), 0.802 (0.053) for CWSIS, MWSIS, NCWSIS procedures, respectively. According to Uno et al. (2011), the larger the C-statistic is, the stronger predictive power the method possesses. We can conclude that the proposed CWSIS method performs reasonably well for ultrahigh-dimensional survival data under the case-cohort design and delivers a favorable performance in terms of prediction. We also consider d = n/2, n/3, n/4 when analyzing this data and summarize the results in the supplementary material, from which we can see that the selected genes under different cut-offs are highly consistent. Furthermore, we compute the C-statistic estimator for CWSIS, MWSIS, NCWSIS procedures under these three cases. From the results in the supplementary material we can make similar conclusion to that with d = n/log(n).

Conclusion

For ultrahigh-dimensional survival data under the case-cohort design, we propose a conditional screening procedure CWSIS by incorporating the prior information of active covariates. This method enables the detection of hidden active covariates, which is an outstanding advantage compared with the marginal screening procedures. Moreover, the proposed procedure does not require any complicated numerical optimization and is computationally efficient. Theoretically, it enjoys the sure screening property and ranking consistency property under some mild regularity conditions. In the development of the theoretical properties, we adopt the conditional linear expectation and conditional linear covariance, which are proposed in Hong et al. (2018) and are useful to specify the regularity conditions. There are some issues that deserve further considerations. First, the proposed method requires the prior information of active covariates, sometimes it may be difficult to obtain such useful information. Hong et al. (2016) proposed a data-driven method to obtain the conditional set for generalized linear models. How to develop a data-driven conditional screening method for survival data under the case-cohort is an interesting question. Furthermore, when we have prior knowledge of active covariates, how to balance it with the information extracted from the given data merits further investigation. Second, under our design, the subcohort is selected by independent Bernoulli sampling. When the subcohort is selected by simple random sampling without replacement, our method also works, although more complicated arguments would be needed to develop the theoretical properties. Moreover, when some covariates are available for all cohort members, we can consider the stratified case-cohort design based on those covariates. Third, we can consider to propose more efficient screening methods which incorporating more complex prior knowledge, such as the network structure or the spatial information of the covariates.
  30 in total

1.  Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models.

Authors:  Jianqing Fan; Yang Feng; Rui Song
Journal:  J Am Stat Assoc       Date:  2011-06       Impact factor: 5.033

2.  Marginal hazards model for case-cohort studies with multiple disease outcomes.

Authors:  S Kang; J Cai
Journal:  Biometrika       Date:  2009-12       Impact factor: 2.445

3.  Censored cumulative residual independent screening for ultrahigh-dimensional survival data.

Authors:  Jing Zhang; Guosheng Yin; Yanyan Liu; Yuanshan Wu
Journal:  Lifetime Data Anal       Date:  2017-05-26       Impact factor: 1.588

4.  Variable selection for case-cohort studies with failure time outcome.

Authors:  A I Ni; Jianwen Cai; Donglin Zeng
Journal:  Biometrika       Date:  2016-08-10       Impact factor: 2.445

5.  More efficient estimators for case-cohort studies.

Authors:  S Kim; J Cai; W Lu
Journal:  Biometrika       Date:  2013       Impact factor: 2.445

6.  Gene expression profiling predicts clinical outcome of breast cancer.

Authors:  Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal:  Nature       Date:  2002-01-31       Impact factor: 49.962

7.  MARGINAL EMPIRICAL LIKELIHOOD AND SURE INDEPENDENCE FEATURE SCREENING.

Authors:  Jinyuan Chang; Cheng Yong Tang; Yichao Wu
Journal:  Ann Stat       Date:  2013-08-01       Impact factor: 4.028

8.  Conditional Sure Independence Screening.

Authors:  Emre Barut; Jianqing Fan; Anneleen Verhasselt
Journal:  J Am Stat Assoc       Date:  2016-10-18       Impact factor: 5.033

9.  Univariate shrinkage in the cox model for high dimensional data.

Authors:  Robert J Tibshirani
Journal:  Stat Appl Genet Mol Biol       Date:  2009-04-14

10.  Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models.

Authors:  Jianqing Fan; Yunbei Ma; Wei Dai
Journal:  J Am Stat Assoc       Date:  2014       Impact factor: 5.033

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.