Literature DB >> 34417679

Conditional screening for ultrahigh-dimensional survival data in case-cohort studies.

Jing Zhang¹, Haibo Zhou², Yanyan Liu³, Jianwen Cai⁴.

Abstract

The case-cohort design has been widely used to reduce the cost of covariate measurements in large cohort studies. In many such studies, the number of covariates is very large, and the goal of the research is to identify active covariates which have great influence on response. Since the introduction of sure independence screening, screening procedures have achieved great success in terms of effectively reducing the dimensionality and identifying active covariates. However, commonly used screening methods are based on marginal correlation or its variants, they may fail to identify hidden active variables which are jointly important but are weakly correlated with the response. Moreover, these screening methods are mainly proposed for data under the simple random sampling and can not be directly applied to case-cohort data. In this paper, we consider the ultrahigh-dimensional survival data under the case-cohort design, and propose a conditional screening method by incorporating some important prior known information of active variables. This method can effectively detect hidden active variables. Furthermore, it possesses the sure screening property under some mild regularity conditions and does not require any complicated numerical optimization. We evaluate the finite sample performance of the proposed method via extensive simulation studies and further illustrate the new approach through a real data set from patients with breast cancer.

Entities: Chemical

Keywords: Case-cohort design; Conditional screening; Sure screening property; Survival data; Ultrahigh-dimensional data; Weighted estimating equation

Mesh：

Year: 2021 PMID： 34417679 PMCID： PMC8561435 DOI： 10.1007/s10985-021-09531-7

Source DB: PubMed Journal: Lifetime Data Anal ISSN： 1380-7870 Impact factor: 1.429

Introduction

In large epidemiological cohort studies, it is common that some diseases of interest (e.g., cancer, heart disease, HIV infection) have very low incidence. In addition, some exposures can be very expensive to measure and it is not feasible to obtain the measures on all cohort members due to restrictions on resources. To reduce the cost while keeping as much efficiency as possible, Prentice (1986) proposed the case-cohort design, where the expensive covariates are obtained only for a random sample of the full cohort, called the subcohort, as well as the additional cases who have experienced the event of interest during the follow-up period. When covariate dimension p is smaller than sample size n, various methods have been proposed for analyzing data under this design, such as the pseudo-likelihood approach (Prentice, 1986; Self and Prentice, 1988; Kalbfleisch and Lawless, 1988), the estimating equation method (Chen and Lo, 1999; Chen, 2001), the multiple imputation approach (Marti and Chavance, 2011; Keogh and White, 2013), the maximum likelihood estimation (Scheike and Martinussen, 2004; Zeng and Lin, 2014), weighted estimating equation approach (Barlow, 1994; Borgan et al., 2000; Kulich and Lin, 2004; Breslow and Wellner, 2007; Kang and Cai, 2009; Kim et al., 2013), among others. With the rapid development of biomedical technology, high-dimensional data are frequently collected in large epidemiological studies. The feature of this kind of data is that the covariate dimension p is much larger than sample size n. An important purpose of analyzing this type of data is to identify a subset of covariates related to the event of interest and construct the effective models based on the selected covariates. For scenarios where p increases with n at polynomial rate (e.g., p = n with α > 0), the regularization method has been demonstrated to be an effective dimension reduction method for simple random sampling (SRS) data (e.g., Tibshirani, 1996; Fan and Li, 2001; Zou, 2006; Candes and Tao, 2007; Zhang, 2010) and has been generalized to high-dimensional data under the case-cohort design recently. For example, Ni et al. (2016) proposed a variable selection procedure by using the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) for scenarios where p increases at a slower rate than n. Kim and Ahn (2019) proposed a bi-level variable selection method to select non-zero group and within-group variables for cases where variables have group structure. These methods can select variables and estimate parameters simultaneously, however, the computation inherent in regularization methods makes them involve the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability when the dimension p is ultrahigh in the sense that p = exp(n) with α > 0 (Fan et al., 2009). For SRS data, the feature screening method has achieved great success in dealing with the challenge of ultrahigh-dimensional settings. Various marginal screening methods have been proposed under different settings, such as linear models (Fan and Lv, 2008), generalized linear models (Fan and Song, 2010), additive models (Fan et al., 2011), the varying coefficient models (Fan et al., 2014; Liu et al., 2014) and model-free scenarios (e.g., Zhu et al., 2011; Li et al., 2012a; Li et al., 2012b; He et al., 2013; Chang et al., 2013; Cui et al., 2015; Mai and Zou, 2015; Wu and Yin, 2015). For censored survival data, several model-based screening methods (e.g., Tibshirani, 2009; Zhao and Li, 2012; Gorst-Rasmussen and Scheike, 2013) and model-free screening methods (e.g., Song et al., 2014; Wu and Yin, 2015; Zhang et al., 2017; Zhou and Zhu, 2017; Liu et al., 2018; Zhang et al., 2018; Lin et al., 2018; Pan et al., 2019) have been proposed via defining different marginal utilities. Although they are powerful in reducing the dimensionality, they may face some challenges in some situations. For instance, as noted in Fan and Lv (2008), the correlation among covariates heavily influence the marginal utility. When the correlation among covariates is relatively high, the marginal screening methods may fail to retain the hidden active variables which have great influence on response but are weakly correlated with the response. Although some iterative screening methods (e.g., Fan and Lv, 2008; Zhu et al., 2011; Zhang et al., 2018, Pan et al., 2019) and forward screening approaches (e.g., Wang, 2009) have been proposed to alleviate this problem, the computation speed is relatively slow and the statistical properties are elusive. In many applications, researchers can obtain some prior information of active variables from previous investigations and experiences. For example, in the breast cancer study (van de Vijver et al., 2002), gene AL080059 has been known to be predictive to patients’ survival time in the literature (Yeung et al., 2005; van’t Veer et al., 2002). Barut et al. (2016) pointed out we can improve the accuracy in variable screening by including such prior knowledge. In view of this thought, they proposed the conditional screening approach for generalized linear models and showed that conditioning helps reducing the correlation among covariates, thus can detect the hidden active variables with higher probability. Hong et al. (2016) further proposed to integrate prior information using data-driven approaches. Hu and Lin (2017) put forward a conditional screening procedure via ranking covariates based on conditional marginal empirical likelihood ratios. Liu and Wang (2017) proposed a screening method based on conditional distance correlation. Hong et al. (2018) developed a conditional screening method for censored data under the proportional hazards model. Liu and Chen (2018) considered the conditional quantile independence screening approach for ultrahigh-dimensional heterogeneous data. Lu and Lin (2020) proposed a model-free conditional screening via conditional distance correlation. Extensive simulation studies showed these conditional screening methods which incorporate important prior information of active variables can provide a powerful means to identify hidden active variables for ultrahigh-dimensional data. The research on marginal and conditional screening methods has been fruitful for ultrahigh-dimensional SRS data, but to the best of our knowledge, conditional screening method has not been studied for case-cohort data, the existing conditional screening methods can not be directly applied to the case-cohort data due to its special data structure. To fill the gap, we propose a conditional screening method for ultrahigh-dimensional case-cohort data under the framework of Cox proportional hazards model. We construct the marginal hazards regression models for each covariate by including the known important covariates. As some covariates are not fully observed, we build the weighted estimating equation to obtain the estimators of the parameters. Then we propose the marginal utilities based on the parameter estimates to measure the contribution of each covariate and retain the covariates with top ranked contributions. We refer to it as conditional weighted screening method, in short the C-WSIS procedure. As pointed out by Barut et al. (2016), the correlation between covariates can be weakened upon conditioning, so that hidden active covariates have a higher chance to be retained. Therefore, the proposed method enables the detection of hidden active covariates for ultrahigh dimensional survival data under the case-cohort design. Under some reasonable conditions, it enjoys the sure screening property and the ranking consistency. Our research is the first one that focus on conditional screening for ultrahigh dimensional case-cohort data, it can be viewed as an extension of Hong et al. (2018) from SRS data to case-cohort data. Note that although the ideas are similar, the generalization is quite challenging due to the much more complex structure of case-cohort data, both implementation and the theory will be quite different. The rest of the article is organized as follows. In Section 2, we introduce the model, data and present the details of the CWSIS procedure. In Section 3, we establish the theoretical properties of the proposed CWSIS method. Section 4 presents results from simulation studies. A real data set from the breast cancer study is analyzed in Section 5. Section 6 provides some remarks and discussions. The regularity conditions and the technical proofs are presented in the Appendix.

Conditional screening for case-cohort data

Suppose there are n independent subjects in a cohort study. Let T and C denote the failure time and censoring time of subject i, we only observe X = min(T, C) and Δ = I(T ≤ C) due to right-censoring. Let Z = (Z,…,Z)T denote the p-dimensional covariate, under the case-cohort design, Z is available only on the cases (Δ = 1) and the subcohort (a random subset of the full cohort). Let ξ be the indicator for subcohort membership, i.e., ξ = 1 and 0 denote whether or not the ith subject in the full cohort is selected into the subcohort. For the selection of subcohort, we consider independent Bernoulli sampling with selection probability π = Pr(ξ = 1) ∈ (0, 1). Thus, the observable data for the ith subject is {X, Δ, Z, ξ} when ξ = 1 or Δ = 1, and {X, Δ, ξ} when ξ = 0 and Δ = 0. Suppose that the failure time follows the proportional hazards model (Cox, 1972), under which the conditional hazard function of T given Z has the form where λ0(t) is the unspecified baseline hazard function and = (α1,…,α)T is the unknown regression parameter. Assume that the failure time T and the censoring time C are independent given Z. In an ultrahigh-dimensional setting, the dimensionality p greatly exceeds sample size n and can be allowed to increase at an exponential rate of n. Under the sparsity principle, only a small number of covariates have great influence on the response variable, i.e., ‖‖ is much smaller than p, where ‖‖ denotes the number of nonzero elements of . Assume we have the prior information that a set of covariates are related to survival time T and the index set is denoted by denotes the number of covariates in C. Write and . Here, is known, and are unknown. The true hazard function in (1) is equivalent to Let and be the true set of non-zero coefficients and its cardinality. Our goal is to recover the set as precisely as possible based on data from case-cohort studies. In other words, we want to find a subset of covariates which satisfies . To perform an initial screening procedure, we construct the marginal Cox regression models for each covariate individually, here we also add the known covariates in to each marginal model. Specifically, for the hazard function of T given (, Z) has the form where λ(t) is the unspecified baseline hazard function, and and β are the unknown regression parameters corresponding to covariates and Z in the marginal Cox model, respectively. Since the covariates can only be observed for the selected subcohort and cases for case-cohort data, we consider the following weighted estimating equation with where for and l = 0, 1, 2. Here, we choose the time-varying weight function , where is a consistent estimator of the true sampling probability π. Note that w(η) weights the ith subject by the inverse probability of selection, it equals to 1 for the cases and for the sampled censored subjects. The maximum marginal pseudo-partial likelihood estimator is defined as the solution to the weighted estimating equation . Define the information matrix which is of (q+1) dimension. Let be the variance estimate of , i.e., the (p + 1)th diagonal element of matrix . For , we define which serves as the proposed utility measure for the jth covariate. We rank covariates Z () by the value of from the largest to smallest and retain those at the top of the rank list. For a given threshold γ > 0, the selected index set in addition to set is given by In practical applications, we can pre-determine a positive integer d0 and define the estimated active set as Similar to Fan and Lv (2008) and other literature related to feature screening, we can choose d0 = ⌈n/log n⌉, where n denotes the case-cohort sample size. Similar to the conditional screening procedures of Barut et al. (2016) and Hong et al. (2018), the outstanding advantage of the proposed CWSIS procedure is that it enables the detection of hidden active covariates for ultrahigh dimensional case-cohort data. To demonstrate this merit, we set up an example in a similar way to Barut et al. (2016) and Hong et al. (2018). In particular, the failure time T follows the Cox proportional hazards model , where λ0(t) = 1, , Z ~ N(0, Σ) with Σ = (σ), σ = 1 for i = 1,…,p, σ = 0.5 for i ≠ j. By this design, Z5 is a hidden active covariate. We consider four different conditioning sets, = {∅}, {1}, {1, 2}, {6, 7, 8}. The densities of the proposed screening statistic for Z5 (hidden active covariate) and Z6, …, Z2000 (inactive covariates) are summarized in Figure 1. When , CWSIS is equivalent to the marginal screening approach, the value of for Z5 is much smaller than the corresponding value of inactive covariates with a high probability. When the conditioning set includes one truly active covariate (), the curve for Z5 is on the right and there is a clear separation between these two curves. When we include more truly active covariates (), this separation becomes larger. We note a very interesting phenomenon that when the conditioning set consists of three inactive covariates (), the chance of identifying the hidden variable Z5 using CWSIS is still higher than the marginal screening method. This may be due to the correlation between them and the active covariates, such inactive variables can effectively function as surrogates for the active variables, thus conditioning on them can help detect hidden variables. A similar phenomenon was also observed in Barut et al. (2016) and Hong et al. (2018).

Fig. 1

Density of the screening statistic for the hidden active covariate Z5 compared with a mixture of densities of inactive covariates Z6, …, Z2000 with different conditioning sets: Case 1: = {∅} which is equivalent to marginal screening; Case 2: = {1}, one truly active covariate; Case 3: = {1, 2}, two truly active covariates; Case 4: = {6, 7, 8}, three inactive covariates. The full cohort sample size n = 500, number of covariates p = 2000, noncase-to-case ratio is 1 : 1, the failure rate equals to 20%.

Theoretical property

In this section, we show the CWSIS procedure enjoys the sure screening property and the ranking consistency property, which demonstrate that our CWSIS procedure tends to rank the active covariates above the inactive ones with high probability, furthermore, all the active covariates survive after screening with probability tending to 1 as n → ∞. These two properties lay out the theoretical foundation of our CWSIS procedure. Define and for and l = 0, 1, 2. Let be the solution of the following equation , with The regularity conditions are given in Appendix A, under which we establish the following lemmas and theorems. Lemma 1 Under conditions C1-C8, if and only if α = 0 for all . Lemma 2 Suppose conditions C1-C8 hold, there exist constants c2 > 0 and 0 < κ < 1/2 such that Lemma 3 Under conditions C1-C8, for any ϵ1 > 0 and ϵ2 > 0, there exist positive constant c3 and integer N such that for any n > N and 0 < κ < 1/2, where a is the size of , q is the size of , c2 is the same value in lemma 2. Lemma 3 shows that the proposed maximum marginal pseudo-partial likelihood estimate is a consistent estimate of . By lemmas 1 and 3, we indeed can distinguish from by the proposed marginal utility . Theorem 1 states the sure independent screening property of the CWSIS procedure. Theorem 1 (The sure screening property) Under conditions C1-C8, for any 0 < κ < 1/2 and ϵ2 > 0, there exists positive constant c3 such that where a is the size of , q is the size of . Furthermore, we have From this theorem, we can see that all active covariates survive after screening with a probability tending to one. The next theorem establishes the ranking consistency property of the proposed method. Theorem 2 (The ranking consistency) Under conditions C1-C8, we have when n → ∞. This lays out the theoretical foundation that our procedure ensures active covariates be ranked prior to the inactive ones with overwhelming probability. The proof of theorems and these lemmas are presented in the Appendix B.

Simulation studies

We examine the finite sample performance of the proposed CWSIS procedure and make comparisons with some existing methods via simulation studies. For brevity, we refer to the feature aberration at survival times screening procedure of Gorst-Rasmussen and Scheike (2013) as FAST-SIS, the principled sure independent screening procedure of Zhao and Li (2012) as P-SIS, the censored rank independence screening of Song et al. (2014) as CRIS. Furthermore, we consider the marginal weighted screening procedure (MWSIS), where we fit the marginal Cox regressions for each Z and construct the weighted estimating equation to obtain the estimate , then define the active index set as , I(β) denotes the information matrix. As the PSIS, FAST and CRIS can only deal with the SRS data, we generate the SRS data with the same sample size as the case-cohort data for PSIS, FAST and CRIS. We consider the survival data generated from the Cox proportional hazards model and employ the independent Bernoulli sampling to generate the subcohort. We consider full cohort sample size n = 500, 1000, and the number of covariates p = 2000, 4000. As the incidence rate for case-cohort studies is usually very low or moderate, we consider the failure rate of 20% for n = 500, 5% and 10% for n = 1000. We consider the noncase-to-case ratio of 1 : 1, thus the sample size of the case-cohort data in our simulation studies equals to 100, 200. For each configuration, we repeat 500 simulations and employ three evaluation criteria (Li et al., 2012b). The first one is the minimum model size to include all active predictors, denoted by . We present the median and interquartile range (IQR) of out of 500 replications. The second one is the selection proportion that each important variable is selected into the model with a given model size d0, denoted by . The third one is the selection proportion that all important variables are selected into the model with a given model size d0, denoted by . An effective screening procedure is expected to yield close to the true minimum model size and both and close to one. Here, we choose d0 = ⌈n/log n⌉ (Fan and Lv, 2008), n is the case-cohort sample size and ⌈x⌉ denotes the integer part of x.

Example 1.

T are generated from the Cox proportional hazards model where with for i = 1,…p, σ = 0.5 for i ≠ j. The censoring time C ~ Unif(0, τ), the constant τ represents the end time of the study and is used to control the failure rate.

Example 2.

We consider the same model as example 1, with , i.e., only Z1 and Z are active covariates. The first (p − 1) covariates with , where σ = 1 for i = 1, …, (p − 1), σ = ρ for i ≠ j. We vary the value of ρ to be 0, 0.3, 0.7, with a larger ρ yielding a higher collinearity. The last covariate Z ~ N(0, 1). We compute the absolute correlation between the survival time T and each covariate Z (j = 1, …, p) for p = 2000 through the inverse probability weighting scheme and further summarize the marginal correlation in three groups: the active covariates (Z1, …, Z4 for example 1 and Z1 for example 2), the hidden active covariates (Z5 for example 1 and Z for example 2), and the inactive covariates (Z6, …, Z for example 1 and Z2, …, Z( for example 2). Figures 2 and 3 depict the distribution of the absolute correlation for these three groups, from which we can see the marginal signal strength of hidden active covariates are weaker than the inactive covariates. Therefore, the marginal screening methods MWSIS, PSIS, FAST and CRIS are difficult to identify the hidden active covariates. The proposed conditional screening method CWSIS is an ideal alternative. In our simulations, we simply choose Z1 as the conditional covariate. In practice, if we have no useful prior information about active covariates, we can choose those covariates which have higher marginal signal strength as the conditional set (Barut et al., 2016; Lu and Lin, 2020). To have a fair comparison, we add one (the number of conditional covariate in our examples) to for the proposed conditional screening method CWSIS.

Fig. 2

Absolute correlation of the survival time and the covariates for p = 2000.

Fig. 3

Absolute correlation of the survival time and the covariates for p = 4000.

The simulation results for , and are summarized in Tables 1–2. By observing the values of for Z5 in example 1 and Z in example 2, we can conclude that the proposed CWSIS procedure can detect the hidden active covariates with high probabilities, while the other four methods MWSIS, PSIS, FAST and CRIS fail to select them. In example 2, ρ equals to 0, 0.3, and 0.7, with a larger ρ yielding a higher collinearity. The proposed method CWSIS performs well even with high collinearity, while the other four methods do not behave well even when ρ = 0 and the performance deteriorates with the increasing value of ρ. As expected, CWSIS needs a smaller model size to possess the sure screening property in all settings. Larger case-cohort sample size and higher failure rate are associated with better performance. In particular, larger cohort sample size can handle rare disease situations better.

Table 1

The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 1

							Pe
p	n	FR	n_c	Method	Median	IQR	X ₁	X ₂	X ₃	X ₄	X ₅	Pa
2000	1000	5%	50	PSIS	1849	469	0.130	0.106	0.094	0.102	0.016	0.000
				FAST	1843	446	0.116	0.108	0.106	0.104	0.016	0.000
				CRIS	1398	764	0.290	0.270	0.288	0.250	0.220	0.084
				MWSIS	2000	2	0.354	0.342	0.352	0.342	0.000	0.000
				CWSIS	447	746	–	0.318	0.350	0.330	0.476	0.018
	1000	10%	100	PSIS	1998	18	0.488	0.456	0.436	0.462	0.000	0.000
				FAST	1998	19	0.474	0.444	0.422	0.450	0.000	0.000
				CRIS	1721	379	0.050	0.032	0.020	0.038	0.004	0.000
				MWSIS	2000	0	0.784	0.804	0.774	0.794	0.000	0.000
				CWSIS	69	172	–	0.790	0.768	0.810	0.760	0.356
	500	20%	100	PSIS	2000	1	0.686	0.706	0.706	0.668	0.002	0.000
				FAST	2000	1	0.654	0.654	0.694	0.622	0.002	0.000
				CRIS	1720	405	0.054	0.054	0.044	0.054	0.002	0.000
				MWSIS	2000	0	0.812	0.832	0.840	0.798	0.000	0.000
				CWSIS	47	168	–	0.828	0.852	0.806	0.764	0.442
4000	1000	5%	50	PSIS	3747	716	0.384	0.364	0.376	0.380	0.002	0.000
				FAST	3748	744	0.368	0.352	0.378	0.362	0.002	0.000
				CRIS	3133	1100	0.022	0.018	0.014	0.022	0.000	0.000
				MWSIS	4000	3	0.670	0.700	0.710	0.702	0.000	0.000
				CWSIS	908	1477	–	0.720	0.680	0.734	0.688	0.252
	1000	10%	100	PSIS	3995	46	0.384	0.364	0.376	0.380	0.002	0.000
				FAST	3995	48	0.368	0.352	0.378	0.362	0.002	0.000
				CRIS	3363	795	0.022	0.018	0.014	0.022	0.000	0.000
				MWSIS	4000	0	0.670	0.700	0.710	0.702	0.000	0.000
				CWSIS	136	389	–	0.720	0.680	0.734	0.688	0.252
	500	20%	100	PSIS	4000	2	0.600	0.608	0.578	0.630	0.000	0.000
				FAST	4000	2	0.582	0.592	0.574	0.582	0.000	0.000
				CRIS	3447	871	0.036	0.050	0.038	0.024	0.000	0.000
				MWSIS	4000	0	0.770	0.732	0.730	0.766	0.000	0.000
				CWSIS	86	277	–	0.770	0.784	0.806	0.746	0.350

n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n, the average number of cases; CWSIS: the proposed conditional screening method; MWSIS: the marginal weighted screening procedure; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014).

Table 2

The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 2

					p = 2000					p = 4000
							Pe					Pe
n	FR	ρ	n_c	Method	Median	IQR	X ₁	X_p	Pa	Median	IQR	X ₁	X_p	Pa
500	20%	0	100	PSIS	578	979	1.000	0.092	0.092	1279	2088	1.000	0.066	0.066
				FAST	594	975	1.000	0.090	0.090	1286	2054	1.000	0.068	0.068
				CRIS	841	1005	1.000	0.032	0.032	1683	1920	0.998	0.014	0.014
				MWSIS	424	951	1.000	0.104	0.104	936	1819	1.000	0.088	0.088
				CWSIS	2	0	–	1.000	1.000	2	0	–	1.000	1.000
		0.3	100	PSIS	1973	131	1.000	0.000	0.000	3958	302	1.000	0.000	0.000
				FAST	1971	138	1.000	0.000	0.000	3958	311	1.000	0.000	0.000
				CRIS	1278	1192	0.998	0.022	0.022	2795	2220	0.998	0.012	0.012
				MWSIS	1997	33	1.000	0.000	0.000	3993	62	1.000	0.002	0.002
				CWSIS	2	0	–	1.000	1.000	2	0	–	1.000	1.000
		0.7	100	PSIS	2000	0	0.380	0.000	0.000	4000	0	0.278	0.000	0.000
				FAST	2000	0	1.000	0.000	0.000	4000	0	1.000	0.000	0.000
				CRIS	1945	460	0.990	0.008	0.008	3923	848	0.978	0.004	0.004
				MWSIS	2000	0	0.676	0.000	0.000	4000	0	0.558	0.000	0.000
				CWSIS	2	0	–	1.000	1.000	2	0	–	1.000	1.000
1000	10%	0	100	PSIS	664	1033	1.000	0.064	0.064	1385	2072	1.000	0.038	0.038
				FAST	684	1024	1.000	0.064	0.064	1376	2031	1.000	0.036	0.036
				CRIS	937	982	0.920	0.014	0.008	1938	2140	0.816	0.002	0.000
				MWSIS	315	795	1.000	0.170	0.170	599	1766	1.000	0.116	0.116
				CWSIS	2	0	–	0.998	0.998	2	0	–	0.998	0.998
		0.3	100	PSIS	1928	374	0.964	0.002	0.002	3870	710	0.946	0.000	0.000
				FAST	1926	382	1.000	0.002	0.002	3863	678	1.000	0.000	0.000
				CRIS	1233	1119	0.884	0.034	0.030	2427	2355	0.810	0.024	0.012
				MWSIS	1999	15	1.000	0.000	0.000	3998	63	1.000	0.000	0.000
				CWSIS	2	0	–	1.000	1.000	2	0	–	0.998	0.998
		0.7	100	PSIS	2000	0	0.042	0.000	0.000	4000	0	0.016	0.000	0.000
				FAST	2000	0	0.996	0.000	0.000	4000	0	0.994	0.000	0.000
				CRIS	1737	930	0.794	0.028	0.024	3451	1921	0.710	0.024	0.010
				MWSIS	2000	0	0.208	0.000	0.000	4000	0	0.150	0.000	0.000
				CWSIS	2	0	–	1.000	1.000	2	0	–	0.998	0.998
1000	5%	0	50	PSIS	1075	984	0.254	0.006	0.002	2010	2274	0.266	0.008	0.006
				FAST	931	1022	0.990	0.008	0.008	1771	2138	1.000	0.002	0.002
				CRIS	1332	1002	0.562	0.000	0.000	2364	1826	0.436	0.000	0.000
				MWSIS	520	983	1.000	0.042	0.042	1082	2023	1.000	0.030	0.030
				CWSIS	2	1	–	0.936	0.936	2	2	–	0.882	0.882
		0.3	50	PSIS	1678	667	0.080	0.002	0.000	3459	1476	0.046	0.004	0.002
				FAST	1580	825	0.958	0.002	0.002	3249	1642	0.976	0.002	0.002
				CRIS	1501	1077	0.568	0.006	0.004	2592	2036	0.448	0.004	0.000
				MWSIS	1981	155	0.984	0.000	0.000	3971	204	0.956	0.000	0.000
				CWSIS	2	0	–	0.946	0.946	2	1	–	0.890	0.890
		0.7	50	PSIS	2000	2	0.000	0.000	0.000	4000	1	0.000	0.000	0.000
				FAST	2000	2	0.502	0.000	0.000	4000	1	0.536	0.000	0.000
				CRIS	1721	1037	0.590	0.010	0.002	3093	2247	0.462	0.008	0.002
				MWSIS	2000	0	0.020	0.000	0.000	4000	0	0.006	0.000	0.000
				CWSIS	2	0	–	0.942	0.942	2	1	–	0.900	0.900

n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n the average number of cases; ρ, the correlation coefficient of covariates; CWSIS: the proposed conditional screening method; MWSIS: the marginal weighted screening procedure; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014).

To assess the performance of the proposed method in the settings that are similar to the real data, we further consider n = 300 and the failure rate of 25% for example 2, the remaining setups are kept the same as before. Here, we also consider the unweighted conditional screening method NCWSIS which does not adopt the weight function and simply treat the case-cohort data as SRS data, and the conditional screening method C-SMPLE in Hong et al. (2018). Since the method C-SMPLE in Hong et al. (2018) is proposed for SRS data, it can not be directly used to handle the case-cohort data, we generate the SRS data with the same sample size as the case-cohort data for CSMPLE. The simulation results for , and are summarized in Table 3, from which we can see that the proposed method can detect the hidden active covariates with high probabilities and delivers its distinctive advantages for all the considered settings. By comparing the results of NCWSIS, CSMPLE and CWSIS, we can conclude that the performance of the conditional screening method is improved by including the case-cohort weight. Moreover, the proposed conditional screening procedure based on case-cohort design is more accurate in selecting the active covariates than the conditional screening based on a SRS of the same size as the case-cohort sample. For example, when p = 2000 and ρ = 0.7, the value of is only 0.460 for CSMPLE, while the corresponding value of the proposed method CWSIS equals to 1.

Table 3

The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 2 with n = 300 and FR=25%

			p = 2000					p = 4000
					Pe					Pe
ρ	n_c	Method	Median	IQR	X ₁	X_p	Pa	Median	IQR	X ₁	X_p	Pa
0	75	PSIS	671	1040	1.000	0.062	0.062	1226	1920	1.000	0.032	0.032
		FAST	711	1035	1.000	0.068	0.068	1221	1927	1.000	0.034	0.034
		CRIS	854	1064	1.000	0.026	0.026	1626	2057	0.998	0.012	0.012
		MWSIS	599	1057	1.000	0.058	0.058	1014	1900	1.000	0.040	0.040
		NCWSIS	5	53	-	0.706	0.706	8	115	-	0.648	0.648
		CSMPLE	7	68	-	0.674	0.674	12	108	-	0.608	0.608
		CWSIS	2	0	-	1.000	1.000	2	0	-	0.994	0.994
0.3	75	PSIS	1960	236	1.000	0.000	0.000	3917	485	1.000	0.000	0.000
		FAST	1959	236	1.000	0.000	0.000	3915	516	1.000	0.000	0.000
		CRIS	1339	1216	0.998	0.014	0.014	2718	2467	0.994	0.020	0.020
		MWSIS	1987	106	1.000	0.000	0.000	3966	225	1.000	0.000	0.000
		NCWSIS	3	60	-	0.712	0.712	4	57	-	0.704	0.704
		CSMPLE	10	65	-	0.648	0.648	16	223	-	0.574	0.574
		CWSIS	2	0	-	1.000	1.000	2	0	-	1.000	1.000
0.7	75	PSIS	2000	0	0.596	0.000	0.000	4000	0	0.578	0.000	0.000
		FAST	2000	0	1.000	0.000	0.000	4000	0	1.000	0.000	0.000
		CRIS	1948	399	0.990	0.002	0.002	3921	1035	0.978	0.006	0.006
		MWSIS	2000	0	0.610	0.000	0.000	4000	0	0.556	0.000	0.000
		NCWSIS	2	40	-	0.732	0.732	2	38	-	0.736	0.736
		CSMPLE	44	204	-	0.460	0.460	74	606	-	0.388	0.388
		CWSIS	2	0	-	1.000	1.000	2	0	-	1.000	1.000

n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n, the average number of cases; ρ, the correlation coefficient of covariates; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014); MWSIS: the marginal weighted screening procedure; NCWSIS: the unweighted conditional screening method; CSMPLE: the conditional screening method of Hong et al. (2018); CWSIS: the proposed conditional screening method.

Application to breast cancer data

As an illustration, we apply the proposed CWSIS method to the breast cancer data (van de Vijver et al., 2002), with 295 female patients who have primary invasive breast carcinoma. For each patient, the expressions of 24885 genes were profiled on cDNA arrays from all tumors. A set of 4919 candidate genes were selected after initial screening using the Rosetta error model (van’t Veer et al., 2002). By excluding the individuals with missing values, we have 289 subjects with 4919 candidate genes. The median observed time was 7.23 years (ranging from 0.05 to 18.34 years). During the follow-up, 78 patients died of breast cancer and the other 211 patients were still alive, which led to the failure rate of 26.99%. Of the 289 patient samples, 60 samples overlapped with the 78 training samples from van’t Veer et al (2002), we use these 60 samples as the testing set and the case-cohort samples as our training set. The details of these two sets are summarized in Table 4. The interest of the study is to identify genes that have great influence on patients’ overall survival rate.

Table 4

Summary of the breast cancer data

Dataset	Num	Min	Max	Median	Fail(%)
Train	289	0.055	18.341	7.225	26.99
Test	60	0.712	15.352	7.606	38.33

Train, the training set; Test, the testing set; Num, the number of patients; Min, the minimum observed survival time; Max, the maximum observed survival time; Median, the median of observed survival time; Fail, the failure rate.

We illustrate the proposed method by identifying genes that have great influence on patients’ overall survival rate based on data from a case-cohort sample. Specifically, we select the subcohort by independent Bernoulli sampling with the selection probability π = 0.37, which results in about the same number of cases and noncases. The subcohort has 111 subjects and the final case-cohort sample has 155 subjects. Gene AL080059 has been known to be predictive to patients’ survival time in the literature (Yeung et al., 2005; van’t Veer et al., 2002), we use it as the conditional variable in the proposed procedure. The screening methods are usually considered as an initial step to reduce the dimensionality and then followed with some model-based regularization methods. In particular, we first apply the proposed CWSIS procedure to reduce the dimension from p = 4919 to ⌈155/log(155)⌉ = 31 and then utilize different regularization methods LASSO, SCAD and MCP to select the significant ones among these 31 genes under the framework of the Cox proportional hazards regression, the tuning parameter was selected by the 10-fold cross-validation. We summarize the name and the corresponding estimated value of the coefficient for selected genes in Table 5, from which we can see that genes Contig58368.RC, NM.014889, NM.005689, NM.013290, AL080059, NM.013332, Contig63649.RC and NM.002916 were all selected by the LASSO, SCAD and MCP methods, indicating that these eight genes could be associated with patients’ survival rate. Moreover, genes Contig58368.RC, NM.014889 and NM.005689 were ranked at the first three position, which means that these three genes may have great influence on patients’ survival rate.

Table 5

The results of selected important genes for the breast cancer data using the regularization methods

LASSO		SCAD		MCP
Name	Est.	Name	Est.	Name	Est.
Contig58368.RC	0.392	Contig58368.RC	0.516	Contig58368.RC	0.515
NM.014889	0.277	NM.014889	0.446	NM.014889	0.445
NM.005689	0.201	NM.005689	0.329	NM.005689	0.329
NM.013332	0.199	NM.013290	0.326	NM.013290	0.325
Contig63649.RC	0.178	AL080059	0.312	AL080059	0.312
NM.013290	0.172	NM.013332	0.256	NM.013332	0.256
AL080059	0.168	Contig63649.RC	0.249	Contig63649.RC	0.249
NM.002916	0.140	NM.002916	0.204	NM.002916	0.206
NM.012291	0.102
Contig31288.RC	0.083
Contig38288.RC	0.049
NM.003376	0.017
NM.001673	0.014

Name: the name for selected genes; Est.: the corresponding estimated value of the coefficient for selected genes.

To evaluate the predictive accuracy of C-WSIS, we further compute the C-statistic estimator (Uno et al., 2011). For comparison, we also apply the MWSIS and NC-WSIS procedures to analyze this data. In particular, we first apply these three screening methods to reduce the dimension to ⌈155/log(155)⌉ = 31, then perform the LASSO penalization to further remove some irrelevant covariates, with the tuning parameter selected by the 10-fold cross-validation. We obtain the risk score for each subject by using the final model selected by LASSO and further compute the corresponding concordance statistic (C-statistic) (Uno et al., 2011) in the testing set. The standard deviations (SD) of C-statistic are obtained from perturbation resampling 1000 times. The corresponding values of C-statistic and SD (the values in the parenthesis) are 0.862 (0.059), 0.796 (0.078), 0.802 (0.053) for CWSIS, MWSIS, NCWSIS procedures, respectively. According to Uno et al. (2011), the larger the C-statistic is, the stronger predictive power the method possesses. We can conclude that the proposed CWSIS method performs reasonably well for ultrahigh-dimensional survival data under the case-cohort design and delivers a favorable performance in terms of prediction. We also consider d = n/2, n/3, n/4 when analyzing this data and summarize the results in the supplementary material, from which we can see that the selected genes under different cut-offs are highly consistent. Furthermore, we compute the C-statistic estimator for CWSIS, MWSIS, NCWSIS procedures under these three cases. From the results in the supplementary material we can make similar conclusion to that with d = n/log(n).

Conclusion

For ultrahigh-dimensional survival data under the case-cohort design, we propose a conditional screening procedure CWSIS by incorporating the prior information of active covariates. This method enables the detection of hidden active covariates, which is an outstanding advantage compared with the marginal screening procedures. Moreover, the proposed procedure does not require any complicated numerical optimization and is computationally efficient. Theoretically, it enjoys the sure screening property and ranking consistency property under some mild regularity conditions. In the development of the theoretical properties, we adopt the conditional linear expectation and conditional linear covariance, which are proposed in Hong et al. (2018) and are useful to specify the regularity conditions. There are some issues that deserve further considerations. First, the proposed method requires the prior information of active covariates, sometimes it may be difficult to obtain such useful information. Hong et al. (2016) proposed a data-driven method to obtain the conditional set for generalized linear models. How to develop a data-driven conditional screening method for survival data under the case-cohort is an interesting question. Furthermore, when we have prior knowledge of active covariates, how to balance it with the information extracted from the given data merits further investigation. Second, under our design, the subcohort is selected by independent Bernoulli sampling. When the subcohort is selected by simple random sampling without replacement, our method also works, although more complicated arguments would be needed to develop the theoretical properties. Moreover, when some covariates are available for all cohort members, we can consider the stratified case-cohort design based on those covariates. Third, we can consider to propose more efficient screening methods which incorporating more complex prior knowledge, such as the network structure or the spatial information of the covariates.

30 in total

1. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models.

Authors: Jianqing Fan; Yang Feng; Rui Song
Journal: J Am Stat Assoc Date: 2011-06 Impact factor: 5.033

2. Marginal hazards model for case-cohort studies with multiple disease outcomes.

Authors: S Kang; J Cai
Journal: Biometrika Date: 2009-12 Impact factor: 2.445

3. Censored cumulative residual independent screening for ultrahigh-dimensional survival data.

Authors: Jing Zhang; Guosheng Yin; Yanyan Liu; Yuanshan Wu
Journal: Lifetime Data Anal Date: 2017-05-26 Impact factor: 1.588

4. Variable selection for case-cohort studies with failure time outcome.

Authors: A I Ni; Jianwen Cai; Donglin Zeng
Journal: Biometrika Date: 2016-08-10 Impact factor: 2.445

5. More efficient estimators for case-cohort studies.

Authors: S Kim; J Cai; W Lu
Journal: Biometrika Date: 2013 Impact factor: 2.445

6. Gene expression profiling predicts clinical outcome of breast cancer.

Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal: Nature Date: 2002-01-31 Impact factor: 49.962