| Literature DB >> 34417679 |
Jing Zhang1, Haibo Zhou2, Yanyan Liu3, Jianwen Cai4.
Abstract
The case-cohort design has been widely used to reduce the cost of covariate measurements in large cohort studies. In many such studies, the number of covariates is very large, and the goal of the research is to identify active covariates which have great influence on response. Since the introduction of sure independence screening, screening procedures have achieved great success in terms of effectively reducing the dimensionality and identifying active covariates. However, commonly used screening methods are based on marginal correlation or its variants, they may fail to identify hidden active variables which are jointly important but are weakly correlated with the response. Moreover, these screening methods are mainly proposed for data under the simple random sampling and can not be directly applied to case-cohort data. In this paper, we consider the ultrahigh-dimensional survival data under the case-cohort design, and propose a conditional screening method by incorporating some important prior known information of active variables. This method can effectively detect hidden active variables. Furthermore, it possesses the sure screening property under some mild regularity conditions and does not require any complicated numerical optimization. We evaluate the finite sample performance of the proposed method via extensive simulation studies and further illustrate the new approach through a real data set from patients with breast cancer.Entities:
Keywords: Case-cohort design; Conditional screening; Sure screening property; Survival data; Ultrahigh-dimensional data; Weighted estimating equation
Mesh:
Year: 2021 PMID: 34417679 PMCID: PMC8561435 DOI: 10.1007/s10985-021-09531-7
Source DB: PubMed Journal: Lifetime Data Anal ISSN: 1380-7870 Impact factor: 1.429
Fig. 1Density of the screening statistic for the hidden active covariate Z5 compared with a mixture of densities of inactive covariates Z6, …, Z2000 with different conditioning sets: Case 1: = {∅} which is equivalent to marginal screening; Case 2: = {1}, one truly active covariate; Case 3: = {1, 2}, two truly active covariates; Case 4: = {6, 7, 8}, three inactive covariates. The full cohort sample size n = 500, number of covariates p = 2000, noncase-to-case ratio is 1 : 1, the failure rate equals to 20%.
Fig. 2Absolute correlation of the survival time and the covariates for p = 2000.
Fig. 3Absolute correlation of the survival time and the covariates for p = 4000.
The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 1
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
| FR |
| Method | Median | IQR |
|
|
|
|
|
|
| 2000 | 1000 | 5% | 50 | PSIS | 1849 | 469 | 0.130 | 0.106 | 0.094 | 0.102 | 0.016 | 0.000 |
| FAST | 1843 | 446 | 0.116 | 0.108 | 0.106 | 0.104 | 0.016 | 0.000 | ||||
| CRIS | 1398 | 764 | 0.290 | 0.270 | 0.288 | 0.250 | 0.220 | 0.084 | ||||
| MWSIS | 2000 | 2 | 0.354 | 0.342 | 0.352 | 0.342 | 0.000 | 0.000 | ||||
| CWSIS | 447 | 746 | – | 0.318 | 0.350 | 0.330 | 0.476 | 0.018 | ||||
| 1000 | 10% | 100 | PSIS | 1998 | 18 | 0.488 | 0.456 | 0.436 | 0.462 | 0.000 | 0.000 | |
| FAST | 1998 | 19 | 0.474 | 0.444 | 0.422 | 0.450 | 0.000 | 0.000 | ||||
| CRIS | 1721 | 379 | 0.050 | 0.032 | 0.020 | 0.038 | 0.004 | 0.000 | ||||
| MWSIS | 2000 | 0 | 0.784 | 0.804 | 0.774 | 0.794 | 0.000 | 0.000 | ||||
| CWSIS | 69 | 172 | – | 0.790 | 0.768 | 0.810 | 0.760 | 0.356 | ||||
| 500 | 20% | 100 | PSIS | 2000 | 1 | 0.686 | 0.706 | 0.706 | 0.668 | 0.002 | 0.000 | |
| FAST | 2000 | 1 | 0.654 | 0.654 | 0.694 | 0.622 | 0.002 | 0.000 | ||||
| CRIS | 1720 | 405 | 0.054 | 0.054 | 0.044 | 0.054 | 0.002 | 0.000 | ||||
| MWSIS | 2000 | 0 | 0.812 | 0.832 | 0.840 | 0.798 | 0.000 | 0.000 | ||||
| CWSIS | 47 | 168 | – | 0.828 | 0.852 | 0.806 | 0.764 | 0.442 | ||||
| 4000 | 1000 | 5% | 50 | PSIS | 3747 | 716 | 0.384 | 0.364 | 0.376 | 0.380 | 0.002 | 0.000 |
| FAST | 3748 | 744 | 0.368 | 0.352 | 0.378 | 0.362 | 0.002 | 0.000 | ||||
| CRIS | 3133 | 1100 | 0.022 | 0.018 | 0.014 | 0.022 | 0.000 | 0.000 | ||||
| MWSIS | 4000 | 3 | 0.670 | 0.700 | 0.710 | 0.702 | 0.000 | 0.000 | ||||
| CWSIS | 908 | 1477 | – | 0.720 | 0.680 | 0.734 | 0.688 | 0.252 | ||||
| 1000 | 10% | 100 | PSIS | 3995 | 46 | 0.384 | 0.364 | 0.376 | 0.380 | 0.002 | 0.000 | |
| FAST | 3995 | 48 | 0.368 | 0.352 | 0.378 | 0.362 | 0.002 | 0.000 | ||||
| CRIS | 3363 | 795 | 0.022 | 0.018 | 0.014 | 0.022 | 0.000 | 0.000 | ||||
| MWSIS | 4000 | 0 | 0.670 | 0.700 | 0.710 | 0.702 | 0.000 | 0.000 | ||||
| CWSIS | 136 | 389 | – | 0.720 | 0.680 | 0.734 | 0.688 | 0.252 | ||||
| 500 | 20% | 100 | PSIS | 4000 | 2 | 0.600 | 0.608 | 0.578 | 0.630 | 0.000 | 0.000 | |
| FAST | 4000 | 2 | 0.582 | 0.592 | 0.574 | 0.582 | 0.000 | 0.000 | ||||
| CRIS | 3447 | 871 | 0.036 | 0.050 | 0.038 | 0.024 | 0.000 | 0.000 | ||||
| MWSIS | 4000 | 0 | 0.770 | 0.732 | 0.730 | 0.766 | 0.000 | 0.000 | ||||
| CWSIS | 86 | 277 | – | 0.770 | 0.784 | 0.806 | 0.746 | 0.350 | ||||
n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n, the average number of cases; CWSIS: the proposed conditional screening method; MWSIS: the marginal weighted screening procedure; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014).
The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 2
|
|
| |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| FR |
|
| Method | Median | IQR |
|
|
| Median | IQR |
|
|
|
| 500 | 20% | 0 | 100 | PSIS | 578 | 979 | 1.000 | 0.092 | 0.092 | 1279 | 2088 | 1.000 | 0.066 | 0.066 |
| FAST | 594 | 975 | 1.000 | 0.090 | 0.090 | 1286 | 2054 | 1.000 | 0.068 | 0.068 | ||||
| CRIS | 841 | 1005 | 1.000 | 0.032 | 0.032 | 1683 | 1920 | 0.998 | 0.014 | 0.014 | ||||
| MWSIS | 424 | 951 | 1.000 | 0.104 | 0.104 | 936 | 1819 | 1.000 | 0.088 | 0.088 | ||||
| CWSIS | 2 | 0 | – | 1.000 | 1.000 | 2 | 0 | – | 1.000 | 1.000 | ||||
| 0.3 | 100 | PSIS | 1973 | 131 | 1.000 | 0.000 | 0.000 | 3958 | 302 | 1.000 | 0.000 | 0.000 | ||
| FAST | 1971 | 138 | 1.000 | 0.000 | 0.000 | 3958 | 311 | 1.000 | 0.000 | 0.000 | ||||
| CRIS | 1278 | 1192 | 0.998 | 0.022 | 0.022 | 2795 | 2220 | 0.998 | 0.012 | 0.012 | ||||
| MWSIS | 1997 | 33 | 1.000 | 0.000 | 0.000 | 3993 | 62 | 1.000 | 0.002 | 0.002 | ||||
| CWSIS | 2 | 0 | – | 1.000 | 1.000 | 2 | 0 | – | 1.000 | 1.000 | ||||
| 0.7 | 100 | PSIS | 2000 | 0 | 0.380 | 0.000 | 0.000 | 4000 | 0 | 0.278 | 0.000 | 0.000 | ||
| FAST | 2000 | 0 | 1.000 | 0.000 | 0.000 | 4000 | 0 | 1.000 | 0.000 | 0.000 | ||||
| CRIS | 1945 | 460 | 0.990 | 0.008 | 0.008 | 3923 | 848 | 0.978 | 0.004 | 0.004 | ||||
| MWSIS | 2000 | 0 | 0.676 | 0.000 | 0.000 | 4000 | 0 | 0.558 | 0.000 | 0.000 | ||||
| CWSIS | 2 | 0 | – | 1.000 | 1.000 | 2 | 0 | – | 1.000 | 1.000 | ||||
| 1000 | 10% | 0 | 100 | PSIS | 664 | 1033 | 1.000 | 0.064 | 0.064 | 1385 | 2072 | 1.000 | 0.038 | 0.038 |
| FAST | 684 | 1024 | 1.000 | 0.064 | 0.064 | 1376 | 2031 | 1.000 | 0.036 | 0.036 | ||||
| CRIS | 937 | 982 | 0.920 | 0.014 | 0.008 | 1938 | 2140 | 0.816 | 0.002 | 0.000 | ||||
| MWSIS | 315 | 795 | 1.000 | 0.170 | 0.170 | 599 | 1766 | 1.000 | 0.116 | 0.116 | ||||
| CWSIS | 2 | 0 | – | 0.998 | 0.998 | 2 | 0 | – | 0.998 | 0.998 | ||||
| 0.3 | 100 | PSIS | 1928 | 374 | 0.964 | 0.002 | 0.002 | 3870 | 710 | 0.946 | 0.000 | 0.000 | ||
| FAST | 1926 | 382 | 1.000 | 0.002 | 0.002 | 3863 | 678 | 1.000 | 0.000 | 0.000 | ||||
| CRIS | 1233 | 1119 | 0.884 | 0.034 | 0.030 | 2427 | 2355 | 0.810 | 0.024 | 0.012 | ||||
| MWSIS | 1999 | 15 | 1.000 | 0.000 | 0.000 | 3998 | 63 | 1.000 | 0.000 | 0.000 | ||||
| CWSIS | 2 | 0 | – | 1.000 | 1.000 | 2 | 0 | – | 0.998 | 0.998 | ||||
| 0.7 | 100 | PSIS | 2000 | 0 | 0.042 | 0.000 | 0.000 | 4000 | 0 | 0.016 | 0.000 | 0.000 | ||
| FAST | 2000 | 0 | 0.996 | 0.000 | 0.000 | 4000 | 0 | 0.994 | 0.000 | 0.000 | ||||
| CRIS | 1737 | 930 | 0.794 | 0.028 | 0.024 | 3451 | 1921 | 0.710 | 0.024 | 0.010 | ||||
| MWSIS | 2000 | 0 | 0.208 | 0.000 | 0.000 | 4000 | 0 | 0.150 | 0.000 | 0.000 | ||||
| CWSIS | 2 | 0 | – | 1.000 | 1.000 | 2 | 0 | – | 0.998 | 0.998 | ||||
| 1000 | 5% | 0 | 50 | PSIS | 1075 | 984 | 0.254 | 0.006 | 0.002 | 2010 | 2274 | 0.266 | 0.008 | 0.006 |
| FAST | 931 | 1022 | 0.990 | 0.008 | 0.008 | 1771 | 2138 | 1.000 | 0.002 | 0.002 | ||||
| CRIS | 1332 | 1002 | 0.562 | 0.000 | 0.000 | 2364 | 1826 | 0.436 | 0.000 | 0.000 | ||||
| MWSIS | 520 | 983 | 1.000 | 0.042 | 0.042 | 1082 | 2023 | 1.000 | 0.030 | 0.030 | ||||
| CWSIS | 2 | 1 | – | 0.936 | 0.936 | 2 | 2 | – | 0.882 | 0.882 | ||||
| 0.3 | 50 | PSIS | 1678 | 667 | 0.080 | 0.002 | 0.000 | 3459 | 1476 | 0.046 | 0.004 | 0.002 | ||
| FAST | 1580 | 825 | 0.958 | 0.002 | 0.002 | 3249 | 1642 | 0.976 | 0.002 | 0.002 | ||||
| CRIS | 1501 | 1077 | 0.568 | 0.006 | 0.004 | 2592 | 2036 | 0.448 | 0.004 | 0.000 | ||||
| MWSIS | 1981 | 155 | 0.984 | 0.000 | 0.000 | 3971 | 204 | 0.956 | 0.000 | 0.000 | ||||
| CWSIS | 2 | 0 | – | 0.946 | 0.946 | 2 | 1 | – | 0.890 | 0.890 | ||||
| 0.7 | 50 | PSIS | 2000 | 2 | 0.000 | 0.000 | 0.000 | 4000 | 1 | 0.000 | 0.000 | 0.000 | ||
| FAST | 2000 | 2 | 0.502 | 0.000 | 0.000 | 4000 | 1 | 0.536 | 0.000 | 0.000 | ||||
| CRIS | 1721 | 1037 | 0.590 | 0.010 | 0.002 | 3093 | 2247 | 0.462 | 0.008 | 0.002 | ||||
| MWSIS | 2000 | 0 | 0.020 | 0.000 | 0.000 | 4000 | 0 | 0.006 | 0.000 | 0.000 | ||||
| CWSIS | 2 | 0 | – | 0.942 | 0.942 | 2 | 1 | – | 0.900 | 0.900 | ||||
n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n the average number of cases; ρ, the correlation coefficient of covariates; CWSIS: the proposed conditional screening method; MWSIS: the marginal weighted screening procedure; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014).
The median and interquartile range (IQR) of , the selection proportions and among 500 replications for example 2 with n = 300 and FR=25%
|
|
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
| Method | Median | IQR |
|
|
| Median | IQR |
|
|
|
| 0 | 75 | PSIS | 671 | 1040 | 1.000 | 0.062 | 0.062 | 1226 | 1920 | 1.000 | 0.032 | 0.032 |
| FAST | 711 | 1035 | 1.000 | 0.068 | 0.068 | 1221 | 1927 | 1.000 | 0.034 | 0.034 | ||
| CRIS | 854 | 1064 | 1.000 | 0.026 | 0.026 | 1626 | 2057 | 0.998 | 0.012 | 0.012 | ||
| MWSIS | 599 | 1057 | 1.000 | 0.058 | 0.058 | 1014 | 1900 | 1.000 | 0.040 | 0.040 | ||
| NCWSIS | 5 | 53 | - | 0.706 | 0.706 | 8 | 115 | - | 0.648 | 0.648 | ||
| CSMPLE | 7 | 68 | - | 0.674 | 0.674 | 12 | 108 | - | 0.608 | 0.608 | ||
| CWSIS | 2 | 0 | - | 1.000 | 1.000 | 2 | 0 | - | 0.994 | 0.994 | ||
| 0.3 | 75 | PSIS | 1960 | 236 | 1.000 | 0.000 | 0.000 | 3917 | 485 | 1.000 | 0.000 | 0.000 |
| FAST | 1959 | 236 | 1.000 | 0.000 | 0.000 | 3915 | 516 | 1.000 | 0.000 | 0.000 | ||
| CRIS | 1339 | 1216 | 0.998 | 0.014 | 0.014 | 2718 | 2467 | 0.994 | 0.020 | 0.020 | ||
| MWSIS | 1987 | 106 | 1.000 | 0.000 | 0.000 | 3966 | 225 | 1.000 | 0.000 | 0.000 | ||
| NCWSIS | 3 | 60 | - | 0.712 | 0.712 | 4 | 57 | - | 0.704 | 0.704 | ||
| CSMPLE | 10 | 65 | - | 0.648 | 0.648 | 16 | 223 | - | 0.574 | 0.574 | ||
| CWSIS | 2 | 0 | - | 1.000 | 1.000 | 2 | 0 | - | 1.000 | 1.000 | ||
| 0.7 | 75 | PSIS | 2000 | 0 | 0.596 | 0.000 | 0.000 | 4000 | 0 | 0.578 | 0.000 | 0.000 |
| FAST | 2000 | 0 | 1.000 | 0.000 | 0.000 | 4000 | 0 | 1.000 | 0.000 | 0.000 | ||
| CRIS | 1948 | 399 | 0.990 | 0.002 | 0.002 | 3921 | 1035 | 0.978 | 0.006 | 0.006 | ||
| MWSIS | 2000 | 0 | 0.610 | 0.000 | 0.000 | 4000 | 0 | 0.556 | 0.000 | 0.000 | ||
| NCWSIS | 2 | 40 | - | 0.732 | 0.732 | 2 | 38 | - | 0.736 | 0.736 | ||
| CSMPLE | 44 | 204 | - | 0.460 | 0.460 | 74 | 606 | - | 0.388 | 0.388 | ||
| CWSIS | 2 | 0 | - | 1.000 | 1.000 | 2 | 0 | - | 1.000 | 1.000 | ||
n, the sample size of the full cohort; p, the number of covariates; FR, the failure rate; n, the average number of cases; ρ, the correlation coefficient of covariates; PSIS: the screening procedure of Zhao and Li (2012); FAST: the screening procedure of Gorst-Rasmussen and Scheike (2013); CRIS: the screening procedure of Song et al. (2014); MWSIS: the marginal weighted screening procedure; NCWSIS: the unweighted conditional screening method; CSMPLE: the conditional screening method of Hong et al. (2018); CWSIS: the proposed conditional screening method.
Summary of the breast cancer data
| Dataset | Num | Min | Max | Median | Fail(%) |
|---|---|---|---|---|---|
| Train | 289 | 0.055 | 18.341 | 7.225 | 26.99 |
| Test | 60 | 0.712 | 15.352 | 7.606 | 38.33 |
Train, the training set; Test, the testing set; Num, the number of patients; Min, the minimum observed survival time; Max, the maximum observed survival time; Median, the median of observed survival time; Fail, the failure rate.
The results of selected important genes for the breast cancer data using the regularization methods
| LASSO | SCAD | MCP | |||
|---|---|---|---|---|---|
| Name | Est. | Name | Est. | Name | Est. |
| Contig58368.RC | 0.392 | Contig58368.RC | 0.516 | Contig58368.RC | 0.515 |
| NM.014889 | 0.277 | NM.014889 | 0.446 | NM.014889 | 0.445 |
| NM.005689 | 0.201 | NM.005689 | 0.329 | NM.005689 | 0.329 |
| NM.013332 | 0.199 | NM.013290 | 0.326 | NM.013290 | 0.325 |
| Contig63649.RC | 0.178 | AL080059 | 0.312 | AL080059 | 0.312 |
| NM.013290 | 0.172 | NM.013332 | 0.256 | NM.013332 | 0.256 |
| AL080059 | 0.168 | Contig63649.RC | 0.249 | Contig63649.RC | 0.249 |
| NM.002916 | 0.140 | NM.002916 | 0.204 | NM.002916 | 0.206 |
| NM.012291 | 0.102 | ||||
| Contig31288.RC | 0.083 | ||||
| Contig38288.RC | 0.049 | ||||
| NM.003376 | 0.017 | ||||
| NM.001673 | 0.014 | ||||
Name: the name for selected genes; Est.: the corresponding estimated value of the coefficient for selected genes.