| Literature DB >> 34857823 |
Yuan Zhou1,2, Botao Fa1, Ting Wei1, Jianle Sun1, Zhangsheng Yu3, Yue Zhang4.
Abstract
Investigation of the genetic basis of traits or clinical outcomes heavily relies on identifying relevant variables in molecular data. However, characteristics such as high dimensionality and complex correlation structures of these data hinder the development of related methods, resulting in the inclusion of false positives and negatives. We developed a variable importance measure method, termed the ECAR scores, that evaluates the importance of variables in the dataset. Based on this score, ranking and selection of variables can be achieved simultaneously. Unlike most current approaches, the ECAR scores aim to rank the influential variables as high as possible while maintaining the grouping property, instead of selecting the ones that are merely predictive. The ECAR scores' performance is tested and compared to other methods on simulated, semi-synthetic, and real datasets. Results showed that the ECAR scores improve the CAR scores in terms of accuracy of variable selection and high-rank variables' predictive power. It also outperforms other classic methods such as lasso and stability selection when there is a high degree of correlation among influential variables. As an application, we used the ECAR scores to analyze genes associated with forced expiratory volume in the first second in patients with lung cancer and reported six associated genes.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34857823 PMCID: PMC8640025 DOI: 10.1038/s41598-021-02706-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Data generating diagram and univariate contribution of ,, to . (a) The causal diagram of our data-generating model. (b) The univariate contribution of each variable. The overlapped area represents shared explained variance of each variable in univariate regression.
Figure 2The correlations between 100 transformed gene expression profiles and their original versions as moves from 0 to 1. Each line of different color represents an mRNA. The features are 100 gene expression profiles selected from The Cancer Genome Atlas (TCGA) LIHC cohort.
Figure 3Comparison of feature selection performance on 500 simulated datasets. The median number of true positive variables as a function of the total number of selected genes as well as the median of PR-AUC and its standard deviation are shown for ECAR, CAR, SIS, ridge, lasso and stability selection under five scenarios. The total number of influential genes is 30, which are randomly selected from the first 300 genes (first block). Parameter of ECAR is estimated using the methods described in the Methods section. The regularization parameter of ridge and lasso is estimated using fivefold cross-validation and generalized cross-validation, respectively. As lasso cannot select more variables than the sample size, we let it choose genes randomly when all genes in the output selected set are chosen. (a) is controlled at 0.95 for the 100 simulated datasets. (b) same as (a), controlled at 0.8. (c) Same as a, controlled at 0.6. (d) same as (a), controlled at 0.4. (e) Same as a, controlled at 0.2.
Figure 4Comparison of feature selection performance on 500 simulated datasets. This figure is the same as Fig. 3 except that paths are truncated at 100 genes.
Figure 5Comparison of feature selection performance on 500 simulated datasets. This figure is the same as Fig. 3 except that the influential features are selected from the whole set of features randomly.
Figure 6Comparison of variable selection performance on 500 semi-synthetic datasets. The median number of true positive variables as a function of the total number of selected genes as well as the median of PR-AUC and its standard deviation are shown for ECAR, CAR, SIS, ridge, lasso and stability selection under five scenarios. The total number of influential genes is 50, which are randomly selected from the 1000 genes. Parameter of ECAR is estimated using the methods described in the Methods section. The regularization parameter of ridge and lasso is estimated using fivefold cross-validation and generalized cross-validation, respectively. As lasso cannot select more variables than the sample size, we let it choose genes randomly when all genes in the output selected set are chosen. (a) is controlled at 0.95 for the 100 simulated datasets. (b) Same as (a), controlled at 0.8. (c) Same as (a), controlled at 0.6. (d) Same as (a), controlled at 0.4. (e) Same as (a), controlled at 0.2.
Summary of the generalization performance of high-rank SNP features evaluated by lasso on the three datasets.
| Data | Features’ number | ECAR | CAR | Lasso | Stability selection | SIS | Ridge | Base lasso |
|---|---|---|---|---|---|---|---|---|
Spike Length ( | 5 | 342.9 | 357.3 | 361.1 | 331.4 | 331.2 | 366.1 | 272.8 |
| 10 | 320.4 | 330.1 | 339.0 | 317.1 | 325.0 | 332.2 | 272.8 | |
| 20 | 303.6 | 308.7 | 314.9 | 301.1 | 317.8 | 317.0 | 272.8 | |
| 30 | 301.9 | 306.2 | 310.2 | 296.3 | 316.1 | 307.2 | 272.8 | |
Lodging Degree ( | 5 | 3.48 | 3.56 | 3.57 | 3.51 | 3.15 | 3.80 | 2.91 |
| 10 | 3.42 | 3.54 | 3.49 | 3.35 | 3.10 | 3.75 | 2.91 | |
| 20 | 3.36 | 3.39 | 3.46 | 3.30 | 3.14 | 3.60 | 2.91 | |
| 30 | 3.28 | 3.43 | 3.47 | 3.27 | 3.12 | 3.52 | 2.91 | |
Leaf Width ( | 5 | 0.067 | 0.067 | 0.070 | 0.062 | 0.060 | 0.069 | 0.050 |
| 10 | 0.061 | 0.061 | 0.063 | 0.059 | 0.056 | 0.066 | 0.050 | |
| 20 | 0.058 | 0.058 | 0.059 | 0.056 | 0.055 | 0.062 | 0.050 | |
| 30 | 0.056 | 0.056 | 0.059 | 0.055 | 0.059 | 0.060 | 0.050 |
Base lasso is the prediction performance of lasso on the test sets using all features as input. See the “Methods” section for further details.
Summary of the generalization performance of high-rank SNP features evaluated by ridge on the three datasets.
| Data | Features’ number | ECAR | CAR | Lasso | Stability selection | SIS | Ridge | Base ridge |
|---|---|---|---|---|---|---|---|---|
Spike Length ( | 5 | 344.5 | 350.5 | 361.6 | 336.6 | 337.8 | 350.1 | 268.3 |
| 10 | 326.5 | 329.4 | 340.4 | 319.7 | 331.8 | 331.4 | 268.3 | |
| 20 | 303.8 | 305.7 | 317.9 | 300.9 | 324.6 | 310.1 | 268.3 | |
| 30 | 297.7 | 301.4 | 305.5 | 297.2 | 324.8 | 303.1 | 268.3 | |
Lodging Degree ( | 5 | 3.45 | 3.46 | 3.44 | 3.34 | 3.10 | 3.69 | 2.61 |
| 10 | 3.30 | 3.46 | 3.42 | 3.24 | 3.00 | 3.63 | 2.61 | |
| 20 | 3.26 | 3.39 | 3.33 | 3.18 | 2.95 | 3.56 | 2.61 | |
| 30 | 3.21 | 3.29 | 3.36 | 3.18 | 2.91 | 3.40 | 2.61 | |
Leaf Width ( | 5 | 0.065 | 0.065 | 0.066 | 0.062 | 0.058 | 0.070 | 0.047 |
| 10 | 0.061 | 0.061 | 0.061 | 0.057 | 0.055 | 0.067 | 0.047 | |
| 20 | 0.057 | 0.057 | 0.059 | 0.054 | 0.053 | 0.062 | 0.047 | |
| 30 | 0.055 | 0.055 | 0.058 | 0.053 | 0.052 | 0.060 | 0.047 |
Base ridge is the prediction performance of ridge regression on the test sets using all features as input. See the “Methods” section for further details.
Summary of selected genes for each method.
| Methods | Selected genes (FDR = 0.05) |
|---|---|
| ECAR | |
| CAR | |
| SIS |