| Literature DB >> 30123241 |
Stephane Wenric1,2, Ruhollah Shemirani3.
Abstract
Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.Entities:
Keywords: RNA-Seq; feature selection; gene expression; gene selection; random forests; supervised learning; transcriptomics; variational autoencoders
Year: 2018 PMID: 30123241 PMCID: PMC6085558 DOI: 10.3389/fgene.2018.00297
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
TCGA data sets used in this study.
| TCGA-BRCA | Breast invasive carcinoma | 1,097 | 113 | 59.07 | 26-90 |
| TCGA-LUAD | Lung adenocarcinoma | 582 | 59 | 66.88 | 33-88 |
| TCGA-UCEC | Uterine Corpus endometrial carcinoma | 559 | 35 | 64.24 | 31-90 |
| TCGA-KIRC | Kidney renal clear cell carcinoma | 535 | 72 | 61.16 | 26-90 |
| TCGA-HNSC | Head and neck squamous cell carcinoma | 528 | 44 | 61.14 | 20-90 |
| TCGA-THCA | Thyroid carcinoma | 507 | 58 | 46.92 | 15-89 |
| TCGA-LUSC | Lung squamous cell carcinoma | 504 | 49 | 68.66 | 39-90 |
| TCGA-PRAD | Prostate adenocarcinoma | 498 | 52 | 61.99 | 42-78 |
| TCGA-COAD | Colon adenocarcinoma | 460 | 41 | 68.88 | 31-90 |
| TCGA-STAD | Stomach adenocarcinoma | 443 | 32 | 67.56 | 30-90 |
| TCGA-LIHC | Liver hepatocellular carcinoma | 377 | 50 | 61.53 | 16-88 |
| TCGA-KIRP | Kidney renal papillary cell carcinoma | 291 | 32 | 62.03 | 28-88 |
Figure 1Study design: A diagram describing the methodology.
Performance comparison of survival gene signatures: The random forests column denotes the number of random forests-based signatures having a lower log-rank p-value than their corresponding differential expression-based signatures.
| TCGA-BRCA | Breast invasive carcinoma | 5 | 19 |
| TCGA-LUAD | Lung adenocarcinoma | 14 | 14 |
| TCGA-UCEC | Uterine Corpus endometrial carcinoma | 16 | 9 |
| TCGA-KIRC | Kidney renal clear cell carcinoma | 13 | 10 |
| TCGA-HNSC | Head and neck squamous cell carcinoma | 14 | 15 |
| TCGA-THCA | Thyroid carcinoma | 15 | 15 |
| TCGA-LUSC | Lung squamous cell carcinoma | 5 | 0 |
| TCGA-PRAD | Prostate adenocarcinoma | 12 | 19 |
| TCGA-COAD | Colon adenocarcinoma | 11 | 18 |
| TCGA-STAD | Stomach adenocarcinoma | 13 | 19 |
| TCGA-LIHC | Liver hepatocellular carcinoma | 19 | 8 |
| TCGA-KIRP | Kidney renal papillary cell carcinoma | 10 | 19 |
The extreme pseudo-samples column denotes the number of extreme pseudo-samples-based signatures having a lower log-rank p-value than their corresponding differential expression-based signatures. The 3 colors (green, yellow, red) refer to cases where the proposed methods have a higher number, the same number, and a lower number of best-performing gene signatures than DESeq2, respectively.
Figure 2Performance comparison of survival gene signatures: Evolution of the log-rank p-values obtained with survival gene signatures comprising incremental number of genes, for the 3 methods compared and the 4 largest TCGA datasets.
Results of a pathway-based gene set enrichment analysis performed on the top 20 ranked genes obtained through the supervised learning methods.
| TCGA-BRCA | Breast invasive carcinoma | EPS | Signaling by PTK6 (Goel and Lukong, | 0.00176 | Reactome |
| TCGA-UCEC | Uterine Corpus endometrial carcinoma | RF | Oncostatin_M (Zhu et al., | 0.000876 | NetPath |
| EPS | IGF1 (Baserga et al., | 0.000622 | PID | ||
| TCGA-HNSC | Head and neck squamous cell carcinoma | RF | PPAR signaling pathway (Michalik et al., | 0.00278 | Wikipathways |
| EPS | AURKA (Chou et al., | 0.00198 | Reactome | ||
| TCGA-LUSC | Lung squamous cell carcinoma | EPS | IGF1 | 0.000406 0.000724 | PID Wikipathways |
| TCGA-PRAD | Prostate adenocarcinoma | EPS | IGF1 AURKA | 0.000545 0.00311 | PID Reactome |
| TCGA-COAD | Colon adenocarcinoma | RF | Mitochondrial Beta-Oxidation of Long Chain Saturated Fatty Acids (Wen et al., | 3.6e-05 0.000105 | SMPDB Wikipathways |
| TCGA-LIHC | Liver hepatocellular carcinoma | RF | Angiogenesis (Muto et al., | 0.00168 | Wikipathways |