| Literature DB >> 31278367 |
Colin A Targonski1, Courtney A Shearer2, Benjamin T Shealy1, Melissa C Smith1, F Alex Feltus3,4,5.
Abstract
Given the complex relationship between gene expression and phenotypic outcomes, computationally efficient approaches are needed to sift through large high-dimensional datasets in order to identify biologically relevant biomarkers. In this report, we describe a method of identifying the most salient biomarker genes in a dataset, which we call "candidate genes", by evaluating the ability of gene combinations to classify samples from a dataset, which we call "classification potential". Our algorithm, Gene Oracle, uses a neural network to test user defined gene sets for polygenic classification potential and then uses a combinatorial approach to further decompose selected gene sets into candidate and non-candidate biomarker genes. We tested this algorithm on curated gene sets from the Molecular Signatures Database (MSigDB) quantified in RNAseq gene expression matrices obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) data repositories. First, we identified which MSigDB Hallmark subsets have significant classification potential for both the TCGA and GTEx datasets. Then, we identified the most discriminatory candidate biomarker genes in each Hallmark gene set and provide evidence that the improved biomarker potential of these genes may be due to reduced functional complexity.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31278367 PMCID: PMC6611793 DOI: 10.1038/s41598-019-46059-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The Gene Oracle Algorithm.
Figure 2Molecular Signature Classification Screening. Plots depict the classification accuracy of each Hallmark set versus random sets of equal size in both GTEx and TCGA. The left y-axis displays the 50 Hallmark sets. The right y-axis displays the number of genes in each Hallmark set. The x-axis displays the accuracy score produced by the MLP. The light blue brackets denote Hallmark sets that exhibit higher classification accuracy over their random counterparts.
Figure 3Combinatorial Analysis of Angiogenesis. (A,B) Accuracy plots of subsets from Angiogenesis versus 5 random sets of equal size for GTEx (left) and TCGA (right). In each plot, the darkest random line represents the average of the 5 random experiments, while the faded lines represent the results of individual random experiments. (C,D) Heatmaps depicting the frequency of genes in the subsets that were generated at each iteration. Each row is an iteration and each column is a gene from the Hallmark Angiogenesis set. Darker colors correspond to higher frequencies.
Gene Oracle Candidate Genes for GTEx Dataset.
| Hallmark Set | Candidate Genes |
|---|---|
| Hedgehog Signaling | SHH, UNC5C, HEY2, L1CAM, NKX6-1, CRMP1, CELSR1, CNTFR, CDK5R1, VLDLR, DPYSL2 |
| Angiogenesis | LRPAP1, LPL, FGFR1, TNFRSF21, CCND2, APP, JAG1, PTK2, VAV2, S100A4, JAG2 |
| Notch Signaling | FZD1, PSEN2, FZD7, CUL1, DTX4, HEYL, ARRB1, PRKCA, ST3GAL6, FBXW11 |
| MYC Targets V2 | SLC19A1, TMEM97, NDUFAF4, MYC, SRM, GNL3, TBRG4, NPM1, CBX3, CDK4, MAP3K6, LAS1L, PUS1, SLC29A2, DCTPP1, SORD |
Gene Oracle Candidate Genes for TCGA Dataset.
| Hallmark Set | Candidate Genes |
|---|---|
| Hedgehog Signaling | NRP1, HEY2, TLE3, TLE1, ETS2, VEGFA, CELSR1, CDK5R1, OPHN1 |
| Angiogenesis | FSTL1, LRPAP1, VEGFA, FGFR1, CCND2, COL5A2, SERPINA5, SPP1, NRP1 |
| Notch Signaling | FZD1, FZD7, FZD5, NOTCH1, DTX4, ARRB1, PRKCA, ST3GAL6 |
| MYC Targets V2 | SLC19A1, RRP9, IPO4, MYC, SRM, GNL3, RABEPK, PLK1, PRMT3, MAP3K6, SLC29A2, DCTPP1, SORD, UNG, MPHOSPH10 |
Figure 4Classification potential for decomposed gene sets. Classification accuracies for the full Hallmark gene sets (green) are compared to accuracies of the candidate genes identified by the combinatorial decomposition algorithm (yellow), non-candidate genes (blue), and random genes (red). The performance analysis for (A) Hedgehog Signaling, (B) Angiogenesis, (C) Notch Signaling, and (D) MYC Targets V2 Hallmark sets are shown for both GTEx and TCGA datasets. The number at the bottom of each bar denotes the size of the gene sets for that result.
Figure 5Functional Complexity Analysis. Gene Ontology (GO) term enrichment was performed for each full hallmark, candidate, non-candidate, or random gene set tested with TCGA or GTEx GEM where N = Notch signaling, H = Hedgehog signaling, A = Angiogenesis, M = MYC Targets V2. (A) The total number of enriched GO terms is shown (Adj. p < 0.001). (B) The average connectivity 〈k〉 of unique GO terms shared between genes in a set is shown.