| Literature DB >> 28008934 |
Zeev Waks1, Omer Weissbrod1, Boaz Carmeli1, Raquel Norel2, Filippo Utro2, Yaara Goldschmidt1.
Abstract
Compiling a comprehensive list of cancer driver genes is imperative for oncology diagnostics and drug development. While driver genes are typically discovered by analysis of tumor genomes, infrequently mutated driver genes often evade detection due to limited sample sizes. Here, we address sample size limitations by integrating tumor genomics data with a wide spectrum of gene-specific properties to search for rare drivers, functionally classify them, and detect features characteristic of driver genes. We show that our approach, CAnceR geNe similarity-based Annotator and Finder (CARNAF), enables detection of potentially novel drivers that eluded over a dozen pan-cancer/multi-tumor type studies. In particular, feature analysis reveals a highly concentrated pool of known and putative tumor suppressors among the <1% of genes that encode very large, chromatin-regulating proteins. Thus, our study highlights the need for deeper characterization of very large, epigenetic regulators in the context of cancer causality.Entities:
Mesh:
Year: 2016 PMID: 28008934 PMCID: PMC5180091 DOI: 10.1038/srep38988
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Approach for detection of infrequently mutated driver genes.
(A) There is likely a long-tail of uncharacterized driver genes with infrequent somatic tumor aberrations or atypical mutation patterns. CNA–copy number alteration (deletions and gains). (B) Illustration of the CARNAF pipeline. A diverse set of gene-specific features are extracted and used for ranking genes as TSGs, OGs, or non-driver genes. (C) Breakdown of genes used for CARNAF training. 165 high confidence driver genes (84 TSGs and 81 OGs) are used as positive examples. Additional genes present in at least one of 15 pan-cancer/multi-tumor type studies used in this work are divided into medium confidence, low confidence, and other evidence drivers and are omitted from training (Online Methods). The remaining 15,972 background genes are used as negative examples for CARNAF training.
Gene-specific features used in study.
| Feature category (# features) | Description |
|---|---|
| GC percent (1, 1) | GC content of gene, including introns |
| Genomic density (1, 1) | Number of genes that are present ≤4 Mb from gene center |
| DNA replication time (1, 1) | Stage in cell cycle in which gene is replicated |
| Number of transcripts (1, 1) | Number of transcripts per gene |
| Chromatin compartment (1, 1) | Extent that the chromatin compartment of the gene is open or closed (HiC experiment) |
| Tissue RNA levels (27, 27) | Expression levels from 27 different tissues |
| Median across tissues (1, 1) | Median expression level across tissues |
| Variation across tissues (1, 1) | Coefficient of variation (mean divided by standard deviation) across the 27 tissues |
| Coding sequence length (1, 1) | Number of amino acids in longest gene isoform |
| Number of modified residues (14, 10) | Number of acetylation, methylation (mono, di, & tri), phosphorylation, SUMOylation, and ubiquitination sites (normalized by CDS length) |
| Number of PPIs (1, 1) | Number of protein-protein interactions |
| Gene duplication (1, 1) | Is the gene a duplicate gene |
| Betweenness centrality (1, 1) | Measure for centrality within networks as quantified by frequency in shortest paths between nodes (proteins). |
| GO slims biological process (70, 36) | Biological process in which the gene is involved. Gene Ontology (GO) slims are high level gene ontology terms. |
| GO slims molecular function (34, 16) | Specific function of encoded proteins |
| GO slims cellular component (42, 19) | Spatial location of encoded proteins |
| Number of total GO slim terms (4, 4) | Number of total GO slims terms and total per each GO category |
| Predicted haploinsufficiency (1, 1) | Estimated probability of haploinsufficiency of the gene |
| Predicted essentiality (1, 1) | Essential gene or non-essential but phenotype-changing based on mouse homology |
| Mutation patterns (4, 4) | Four features: Mutation clustering estimation (distribution entropy) and ratio of predicted loss-of-function, damaging missense, and splice site mutations to benign mutations |
| Copy number alteration (2, 2) | Somatic gene amplification and deletion frequency |
A diverse set of feature classes were used in the study. The number of features within each category before and after feature selection is presented in parentheses (before, after). 131 features remained after feature selection (Online Methods and Supplementary Tables 1 and 2).
Figure 2Use of non-tumor based features improves detection of rare driver genes.
Precision at N shown for three gene sets: (A) high confidence driver genes, (B) medium confidence genes, and (C) low confidence genes. The 200 top ranked driver genes are shown sorted by rank. Going from left to right, the genes considered in each panel are excluded from subsequent panels. Precision in this scenario is equivalent to the fraction of detected genes. High confidence drivers, which are frequently mutated, are better detected using tumor genomics data. In contrast, non-tumor genomics data increases detection of candidate driver genes that are infrequently mutated.
Figure 3Large driver gene proteins are almost exclusively encoded by TSGs and primarily regulate chromosome organization.
(a) Comparison of protein size distributions encoded by high confidence (HC) TSGs, high confidence OGs, medium confidence drivers, low confidence genes, other evidence genes that are present in at least one of 15 studies used in this work (Online Methods), and background genes (BGs). A high fraction of TSGs encode very large proteins. CDS–coding sequence. (b) Comparison of high confidence TSG and non-TSG protein size with respect to having a documented role in chromosome organization processes (Chr) based on gene ontology. Large TSG proteins are enriched for participation in chromosome organization processes. All P values are derived using the Welch t-test.
Large driver proteins are encoded almost exclusively by TSGs.
| Symbol | Type | CDS length (aa) | Percentile in genome |
|---|---|---|---|
| KMT2D | TSG | 5537 | 99.9% |
| KMT2C | TSG | 4911 | 99.8% |
| FAT1 | TSG | 4588 | 99.7% |
| CSMD1 | TSG | 3565 | 99.5% |
| BRCA2 | TSG | 3418 | 99.4% |
| ATM | TSG | 3056 | 99.2% |
| APC | TSG | 2843 | 99.1% |
| NF1 | TSG | 2839 | 99.1% |
| SETD2 | TSG | 2564 | 98.8% |
| NOTCH1 | TSG | 2555 | 98.8% |
| CIC | TSG | 2514 | 98.7% |
| ATRX | TSG | 2492 | 98.7% |
| NOTCH2 | TSG | 2471 | 98.6% |
| CREBBP | TSG | 2442 | 98.6% |
| NCOR1 | TSG | 2440 | 98.6% |
| EP300 | TSG | 2414 | 98.5% |
| ARID1A | TSG | 2285 | 98.3% |
| ARID1B | TSG | 2236 | 98.2% |
| TET2 | TSG | 2023 | 97.7% |
| BRCA1 | TSG | 1884 | 97.2% |
| ARID2 | TSG | 1835 | 97.0% |
| TSC2 | TSG | 1807 | 96.9% |
| BCOR | TSG | 1755 | 96.6% |
| PBRM1 | TSG | 1689 | 96.2% |
| SMARCA4 | TSG | 1681 | 96.2% |
| KDM5C | TSG | 1560 | 95.6% |
List of the high confidence driver genes encoding the 30 largest proteins. CDS – coding sequence length.
Figure 4CARNAF and other methods predict an enrichment of uncharacterized TSGs among very large chromatin regulators.
(A) The abundance of high confidence TSGs and CARNAF predicted TSGs encoding very large (top 5% in genome) and small proteins (the remaining 95%) with respect to participation in chromosome organization processes. The top 84 CARNAF TSG predictions, using all features and excluding the high confidence driver gene set, were selected to match the abundance of TSGs in the high confidence set (n = 84). CARNAF predictions that overlap with the medium and low confidence driver gene sets are shown. Chr – chromosome organization biological process, according to gene ontology. (B) Prominent cellular processes for the 92 large, chromosome organization proteins. The fraction of high confidence and CARNAF predicted TSGs in each category is displayed. Categories are not mutually exclusive. (C) Specific cellular processes of the 66 genes annotated as involved in chromatin modification. Categories are mutually exclusive. Abbreviations: Chrsm–chromosome; Chrmt – chromatin; Chrmt mod – chromatin modification; Chrsm seg – chromosome segregation; DNA rep – DNA repair; Hist mod – histone modification; SWI/SNF cmplx – SWI/SNF complex; Chrmt remod* – other chromatin modification not annotated as histone modification or SWI/SNF complex.