| Literature DB >> 25961669 |
Yang Liu1, Feng Tian1, Zhenjun Hu1, Charles DeLisi1.
Abstract
The number of mutated genes in cancer cells is far larger than the number of mutations that drive cancer. The difficulty this creates for identifying relevant alterations has stimulated the development of various computational approaches to distinguishing drivers from bystanders. We develop and apply an ensemble classifier (EC) machine learning method, which integrates 10 classifiers that are publically available, and apply it to breast and ovarian cancer. In particular we find the following: (1) Using both standard and non-standard metrics, EC almost always outperforms single method classifiers, often by wide margins. (2) Of the 50 highest ranked genes for breast (ovarian) cancer, 34 (30) are associated with other cancers in either the OMIM, CGC or NCG database (P < 10(-22)). (3) Another 10, for both breast and ovarian cancer, have been identified by GWAS studies. (4) Several of the remaining genes--including a protein kinase that regulates the Fra-1 transcription factor which is overexpressed in ER negative breast cancer cells; and Fyn, which is overexpressed in pancreatic and prostate cancer, among others--are biologically plausible. Biological implications are briefly discussed. Source codes and detailed results are available at http://www.visantnet.org/misi/driver_integration.zip.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25961669 PMCID: PMC4650817 DOI: 10.1038/srep10204
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Ensemble classifier (EC) flow chart.
TCGA mutation data is used as input to 8 of the 10 publicly available classifiers; two of the module methods take OMIM data as input. EC is applied to the training set (Methods) as part of a ten-fold cross validation procedure, to obtain driver/passenger outputs. The vectors are separated in a ten dimensional space by the Decorate ensemble classifier. After training and cross validation, all known human genes, except those used for training, are scored.
Summary of 10 driver gene/module identification methods.
| OncodriveFM | Computes a metric of functional impact using three well-known methods (SIFT, PolyPhen2 and MutationAssessor) and assesses how the functional impact of variants found in a gene across several tumor samples deviates from a null distribution. | Uses |
| OncodriveCLUST | Identifies genes whose mutations tend to cluster in particular location on the protein. | Uses |
| MutSig | Estimates the background mutation rate for each gene–patient–category combination based on the observed silent mutations in the gene and non-coding mutations in the surrounding regions. | Uses |
| ActiveDriver | The method is based on a logistic regression strategy and identifies 22ignalling sites in proteins that involve unexpectedly many (or few) sequence variants considering the general variability of the protein, disordered and ordered regions, density of 22ignalling-related residues (such as phosphosites), and proximity of variants/mutations to 23ignalling residues. | Uses |
| Simon | Accounts for the functional impact of mutations on proteins, variation in background mutation rate among tumors and the redundancy of the genetic code. | Uses |
| FLN | Count connections of a gene with known cancer related genes based on FLN and provide Top 100 driver genes that with maximum connections. | Uses average weights (weights are obtained from FLN) between target gene and all Top 100 genes. |
| NetBox | Identify driver module by maximizing modularity based on Human Interaction Network (HIN). | Uses total number of links between target gene and all genes interior to the module based on HIN. Target genes exterior to the module are assigned a weight of 1; interior genes are assigned a weight of 2. |
| MEMo | Identify network modules whose members are recurrently altered across a set of tumor samples, are known to or are likely to participate in the same biological process and are mutually exclusive. | Uses total number of links between target gene and all genes interior to the module based on HIN. Exterior and interior genes are weighted 1 and 2, respectively. |
| Dendrix | Finds sets of genes, domains, or nucleotides whose mutations exhibit both high coverage and high exclusivity in the analysed samples. | Uses total number of links between target gene and all genes interior to the module based on HIN. Same weight as above. |
| FLNP (Huang | Identify driver module by maximizing modularity based on Functional Linkage Network (FLN). | Uses average weights (weights are obtained from FLN) between target gene and all genes interior to the module. |
This table describes the 10 methods that we use to do the integration, including the name of the method, how it works, how we use it as a feature.
Figure 2Ensemble predictions for breast cancer.
(a) Thirty-four of the top 50 genes selected by EC (EC50) are either in CGC, OMIM, or NCG. The Venn diagram displays their distribution among the three databases. Of the remaining 16 genes, 10 have been discovered in GWAS studies (indicated by asterisk). (b) EC50 genes identified by the 10 independent classifiers. (c) EC50 genes and enriched signalling pathways mapped onto the FLN as explained in the text. Only links with weights greater than 0.1 are retained.
KEGG pathways enriched in breast cancer using DAVID (FDR < 0.01).
| ErbB signaling | 12 | 3.0E-11 | 7.4E-10 | |
| Neurotrophin signaling | 12 | 1.5E-9 | 2.2E-8 | |
| T cell receptor signaling | 11 | 6.1E-9 | 6.6E-8 | |
| Jak-STAT signaling | 10 | 2.2E-6 | 1.6E-5 | |
| Cell cycle | 9 | 4.2E-6 | 2.9E-5 | |
| Toll-like receptor signaling | 8 | 1.1E-5 | 6.2E-5 | |
| Adipocytokine signaling | 7 | 1.1E-5 | 6.0E-5 | |
| B cell receptor signaling | 7 | 2.2E-5 | 1.0E-4 | |
| TGF-beta signaling | 7 | 5.1E-5 | 2.1E-4 | |
| Insulin signaling | 8 | 7.1E-5 | 2.8E-4 | |
| Chemokine signaling | 9 | 8.1E-5 | 3.0E-4 | |
| Fc epsilon RI signaling | 6 | 3.2E-4 | 1.1E-3 | |
| Natural killer cell mediated cytotoxicity | 7 | 5.3E-4 | 1.7E-3 | |
| Focal adhesion | 8 | 8.3E-4 | 2.4E-3 | |
| GnRH signaling | 6 | 9.3E-4 | 2.6E-3 | |
| RIG-I-like receptor signaling | 5 | 2.2E-3 | 5.8E-3 | |
| Adherens junction | 5 | 2.9E-3 | 7.6E-3 |
This table shows enriched KEGG pathways in breast cancer (FDR < 0.01), with FDR ascending order. The second and third columns are the number and names of the Top 50 genes in a given enriched pathway. Bold face indicates that the gene is newly predicted by EC, i.e. it is not identified as breast cancer related in any of the databases.
Figure 3Comparison of performance metrics for the ensemble classifier and single feature classifiers for breast cancer.
(a) Sensitivity and PPV for each of the methods. (b) The number of genes in Top 50 that are identified by at least 5 methods. No genes can be selected by more than 5 methods in FLN and ActiveDriver. (c) The number of genes in Top 50 that are annotated in two breast cancer studies. (d) Overall ranking of each method based on the sum of rankings in (b) and (c).
Figure 4Ensemble predictions for ovarian cancer.
(a) Thirty of the top 50 genes selected by EC (EC50) are either in CGC, OMIM, or NCG. The Venn diagram displays their distribution among the three databases. Of the remaining 20 genes, 10 have been discovered in GWAS studies (indicated by asterisk). (b) EC50 genes identified by the 10 independent classifiers. (c) Mapping of EC50 genes and enriched signalling pathways onto an FLN as explained in the text. Only links with weights greater than 0.1 are retained.
KEGG pathways enriched in ovarian cancer using DAVID (FDR < 0.01).
| ErbB signaling pathway | 11 | 9.2E-10 | 2.1E-8 | |
| Neurotrophin signaling | 12 | 2.4E-9 | 4.1E-8 | |
| Chemokine signaling | 13 | 5.9E-9 | 6.7E-8 | |
| Focal adhesion | 13 | 2.4E-8 | 2.3E-7 | |
| Natural killer cell mediated cytotoxicity | 10 | 5.5E-8 | 3.7E-7 | |
| Cell cycle | 10 | 4.5E-7 | 2.8E-6 | |
| Adherens junction | 8 | 1.7E-6 | 8.6E-6 | |
| T cell receptor signaling | 9 | 2.2E-6 | 1.1E-5 | |
| GnRH signaling | 8 | 1.1E-5 | 4.1E-5 | |
| Jak-STAT signaling | 9 | 1.5E-5 | 5.3E-5 | |
| B cell receptor signaling | 7 | 1.9E-5 | 6.1E-5 | |
| Fc epsilon RI signaling | 7 | 2.4E-5 | 7.2E-5 | |
| MAPK signaling | 11 | 4.8E-5 | 1.4E-4 | |
| Gap junction | 7 | 9.2E-5 | 2.5E-4 | |
| TGF-beta signaling | 6 | 4.5E-4 | 1.1E-3 | |
| Insulin signaling | 7 | 5.5E-4 | 1.3E-3 | |
| Axon guidance | 7 | 6.0E-4 | 1.4E-3 | |
| Dorso-ventral axis formation | 4 | 8.6E-4 | 1.9E-3 | |
| VEGF signaling | 5 | 2.6E-3 | 5.7E-3 |
This table shows enriched KEGG pathways in ovarian cancer (FDR < 0.01), with FDR ascending order. The second and third columns are the number and names of the Top 50 genes in a given enriched pathway. Bold face indicates that the gene is newly predicted by EC.
Figure 5Comparison of performance metrics for the ensemble classifier and single feature classifiers for ovarian cancer.
(a) Sensitivity and PPV for each of the methods. (b) The number of genes in Top 50 that are identified by at least 5 methods. (c) The number of genes in Top 50 that are annotated in two ovarian cancer studies. No genes can be overlapped with these two studies in MEMo. (d) Overall ranking of each method based on the sum of rankings in (b) and (c).