Literature DB >> 31465854

C3: Consensus Cancer Driver Gene Caller.

Chen-Yu Zhu1, Chi Zhou1, Yun-Qin Chen2, Ai-Zong Shen3, Zong-Ming Guo2, Zhao-Yi Yang3, Xiang-Yun Ye4, Shen Qu5, Jia Wei6, Qi Liu7.   

Abstract

Next-generation sequencing has allowed identification of millions of somatic mutations in human cancer cells. A key challenge in interpreting cancer genomes is to distinguish drivers of cancer development among available genetic mutations. To address this issue, we present the first web-based application, consensus cancer driver gene caller (C3), to identify the consensus driver genes using six different complementary strategies, i.e., frequency-based, machine learning-based, functional bias-based, clustering-based, statistics model-based, and network-based strategies. This application allows users to specify customized operations when calling driver genes, and provides solid statistical evaluations and interpretable visualizations on the integration results. C3 is implemented in Python and is freely available for public use at http://drivergene.rwebox.com/c3.
Copyright © 2019 The Authors. Published by Elsevier B.V. All rights reserved.

Entities:  

Keywords:  Cancer driver genes; Consensus; Data integration; Somatic mutation; Web server

Mesh:

Year:  2019        PMID: 31465854      PMCID: PMC6818389          DOI: 10.1016/j.gpb.2018.10.004

Source DB:  PubMed          Journal:  Genomics Proteomics Bioinformatics        ISSN: 1672-0229            Impact factor:   7.691


Introduction

The continued advancement of next-generation sequencing (NGS) technology has allowed for the sequencing of large sets of cancer samples for somatic mutation discovery [1], [2]. However, one of the main challenges in interpreting the cancer genomes is to efficiently distinguish the driver mutations from the passenger mutations. Driver mutations are causally implicated in oncogenes and positively selected along the lineage of cancer development under the specific microenvironment conditions in vivo, whereas passenger mutations do not confer clonal growth advantages and are thus irrelevant to tumor development [3]. To address this issue, various methods have been proposed to identify driver genes based on distinctive assumptions and strategies [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. Intuitively, all these driver gene identification strategies exhibit the biased signals of positive selection exploited by corresponding mechanisms at varied degrees. Several studies have been reported on benchmarking these methods with consensus cancer driver genes derived from individual model [8], [17], [18]. Collin et al. [8] proposed an evaluation framework to benchmark several existing models based on several measurements including precision, consistency, and mean log fold change (MLFC). Matan et al. [17] also benchmarked the available methods by using measurements such as precision and recall. Eduard et al. [18] classified four subtypes of driver gene calling methods at a subgene resolution. Denis et al. [19] provided the most comprehensive benchmarking of 21 driver gene prediction methods and proposed a Borda-based integration approach ConsensusDriver. Despite these efforts, the available tools are often challenging for biologists or clinicians to carry out the related analysis directly, given the technical hurdles ranging from setting up the software to tuning parameters. A web-based user-friendly consensus driver gene prediction with intuitive visualization of the consensus mutation calling is needed. Here, we present the first web server-based consensus cancer driver gene caller (C) platform to derive the consensus mutation calling results [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], using six state-of-the-arts and complementary prediction strategies. These include frequency-based (MutSigCV) [6], machine learning-based (20/20 + ) [8], functional bias-based (OncodriveFM) [10], clustering-based (OncodriveCLUST) [11], statistics model-based (DrGaP) [5], and network-based (MUFFINN) [7]. Various calling evaluation and visualization strategies are incorporated in C as follows. (1) C provides a solid evaluation of the consensus mutation calling results with Top-N-Precision and Top-N-nDCG [20]. (2) C provides an efficient integration strategy to derive the consensus results by Robust Rank Aggregation (RRA) [21] and statistical model-based intersection visualization [22]. (3) Circos plots are presented in C to visualize the consensus mutation calling results [22], [23].

Method

General workflow of C

C accepts mutation annotation format (MAF) [24] file as input. The MAF file is annotated from variant calling format (VCF) [25] file, which can be acquired by using variant calling tool like Mutect on the NGS data. A schematic representation of the C workflow is shown in Figure 1A. The selected programs, including 20/20+, MutSigCV, OncodriveFM, OncodriveCLUST, DrGaP, and MUFFINN (Figure 1A and B; File S1 Part 1), run in the Ubuntu sever 16.04 system. Then all preprocessed input mutation data are processed in C to obtain candidate driver genes list for each strategy separately. We use SuperExactTest model to evaluate the statistical significance of the intersection of individual calling results using all the protein-coding gene as a whole background gene set. In addition, based on each discrepant driver gene list, a rank ensemble method, RobustRankAggreg, is used to obtain a consensus driver gene list. Four databases including the Cancer Gene Census (CGC) [26], Integrative Onco Genomics (IntOGen) [10], Network of Cancer Genes (NCG) [27], and Online Mendelian Inheritance in Man (OMIM) [28] are used to annotate the predicted driver genes. Two evaluation measurements, i.e., the Top-N-Precision and Top-N-nDCG, are applied to evaluate the calling performance. Finally, the KEGG [29] pathway and Gene Ontology analyses are also performed on the consensus driver genes for comprehensive annotations.
Figure 1

Guideline of

A. A schematic representation of the C workflow. A cancer sample input into C workflow is analyzed by six cancer driver gene calling strategies (shown on the left), resulting in a consensus driver gene set as the output. Then the driver gene set is evaluated in terms of precision and nDCG before it is annotated based on the reference databases (shown on the right). Finally, the results are visualized by SuperExactTest and Circos. B. Overview of six categories of distinct cancer driver gene calling strategies. C.SuperExactTest plot of the consensus calling results identified by C. The inner rings (green and white blocks) represent six driver gene sets generated by the six strategies in C. Blocks in white and green indicate the absence and presence of the driver gene sets, and each group represents an intersection of prediction results using 2–6 strategies (shown as the green blocks). The size of the intersections is proportionally shown by the heights of the bars on the outer ring. The number of each intersection is shown on the top of the respective bars and the color intensity of the bars represents the significance of the intersections (−Log10P value). D.Circos plot of potential driver genes identified by C. From the outer to inner circles, the first circle indicates the whole chromosomes across the genome and the second circle with gene symbols indicates the top 100 consensus driver genes identified by C. The six inner colorful circles represent the top 100 results predicted by the individual strategies, respectively. Names of the strategies are provided in the center of the circle. The size of the gene symbol is positively proportional to the rank order of the predicted results, with a larger size indicating the higher rank. BRCA dataset in TCGA was used as an example for analyses shown in panels C and D.

Guideline of A. A schematic representation of the C workflow. A cancer sample input into C workflow is analyzed by six cancer driver gene calling strategies (shown on the left), resulting in a consensus driver gene set as the output. Then the driver gene set is evaluated in terms of precision and nDCG before it is annotated based on the reference databases (shown on the right). Finally, the results are visualized by SuperExactTest and Circos. B. Overview of six categories of distinct cancer driver gene calling strategies. C.SuperExactTest plot of the consensus calling results identified by C. The inner rings (green and white blocks) represent six driver gene sets generated by the six strategies in C. Blocks in white and green indicate the absence and presence of the driver gene sets, and each group represents an intersection of prediction results using 2–6 strategies (shown as the green blocks). The size of the intersections is proportionally shown by the heights of the bars on the outer ring. The number of each intersection is shown on the top of the respective bars and the color intensity of the bars represents the significance of the intersections (−Log10P value). D.Circos plot of potential driver genes identified by C. From the outer to inner circles, the first circle indicates the whole chromosomes across the genome and the second circle with gene symbols indicates the top 100 consensus driver genes identified by C. The six inner colorful circles represent the top 100 results predicted by the individual strategies, respectively. Names of the strategies are provided in the center of the circle. The size of the gene symbol is positively proportional to the rank order of the predicted results, with a larger size indicating the higher rank. BRCA dataset in TCGA was used as an example for analyses shown in panels C and D.

Performance measurement

Previously, Collin et al. proposed a novel measurement of mean log fold change between the observed and desired theoretical P values [8]. Matan et al. [17] and Eduard et al. [18] applied measurements of precision and recall. Denis et al. also applied precision, recall, and F1 score [19] (File S1 Part 1). In our study, we applied the Top-N-Precision (using CGC data as a reference driver gene set [26]) and Top-N-nDCG (using IntOGen as a reference ranking driver gene set [30]) to facilitate the quantitative comparison and evaluation, focusing on the top n performance of the ranking results.

Precision

We evaluated the precision performance among the results acquired by the previous strategies based on the top 100 genes with respect to CGC cancer database through Equation (1). The average precision can measure a general predicting ability of individual methods among the pan-cancer cohort samples. We calculate the precision scores for each of 27 cancer types, and the SUM (precision) represents the sum of respective precision score of 27 cancer types (Equation (2)).

nDCG

Meanwhile, normalized discounted cumulative gain (nDCG) was applied to measure the ranking quality of the results using the IntOGen as a reference cancer driver gene set. Here, n represents the number of top predicted genes; i represents the rank of predicted genes; CG represents cumulative weight of top n predicted genes; DCG represents CG multiplied by a discount factor (i > 1); IDCG represents a DCG under the ideal condition, that is, the rank of predicted genes is exactly the same as that in the reference dataset. Top-N-nDCG represents normalized DCG and measures the ranking performance of predicted genes. To obtain the Top-N-nDCG, firstly, we download IntOGen cancer driver gene set (URL: https://www.intogen.org/) [31] and assign a weight for each reference driver gene in IntOGen based on their proportion of driver mutation counts [30] (Version 2014.12) calculated according to Equation (3). Specifically, the total number of cancer driver genes in IntOGen is 459. The weights of the predicted driver genes overlapping with the benchmark IntOGen dataset are calculated according to Equation (4). The weights of the predicted genes that are not available at the benchmark IntOGen dataset are set to 0. The Top-N- nDCG can be calculated through Equations (4), (5), (6), (7) [20].

Rank aggregation

The RRA algorithm [21] is applied to obtain a consensus driver gene list, which aggregates the ranking driver genes predicted by individual tools. Comparing with the original RankAggreg algorithm [32], the RRA algorithm has three advantages: (1) it deals with incomplete rankings, which is common in practice, (2) it performs robustly with tolerance to the data noise, and (3) it is fast to be integrated for interactive data analysis.

Intersection visualization and evaluation with SuperExactTest and Circos

We applied SuperExactTest [22] and Circos [23] to organize our visualization results. The former is a scalable visualization tool to illustrate high-order relationships among multi sets beyond Venn diagrams [33]. It evaluates the overlap of each of tools and presents a circular plot illustrating all possible intersections with statistical methods. The latter visualizes the predicted driver gene sets intuitively (Figure 1C and D; File S1 Part 5).

Implementation

As Figure 2 shows, C web application accepts MAF [24] file or a modified micro-MAF file (Table S1) as the input. After users select driver gene calling strategies and parameters, C runs as the back-end Ubuntu 16.04 system (with python-2.7, R-3.3.4 and MATLAB Runtime 2014). When the job is successfully finished, users will be notified through email including a “Request ID”. At the “Recent Request” page, users can preview and obtain candidate driver gene list by querying the “Request ID”. The output is directly viewable on the website and is available to downloaded for further analyses. The data submitted by every user are kept private. If there are any questions, users can visit the “Help” page for a detailed guidance.
Figure 2

General framework of

C web application provides a user-friendly and simple five-step workflow. These include (1) selecting tools used for analysis, (2) choosing a data file to upload from user’s own computer (refer to our file format to verify the integrity of the input data), (3) selecting parameters for the selected tools (refer to the help documentation), (4) entering a name of the task (make sure to provide a valid e-mail address), and (5) inquiring and downloading the results with the request ID at “Recent Request” page. The request ID is sent to the user via e-mail upon the completion of the analysis.

General framework of C web application provides a user-friendly and simple five-step workflow. These include (1) selecting tools used for analysis, (2) choosing a data file to upload from user’s own computer (refer to our file format to verify the integrity of the input data), (3) selecting parameters for the selected tools (refer to the help documentation), (4) entering a name of the task (make sure to provide a valid e-mail address), and (5) inquiring and downloading the results with the request ID at “Recent Request” page. The request ID is sent to the user via e-mail upon the completion of the analysis.

Detailed information of the test datasets

We test the stability of C web application by selecting tumor datasets collected from The Cancer Genome Atlas (TCGA) [2] databases. Initially, the whole dataset includes 34 cancer types with 7724 samples and 729,235 mutations, curated from the published whole-exome sequencing or whole-genome sequencing studies which are also used by TUSON [9] and Collin study [8]. Since some tools (such as MutSigCV and DrGaP) need additional cohort mutation information, we removed 7 cancer types with 290 samples and 5164 mutations through data preprocessing. Finally, we curated 27 cancer types with 7434 samples and 724,071 mutations for the final analysis, which constitute the updated comprehensive test datasets finally for driver gene calling (Table 1 and File S1 Part 2).
Table 1

Number of tested tumor samples and mutations

Cancer type
No. of samplesTotal No. of mutations per cancer typeAverage No. of mutations per sample
AbbreviationFull name
BLCAUrothelial bladder cancer14233,772237.83
BRCABreast cancer88951,76658.23
CESCCervical cancer386115160.92
CLLchronic lymphocytic leukemia224349115.58
COADColon adenocarcinoma24432,192131.93
DLBCLDiffuse large B-cell lymphoma575785101.49
ESCAEsophageal cancer16019,141119.63
GBMGlioblastoma multiforme36521,92360.06
HNSCHead and neck squamous cell carcinoma40760,074147.60
KIRCKidney renal clear cell carcinoma48428,48358.85
KIRPKidney renal papillary cell carcinoma112754167.33
LAMLAcute Myeloid Leukemia197418021.22
LIHCLiver hepatocellular carcinoma151764850.65
LGGLower Grade Glioma227996543.90
LUADLung adenocarcinoma394106,613270.59
LUSCLung squamous cell carcinoma17553,528305.87
MBMedulloblastoma332361510.89
MESOMesothelioma28997,806338.43
MMMultiple Myeloma20510,78152.59
NBLNeuroblastoma352645318.33
OVOvarian serous cystadenocarcinoma48028,13658.62
PAADPancreatic ductal adenocarcinoma234793933.93
PRADProstate adenocarcinoma42016,78439.96
STADStomach adenocarcinoma24442,456174.00
SCLCSmall cell lung cancer318378270.26
THCAPapillary thyroid carcinoma326642419.71
UCECUterine corpus endometrial carcinoma25539,234153.86
Number of tested tumor samples and mutations

Performance of C

We benchmarked the performance of the consensus results comparing with each alternative. As shown in Figure 3, the integration results of C application outperformed other methods evaluated with Top-N-Precision and Top-N-nDCG, revealing its superiority in driver genes prediction (File S1 Part 4).
Figure 3

Comparison of cancer driver gene calling performance using

The performance for Consensus and the six individual strategies on 27 cancer datasets is presented in radar plots in terms of the Top-N-precision (A) (calculated according to Equation (1)) and Top-N-nDCG (B) (calculated according to Equations (2), (3), (4), (5), (6)). Cancer types are labeled on the outmost circle. Values of precision in panel A and nDCG in panel B are labelled on each circle. The range of these values is between 0.1 and 1. For each cancer type, a higher value indicates a better performance and for each cancer driver gene calling strategy, the larger area means the better performance. nDCG, normalized discounted cumulative gain.

Comparison of cancer driver gene calling performance using The performance for Consensus and the six individual strategies on 27 cancer datasets is presented in radar plots in terms of the Top-N-precision (A) (calculated according to Equation (1)) and Top-N-nDCG (B) (calculated according to Equations (2), (3), (4), (5), (6)). Cancer types are labeled on the outmost circle. Values of precision in panel A and nDCG in panel B are labelled on each circle. The range of these values is between 0.1 and 1. For each cancer type, a higher value indicates a better performance and for each cancer driver gene calling strategy, the larger area means the better performance. nDCG, normalized discounted cumulative gain. C also helps to identify reliable potential driver genes by SuperExactTest intersection between different driver gene calling strategies with reference to CGC and literature review. Detailed results are shown in Table S2 and Table S3. In summary, although there exists a high discrepancy among different driver gene identification strategies, the intersection by individual strategies not only identifies the most reliable driver genes, but also helps to find potential novel driver genes that are not well-characterized.

Future developments

Currently C has some limitations and warrants future updates. (1) C is currently deployed on the Ali Cloud server, which requires a lot of memory and space to process the data. Any variant file exceeding 40,000 records may fail when running DrGaP. Since the Random Forest Model 20/20 + occupies too much CPU resources, it also takes a long time (>3 h for sample of 50,000 mutations with 8 cores of Intel Xeon E5-2643 3.3 GHz) to run a whole pipeline of C. Future optimizations are required to accelerate C. (2) Current version of C only supports the GRCH37 reference genome, and a new version of the reference genome such as GRCH38 will be added in the next version. (3) One potential application of C is to identify the target driver genes for drug discovery. However, the computationally predicted drivers should not be over-interpreted without additional experimental evidence.

Availability

C3 is freely available for public use at http://drivergene.rwebox.com/c3.

Authors’ contributions

QL, JW, XY, and SQ conceived the project. CYZ, CZ, YC, and ZG designed the platform. CYZ, AS, and ZY analyzed the data. QL, YC, CZ, and CYZ wrote the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.
  32 in total

1.  The Cancer Genome Atlas Pan-Cancer analysis project.

Authors:  John N Weinstein; Eric A Collisson; Gordon B Mills; Kenna R Mills Shaw; Brad A Ozenberger; Kyle Ellrott; Ilya Shmulevich; Chris Sander; Joshua M Stuart
Journal:  Nat Genet       Date:  2013-10       Impact factor: 38.330

2.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

Review 3.  A census of human cancer genes.

Authors:  P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton
Journal:  Nat Rev Cancer       Date:  2004-03       Impact factor: 60.716

4.  VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R.

Authors:  Hanbo Chen; Paul C Boutros
Journal:  BMC Bioinformatics       Date:  2011-01-26       Impact factor: 3.307

Review 5.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.

Authors:  Katarzyna Tomczak; Patrycja Czerwińska; Maciej Wiznerowicz
Journal:  Contemp Oncol (Pozn)       Date:  2015

6.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders.

Authors:  Joanna S Amberger; Carol A Bocchini; François Schiettecatte; Alan F Scott; Ada Hamosh
Journal:  Nucleic Acids Res       Date:  2014-11-26       Impact factor: 19.160

7.  Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers.

Authors:  Jüri Reimand; Gary D Bader
Journal:  Mol Syst Biol       Date:  2013       Impact factor: 11.429

8.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website.

Authors:  S Bamford; E Dawson; S Forbes; J Clements; R Pettett; A Dogan; A Flanagan; J Teague; P A Futreal; M R Stratton; R Wooster
Journal:  Br J Cancer       Date:  2004-07-19       Impact factor: 7.640

9.  Efficient Test and Visualization of Multi-Set Intersections.

Authors:  Minghui Wang; Yongzhong Zhao; Bin Zhang
Journal:  Sci Rep       Date:  2015-11-25       Impact factor: 4.379

10.  IntOGen-mutations identifies cancer drivers across tumor types.

Authors:  Abel Gonzalez-Perez; Christian Perez-Llamas; Jordi Deu-Pons; David Tamborero; Michael P Schroeder; Alba Jene-Sanz; Alberto Santos; Nuria Lopez-Bigas
Journal:  Nat Methods       Date:  2013-09-15       Impact factor: 28.547

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.