| Literature DB >> 31465854 |
Chen-Yu Zhu1, Chi Zhou1, Yun-Qin Chen2, Ai-Zong Shen3, Zong-Ming Guo2, Zhao-Yi Yang3, Xiang-Yun Ye4, Shen Qu5, Jia Wei6, Qi Liu7.
Abstract
Next-generation sequencing has allowed identification of millions of somatic mutations in human cancer cells. A key challenge in interpreting cancer genomes is to distinguish drivers of cancer development among available genetic mutations. To address this issue, we present the first web-based application, consensus cancer driver gene caller (C3), to identify the consensus driver genes using six different complementary strategies, i.e., frequency-based, machine learning-based, functional bias-based, clustering-based, statistics model-based, and network-based strategies. This application allows users to specify customized operations when calling driver genes, and provides solid statistical evaluations and interpretable visualizations on the integration results. C3 is implemented in Python and is freely available for public use at http://drivergene.rwebox.com/c3.Entities:
Keywords: Cancer driver genes; Consensus; Data integration; Somatic mutation; Web server
Mesh:
Year: 2019 PMID: 31465854 PMCID: PMC6818389 DOI: 10.1016/j.gpb.2018.10.004
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Guideline of
A. A schematic representation of the C workflow. A cancer sample input into C workflow is analyzed by six cancer driver gene calling strategies (shown on the left), resulting in a consensus driver gene set as the output. Then the driver gene set is evaluated in terms of precision and nDCG before it is annotated based on the reference databases (shown on the right). Finally, the results are visualized by SuperExactTest and Circos. B. Overview of six categories of distinct cancer driver gene calling strategies. C.SuperExactTest plot of the consensus calling results identified by C. The inner rings (green and white blocks) represent six driver gene sets generated by the six strategies in C. Blocks in white and green indicate the absence and presence of the driver gene sets, and each group represents an intersection of prediction results using 2–6 strategies (shown as the green blocks). The size of the intersections is proportionally shown by the heights of the bars on the outer ring. The number of each intersection is shown on the top of the respective bars and the color intensity of the bars represents the significance of the intersections (−Log10P value). D.Circos plot of potential driver genes identified by C. From the outer to inner circles, the first circle indicates the whole chromosomes across the genome and the second circle with gene symbols indicates the top 100 consensus driver genes identified by C. The six inner colorful circles represent the top 100 results predicted by the individual strategies, respectively. Names of the strategies are provided in the center of the circle. The size of the gene symbol is positively proportional to the rank order of the predicted results, with a larger size indicating the higher rank. BRCA dataset in TCGA was used as an example for analyses shown in panels C and D.
Figure 2General framework of
C web application provides a user-friendly and simple five-step workflow. These include (1) selecting tools used for analysis, (2) choosing a data file to upload from user’s own computer (refer to our file format to verify the integrity of the input data), (3) selecting parameters for the selected tools (refer to the help documentation), (4) entering a name of the task (make sure to provide a valid e-mail address), and (5) inquiring and downloading the results with the request ID at “Recent Request” page. The request ID is sent to the user via e-mail upon the completion of the analysis.
Number of tested tumor samples and mutations
| BLCA | Urothelial bladder cancer | 142 | 33,772 | 237.83 |
| BRCA | Breast cancer | 889 | 51,766 | 58.23 |
| CESC | Cervical cancer | 38 | 6115 | 160.92 |
| CLL | chronic lymphocytic leukemia | 224 | 3491 | 15.58 |
| COAD | Colon adenocarcinoma | 244 | 32,192 | 131.93 |
| DLBCL | Diffuse large B-cell lymphoma | 57 | 5785 | 101.49 |
| ESCA | Esophageal cancer | 160 | 19,141 | 119.63 |
| GBM | Glioblastoma multiforme | 365 | 21,923 | 60.06 |
| HNSC | Head and neck squamous cell carcinoma | 407 | 60,074 | 147.60 |
| KIRC | Kidney renal clear cell carcinoma | 484 | 28,483 | 58.85 |
| KIRP | Kidney renal papillary cell carcinoma | 112 | 7541 | 67.33 |
| LAML | Acute Myeloid Leukemia | 197 | 4180 | 21.22 |
| LIHC | Liver hepatocellular carcinoma | 151 | 7648 | 50.65 |
| LGG | Lower Grade Glioma | 227 | 9965 | 43.90 |
| LUAD | Lung adenocarcinoma | 394 | 106,613 | 270.59 |
| LUSC | Lung squamous cell carcinoma | 175 | 53,528 | 305.87 |
| MB | Medulloblastoma | 332 | 3615 | 10.89 |
| MESO | Mesothelioma | 289 | 97,806 | 338.43 |
| MM | Multiple Myeloma | 205 | 10,781 | 52.59 |
| NBL | Neuroblastoma | 352 | 6453 | 18.33 |
| OV | Ovarian serous cystadenocarcinoma | 480 | 28,136 | 58.62 |
| PAAD | Pancreatic ductal adenocarcinoma | 234 | 7939 | 33.93 |
| PRAD | Prostate adenocarcinoma | 420 | 16,784 | 39.96 |
| STAD | Stomach adenocarcinoma | 244 | 42,456 | 174.00 |
| SCLC | Small cell lung cancer | 31 | 8378 | 270.26 |
| THCA | Papillary thyroid carcinoma | 326 | 6424 | 19.71 |
| UCEC | Uterine corpus endometrial carcinoma | 255 | 39,234 | 153.86 |
Figure 3Comparison of cancer driver gene calling performance using
The performance for Consensus and the six individual strategies on 27 cancer datasets is presented in radar plots in terms of the Top-N-precision (A) (calculated according to Equation (1)) and Top-N-nDCG (B) (calculated according to Equations (2), (3), (4), (5), (6)). Cancer types are labeled on the outmost circle. Values of precision in panel A and nDCG in panel B are labelled on each circle. The range of these values is between 0.1 and 1. For each cancer type, a higher value indicates a better performance and for each cancer driver gene calling strategy, the larger area means the better performance. nDCG, normalized discounted cumulative gain.