| Literature DB >> 34997044 |
Malvika Sudhakar1,2,3, Raghunathan Rengaswamy4,5,6, Karthik Raman7,8,9.
Abstract
An emergent area of cancer genomics is the identification of driver genes. Driver genes confer a selective growth advantage to the cell. While several driver genes have been discovered, many remain undiscovered, especially those mutated at a low frequency across samples. This study defines new features and builds a pan-cancer model, cTaG, to identify new driver genes. The features capture the functional impact of the mutations as well as their recurrence across samples, which helps build a model unbiased to genes with low frequency. The model classifies genes into the functional categories of driver genes, tumour suppressor genes (TSGs) and oncogenes (OGs), having distinct mutation type profiles. We overcome overfitting and show that certain mutation types, such as nonsense mutations, are more important for classification. Further, cTaG was employed to identify tissue-specific driver genes. Some known cancer driver genes predicted by cTaG as TSGs with high probability are ARID1A, TP53, and RB1. In addition to these known genes, potential driver genes predicted are CD36, ZNF750 and ARHGAP35 as TSGs and TAB3 as an oncogene. Overall, our approach surmounts the issue of low recall and bias towards genes with high mutation rates and predicts potential new driver genes for further experimental screening. cTaG is available at https://github.com/RamanLab/cTaG .Entities:
Mesh:
Substances:
Year: 2022 PMID: 34997044 PMCID: PMC8741763 DOI: 10.1038/s41598-021-04015-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Methodology for identifying new driver genes. The figure presents an overview of the different steps involved in our study as described in “Outline of cTaG” section. The somatic mutation data is used to generate the feature matrix where the rows and columns represent genes and features respectively. The genes labelled as TSG or OG were used for building the models. Block A (light green frame) shows how classifier is built and is repeated five times for each fold (details in “Classification of genes” section). Block B (light blue frame) shows random iterations for estimation of hyper-parameters and is repeated 10 times to identify a set of stable hyper-parameters (details in “Tuning hyperparameters” section). The five models are then used to make predictions on unlabelled genes to identify the new driver genes.
Definitions of mutation categories and the ratio of mutation categories.
Along with mutation categories annotated by COSMIC, we define additional categories which combine multiple mutation types. These categories together with the 11 mutation categories defined in “Ratio-metric features” section are used to define ratio-metric features. The ratio of two mutation types A and B is defined for a given gene, where A and B are any two mutation categories.
The features used in this study for classification.
| Previously defined in the literature (18 features) | Silent/kb, Total Missense, Total Splicing, Total LOF, Missense/kb, LOF/kb, |
| Defined in this paper (19 features) |
The features in bold are ratio-metric features. The different mutation types and method for calculating the features are defined in “Mutation data” section.
Classification metrics for training and test set.
| Accuracy | F1 score | Precision | Recall | |||
|---|---|---|---|---|---|---|
| cTaG | Training set | OG | 0.86 ± 0.04 | 0.77 ± 0.07 | 0.67 ± 0.09 | |
| TSG | 0.84 ± 0.04 | |||||
| Test set | OG | 0.76 ± 0.03 | 0.59 ± 0.10 | 0.50 ± 0.19 | ||
| TSG | 0.77 ± 0.07 | |||||
| BalancedBagging | Training set | OG | 0.93 ± 0.05 | 0.92 ± 0.06 | 0.86 ± 0.09 | 0.99 ± 0.01 |
| TSG | 0.94 ± 0.04 | 1.00 ± 0.01 | 0.90 ± 0.07 | |||
| Test set | OG | 0.69 ± 0.06 | 0.64 ± 0.06 | 0.56 ± 0.07 | 0.75 ± 0.06 | |
| TSG | 0.73 ± 0.06 | 0.82 ± 0.04 | 0.65 ± 0.09 |
Numbers in bold indicate best performances for each metric between TSG and OG. The metrics are standard, and are defined as follows (T stands for True, F for false, P for positives and N for negatives): Accuracy = (TP + TN)/(TP + FP + TN + FN); Precision = TP/(TP + FP); Recall = TP/(TP + FN); F1 score is the harmonic mean of Precision and Recall.
Figure 2Distribution of top features identified by the classifier for TSG and OG. Training genes were used to study the differences between the distributions of features (kernel density) in TSG and OG. Kolmogorov–Smirnov statistic and the p-value is given for each feature. Higher value of KS statistic shows magnitude of difference of the two distributions.
Figure 3Fraction of genes predicted plotted against log transformed mutation rates. Genes predicted by a given method were sorted based on their mutation rate and plotted against the fraction of genes predicted below the given mutation rate.
Driver genes predicted for each of the cancer types.
| Primary tissue | Genes |
|---|---|
| Breast cancer | TP53, |
| Central nervous system | TP53, |
| Endometrium | |
| Hematopoietic | TP53, B2M, |
| Kidney | PBRM1, |
| Large intestine | TP53, FBXW7, |
| Liver | TP53, |
The genes reported showed consensus for > 4 CV models. Genes in bold did not find similar consensus in the pan-cancer predictions. New genes are underlined.