| Literature DB >> 35734767 |
Sumeet Patiyal1, Anjali Dhall1, Gajendra P S Raghava1.
Abstract
Identification of somatic mutations with high precision is one of the major challenges in the prediction of high-risk liver cancer patients. In the past, number of mutations calling techniques has been developed that include MuTect2, MuSE, Varscan2, and SomaticSniper. In this study, an attempt has been made to benchmark the potential of these techniques in predicting the prognostic biomarkers for liver cancer. Initially, we extracted somatic mutations in liver cancer patients using Variant Call Format (VCF) and Mutation Annotation Format (MAF) files from the cancer genome atlas. In terms of size, the MAF files are 42 times smaller than VCF files and containing only high-quality somatic mutations. Furthermore, machine learning-based models have been developed for predicting high-risk cancer patients using mutations obtained from different techniques. The performance of different techniques and data files has been compared based on their potential to discriminate high- and low-risk liver cancer patients. Based on correlation analysis, we selected 80 genes having significant negative correlation with the overall survival of liver cancer patients. The univariate survival analysis revealed the prognostic role of highly mutated genes. Single gene-based analysis showed that MuTect2 technique-based MAF file has achieved maximum hazard ratio (HRLAMC3) of 9.25 with P-value of 1.78E-06. Further, we developed various prediction models using risk-associated top-10 genes for each technique. Our results indicate that MuTect2 technique-based VCF files outperform all other methods with maximum Area Under the Receiver-Operating Characteristic curve of 0.765 and HR = 4.50 (P-value = 3.83E-15). Eventually, VCF file generated using MuTect2 technique performs better among other mutation calling techniques for the prediction of high-risk liver cancer patients. We hope that our findings will provide a useful and comprehensive comparison of various mutation-calling techniques for the prognostic analysis of cancer patients. In order to serve the scientific community, we have provided a Python-based pipeline to develop the prediction models using mutation profiles (VCF/MAF) of cancer patients. It is available on GitHub at https://github.com/raghavagps/mutation_bench.Entities:
Keywords: liver cancer; machine learning; mutation calling techniques; prognosis; regression; survival analysis
Year: 2022 PMID: 35734767 PMCID: PMC9204470 DOI: 10.1093/biomethods/bpac012
Source DB: PubMed Journal: Biol Methods Protoc ISSN: 2396-8923
Figure 1:Pipeline illustrating the overall overflow of the study.
Total number of genes and mutations for each gene extracted from VCF and MAF files using different mutation calling techniques
| File type | Technique | Number of genes | Number of mutations |
|---|---|---|---|
| VCF | MuTect2 | 25 366 | 5 237 093 |
| MuSE | 19 425 | 379 368 | |
| Varscan2 | 19 422 | 576 231 | |
| SomaticSniper | 25 785 | 5 003 969 | |
| MAF | MuTect2 | 16 474 | 59 741 |
| MuSE | 15 712 | 51 184 | |
| Varscan2 | 15 950 | 54 877 | |
| SomaticSniper | 14 979 | 44 102 |
Figure 2:Upset-plot for distribution of genes in four techniques. (A) From VCF files and (B) from MAF files.
Figure 3:Visualization of mutation summary (variants classification, type and SNVs) for (A) MuTect2, (B) MuSE, (C) Varscan2 and (D) SomaticSniper MAF files.
Figure 4:Oncoplot visualization of mutation frequency of top-most mutated genes. The rows represented the genes with percent mutations, and columns display the samples. (A) Illustrates the oncoplot of MuTect2 technique and indicates that 89.18% of samples having mutated genes. (B) Illustrates the oncoplot of MuSE technique and shows that 80.29% of samples having mutated genes. (C) Presents the oncoplot of Varscan2 approach and shows that 88.43% of samples having mutated genes. (D) Illustrates the oncoplot of SomaticSniper technique and indicates that 75.73% of samples having alerted/mutated genes.
HR for risk-associated top-10 genes from VCF and MAF files derived using MuTect2, MuSE, Varscan2 and SomaticSniper technique
| MuTect2 | MuSE | Varscan2 | SomaticSniper | ||||
|---|---|---|---|---|---|---|---|
| Gene | HR | Gene | HR | Gene | HR | Gene | HR |
| ( | ( | ( | ( | ||||
| VCF files | |||||||
| SNHG10 | 5.49 (3.94E-06) | CLMP | 3.01 (1.67E-05) | FAM160A2 | 6.81 (4.01E-05) | CLDN20 | 7.06 (6.62E-07) |
| WIZ | 2.69 (9.71E-07) | BIRC6 | 2.80 (4.46E-04) | LOC100420587 | 5.45 (1.31E-07) | NR2C2AP | 5.17 (3.16E-05) |
| MGAT4EP | 2.49 (4.46E-04) | LINC02210-CRHR1 | 2.03 (6.42E-03) | SPDYA | 3.08 (7.70E-04) | ATG9B | 3.34 (2.59E-04) |
| LINC00304 | 2.39 (7.40E-05) | DHX8 | 2.00 (2.90E-02) | BRSK2 | 2.55 (1.01E-03) | HAUS5 | 2.79 (2.22E-05) |
| CACNG7 | 1.93 (5.72E-04) | LINC00972 | 1.91 (9.31E-03) | ADGRF4 | 2.21 (1.23E-02) | LOC100287329 | 2.58 (8.23E-04) |
| OR52B6 | 1.83 (1.12E-03) | PAX7 | 1.90 (8.29E-04) | LINC00972 | 2.11 (2.18E-03) | P4HTM | 2.18 (2.43E-02) |
| TYK2 | 1.80 (2.21E-03) | TAS1R2 | 1.61 (2.63E-02) | TM4SF18 | 2.07 (1.40E-02) | OR6C76 | 2.12 (1.18E-03) |
| PIGO | 1.79 (1.66E-02) | SNTG1 | 1.53 (3.37E-02) | OR5AS1 | 1.86 (1.43E-02) | CLK2 | 1.94 (3.58E-02) |
| S100A12 | 1.71 (1.10E-02) | CNTN5 | 1.34 (2.25E-01) | PDE11A | 1.72 (2.74E-03) | FAM187B | 1.64 (1.51E-02) |
| DNAJC9-AS1 | 1.08 (6.51E-01) | ZNF521 | 1.26 (2.63E-01) | LOC101929073 | 1.29 (2.98E-01) | NOMO3 | 1.34 (1.45E-01) |
| MAF files | |||||||
| LAMC3 | 9.25 (1.78E-06) | ITGB8 | 8.37 (5.69E-07) | SYDE1 | 8.46 (3.71E-05) | CAD | 5.56 (8.10E-04) |
| EVC2 | 4.30 (8.66E-05) | TBX3 | 8.10 (6.06E-05) | ALPP | 4.33 (1.44E-03) | TOP2A | 4.63 (2.73E-03) |
| NYNRIN | 3.94 (1.22E-03) | SIPA1L3 | 4.90 (5.54E-05) | KIAA2026 | 3.85 (1.49E-03) | KIAA2026 | 4.01 (2.62E-03) |
| KIAA2026 | 3.85 (1.49E-03) | CAD | 4.45 (3.58E-03) | CAD | 3.32 (1.91E-02) | EVC2 | 4.00 (1.04E-03) |
| SUPT20H | 3.41 (7.53E-03) | EVC2 | 4.16 (2.97E-04) | BRINP2 | 2.83 (2.43E-02) | KTN1 | 2.56 (1.09E-01) |
| BRINP2 | 2.83 (2.43E-02) | ARHGEF11 | 3.17 (2.37E-02) | TP53 | 1.60 (9.85E-03) | EPHA3 | 2.25 (1.67E-01) |
| LRP1B | 1.93 (7.81E-03) | BRINP2 | 2.80 (2.56E-02) | PCDH15 | 1.48 (2.81E-01) | KIF26B | 2.03 (1.66E-01) |
| TP53 | 1.48 (3.60E-02) | PCDH15 | 1.72 (1.20E-01) | TG | 1.46 (4.53E-01) | PCDH15 | 1.76 (1.78E-01) |
| TG | 1.46 (4.53E-01) | TG | 1.46 (4.55E-01) | PLCB1 | 1.25 (7.00E-01) | TP53 | 1.63 (1.20E-02) |
| PCDH15 | 1.43 (3.30E-01) | CSMD3 | 1.24 (4.54E-01) | XIRP2 | 1.11 (7.55E-01) | TG | 1.18 (8.17E-01) |
Figure 5:KM survival curves for the risk estimation of liver cancer patients based on the combined effect of mutation. (A) Survival plots for the VCF files and (B) survival plots for the MAF files.
Performance of best regressors on top-10 genes from VCF and MAF files extracted using all techniques
| Technique | File type | MAE | RMSE |
|
|
|---|---|---|---|---|---|
| MuTect2 | VCF | 12.52 | 19.58 | 0.57 | 7.00E-37 |
| MAF | 16.47 | 22.16 | 0.37 | 1.31E-14 | |
| MuSE | VCF | 13.88 | 20.38 | 0.51 | 1.38E-29 |
| MAF | 16.89 | 22.48 | 0.34 | 1.68E-12 | |
| Varscan2 | VCF | 14.57 | 20.78 | 0.48 | 4.77E-26 |
| MAF | 16.53 | 22.26 | 0.36 | 9.11E-14 | |
| SomaticSniper | VCF | 15.76 | 21.82 | 0.40 | 3.31E-17 |
| MAF | 16.72 | 22.26 | 0.33 | 8.46E-12 |
Performance of logistic regression based models on top-10 genes from VCF and MAF files extracted using all techniques on validation dataset
| Technique | File type | AUROC | F1 | Kappa | MCC |
|---|---|---|---|---|---|
| MuTect2 | VCF | 0.765 | 0.767 | 0.421 | 0.442 |
| MAF | 0.659 | 0.661 | 0.259 | 0.335 | |
| MuSE | VCF | 0.735 | 0.737 | 0.400 | 0.421 |
| MAF | 0.621 | 0.667 | 0.225 | 0.277 | |
| Varscan2 | VCF | 0.656 | 0.661 | 0.250 | 0.348 |
| MAF | 0.653 | 0.661 | 0.308 | 0.309 | |
| SomaticSniper | VCF | 0.638 | 0.672 | 0.276 | 0.277 |
| MAF | 0.617 | 0.667 | 0.225 | 0.243 | |
| Average | VCF | 0.699 ± 0.061 | 0.709 ± 0.051 | 0.337 ± 0.086 | 0.372 ± 0.075 |
| MAF | 0.638 ± 0.022 | 0.664 ± 0.003 | 0.254 ± 0.039 | 0.291 ± 0.040 |