| Literature DB >> 23620363 |
Hashem A Shihab1, Julian Gough, David N Cooper, Ian N M Day, Tom R Gaunt.
Abstract
MOTIVATION: The number of missense mutations being identified in cancer genomes has greatly increased as a consequence of technological advances and the reduced cost of whole-genome/whole-exome sequencing methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumour progression (passenger mutations). Therefore, accurate automated methods capable of discriminating between driver (cancer-promoting) and passenger mutations are becoming increasingly important. In our previous work, we developed the Functional Analysis through Hidden Markov Models (FATHMM) software and, using a model weighted for inherited disease mutations, observed improved performances over alternative computational prediction algorithms. Here, we describe an adaptation of our original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations.Entities:
Mesh:
Year: 2013 PMID: 23620363 PMCID: PMC3673218 DOI: 10.1093/bioinformatics/btt182
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Summary of mutation datasets used in this study
| Dataset | Positives | Negatives | Description |
|---|---|---|---|
| Training datasets | |||
| CanProVar | 12 720 | — | A collection of cancer-associated mutations used to calculate our pathogenicity weights |
| UniProt | — | 36 928 | A collection of putative neutral polymorphisms used to calculate our pathogenicity weights |
| Capriotti and Altman benchmark | |||
| CNO | 3163 | 3163 | Comprising driver mutations used to train the CHASM algorithm and neutral polymorphisms |
| CND | 3163 | 3163 | Comprising driver mutations used to train the CHASM algorithm and other germ line mutations (both disease-causing and neutral polymorphisms) |
| Synthetic | 3163 | 3163 | Comprising driver and passenger mutations (somatic) used to train the CHASM algorithm |
| Gonzalez-Perez | |||
| COSMIC 2 + 1 | 3978 | 39 850 | Comprising COSMIC mutations occurring in 2+ samples and COSMIC mutations occurring in one sample |
| COSMIC 5 + 1 | 1631 | 39 850 | Comprising COSMIC mutations occurring in 5+ samples and COSMIC mutations occurring in one sample |
| COSMIC 2/POL | 3978 | 8040 | Comprising COSMIC mutations occurring in 2+ samples and neutral polymorphisms |
| COSMIC 5/POL | 1631 | 8040 | Comprising COSMIC mutations occurring in 5+ samples and neutral polymorphisms |
| COSMIC D/O | 2151 | 41 664 | Comprising driver mutations used to train the CHASM algorithm and COSMIC mutations not in the positive subset |
| COSMIC D/POL | 2151 | 8040 | Comprising driver mutations used to train the CHASM algorithm and neutral polymorphisms |
| COSMIC CGC/NONCGC | 4865 | 34 827 | Comprising COSMIC mutations falling within genes defined in the CGC and COSMIC mutations falling within genes outside the CGC |
| WG 2/1 | 790 | 24 079 | Comprising somatic mutations occurring in 2+ samples and somatic mutations occurring in one sample |
| WG CGC/NONCGC | 1302 | 22 983 | Comprising somatic mutations falling within genes defined in the CGC and somatic mutations falling within genes outside the CGC |
CGC, Cancer Gene Census (Futreal ).
Fig. 1.The distribution of the predicted magnitude of effect for all driver mutations against all non–cancer-associated (germ line and somatic) mutations in the Capriotti and Altman (2011) benchmark. Here, the dashed line represents our prediction threshold of −0.75 at which the specificity and sensitivity of our algorithm is maximized across all mutation datasets
Performance of computational prediction methods using the Capriotti and Altman benchmarking datasets
| Method | tp | fp | tn | fn | Accuracy | Precision | Specificity | Sensitivity | NPV | MCC |
|---|---|---|---|---|---|---|---|---|---|---|
| Cancer and neutral only (CNO) | ||||||||||
| SIFT | 2180 | 560 | 1266 | 982 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 | 0.38 |
| PolyPhen-2 | 2421 | 1244 | 1894 | 656 | 0.70 | 0.66 | 0.60 | 0.79 | 0.74 | 0.40 |
| Mutation Assessor | 2403 | 1004 | 2155 | 751 | 0.72 | 0.71 | 0.68 | 0.76 | 0.74 | 0.45 |
| SPF-Cancer | 2876 | 196 | 2967 | 287 | 0.92 | 0.94 | 0.94 | 0.91 | 0.91 | 0.85 |
| FATHMM | 2858 | 77 | 3077 | 300 | ||||||
| Cancer, neutral and other disease (CND) | ||||||||||
| SIFT | 2180 | 943 | 745 | 982 | 0.57 | 0.55 | 0.44 | 0.69 | 0.59 | 0.14 |
| PolyPhen-2 | 2421 | 1921 | 1238 | 656 | 0.56 | 0.54 | 0.34 | 0.79 | 0.62 | 0.14 |
| Mutation Assessor | 2403 | 1921 | 1238 | 751 | 0.58 | 0.56 | 0.39 | 0.76 | 0.62 | 0.17 |
| SPF-Cancer | 2876 | 418 | 2745 | 287 | 0.89 | 0.87 | 0.87 | 0.91 | 0.91 | 0.78 |
| FATHMM | 2858 | 161 | 2933 | 300 | ||||||
| Synthetic | ||||||||||
| SIFT | 2180 | 1431 | 1434 | 982 | 0.59 | 0.58 | 0.50 | 0.69 | 0.62 | 0.19 |
| PolyPhen-2 | 2421 | 1902 | 985 | 656 | 0.56 | 0.54 | 0.34 | 0.79 | 0.62 | 0.14 |
| Mutation Assessor | 2403 | 1474 | 1432 | 751 | 0.63 | 0.60 | 0.49 | 0.76 | 0.67 | 0.26 |
| SPF-Cancer | 2859 | 297 | 2866 | 304 | 0.90 | 0.90 | ||||
| FATHMM | 2858 | 362 | 2710 | 300 | 0.89 | 0.88 | 0.88 | 0.79 | ||
Note: tp, fp, tn, fn refer to the number of true positives, false positives, true negatives and false negatives, respectively. Bold values indicate the best performing method across the corresponding performance statistics. aAccuracy, precision, specificity, sensitivity, NPV and MCC are calculated ‘from normalized numbers. b‘Possibly damaging’ predictions are classified as pathogenic.
Fig. 2.ROC curves showing the cumulative true positive rate versus the cumulative false positive rate for the computational prediction algorithms evaluated in our independent benchmark
A performance comparison using a 2-fold cross-validation procedure
| Method | Accuracy | Precision | Specificity | Sensitivity | NPV | MCC |
|---|---|---|---|---|---|---|
| CHASM | 0.80 | 0.85 | 0.87 | 0.73 | 0.76 | 0.60 |
| FATHMM |
Note: The performances of CHASM have been reproduced with permission from Capriotti and Altman (2011), Copyright 2013, Elsevier. Bold values indicate the best performing method across the corresponding performance statistics.
Performance of computational prediction methods using the Gonzalez-Perez et al. benchmarking datasets
| Dataset | SIFT | PolyPhen-2 | Mutation assessor | TransFIC | FATHMM | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | MCC | Acc. | MCC | Acc. | MCC | Acc. | MCC | Acc. | MCC | Threshold | |
| COSMIC 2 + 1 | 0.49 | 0.10 | 0.59 | 0.06 | 0.30 | 0.80 | 0.50 | −3.50 | |||
| COSMIC 5 + 1 | 0.49 | 0.12 | 0.60 | 0.09 | 0.32 | 0.90 | 0.95 | −3.50 | |||
| COSMIC 2/POL | 0.70 | 0.32 | 0.79 | 0.39 | 0.80 | 0.91 | 0.84 | −1.50 | |||
| COSMIC 5/POL | 0.71 | 0.32 | 0.86 | 0.41 | 0.71 | 0.96 | 0.76 | 0.97 | −1.50 | ||
| COSMIC D/O | 0.48 | 0.09 | 0.61 | 0.10 | 0.18 | 0.78 | 0.88 | 0.25 | −3.00 | ||
| COSMIC D/POL | 0.70 | 0.29 | 0.85 | 0.42 | 0.64 | 0.92 | 0.94 | 0.69 | −0.75 | ||
| COSMIC CGC/NONCGC | 0.44 | 0.08 | 0.56 | 0.07 | 0.16 | 0.78 | 0.85 | 0.50 | −1.60 | ||
| WG 2/1 | 0.84 | 0.02 | 0.71 | 0.01 | 0.10 | 0.89 | 0.96 | 0.23 | −3.50 | ||
| WG CGC/NONCGC | 0.42 | 0.11 | 0.56 | 0.11 | 0.34 | 0.90 | 0.94 | 0.39 | −2.80 | ||
Note: The performances of alternative computational prediction algorithms have been reproduced with permission from Gonzalez-Perez ; Open Access Article). Bold values indicate the best performing method across the corresponding benchmark.