| Literature DB >> 23033316 |
Hashem A Shihab1, Julian Gough, David N Cooper, Peter D Stenson, Gary L A Barker, Keith J Edwards, Ian N M Day, Tom R Gaunt.
Abstract
The rate at which nonsynonymous single nucleotide polymorphisms (nsSNPs) are being identified in the human genome is increasing dramatically owing to advances in whole-genome/whole-exome sequencing technologies. Automated methods capable of accurately and reliably distinguishing between pathogenic and functionally neutral nsSNPs are therefore assuming ever-increasing importance. Here, we describe the Functional Analysis Through Hidden Markov Models (FATHMM) software and server: a species-independent method with optional species-specific weightings for the prediction of the functional effects of protein missense variants. Using a model weighted for human mutations, we obtained performance accuracies that outperformed traditional prediction methods (i.e., SIFT, PolyPhen, and PANTHER) on two separate benchmarks. Furthermore, in one benchmark, we achieve performance accuracies that outperform current state-of-the-art prediction methods (i.e., SNPs&GO and MutPred). We demonstrate that FATHMM can be efficiently applied to high-throughput/large-scale human and nonhuman genome sequencing projects with the added benefit of phenotypic outcome associations. To illustrate this, we evaluated nsSNPs in wheat (Triticum spp.) to identify some of the important genetic variants responsible for the phenotypic differences introduced by intense selection during domestication. A Web-based implementation of FATHMM, including a high-throughput batch facility and a downloadable standalone package, is available at http://fathmm.biocompute.org.uk.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23033316 PMCID: PMC3558800 DOI: 10.1002/humu.22225
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Summary of Mutation Datasets
| Dataset | Proteins | Amino acid substitutions | Description |
|---|---|---|---|
| HGMD | 2,298 | 49,532 | Inherited disease-causing mutations from HGMD used to calculate our pathogenicity weights |
| UniProt | 11,548 | 36,928 | Inherited putative functionally neutral mutations from UniProt used to calculate our pathogenicity weights |
| VariBench | 9,684 | 40,470 | Benchmarking dataset used in a review of nine alternative computational prediction algorithms [Thusberg et al., |
| Hicks et al. | 4 | 267 | Benchmarking dataset consisting of both disease-causing and functionally neutral mutations in four well-characterized genes ( |
| SwissVar | 11,986 | 59,976 | Benchmarking dataset used as an independent benchmark of eight alternative prediction algorithms |
Figure 1The distribution of the predicted magnitude of effect for disease-associated (shaded region) and functionally neutral (unshaded region) AASs in the SwissVar dataset using our unweighted and weighted methods (A and B, respectively). From this, we calculated prediction thresholds at which both specificity and sensitivity were maximized (−3.0 and −1.5, respectively).
Performance of Computational Prediction Methods using the VariBench Benchmarking Dataset
| Accuracy | Precision | Specificity | Sensitivity | NVP | MCC | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Theoretical/unweighted computational prediction methods | ||||||||||
| SIFT | 10,464 | 4,856 | 12,188 | 7,433 | 0.65 | 0.64 | 0.62 | 0.68 | 0.66 | 0.30 |
| PolyPhen 1 | 10,093 | 9,185 | 17,669 | 3,199 | 0.69 | 0.52 | 0.64 | 0.39 | ||
| PolyPhen 1 | 14,285 | 4,993 | 13,671 | 7,197 | 0.70 | 0.68 | 0.66 | 0.74 | 0.72 | 0.40 |
| PANTHER | 9,689 | 2,859 | 8,676 | 2,797 | 0.76 | 0.76 | ||||
| FATHMM (unweighted) | 11,561 | 4,839 | 16,257 | 7,707 | 0.69 | 0.72 | 0.77 | 0.60 | 0.66 | 0.38 |
| Trained/weighted computational prediction methods | ||||||||||
| PolyPhen 2 | 13,807 | 5,102 | 13,863 | 6,010 | 0.71 | 0.71 | 0.70 | 0.73 | 0.72 | 0.43 |
| PolyPhen 2 | 16,206 | 2,703 | 10,199 | 9,674 | 0.69 | 0.64 | 0.51 | 0.86 | 0.78 | 0.39 |
| PhD-SNP | 11,900 | 6,896 | 16,788 | 4,377 | 0.71 | 0.75 | 0.79 | 0.63 | 0.68 | 0.43 |
| SNPs&GO | 13,736 | 5,487 | 17,028 | 1,382 | 0.82 | 0.71 | 0.76 | 0.65 | ||
| nsSNPAnalyzer | 4,360 | 2,778 | 1,319 | 943 | 0.60 | 0.59 | 0.58 | 0.61 | 0.60 | 0.19 |
| SNAP | 16,000 | 2,146 | 8,190 | 6,387 | 0.72 | 0.67 | 0.56 | 0.83 | 0.47 | |
| MutPred | 13,829 | 2,507 | 15,891 | 4,557 | 0.81 | 0.79 | 0.78 | 0.85 | 0.84 | 0.63 |
| FATHMM (weighted) | 14,231 | 1,633 | 10,146 | 2,336 | 0.86 | 0.86 | 0.86 | |||
tp, fp, tn, fn refer to the number of true positives, false positives, true negatives, and false negatives, respectively.
Accuracy, Precision, Specificity, Sensitivity, NVP, and MCC are calculated from normalized numbers.
“Probably Pathogenic” predictions classed as disease causing.
“Probably Pathogenic” predictions classed as functionally neutral.
The performances of alternative computational prediction algorithms have been reproduced with permission from Thusberg et al. (2011). Copyright 2012, Wiley.
Specificity and Sensitivity of Computational Prediction Methods in Four Well-Characterized Genes (BRCA1, MSH2, MLH1, and TP53)
| Algorithm | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity |
|---|---|---|---|---|---|---|---|---|
| Theoretical/unweighted computational prediction methods | ||||||||
| SIFT | 0.31 | 0.46 | 0.52 | 0.72 | 0.75 | |||
| Align-GVGD | 0.71 | 0.55 | 0.52 | 0.82 | ||||
| FATHMM (unweighted) | 0.56 | 0.65 | 0.84 | 0.77 | 0.71 | |||
| Trained/weighted computational prediction methods | ||||||||
| PolyPhen-2 | 0.38 | 0.77 | 0.36 | 0.90 | 0.90 | 0.84 | ||
| X-Var | 0.56 | 0.27 | 0.33 | 0.50 | 0.96 | |||
| FATHMM (weighted) | 0.47 | 0.79 | 0.24 | 0.97 | NA | |||
The specificity for our weighted method in this instance is uninformative as there was only one neutral mutation falling within conserved protein domains.
The performances of alternative computational prediction algorithms have been reproduced with permission from Hicks et al. (2011). Copyright 2012, Wiley.
Figure 2Receiver operating characteristic (ROC) curves for the top-ranking computational prediction algorithms evaluated using the SwissVar dataset. Here, we compare our unweighted method against SIFT and PANTHER (A—full curve; B—10% false positive rate) whereas our weighted method is compared to SNPs&GO and MutPred (C—full curve; D—10% false positive rate). Full ROC curves for all computational prediction algorithms evaluated are made available in Supp. Figure S3.
Figure 3The intersection of disease-associated amino acid substitutions correctly identified by the top-ranking computational prediction algorithms evaluated using the SwissVar dataset. Here, we compare our unweighted method against SIFT and PANTHER (A) whereas our weighted method is compared to SNPs&GO and MutPred (B).
Performance of Computational Prediction Methods Using the SwissVar Benchmarking Dataset
| Accuracy | Precision | Specificity | Sensitivity | NVP | MCC | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Unweighted computational prediction methods | ||||||||||
| SIFT | 15,634 | 6,318 | 28,236 | 7,716 | 0.67 | 0.71 | ||||
| PolyPhen 1 | 12,803 | 8,759 | 18,603 | 4,497 | 0.71 | 0.70 | 0.68 | 0.42 | ||
| PANTHER | 8,283 | 5,842 | 17,447 | 5,162 | 0.68 | 0.71 | 0.75 | 0.62 | 0.66 | 0.37 |
| FATHMM (unweighted) | 14,311 | 6,717 | 29,454 | 9,429 | 0.71 | 0.76 | 0.81 | 0.60 | 0.67 | 0.43 |
| Weighted/trained computational prediction methods | ||||||||||
| PolyPhen 2 (HumDiv) | 19,782 | 13,592 | 20,874 | 3,204 | 0.73 | 0.69 | 0.61 | 0.86 | 0.81 | 0.48 |
| PolyPhen 2 (HumVar) | 19,928 | 13,239 | 21,227 | 3,058 | 0.74 | 0.69 | 0.62 | 0.87 | 0.82 | 0.50 |
| PhD-SNP Sequence | 15,695 | 9,380 | 26,838 | 8,062 | 0.70 | 0.72 | 0.74 | 0.66 | 0.69 | 0.40 |
| PhD-SNP Profile | 17,548 | 7,233 | 27,731 | 5,161 | 0.78 | 0.79 | 0.79 | 0.77 | 0.78 | 0.57 |
| PMut | 13,498 | 12,156 | 23,636 | 10,159 | 0.62 | 0.63 | 0.66 | 0.57 | 0.61 | 0.23 |
| SNPs&GO | 17,768 | 3,768 | 29,101 | 5,655 | 0.82 | 0.87 | 0.89 | 0.76 | 0.79 | 0.65 |
| MutPred | 21,365 | 3,500 | 32,719 | 2,392 | ||||||
| FATHMM (weighted) | 15,916 | 3,017 | 19,713 | 4,496 | 0.82 | 0.85 | 0.87 | 0.78 | 0.80 | 0.65 |
tp, fp, tn, fn refer to the number of true positives, false positives, true negatives, and false negatives, respectively.
Accuracy, Precision, Specificity, Sensitivity, NVP, and MCC are calculated from normalized numbers.