| Literature DB >> 28511696 |
Khalid Mahmood1, Chol-Hee Jung1, Gayle Philip1, Peter Georgeson1, Jessica Chung1, Bernard J Pope1, Daniel J Park2.
Abstract
BACKGROUND: Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools.Entities:
Keywords: Benchmarking; Functional assays; Functional datasets; Genomic screening; Mutation assessment; Pathogenicity prediction; Protein function; Variant effect prediction
Mesh:
Year: 2017 PMID: 28511696 PMCID: PMC5433009 DOI: 10.1186/s40246-017-0104-8
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Characteristics of the protein variant effect prediction tools assessed in this study. The table indicates their scoring ranges and thresholds, training data, summary information about features and, where applicable, machine learning method
| Prediction tool | Score range | Deleterious score cutoff | Training data | Features | Machine learning method |
|---|---|---|---|---|---|
| GERP++ | −12.0 to 6.17 | >0.047 | None | Infers conserved or constrained elements from 33 mammalian genomes | – |
| fitCons | 0 to 1 | >0.4 | None | Functional genomics data mainly sourced from chromatin analysis, e.g. ChIP-seq, and evolutionary conservation data | – |
| SIFT | 1 to 0 | <0.05 | None | Conservation data (MSA of homologous sequences) and transformed into normalised probability matrix | – |
| PolyPhen | 0 to 1 | >0.5 | HumVar, HumDiv | Conservation data (MSA of homologous sequences), protein functional domain data and protein structural features | Naïve Bayes classifier |
| CADD | 0 to 35+ | >15 | Simulated, Swissvar, HumVar | Integrates several annotations into a single score, e.g. SIFT, GERP++, PolyPhen, CPG distance, GC content | SVM |
| Condel | 0 to 1 | >0.5 | Builds a unified classification by integration output from a collection of tools, e.g. SIFT, PolyPhen | Weighted average normalised scores | |
| REVEL | 0 to 1 | >0.5 | HGMD, EPS | HGMD and rare EPS variants used for training | Random forest |
| fathmm | 0 to 1 | >0.45 | HGMD, Swiss-Prot | Combines evolutionary conservation with disease-specific protein weights for intolerance to mutation | Hidden Markov models |
Fig. 1Venn diagram of datasets used in this study showing the overlaps among the deleterious and benign variants observed in these datasets. Humsavar displays a relatively high degree of overlap with the ClinvarHC and Swissvar datasets. The remaining datasets overlap to relatively small extents
Composition of the variant reference datasets used in this study. This table separates mutation catalogues into those derived from clinical databases (disease mutation catalogues) and those derived directly from functional assays (functional mutation catalogues). The table provides summary information for the numbers of proteins and variants of different classifications that have contributed to each dataset. See Additional file 1: Figure S1 and Table S1 for more detailed information
| Total variants | Deleterious | Benign | Total proteins | |
|---|---|---|---|---|
| Disease mutation catalogues | ||||
| ClinvarHC | 29,752 | 19,461 | 10,291 | 2979 |
| Humsavar | 43,878 | 19,329 | 24,549 | 10,231 |
| Swissvar | 12,729 | 4526 | 8203 | 5036 |
| Varibench | 10,266 | 4309 | 5957 | 4203 |
| Functional mutation catalogues | ||||
| TP53-TA | 1886 | 582 | 1304 | 1 |
| BRCA1-DMS | 1683 | 408 | 1275 | 1 |
| UniFun | 11,519 | 9503 | 2016 | 2209 |
Fig. 2Histogram depicting apparent accuracies of in silico variant effect predictors based on ROC curve AUCs for the benchmarking datasets used in this study
Measured accuracies of eight in silico predictors as benchmarked against seven different variant reference datasets. Measured accuracies are calculated as the areas under the respective ROC curves (AUCs) and Matthews correlation coefficients (MCCs). See Additional file 1: Figure S4 for the ROC curve graphs
| ClinvarHC | Humsavar | Swissvar | Varibench | TP53-TA | BRCA1-DMS | UniFun | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AUC | MCC | AUC | MCC | AUC | MCC | AUC | MCC | AUC | MCC | AUC | MCC | AUC | MCC | |
| GERP++ | 0.863 | 0.587 | 0.777 | 0.469 | 0.677 | 0.286 | 0.571 | 0.15 | 0.719 | 0.283 | 0.544 | 0.069 | 0.538 | 0.04 |
| fitCons | 0.641 | 0.3 | 0.533 | 0.033 | 0.564 | 0.008 | 0.651 | 0.024 | 0.557 | 0 | 0.559 | 0 | 0.515 | 0.033 |
| SIFT | 0.848 | 0.489 | 0.841 | 0.543 | 0.698 | 0.289 | 0.651 | 0.228 | 0.835 | 0.484 | 0.653 | 0.199 | 0.631 | 0.184 |
| PolyPhen | 0.827 | 0.447 | 0.831 | 0.541 | 0.699 | 0.301 | 0.672 | 0.256 | 0.859 | 0.469 | 0.596 | 0.088 | 0.623 | 0.168 |
| CADD | 0.939 | 0.731 | 0.851 | 0.57 | 0.73 | 0.331 | 0.663 | 0.25 | 0.869 | 0.418 | 0.556 | 0.032 | 0.589 | 0.119 |
| Condel | 0.879 | 0.51 | 0.911 | 0.664 | 0.728 | 0.333 | 0.86 | 0.57 | 0.883 | 0.074 | 0.747 | 0.172 | 0.614 | 0.098 |
| REVEL | 0.945 | 0.68 | 0.968 | 0.83 | 0.792 | 0.462 | 0.89 | 0.59 | 0.907 | 0.465 | 0.737 | 0.088 | 0.63 | 0.148 |
| fathmm | 0.787 | 0.288 | 0.902 | 0.538 | 0.701 | 0.253 | 0.936 | 0.509 | 0.53 | 0 | 0.621 | 0 | 0.531 | 0.02 |
Fig. 3Apparent prediction accuracies of variant effect prediction tools when assessed using ClinvarHC versus functional mutation derived datasets, reported as AUCs derived from ROC curves