| Literature DB >> 25684150 |
Dominik G Grimm1, Chloé-Agathe Azencott, Fabian Aicheler, Udo Gieraths, Daniel G MacArthur, Kaitlin E Samocha, David N Cooper, Peter D Stenson, Mark J Daly, Jordan W Smoller, Laramie E Duncan, Karsten M Borgwardt.
Abstract
Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.Entities:
Keywords: exome sequencing; pathogenicity prediction tools
Mesh:
Year: 2015 PMID: 25684150 PMCID: PMC4409520 DOI: 10.1002/humu.22768
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Overview of the Prediction Tools Used in This Study
| Tool (abbreviation) | Version | N | AA | Purpose, as stated by developers |
|---|---|---|---|---|
| PolyPhen‐2 (PP2) | 2.2.2 | Yes | Yes | “Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations” |
| MutationTaster‐2 (MT2) | 2 | Yes | No | “Evaluation of the disease‐causing potential of DNA sequence alterations” |
| MutationAssessor (MASS) | 2 | Yes | Yes | “Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms” |
| LRT | ‐ | Yes | No | “Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious” |
| SIFT | 1.03 | Yes | Yes | “Predicts whether an amino acid substitution affects protein function” |
| GERP++ | – | Yes | No | “Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element” |
| phyloP | – | Yes | No | “Compute conservation or acceleration |
| FatHMM unweighted (FatHMM‐U) | 2.2–2.3 | No | Yes | Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” |
| FatHMM weighted (FatHMM‐W) | 2.2–2.3 | No | Yes | Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variants |
| Combined Annotation Dependent Depletion (CADD) | 1.0 | Yes | No | “CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome” |
For each tool, the first column shows the version of the tool, the second column (N) shows whether it accepts nucleotide changes as input, the third column (AA) shows whether it accepts amino acid changes as input. The last column provides a description of the tool, as stated by the developers.
http://genetics.bwh.harvard.edu/pph2/index.shtml
http://www.mutationtaster.org
http://mutationassessor.org
http://www.genetics.wustl.edu/jflab/lrt_query.html
http://sift.jcvi.org
http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html
http://compgen.bscb.cornell.edu/phast/
http://fathmm.biocompute.org.uk
http://cadd.gs.washington.edu/home
Purpose of Each Dataset, as Described by Dataset Creators
| Dataset | Purpose | Positive control: damaging/deleterious/disease causing/pathogenic | Negative control: neutral/benign/nondamaging/tolerated |
|---|---|---|---|
|
| Mendelian disease variant identification | “All disease‐causing mutations from UniProtKB” | “Common human nsSNPs (MAF > 1%) without annotated involvement in disease…treated as nondamaging” |
|
| “Dataset composed of pathogenic nsSNVs and nearly nonpathogenic rare nsSNVs” | “5,340 alleles with known effects on the molecular function causing human Mendelian diseases from the UniProt database…positive control variants.” “Pathogenic nsSNVs” | “4,752 rare (alternative/derived allele frequency <1%) nsSNVs with at least one homozygous genotype for the alternative/derived allele in the 1000 Genomes Project…negative control variants.” “Other rare variants” |
|
| “Variation datasets affecting protein tolerance” | “The pathogenic dataset of 19,335 missense mutations obtained from the PhenCode database downloaded in June 2009), IDbases and from 18 individual LSDBs. For this dataset, the variations along with the variant position mappings to RefSeq protein (> = 99% match), RefSeq mRNA, and RefSeq genomic sequences are available for download.” | “This is the neutral dataset or nonsynonymous coding SNP dataset comprising 21,170 human nonsynonymous coding SNPs with allele frequency 40.01 and chromosome sample count 449 from the dbSNP database build 131. This dataset was filtered for the disease‐associated SNPs. The variant position mapping for this dataset was extracted from dbSNP database.” |
|
| “Benchmark dataset used for the evaluation of…prediction tools and training of consensus classifier PredictSNP” | Disease‐causing and deleterious variants from | Neutral variants from |
|
| “Comprehensive collection of single amino acid polymorphisms (SAPs) and diseases in the UniProtKB/Swiss‐Prot knowledgebase” | “A variant is classified as disease when it is found in patients and disease association is reported in literature. However, this classification is not a definitive assessment of pathogenicity” | “A variant is classified as polymorphism if no disease association has been reported” |
For each dataset, the first column shows the general purpose. The last two columns describe the positive and negative control categories of variants.
http://genetics.bwh.harvard.edu/pph2/dokuwiki/overview
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003143
http://structure.bmc.lu.se/VariBench/tolerance_dataset1.php
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003440
http://bioinformatics.oxfordjournals.org/content/26/6/851.long
http://swissvar.expasy.org/cgi‐bin/swissvar/documentation
All Datasets Used in This Study
| Datasets | Deleterious variants (D) | Neutral variants (N) | Total | Ratio (D:Total) | Tools potentially trained on data (fully or partly) | Removed variants overlapping with: |
|---|---|---|---|---|---|---|
|
| 21,090 | 19,299 | 40,389 | 0.52 | MT2, MASS, PP2, FatHMM‐W | CADD training data |
|
| 5,156 | 3,694 | 8,850 | 0.58 | MT2, MASS, PP2, FatHMM‐W | CADD training data |
|
| 4,309 | 5,957 | 10,266 | 0.42 | MT2 | CADD training data, |
|
| 10,000 | 6,098 | 16,098 | 0.62 | MT2 | CADD training data, |
|
| 4,526 | 8,203 | 12,729 | 0.36 | MT2 | CADD training data, |
These preprocessed and filtered datasets are used to evaluate the performance of different prediction tools.
Figure 1Evaluation of the 10 different pathogenicity prediction tools (by AUC) over five datasets. The hatched bars indicate potentially biased results, due to the overlap (or possible overlap) between the evaluation data and the data used (by tool developers) for training the prediction tool. The dotted bars indicate that the tool is biased due to type 2 circularity. The protein MV predictor and the logistic regression (over the features used in the weighting scheme of FatHMM‐W) are discussed in the second part of the Results section.
Protein Categories and Variants Per Category
| Datasets | “Pure” pathogenic proteins | Pathogenic variants in “pure” proteins | “Pure” neutral proteins | Neutral variants in “pure” proteins | Mixed proteins | Variants in mixed proteins | Total number of proteins |
|---|---|---|---|---|---|---|---|
|
| 1,277 | 10,484 | 8,400 | 17,140 | 911 | 12,765 | 10,588 |
|
| 891 | 4,336 | 2,794 | 3,478 | 165 | 1,036 | 3,850 |
|
| 286 | 3,865 | 4,139 | 5,869 | 65 | 532 | 4,490 |
|
| 855 | 7,090 | 3,738 | 5,649 | 228 | 3,359 | 4,821 |
|
| 1,444 | 2,749 | 3,614 | 6,568 | 540 | 3,412 | 5,598 |
Overview about the total number of proteins per dataset and the composition of these datasets.
Figure 2In the VariBenchSelected dataset, most SNPs are in genes with only neutral or only pathogenic variants. A: Protein perspective: proportion of proteins containing only neutral variants (“neutral‐only”), only pathogenic variants (“pathogenic‐only”), and both types of variants (“mixed”). Only 1.4% of the proteins are mixed. B: Variant perspective: proportions, of variants in each of the three categories of proteins. Only 5.2% of variants are in mixed proteins. C: Fractions of variants, in the VariBenchSelected dataset, containing various ratios of pathogenic‐to‐neutral variants, binned into increasingly narrow bins, approaching balanced proteins. The open interval ]0.0, 1.0[ contains all mixed proteins (as in B). Only 0.7% of all variants belong to almost perfectly balanced proteins (closed interval [0.4, 0.6]).
Figure 3Performance of 10 pathogenicity prediction tools according to protein pathogenic‐to‐neutral variant ratio. Evaluation of tool performance on subsets of VariBenchSelected, predictSNPSelected, and SwissVarSelected, defined according to the relative proportions of pathogenic and neutral variants in the proteins they contain. “Pure” indicates variants belonging to proteins containing only one class of variant. (x and y) indicate variants belonging to mixed proteins, containing a ratio of pathogenic‐to‐neutral variants between x and y. ]0.0, 1.0[ therefore indicate all mixed proteins (the ratios of 0.0 and 1.0 being excluded by the reversed brackets). While FatHMM‐W performs well or excellently on variants belonging to pure proteins (VariBenchSelected and predictSNPSelected), it performs poorly on those belonging to mixed proteins.
Figure 4Comparison of the performance of two metapredictors (Logit and Condel) and their component tools, across five datasets. Bar heights reflect AUC for each tool and tool combination. Logit and Condel are metapredictors combining MASS, PP2, and SIFT. The “+” versions of Logit and Condel also include FatHMM‐W. While effective in prediction, FATHMM‐W (alone and in the Logit+ and Condel+ metapredictors) is optimistically biased due to type 2 circularity (see Results section). In the “Selected” datasets, Logit provides the best unbiased performance. SIFT has the lowest performance in the HumVar and ExoVar datasets, but it is also the only predictor that is unbiased in these two datasets.