| Literature DB >> 25234433 |
Panagiotis Katsonis1, Amanda Koire, Stephen Joseph Wilson, Teng-Kuei Hsu, Rhonald C Lua, Angela Dawn Wilkins, Olivier Lichtarge.
Abstract
Genome-wide association studies (GWAS) and whole-exome sequencing (WES) generate massive amounts of genomic variant information, and a major challenge is to identify which variations drive disease or contribute to phenotypic traits. Because the majority of known disease-causing mutations are exonic non-synonymous single nucleotide variations (nsSNVs), most studies focus on whether these nsSNVs affect protein function. Computational studies show that the impact of nsSNVs on protein function reflects sequence homology and structural information and predict the impact through statistical methods, machine learning techniques, or models of protein evolution. Here, we review impact prediction methods and discuss their underlying principles, their advantages and limitations, and how they compare to and complement one another. Finally, we present current applications and future directions for these methods in biological research and medical genetics.Entities:
Keywords: disease causing SNV (single nucleotide variation); functional impact prediction methods; missense variant classification; non-synonymous protein mutations; single nucleotide polymorphism prioritization
Mesh:
Year: 2014 PMID: 25234433 PMCID: PMC4253807 DOI: 10.1002/pro.2552
Source DB: PubMed Journal: Protein Sci ISSN: 0961-8368 Impact factor: 6.725
Figure 1Production cost and usage of whole exome sequencing over time. As the cost of exome sequencing (blue) decreases, the number of articles containing the phrase “whole exome sequencing” (red) increases. The number of articles is found via Scopus.2 The production cost is defined by the National Human Genome Research Institute3 and includes the costs of labor, sequencing instruments, and data processing, but not quality control, technology development, or data analysis. As of April 2014, the production cost for an exome on the Illumina or SOLiD platform at 30-fold coverage was $49.20, although the actual cost to the consumer is considerably higher, with costs advertised in the range of $700 to $2000 per sample.4
Figure 2TP53 sequences from different species and variations in their amino acids. Some homology-based methods would predict that the human sequence would tolerate a substitution of alanine to aspartic acid or to cysteine at the highlighted position. Other methods account for the conservation of a position, concluding that the highlighted position would likely tolerate more substitutions than other positions.
SNP Impact Predictors
| Server | Year | Input | URL | Pubmed ID |
|---|---|---|---|---|
| Structural | ||||
| SDM | 1997 | PDB ID | 9051729 | |
| Dmutant | 2002 | PDB ID | 12381853 | |
| PoPMuSiC | 2009 | PDB ID | 19654118 | |
| SDS | 2014 | - | Cannot automate | 24795746 |
| Homology | ||||
| SIFT | 2001 | Protein identifier, SNP IDs, or alignment | 11337480 | |
| Panther | 2003 | Sequence | 12952881 | |
| MAPP | 2005 | Alignment and phylogenetic tree | 15965030 | |
| A-GVGD | 2006 | Alignment | 16014699 | |
| mutationassessor (xvar) | 2011 | Protein identifier or chrom. location | 21727090 | |
| Provean | 2012 | Sequence or chrom. location | 23056405 | |
| Evolutionary action | 2014 | Protein identifier | ||
| Hybrid | ||||
| PolyPhen | 2002 | Protein identifier or sequence | 12202775 | |
| LogR.E-value | 2004 | Site is down for maintenance | 14751981 | |
| nsSNPAnalyzer | 2005 | Sequence (requires available PDB structure) | 15980516 | |
| SNPeffect | 2005 | Sequence, PDB ID, UniProt ID | 15608254 | |
| LS-SNP | 2005 | SNP, protein or pathway identifier | 15827081 | |
| MUpro | 2005 | Protein sequence, structure (optional) | 16372356 | |
| pmut | 2005 | Sequence (on demand version) or PDB ID (precalculated version) | 15879453 | |
| PhD-SNP | 2006 | Protein identifier or sequence | 16895930 | |
| SNPs3D | 2006 | SNP identifier | 16551372 | |
| Parepro | 2007 | Alignment | 18005451 | |
| SAPRED | 2007 | Sequence and PDB files | 17384424 | |
| Imutant 3.0 | 2007 | Sequence or PDB ID | 18387208 | |
| SNAP | 2007 | Sequence | 17526529 | |
| AUTO-MUTE | 2010 | PDB ID | 20573719 | |
| Mutation Taster | 2010 | Transcript, gene, or ORF | 20676075 | |
| PolyPhen2 | 2010 | Protein or SNP identifier or sequence | 20354512 | |
| Condel | 2011 | Protein identifier, mutation, homology tree | No server, but can get PERL pipeline scripts and then download each tool | 21457909 |
| CADD | 2014 | VCF file | 24487276 | |
| VarMod | 2014 | Sequence | 24906884 | |
| SuSPect | 2014 | Sequence or VCF | 24810707 | |
Figure 3Average rank of predictions in two CAGI challenges from the competitions of 2011 and 2012–13. The Cystathionine beta-Synthase (CBS) challenge of 2011 asked predictors to submit the effect of 84 variants in the function of CBS at two different cofactor concentrations,127 which were assessed by nine measures for each concentration (precision, recall, accuracy, harmonic mean F1, Spearman's rank, Student's t test, RMSD, RMSD over z scores, and AUC). The p16 challenge of 2012–13 asked predictors to submit evaluations of how 10 variants of the p16 protein impact its ability to block cell proliferation,128 which were assessed by four measures (AUC, RMSD, Kendall tau, and the number of correct predictions within a range of 10%). A total of 16 participants (color-coded) to one or both challenges submitted one or multiple predictions (20 predictions in 2011 and 22 predictions in 2012–13). The number shown on the vertical axis is an average rank so that in order to have a rank of one, the prediction would need to rank first in all of the evaluation measures that were used. Conversely, the worst a prediction could do would be to be last in every evaluation measure, leading to an average rank equal to the total number of prediction sets in that challenge. Besides Action, only the participants B and C submitted predictions in both challenges. The Evolutionary Action method can be found at: http://mammoth.bcm.tmc.edu/EvolutionaryAction/.
Figure 4The total number of citations since each method was published, on a logarithmic scale, according to Scopus2 for methods published before 2014. The methods are colored by the type of information they use as seen in the figure legend. The older and well-established methods of PolyPhen, DMutant, SIFT, and Panther are at the bottom right, in contrast to the new and less-known, methods at the top left, while an abundance of methods are clustered at the center of the graph. Of particular interest is PolyPhen2, which despite its recent release, it is currently the most cited of any method.