Literature DB >> 25234433

Single nucleotide variations: biological impact and theoretical interpretation.

Panagiotis Katsonis¹, Amanda Koire, Stephen Joseph Wilson, Teng-Kuei Hsu, Rhonald C Lua, Angela Dawn Wilkins, Olivier Lichtarge.

Abstract

Genome-wide association studies (GWAS) and whole-exome sequencing (WES) generate massive amounts of genomic variant information, and a major challenge is to identify which variations drive disease or contribute to phenotypic traits. Because the majority of known disease-causing mutations are exonic non-synonymous single nucleotide variations (nsSNVs), most studies focus on whether these nsSNVs affect protein function. Computational studies show that the impact of nsSNVs on protein function reflects sequence homology and structural information and predict the impact through statistical methods, machine learning techniques, or models of protein evolution. Here, we review impact prediction methods and discuss their underlying principles, their advantages and limitations, and how they compare to and complement one another. Finally, we present current applications and future directions for these methods in biological research and medical genetics.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: disease causing SNV (single nucleotide variation); functional impact prediction methods; missense variant classification; non-synonymous protein mutations; single nucleotide polymorphism prioritization

Mesh：

Year: 2014 PMID： 25234433 PMCID： PMC4253807 DOI： 10.1002/pro.2552

Source DB: PubMed Journal: Protein Sci ISSN： 0961-8368 Impact factor: 6.725

Introduction

Accurate prediction of SNV impact is an important challenge

Since making its first clinical diagnosis in 2009,1 whole exome sequencing has been on the rise for both individual patient diagnosis and large-scale projects, in keeping with decreasing production costs (Fig. 1). Our capacity to obtain sequencing information has expanded so quickly that it now far out-paces Moore's doubling law for computing power.5 Whereas targeted gene sequencing and Genome Wide Association Studies (GWAS) at predetermined loci used to be the cutting edge,6,7 new studies aim to identify single nucleotide variations (SNVs) in all genes and to analyze their association with disease.8 There are now thousands of sequenced exomes encompassing phenotypes both rare (e.g., Joubert syndrome,9 myofibrillar myopathy10), and relatively common (e.g., cancer11–16 and epilepsy17,18). Many of these exome projects have been catalogued and made available for analysis through the Database of Genotypes and Phenotypes (dbGaP),19,20 and multi-center efforts like the NHLBI Exome Sequencing Project21 are actively gathering more data. With this influx of information, researchers are now limited not by a lack of material, but instead by the challenges of processing and interpreting this wealth of information. With more candidate SNVs to evaluate than ever before, accurate methods that predict the effect of SNVs are crucial to ensure that research focuses on those variations that are most likely to cause disease.

Figure 1

Production cost and usage of whole exome sequencing over time. As the cost of exome sequencing (blue) decreases, the number of articles containing the phrase “whole exome sequencing” (red) increases. The number of articles is found via Scopus.2 The production cost is defined by the National Human Genome Research Institute3 and includes the costs of labor, sequencing instruments, and data processing, but not quality control, technology development, or data analysis. As of April 2014, the production cost for an exome on the Illumina or SOLiD platform at 30-fold coverage was $49.20, although the actual cost to the consumer is considerably higher, with costs advertised in the range of $700 to $2000 per sample.4

Most tools focus on coding SNVs rather than other SNVs

Decoding the relationship between genotype and phenotype is a major challenge in genetics. In humans there are more than four million DNA differences between two random individuals.22,23 Because additions and deletions typically have stronger impact,24–26 and are selected against more often, ∼80% of these differences are single nucleotide variations (SNVs).27–29 Over the entire human population, an estimated 81%30 to 93%31 of human genes contain at least one SNV. Although only a small fraction of variants are non-synonymous single nucleotide variations (nsSNVs), about 10,000 are found between two random individuals,27–29 and over 85% of known disease associations are culled from this important class of mutations.1 For this reason, methods for predicting the impact of SNVs have historically focused on the high-yield category of non-synonymous coding SNVs. The existence of disease-associated synonymous mutations32,33 and nocoding variations with effects on lincRNA,34 miRNA,35,36 and promoters37,38 has produced interest in other types of mutations as well, but different tools will be needed to analyze these types of variations and such tools are comparatively still new and untested.39–41

Most nsSNVs affect protein function but in distinct ways

nsSNVs may affect folding,42,43 binding affinity,44,45 expression,46 post-translational modification,47,48 and other protein features. However, not all nsSNVs impact protein function. Some variations may produce no perceivable changes to the protein, in which case the mutation may not be pathogenic. On the other hand, purifying selection should eliminate over time the mutations that are most deleterious to fitness. A telltale signal is a decreased ratio of non-synonymous to synonymous mutations compared to a model of neutral mutation theory.49 Importantly, not all non-synonymous mutations are under the same strength of purifying selection. An analysis of exomes from the 1000 Genomes Project,50 in accordance with simulations51 and with Fisher's geometric model52 showed that the number of the nsSNVs retained in the human population decreases exponentially as the impact on fitness increases. The same analysis also showed that the exponential decrease becomes steeper for nsSNVs with higher allelic frequency, reflecting that the more common mutations have been selected against stronger constraints. This demonstrates the complexity of the genotype-to-phenotype relationship and implies that a binary classification of a mutation into deleterious or neutral, although very convenient, may be too simplistic.53

Goal of the review

Predictors of the impact of nsSNVs are useful for associating variants to phenotypic traits and diseases, but they should be used cautiously and with an understanding of the benefits and pitfalls of using each method. However, researchers attempting to understand the field may feel overwhelmed by the plethora of available predictors to consider. Here we classify current predictors of functional impact by their underlying theory and we discuss the fundamental principles, assumptions, strengths, and limitations of each type of method. Finally, we speculate on the future directions of variant prioritization and review applications for nsSNV impact prediction in guided mutagenesis studies, the identification of disease-causing nsSNVs, the association of genes to diseases, and the prediction of polygenic phenotypes from whole exome data.

Predicting SNV Impact

While many features have been used to predict the impact of nsSNVs, there are two major features that are commonly used in bioinformatics tools: structure and sequence homology.

Structural metric of nsSNV impact

Some of the first methods to predict the impact of nsSNVs were based solely on structure.54,55 They assumed that deleterious nsSNVs destabilize the folding of proteins and therefore aimed to estimate the free energy change of folding (ΔΔG) due to a mutation. Roughly three quarters of amino acid substitutions that result in Mendelian diseases do affect protein stability, proving the value of this assumption.56,57 Impacting protein stability typically implies local or total unfolding of the protein, but occasionally deleterious aggregates like amyloid fibrils58,59 may form. Rarely, single mutations have been known to cause a switch between stable folds.60 To avoid the computational expense of physical models like Molecular Dynamics simulations, most methods use statistical (PopMuSiC-2.0,61 SDM54) or empirical (FoldX/SNPeffect,62,63 Dmutant55) effective energy functions. These methods typically require a structure for the region of the protein under investigation, although some methods can use sequence information alone.64 Originally, SDM, a knowledge-based approach, used environment-dependent amino acid substitution with propensity tables and considered a structure's main-chain conformations, solvent accessibilities, hydrogen bonds, and disulfide bonds.54 Later methods used this information to help calculate basic potentials, low-order and high-order coupling terms, volume terms, and solvent accessibility terms for comprehensive scoring functions that can be weighted through training with machine learning techniques61 or direct fitting to empirical data.63,65 Other structural components that are taken into account include small-molecule binding sites, protein–protein interactions, entropy optimization, and Van der Waals and torsional clashes.63,66 These structure-based methods give insight about the local environment of the mutation. Variants on the surface are, in general, more likely to be neutral than variants in the core,67 indicating that disease-associated mutations often affect intrinsic structural features of proteins.68 However, surface mutations at important protein–protein interaction sites are more likely to be disease-associated.69 Using the structure also has the advantage of accounting for the interactions between amino acid residues that are close in three-dimensional space but far apart in the protein's sequence. Loss or gain of disulfide, electrostatic or hydrophobic interactions that affect protein stability or aggregation are examples of interaction changes that the use of 3-D structure can help identify.70,71 Unfortunately, even with a deposition rate outpacing PubMed article submission72 and after recently reaching the milestone of 100,000 structures,73 it is still a relatively small fraction of all proteins that can be found in the protein data bank. For example, in a recent study on epilepsy disorders66 only 18/68 of the proteins of interest had partial structures. For the remaining proteins, only 22% of the mutations could be mapped onto a predicted structure from theoretical models based on homology of known structures.66 For a larger perspective, only 7.6% of 57,525 nsSNVs from the Humsavar database could be mapped to structures.74 This percentage increased to 60.4% when Phyre2 homology models75 were included,74 but still the proportion of unaddressed SNVs was large. Another pitfall is that the PDB may contain structures, often flagged with a warning,76 that have unresolved concerns regarding geometry, stereochemistry, or solvent, and that contribute to inconsistency in the quality of the available structures.77 Overall, structural information has its greatest value in nsSNV impact prediction in cases where a complete and robust protein structure is available and where the protein has few homologs, compromising the prediction accuracy of methods that rely heavily on homology.78,79

Evolutionary metrics of nsSNV impact

A complementary approach to determine the impact of nsSNVs is based on evolutionary principles. At first, substitution matrices like BLOSUM6280 were used to classify a nsSNV as impactful or not81 by the similarity of an amino acid substitution as judged by the interchanges between homologous proteins. This type of substitution matrix was originally designed for database searching and pairwise alignment82 and then repurposed to predict nsSNV impact. When used as a standalone prediction tool, BLOSUM62 matrices over-predict non-conservative substitutions,83 and many early methods demonstrated their feasibility by showing improvements in accuracy over BLOSUM62 predictions.83,84 While BLOSUM62 uses a non-specific substitution profile, many homology-based methods now assess amino acid substitution profiles in a more sophisticated and family specific manner. Homology-based methods typically assume that the overrepresented substitutions in a protein family are neutral on protein function and that the underrepresented ones are deleterious25,83,84 (Fig. 2). This implies two hypotheses: that each substitution has an independent effect on protein function (no epistasis) and that all homologs have identical function (the fitness landscape is constant).83,84 The prediction accuracy is significantly affected upon violation of these hypotheses and most methods attempt to minimize this problem by optimizing the sequence selection to mostly orthologous proteins, thereby minimizing changes in the fitness landscape.25 Although non-native alignments can sometimes improve the accuracy of a method,85 customizing the sequence alignment in a rational way requires a great deal of knowledge and finesse.

Figure 2

TP53 sequences from different species and variations in their amino acids. Some homology-based methods would predict that the human sequence would tolerate a substitution of alanine to aspartic acid or to cysteine at the highlighted position. Other methods account for the conservation of a position, concluding that the highlighted position would likely tolerate more substitutions than other positions. At the most basic level, the early homology methods (SIFT,83 Panther84) judge the impact of nsSNVs by scoring the substitution frequency amongst homologues. To improve upon this simple principle, SIFT normalized the probabilities of all possible amino acid substitutions and Panther uses a Hidden Markov Model.84 The next generation of methods (A-GVGD,86 MAPP87) score the observed frequency of biochemical properties in each position of the alignment, such as the volume, polarity, hydropathy and charge, and how they differ from the properties of the substituted amino acid. These methods then conclude that a residue is deleterious for protein function when it does not comply with the protein family's substitution profile.86,87 More recent implementations of homology have combined homology information together with substitution matrices. Provean uses an alignment-based score that measures the change in sequence similarity of the query sequence with each of its homologs, before and after the introduction of the mutation.25 The similarity is estimated by using the BLOSUM62 matrix, and it can provide predictions for multiple amino acid substitutions, insertions, and deletions. Alternatively, the Evolutionary Action method models the genotype-to-phenotype relationship with an equation stating that the impact of a mutation is a product of the functional importance of the mutated residue and of the amino acid similarity of the substitution.50 The functional importance is approximated by the Evolutionary Trace method88,89 and the amino acid similarity by substitution matrices that depend on the functional importance of the residues and optionally on their structural features. Overall, the abundance of such methods highlights the ability of homology to accurately predict the impact of nsSNVs independently from other features. Homology has been a steadfast component of nsSNV impact prediction, whether by itself or in combination with structural information, but there are several limitations to its predictive power. In particular, the lack of available homologous sequences may result in lower prediction accuracy.87 For example, the Provean method uses 100–200 homologous sequences on average, but when their number drops below 50, the accuracy is lower.25 Another caveat is that the selection of sequences must be balanced to represent sufficiently deep evolution of the protein family without being biased to distant phylogenetic branches that have evolved to retain functions that are specific only to that branch. When the Provean method was tested on sequence alignments derived by using the UniProtKB/Swiss-Prot instead of the NCBI NR database, the accuracy dropped by 7% and this was attributed to the lack of orthologous and distantly related sequences.25 Furthermore, the choice of the alignment has a major effect on the accuracy of a method. When each of four alignments was used as input to SIFT, A-GVGD, PolyPhen-2, and Xvar (now MutationAssessor), their accuracy varied widely, with A-GVGD being extremely sensitive to, and PolyPhen-2 being more robust to, changes in alignment.85 Interestingly, the native alignments of each method did not necessarily give the best predictions for that method.85 Overall, sequence homology can be applied to nsSNV impact prediction with great success if there are sufficient homologues in broad and deep branches of the phylogeny.

Integrative machine learning approaches

Several methods predict the impact of nsSNVs using both structure and homology, along with other types of information such as function annotation and biochemical properties. To combine key features, these methods use supervised machine learning techniques that integrate disparate data types through nonlinear relationships and handle outliers and noise more readily than linear approaches.64 Supervised learning requires training with large numbers of known phenotype associations in order to deduce these complex relationships.64,90,91 Ultimately, they classify the data into categories91 like deleterious or neutral, and they may provide a confidence score for each prediction. Commonly used machine learning techniques include Support Vector Machines,5,64,71,92–96 Naive Bayes,97,98 Neural Networks,90,99 Random Forests,100,101 and Decision Trees.93,102 Perhaps the most well-known impact predictor that uses machine-learning is PolyPhen2, which uses a naive Bayes classifier on substitution events in homologs, structural parameters, function annotation, and physicochemical features.98 Typical training features include amino acid substitution profiles or homology derived scores,71,94,98 biophysical properties of the substitution (volume,71,92,98 hydropathy,71,92,94 and charge71,92,94), structure information (secondary structure,94 solvent accessibility,92,98 and crystallographic B-factors98), function annotation,92,98,103 local environment information (neighbors in sequence or space),64,93,94,104 statistical potentials,64 aggregation property,62,105 and intrinsically disordered regions.105 Recently, SuSPect74 even incorporated a network of protein-protein interactions from the STRING database106 into its analysis. Machine learning methods aim to identify and use non-redundant features that are highly correlated to accurate classification.107 However, optimizing the selection of features may cause predictions to be less accurate for those proteins dependent on “atypical” features. For example, disruption of intrinsic disorder, a rarely used feature, is critical for predicting the impact of mutations in the tumor suppressor APC.108 Determining which features contain the most relevant information and the least amount of noise has been a constant challenge, and several methods integrate predictions of existing methods with other methods (Condel),109 or with additional features, (SNAP99 and MutPred110), in order to increase the accuracy. At the publication time of this review, there is no consensus for a “best” set of features to predict the impact of SNVs, with different combinations working for different methods and datasets. The features considered by each method are detailed in Supporting Information Table1.

Table I

SNP Impact Predictors

Server	Year	Input	URL	Pubmed ID
Structural
SDM	1997	PDB ID	http://www-cryst.bioc.cam.ac.uk/∼sdm/sdm.php	9051729
Dmutant	2002	PDB ID	http://sparks.informatics.iupui.edu/hzhou/mutation.html(Unavailable)	12381853
PoPMuSiC	2009	PDB ID	http://dezyme.com/	19654118
SDS	2014	-	Cannot automate	24795746
Homology
SIFT	2001	Protein identifier, SNP IDs, or alignment	http://sift.jcvi.org/	11337480
Panther	2003	Sequence	http://www.pantherdb.org/tools/csnpScoreForm.jsp	12952881
MAPP	2005	Alignment and phylogenetic tree	http://mendel.stanford.edu/SidowLab/downloads/MAPP/index.html	15965030
A-GVGD	2006	Alignment	http://agvgd.iarc.fr/agvgd_input.php	16014699
mutationassessor (xvar)	2011	Protein identifier or chrom. location	http://mutationassessor.org/	21727090
Provean	2012	Sequence or chrom. location	http://provean.jcvi.org/index.php	23056405
Evolutionary action	2014	Protein identifier	http://mammoth.bcm.tmc.edu/EvolutionaryAction/
Hybrid
PolyPhen	2002	Protein identifier or sequence	http://genetics.bwh.harvard.edu/pph/	12202775
LogR.E-value	2004	Site is down for maintenance	http://lpgws.nci.nih.gov/cgi-bin/GeneViewer.cg	14751981
nsSNPAnalyzer	2005	Sequence (requires available PDB structure)	http://snpanalyzer.uthsc.edu/	15980516
SNPeffect	2005	Sequence, PDB ID, UniProt ID	http://snpeffect.switchlab.org/menu	15608254
LS-SNP	2005	SNP, protein or pathway identifier	http://modbase.compbio.ucsf.edu/LS-SNP/	15827081
MUpro	2005	Protein sequence, structure (optional)	mupro.proteomics.ics.uci.edu	16372356
pmut	2005	Sequence (on demand version) or PDB ID (precalculated version)	http://mmb2.pcb.ub.es:8080/PMut/	15879453
PhD-SNP	2006	Protein identifier or sequence	http://snps.biofold.org/phd-snp/phd-snp.html	16895930
SNPs3D	2006	SNP identifier	http://www.snps3d.org/	16551372
Parepro	2007	Alignment	http://www.mobioinfor.cn/parepro/index.htm	18005451
SAPRED	2007	Sequence and PDB files	http://sapred.cbi.pku.edu.cn/ (Login required)	17384424
Imutant 3.0	2007	Sequence or PDB ID	http://gpcr2.biocomp.unibo.it/cgi/predictors/I-Mutant3.0/I-Mutant3.0.cgi	18387208
SNAP	2007	Sequence	http://rostlab.org/services/snap/submit	17526529
AUTO-MUTE	2010	PDB ID	http://proteins.gmu.edu/automute/AUTO-MUTE_nsSNPs.html	20573719
Mutation Taster	2010	Transcript, gene, or ORF	http://www.mutationtaster.org	20676075
PolyPhen2	2010	Protein or SNP identifier or sequence	http://genetics.bwh.harvard.edu/pph2/	20354512
Condel	2011	Protein identifier, mutation, homology tree	No server, but can get PERL pipeline scripts and then download each tool	21457909
CADD	2014	VCF file	http://cadd.gs.washington.edu/score	24487276
VarMod	2014	Sequence	http://www.wasslab.org/varmod/	24906884
SuSPect	2014	Sequence or VCF	http://www.sbg.bio.ic.ac.uk/suspect/index.html	24810707

SNP Impact Predictors Another limitation of the machine learning methods is that they may rely on asymmetric training sets that may misrepresent population characteristics.104 For example, if a Gaussian distribution was randomly sampled, one might obtain by chance a few more samples on one side of the curve (Supporting Information Fig. 1). Using this skewed distribution in a machine learning technique underfits the data and can cause false predictions defeating the purpose of the learning process.91,111 However, this “generalization error” can at least be minimized by mathematical models.112 Equally problematic, if a method is over-trained on a dataset, noise will be built in and the performance of the model will drop.91,111,112 Finally, using machine learning methods to predict the impact of mutations that differ fundamentally from the training data may require retraining and revalidating the tool. For example, using SuSPect, which was initially trained only on human SNVs, to predict the impact of mutations in non-human proteins dropped the AUC by about 10%.74

Availability and Comparisons

A summary of well-known current methods to predict the impact of nsSNV is provided in Table1, and a more detailed version of this table exists in the Supporting Information Table1. The majority of these methods are freely available to the research community through web servers or through downloadable files for local use. Using them often requires basic to advanced bioinformatics skills, as presented in Karchin 2008.113 At its most basic, a user has to input just an identifier of the protein of interest or its sequence, and in some cases the specific amino acid substitution as well. To better assist users, many methods allow submitting large number of prediction requests at a time, and others give an option to input user-curated sequence alignments of the protein family. New tools determine their accuracy by applying their method to various sets of nsSNVs whose impact is known and measuring how well they are able to distinguish harmful mutations from benign ones. Ambitious mutagenesis work on a particular protein is one way these validation sets are developed. For example, 4041 mutations of the E. coli LacI protein,114,115 336 mutations of HIV-1 protease,116 2015 mutations of bacteriophage T4 lysozyme,117 and 2314 mutations of the human p53 protein118 have been assayed for functional effect and catalogued. Many tools, including SIFT,83 MutationAssessor,65 Provean,25 MAPP,87 and EA,50 compare to one or more of these classic datasets. Another type of validation set comes from reference human SNVs that have been classified as disease-associated variants (deleterious) or common polymorphisms (presumed benign). These datasets include VariBench,119 HGMD,120 and the “human polymorphisms and disease mutations” set available from the UniProtKB/Swiss-Prot database,121 each of which contains tens of thousands of missense variants. This type of validation set has the advantage of being human-specific and encompassing many proteins, but relies on the accuracy of annotations in the databases and can only consider SNV impact in a binary fashion. On the other hand, validation sets from mutagenesis studies are more limited in scope but involve functional assays that consider impact on a continuous scale. The performance of different methods to predict the impact of mutations is typically compared with the area under the curve (AUC) of the receiver operating characteristic (ROC) plots. An ROC plots the true positive rate against the false positive rate and demonstrates the trade-off between sensitivity and specificity. The AUC quantifies the success of this trade-off. A perfect prediction would result in a vertical line (infinite slope) at the origin and an AUC of 1, in contrast to a completely random prediction that would result in a line with a slope of 1 and an AUC of 0.5. Other measures to evaluate the ability of prediction methods to prioritize the impact of mutations include the balanced accuracy, which is the average of the sensitivity and specificity,25 the F1 score, which is the harmonic mean of precision and recall,122 the Matthews correlation coefficient (MMC),93 the Spearman's rank correlation coefficient,123 the Kendall tau rank correlation coefficient,124 and the scale-dependent metric root-mean-square deviation (RMSD).61 It is important to be cautious when attempting to objectively compare methods, and only new, unpublished data should be included in a validation set in order to keep the methods on equal footing. Otherwise, machine learning methods that have used part of the validation data in their training may appear to be more accurate than they really are. When available, comparisons that are performed by independent researchers are preferable.53,85 In one such study, the performance of four commonly used methods (SIFT, Align-GVGD, PolyPhen-2, and Xvar which is now called MutationAssessor) was compared for 267 well-characterized human missense mutations in the BRCA1, MSH2, MLH1, and TP53 genes.85 All four algorithms performed similarly, with an AUC of about 80%, but the predictions by each algorithm were often discordant even when each one was provided the same input alignment.85 Thus, while these methods perform similarly in their overall accuracy, their predictions are different,85 a phenomenon that is documented for other tools as well125 and suggests complementarity.109 There are also independent third-party challenges that use unpublished data to assess the ability of methods to predict the functional impact of mutations on proteins, including the critical assessment of genome interpretation (CAGI),126 in which competing groups evaluate genetic variants blindly and have their predictions judged against experimental results on a variety of measures. Most often, no single method outperforms all others in every one of these diverse measures of quality; nevertheless an average rank can be calculated for each method over all of the quality measures. In Figure 3, we plotted the average ranks of impact predictors in two of the CAGI challenges, where we participated with predictions made by the evolutionary action method (simply Action). The identities of methods other than our own will remain anonymous until the CAGI community publishes comprehensive results.

Figure 3

Average rank of predictions in two CAGI challenges from the competitions of 2011 and 2012–13. The Cystathionine beta-Synthase (CBS) challenge of 2011 asked predictors to submit the effect of 84 variants in the function of CBS at two different cofactor concentrations,127 which were assessed by nine measures for each concentration (precision, recall, accuracy, harmonic mean F1, Spearman's rank, Student's t test, RMSD, RMSD over z scores, and AUC). The p16 challenge of 2012–13 asked predictors to submit evaluations of how 10 variants of the p16 protein impact its ability to block cell proliferation,128 which were assessed by four measures (AUC, RMSD, Kendall tau, and the number of correct predictions within a range of 10%). A total of 16 participants (color-coded) to one or both challenges submitted one or multiple predictions (20 predictions in 2011 and 22 predictions in 2012–13). The number shown on the vertical axis is an average rank so that in order to have a rank of one, the prediction would need to rank first in all of the evaluation measures that were used. Conversely, the worst a prediction could do would be to be last in every evaluation measure, leading to an average rank equal to the total number of prediction sets in that challenge. Besides Action, only the participants B and C submitted predictions in both challenges. The Evolutionary Action method can be found at: http://mammoth.bcm.tmc.edu/EvolutionaryAction/.

Figure 4

The total number of citations since each method was published, on a logarithmic scale, according to Scopus2 for methods published before 2014. The methods are colored by the type of information they use as seen in the figure legend. The older and well-established methods of PolyPhen, DMutant, SIFT, and Panther are at the bottom right, in contrast to the new and less-known, methods at the top left, while an abundance of methods are clustered at the center of the graph. Of particular interest is PolyPhen2, which despite its recent release, it is currently the most cited of any method. In summary, one may choose an impact prediction method not only based on its accuracy against a variety of benchmark datasets, but also based on the strengths and limitations of the method in the context of the data at hand. The availability of a structure, the number of available homologs, the convenience of a predictor (web server or local installation), and the ability to submit multiple requests with various formats (vcf files or lists of single amino acid variants) may all affect the preference of a user in practice. In general, the confidence of a prediction is higher when multiple methods are in agreement,129 so studies often use the results from multiple methods to bolster evidence for pathogenicity.42,130,131 To this end, metaservers that compile the results from multiple methods are often time-saving, and several are noted in Supporting Information Table1 along with the original methods.

Applications

Typically, SNV impact prediction methods are used to associate amino acid variations to loss of protein function or to risk of diseases. An increasing number of studies use the predicted impact in a variety of applications, and have reported that SNV impact predictions match experimental findings.130,132,133 Such applications include guiding mutagenesis,134,135 identifying disease associated genes in both Mendelian and common diseases,1,136–139 separating disease-causing variants from linkage disequilibrium variants,140 identifying somatic mutations that drive cancer,65,141,142 and predicting the overall phenotype of an organism.143 These applications highlight the value of SNV impact prediction and the need for further improvement.

Guided mutagenesis

Predictions of impact may guide mutagenesis studies that aim to uncover functional sites or fine-tune the activity of proteins. Rather than using laborious random mutagenesis and screening to identify functional residues,144,145 site-directed mutagenesis studies146 may be efficiently guided by computational predictions with high rates of success.134,147–149 Besides selecting strongly deleterious mutations that knock out protein function, often it is desirable to select mutations with an intermediate impact in order to redirect the protein activity.135 Methods like EA, which yield prediction on a continuous scale rather than in binary categories, are appropriate to engineer functional proteins that deviate variably from the wild-type phenotype.135

nsSNV disease association

nsSNV impact predictors can also aid in untangling disease etiology. Although thousands of associations have been made between nsSNVs and risk of various diseases through GWAS and catalogued in databases like HGMD,120 dbSNP,150 ENSEMBL,151 and UniProt,152 it is often unclear if the nsSNV itself is causative or merely linked to the disease-causing variant. In addition, predisposing nsSNVs usually account for a small fraction of the predicted genetic risk of the complex diseases, a major issue known as “missing” heritability.153–155 Current theories suggest that common diseases are caused by either common variants with small to modest effects155 or by multiple rare variants.156 In both cases the statistical power is limited by either the linkage disequilibrium or the low population frequency, respectively. nsSNV impact predictions may be used to distinguish the most deleterious nsSNVs from those that are merely in strong linkage disequilibrium with a causative nsSNV,157 or identify deleterious rare nsSNVs that occur on genes that are associated with the disease.158–160

Identifying genes that cause diseases

Another use of impact predictors is to discover genes associated with genetic disorders.161 In these studies, exome sequencing of unrelated patients with the disorder is conducted under the hypothesis that these exomes will be enriched in mutations that impact the function of a causative gene. The predicted impact of SNVs on protein function may then be used to associate new genes with the studied disorder, such as the genes FRAS1 and FREM2 with Congenital Abnormalities of the Kidney and Urinary Tract (CAKUT),136 the DHODH gene with the Miller syndrome,137 the SLC26A3 gene with Bartter syndrome,1 the TGM6 gene with spinocerebellar ataxias,138 and the VCP gene with Amyotrophic Lateral Sclerosis (ALS).139 With more exome sequencing studies on the way, there is much potential for the widespread use of mutation impact predictors in the clinical setting, given their continuous improvement and almost immediate access to results.

Identifying cancer driver mutations

The search for cancer-associated mutations also benefits from predictions of the functional impact. This is a particularly challenging problem, since although cancer-causing mutations may be inherited,162,163 most often they are acquired in somatic cells during tumor development.164,165 The average number of nonsynonymous somatic mutations in a tumor varies widely by cancer type, ranging from as low as four in pediatric rhabdoid cancer to as high as the thousands in colorectal cancer with microsatellite instability.166 Some of these mutations, called drivers, disrupt or further activate the function of proteins to promote cancer, while the rest confer no selective tumor growth advantage and are called passengers.167 Predicting the impact of the variants found by exome sequencing of numerous tumors can help in identifying the genes that are associated with each cancer type.168–170 Moreover, nsSNV impact can provide clinical information. For example, even when only the TP53 gene is under consideration, predicting the impact of head and neck tumor mutations can stratify patient survival into statistically significant groups.171 Several nsSNV impact predictors have specifically applied their method to cancer gene discovery, including CanPredict,141 MutationAssessor,65 and SNPs3D.142 CanPredict is a Random Forest classifier, trained on 800 cancerous and 200 non-cancerous mutations, that uses SIFT172 and Pfam-based scores173 to predict impact, and Gene Ontology174 to predict cancer association. This method identified as cancer-associated several novel germline variants that were not present in controls, suggesting they are markers for increased cancer risk.141 The MutationAssessor method predicted the impact of over 10,000 nsSNVs from the COSMIC database,175 which combined with the total number of mutations in a gene and the frequency of each mutation in different tumors, ranked genes for cancer association, recovering known drivers (TP53, PTEN, etc) and suggesting many others.65 The SNPs3D method, consisting of two SVMs based on protein stability and homology respectively, was applied to about 2000 somatic mutations from colorectal and breast cancer to find that virtually all mutations in known cancer genes are predicted to impact protein function and therefore can be detected by nsSNV impact prediction methods.142 These methods produced intriguing novel predictions and may foreshadow wider use of nsSNV impact predictions to elucidate cancer mechanisms.

Predicting the phenotypic behavior of single organisms by integrating the impact of multiple mutations

Although a simple, clinically useful pipeline to reliably annotate all likely phenotypes from a human genome is not yet possible,176 predicting phenotypic variation from genome sequences has made significant advances in model organisms like yeast and has illustrated the centrality of SNV impact prediction to these efforts.143 Genome-scale reverse genetic screens in model organisms have produced thorough, if not complete, sets of genes associated with a variety of phenotypes, aiding the prediction process and allowing for proof-of-concept experiments that apply to human genotype-to-phenotype research.177 One such study used gene sets for 115 phenotypes described by the Saccharomyces Genome Database (SGD)178 and considered how the mutational load in the protein-coding regions of these gene sets varied by yeast strain. The study applied a nsSNV effect predictor, SIFT,172 to determine the probability of damage for non-synonymous mutations. The overall phenotypic effect was calculated with an additive model that combined the SIFT scores with heuristic rules that evaluated premature stop codons and insertions and deletions.178 The actual phenotypic responses of the strains were experimentally determined and they were predicted by the genotype with an ROC AUC value of 0.76.143 These results offer hope that in the future SNV impact prediction methods may be similarly applied to integrate the impact of multiple mutations in the human genome as the genes known to be associated with a phenotype become more complete.179

Future Directions

What are the future challenges the field of SNV impact prediction needs to address?

Context-dependence

Despite steady progress in predicting the impact of non-synonymous coding variations, there remains a myriad of challenges for determining how the phenotype of an individual organism is affected by a specific SNV. For example, it is important to know whether and how the phenotypic impact is mitigated by zygosity,180,181 epistasis,182,183 mosaicism,184 gender,185,186 environment,187 epigenetics,188,189 or other unknown factors affecting penetrance and expressivity.190 The recently launched “Resilience Project”191 aims precisely to identify the factors that buffer disease in apparently healthy patients that carry high-risk disease variants.192 As our understanding of these factors expands, we may be able to incorporate this information on a large scale and provide personalized impact predictions.

Impact of protein function loss on phenotype

A necessary intermediate step in integrating genetic information is to understand the phenotypic association of each protein and its impact on the overall fitness of a species. For example, a SNV in a gene may render the protein nonfunctional, but this loss of protein function can, depending on the role of that protein, be fairly neutral to the organism193 or have observable consequences,194,195 including lethality.196 An additional complication comes from the redundant function of proteins or pathways, resulting in no noticeable phenotypic change when losing the function of only one involved gene.197,198 SNV impact prediction does not yet make any a priori assumptions about gene importance, but when the gene involved in the phenotype is well established, it can stratify patient outcomes171 and disease severity.50 Large-scale projects like the NIH Knockout Mouse Project (KOMP)196 and particularly systematic surveys of incidental human knockouts199–201 promise to shed light on the relative importance of the genes, their role in diseases, and the gene redundancy within a genome, presenting an opportunity for a leap forward in variant prioritization.

Noncoding regions

Finally, evidence that more than 80% of the human genome may display some functionality202 suggests that there are important limitations in exclusively analyzing exome sequencing data. Consequently, SNV impact prediction is beginning to branch into noncoding regions of the genome. Two recent tools, mrSNP39 and MicroSNiPer40 attempt to identify SNVs in 3'UTR regions that disrupt miRNA binding, and RNAsnp41 predicts the effect of SNVs on the local structure of noncoding RNAs. Future tools will hopefully expand upon this work and may also begin to predict how non-coding SNVs alter methylation patterns and other epigenetic changes.203,204 With the discovery that SNVs in noncoding regions are sometimes disease associated,34–38 additional methods to deal with these variants will likely arise over time to tackle this problem. Developing computational methods to estimate the functional impact of SNVs is crucial to understanding the genotype–phenotype relationship, and their importance to research and clinical practice will only grow as sequencing costs plummet further. Already many nsSNV impact prediction methods find broad applications to guided mutagenesis and to the identification of disease causing variants and genes. There are already a plethora of tools available and many new ones complicate the choice of which to use. This review explored current predictors of functional impact in light of the strengths and limitations of the fundamental principles they apply. Factors such as tool availability, public usage, and, most importantly, accuracy must be carefully weighed and understood in the context of the target dataset. In the future, the technical improvements and the availability of new sequence and SNV data should help the computational methods to predict the impact of SNVs with even higher accuracy.

191 in total

1. Extensive random mutagenesis analysis of the Na+/K+-ATPase alpha subunit identifies known and previously unidentified amino acid residues that alter ouabain sensitivity--implications for ouabain binding.

Authors: M L Croyle; A L Woo; J B Lingrel
Journal: Eur J Biochem Date: 1997-09-01

2. Protein structure prediction on the Web: a case study using the Phyre server.

Authors: Lawrence A Kelley; Michael J E Sternberg
Journal: Nat Protoc Date: 2009 Impact factor: 13.491

3. VariBench: a benchmark database for variations.

Authors: Preethy Sasidharan Nair; Mauno Vihinen
Journal: Hum Mutat Date: 2012-10-11 Impact factor: 4.878

4. Identification of two novel CAKUT-causing genes by massively parallel exon resequencing of candidate genes in patients with unilateral renal agenesis.

Authors: Pawaree Saisawat; Velibor Tasic; Virginia Vega-Warner; Elijah O Kehinde; Barbara Günther; Rannar Airik; Jeffrey W Innis; Bethan E Hoskins; Julia Hoefele; Edgar A Otto; Friedhelm Hildebrandt
Journal: Kidney Int Date: 2011-09-07 Impact factor: 10.612

5. A comprehensive catalogue of somatic mutations from a human cancer genome.

Authors: Erin D Pleasance; R Keira Cheetham; Philip J Stephens; David J McBride; Sean J Humphray; Chris D Greenman; Ignacio Varela; Meng-Lay Lin; Gonzalo R Ordóñez; Graham R Bignell; Kai Ye; Julie Alipaz; Markus J Bauer; David Beare; Adam Butler; Richard J Carter; Lina Chen; Anthony J Cox; Sarah Edkins; Paula I Kokko-Gonzales; Niall A Gormley; Russell J Grocock; Christian D Haudenschild; Matthew M Hims; Terena James; Mingming Jia; Zoya Kingsbury; Catherine Leroy; John Marshall; Andrew Menzies; Laura J Mudie; Zemin Ning; Tom Royce; Ole B Schulz-Trieglaff; Anastassia Spiridou; Lucy A Stebbings; Lukasz Szajkowski; Jon Teague; David Williamson; Lynda Chin; Mark T Ross; Peter J Campbell; David R Bentley; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2009-12-16 Impact factor: 49.962

Review 6. Epistasis and its implications for personal genetics.

Authors: Jason H Moore; Scott M Williams
Journal: Am J Hum Genet Date: 2009-09 Impact factor: 11.025

7. A highly annotated whole-genome sequence of a Korean individual.

Authors: Jong-Il Kim; Young Seok Ju; Hansoo Park; Sheehyun Kim; Seonwook Lee; Jae-Hyuk Yi; Joann Mudge; Neil A Miller; Dongwan Hong; Callum J Bell; Hye-Sun Kim; In-Soon Chung; Woo-Chung Lee; Ji-Sun Lee; Seung-Hyun Seo; Ji-Young Yun; Hyun Nyun Woo; Heewook Lee; Dongwhan Suh; Seungbok Lee; Hyun-Jin Kim; Maryam Yavartanoo; Minhye Kwak; Ying Zheng; Mi Kyeong Lee; Hyunjun Park; Jeong Yeon Kim; Omer Gokcumen; Ryan E Mills; Alexander Wait Zaranek; Joseph Thakuria; Xiaodi Wu; Ryan W Kim; Jim J Huntley; Shujun Luo; Gary P Schroth; Thomas D Wu; HyeRan Kim; Kap-Seok Yang; Woong-Yang Park; Hyungtae Kim; George M Church; Charles Lee; Stephen F Kingsmore; Jeong-Sun Seo
Journal: Nature Date: 2009-07-08 Impact factor: 49.962

8. Improving the prediction of disease-related variants using protein three-dimensional structure.

Authors: Emidio Capriotti; Russ B Altman
Journal: BMC Bioinformatics Date: 2011-07-05 Impact factor: 3.169

9. From SNPs to genes: disease association at the gene level.

Authors: Benjamin Lehne; Cathryn M Lewis; Thomas Schlitt
Journal: PLoS One Date: 2011-06-30 Impact factor: 3.240

10. SNAP: predict effect of non-synonymous polymorphisms on function.

Authors: Yana Bromberg; Burkhard Rost
Journal: Nucleic Acids Res Date: 2007-05-25 Impact factor: 16.971

35 in total

1. CURRENT CONCEPTS ON THE GENETIC FACTORS IN ROTATOR CUFF PATHOLOGY AND FUTURE IMPLICATIONS FOR SPORTS PHYSICAL THERAPISTS.

Authors: Travis Orth; Jessica Paré; John E Froehlich
Journal: Int J Sports Phys Ther Date: 2017-04

2. MARRVEL: Integration of Human and Model Organism Genetic Resources to Facilitate Functional Annotation of the Human Genome.

Authors: Julia Wang; Rami Al-Ouran; Yanhui Hu; Seon-Young Kim; Ying-Wooi Wan; Michael F Wangler; Shinya Yamamoto; Hsiao-Tuan Chao; Aram Comjean; Stephanie E Mohr; Norbert Perrimon; Zhandong Liu; Hugo J Bellen
Journal: Am J Hum Genet Date: 2017-05-11 Impact factor: 11.025

3. HUMAN KINASES DISPLAY MUTATIONAL HOTSPOTS AT COGNATE POSITIONS WITHIN CANCER.

Authors: Jonathan Gallion; Angela D Wilkins; Olivier Lichtarge
Journal: Pac Symp Biocomput Date: 2017

4. Assessment of predicted enzymatic activity of α-N-acetylglucosaminidase variants of unknown significance for CAGI 2016.

Authors: Wyatt T Clark; Laura Kasak; Constantina Bakolitsa; Zhiqiang Hu; Gaia Andreoletti; Giulia Babbi; Yana Bromberg; Rita Casadio; Roland Dunbrack; Lukas Folkman; Colby T Ford; David Jones; Panagiotis Katsonis; Kunal Kundu; Olivier Lichtarge; Pier L Martelli; Sean D Mooney; Conor Nodzak; Lipika R Pal; Predrag Radivojac; Castrense Savojardo; Xinghua Shi; Yaoqi Zhou; Aneeta Uppal; Qifang Xu; Yizhou Yin; Vikas Pejaver; Meng Wang; Liping Wei; John Moult; Guoying Karen Yu; Steven E Brenner; Jonathan H LeBowitz
Journal: Hum Mutat Date: 2019-09 Impact factor: 4.878

5. Elucidation of G-protein and β-arrestin functional selectivity at the dopamine D2 receptor.

Authors: Sean M Peterson; Thomas F Pack; Angela D Wilkins; Nikhil M Urs; Daniel J Urban; Caroline E Bass; Olivier Lichtarge; Marc G Caron
Journal: Proc Natl Acad Sci U S A Date: 2015-05-11 Impact factor: 11.205

Review 6. Objective assessment of the evolutionary action equation for the fitness effect of missense mutations across CAGI-blinded contests.

Authors: Panagiotis Katsonis; Olivier Lichtarge
Journal: Hum Mutat Date: 2017-06-21 Impact factor: 4.878