| Literature DB >> 25763369 |
João G R Cardoso1, Mikael Rørdam Andersen2, Markus J Herrgård1, Nikolaus Sonnenschein1.
Abstract
Genetic variation is the motor of evolution and allows organisms to overcome the environmental challenges they encounter. It can be both beneficial and harmful in the process of engineering cell factories for the production of proteins and chemicals. Throughout the history of biotechnology, there have been efforts to exploit genetic variation in our favor to create strains with favorable phenotypes. Genetic variation can either be present in natural populations or it can be artificially created by mutagenesis and selection or adaptive laboratory evolution. On the other hand, unintended genetic variation during a long term production process may lead to significant economic losses and it is important to understand how to control this type of variation. With the emergence of next-generation sequencing technologies, genetic variation in microbial strains can now be determined on an unprecedented scale and resolution by re-sequencing thousands of strains systematically. In this article, we review challenges in the integration and analysis of large-scale re-sequencing data, present an extensive overview of bioinformatics methods for predicting the effects of genetic variants on protein function, and discuss approaches for interfacing existing bioinformatics approaches with genome-scale models of cellular processes in order to predict effects of sequence variation on cellular phenotypes.Entities:
Keywords: SNP; adaptive laboratory evolution; constraint-based modeling; genetic variation; high-throughput analysis; metabolic engineering; metabolism; next-generation sequencing
Year: 2015 PMID: 25763369 PMCID: PMC4329917 DOI: 10.3389/fbioe.2015.00013
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Common genetic variations. Variations at the (A) nucleotide level and (B) structural level. (C) Single nucleotide polymorphism A/T across a population.
A summary of the available software tools for predicting the effect of the genetic variants.
| Tool | Description | Reference |
|---|---|---|
| AUTO-MUTE | Uses the “4-Body Statistical Potential” to compute a set of features – based on protein 3D structure – used to train a Random Forest model to predict | Masso and Vaisman ( |
| Align-GVGD | This algorithm is based on multiple sequence alignment and Grantham distance to identify missense SNPs. The authors propose a measure to calculate how much the substitution changes the Grantham distance. | Tavtigian ( |
| CADD | A machine-learning approach that uses a SVM model to predict deleterious phenotypes caused by SNPs. | Kircher et al. ( |
| Chasman and Adams ( | A probabilistic approach to identify which SNPs have an effect on the protein function using structural and evolutionary features that compare the variation against a dataset of mutations of lac repressor and T4 lysozyme. | Chasman and Adams ( |
| CONDEL | González-Pérez and López-Bigas ( | |
| Evolutionary action | Evolutionary action is a function that links genotype with phenotype using evolutionary information, by quantifying the impact of SNPs on the fitness of a population; it correlates with disease-associated mutations. | Katsonis and Lichtarge ( |
| FATHMM | Uses Hidden Markov Models (HMMs) to obtain position-specific information. The prediction is based on the probability change of the HMM between wild-type and mutant. | Shihab et al. ( |
| FunSAV | A random forest classifier for predicting deleterious SNPs. It combines properties of the mutated protein with other tools (i.e., nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen2, SIFT, and SNAP). | Wang et al. ( |
| FuzzySnps | A machine-learning approach that uses a Random Forest model trained by combining “4-Body Statistical Potential” and sequence-based features to identify tolerant and intolerant SNPs. | Barenboim et al. ( |
| Goldgar et al. ( | A probabilistic approach to determine if a SNP is disease-causing, which is achieved by computing the likelihood of the protein to be similar to previously classified mutated proteins in a dataset. | Goldgar et al. ( |
| HANSA | It is a machine-learning classifier that uses a SVM model to predict whether a SNP will be neutral or disease-causing. | Acharya and Nagarajaram ( |
| LogR.E-value | Uses the | Clifford et al. ( |
| LS-SNP | A workflow/database that uses predefined rules and machine-learning (SVN) approach to systematically characterize known SNPs. | Karchin et al. ( |
| Krishnan and Westhead ( | Two machine-learning approaches – using SVM and Decision Trees models – are used to predict the “effect” or “no-effect” of a SNP. | Krishnan and Westhead ( |
| MAPP | Stone ( | |
| Mutation assessor | Predicts the degree of impact in a protein by scoring the mutation based on the impact it causes regarding the properties of a multiple sequence alignment of homologous sequences. | Reva et al. ( |
| Mutation taster 2 | Uses a Bayes classifier to predict disease associated effects caused by SNPs or Indels. The classifier uses a set of features that includes splicing site and polyadenylation signal information along with structural and evolutionary properties. | Schwarz et al. ( |
| MutPred | Uses a machine-learning approach to predict disease or neutral SNPs. The features used refer to a probability of loss or gain of function regarding several functional and structural properties of the encoded protein. The authors trained SVM and Random Forest models in this work. | Li et al. ( |
| nsSNPAnalyzer | Uses a Random Forest model trained with features (consisting of SIFT score and information from multiple sequence alignment and protein 3D structures) to identify disease associated SNPs. | Bao et al. ( |
| Papepro | A SVM prediction model is used by the authors to separate deleterious from neutral SNPs. | Tian et al. ( |
| Panther | Using an internal database of HMM, an evolutionary score is computed and the method predicts deleterious or neutral effects with a probability attached. The cutoff can be defined by the user (default is 3). | Thomas and Kejariwal ( |
| PhD-SNP | This approach uses one of two SVM models: one is trained using sequence profile features and the other is trained using sequence features. The choice of which model to use is based on a preliminary decision: if the mutation exists in the homology profile, the first model is used, otherwise the prediction is done using the second model. | Capriotti et al. ( |
| PMut | Predicts pathological or neutral effects of amino-acid substitutions. The prediction model is a neural network using structural-, physicochemical-, and evolutionary-based features, all calculated using sequence information only (without requiring a3D protein structure). | Ferrer-Costa et al. ( |
| Polyphen | A set of rules defined by the authors is used to predict the effect of a SNP. These rules are built based on three properties: PSIC score, substitution site properties, and substitution type properties. If one of the rules matches, the output can be deleterious or benign, otherwise the substitution is classified as neutral. | Ramensky ( |
| PolyPhen2 | The follow up version of Polyphen, uses a naive Bayes predictor to predict damaging, benign, or neutral effects of SNPs. It uses structural information if available. | Adzhubei et al. ( |
| PROVEAN | Choi et al. ( | |
| RCOL | Applies a Bayes’ formula to calculate the probability of a SNP to be deleterious. The likelihood is tested using 20 structural and physicochemical parameters. | Terp et al. ( |
| SAPRED | Using a SVM prediction model, the authors combine features computed from evolutionary, structural, and physicochemical properties to predict disease associated SNPs. | Ye et al. ( |
| SIFT | Using a PSSM, SIFT determines the probability of a substitution being tolerated in a given position. | Ng and Henikoff ( |
| SNAP | Identifies non-neutral SNPs using machine-learning approaches that combines a battery of Neural Network models. | Bromberg et al. ( |
| SNPs3D | Combines a set of features obtained from protein 3D structure and evolutionary information to predict deleterious effects using a SVM model. | Yue et al. ( |
| SNPs&GO | A machine-learning approach that includes GO annotations as features in a SVM model to predict whether a SNP is neutral or disease associated. | Calabrese et al. ( |
| SNPs&GO3D | It is the successor of SNPs&GO. It includes new features obtained from protein 3D structure. | Capriotti and Altman ( |
| Sunyaev ( | This approach uses a set of seven rules empirically defined by the authors to identify nsSNPs. If one of the rules is matched, then the SNP is likely to be deleterious. | Sunyaev ( |
| SuSPect | A SVM model implementation to predict disease phenotypes caused by SNPs. The authors started with a high number of features until they identified nine that provided best performance. | Yates et al. ( |
| VarMode | A machine-learning approach using a SVN model to predict the effect of SNPs that includes information regarding known protein–protein interactions. It predicts non-synonymous SNPs. | Pappalardo and Wass ( |
Figure 2Summary of properties and approaches for software listed in Table 1. The approaches found fall into four different categories: Machine-Learning, Probabilistic, Score (calculating a summarizing score of a set of hand-picked statistics), and Rule (using a set of empirically derived rules). These approaches provide one of two types of classifications each: a binary classification (e.g., neutral or deleterious) or a multi-classification (e.g., benign, neutral, and deleterious). The features used by those approaches can be computed based on properties of the following five categories: (i) physicochemical properties (e.g., solvent accessibility, polarity, charge, disorder, and Grantham), (ii) structural information about the primary, secondary, and tertiary structure of a protein (e.g., α-helices, β-sheets, and coil), (iii) evolutionary properties (multiple sequence alignments, position-specific scoring matrices, and Hidden Markov models), and (iv) genome annotation (GO terms or other protein function annotations). The supported variants were determined either by accessing the tools’ websites or by the description of the approach itself.