Literature DB >> 20542913

MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data.

Gilad Wainreb¹, Haim Ashkenazy, Yana Bromberg, Alina Starovolsky-Shitrit, Turkan Haliloglu, Eytan Ruppin, Karen B Avraham, Burkhard Rost, Nir Ben-Tal.

Abstract

The discrimination between functionally neutral amino acid substitutions and non-neutral mutations, affecting protein function, is very important for our understanding of diseases. The rapidly growing amounts of experimental data enable the development of computational tools to facilitate the annotation of these substitutions. Here, we describe a Random Forests-based classifier, named Mutation Detector (MuD) that utilizes structural and sequence-derived features to assess the impact of a given substitution on the protein function. In its automatic mode, MuD is comparable to alternative tools in performance. However, the uniqueness of MuD is that user-reported protein-specific structural and functional information can be added at run-time, thereby enhancing the prediction accuracy further. The MuD server, available at http://mud.tau.ac.il, assigns a reliability score to every prediction, thus offering a useful tool for the prioritization of substitutions in proteins with an available 3D structure.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2010 PMID： 20542913 PMCID： PMC2896130 DOI： 10.1093/nar/gkq528

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The human population contains approximately 10 million single nucleotide polymorphism (SNP) sites (1). The non-synonymous SNPs (nsSNPs) account for a large portion of the known genetic variations associated with human diseases (2). Many experimental mutagenesis studies have been dedicated to the identification of disease-causing amino acid (AA) substitutions among SNP sites. However, experimental mutagenesis is time-, labor- and cost-demanding. Thus, numerous computational tools have been developed to predict effects of AA substitutions on protein function. The reported methods have significantly different input requirements. (i) One set of tools focuses only on sequence-based features (3–7). For example, Ng and Henikoff (5) developed a homology-based algorithm (SIFT; sorting intolerant from tolerant) to estimate the viability of substitutions according to the profiles of AA residues in alignment columns. (ii) A number of approaches extend beyond the use of alignments to sequence-based prediction of structural features (3,5,7). For instance, Bromberg and Rost (3) trained neural networks using, among other features, predicted secondary structure and residue solvent accessibility. (iii) To reflect the differences between the wild type AA and the mutant, several methods utilize physicochemical features (3,6,7), illustrating the differences in the hydropathy, secondary structure propensities, etc. (iv) With the growing number of solved structures, several tools choose to utilize observed structural data such as solvent accessibility (8–11), distance to the ligand (9–11), statistical knowledge-based potentials (9) and micro-environment description (8). (v) A number of studies show that prediction can be improved by combining information from various sources (3,7,11). For example, PolyPhen (11) employs a rule-based system that incorporates information from the UniProtKB/Swiss-Prot annotations (12) together with data extracted from solved 3D-structure and sequence alignment. AA substitution prediction algorithms are usually contingent on a data set of substitution variants that have been experimentally annotated as neutral or non-neutral. The available data sets can be broadly divided into four categories. (i) Relatively clean substitution data collected from extensive mutagenesis studies (13–15). These studies probe nearly all substitutions over whole proteins to reflect the whole spectrum of effects. (ii) Comprehensive collections of naturally occurring substitutions annotated through association studies and targeted laboratory mutagenesis experiments (UniProtKB/Swiss-Prot (12), HGMD (2), etc). Unfortunately, this data might be biased by investigator interest and some of the annotated neutrals are likely non-neutral mutations whose disease associations were overlooked. Moreover, the number of false non-neutral annotations obtained from association studies is also relatively high (16). (iii) The Protein Mutant Database (PMD) (17), representative of the third type of data, avoids the problem of false non-neutrals by reporting substitutions that have been experimentally validated. (iv) The fourth category includes evolutionary model (EM)-based substitution data sets that are a relatively reliable set of neutral mutations created by analyzing single residue substitutions between orthologous proteins (3,7). Herein, we present a web-based tool, named MuD, aimed at distinguishing functionally neutral and non-neutral AA substitutions. MuD utilizes a machine learning algorithm and a set of structural- and sequence-based features. A benchmark experiment using a cross-validation on a subset of the Bromberg and Rost (3) data set (Sub-BR data set) showed similar performance as SNAP (screening for non-acceptable polymorphisms). However, the performance on a test set of three proteins (3-PRO), which have previously been used for benchmarking, confirmed the importance of using reliable experimentally verified structural data (e.g. naturally occurring ligands pertinent to the function of the protein and the biological oligomerization state of the protein). Given this improvement in performance, we enabled the MuD server to allow users to guide the prediction scheme using select structural features. The web server is free and open to all users and there is no login requirement.

METHODS

We employed a Random Forests machine-learning algorithm to differentiate between non-neutral and neutral substitutions. The model was trained and tested on the Sub-BR data set (described below) using cross-validation. In addition, we evaluated the performance on the 3-PRO data set after training on the complete Sub-BR data set.

Data sets

MuD’s model was trained on substitutions extracted from the Bromberg and Rost (3) data set. Since MuD requires both sequence and structural data, we excluded proteins for which the structure was not available. We also excluded proteins with fewer than five homologs, and substitutions in positions with low information content, as measured by the ConSurf (18) confidence interval. This filtration procedure resulted in a set of 19 615 substitutions (70% non-neutral). A balanced data set contributes to the prediction accuracy (19). Thus, we randomly removed approximately a third of the non-neutral substitutions, yielding a balanced set (Sub-BR) comprising 12 133 substitutions from 1178 proteins. Of this subset, 10 253 substitutions originated from PMD and 2065 were EM substitutions. The classification of the Sub-BR proteins according to the Structural Classification of Proteins (SCOP) database (20) can be found on the web server. We extended our examination to the 3-PRO data set, comprised of the Escherichia coli lac I repressor (15), HIV protease (13) and T4 lysozyme (14,21) mutagenesis data. After applying the filtration procedure (as explained previously), we retained subsets with 1773/2230, 148/157 and 315/1475 non-neutral to neutral mutants ratios, respectively; constituting 99%, 90% and 89% of the original data sets, respectively. The original mutagenesis experiments classified each substitution into four categories: no phenotypic effect and three levels of severity of phenotypic change. We followed the reduced binary classification of Ref. (9), where the substitutions with no effect were deemed neutral and the rest were non-neutral.

Machine learning

Random Forests is a classifier consisting of an ensemble of tree-structured classifiers (22). We used the R software implementation of Random Forests (23). The number of trees to grow was set to 650 and the number of random features to be searched at each tree node was the square-root of the number of features.

Data gathering

Both the sequences and the PDB file names required for all data sets were extracted from the corresponding UniProtKB/Swiss-Prot entries (12). The multiple sequence alignments (MSAs) and the PDB (24) files were downloaded from the ConSurf-DB database (25) and from the protein quaternary structure (PQS) server (26), respectively. We excluded proteins sharing <30% sequence identity with the query protein to ensure that the sequences in the MSA would all belong to the same protein fold (27).

Feature set

We used a total of 14 features (with 41 dimensions). The novel features are presented here, whereas traditional descriptors such as secondary structure assignment, UniProtKB/Swiss-Prot annotations, physicochemical and AA preference (using SIFT), number of sequences in the alignment, stability prediction change and evolutionary conservation are included in the Supplementary Data.

Structure-based features

Solvent accessibility and oligomerization interface

The solvent accessibility was calculated according to the structure (among all the protein’s structures) in which the query position had the least area of side chain solvent accessibility. This structure was used to calculate two solvent accessibility features: Cβ density (10) and side chain accessible surface area (using NACCESS (28)). To measure the involvement of the query position in the oligomerization interfaces, we performed a search over all available structures for the maximal difference between the heavy atom density of the query position in its single chain and complex forms.

Ligand distance and binding site conservation

To identify residues involved in ligand binding or catalytic activity, we measured the shortest distance between the heavy atoms of the query position and of the ligands in all available structures. A ligand was defined as a hetero atom compound, dinucleotide chain or a protein chain shorter than 15 AAs. Since crystal structures often include ingredients of the crystallization solution, which are not naturally present, we strove to rank the ligands according to their biological relevance. We assumed that the evolutionary conservation of the binding site indicates whether the ligand bound to the site is indeed naturally present. Therefore, we ranked the ligands according to the ConSurf conservation score assigned to the most conserved AA with a distance of 4 Å from the ligand. This value was selected empirically (Supplementary Data).

Fold three-dimensional residue environment

Fold three-dimensional residue environment (F3DRE) was calculated following Ref. (8), but taking into account the residue environment of homologs (inferred from the MSA). Given a query position, we retrieved from the protein structure all the AA positions Ψ whose Cα atoms were within a distance of 9 Å from the Cα of the query position. Next, according to the alignment of the AA sequence of the query protein in the MSA we defined MSA columns T that correspond to Ψ. For every column t|t ∈ T, we calculated the AA composition vector C|1 ≤ aa ≤ 20, t ∈ T (Supplementary Equation S1), while disregarding gaps in the column. Next, we calculated the F3DRE for every AA as the average AA composition over all T (Supplementary Equation S2).

Sequence-based features

Sequence identity to the closest homolog bearing the substitution (SIDCH)

Counterintuitively, in some of the non-neutral mutations, the mutant AA appeared naturally in the aligned positions in homologous proteins, and we hypothesized that they are likely to occur mostly in distant homologs. Therefore, we added a feature that specifies the sequence identity of the query protein to the closest homolog bearing the mutant AA. For example, the mutation I104S in the human protein transthyretin (UniProtKB/Swiss-Prot ID: TTHY_HUMAN) is known to cause amyloidosis Type 1 (29). Two homologous proteins (A7UIU9_PERFV and Q9PTT3_SPAAU) with sequence identities of 58% and 53% to TTHY_HUMAN feature the AA serine in the corresponding position, and the value of I104S SIDCH was set to 58%.

SNAP

The prediction scheme presented here is based on a solved crystal structure of the query protein, reducing the size of the learning data set considerably. Thus, we also used SNAP, a prediction scheme based on sequence alone. The predictions utilized during the performance assessment were obtained from the testing procedures described in Ref. (3).

Performance measurements

The performance of MuD on the Sub-BR data set was examined using a leave-one-out cross-validation test. To empirically estimate how well the method can be generalized to unseen substitutions it is important that the training and testing sets are as dissimilar as possible. Therefore, during each iteration of the leave-one-out cross-validation, all substitutions from a single protein were designated for testing, whereas the substitutions belonging to proteins with HSSP distances ≤0 (27) to the test protein were designated for training. For alignments of >250 residues this HSSP-distance threshold infers that no pair of proteins had over 21% pairwise sequence identity. To calculate the average and SD for the performance measures, we used a bootstrap procedure with 1000 iterations. At each iteration we randomly selected 60% of the data set while maintaining a balanced ratio of non-neutral to neutral substitutions. The performance measures on each subset were calculated according to predictions obtained during the cross-validation. To compare PolyPhen’s performance, we defined the ‘benign’ and ‘possibly damaging’ predictions as neutral and the prediction ‘probably damaging’ as non-neutral. This binary classification gave the highest Matthew’s correlation coefficient (MCC) value for the PolyPhen predictions on the Sub-BR data set (MCC of 0.39 ± 0.01). The alternative definition, setting ‘benign’ predictions as neutral and ‘possibly damaging’ predictions and ‘probably damaging’ as non-neutral gave a lower MCC of 0.35 ± 0.01. To further measure MuD’s performance, we trained on the Sub-BR data set and tested on the 3-PRO data set in fully- and semi-automatic schemes. As in the cross validation procedure, only proteins with an HSSP-value ≤0 to any of the 3-PRO proteins were retained in the training set. In the semi-automatic scheme, all ligands not present naturally were disregarded and the oligomerization state of the protein was determined according to the literature. Specifically, (i) the T4 lysozyme was predicted as a dimer by the PQS. However, according to the literature its biological unit is a monomer (30). As the T4 lysozyme has been used as a scaffold for the creation of an artificial binding site, most of its solved structures include biologically non-relevant ligands (30). Hence, we disregarded all ligands found in the T4 lysozyme structures. (ii) Most of the HIV protease structures have an inhibitor located at the active site. All other ligands and additional chains were removed. (iii) At the semi-automatic prediction of the lac I repressor substitutions, we considered only the DNA segments and ligands that were located at the known repressor binding site.

Performance measures

We used several standard measures [Equations (1–5), and Supplementary Table S1 and Equation S5] to evaluate the performance of MuD and to compare it with other prediction tools. To assess the overall performance, we utilized both the receiver operating characteristic (ROC) area under curve (AUC) and the MCC. The ROC curve analysis does not require the determination of a decision threshold and as such can better describe the performance of the prediction. However, although the SIFT predictions are numeric they have been optimized for a default cutoff. Furthermore, PolyPhen predictions are not numeric. Therefore, the comparison of performance of SIFT and PolyPhen to MuD was only possible using MCC.

RESULTS

Cross-validation performance

According to the MCC and ROC AUC, MuD and SNAP exhibit similar performance and are both better than SIFT and PolyPhen (Table 1, and Supplementary Table S2 and Figure S1) on the Sub-BR data set.

Table 1.

The performance of MuD, SIFT, PolyPhen and SNAP on the Sub-BR, and the Sub-PMD data sets

	Sub-BR data set				Sub-PMD data set
	MuD	SNAP	SIFT	PolyPhen	MuD	SNAP	SIFT	PolyPhen
Precision	72.8 ± 0.5	68.7 ± 0.4	69.6 ± 0.5	66.6 ± 0.5	73.3 ± 0.4	69.5 ± 0.4	70.6 ± 0.5	67.8 ± 0.5
True positive rate	76.6 ± 0.4	79.9 ± 0.4	73.5 ± 0.5	73.5 ± 0.5	76.6 ± 0.4	79.9 ± 0.4	73.6 ± 0.5	73.5 ± 0.5
Specificity	71.8 ± 0.5	64.1 ± 0.5	68.4 ± 0.5	63.9 ± 0.5	58.8 ± 0.6	48.3 ± 0.6	55.9 ± 0.6	48.9 ± 0.7
MCC	49.5 ± 0.4	48.3 ± 0.8	42.9 ± 0.1	39.2 ± 0.8	45.2 ± 0.1	41.0 ± 1.1	36.0 ± 1.0	29.7 ± 1.1
ROC AUC	81.9 ± 0.3	78.8 ± 0.3	NR	NR	74.8 ± 0.4	70.9 ± 0.4	NR	NR

The average and SD of the performance measures were obtained by a bootstrap procedure run for 1000 iterations performed on the cross-validation predictions. The results on the Sub-PMD data set are a subset of the results obtained during the cross-validation on the entire data set. According to the MCC, MuD and SNAP perform better than SIFT and PolyPhen both on the entire data set and on the PMD subset. According to the ROC AUC, MuD performed better than SNAP. However, according to the MCC both methods exhibited similar performance. Although all methods exhibited a decline in the performance on the Sub-PMD data set relative to the performance on the Sub-BR data set, MuD surpassed all methods on the Sub-PMD data set. Values in the table have been multiplied by 100.

The performance of MuD, SIFT, PolyPhen and SNAP on the Sub-BR, and the Sub-PMD data sets The average and SD of the performance measures were obtained by a bootstrap procedure run for 1000 iterations performed on the cross-validation predictions. The results on the Sub-PMD data set are a subset of the results obtained during the cross-validation on the entire data set. According to the MCC, MuD and SNAP perform better than SIFT and PolyPhen both on the entire data set and on the PMD subset. According to the ROC AUC, MuD performed better than SNAP. However, according to the MCC both methods exhibited similar performance. Although all methods exhibited a decline in the performance on the Sub-PMD data set relative to the performance on the Sub-BR data set, MuD surpassed all methods on the Sub-PMD data set. Values in the table have been multiplied by 100. The Sub-BR data set consists of two subsets: substitutions extracted from the PMD database and neutral EM substitutions. We reviewed the performance of MuD on each of these subsets separately. MuD did better than SIFT, PolyPhen and SNAP on the Sub-PMD data set. However, the performance of all four methods on this set was lower than on the entire data set. MuD had a high FP rate of 41.1 ± 0.6% on the substitutions extracted from the PMD compared to only 2.0 ± 0.2% on the EM subset. The implications of this are discussed in the Supplementary Data. Prediction performance may also depend on the protein’s structural class. To examine this possibility, we analyzed the cross-validation results of SNAP and MuD and the predictions made by SIFT and PolyPhen on the Sub-BR data set according to the SCOP class assignment of the query proteins (Supplementary Figure S2). Ninty seven percent of the Sub-BR data set mutations occur at proteins assigned by SCOP to one of the seven ‘true’ SCOP classes. Across all SCOP classes MuD performed as good as any of the other methods or better. The performance of MuD, SNAP and SIFT on the ‘membrane and cell surface proteins and peptides’ and ‘small proteins’ classes showed a decline in the performance relative to the four main SCOP classes (all α, all β, α / β, α + β). These classes comprise a small number of substitutions (437 and 391, respectively), but the decline may be indicative of a true difficulty.

Performance on the 3-PRO data set

We also evaluated performance on additional data sets that have previously been used for benchmarking. Our results of the fully automated and of the semi-automatic schemes were compared with SIFT, PolyPhen and SNAP (Table 2). A comparison between all the fully automated methods indicated that neither was favorable over the others. SNAP performed best on the T4 lysozyme, SIFT performed best on the HIV protease and MuD performed best on the lac I repressor data set. However, semi-automatic MuD surpassed all other methods in all performance measures.

Table 2.

Introduction of reliable structural data improves the prediction performance

	HIV protease^a				T4 lysozyme					Lac repressor
	SA MuD	MuD	SNAP	SIFT	SA MuD	MuD	SNAP	SIFT	PP	SA MuD	MuD	SNAP	SIFT	PP
Precision	67.6	61.6	54.3	66.0	37.0	24.4	34.9	29.8	31.1	79.2	81.83	64.1	65.2	83.9
TP rate	93.2	95.3	98.0	93.2	92.1	97.1	81.0	91.4	89.9	77.9	77.83	74.6	70.1	67.9
Specificity	58.0	44.3	22.3	54.8	66.5	35.9	67.8	54.1	58.0	83.8	83.00	66.9	70.3	85.7
MCC	54.4	45.6	30.7	51.7	45.0	27.4	37.7	34.8	36.4	61.8	60.93	41.2	40.2	54.2
ROC AUC	89.9	81.2	78.2	NR	87.4	83.7	83.8	NR	NR	89.5	89.0	79.3	NR	NR

The performance of PolyPhen (PP), SNAP, SIFT, the semi-automatic (SA MuD) and the fully-automatic MuD (MuD) on the 3-PRO data set. The performance of the automatic predictors appears to depend on the evaluation criterion and data set. Nevertheless as indicated by all measures, the application of the semi-automatic prediction scheme improved the performance, thus surpassing all the fully-automatic methods. In the semi-automatic prediction, we removed from the solved crystal structures of the query proteins, the structures of the non-naturally present ligands and selected the appropriate dimerization state for each of the proteins. Values in the table have been multiplied by 100.

aPolyPhen did not produce predictions for the HIV-1 protease.

Introduction of reliable structural data improves the prediction performance The performance of PolyPhen (PP), SNAP, SIFT, the semi-automatic (SA MuD) and the fully-automatic MuD (MuD) on the 3-PRO data set. The performance of the automatic predictors appears to depend on the evaluation criterion and data set. Nevertheless as indicated by all measures, the application of the semi-automatic prediction scheme improved the performance, thus surpassing all the fully-automatic methods. In the semi-automatic prediction, we removed from the solved crystal structures of the query proteins, the structures of the non-naturally present ligands and selected the appropriate dimerization state for each of the proteins. Values in the table have been multiplied by 100. aPolyPhen did not produce predictions for the HIV-1 protease.

The web server

The MuD web server implementation encourages the user to introduce into the prediction scheme specific biological data about the target protein. The graphical user interface and an example of the results page are depicted in Supplementary Figure S6.

DISCUSSION

We tested MuD using cross-validation analysis on the Sub-BR data set and found automatic MuD to be as good as SNAP, and better than PolyPhen and SIFT. However, the assessment also indicates that MuD might be less suitable for the prediction of substitutions in non-globular and small proteins. The incorporation of structural features such as solvent accessibility, ligand proximity and oligomerization interfaces into the automatic MuD might not always be advantageous. This is due to (i) the presence of non-naturally present ligands, such as ingredients of the crystallization solution and (ii) incorrect oligomerization state assignments that may hinder the prediction accuracy. To alleviate these problems, the MuD web server implementation offers a semi-automatic scheme. It enables the user to incorporate additional data about the target protein in order to improve the prediction accuracy. This procedure aims reducing the erroneous features that might be extracted from the crystal structure. Specifically, the graphical interface of the server allows the user to filter out irrelevant ligands and to select the structure of the biological unit. Additionally, the user can change the indication of functionally important residues. Detailed analysis of three well-characterized proteins showed that the incorporation of target-specific data on each protein via the semi-automatic scheme improved the prediction performance, surpassing current methods. The largest difference in the performance between the semi-automatic and automatic MuD is manifested in the T4 lysozyme data set. This is not surprising since many of the PDB T4 lysozyme structures contain non-relevant ligands and do not present the protein in its correct biological shape. An important aspect of MuD is the ascription of a reliability score to every prediction (Supplementary Data). The reliability score offers the researcher a qualitative assessment of the prediction, indicating its expected accuracy, sensitivity and precision. The reliability score can be used to prioritize substitutions in a given set according to their likelihood of affecting the protein function. We are hopeful that this tool will assist researchers in the annotation of disease causing substitutions in proteins with a solved crystal structure.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

European Commission FP6 Integrated Project EuroHear LSHG-CT- 20054-512063 (to N.B.-T., K.B.A.); NATO traveling grant No. CBP.MD.CLG 983009 (to T.H. and N.B-T); National Library of Medicine (grant 2-R01-LM007329 to Y.B.). Funding for open access charge: European Commission FP6 Integrated Project EuroHear LSHG-CT- 20054-512063. Conflict of interest statement. None declared.

28 in total

Review 1. SNP association studies in Alzheimer's disease highlight problems for complex disease analysis.

Authors: T Emahazion; L Feuk; M Jobs; S L Sawyer; D Fredman; D St Clair; J A Prince; A J Brookes
Journal: Trends Genet Date: 2001-07 Impact factor: 11.639

2. Enzyme function less conserved than anticipated.

Authors: Burkhard Rost
Journal: J Mol Biol Date: 2002-04-26 Impact factor: 5.469

3. Evaluation of structural and evolutionary contributions to deleterious mutation prediction.

Authors: Christopher T Saunders; David Baker
Journal: J Mol Biol Date: 2002-09-27 Impact factor: 5.469

4. Complete mutagenesis of the HIV-1 protease.

Authors: D D Loeb; R Swanstrom; L Everitt; M Manchester; S E Stamper; C A Hutchison
Journal: Nature Date: 1989-08-03 Impact factor: 49.962

5. Molecular recognition. Conformational analysis of limited proteolytic sites and serine proteinase protein inhibitors.

Authors: S J Hubbard; S F Campbell; J M Thornton
Journal: J Mol Biol Date: 1991-07-20 Impact factor: 5.469

6. Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the folded protein.

Authors: T Alber; D P Sun; J A Nye; D C Muchmore; B W Matthews
Journal: Biochemistry Date: 1987-06-30 Impact factor: 3.162

7. Rapid crystallization of T4 lysozyme by intermolecular disulfide cross-linking.

Authors: D W Heinz; B W Matthews
Journal: Protein Eng Date: 1994-03

8. The Ile-84-->Ser amino acid substitution in transthyretin interferes with the interaction with plasma retinol-binding protein.

Authors: R Berni; G Malpeli; C Folli; J R Murrell; J J Liepnieks; M D Benson
Journal: J Biol Chem Date: 1994-09-23 Impact factor: 5.157

9. Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence.

Authors: P Markiewicz; L G Kleina; C Cruz; S Ehret; J H Miller
Journal: J Mol Biol Date: 1994-07-29 Impact factor: 5.469

10. Systematic mutation of bacteriophage T4 lysozyme.

Authors: D Rennell; S E Bouvier; L W Hardy; A R Poteete
Journal: J Mol Biol Date: 1991-11-05 Impact factor: 5.469

19 in total

1. The road from next-generation sequencing to personalized medicine.

Authors: Manuel L Gonzalez-Garay
Journal: Per Med Date: 2014 Impact factor: 2.512

2. A new disease-specific machine learning approach for the prediction of cancer-causing missense variants.

Authors: Emidio Capriotti; Russ B Altman
Journal: Genomics Date: 2011-07-07 Impact factor: 5.736

Review 3. Role of Structural Bioinformatics in Drug Discovery by Computational SNP Analysis: Analyzing Variation at the Protein Level.

Authors: David K Brown; Özlem Tastan Bishop
Journal: Glob Heart Date: 2017-03-13

4. Prediction of phenotypes of missense mutations in human proteins from biological assemblies.

Authors: Qiong Wei; Qifang Xu; Roland L Dunbrack
Journal: Proteins Date: 2012-11-05

5. FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model.

Authors: Mingjun Wang; Xing-Ming Zhao; Kazuhiro Takemoto; Haisong Xu; Yuan Li; Tatsuya Akutsu; Jiangning Song
Journal: PLoS One Date: 2012-08-24 Impact factor: 3.240

6. Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining.

Authors: Jose M G Izarzugaza; Martin Krallinger; Alfonso Valencia
Journal: Front Physiol Date: 2012-08-22 Impact factor: 4.566

7. Improving the prediction of disease-related variants using protein three-dimensional structure.

Authors: Emidio Capriotti; Russ B Altman
Journal: BMC Bioinformatics Date: 2011-07-05 Impact factor: 3.169

8. Bioinformatics challenges for personalized medicine.

Authors: Guy Haskin Fernald; Emidio Capriotti; Roxana Daneshjou; Konrad J Karczewski; Russ B Altman
Journal: Bioinformatics Date: 2011-05-19 Impact factor: 6.937

9. Protein stability: a single recorded mutation aids in predicting the effects of other mutations in the same amino acid site.

Authors: Gilad Wainreb; Lior Wolf; Haim Ashkenazy; Yves Dehouck; Nir Ben-Tal
Journal: Bioinformatics Date: 2011-10-13 Impact factor: 6.937

10. Prioritization of pathogenic mutations in the protein kinase superfamily.

Authors: Jose M G Izarzugaza; Angela del Pozo; Miguel Vazquez; Alfonso Valencia
Journal: BMC Genomics Date: 2012-06-18 Impact factor: 3.969