Literature DB >> 15980516

nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms.

Abstract

Nonsynonymous single nucleotide polymorphisms (nsSNPs) are prevalent in genomes and are closely associated with inherited diseases. To facilitate identifying disease-associated nsSNPs from a large number of neutral nsSNPs, it is important to develop computational tools to predict the nsSNP's phenotypic effect (disease-associated versus neutral). nsSNPAnalyzer, a web-based software developed for this purpose, extracts structural and evolutionary information from a query nsSNP and uses a machine learning method called Random Forest to predict the nsSNP's phenotypic effect. nsSNPAnalyzer server is available at http://snpanalyzer.utmem.edu/.

Entities: Disease Species

Mesh：

Year: 2005 PMID： 15980516 PMCID： PMC1160133 DOI： 10.1093/nar/gki372

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Assessing susceptibility to diseases based on an individual's genotype has long been a central theme of genetics studies. Among inherited gene variations in humans, nonsynonymous single nucleotide polymorphisms (nsSNPs) that lead to an amino acid change in the protein product are most relevant to human inherited diseases (1). nsSNPs can be classified into two categories according to their phenotypic effects: those that cause deleterious effects on protein functions and are hence disease-associated and those that are functionally neutral. Given the huge number of nsSNPs already discovered (2,3), a major challenge is to predict which of them are potentially disease associated. Computational tools have been developed to predict the nsSNP's phenotypic effect, e.g. the SIFT server (4) and the PolyPhen server (5). Recently, studies have shown that combining information obtained from multiple sequence alignment and three-dimensional protein structure can increase the prediction accuracy (6). nsSNPAnalyzer server integrates multiple sequences alignment and protein structure analysis to identify disease-associated nsSNPs. nsSNPAnalyzer takes a protein sequence and the accompanying nsSNP as inputs and reports whether the nsSNP is likely to be disease-associated or functionally neutral. nsSNPAnalyzer also provides additional useful information about the nsSNP to facilitate the biological interpretation of results, e.g. structural environment class and multiple sequence alignment.

PROGRAM DESCRIPTION

Algorithm and implementation

nsSNPAnalyzer is a web server implementing machine learning methods for nsSNP classification. The program design and data flow are illustrated in Figure 1. Briefly, on receiving the input sequence, nsSNPAnalyzer searches the ASTRAL database (7) for homologous protein structures. This step is skipped if the users provide the protein structure themselves. nsSNPAnalyzer calculates three types of information from user's input: (i) the structural environment of the SNP, including the solvent accessibility, environmental polarity and secondary structure (8); (ii) the normalized probability of the substitution in the multiple sequence alignment (9); and (iii) the similarity and dissimilarity between the original amino acid and mutated amino acid. nsSNPAnalyzer then uses a machine learning method called Random Forest (10) to classify the nsSNPs. Random Forest is a classifier consisting of an ensemble of tree-structured classifiers. The Random Forest classifier was trained to optimally combine the heterogeneous sources of predictors using a curated training dataset prepared from the SwissProt database (11). Several recent studies have demonstrated the better performance of Random Forest over other machine learning approaches (12–14). For the nsSNP phenotypic effect prediction, we also found that Random Forest gave the best results on this training dataset. In a cross-validation test, the false positive rate is 38% and the false negative rate is 21% (15). The nsSNPAnalyzer web server is implemented on a Linux Redhat 8.0 platform with the Common Gateway Interface scripts written in PHP.

Figure 1

The program design and data flow of nsSNPAnalyzer.

Input

Two inputs are mandatory: protein sequence in FASTA format and the nsSNP identities to be analyzed. An nsSNP is denoted as X#Y, where X is the original amino acid in one letter, # is the position of the substitution (starting from 1), and Y is the mutated amino acid in one letter. Multiple nsSNPs in a protein should be separated by new-line characters. Users may provide the inputs by copy-paste or file uploading. In addition to the two mandatory inputs, users may also upload an accompanying protein structure file in PDB format if they want their own structure to be used. Finally, because the calculation usually takes a while, users may provide their email addresses to avoid waiting online. The results are sent to the email address when the calculations are finished. Users can use the sample data to learn the input format and perform a demo run.

Output

The results of nsSNPAnalyzer are displayed on a web page and stored on the server for a week. A link to the results page can also be sent to the user via email. A sample output is shown in Figure 2. The output includes several calculated features of the nsSNP: (i) predicted phenotypic class (disease-associated versus neutral); (ii) a hyperlink to the homologous structure with a SCOP identifier (7); (iii) the normalized probability of the substitution calculated by the SIFT program (4); (iv) area buried score, a measure of the solvent accessibility; (v) fraction polar score, a measure of environmental polarity related to hydrogen bond formation; (vi) secondary structure (helix, sheet and coil); and (vii) the structural environment class, a discrete environment class definition by combining features (iv)–(vi) (8). The area buried score and fraction polar score are calculated by the ENVIRONMENT program (8), and the secondary structure is calculated by the STRIDE program (16). The user can click the ‘View Alignment’ button to see the local sequence alignment spanning the substitution sites and get a direct sight on the mutability of the substitution. The original amino acid is highlighted in blue, and the mutated amino acid is highlighted in red.

Figure 2

The output of nsSNPAnalyzer. (A) The main output page of nsSNPAnalyzer. The user can click the icon to see the interpretation of each field. (B) An example of local sequence alignment spanning the nsSNP (D7N). The original amino acid (D) is highlighted in blue, and the mutated amino acid (N) is highlighted in red.

FUTURE PLANS

Considering the remarkable CPU cost of calculation, we are planning to provide precalculated results for all human nsSNPs in the dbSNP (17) with homologous structures available. We will also test the applicability of extracting structural predictors from predicted structures to eliminate the requirement of having experimentally determined structures available.

16 in total

1. dbSNP: the NCBI database of genetic variation.

Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Predicting deleterious amino acid substitutions.

Authors: P C Ng; S Henikoff
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

3. Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors: Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec

4. The ASTRAL Compendium in 2004.

Authors: John-Marc Chandonia; Gary Hon; Nigel S Walker; Loredana Lo Conte; Patrice Koehl; Michael Levitt; Steven E Brenner
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro.

Authors: Erik C Gunther; David J Stone; Robert W Gerwien; Patricia Bento; Melvyn P Heyes
Journal: Proc Natl Acad Sci U S A Date: 2003-07-17 Impact factor: 11.205

6. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants.

Authors: Yum L Yip; Holger Scheib; Alexander V Diemand; Alexandre Gattiker; Livia M Famiglietti; Elisabeth Gasteiger; Amos Bairoch
Journal: Hum Mutat Date: 2004-05 Impact factor: 4.878

7. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information.

Authors: Lei Bao; Yan Cui
Journal: Bioinformatics Date: 2005-03-03 Impact factor: 6.937

8. Knowledge-based protein secondary structure assignment.

Authors: D Frishman; P Argos
Journal: Proteins Date: 1995-12

9. Human non-synonymous SNPs: server and survey.

Authors: Vasily Ramensky; Peer Bork; Shamil Sunyaev
Journal: Nucleic Acids Res Date: 2002-09-01 Impact factor: 16.971

10. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data.

Authors: Baolin Wu; Tom Abbott; David Fishman; Walter McMurray; Gil Mor; Kathryn Stone; David Ward; Kenneth Williams; Hongyu Zhao
Journal: Bioinformatics Date: 2003-09-01 Impact factor: 6.937

78 in total

1. Testing computational prediction of missense mutation phenotypes: functional characterization of 204 mutations of human cystathionine beta synthase.

Authors: Qiong Wei; Liqun Wang; Qiang Wang; Warren D Kruger; Roland L Dunbrack
Journal: Proteins Date: 2010-07

Review 2. Computational approaches to study the effects of small genomic variations.

Authors: Kamil Khafizov; Maxim V Ivanov; Olga V Glazova; Sergei P Kovalenko
Journal: J Mol Model Date: 2015-09-08 Impact factor: 1.810

3. Incorporating molecular and functional context into the analysis and prioritization of human variants associated with cancer.

Authors: Thomas A Peterson; Nathan L Nehrt; Dohwan Park; Maricel G Kann
Journal: J Am Med Inform Assoc Date: 2012 Mar-Apr Impact factor: 4.497

4. Next generation tools for the annotation of human SNPs.

Authors: Rachel Karchin
Journal: Brief Bioinform Date: 2009-01 Impact factor: 11.622

5. The road from next-generation sequencing to personalized medicine.

Authors: Manuel L Gonzalez-Garay
Journal: Per Med Date: 2014 Impact factor: 2.512

6. Meet me halfway: when genomics meets structural bioinformatics.

Authors: Sungsam Gong; Catherine L Worth; Tammy M K Cheng; Tom L Blundell
Journal: J Cardiovasc Transl Res Date: 2011-02-25 Impact factor: 4.132

Review 7. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data.

Authors: Gregory M Cooper; Jay Shendure
Journal: Nat Rev Genet Date: 2011-08-18 Impact factor: 53.242

8. Analysis of a set of missense, frameshift, and in-frame deletion variants of BRCA1.

Authors: Marcelo Carvalho; Maria A Pino; Rachel Karchin; Jennifer Beddor; Martha Godinho-Netto; Rafael D Mesquita; Renato S Rodarte; Danielle C Vaz; Viviane A Monteiro; Siranoush Manoukian; Mara Colombo; Carla B Ripamonti; Richard Rosenquist; Graeme Suthers; Ake Borg; Paolo Radice; Scott A Grist; Alvaro N A Monteiro; Blase Billack
Journal: Mutat Res Date: 2008-10-17 Impact factor: 2.433

9. Mutation@A Glance: an integrative web application for analysing mutations from human genetic diseases.

Authors: Atsushi Hijikata; Rajesh Raju; Shivakumar Keerthikumar; Subhashri Ramabadran; Lavanya Balakrishnan; Suresh Kumar Ramadoss; Akhilesh Pandey; Sujatha Mohan; Osamu Ohara
Journal: DNA Res Date: 2010-04-01 Impact factor: 4.458

10. Correlating protein function and stability through the analysis of single amino acid substitutions.

Authors: Yana Bromberg; Burkhard Rost
Journal: BMC Bioinformatics Date: 2009-08-27 Impact factor: 3.169