Literature DB >> 17921498

coliSNP database server mapping nsSNPs on protein structures.

Hidetoshi Kono¹, Tomo Yuasa, Shinya Nishiue, Kei Yura.

Abstract

We have developed coliSNP, a database server (http://yayoi.kansai.jaea.go.jp/colisnp) that maps non-synonymous single nucleotide polymorphisms (nsSNPs) on the three-dimensional (3D) structure of proteins. Once a week, the SNP data from the dbSNP database and the protein structure data from the Protein Data Bank (PDB) are downloaded, and the correspondence of the two data sets is automatically tabulated in the coliSNP database. Given an amino acid sequence, protein name or PDB ID, the server will immediately provide known nsSNP information, including the amino acid mutation caused by the nsSNP, the solvent accessibility, the secondary structure and the flanking residues of the mutated residue in a single page. The position of the nsSNP within the amino acid sequence and on the 3D structure of the protein can also be observed. The database provides key information with which to judge whether an observed nsSNP critically affects protein function and/or stability. As far as we know, this is the only web-based nsSNP database that automatically compiles SNP and protein information in a concise manner.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 17921498 PMCID： PMC2238833 DOI： 10.1093/nar/gkm801

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Single nucleotide polymorphisms (SNPs) have the potential to affect gene function, especially when they are located in coding or regulatory regions. Among the many types of SNPs, non-synonymous SNPs (nsSNPs) are believed to have the greatest impact on protein function because they often lead to mutation of the encoded amino acids, which can have a deleterious effect on the structure and/or function of the proteins. Such nsSNPs are often associated with disease-modifying alleles that have been compiled, for example, in the OMIM database (http://www.ncbi.nlm.nih.gov/omim/) (1). Disease-associated SNPs were often interpreted solely on the basis of their sequences, mainly with respect to sequence conservation; however, thanks to structural genomics projects in the USA, Canada, Europe and Japan, which now make available more than 40 000 protein structures (2), it is possible to interpret the effects of a large number of SNPs on three-dimensional (3D) protein structures. To further investigate the possible causes of disease at the molecular level, we have started mapping nsSNPs on 3D protein structures and developed a database named coliSNP (Clue of Life SNP), which provides users with both the protein sequence and structural information on nsSNPs, enabling them to gain significant insight into the effects of nsSNPs at the molecule level. To date, several databases, including SAAP (3), PolyDoms (4), topoSNP (5), SNPeffect (6), SNPs3D (7), MutDB (8,9) and LS-SNP (10), have been developed to provide links between SNPs and protein sequence/structure data and/or cellular processes such as localization, phosphorylation and glycosylation. Among these, topoSNP, SNPs3D, MutDB and LS-SNP have a direct link to 3D protein structures from nsSNP locations within nucleotide sequences. Apparently, however, these databases are no longer actively maintained or are updated only about once a year, at best. The coliSNP we have launched is automatically updated every week and contains the up-to-date nsSNP data mapped on the 3D protein structures. Moreover, coliSNP enables visualization of 3D protein structures directly using Jmol or RasMol by downloading the coordinates attached to a RasMol script. Both of these features are unique to coliSNP and enable one to easily observe the locations of mutations caused by nsSNPs, even when the nsSNPs/protein structures have only been very recently identified/determined. The regularly updated nsSNP information, combined with 3D protein structures, represents an invaluable resource for evaluating the effect of the mutation on protein function and stability.

Data sources and integrated information

To develop the coliSNP database, we integrated three publicly available databases: RefSeq, for a comprehensive, integrated, non-redundant set of sequences (11); Protein Data Bank (PDB), for 3D protein structures (2); and dbSNP, for SNP information (12). We first compared dbSNP and RefSeq and built a temporary database, RefSeqSNP, which was a subset of RefSeq that only contained amino acid sequences encoded by genes with nsSNPs. We then used Basic Local Alignment Search Tool (BLAST) (13) with default parameters to search for PDB entries that matched each of the sequences in RefSeqSNP. We collected the PDB entries whose amino acid sequence identity against the query was >95% over the region aligned by BLAST if the length of the aligned region was ≥ 30 amino acids. At this point, we removed similar amino acid sequences by limiting the PDB entries to those that were ranked at the top in each of the 90% sequence identity clusters and compiled in the ‘clusters90.txt’ file provided by PDB. Our reasoning was that sequences with such a high identity would assume very similar tertiary structures, close enough to assess the impact of mutations.

Data access and the search interface

coliSNP can be accessed at http://yayoi.kansai.jaea.go.jp/colisnp. In the search form (Figure 1), a user can provide several search keys on the protein and/or the SNPs. In the protein section, the user can use as query terms the amino acid sequence, PDB ID, molecule name and keyword in PDB. The user can also limit the scope of the 3D protein structures to be mapped for SNPs by specifying the organisms that the proteins were derived from. In the SNP section, the user can give the organism, allele type and heterozygosity as queries. The user can also use both the protein and SNP sections to narrow the search.

Figure 1.

The coliSNP search interface. The user can use the protein section, SNP section or both to set the search conditions.

The coliSNP search interface. The user can use the protein section, SNP section or both to set the search conditions. As shown in Figure 2, the search result emerges with information about the mutation, the flanking amino acids, the secondary structure and the solvent accessibility of the mutated residue, allele frequency and heterozygosity. The output page also has a link to the original dbSNP database, enabling more detailed information to be obtained. If desired, the information can be saved in a flat text format for further analysis. The user can also easily observe the location of an nsSNP on the 3D protein structure by clicking either the ‘Download Structure’ link or the ‘Structure View’ box. We adopted two graphics programs, RasMol (http://openrasmol.org) and Jmol (http://jmol.sourceforge.net), for 3D protein visualization. The former is one of the most widely used software packages for visualizing the 3D structures of proteins and has a number of handy operations. The latter displays 3D structures in a Java-implemented browser.

Figure 2.

A typical search result. nsSNP information is provided with structural information on the mutated amino acid residue—e.g. the secondary structure and solvent accessibility.

nsSNPs location and its impact on 3D protein structure

One of the unique features of the coliSNP database is that it gives the solvent accessibility of a wild-type residue that has been mutated by an nsSNP. We found that the solvent accessibility is the best indicator of the impact of a mutation on protein function. Other properties that we evaluated include the secondary structure where the mutation occurred, changes in hydrogen bonding, and the chemical properties of the affected residue. The correlation between the effect substituting a single residue and its solvent accessibility has long been discussed (14–16). To provide a quantitative limit for the solvent accessibility of residues able to tolerate mutation caused by nsSNPs, we collected experimental data showing the relationship between point mutations and the activities of proteins with known 3D structures. The point mutation studies on Lac repressor (17) and T4 lysozyme (18), in particular, provided us with sufficient data to determine that limit of solvent accessibility. The solvent accessibility was calculated with ASC(19). We then re-examined the relationship between solvent accessibility and viability of the organism for these two proteins. Figure 3 shows the loss of function rate plotted against the solvent accessibility of the wild-type residue. In the case of Lac repressor, about 90% of mutation-tolerant sites (see Figure 3 caption) were located at positions where the solvent accessibility of the wild-type residue was >30%, and about 80% of mutations in intolerant or partially tolerant sites were located at positions where the solvent accessibility was ≤30%. Based on this observation, we decided to provide the solvent accessibility value of the mutated residue together with the 3D structure of the protein in the database, and the residues with solvent accessibility of ≤ 30% were marked in yellow in the 3D structures. We believe that these data enable one to evaluate possible effect of nsSNPs on protein stability and function. For instance, the nsSNP in human SYK kinase shown in Figure 2 results in the substitution of Arg45 with His, and the solvent accessibility is 8%. Because of the degree to which this residue is buried, it is highly likely that the substitution will have a deleterious effect on the protein's function and/or stability, as suggested by Figure 3. In fact, the residue forms one of the loops for domain association and is located relatively close to the phosphorylated Tyr of the target peptide (20). Both of these pieces of information are easily retrieved from coliSNP and may add medically important annotation to the SNP site. It is worth noting that the impact of a mutation should also be evaluated based on sequence conservation. Disease-associated nsSNPs tend to be located at highly conserved sites (21). This information will be incorporated in the coliSNP database in the near future.

Figure 3.

Cumulative plots of tolerant, partially tolerant and intolerant sites in Lac repressor against the solvent accessibility. In the experiment (17), 12 or 13 mutations (depending on the identity of the wild-type residue) were tested at 124 sites. We defined the tolerance at each site as follows: tolerant, <5 of the mutations cause loss of function (45 sites); intolerant, >8 of the mutations cause loss of function (69 sites); and partially tolerant, 5–8 of the mutations cause loss of function (10 sites). The solvent accessibility was calculated using the program ASC (19) with a protein–DNA complex form (PDB:1EFA) or a tetrameric form (PDB:1LBI), depending on the site considered.

Database status and future work

As of 26 July 2007, coliSNP contains 4470 nsSNPs, which are mapped on 1559 distinct protein structures, mostly from Homo sapiens (4216 out of 4470). A process to include all of the data in dbSNP, which is derived from 22 organisms, is ongoing. Modification of 3D protein structure visualization tools to accept the new PDB format (http://www.wwpdb.org/docs.html and see Remediation Documentation) is also ongoing. In the near future, coliSNP will also provide SNPs located in the gene regulatory region together with potential target regions of the regulatory proteins.

Availability

The coliSNP database can be accessed freely at http://yayoi.kansai.jaea.go.jp/colisnp.

19 in total

1. dbSNP: the NCBI database of genetic variation.

Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Structural location of disease-associated single-nucleotide polymorphisms.

Authors: Nathan O Stitziel; Yan Yuan Tseng; Dimitri Pervouchine; David Goddeau; Simon Kasif; Jie Liang
Journal: J Mol Biol Date: 2003-04-11 Impact factor: 5.469

3. MutDB: annotating human variation with functionally relevant data.

Authors: Sean D Mooney; Russ B Altman
Journal: Bioinformatics Date: 2003-09-22 Impact factor: 6.937

4. topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association.

Authors: Nathan O Stitziel; T Andrew Binkowski; Yan Yuan Tseng; Simon Kasif; Jie Liang
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. SIFT: Predicting amino acid changes that affect protein function.

Authors: Pauline C Ng; Steven Henikoff
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

6. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

7. Genetic studies of the lac repressor. XIII. Extensive amino acid replacements generated by the use of natural and synthetic nonsense suppressors.

Authors: L G Kleina; J H Miller
Journal: J Mol Biol Date: 1990-03-20 Impact factor: 5.469

8. Human non-synonymous SNPs: server and survey.

Authors: Vasily Ramensky; Peer Bork; Shamil Sunyaev
Journal: Nucleic Acids Res Date: 2002-09-01 Impact factor: 16.971

9. Systematic mutation of bacteriophage T4 lysozyme.

Authors: D Rennell; S E Bouvier; L W Hardy; A R Poteete
Journal: J Mol Biol Date: 1991-11-05 Impact factor: 5.469

10. PolyDoms: a whole genome database for the identification of non-synonymous coding SNPs with the potential to impact disease.

Authors: Anil G Jegga; Sivakumar Gowrisankar; Jing Chen; Bruce J Aronow
Journal: Nucleic Acids Res Date: 2006-11-16 Impact factor: 16.971

9 in total

1. PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation.

Authors: Jian Ren; Chunhui Jiang; Xinjiao Gao; Zexian Liu; Zineng Yuan; Changjiang Jin; Longping Wen; Zhaolei Zhang; Yu Xue; Xuebiao Yao
Journal: Mol Cell Proteomics Date: 2009-12-08 Impact factor: 5.911

2. FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases.

Authors: Stephen J Goodswen; Cedric Gondro; Nathan S Watson-Haigh; Haja N Kadarmideen
Journal: BMC Bioinformatics Date: 2010-06-09 Impact factor: 3.169

3. Mutation@A Glance: an integrative web application for analysing mutations from human genetic diseases.

Authors: Atsushi Hijikata; Rajesh Raju; Shivakumar Keerthikumar; Subhashri Ramabadran; Lavanya Balakrishnan; Suresh Kumar Ramadoss; Akhilesh Pandey; Sujatha Mohan; Osamu Ohara
Journal: DNA Res Date: 2010-04-01 Impact factor: 4.458

4. Systematic Analysis of the Genetic Variability That Impacts SUMO Conjugation and Their Involvement in Human Diseases.

Authors: Hao-Dong Xu; Shao-Ping Shi; Xiang Chen; Jian-Ding Qiu
Journal: Sci Rep Date: 2015-07-08 Impact factor: 4.379

5. PSnpBind: a database of mutated binding site protein-ligand complexes constructed using a multithreaded virtual screening workflow.

Authors: Ammar Ammar; Rachel Cavill; Chris Evelo; Egon Willighagen
Journal: J Cheminform Date: 2022-02-28 Impact factor: 5.514

6. Computational study of the impact of nucleotide variations on highly conserved proteins: In the case of actin.

Authors: Ha T T Duong; Hirofumi Suzuki; Saki Katagiri; Mayu Shibata; Misae Arai; Kei Yura
Journal: Biophys Physicobiol Date: 2022-07-28

7. SysPIMP: the web-based systematical platform for identifying human disease-related mutated sequences from mass spectrometry.

Authors: Hong Xi; Jongsun Park; Guohui Ding; Yong-Hwan Lee; Yixue Li
Journal: Nucleic Acids Res Date: 2008-11-26 Impact factor: 16.971

8. Extrapolating the effect of deleterious nsSNPs in the binding adaptability of flavopiridol with CDK7 protein: a molecular dynamics approach.

Authors: C George Priya Doss; N Nagasundaram; Chiranjib Chakraborty; Luonan Chen; Hailong Zhu
Journal: Hum Genomics Date: 2013-04-05 Impact factor: 4.639

9. Prediction and experimental characterization of nsSNPs altering human PDZ-binding motifs.

Authors: David Gfeller; Andreas Ernst; Nick Jarvik; Sachdev S Sidhu; Gary D Bader
Journal: PLoS One Date: 2014-04-10 Impact factor: 3.240

9 in total