Literature DB >> 22210871

SNPdbe: constructing an nsSNP functional impacts database.

Christian Schaefer¹, Alice Meier, Burkhard Rost, Yana Bromberg.

Abstract

UNLABELLED: Many existing databases annotate experimentally characterized single nucleotide polymorphisms (SNPs). Each non-synonymous SNP (nsSNP) changes one amino acid in the gene product (single amino acid substitution;SAAS). This change can either affect protein function or be neutral in that respect. Most polymorphisms lack experimental annotation of their functional impact. Here, we introduce SNPdbe-SNP database of effects, with predictions of computationally annotated functional impacts of SNPs. Database entries represent nsSNPs in dbSNP and 1000 Genomes collection, as well as variants from UniProt and PMD. SAASs come from >2600 organisms; 'human' being the most prevalent. The impact of each SAAS on protein function is predicted using the SNAP and SIFT algorithms and augmented with experimentally derived function/structure information and disease associations from PMD, OMIM and UniProt. SNPdbe is consistently updated and easily augmented with new sources of information. The database is available as an MySQL dump and via a web front end that allows searches with any combination of organism names, sequences and mutation IDs. AVAILABILITY: http://www.rostlab.org/services/snpdbe.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Proteins

Year: 2011 PMID： 22210871 PMCID： PMC3278761 DOI： 10.1093/bioinformatics/btr705

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Resources like dbSNP (Sherry ) and UniProt (Bairoch ) contain many experimentally determined nsSNPs, but few of these are annotated with respect to function. Some databases [e.g. PMD (Kawabata )] contain experimental annotations of functional effects of mutants. However, these are sparsely populated and do not directly link to dbSNP or UniProt. For the vast majority of mutations lacking experimental annotation, we can gauge functional impact only via in silico analysis. Proper use of computational methods requires specific skills and resources generally inaccessible to medical researchers or experimental biologists. To help, we created an MySQL database readily usable by non-experts. We collected SAASs from PMD, dbSNP, 1000 Genomes (1000_Genomes_Project_Consortium, 2010) and UniProt ‘variant’s and ‘mutant’s. We also store ‘conflict’ records to illustrate how sequencing discrepancies may lead to differing interpretations of the functional significance of a given sequence position. For each SAAS we predict the functional effect using SNAP (Bromberg and Rost, 2007) and SIFT (Ng and Henikoff, 2001). Where available, predictions are augmented by experimental annotations and associated human diseases. We also compute evolutionary conservation of the mutant positions. A web interface provides convenient access to underlying data via organism, sequence and mutation ID queries.

2 DATA SETUP AND RETRIEVAL

Database: SNPdbe mutation data comes from dbSNP, UniProt, 1KG and PMD (Fig. 1A). UniProt and PMD store protein sequences explicitly, while dbSNP links to RefSeq (Pruitt ). dbSNP collects 1KG variants with a time delay, so for SNPdbe we mapped all 1KG nsSNPs to RefSeq using Annovar (Wang ). We keep only one version of redundant protein sequences, referenced by md5 checksums irrespective of origin. Redundancy is assessed at full-sequence identity (maximum one substitution per sequence) over the entire sequence (+/− leading Met residue). This allows correlating mutations from different sources referencing the same sequence. We currently store 1 362 793 unique SAASs in 158 004 proteins from 2684 organisms covering all kingdoms of life; the top five contributors are human, mouse, rice, cow and rat. For each SAAS we provide the following information: (i) SNAP and SIFT binary predictions of functional effects (neutral/non-neutral). (ii) Evolutionary conservation information from PSIC (Sunyaev ), PSI-Blast (Altschul ) PSSMs and frequency scores from runs against PDB (Berman ) and UniProt. (iii) Functional effects from PMD and UniProt. For human SAASs, disease associations are also available from PMD, UniProt and OMIM (Amberger ) (Fig. 1B). (iv) dbSNP evidence and average heterozygosity, and (iv) interesting functional/structural features (UniProt) at the mutation site. Data are stored in an MySQL database and are downloadable as a dump file.

Fig. 1.

Venn diagrams describing the overlap of (A) all SNPdbe component databases and (B) functional and disease annotations of human SAASs. Note that <1% of human SAASs have both functional effect and disease annotations. Web interface: The database is web-accessible allowing gene/protein ID/name, disease, sequence (or its md5 hash) and mutant-based queries. Some queries (e.g. md5, gene ID) are exact. Sequence queries are BLAST similarity based. Keyword searches (e.g. disease) are ‘loose’, i.e. matched to corresponding free text fields. The results page lists all SAASs found within the specified sequence and their functional effect predictions, wild-type/mutant conservation scores, information on disease (human only), experimentally derived functional/structural consequences, changes in position biochemical properties, per-variant validation status and average heterozygosity. This information is also accessible via single/batch mutation queries with dbSNP rsids, PMD or SwissVar IDs or SAASs in the XposY format (and associated sequence). The user can (i) restrict queries to specific organisms or protein keywords; and (ii) search for mutants in similar sequences. Query results may be sorted by different attributes and downloaded in CSV format. Linkouts to referenced web resources are available. Example: dbSNP rsid 104894374 describes the mutation R157W in the RDH5 gene. This mutation is associated with eye disease, Fundus albipunctatus (OMIM 601617.0008). Both SNAP and SIFT predict this substitution to be non-neutral. Indeed, it results in loss of activity in the gene product (PMD A010122). By combining mutation disease associations and their functional effects new inferences can be made about molecular functions altered in disease.

3 CONCLUSION

SNPdbe is designed to fill the annotation gap left by the high cost of experimental testing for functional significance of protein variants. It joins related bits of knowledge, currently distributed throughout various databases, into a consistent, easily accessible and updatable resource. The major features distinguishing SNPdbe from other databases are: (i) the inclusion of a much wider array of organisms and data sources; and (ii) the explicit differentiation between functional/structural effects and disease associations. Furthermore, unlike SNPdbe, existing resources (i) lack experimental annotation of functional/structural changes or offer only single tool (e.g. SIFT) predictions (Mooney and Altman, 2003; Thorn ), (ii) are limited to naturally occurring variants (Chelala ), (iii) are not consistently updated (Jegga ; Wang ) or (iv) do not offer pre-computed effects on a large scale (Reva ; Wang ). SNPdbe's database schema and management scripts are designed to easily handle the addition of new sequences and SAASs and the integration of new predictors and sources of experimental data. Monthly updates are planned. Information about current versions of included databases and statistics is available from SNPdbe website. Our ultimate goal is to make SNPdbe a toolbox for biologists and medical researchers dealing with mutation data. Computationally acquired predictions and annotations found in SNPdbe will help design and prioritize further experimental research.

18 in total

1. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations.

Authors: S R Sunyaev; F Eisenhaber; I V Rodchenkov; B Eisenhaber; V G Tumanyan; E N Kuznetsov
Journal: Protein Eng Date: 1999-05

2. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. Predicting deleterious amino acid substitutions.

Authors: P C Ng; S Henikoff
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

4. MutDB: annotating human variation with functionally relevant data.

Authors: Sean D Mooney; Russ B Altman
Journal: Bioinformatics Date: 2003-09-22 Impact factor: 6.937

5. SNP Function Portal: a web database for exploring the function implication of SNP alleles.

Authors: Pinglang Wang; Manhong Dai; Weijian Xuan; Richard C McEachin; Anne U Jackson; Laura J Scott; Brian Athey; Stanley J Watson; Fan Meng
Journal: Bioinformatics Date: 2006-07-15 Impact factor: 6.937

6. The Protein Mutant Database.

Authors: T Kawabata; M Ota; K Nishikawa
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

Review 7. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

8. Predicting the functional impact of protein mutations: application to cancer genomics.

Authors: Boris Reva; Yevgeniy Antipin; Chris Sander
Journal: Nucleic Acids Res Date: 2011-07-03 Impact factor: 16.971

9. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

10. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

24 in total

1. Neutral and weakly nonneutral sequence variants may define individuality.

Authors: Yana Bromberg; Peter C Kahn; Burkhard Rost
Journal: Proc Natl Acad Sci U S A Date: 2013-08-12 Impact factor: 11.205

2. The road from next-generation sequencing to personalized medicine.

Authors: Manuel L Gonzalez-Garay
Journal: Per Med Date: 2014 Impact factor: 2.512

3. Natural variability of minimotifs in 1092 people indicates that minimotifs are targets of evolution.

Authors: Kenneth F Lyon; Christy L Strong; Steve G Schooler; Richard J Young; Nervik Roy; Brittany Ozar; Mark Bachmeier; Sanguthevar Rajasekaran; Martin R Schiller
Journal: Nucleic Acids Res Date: 2015-06-11 Impact factor: 16.971

Review 4. Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation.

Authors: Haiming Tang; Paul D Thomas
Journal: Genetics Date: 2016-06 Impact factor: 4.562

5. VIPdb, a genetic Variant Impact Predictor Database.

Authors: Zhiqiang Hu; Changhua Yu; Mabel Furutsuki; Gaia Andreoletti; Melissa Ly; Roger Hoskins; Aashish N Adhikari; Steven E Brenner
Journal: Hum Mutat Date: 2019-08-17 Impact factor: 4.878

6. Chapter 15: disease gene prioritization.

Authors: Yana Bromberg
Journal: PLoS Comput Biol Date: 2013-04-25 Impact factor: 4.475

7. Prioritization of pathogenic mutations in the protein kinase superfamily.

Authors: Jose M G Izarzugaza; Angela del Pozo; Miguel Vazquez; Alfonso Valencia
Journal: BMC Genomics Date: 2012-06-18 Impact factor: 3.969

8. Disease-related mutations predicted to impact protein function.

Authors: Christian Schaefer; Yana Bromberg; Dominik Achten; Burkhard Rost
Journal: BMC Genomics Date: 2012-06-18 Impact factor: 3.969

9. KD4v: Comprehensible Knowledge Discovery System for Missense Variant.

Authors: Tien-Dao Luu; Alin Rusu; Vincent Walter; Benjamin Linard; Laetitia Poidevin; Raymond Ripp; Luc Moulinier; Jean Muller; Wolfgang Raffelsberger; Nicolas Wicker; Odile Lecompte; Julie D Thompson; Olivier Poch; Hoan Nguyen
Journal: Nucleic Acids Res Date: 2012-05-27 Impact factor: 16.971

10. EVA: Exome Variation Analyzer, an efficient and versatile tool for filtering strategies in medical genomics.

Authors: Sophie Coutant; Chloé Cabot; Arnaud Lefebvre; Martine Léonard; Elise Prieur-Gaston; Dominique Campion; Thierry Lecroq; Hélène Dauchel
Journal: BMC Bioinformatics Date: 2012-09-07 Impact factor: 3.169