Literature DB >> 17827212

MutDB: update on development of tools for the biochemical analysis of genetic variation.

Arti Singh¹, Adebayo Olowoyeye, Peter H Baenziger, Jessica Dantzer, Maricel G Kann, Predrag Radivojac, Randy Heiland, Sean D Mooney.

Abstract

Understanding how genetic variation affects the molecular function of gene products is an emergent area of bioinformatic research. Here, we present updates to MutDB (http://www.mutdb.org), a tool aiming to aid bioinformatic studies by integrating publicly available databases of human genetic variation with molecular features and clinical phenotype data. MutDB, first developed in 2002, integrates annotated SNPs in dbSNP and amino acid substitutions in Swiss-Prot with protein structural information, links to scores that predict functional disruption and other useful annotations. Though these functional annotations are mainly focused on nonsynonymous SNPs, some information on other SNP types included in dbSNP is also provided. Additionally, we have developed a new functionality that facilitates KEGG pathway visualization of genes containing SNPs and a SNP query tool for visualizing and exporting sets of SNPs that share selected features based on certain filters.

Entities: Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 17827212 PMCID： PMC2238958 DOI： 10.1093/nar/gkm659

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Understanding how coding single nucleotide polymorphisms (cSNPs) and disease-associated mutations cause molecular alterations and expression changes in gene products is important to many fields of biological and medical research (1,2). We believe that linking disease with basic research data will enable hypothesis generation that can be experimentally tested in the laboratory with functional assays. Recently, several servers and databases aiming to understand the biochemical effects of nonsynonymous SNPs and disease-associated mutation have been developed. These include SIFT (3), PolyPhen (4), SNPs3D (5), PANTHER (6), PMUT (7), LS-SNP (8), PolyDoms (9) and SNPEffect (10). These methods and their resulting datasets generally apply DNA and protein sequence, protein structure and/or evolutionary features to classify a query amino acid substitution using a training set of putative neutral and causative amino acid substitutions (4,5,8,11–17). Similarly, MutDB (18,19) is an online resource that serves as a step toward better understanding the potential molecular effects of a mutation. MutDB integrates genetic variation from two public databases, Swiss-Prot (20) and dbSNP (21), and annotates the variants with biochemically relevant information. These two databases are chosen because they are freely available and represent a significant breadth of available amino acid substitutions. However, neither of these databases annotates disease causing amino acid substitutions particularly well. dbSNP contains few links to OMIM (22), and Swiss-Prot does not identify disease causing amino acid substitutions from other amino acid substitutions. Therefore, a researcher studying a specific disease should have some prior knowledge of the proteins and mutations of interest, and MutDB provides some helpful links to useful databases with disease and phenotype annotations such as OMIM and dbGAP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap), (22). In addition to updating to the latest mutation and SNP datasets, here we present several additions to the MutDB resource. First, we have developed a pathway visualization add-on to MutDB that leads the biologist from mutations in a gene to KEGG (23) biological pathways involving the gene. This enables the researcher to view the systems context of both a mutation and its associated phenotype. Second, we have constructed an AJAX (Asynchronous JavaScript and XML) based SNP query tool that allows users to save searches, view Haploview-like haplotype structure (24), and select subsets of SNPs based on frequencies and SNP scores. Together these tools represent a useful addition to our existing library of genetic research tools (Figure 1).

Figure 1.

SNP query tool snapshot highlighting SNP filtering. Multiple filtering options include: validation (T/F), HapMap (31) (T/F), Location, avHet, avHetSE, HapMap Frequency (CEU, CHB, JPT, YRI), SIFT score and UCSC conservation. Users can preview current filtering criteria by scrolling over pop-up window link. Once SNPs are selected, Haploview like images can be rendered showing HapMap LD structure (lower right).

METHODS AND USAGE

Web interface and data organization

The SNP and mutation data is parsed directly from Swiss-Prot (currently build 51.0) and dbSNP (currently build 126) without curation, other than to remove any annotations that do not map to the wild-type amino acid in the referenced sequence. The gene model provided by MutDB is organized, using gene information extracted from a local copy of the UCSC Human Genome Annotation Database (ver. Hg18, http://genome.ucsc.edu/) (25). We also use Ensembl (ver. 41_36c, http://www.ensembl.org/) (26) for some gene name cross-references. We attempt to keep pages organized by Entrez Gene ID with the most representative transcript as the primary gene page. Other known mRNA transcripts annotated in the UCSC Genome Annotation Database are listed at the bottom of the page with their annotations. This data may be browsed alphabetically by gene symbol or by employing one of several search methods, including keyword, gene symbol, protein or Refseq ID, and individual variant identifier. Each gene is given its own page for display, providing a list of related SNPs and mutations classified by their effects on the protein product, as well as a pictorial representation of the sequence including points of conservation, location of exons and location of variants. Links to corresponding Swiss-Prot and dbSNP pages, a short description of the gene, and the related chromosome name are supplied. Each variant is annotated within its own page providing further details, which includes the protein sequence, if known, and any related Protein Data Bank (PDB) (http://www.rcsb.org/pdb/) (27) structures, KEGG Pathways, HapMap data and Entrez Gene information. We describe important aspects of our annotation pipeline below.

Protein structure annotations

Protein structural mapping for each amino acid substitution is performed by aligning the query sequence to each high scoring segment pair (HSP) from BLAST (28) search results using BioPerl scripts (29). BLAST results used for alignment come only from PDB (using a sequence data file downloaded in January of 2007) and are limited to those with 100% identity to the original sequence. These pairwise alignments are then used to map wild-type and mutation sequence to structure sequence. The annotated mutations that are mapped to a structure can be displayed using the integrated Jmol visualization tool (http://jmol.sourceforge.net/) or in extensions developed for UCSF Chimera (30) and Delano Scientific PyMOL (http://pymol.sourceforge.net/). To download the extensions visit http://lifescienceweb.org/.

Function annotations

We provide links to other tools that provide predictions of functional or molecular disruptions caused by an amino acid substitution. These include SNPs3D (5), PolyPhen (4), SIFT (3), PolyDoms (9), PMUT (7) and PANTHER (6) and are deep linked directly to the gene or SNP page, if available. Sorting Intolerant from Tolerant (SIFT) scores (3) and their associated predictions are supplied for each variant causing an amino acid substitution. Variants with low confidence scores are marked with an asterisk. Here, again, the source Swiss-Prot and dbSNP pages are linked.

Visualization on KEGG pathways

We have augmented MutDB annotations with KEGG pathways using KEGG web services (23). This enables visualizing proteins, mutations and pathways on approximately 188 human pathways found in KEGG. The addition of a link, ‘Visualize Pathways’, on the MutDB gene page takes the user to a page listing the names of all KEGG pathways involving the gene. When a pathway is chosen, the user is taken to a new page displaying the pathway and a list of involved genes and their associated phenotypes. All genes containing a SNP denoted as having a disease annotation or comment (per Swiss-Prot) are colored yellow in the pathway. This page is also hyper-linked to KEGG and MutDB. This functionality makes use of KEGG SOAP-based web services with supplementary data saved locally (Figure 2).

Figure 2.

MutDB-KEGG integration example of the VEGF Pathway. This pathway shows all proteins with SNPs or Swiss-Prot mutations and all unique diseases and comments provided by Swiss-Prot (top). The VEGF signaling pathway showing proteins with mutations in yellow (bottom).

SNP query tool

A recent addition to our toolset is a SNP query tool that enables querying and exporting sets of SNPs that share selected features. The SNP query tool requires two sequence-tagged site (STS) markers or dbSNP reference cluster IDs (rs#) as input and returns all SNPs between the markers. The tool uses AJAX and a paging scheme to increase responsiveness upon large data sets. AJAX enhances speed by exchanging small amounts of data with the server, so the entire web page need not be reloaded each time the user makes a change. This technique along with the broad filtering options provide for an interactive tool. Users may filter SNPs by manual selection or one of the filtering criteria. There are currently eleven filter options: validation status in dbSNP, hapmap status, location (functional class), avHet (average heterozygosity in dbSNP), avHetSE (SE for the average heterozygosity in dbSNP), CEU (CEPH—Utah residents with ancestry from northern and western Europe frequencies in HapMap), CHB (Han Chinese in Beijing, China), JPT (Japanese in Tokyo, Japan), YRI (Yoruba in Ibadan, Nigeria), SIFT score (3) and conservation score [based on the UCSC Genome Annotation Database conservation (25)]. The conservation score is the averaged 10-mer window of conservation values around each SNP derived from alignments of the 16 vertebrate species in the UCSC Annotated Genome Database. A user can authenticate to enter the tool or visit as a guest, and may save each session and return later. Retrieval of sequence surrounding the SNP and exportation of SNP data to Microsoft Excel are easily performed via provided links. Excel output includes the dbSNP rsID, primer sequences, and the polymorphic alleles. The tool displays a PNG image containing RefSeq transcript information and location information for all selected SNPs indexed by function type using the UCSC Genome Annotation Database. A user may also visualize linkage disequilibrium for up to 200 selected SNPs in a Haploview (24) like structure. The SNP query tool is located at http://www.mutdb.org/snp and is linked from each page (Figure 1).

Continued web services support

MutDB continues to support its SOAP-based web services. The web services can be accessed via http://www.lifescienceweb.org. This interface is used to communicate to the structural visualization extensions for UCSF Chimera and Delano Scientific PyMOL.

Most accessed gene pages

In MutDB, the most accessed genes may give insight into the current interests of researchers. The most accessed genes from October 2005 to January 2007 are listed in Table 1. Not surprisingly, the most accessed genes also have many mutations associated with them and are what we would consider to be well-studied disease-associated genes.

Table 1.

Top 15 accessed genes on MutDB from October 2005 to January 2007

Symbol	Name
1. BRCA1	Breast cancer 1, early onset
2. CFTR	Cystic fibrosis transmembrane conductance
3. AR	Androgen receptor
4. APOE	Apolipoprotein E precursor
5. ATP7B	ATPase, Cu++transporting, beta polypeptide
6. TP53	Tumor protein p53
7. CD53	CD53 antigen
8. BRCA2	Breast cancer 2, early onset
9. FBN1	Fibrillin 1
10. APC	Adenomatosis polyposis coli
11. NOTCH3	Notch homolog 3
12. KALRN	Kalirin, RhoGEF kinase
13. CYP2D6	Cytochrome P450, family 2, subfamily D
14. RET	Ret proto-oncogene
15. HBB	Beta globin

BRCA1, CFTR, AR and APOE are the most requested pages within MutDB.

Top 15 accessed genes on MutDB from October 2005 to January 2007 BRCA1, CFTR, AR and APOE are the most requested pages within MutDB.

Future

Understanding the underlying molecular causes of disease remains an important area for research. We continue to investigate annotations that are useful for hypothesis generation and directing experimental validation. While we continue to update the database as new annotations become available, we are also adding useful annotations outside of protein amino acid changes such as noncoding, synonymous and intronic variation.

31 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.

Authors: D Chasman; R M Adams
Journal: J Mol Biol Date: 2001-03-23 Impact factor: 5.469

3. Prediction of deleterious human alleles.

Authors: S Sunyaev; V Ramensky; I Koch; W Lathe; A S Kondrashov; P Bork
Journal: Hum Mol Genet Date: 2001-03-15 Impact factor: 6.150

4. Predicting deleterious amino acid substitutions.

Authors: P C Ng; S Henikoff
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

5. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors: Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

8. Evaluation of structural and evolutionary contributions to deleterious mutation prediction.

Authors: Christopher T Saunders; David Baker
Journal: J Mol Biol Date: 2002-09-27 Impact factor: 5.469

9. Human non-synonymous SNPs: server and survey.

Authors: Vasily Ramensky; Peer Bork; Shamil Sunyaev
Journal: Nucleic Acids Res Date: 2002-09-01 Impact factor: 16.971

10. The PANTHER database of protein families, subfamilies, functions and pathways.

Authors: Huaiyu Mi; Betty Lazareva-Ulitsky; Rozina Loo; Anish Kejariwal; Jody Vandergriff; Steven Rabkin; Nan Guo; Anushya Muruganujan; Olivier Doremieux; Michael J Campbell; Hiroaki Kitano; Paul D Thomas
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

18 in total

Review 1. Bioinformatics for personal genome interpretation.

Authors: Emidio Capriotti; Nathan L Nehrt; Maricel G Kann; Yana Bromberg
Journal: Brief Bioinform Date: 2012-01-13 Impact factor: 11.622

2. Annotating Mutational Effects on Proteins and Protein Interactions: Designing Novel and Revisiting Existing Protocols.

Authors: Minghui Li; Alexander Goncearenco; Anna R Panchenko
Journal: Methods Mol Biol Date: 2017

3. StructMAn: annotation of single-nucleotide polymorphisms in the structural context.

Authors: Alexander Gress; Vasily Ramensky; Joachim Büch; Andreas Keller; Olga V Kalinina
Journal: Nucleic Acids Res Date: 2016-05-05 Impact factor: 16.971

Review 4. Bioinformatic tools for identifying disease gene and SNP candidates.

Authors: Sean D Mooney; Vidhya G Krishnan; Uday S Evani
Journal: Methods Mol Biol Date: 2010

5. MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures.

Authors: Noushin Niknafs; Dewey Kim; Ryangguk Kim; Mark Diekhans; Michael Ryan; Peter D Stenson; David N Cooper; Rachel Karchin
Journal: Hum Genet Date: 2013-06-23 Impact factor: 4.132

6. In silico functional profiling of human disease-associated and polymorphic amino acid substitutions.

Authors: Matthew Mort; Uday S Evani; Vidhya G Krishnan; Kishore K Kamati; Peter H Baenziger; Angshuman Bagchi; Brandon J Peters; Rakesh Sathyesh; Biao Li; Yanan Sun; Bin Xue; Nigam H Shah; Maricel G Kann; David N Cooper; Predrag Radivojac; Sean D Mooney
Journal: Hum Mutat Date: 2010-03 Impact factor: 4.878

7. Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.

Authors: Rainer Winnenburg; Conrad Plake; Michael Schroeder
Journal: BMC Bioinformatics Date: 2009-08-27 Impact factor: 3.169

8. Mutation@A Glance: an integrative web application for analysing mutations from human genetic diseases.

Authors: Atsushi Hijikata; Rajesh Raju; Shivakumar Keerthikumar; Subhashri Ramabadran; Lavanya Balakrishnan; Suresh Kumar Ramadoss; Akhilesh Pandey; Sujatha Mohan; Osamu Ohara
Journal: DNA Res Date: 2010-04-01 Impact factor: 4.458

9. Protein-structure-guided discovery of functional mutations across 19 cancer types.

Authors: Beifang Niu; Adam D Scott; Sohini Sengupta; Matthew H Bailey; Prag Batra; Jie Ning; Matthew A Wyczalkowski; Wen-Wei Liang; Qunyuan Zhang; Michael D McLellan; Sam Q Sun; Piyush Tripathi; Carolyn Lou; Kai Ye; R Jay Mashl; John Wallis; Michael C Wendl; Feng Chen; Li Ding
Journal: Nat Genet Date: 2016-06-13 Impact factor: 38.330

10. PEPPI: a peptidomic database of human protein isoforms for proteomics experiments.

Authors: Ao Zhou; Fan Zhang; Jake Y Chen
Journal: BMC Bioinformatics Date: 2010-10-07 Impact factor: 3.169