Literature DB >> 17135201

MMDB: annotating protein sequences with Entrez's 3D-structure database.

Yanli Wang¹, Kenneth J Addess, Jie Chen, Lewis Y Geer, Jane He, Siqian He, Shennan Lu, Thomas Madej, Aron Marchler-Bauer, Paul A Thiessen, Naigong Zhang, Stephen H Bryant.

Abstract

Three-dimensional (3D) structure is now known for a large fraction of all protein families. Thus, it has become rather likely that one will find a homolog with known 3D structure when searching a sequence database with an arbitrary query sequence. Depending on the extent of similarity, such neighbor relationships may allow one to infer biological function and to identify functional sites such as binding motifs or catalytic centers. Entrez's 3D-structure database, the Molecular Modeling Database (MMDB), provides easy access to the richness of 3D structure data and its large potential for functional annotation. Entrez's search engine offers several tools to assist biologist users: (i) links between databases, such as between protein sequences and structures, (ii) pre-computed sequence and structure neighbors, (iii) visualization of structure and sequence/structure alignment. Here, we describe an annotation service that combines some of these tools automatically, Entrez's 'Related Structure' links. For all proteins in Entrez, similar sequences with known 3D structure are detected by BLAST and alignments are recorded. The 'Related Structure' service summarizes this information and presents 3D views mapping sequence residues onto all 3D structures available in MMDB (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=structure).

Entities: Chemical Disease Gene Species

Mesh：

Year: 2006 PMID： 17135201 PMCID： PMC1751549 DOI： 10.1093/nar/gkl952

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

CONTENT

Access

The molecular modeling database (MMDB) is Entrez's ‘Structure’ database (1). Querying MMDB with text terms, e.g. one may identify structures of interest based on a protein name. Links between databases provide other search mechanisms. A query of Entrez PubMed database, e.g. will identify articles citing a particular protein name. Links from this set of articles to ‘Structure’ may identify structures not found by direct query, since PubMed abstracts contain additional descriptive terms. Currently, MMDB and its visualization services handle ∼25 000 user queries per day.

Data sources

Experimental three-dimensional (3D) structure data are obtained from the Protein Data Bank (PDB) (2). Author-annotated features provided by PDB are recorded in MMDB. The agreement between atomic coordinate and sequence data is verified, and sequence data are obtained from PDB coordinate records, if necessary, to resolve ambiguities(3). Data are mapped into a computer friendly format and transferred between applications using Abstract Syntax Notation 1 (ASN.1). This validation and encoding supports the interoperable display of sequence, structure and alignment. Uniformly defined secondary-structure and 3D-domain features are added to support structure neighbor calculations. MMDB currently contains ∼39 000 structure entries, corresponding to ∼90 000 chains and 170 000 3D domains.

Summary, links, neighbors and visualization

The MMDB web server generates structure summary pages, which provide a concise description of an MMDB entry's content and the available annotation (4). Sequences derived from MMDB are entered into Entrez's protein or nucleic acid sequence database, preserving links to the corresponding 3D structures. Links to PubMed are generated by matching citations. Links to Entrez's organism taxonomy database are generated by semi-automatic processing of ‘source records' and other descriptive text provided by PDB. Ligands and other small molecules are identified and added to the PubChem resource, accessible at , also preserving reciprocal links to 3D structure. Sequence neighbors are identified by BLAST (5), and links to the Conserved Domain Database (CDD) (6) by the RPS-BLAST algorithm (5). Structure neighbors are identified by VAST (7). The 3D structure viewer supported by Entrez, Cn3D (8), provides molecular-graphics visualization.

ANNOTATING SEQUENCE WITH STRUCTURE

The ‘Related Structure’ service

In the Entrez database system, protein sequences are neighbored to each other by comparing each newly entered sequence to all other database entries. These database scans are run with the BLAST (5) engine, which identifies sequence neighbors with significant similarity, and the resulting sequence identifiers and taxonomy indices are stored, so that Entrez can provide ‘Related Sequences’ links for all protein records in the collection. The ‘Related Structure’ service is built on top of this system. Sequence neighbors directly linked to MMDB are identified and alignments are re-computed by employing the ‘BlastTwoSequences’ tool (9) to restore alignment footprints. The ‘Related Structure’ web interface provides direct access to this information. Initially this service had been restricted to sequences from microbial genomes (10), but it has now been expanded to cover all proteins in Entrez and is updated daily to provide a comprehensive 3D-structure annotation service. Identification of structure-linked neighbors and the visualization of sequence-structure alignment is also possible using Entrez and the Cn3D alignment viewer/editor, but ‘Related Structures’ provides a convenient new summary and ‘one click’ shortcuts to 3D visualization. These 3D views may be used to identify conserved residues and map site-specific features derived from the 3D structure. Currently ∼48% of non-identical protein sequences in Entrez have been linked to at least one related structure, employing a conservative threshold for alignment length (50 aligned residues or more) and similarity (30% or more identical residues in the aligned footprint); see Figure 1 for details.

Figure 1

Non-identical protein sequences in Entrez have been classified into groups linked to related structures, at various levels of sequence similarity. Sequence identity is calculated from the BLAST alignments, and here only those neighbor relationships are listed that produce an aligned footprint of 50 residues or more. The analysis also excludes protein sequences which have been directly obtained from MMDB. Forty-eight percent of sequences in Entrez protein have at least one structure neighbor with an extensive alignment footprint and at least 30% identical residues.

An example

A search with the term ‘Angiotensin converting enzyme’ in Entrez's protein database retrieves >400 hits. One may configure the Entrez browser to filter search results by various criteria, and one pre-configured filter selects those protein sequences with ‘Related Structures’ (configuration of Entrez can be achieved by following links to ‘My NCBI’, or by clicking on the ‘toolbox’ icon shown at the top of Entrez document summaries.). In this example, the ‘Related Structures’ filter shows that >240 of the identified sequence records have links to related structures. One such protein sequence is the ACE protein from Rattus norvegicus (accession no. ‘NP_036676’). On the ‘Links’ menu for this record, ‘Related structures’ generates a request to the Related Structure service (). The resulting page indicates with a horizontal bar, the sequence region annotated by each related structure (Figure 2). The display also supports sorting by a variety of alignment parameters such as score or length and selection of sequence-dissimilar ‘non redundant’ subsets. A ‘Table’ option switches to a text view, listing descriptions of each structure as well as alignment scores.

Figure 2

A screen shot of the ‘Related Structure’ summary along with Entrez's document summary for protein NP_036676. Clicking on the ‘Related Structure’ option from the ‘Links’ pull-down menu launches the summary view. Using the table view with this example, one may notice that several related structures are complexes of the same protein with different drugs/inhibitors, e.g. structures with PDB codes 1O86 (11), 1UZF (12) and 1UZE (12). Clicking on the graphical alignment footprint of 1O86, a human ACE enzyme in complex with lisinopril, one can see a text representation of the corresponding BLAST alignment, and a Cn3D view of the alignment can be launched by clicking on ‘Get 3D Structure data’ (Figure 3). One may see that the query protein is highly similar in sequence to the human ACE enzyme, as identical residue pairs are colored red by default. The sequence identity across the aligned region is 82%, and it appears that the core of the structure is mostly formed by residues conserved between the two aligned rows, while non-conserved residues are mainly located on the structure's surface.

Figure 3

A Cn3D view of the query sequence from Figure 2 aligned to chain A of the related structure 1O86 (PDB code). Residues in aligned regions are displayed in upper case letters with identical residue pairs rendered in red color. Residues within a 5 A contact radius of the bound drug lisinopril are highlighted in the 3D structure view and automatically mapped onto the aligned residues shown in the sequence alignment window. Side chains of these residues are displayed selectively and rendered as ball-and-stick models. One may further identify the catalytic center by identifying residues that contact the catalytic Zinc ion. Those sites can then be mapped from the structure to aligned regions in the sequence window using Cn3D's highlighting functionality. One may also examine the sequence-structure alignments with related structures 1UZE and 1UZF, human ACE binding to enalaprilat and captopril, respectively, drugs with chemical structures similar to that of lisinopril. This allows one to identify conserved interactions between the ACE enzyme and this series of antihypertensive drugs. Similarly, by examining the related structure 2AJF (13), one may be able to identify residues critical for cross-species infection by studying the protein–protein interactions between the receptor binding domain from SARS Coronavirus Spike and human versus rat angiotensin-converting enzyme 2. The ‘Related Structure’ service is also integrated with NCBI's protein BLAST service. A ‘Related Structures’ link is provided when one or more similar proteins with known 3D structures have been identified by BLAST. The NCBI single-nucleotide polymorphism resource (SNP) also links to the ‘Related Structure’ service, which in this context provides a mapping of both synonymous and non-synonymous coding SNPs onto experimentally determined 3D structures. ‘Related Structure’ may be expanded further in the future, to provide visualization for other NCBI resources and to support additional filtering and selection among related structures, e.g. to highlight those annotated with conserved domain footprints by the CDD resource or those linked to small molecules in the PubChem database.

13 in total

1. Cn3D: sequence and structure views for Entrez.

Authors: Y Wang; L Y Geer; C Chappey; J A Kans; S H Bryant
Journal: Trends Biochem Sci Date: 2000-06 Impact factor: 13.807

2. Links from genome proteins to known 3-D structures.

Authors: Y Wang; S Bryant; R Tatusov; T Tatusova
Journal: Genome Res Date: 2000-10 Impact factor: 9.043

3. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences.

Authors: T A Tatusova; T L Madden
Journal: FEMS Microbiol Lett Date: 1999-05-15 Impact factor: 2.742

4. MMDB: Entrez's 3D-structure database.

Authors: Jie Chen; John B Anderson; Carol DeWeese-Scott; Natalie D Fedorova; Lewis Y Geer; Siqian He; David I Hurwitz; John D Jackson; Aviva R Jacobs; Christopher J Lanczycki; Cynthia A Liebert; Chunlei Liu; Thomas Madej; Aron Marchler-Bauer; Gabriele H Marchler; Raja Mazumder; Anastasia N Nikolskaya; Bachoti S Rao; Anna R Panchenko; Benjamin A Shoemaker; Vahan Simonyan; James S Song; Paul A Thiessen; Sona Vasudevan; Yanli Wang; Roxanne A Yamashita; Jodie J Yin; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. MMDB: an ASN.1 specification for macromolecular structure.

Authors: H Ohkawa; J Ostell; S Bryant
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1995

7. Structure of SARS coronavirus spike receptor-binding domain complexed with receptor.

Authors: Fang Li; Wenhui Li; Michael Farzan; Stephen C Harrison
Journal: Science Date: 2005-09-16 Impact factor: 47.728

8. Crystal structure of the human angiotensin-converting enzyme-lisinopril complex.

Authors: Ramanathan Natesh; Sylva L U Schwager; Edward D Sturrock; K Ravi Acharya
Journal: Nature Date: 2003-01-19 Impact factor: 49.962

9. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Stephen T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. CDD: a Conserved Domain Database for protein classification.

Authors: Aron Marchler-Bauer; John B Anderson; Praveen F Cherukuri; Carol DeWeese-Scott; Lewis Y Geer; Marc Gwadz; Siqian He; David I Hurwitz; John D Jackson; Zhaoxi Ke; Christopher J Lanczycki; Cynthia A Liebert; Chunlei Liu; Fu Lu; Gabriele H Marchler; Mikhail Mullokandov; Benjamin A Shoemaker; Vahan Simonyan; James S Song; Paul A Thiessen; Roxanne A Yamashita; Jodie J Yin; Dachuan Zhang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

60 in total

1. The invention of WUS-like stem cell-promoting functions in plants predates leptosporangiate ferns.

Authors: Judith Nardmann; Wolfgang Werr
Journal: Plant Mol Biol Date: 2011-11-11 Impact factor: 4.076

2. The Shigella flexneri effector OspI deamidates UBC13 to dampen the inflammatory response.

Authors: Takahito Sanada; Minsoo Kim; Hitomi Mimuro; Masato Suzuki; Michinaga Ogawa; Akiho Oyama; Hiroshi Ashida; Taira Kobayashi; Tomohiro Koyama; Shinya Nagai; Yuri Shibata; Jin Gohda; Jun-ichiro Inoue; Tsunehiro Mizushima; Chihiro Sasakawa
Journal: Nature Date: 2012-03-11 Impact factor: 49.962

Review 3. Calcineurin homologous protein: a multifunctional Ca2+-binding protein family.

Authors: Francesca Di Sole; Komal Vadnagara; Orson W Moe; Victor Babich
Journal: Am J Physiol Renal Physiol Date: 2011-12-21

Review 4. Divergence and convergence in enzyme evolution: parallel evolution of paraoxonases from quorum-quenching lactonases.

Authors: Mikael Elias; Dan S Tawfik
Journal: J Biol Chem Date: 2011-11-08 Impact factor: 5.157

5. A novel minicollagen gene links cnidarians and myxozoans.

Authors: Jason W Holland; Beth Okamura; Hanna Hartikainen; Chris J Secombes
Journal: Proc Biol Sci Date: 2010-09-01 Impact factor: 5.349

6. Mechanisms of protein oligomerization, the critical role of insertions and deletions in maintaining different oligomeric states.

Authors: Kosuke Hashimoto; Anna R Panchenko
Journal: Proc Natl Acad Sci U S A Date: 2010-11-03 Impact factor: 11.205

10. SSMap: a new UniProt-PDB mapping resource for the curation of structural-related information in the UniProt/Swiss-Prot Knowledgebase.

Authors: Fabrice P A David; Yum L Yip
Journal: BMC Bioinformatics Date: 2008-09-23 Impact factor: 3.169