Literature DB >> 16381872

The RCSB PDB information portal for structural genomics.

Andrei Kouranov¹, Lei Xie, Joanna de la Cruz, Li Chen, John Westbrook, Philip E Bourne, Helen M Berman.

Abstract

The RCSB Protein Data Bank (PDB) offers online tools, summary reports and target information related to the worldwide structural genomics initiatives from its portal at http://sg.pdb.org. There are currently three components to this site: Structural Genomics Initiatives contains information and links on each structural genomics site, including progress reports, target lists, target status, targets in the PDB and level of sequence redundancy; Targets provides combined target information, protocols and other data associated with protein structure determination; and Structures offers an assessment of the progress of structural genomics based on the functional coverage of the human genome by PDB structures, structural genomics targets and homology models. Functional coverage can be examined according to enzyme classification, gene ontology (biological process, cell component and molecular function) and disease.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2006 PMID： 16381872 PMCID： PMC1347482 DOI： 10.1093/nar/gkj120

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The wwPDB (1) maintains the Protein Data Bank (PDB) archives of biological macromolecular structure data, currently comprising over 32500 structures. Since the year 2000, the worldwide structural genomics initiatives have provided more than 2400 structures, which have also added a large number of new folds. To represent the progress of this collective effort, the RCSB PDB (2) has developed and maintains the Structural Genomics Information Portal at which consists of three main sections, outlined below.

Structural genomics initiatives

The first section of the information portal provides summary information about each structural genomics center, including target lists, target status, targets in the PDB and sequence redundancy analyses. Summary statistics describing the overall progress of all contributing projects, including sequence similarity and number of structures determined, are regularly tabulated. As an example, an analysis of the sequence similarity of structures solved by structural genomics projects relative to structures in the PDB archive is shown in Figure 1.

Figure 1

August 2005 report from the structural genomics information portal showing structural genomics structures with sequence similarity <30% relative to solved structures in the PDB by year. Sequence comparisons are performed using the blastclust application (7).

Targets

The Targets section offers databases that track target registration data. Currently, 20 structural genomics centers contribute data to the TargetDB (3) resource (). These data include contributing project and target identifier; protein name, source organism and sequence; current production status (e.g. cloned, expressed and crystallized); related database references; and links to related project information. TargetDB assembles data from all contributing centers and makes these data available in a single validated XML data file which is updated weekly. Targets can also be selected by searching TargetDB by target identifier, similar sequence, program or project, current production status, protein name or source organism. Search results can be captured in FASTA, TargetDB XML or HTML formats. The HTML report presents all of the contributed details about each target including links to related project information and archival databases [e.g. sequence, PDB and BMRB (4)], and links out to protein domain databases. An additional online form constructs cumulative reports summarizing the status of a particular program or project. Created as an extension to TargetDB, the Protein Expression Cloning and Purification Database, PepcDB (), was established to collect more detailed status information and the experimental details of each step in the protein production pipeline. PepcDB captures a complete history of the experimental steps in each production trial, in addition to describing the current target production status. The status history in PepcDB also records the time interval required to complete each experimental step, with an explanation if work on a particular target or experiment was terminated. Standard protocol descriptions are collected in text form for each step of protein production. Multiple experimental trials can be described for each target. Each trial may reference a set of standard protocols and optionally include the special details of an experimental step and the experimentally observed sequence. A validation server has been provided for PepcDB contributors (). Data files validated through this form are automatically loaded into the PepcDB database. PepcDB currently includes protocol information from the NIH Protein Structure Initiative (PSI) centers. TargetDB status data from all other structural genomics centers are merged into PepcDB. As a result, PepcDB always provides the most complete view of target status and experimental information for structural genomics projects. The search features of PepcDB build upon those of TargetDB by offering additional tools to mine experimental protocols. Protocol searches are integrated with queries for target sequence and other target attributes. The resulting report includes the essential target description provided by TargetDB plus additional links to a chronological status history and links to related experimental protocols.

Structures

The Structures section of the RCSB PDB Structural Genomics Information Portal () provides information about the functional distribution of solved structures, structures being determined by structural genomics and homology models determined from solved structures (5). Function is measured relative to Ensembl-assigned functions from the human genome (6) and disease relative to OMIM assignments for human diseases (7). This section answers the question ‘With respect to the function of proteins identified in humans and human disease, what does the present complement of structures in the PDB, the structural genomics targets (if all were solved) and homology models that can be built from the current set of templates add to our understanding of living systems?’ The answer to this question changes over time, and the functional distribution resource provides a current answer since the constituent components needed to address the question—PDB structures, structural genomics targets, homology models from SUPERFAMILY (8), functional assignments from Ensembl and disease classifications from OMIM—are all updated as they change, ranging from weekly for PDB structures and targets to approximately annually for SUPERFAMILY. The answer to the question also depends on the definition of a homology model. Here the structural templates used in homology modeling were a set of hidden Markov models taken from SUPERFAMILY 1.65. The sequences were aligned to the structural template with HMMER (9). Only those assigned domains with sequence identity >30% in the alignment were considered as homology models. Through the functional distribution site, this question can be addressed by examining molecular function, biological process and cellular component [as assigned by the Gene Ontology, GO (10)], enzymes via their EC numbers (), and diseases assigned through OMIM (7). Several steps are used to define the search parameters; here molecular function is used as an example. In Step 1 (Molecular Function) the breadth of the search is defined, which in turn defines the details presented in the results. So, for example, the top level of the GO hierarchy for molecular function is displayed and used by default. All structures could be selected, or a subgroup could be selected (e.g. all structures with the molecular function ‘vitamin binding’) by browsing through the hierarchical tree. Similarly in Step 2 (Structure Type), all structures are chosen by default, but it is possible to drill down and explore just groups of structures based on the SCOP classification of class (all alpha, all beta, etc.) (11). Step 3 selects the genome. At present only the human genome is available, but other model organisms will be added. Step 4 selects the sequence identity to use, with 40% identity the default. Sequence identity defines how the human genome sequences are clustered and a single function assigned for that cluster—at lower sequence identity there are fewer clusters, i.e. the results are effectively at lower resolution. Step 5 specifies the domain combinations needed for a match. Since PDB structures frequently represent a single domain in a larger complex, statistics can be produced requiring overlap for one or more domains up to the whole structure accounting for domain rearrangements [see (5) for a full description]. Based on these input parameters, one of three distributions can be generated: a comparison of the distribution of PDB structures, structure genomics targets or homology models against the human genome; a ‘most wanted list’ of structures—those not in the PDB and which (by default) are not identified through homology modeling or in the structural genomics targets yet have significant presence in the human genome; and simple charts showing the distribution of the genome sequences, PDB structures, structural genomics targets or homology models. Most distributions are accompanied by two tables illustrating, first, the functional coverage by each data type (Table 1), and second, the correlation between input data types (data not shown). The actual overlap between these groups will be added as part of an on-going development. For example in Table 1, PDB structures cover 37.2% of the identified molecular functions in the human genome; if solved, structural genomics targets cover 32.4% of functions; and 56.3% of the molecular functions can be modeled from existing structures. Figure 2 illustrates the resulting normalized distributions for the top level of the GO molecular function hierarchy. At this level most distributions are not skewed with the exception of molecular function unknown—PDB structures are underrepresented and structural genomics targets are overrepresented. Not surprising, since until structural genomics began structural biology was dominated by determining structures of known function. In the era of structural genomics, that trend has reversed. Drilling down to more detailed descriptions of molecular function (data not shown) reveals a more uneven distribution and suggests changes in structure determination strategies.

Table 1

Genome coverage

	Function coverage	Cluster coverage
Genome sequences	1.000	1.000
PDB structures	0.372	0.094
SG targets	0.324	0.156
Homology models	0.563	0.283
PDB structures + SG tragets	0.515	0.239
PDB structures + homology models	0.595	0.303
SG targets + homology models	0.663	0.411
PDB structures + SG targets + homology models	0.687	0.428

Data are based upon 10801 functionally described human genome sequences from Ensembl, 942 PDB structures from human, 1680 structural genomics targets identified in human and 2823 homology models from SUPERFAMILY mapped on to the human genome. Cluster Coverage is the ratio of number of protein clusters that are structurally covered versus all clusters in the genome for a functional class with a specified sequence identity (40% in this case). Functional class and sequence identity are input parameters.

Figure 2

Normalized functional coverage of the human genome by sequence (from Ensembl; red), by structures from the PDB (blue), by structural genomics targets (green) and homology models from SUPERFAMILY (yellow). When viewing the figure from the online structural genomics portal, clicking on the appropriate bar of the histogram will produce a list of sequences or structures that define the distribution.

An important feature of this resource is the ‘most wanted list’ of structures based on the following criteria: (i) functional categories where proteins are underrepresented by structures; (ii) from (i), proteins which can not be modeled, i.e. proteins from the human genome without SUPERFAMILY assignments; (iii) if the protein can be associated with a human disease; and (iv) proteins identified as likely to be intractable, i.e. with a transmembrane segment filtered out.

CONCLUSION

In this report, we present the resources currently made available through the RCSB PDB in support of the structural genomics effort. It is expected that further functionality will be added as the second phase of the PSI and other worldwide efforts move forward.

11 in total

1. SCOP: a structural classification of proteins database.

Authors: L Lo Conte; B Ailey; T J Hubbard; S E Brenner; A G Murzin; C Chothia
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

4. BioMagResBank database with sets of experimental NMR constraints corresponding to the structures of over 1400 biomolecules deposited in the Protein Data Bank.

Authors: Jurgen F Doreleijers; Steve Mading; Dimitri Maziuk; Kassandra Sojourner; Lei Yin; Jun Zhu; John L Markley; Eldon L Ulrich
Journal: J Biomol NMR Date: 2003-06 Impact factor: 2.835

5. Announcing the worldwide Protein Data Bank.

Authors: Helen Berman; Kim Henrick; Haruki Nakamura
Journal: Nat Struct Biol Date: 2003-12

6. The SUPERFAMILY database in 2004: additions and improvements.

Authors: Martin Madera; Christine Vogel; Sarah K Kummerfeld; Cyrus Chothia; Julian Gough
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

7. TargetDB: a target registration database for structural genomics projects.

Authors: Li Chen; Rose Oughtred; Helen M Berman; John Westbrook
Journal: Bioinformatics Date: 2004-05-06 Impact factor: 6.937

Review 8. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

9. Ensembl 2005.

Authors: T Hubbard; D Andrews; M Caccamo; G Cameron; Y Chen; M Clamp; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; J Gilbert; M Hammond; J Herrero; H Hotz; K Howe; V Iyer; K Jekosch; A Kahari; A Kasprzyk; D Keefe; S Keenan; F Kokocinsci; D London; I Longden; G McVicker; C Melsopp; P Meidl; S Potter; G Proctor; M Rae; D Rios; M Schuster; S Searle; J Severin; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; E Birney
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Wolfgang Helmberg; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Joan U Pontius; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

138 in total

1. Predicting folding free energy changes upon single point mutations.

Authors: Zhe Zhang; Lin Wang; Yang Gao; Jie Zhang; Maxim Zhenirovskyy; Emil Alexov
Journal: Bioinformatics Date: 2012-01-11 Impact factor: 6.937

2. In silico investigation of pH-dependence of prolactin and human growth hormone binding to human prolactin receptor.

Authors: Lin Wang; Shawn Witham; Zhe Zhang; Lin Li; Michael E Hodsdon; Emil Alexov
Journal: Commun Comput Phys Date: 2013-01 Impact factor: 3.246

3. DelPhi Web Server: A comprehensive online suite for electrostatic calculations of biological macromolecules and their complexes.

Authors: Subhra Sarkar; Shawn Witham; Jie Zhang; Maxim Zhenirovskyy; Walter Rocchia; Emil Alexov
Journal: Commun Comput Phys Date: 2013-01 Impact factor: 3.246

10. SSMap: a new UniProt-PDB mapping resource for the curation of structural-related information in the UniProt/Swiss-Prot Knowledgebase.

Authors: Fabrice P A David; Yum L Yip
Journal: BMC Bioinformatics Date: 2008-09-23 Impact factor: 3.169