Literature DB >> 16381914

SGCEdb: a flexible database and web interface integrating experimental results and analysis for structural genomics focusing on Caenorhabditis elegans.

David H Johnson¹, Jun Tsao, Ming Luo, Mike Carson.

Abstract

The SGCEdb (http://sgce.cbse.uab.edu) database/interface serves the primary purpose of reporting progress of the Structural Genomics of Caenorhabditis elegans project at the University of Alabama at Birmingham. It stores and analyzes results of experiments ranging from solubility screening arrays to individual protein purification and structure solution. External databases and algorithms are referenced and evaluated for target selection in the human, C.elegans and Pneumocystis carinii genomes. The flexible and reusable design permits tracking of standard and custom experiment types in a scientist-defined sequence. The database coordinates efforts between collaborators and is adaptable to a wide range of biological applications.

Entities: Chemical Gene Species

Mesh：

Substances：

Year: 2006 PMID： 16381914 PMCID： PMC1347399 DOI： 10.1093/nar/gkj036

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The NIH-NIGMS-sponsored Protein Structural Initiative (1) initially sponsored seven pilot projects in structural genomics, including the Southeast Collaboratory for Structural Genomics (SECSG) (2). The Structural Genomics of Caenorhabditis elegans (SGCE) project established a pipeline for high-throughput protein expression, microarray crystallization screening, X-ray data collection and user-friendly bioinformatics. The initial SGCE focus was the nematode worm C.elegans, one of the best-studied multicellular model organisms (3) whose complete genomic sequence is known (4). The SGCE project is facilitated by the C.elegans ORFeome project (5), which aims at cloning all predicted protein-encoding open reading frames (ORFs) as Gateway Entry clones (6). The 96-well plates of cDNA supplied enabled a high-throughput approach to recombinant protein expression and analysis. The initial purpose of the SGCEdb was to report progress on the SGCE project. However, a pipeline to express, purify and crystallize over 10 000 C.elegans ORFs (7) required more comprehensive tracking for meaningful analysis. Additional requirements for the project included sample tracking and monitoring group progress for a smooth transition between the expression, purification and crystallization groups. Plates of cDNA ORFs for robotic screening needed to be prioritized based on the expectation of results. In order to prioritize, established external databases and analysis algorithms must be applied on a genome-wide basis. Finally, the methods used to accomplish each step were updated on a regular basis; therefore, the database and interface system needed to be flexible enough to rearrange and modify experiments before the data stored became obsolete with respect to the experiments performed. The resulting SGCEdb is a database and interface framework that has been applied to a variety of genomes and heterogeneous experiment methodologies.

THE DATABASE

Browsing the SGCEdb

The website provides the public interface. Selecting the ‘Results’ button at the top of the homepage allows searching for any protein target. The target ID used for selecting a single protein is generally the WormBase (8) ‘AceID’, familiar to C.elegans researchers. Unlike WormBase, which concentrates on genomic information, the SGCEdb concentrates on information of the putative proteins. This is similar to other structural genomics databases in spirit (9). The SGCE was among the first structural genomics groups to make its protein production data publicly available. The most detailed way of browsing the SGCEdb is by viewing one of the 16 566 C.elegans individual proteins. In this view links are provided to WormBase website (10), the ORFeome project (5) and the Protein Information Resource (11) if available. BLAST (12) results against the Protein Data Bank (PDB) structures (13) are updated weekly. BLAST results against the non-redundant database (14) and Pfam (15) results are periodically updated. Theoretical values, such as isoelectric point and hydrophobicity, are calculated from the Expasy site (16). Transmembrane protein (17), signal peptide (18) and prosite (19) results are available. For each of the 11 727 proteins with experiments currently performed, a complete record of the experimental results is also given. All these have at least expression and preliminary solubility data (7). An example web page of a protein view with a completed structure (20) is shown in Figure 1.

Figure 1

Web display for the target AceID no. C55C2.2. The left-hand side shows the information available for all targets. The right-hand side shows the experimental data generated. The orange ellipses indicate additional information omitted to create the figure.

Results are also accessible by XML and experimental stage. The XML reports are updated weekly and are intended to provide improved interoperability with external databases. Database queries are used to format all results according to the TargetDB and the Protein Expression, Purification and Crystallization (PEPCdb) XML standards for structural genomics maintained at the PDB (13). These standards include experiment data for each stage from expression through protein structure. Individuals interested in detailed results of a specific step can go directly to that page, e.g. ‘Structures’. The modeling page contains secondary structures and other predictions available from ProteinPredict (21). The modeling page also provides distributions per plate and histograms over the entire database for calculated protein parameters. The parameter distribution per plate is used internally to prioritize efforts. A variety of per plate experimental reports are also available.

Database design

The complete system was developed in close collaboration with the SGCE project scientists and technicians. The infrastructure was constructed entirely with open-source tools including the Apache web server, the Python language and the PostgreSQL database system. A very brief synopsis is given below. A more detailed summary of the design and data entry including figures is provided in Supplementary Data. The database of SGCEdb is separated into two distinct parts: protein source and experiment tracking. The protein source tables handle external database references, target selection and sequence analysis. Experiment tracking tables store detailed results for expression, solubility, purification and crystallization. Each schema implements advanced database methods to efficiently and comprehensively track proteins, results and analyses throughout the experimental process. When a protein source is initially received or considered for study, the sequences are organized into plates and wells. The protein source schema has configurable plate geometry allowing it to track protein production, 96-well solubility screens and 384-well crystallization trials. By keeping plates, wells and sequences separate in the protein source schema data attributes can be assigned at an appropriate level. For instance, the Pfam (15) algorithm is initially executed and stored once per sequence. Experiment parameters common to an entire screening plate are stored once per plate whereas individual experiment results are stored at the plate level. This design provides efficient queries of experimental data by eliminating duplicate information and allowing the design to scale to experiments over the entire C.elegans genome. Experiment history is stored in a condensed form, called parenthetical tree notation (22). This notation is exceptionally useful for querying lineage information. Populating the history tree is handled by the database during data entry, with the scientists only having to indicate the source of the experiment being entered. A major benefit of storing an experiment history tree is the ability to efficiently retrieve every experiment that has influenced an experiment of interest. It is also useful to have a ‘generic’ experiment table when tracking the sequence of experiments. By tracking experiment history using a generic experiment ID any combination of available experiments can be combined and tracked without a major database re-design. In general the sequence of experiments is fixed by protocol. However, in the event of an unexpected result, quality control experiments such as mass spectrometry and additional chromatography column experiments may be inserted anywhere in the sequence of experiments. All subsequent experiments will include the unplanned mass spectrometry in a query of its history. SGCEdb uses Zope (23) for data entry forms and PHP for display. A benefit of Zope is the ability to reuse common data entry components to quickly assemble entry forms for the bench scientist.

FUTURE DEVELOPMENTS

Currently the SGCEdb framework has the capability to handle individual and plate-based experiments with external database reference and algorithm analysis. A modified version of the system is in use at Beijing University. Plans for the SGCEdb framework include an experiment creation tool and modular visualization methods. An experiment creation tool would generate table definitions, entry forms and standard html views based on data types specified in a graphical interface. This tool would allow end users to rapidly add new experiments into the framework. The modular visualization methods would utilize Java display layers to allow the user to superimpose additional information where it is most useful—for instance, secondary structure under the sequence and information content over a 3D structure (24) or gel markers by a gel image. These future development plans are geared toward making customization easier for independent laboratories.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

21 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

3. Structural genomics programs at the US National Institute of General Medical Sciences.

Authors: J C Norvell; A Z Machalek
Journal: Nat Struct Biol Date: 2000-11

4. SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics.

Authors: P Bertone; Y Kluger; N Lan; D Zheng; D Christendat; A Yee; A M Edwards; C H Arrowsmith; G T Montelione; M Gerstein
Journal: Nucleic Acids Res Date: 2001-07-01 Impact factor: 16.971

5. The Protein Information Resource.

Authors: Cathy H Wu; Lai-Su L Yeh; Hongzhan Huang; Leslie Arminski; Jorge Castro-Alvear; Yongxing Chen; Zhangzhi Hu; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; C R Vinayaka; Jian Zhang; Winona C Barker
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. The Southeast Collaboratory for Structural Genomics: a high-throughput gene to structure factory.

Authors: Michael W W Adams; Harry A Dailey; Lawrence J DeLucas; Ming Luo; James H Prestegard; John P Rose; Bi-Cheng Wang
Journal: Acc Chem Res Date: 2003-03 Impact factor: 22.384

7. WormBase: network access to the genome and biology of Caenorhabditis elegans.

Authors: L Stein; P Sternberg; R Durbin; J Thierry-Mieg; J Spieth
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

8. Crystal structure of the cytoskeleton-associated protein glycine-rich (CAP-Gly) domain.

Authors: Songlin Li; Jim Finley; Zhi-Jie Liu; Shi-Hong Qiu; Hongli Chen; Chi-Hao Luan; Mike Carson; Jun Tsao; David Johnson; Guangda Lin; Jun Zhao; Willie Thomas; Lisa A Nagy; Bingdong Sha; Lawrence J DeLucas; Bi-Cheng Wang; Ming Luo
Journal: J Biol Chem Date: 2002-09-07 Impact factor: 5.157

9. Database resources of the National Center for Biotechnology.

Authors: David L Wheeler; Deanna M Church; Scott Federhen; Alex E Lash; Thomas L Madden; Joan U Pontius; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Tatiana A Tatusova; Lukas Wagner
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. The Pfam protein families database.

Authors: Alex Bateman; Ewan Birney; Lorenzo Cerruti; Richard Durbin; Laurence Etwiller; Sean R Eddy; Sam Griffiths-Jones; Kevin L Howe; Mhairi Marshall; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2 in total

1. A semi-automated high-throughput approach to the generation of transposon insertion mutants in the nematode Caenorhabditis elegans.

Authors: Yohann Duverger; Jérôme Belougne; Sarah Scaglione; Dominique Brandli; Christophe Beclin; Jonathan J Ewbank
Journal: Nucleic Acids Res Date: 2006-12-12 Impact factor: 16.971

2. Gene expression patterns during adaptation of a helminth parasite to different environmental niches.

Authors: Emmitt R Jolly; Chen-Shan Chin; Steve Miller; Mahmoud M Bahgat; K C Lim; Joseph DeRisi; James H McKerrow
Journal: Genome Biol Date: 2007 Impact factor: 13.583

2 in total