Literature DB >> 17099234

WormBase: new content and better access.

Tamberlyn Bieri¹, Darin Blasiar, Philip Ozersky, Igor Antoshechkin, Carol Bastiani, Payan Canaran, Juancarlos Chan, Nansheng Chen, Wen J Chen, Paul Davis, Tristan J Fiedler, Lisa Girard, Michael Han, Todd W Harris, Ranjana Kishore, Raymond Lee, Sheldon McKay, Hans-Michael Müller, Cecilia Nakamura, Andrei Petcherski, Arun Rangarajan, Anthony Rogers, Gary Schindelman, Erich M Schwarz, Will Spooner, Mary Ann Tuli, Kimberly Van Auken, Daniel Wang, Xiaodong Wang, Gary Williams, Richard Durbin, Lincoln D Stein, Paul W Sternberg, John Spieth.

Abstract

WormBase (http://wormbase.org), a model organism database for Caenorhabditis elegans and other related nematodes, continues to evolve and expand. Over the past year WormBase has added new data on C.elegans, including data on classical genetics, cell biology and functional genomics; expanded the annotation of closely related nematodes with a new genome browser for Caenorhabditis remanei; and deployed new hardware for stronger performance. Several existing datasets including phenotype descriptions and RNAi experiments have seen a large increase in new content. New datasets such as the C.remanei draft assembly and annotations, the Vancouver Fosmid library and TEC-RED 5' end sites are now available as well. Access to and searching WormBase has become more dependable and flexible via multiple mirror sites and indexing through Google.

Entities: Chemical

Mesh：

Year: 2006 PMID： 17099234 PMCID： PMC1669750 DOI： 10.1093/nar/gkl818

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

DESCRIPTION

Much of our understanding of biological processes and human biology comes from a vast amount of data generated in the study of a few model organisms. The National Human Genome Research Institute (NHGRI; ) in the United States and the British Medical Research Council have funded model organism databases (MODs) to store and integrate information for individual model organisms. WormBase is one of these MODs for the small, soil nematode, Caenorhabditis elegans. WormBase originated in 2001 and initially was simply a web-based interface for its predecessor, ACeDB (1,2) (), which contained primarily the genetic and physical maps, and the genome sequence of C.elegans. Over the years, WormBase has evolved and expanded through developing and utilizing better software, and providing richer content (3) by bringing in data from classical genetics, cell biology and functional genomics (4,5), and the genomic sequences of other closely related nematodes (6,7). WormBase is maintained by the International WormBase Consortium, a group of ∼30 scientists located at four sites (). The data are organized into a single database that is available for downloading () and web browsing. Access to, and use of, the database and resources are freely available, subject to an Acceptable Use Policy () and copyrights (). There is also a User's Guide (), an evolving Wiki site () and an email-based Help Desk (wormbase-help@wormbase.org) WormBase is not a static database. It is constantly growing with new data continually being added and made available to users through new builds and releases of the database every three weeks. Every 10th release is maintained as a permanently available, stable data source for reference. This review is an overview of major new content, software and performance improvements, and how the user community helps to focus efforts of WormBase.

NEW CONTENT

WormBase's content has doubled in the last three years as measured by the size of its underlying database (Figure 1). This growth is expected to increase as research using C.elegans expands and more nematode genome sequencing projects are completed.

Figure 1

The increase in content of WormBase as measured by the size of the underlying database. Plotted are the sizes in gigabytes of the archived releases of the WormBase database.

Additions to existing data types

Phenotypes

Over the past year one of the largest percentage increases in an existing data type has been for phenotype objects. There have been major changes in phenotype curation, including a new and expanded Phenotype Ontology, the evaluation and consolidation of preexisting phenotype classes, and improvements in phenotype curation methods. All of these allow more accurate and detailed phenotype information to be stored in WormBase, and make querying for phenotype-related information more efficient. Consequently, the number of phenotype objects in WormBase has increased dramatically over the past year from 119 to 1282.

Gene Ontology

Gene products annotated with Gene Ontology (GO) terms are another existing data type to see a large percentage increase. WormBase is a member of the Gene Ontology Consortium (8), and is developing GO content and annotating C.elegans' genes in GO terms. Currently there are over 50 000 GO annotations for over 10 000 genes in WormBase. Most of these are automated mappings, although there are curator-performed, manual annotations for ∼700 genes with over 800 unique papers cited as references. The automatic GO annotations are of two types. The first is the automated assignment of GO terms to genes based on the RNAi-knockdown phenotype. For example, an RNAi phenotype of Egl (egg-laying defective) is automatically assigned the GO biological process term ‘oviposition’. These are based on manual mapping of phenotypes to GO terms from large-scale RNAi screens. The second type of automated GO annotation uses the InterPro2GO mappings () provided by the European Bioinformatics Institute (). WormBase has recently begun determining InterPro domains with each build to ensure that they are completely up to date. These are GO term assignments based on protein family, domain, repeat, etc. In WormBase the GO annotations can now be found in a Gene Ontology section on the Gene page, which is the most common entry point for users (Supplementary Figure 1a). A summary of all the GO annotations (Supplementary Figure 1b) is reached through the GO summary link in the ‘Gene Ontology’ section. Details for individual annotations (Supplementary Figure 1c) can also be obtained through links in the section.

RNAi

RNAi data have also increased significantly over the past year, specifically from papers that describe individual RNAi experiments. Curation of these experiments is slow and labor intensive compared with papers containing large-scale datasets. Nonetheless, there has been a 150% increase in the individual RNAi experiments in WormBase. There are currently 2754 such experiments. Including curation of large-scale datasets, there are 63 740 total, curated RNAi experiments in WormBase, compared with 59 882 a year ago. There is still more to do, however. Approximately 45% of 871 C.elegans papers that contain RNAi data have been curated, compared with just over 20% a year ago. RNAi results are summarized in the Function section of the Gene page (Supplementary Figure 2a). A link to view all RNAi experiments that target the gene takes the user to a table (Supplementary Figure 2b) containing the details and results of each RNAi experiment.

Microarray data

The June 2006 database release (WS160) has microarray data from 33 papers describing 417 experiments, which represents almost a doubling in the number of curated, microarray papers and a 78% increase in the number of curated experiments in WormBase over the past year. To support microarray research in the worm community, the probe sets for two new microarray platforms available from Agilent () and Washington University's Genome Sequencing Center () are now in WormBase. These are in addition to the Affymetrix chip data previously available in WormBase. All of these data can be downloaded using the WormBase data-mining tool, WormMart (). The probes are mapped to the genome sequence and displayed as a track on the Genome Browser (9) (Supplementary Figure 3a). They are also mapped to corresponding gene models (Supplementary Figure 3b) with detailed reports (Supplementary Figure 3c) linked to the Gene page.

Intellectual lineage

The Intellectual Lineage was conceived as a way to indicate the heritage of the C.elegans research community. The types of connections a person can have are as follows: Supervised, Supervised by and Collaborated with. There are 1461 people that have connections to other people with 3729 person-to-person connections. This is up from 1030 people and 2626 person-to-person connections from a year ago. The lineage for an individual can be found on their Person page (Supplementary Figure 4), while the global view of the entire worm community lineage is available at WormBase ().

New datasets

Caenorhabditis remanei annotations

A preliminary, draft assembly and automated annotations of the C.remanei genome are now available on a Genome Browser () and can be download from the WormBase ftp site (). Annotations include gene predictions from several ab initio and alignment-based algorithms. Sequence improvement is ongoing at the Washington University Genome Sequencing Center with a final draft assembly and new annotations anticipated early in 2007. A summary of the best BLASTP match to C.remanei is now displayed on C.elegans and Caenorhabditis briggsae Gene pages (Supplementary Figure 5).

Vancouver fosmids

The fosmid library constructed and mapped to the genome by Don Moerman and colleagues at the C.elegans Reverse Genetics Core Facility located at the University of British Columbia is now available on the Genome Browser in the YAC, Fosmids and Cosmids track (Supplementary Figure 6) and searchable in WormBase. This library was made to supplement the aging cosmid and YAC libraries used in the original mapping and sequencing of the genome, some of which are now over 20 years old, and are more difficult to work with than fosmids.

TEC-RED

This relatively new technique involves trans-spliced exon coupled 5′ RNA end determination, which identifies 5′ ends of expressed genes in nematodes [details of the procedure can be found in reference (10)]. The TEC-RED data of Hwang et al. (10) is now in WormBase and can be viewed in the Genome Browser (Supplementary Figure 7). These data have been used extensively to update numerous gene structures.

IMPROVED ACCESS AND USABILITY

Fast and robust access

Community use of WormBase continues to grow. There are now over 2 million page hits per month, a 40% increase over the past year, with the Gene page being the most frequently accessed page. This increase prompted the deployment of a multiply-redundant, load-balancing server system, which now distributes requests across three back-end servers, each capable of serving the complete WormBase site. Two additional servers handle specialized tasks such as BLAST requests and query pages. There is also robust caching that delivers the most frequently requested pages from a high-speed, in-memory and on-disk cache. These improvements have resulted in dramatic increases in reliability and performance. Inserting low-cost machines into the existing infrastructure can easily accommodate increased demand in the future. Access to the data at WormBase is becoming more diverse and flexible. In addition to the main WormBase site located at the Cold Spring Harbor Laboratory (), there are now two new mirror sites: one at the Welcome Trust Sanger Institute in England () and one at the University of Marseille in France () that provide flexibility and backup if the main server experiences problems. This increases the total number of mirror sites to four with the existing sites at the California Institute of Technology () and the Institute of Molecular Biology and Biochemistry in Greece (). There are links to all the mirror sites and the main and development site on the WormBase homepage (Supplementary Figure 8). Another portal to WormBase is through WormBook (). WormBook content is extensively linked to WormBase with Genes, Proteins and Cells linked to the relevant pages in WormBase. Links from WormBase back to WormBook have recently been implemented to provide users easy access to background and in-depth information.

Improved searching

More recently, access to WormBase can be gained through search engines such as Google, which are now able to index the site. Previously their access had been restricted due to load issues on the old, single server infrastructure. Researchers worldwide can now search the vast content of highly curated data without having to actually visit the WormBase site. Google searches can be limited to WormBase results by using the format ‘site: [search terms]’. Google's page caching also provides an additional layer of redundancy during unexpected server interruptions at the WormBase site. Searching for strains at the Caenorhabditis Genetic Center (CGC) has been improved and is now hosted at WormBase. In addition to searching by strain name, users may now search for strains carrying a specific gene or allele, or other information found in the strain class (i.e. species, mutagen, data received and remarks).

Improved data presentation

Data layout and access has become more intuitive with new or updated web pages. Information from the Protein Data Bank's TargetDB (11) (), which provides status and tracking information on the production and solution of structures, is now displayed on Gene pages. Also on the Gene page, a GO summary (Supplementary Figure 1) has been added as well as three types of evolutionary data: InParanoid clusters of orthologous genes (12) (), Treefam's (13) () curated ortholog and paralog assignments (Figure 2), and C.elegans/C.briggsae mutual best BLASTP matches (Supplementary Figure 5). The Gene pages also include the phenotype of RNAi experiments, and physical and genetic interactions, which are now displayed in a convenient table (Supplementary Figure 2a).

Figure 2

Screen shot of the Homology section of the mec-14 Gene page showing InParanoid groups, C.briggsae orthologs and a TreeFam tree.

WORMBASE IS COMMUNITY DRIVEN

The focus and direction of WormBase is driven by the user community. Two ways this occurs is through user surveys and the WormBase advisory board. User surveys are commissioned periodically to gauge user concerns and direct future efforts for data curation. The most recent survey was conducted in the fall of 2005 (). The results were discussed at the WormBase advisory board in November 2005 and summarized in the January 2006 WormBase Newsletter (). One of the concerns of many of the respondents was the speed of WormBase. The Advisory Board felt improving response time should be a major priority, and WormBase responded with the new hardware structure, additional servers and page caching described above. The WormBase Advisory board meets once a year with the entire WormBase staff to review progress and prioritize projects for the upcoming year. The board consists of several members of the C.elegans research community, as well as people with expertise in bioinformatics and databases. WormBase also reaches out to the user community by having staff members at all the international, regional and topical C.elegans meetings, as well as other meetings, often presenting talks and posters, but mainly to receive input from users regarding problems as well as new features or datasets they would like to see WormBase host. Some requests are for information for which WormBase simply does not have access, such as large-scale studies before publication. WormBase strives to have data available at the time of publication and encourages authors to contact WormBase about upcoming publications. The WormBase Wiki site () provides a mechanism for users to easily add content to WormBase. Individuals can post experimental protocols, make posting about meetings and job openings, or add useful information about their favorite gene(s). A long-term goal is for the Wiki site to assume the role of the print version of the Worm Breeder's Gazette.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

13 in total

1. Creating the gene ontology resource: design and implementation.

Authors:
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

2. Genome annotation by high-throughput 5' RNA end determination.

Authors: Byung Joon Hwang; Hans-Michael Müller; Paul W Sternberg
Journal: Proc Natl Acad Sci U S A Date: 2004-02-02 Impact factor: 11.205

3. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

4. TargetDB: a target registration database for structural genomics projects.

Authors: Li Chen; Rose Oughtred; Helen M Berman; John Westbrook
Journal: Bioinformatics Date: 2004-05-06 Impact factor: 6.937

5. Scriptable access to the Caenorhabditis elegans genome sequence and other ACEDB databases.

Authors: L D Stein; J Thierry-Mieg
Journal: Genome Res Date: 1998-12 Impact factor: 9.043

6. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

Authors: M Remm; C E Storm; E L Sonnhammer
Journal: J Mol Biol Date: 2001-12-14 Impact factor: 5.469

Review 7. ACeDB and macace.

Authors: F H Eeckman; R Durbin
Journal: Methods Cell Biol Date: 1995 Impact factor: 1.441

8. WormBase: a multi-species resource for nematode biology and genomics.

Authors: Todd W Harris; Nansheng Chen; Fiona Cunningham; Marcela Tello-Ruiz; Igor Antoshechkin; Carol Bastiani; Tamberlyn Bieri; Darin Blasiar; Keith Bradnam; Juancarlos Chan; Chao-Kung Chen; Wen J Chen; Paul Davis; Eimear Kenny; Ranjana Kishore; Daniel Lawson; Raymond Lee; Hans-Michael Muller; Cecilia Nakamura; Philip Ozersky; Andrei Petcherski; Anthony Rogers; Aniko Sabo; Erich M Schwarz; Kimberly Van Auken; Qinghua Wang; Richard Durbin; John Spieth; Paul W Sternberg; Lincoln D Stein
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.

Authors: Lincoln D Stein; Zhirong Bao; Darin Blasiar; Thomas Blumenthal; Michael R Brent; Nansheng Chen; Asif Chinwalla; Laura Clarke; Chris Clee; Avril Coghlan; Alan Coulson; Peter D'Eustachio; David H A Fitch; Lucinda A Fulton; Robert E Fulton; Sam Griffiths-Jones; Todd W Harris; LaDeana W Hillier; Ravi Kamath; Patricia E Kuwabara; Elaine R Mardis; Marco A Marra; Tracie L Miner; Patrick Minx; James C Mullikin; Robert W Plumb; Jane Rogers; Jacqueline E Schein; Marc Sohrmann; John Spieth; Jason E Stajich; C Wei; David Willey; Richard K Wilson; Richard Durbin; Robert H Waterston
Journal: PLoS Biol Date: 2003-11-17 Impact factor: 8.029

10. TreeFam: a curated database of phylogenetic trees of animal gene families.

Authors: Heng Li; Avril Coghlan; Jue Ruan; Lachlan James Coin; Jean-Karim Hériché; Lara Osmotherly; Ruiqiang Li; Tao Liu; Zhang Zhang; Lars Bolund; Gane Ka-Shu Wong; Weimou Zheng; Paramvir Dehal; Jun Wang; Richard Durbin
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

50 in total

Review 1. In silico characterization of proteins: UniProt, InterPro and Integr8.

Authors: Nicola Jane Mulder; Paul Kersey; Manuela Pruess; Rolf Apweiler
Journal: Mol Biotechnol Date: 2007-10-04 Impact factor: 2.695

2. RBR ubiquitin ligases: Diversification and streamlining in animal lineages.

Authors: Ignacio Marín
Journal: J Mol Evol Date: 2009-06-13 Impact factor: 2.395

3. Comparison of diverse developmental transcriptomes reveals that coexpression of gene neighbors is not evolutionarily conserved.

Authors: Itai Yanai; Craig P Hunter
Journal: Genome Res Date: 2009-09-10 Impact factor: 9.043