Literature DB >> 18063570

PlantGDB: a resource for comparative plant genomics.

Jon Duvick¹, Ann Fu, Usha Muppirala, Mukul Sabharwal, Matthew D Wilkerson, Carolyn J Lawrence, Carol Lushbough, Volker Brendel.

Abstract

PlantGDB (http://www.plantgdb.org/) is a genomics database encompassing sequence data for green plants (Viridiplantae). PlantGDB provides annotated transcript assemblies for >100 plant species, with transcripts mapped to their cognate genomic context where available, integrated with a variety of sequence analysis tools and web services. For 14 plant species with emerging or complete genome sequence, PlantGDB's genome browsers (xGDB) serve as a graphical interface for viewing, evaluating and annotating transcript and protein alignments to chromosome or bacterial artificial chromosome (BAC)-based genome assemblies. Annotation is facilitated by the integrated yrGATE module for community curation of gene models. Novel web services at PlantGDB include Tracembler, an iterative alignment tool that generates contigs from GenBank trace file data and BioExtract Server, a web-based server for executing custom sequence analysis workflows. PlantGDB also hosts a plant genomics research outreach portal (PGROP) that facilitates access to a large number of resources for research and training.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2007 PMID： 18063570 PMCID： PMC2238959 DOI： 10.1093/nar/gkm1041

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

PlantGDB serves the plant research community by providing access to plant sequence data as well as a variety of sequence and genome analysis tools in a single online resource [(1,2); Table 1]. This update outlines recent developments at PlantGDB that have expanded its usefulness as a tool for comparative genomics. Key features include: expanded EST assemblies; new genome browsers for a larger number of species; overnight annotation of emerging genome sequences; and novel tools for sequence retrieval and analysis, including an innovative system for the creation and management of workflows that integrates database queries, linked web services, and local tools.

Table 1.

Sequence resources and analytical tools available at PlantGDB

In this table, sequence resources are divided into four categories: Uploaded Sequence, Assembled Sequence, Genome Browsers, and Other Tools. For each resource in column 1, the species available, source/version, current sequence count, update frequency, tool/services, download options, and alignment to genome are shown in adjacent cells. Web links for both external and internal data/tool sources are indicated with superscript numbers and are listed at the end of the table under Web Resources. Sequence counts displayed here are as of 30 October 2007.

Sequence resources and analytical tools available at PlantGDB In this table, sequence resources are divided into four categories: Uploaded Sequence, Assembled Sequence, Genome Browsers, and Other Tools. For each resource in column 1, the species available, source/version, current sequence count, update frequency, tool/services, download options, and alignment to genome are shown in adjacent cells. Web links for both external and internal data/tool sources are indicated with superscript numbers and are listed at the end of the table under Web Resources. Sequence counts displayed here are as of 30 October 2007.

DATABASE FEATURES AND ADDITIONS

Plant sequence data and transcript assemblies

PlantGDB periodically uploads and parses all Viridiplantae sequences from GenBank (3) and Uniprot (4) into ∼70 000 individual data sets according to species or subspecies origin (Figure 1). PlantGDB's sequence data are refreshed approximately every 4 months, coinciding with alternate bimonthly GenBank version releases. GenBank and Uniprot sequences are uploaded, parsed by (sub)species, indexed for BLAST (5) and GeneSeqer (6) analysis, and loaded into MySQL tables. For all species with >10 000 published transcripts, a non-redundant set of PlantGDB-generated Unique Transcripts (PUTs) is generated using a custom assembly pipeline (http://www.plantgdb.org/prj/ESTCluster/PUT_procedure.php). PUTs are aligned to UniProt entries using BLAST, and the best matches (if any) and UniProt-associated Gene Ontology (GO) annotations (7) are stored. Currently, 116 species have PUT assemblies at PlantGDB, spanning diverse taxonomic groups (Figure 2). Users can track assembly progress in PlantGDB at http://www.plantgdb.org/prj/ESTCluster/progress.php. PlantGDB also provides genome survey sequence (GSS) assemblies for maize and sorghum (http://www.plantgdb.org/prj/GSSAssembly/). All PlantGDB sequence data (raw and processed) are available for download in a variety of file formats at http://www.plantgdb.org/download/download.php. In addition, all Zea mays sequence data at PlantGDB are uploaded monthly to MaizeGDB, the central repository for maize genetic information (http://www.maizegdb.org) (8).

Figure 1.

Figure 2.

Transcript assemblies (PUTs) at PlantGDB, grouped by taxonomic affiliation. Sequence totals displayed here are as of 30 October 2007. Parentheses indicate the number of species/subspecies per genus. Genera highlighted in yellow are associated with a genome browser at PlantGDB; an underscore indicates chromosome-based genome browsers. An asterisk designates genera for which PlantGDB provides preprocessed GeneSeqer indices for quick access to spliced alignments.

Database schema for PlantGDB, showing data sources, update frequency, computation and web services. PlantGDB is accessible at http://www.plantgdb.org, and genome browsers are accessible at http://www.plantgdb.org/XxGDB, where Xx is the first letter of the genus and species (e.g. AtGDB = Arabidopsis thaliana genome database). Transcript assemblies (PUTs) at PlantGDB, grouped by taxonomic affiliation. Sequence totals displayed here are as of 30 October 2007. Parentheses indicate the number of species/subspecies per genus. Genera highlighted in yellow are associated with a genome browser at PlantGDB; an underscore indicates chromosome-based genome browsers. An asterisk designates genera for which PlantGDB provides preprocessed GeneSeqer indices for quick access to spliced alignments.

Query and analysis tools

Data housed at PlantGDB are stored in MySQL tables and can be queried by accession number, GI number or text search. TableMaker, a search module for querying and retrieving PlantGDB's GenBank data in tabular format, described previously (1), has been expanded to include a new query wizard to simplify the search process for users not familiar with GenBank data models (http://www.bioextract.org/genbank/home/index.jsp). For sequence similarity searching, a batch NCBI-BLAST tool is available for querying any combination of plant species data sets against up to 100 query sequences at a time (http://www.plantgdb.org/PlantGDB-cgi/blast/PlantGDBblast). For specialized queries, PatternSearch (http://www.plantgdb.org/PlantGDB-cgi/vmatch/patternsearch.pl) interrogates the database for relatively short matches possibly interspersed with mismatches and indels, and ProbeMatch (http://www.plantdb.org/PlantGDB-cgi/prj/PLEXdb/ProbeMatch.pl) allows users to match sequence to array probes and link to array probe databases. PlantGDB also provides online access to GeneSeqer alignment software, allowing the user to calculate spliced alignments of expressed transcripts to a target genomic sequence, as described previously (1) (http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/PlantGDBgs.cgi). Currently, transcript data sets (EST, cDNA, PUT) from 50 species are preprocessed at PlantGDB to allow rapid online GeneSeqer analysis, and a range of splice site models and alignment parameters can be specified at runtime. PlantGDB provides sequence analysis tools that automate processes that normally require iterative searching or tedious parsing of information. Tracembler (http://www.plantgdb.org/tool/tracembler/) allows the user to do chromosome walks with pre-assembly trace data by performing an iterated search of NCBI's trace archives with a seed sequence (9). Tracembler invokes CAP3 (10) to assemble a contig sequence from one or more automated rounds of BLAST analysis and displays a pairwise alignment between contig and seed sequence. MuSeqBox (http://www.plantgdb.org/MuSeqBox/MuSeqBox.html) is an online tool for generating tabular output from multiple BLAST queries, based on user-specified thresholds (11). The tool also contains algorithms for detecting potential alternate splicing and full-length transcripts. MuSeqBox provides pre-computed Uniprot BLASTx data sets for maize, sorghum, barley, Arabidopsis and rice PUTs, allowing the user to generate filtered, tabulated output. Alternatively, the user can upload a custom BLAST output file.

Bioinformatics workflow tools

A bioinformatics research project may utilize a variety of query and computational tools, some web-based and others local to the user, which may be carried out in serial fashion along with parsing and formatting for input or display. There is a growing need for systems that can integrate disparate tools and workflows in a way that automates the process of input/output, computation and documentation. The PlantGDB-associated BioExtract Server (http://www.bioextract.org/login/Login.html) addresses this need by providing researchers with a web interface that allows them to query sequence databases, analyze data with web-based as well as local bioinformatics tools, save results and create and manage workflows using a directed acyclic graph (DAG) model (Lushbough,C., Bergman,M.K., Lawrence,C.J., Jennewein,D. and Brendel,V. BioExtract server – an integrated system to access and analyze heterogeneous, distributed biomolecular data. Submitted for publication.). As a simple example, a user could develop a workflow that performs a BLAST search, retrieves peptide sequences from query results, eliminates redundant sequences and produces a multiple sequence alignment output. BioExtract workflows can be paused, modified, saved, shared with an online workgroup or the world, and documented electronically for future reference.

Genome browsers

PlantGDB provides genome browsers (xGDB) for 14 plant species whose genomes have been completely or partially sequenced (12). AtGDB, OsGDB, MtGDB, SbGDB, VvGDB and PtGDB are chromosome-based genome browsers for Arabidopsis thaliana (thale cress), Oryza sativa (rice), Medicago truncatula (barrel medic), Sorghum bicolor (sorghum), Vitis vinifera (grapevine) and Populus trichocarpa (western balsam poplar), respectively, while ZmGDB, GmGDB, HvGDB, SlGDB, GhGDB, TaGDB, BrGDB and LjGDB are BAC-based browsers for Zea mays (corn or maize), Glycine max (soybean), Hordeum vulgare (barley), Solanum lycopersicum (tomato), Gossypium hirsutum (cotton), Triticum aestivum (bread wheat), Brassica rapa (field mustard) and Lotus japonicus, respectively (http://www.plantgdb.org/prj/Genome_browser.php). The xGDB browser Context View (Figure 3B) displays current gene model annotation together with high quality, cognate and non-cognate GeneSeqer alignments of ESTs, cDNAs, and PUTs to genomic sequence. Similarly, Oryza sativa predicted polypeptides from TIGR (http://rice.tigr.org/tdb/e2k1/osa1/) (13) and/or Arabidopsis thaliana predicted polypeptides from TAIR (http://www.arabidopsis.org/) (14) are splice-aligned to genomic sequence using GenomeThreader (15) and displayed in the same window. For species with microarray probe sequence, these are downloaded from the microarray database at PLEXdb (16), aligned relative to PUT assemblies and displayed. Significantly, users can view spliced alignments in a genomic region at the nucleotide level and also retrieve quality scores and provenance information for any spliced alignment displayed at xGDB. Additional sequence alignments, including GSS contigs and repeat masked regions, are displayed for some genomes. A subset of xGDB annotation data are accessible through the Distributed Annotation System (DAS) (http://www.biodas.org/). These data can be downloaded for further analysis, or alternatively imported into another genome browser capable of importing DAS formatted data.

Figure 3.

Screenshots from ZmGDB and yrGATE illustrate the use of online tools for gene discovery and community gene annotation. (A) A web-accessible table of Z. mays BACs (alternately shaded) displaying (left to right) the BAC GI, BAC clone name, followed by the ID, start/end coordinates and functional annotation of splice-aligned TIGR-predicted proteins from O. sativa and finally the ZmGDB entry date. All fields are searchable and each row is linked via column 1 to a genome browser view of the BAC region. This table is currently updated daily at ZmGDB (http://www.plantgdb.org/ZmGDB/DisplayGeneAnn.php). (Similar tables are available for eight other BAC-based xGDB browsers.) Note that a region of BAC GI 156523432 is aligned to three paralogous rice predicted polypeptides, annotated as ‘autophagy-related protein 8 precursor’. Clicking on the BAC GI ‘156523432’ in table column 1 (circled) brings up a BAC/Clone Context View of the specified region (B), showing spliced alignments to the rice predicted polypeptides (black), along with other alignment data, in this case maize cDNAs (blue) and maize ESTs (red). Note the evidence for alternative splicing among the maize ESTs (circles) suggesting at least two alternate transcripts (labeled 1 and 2). The user has the option to explore and annotate this variation using yrGATE. (C) Launching the yrGATE annotation tool displays scrolling list of evidence scores and supporting exons for all exon coordinates at a locus (alternative splice coordinates for 1 and 2 are circled). The user can build a complete gene model on screen by selecting each desired exon and then compare the resulting open reading frame to known proteins using BLAST (data not shown). (D) The chosen gene model is displayed graphically and will be published on the ZmGDB browser following curation by PlantGDB staff. Shown here are yrGATE models for the two putative splice variants, with translation start/stop positions indicated by triangles. (E) Predicted protein sequence for the two yrGATE gene models. This example illustrates how xGDB and yrGATE can be used to identify and publish gene model predictions quickly and easily, enhancing the community genome knowledge base for maize as well as facilitating hypothesis-driven research.

Community annotation

Although excellent tools are available for defining genic regions and variant transcript forms from evidence-based data as well as ab initio prediction, models can often be improved further by human curated annotation. PlantGDB's yrGATE (http://www.plantgdb.org/prj/yrGATE/) is a recently developed tool for community annotation of gene models that is integrated with PlantGDB's xGDB genome browsers (17). From a single browser window the user can rapidly evaluate a selected region for intron/exon structures based on any combination of EST/cDNA evidence and ab initio prediction, compare the model with known proteins via GenBank BLAST, and submit the annotation for review and publication on the genome browser. To assist in the identification of gene models in need of annotation, the Genome Annotation Evaluation (GAEVAL) module generates quality scores for gene structure predictions and classifies cases of incongruence of the annotation with experimental evidence (http://www.plantgdb.org/AtGDB-chtml/gaeval/). The yrGATE tool is available for both BAC and chromosome-based xGDB browsers and is being used to communicate evidence-based gene models to the A. thaliana genome database, TAIR (18). Figure 1 shows an example of how yrGATE can be used, together with xGDB's annotation tables and genome browser, to identify and annotate potential splicing variants for a gene of interest in maize.

Pipelines for genome annotation

Genome browsers at PlantGDB are refreshed on a timetable that depends on the pace of accumulation of new genomic or transcript sequence data or assemblies for the respective species. New spliced alignments are calculated for ESTs, cDNAs and PUTs as well as for other sequence types (where available) and data are uploaded. To match the rapid pace with which some genomes are being sequenced, PlantGDB staff have developed and implemented an automated genome data pipeline for species with rapidly expanding sequence data, using Zea mays as an initial example. In 2007, new maize BAC sequences began to be deposited in GenBank at the rate of over 60 BACs or ∼10 Mb of sequence per day (http://www.maizesequence.org). In addition, there is a growing catalog of transposable element-tagged maize genomic sequence in GenBank, facilitating reverse genetics in maize (http://www.plantgdb.org/prj/AcDsTagging/) (19) as well as a large repository of EST sequence-derived PUTs. PlantGDB's daily Z. mays pipeline downloads and processes all new maize BACs with transcript, protein, microarray probe, transposon insertion tag and other genomic alignments, and displays the cumulative output for all BACs in ZmGDB (the xGDB browser for Z. mays; http://www.plantgdb.org/ZmGDB/) within 12 h. The pipeline also updates BLAST and sequence download resources daily. Significantly, the pipeline also generates a browsable, searchable, tabular output of rice gene models and putatively transposon-tagged genes for the entire BAC data set (http://www.plantgdb.org/ZmGDB/DisplayGeneAnn.php), providing a powerful and timely gene discovery tool for researchers (Figure 3). This effort represents an early implementation of a real-time, high-throughput, discovery-oriented annotation process using automated workflows.

Outreach

The Plant Genome Outreach Portal (PGROP) at PlantGDB provides a portal for plant genomics resources online as well as a repository of outreach content, serving the needs of a wide-ranging audience from high school through postgraduate (20). Users can query for resources or add a resource using simple online tools, and query results are ranked via algorithms that highlight the most popular resources.

Future directions

PlantGDB will continue to provide comprehensive and up-to-date plant sequence information online and available for download. As additional genomes become available, xGDB browsers will be expanded, with additional annotations contemplated for certain species [e.g. tracks for transcription factor binding sites and conserved non-coding sequences (21)]. Also planned are additional comparative genomics tools such as SynBrowse (22), expanded DAS import and export, and the development of qualitative (e.g. quality score) and quantitative (e.g. library) filters for spliced alignments. Expanded help and tutorial sections are also under development.

CONCLUSIONS

PlantGDB has expanded greatly in scope since 2004, providing today a wide range of data sets, query methods and analysis tools for researchers interested in comparative plant genomics or gene discovery research. The site aims to complement other, more specialized plant genome sites by providing comprehensive plant sequence data as well as a suite of tools and genome browsers that emphasize spliced alignment of cognate and non-cognate transcripts and similar protein sequences. PlantGDB also addresses the need for timely access to, and processing of, high-volume informatics data through use of automated daily data pipelines (e.g. maize BAC pipeline) and online workflow tools (e.g. BioExtract Server and Tracembler). With the yrGATE community annotation tool, PlantGDB facilitates the sharing of user-generated gene annotation information across the entire plant research community.

20 in total

1. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

2. Multi-query sequence BLAST output examination with MuSeqBox.

Authors: L Xing; V Brendel
Journal: Bioinformatics Date: 2001-08 Impact factor: 6.937

3. The Gene Ontology (GO) database and informatics resource.

Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. PlantGDB, plant genome database and analysis tools.

Authors: Qunfeng Dong; Shannon D Schlueter; Volker Brendel
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. Plant genome research outreach portal. A gateway to plant genome research "outreach" programs and activities.

Authors: Sanford B Baran; Carolyn J Lawrence; Volker Brendel
Journal: Plant Physiol Date: 2004-03 Impact factor: 8.340

6. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

7. Community-based gene structure annotation.

Authors: Shannon D Schlueter; Matthew D Wilkerson; Eva Huala; Seung Y Rhee; Volker Brendel
Journal: Trends Plant Sci Date: 2005-01 Impact factor: 18.313

8. Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus.

Authors: Volker Brendel; Liqun Xing; Wei Zhu
Journal: Bioinformatics Date: 2004-02-05 Impact factor: 6.937

9. BarleyBase--an expression profiling database for plant genomics.

Authors: Lishuang Shen; Jian Gong; Rico A Caldo; Dan Nettleton; Dianne Cook; Roger P Wise; Julie A Dickerson
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Tracembler--software for in-silico chromosome walking in unassembled genomes.

Authors: Qunfeng Dong; Matthew D Wilkerson; Volker Brendel
Journal: BMC Bioinformatics Date: 2007-05-09 Impact factor: 3.169

114 in total

1. Known and novel post-transcriptional regulatory sequences are conserved across plant families.

Authors: Justin N Vaughn; Sally R Ellingson; Flavio Mignone; Albrecht von Arnim
Journal: RNA Date: 2012-01-11 Impact factor: 4.942

Review 2. A beginner's guide to eukaryotic genome annotation.

Authors: Mark Yandell; Daniel Ence
Journal: Nat Rev Genet Date: 2012-04-18 Impact factor: 53.242

Review 3. Genomic and genetic database resources for the grasses.

Authors: Kevin L Childs
Journal: Plant Physiol Date: 2009-01 Impact factor: 8.340

Review 4. Agrigenomics for microalgal biofuel production: an overview of various bioinformatics resources and recent studies to link OMICS to bioenergy and bioeconomy.

Authors: Namrata Misra; Prasanna Kumar Panda; Bikram Kumar Parida
Journal: OMICS Date: 2013-09-17

Review 5. Genomics and bioinformatics resources for crop improvement.

Authors: Keiichi Mochida; Kazuo Shinozaki
Journal: Plant Cell Physiol Date: 2010-03-05 Impact factor: 4.927

6. TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics.

Authors: Keiichi Mochida; Takuhiro Yoshida; Tetsuya Sakurai; Yasunari Ogihara; Kazuo Shinozaki
Journal: Plant Physiol Date: 2009-05-15 Impact factor: 8.340

7. Choosing a genome browser for a Model Organism Database: surveying the maize community.

Authors: Taner Z Sen; Lisa C Harper; Mary L Schaeffer; Carson M Andorf; Trent E Seigfried; Darwin A Campbell; Carolyn J Lawrence
Journal: Database (Oxford) Date: 2010-07-06 Impact factor: 3.451

8. MicroPC (microPC): A comprehensive resource for predicting and comparing plant microRNAs.

Authors: Wuttichai Mhuantong; Duangdao Wichadakul
Journal: BMC Genomics Date: 2009-08-07 Impact factor: 3.969

9. SolEST database: a "one-stop shop" approach to the study of Solanaceae transcriptomes.

Authors: Nunzio D'Agostino; Alessandra Traini; Luigi Frusciante; Maria Luisa Chiusano
Journal: BMC Plant Biol Date: 2009-11-30 Impact factor: 4.215

10. Characterization of microsatellites and gene contents from genome shotgun sequences of mungbean (Vigna radiata (L.) Wilczek).

Authors: Sithichoke Tangphatsornruang; Prakit Somta; Pichahpuk Uthaipaisanwong; Juntima Chanprasert; Duangjai Sangsrakru; Worapa Seehalak; Warunee Sommanas; Somvong Tragoonrung; Peerasak Srinives
Journal: BMC Plant Biol Date: 2009-11-24 Impact factor: 4.215