Literature DB >> 16381904

TcruziDB: an integrated, post-genomics community resource for Trypanosoma cruzi.

Fernán Agüero¹, Wenlong Zheng, D Brent Weatherly, Pablo Mendes, Jessica C Kissinger.

Abstract

TcruziDB (http://TcruziDB.org) is an integrated post-genomics database for the parasitic organism, Trypanosoma cruzi, the causative agent of Chagas' disease. TcruziDB was established in 2003 as a flat-file database with tools for mining the unannotated sequence reads and preliminary contig assemblies emerging from the Tri-Tryp genome consortium (TIGR/SBRI/Karolinska). Today, TcruziDB houses the recently published assembled genomic contigs and annotation provided by the genome consortium in a relational database supported by the Genomics Unified Schema (GUS) architecture. The combination of an annotated genome and a relational architecture has facilitated the integration of genomic data with expression data (proteomic and EST) and permitted the construction of automated analysis pipelines. TcruziDB has accepted, and will continue to accept the deposition of genomic and functional genomic datasets contributed by the research community.

Entities: Chemical Disease Species

Mesh：

Substances：
Protozoan Proteins

Year: 2006 PMID： 16381904 PMCID： PMC1347470 DOI： 10.1093/nar/gkj108

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Trypanosoma cruzi is the causative agent of American Trypanosomiasis (Chagas' Disease), for which there is no definitive chemotherapeutic treatment. The parasite has a complex life cycle, with four main stages occurring in two hosts. In the insect host, T.cruzi is found in the form of epimastigotes and metacyclic trypomastigotes. In the vertebrate host, it is found in the form of bloodstream trypomastigotes and intracellular amastigotes. Based on a number of polymorphic markers, it has been shown that T.cruzi strains can be classified into two defined subgroups, Tcruzi I and II. Furthermore, strains that belong to the Tcruzi II group are more heterogeneous and can be further separated into distinct subgroups ranging from IIa to IIe. The T.cruzi genome sequence was generated using a whole-genome shotgun (WGS) approach. The sequence generation, assembly and annotation were performed by researchers that are part of an international consortium, the TSK-TSC, comprised of The Institute for Genomic Research (TIGR, USA), the Seattle Biomedical Research Institute (SBRI, USA) and the Karolinska Institute (KI, Sweden) (1). The sequenced strain, CL-Brener, is a hybrid derived from an ancient hybridization event between parental Tcruzi IIb and IIc subgroups (1). The current T.cruzi genome assembly consists of 32 746 contigs totaling ∼89 Mb. Of these, 4008 contigs represent the majority of the coding portion of the genome, the remaining contigs represent smaller regions primarily consisting of repeat regions. Annotation for the 4008 contigs provided by the TSK-TSC (1) is represented in TcruziDB. The remaining contig sequences were not annotated, but they are available for searches within the database. TcruziDB was established before the genome sequencing project was completed to serve as a resource for the research community (2). In 2003, TcruziDB provided access to unassembled shotgun reads and preliminary contig assemblies. Since that time, the genome sequence has been published (1) together with a whole-genome proteomic analysis covering the four main life cycle stages of the parasite (3). Also during this period, new EST sequences were deposited in GenBank, adding EST coverage to the two life-cycle stages that occur in the mammalian host (4,5). In this paper, we report on the status of TcruziDB, and highlight new database architectural features and datasets that have been added by the genome consortium and the T.cruzi research community.

DATA INVENTORY UPDATES

Version 4.0 of TcruziDB, released in July 2005, contained the recently published genome sequence and annotation of T.cruzi strain CL-Brener (1) as submitted by the sequencing consortium (TSK-TSC Genome Release v5.0) together with proteomic data contributed by Dr R. Tarleton (3) and EST data available in public repositories, Table 1. The annotated genome sequence data consists of 32 746 contigs, 4008 of which contain coding regions that were annotated. In the 5.0 TSK-TSC genome sequence release, 19 613 protein coding sequences and 3603 pseudogenes were predicted and all are represented within TcruziDB.

Table 1

Available EST data

Library	Stage	Strain	ESTs	Observations
TENa	Epimastigote	CL-Brener	9761	Normalized
TEUb	Epimastigote	CL-Brener	255	Non-normalized
Tomoo	Epimastigote	Y	37
	Epimastigote/metacyclic trypomastigotes	Dm28c	175	Differential display
TcAma	Amastigote	Tulahuen	968	Non-normalized
TcAM	Amastigote	CL-Brener	1269	Non-normalized
TcTR	Trypomastigote	CL-Brener	1503	Non-normalized

Clustered EST statistics
Total EST sequences				13968 (100%)
Included in assemblies (>50 nt)				13250 (94.8%)
Assemblies				7201 (100%)
Singletons				4988 (69.3%)
Clusters (2-112 ESTs)				2213 (30.7%)
Assemblies with an annotated SL sequence				1537 (21.3%)

EST data obtained from GenBank were loaded into TcruziDB in the form of separate datasets, one per cDNA library.

aThe cDNA library was sequenced and submitted to GenBank as four different clone-sets (TENF, TENG, TENS, TENU).

bThe cDNA library was sequenced and submitted to GenBank as two different clone-sets (TEUQ, TEUF). EST assemblies were generated with CAP 4 after contaminating vector sequences were removed.

A dataset representing whole-organism proteolytic peptides was deposited in TcruziDB by members of the T.cruzi research community. This peptide dataset was updated in TcruziDB version 4.0 following the publication of the annotated genome (3). The peptides obtained from metacyclic trypomastigotes (CL strain) and amastigotes, trypomastigotes and epimastigotes (Brazil strain) were separated by multidimensional liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS): 139 147 high mass accuracy tandem mass spectra from this analysis were matched with a confidence of >99% to 2755 proteins in the annotated T.cruzi genome (3). Pre-computed datasets organized by life-cycle stage are available for viewing and download (Figure 1D).

Figure 1

New features of TcruziDB. (A) New queries for RNA and protein expression data. (B) New gene record page displaying expression data, community comments and link to GeneDB. (C) Sample assembly/EST cluster alignment. A predicted splice leader is highlighted in blue. An assembly alignment column indicating an SNP is highlighted in red. (D) New proteomic data page. The location of identified peptides within the coding sequence is shown in red. Quality values for each observed peptide are provided.

Version 4.1 of TcruziDB, released in September 2005, represents an update that adds available EST data (Table 1) to TcruziDB: 13 968 EST sequences were obtained from the NCBI GenBank (6). In the majority of the cases individual analyses of these datasets have been published (4,5,7–10). In previous versions of TcruziDB, EST data obtained from NCBI were available for download and sequence similarity searches, but they were not integrated with available genome data. Now, ESTs are clustered into RNA transcripts (assemblies) and mapped against the genome. Also, based on the addition of ESTs derived from directionally cloned, spliced-leader based cDNA libraries (5), we were also able to map the T.cruzi miniexon sequence onto the EST assemblies (Table 1).

SYSTEM DESIGN AND IMPLEMENTATION

TcruziDB was migrated from a flat-file database to a relational database structure beginning with release 3.0 of the database. The relational schema that has been employed is version 3.x of the Genomics Unified Schema, GUS () (11) and our database management system is Oracle version10g. The Web interface to the database is produced via a Java servlet. A new home page for the database was designed to include ‘one click’ access to the most commonly used features of the database. In addition to the relational database, several additional applications are provided such as BLAST and a variety of custom PERL scripts to facilitate data mining and text searches. Beginning with TcruziDB release 4.0, a community comment field has been added to the database. Users with additional information on the annotation, function or properties of a predicted gene or feature of the genome sequence are encouraged to submit their comments to the database via email to help@tcruzidb.org. Comments, with author attribution, will be posted in the community comment field.

ANALYSIS TOOLS

The migration of the database to a relational architecture combined with the deposition of new genomic and functional genomic datasets has greatly expanded the types of analyses that can be performed. Queries of annotated features are now available including searches by gene name, gene location, gene type (protein coding, pseudogene, rRNA, slRNA, snRNA and tRNA) and key word descriptions, e.g. ‘mucin’. Users can view the genomic context of both genes and contigs (Figure 1B) and these sequences can be retrieved with either feature IDs (e.g. 2383.t00001) or locus tags (e.g. Tc00.1047053439653.10). To facilitate internet navigation by users, a direct link from each annotated gene page in TcruziBD to the same gene as represented in GeneDB (12) is provided. The integration of proteomic and EST datasets in TcruziDB permits users to construct combined queries for T.cruzi genes based on the available evidence of expression, both at the RNA and protein levels. This is particularly important for T.cruzi, but also for other trypanosomatid genomes (for which proteomic and EST data are lacking or limited) since only ∼50% of the predicted protein coding genes have been assigned a putative function. The integration of these functional analysis datasets with the genome annotation permits users to decide if a hypothetical protein encoding gene prediction can be now called a real protein (although perhaps still of unknown function), and at the same time provides information on developmental expression of the transcript and/or protein. Links to a detailed report of the analysis data is available for each gene having expression evidence. The queries for both types of expression data are based on the mapping of either proteomic data [as described in (3)] or EST assemblies [as described in (13)] against the genome, and can be modified to select for proteins and/or genes that are expressed in different life cycle stages of T.cruzi. In the case of proteomic data, users can additionally adjust their queries to select cases of proteins showing (i) a specified minimum coverage of the sequence with peptide mass-spectrometry data, (ii) a minimum number of peptide spectra that should match the protein and (iii) a minimum number of high-quality peptide spectra being matched (Figure 1A). EST queries permit users to select for ESTs showing (i) a user-specified length of sequence overlap between annotated gene features and EST sequences, (ii) a specified minimum length of EST sequence that is required to be aligned to the genome and (iii) expression during a particular lifecycle stage. Also, new queries have been implemented to let users search for genes with experimental evidence of trans-splicing, based on the mapping of EST assemblies onto the genome and the presence of the T.cruzi spliced leader (miniexon) on these transcripts. EST assemblies can be viewed graphically with splice-leader and/or SNPs indicated if present (Figure 1C). In addition to the new query functions described above, traditional analysis tools such as BLAST, user-defined protein motif searches and text searching are provided.

FUTURE PLANS

For the T.cruzi research community, the challenge now lies in turning the wealth of data already available into potential molecular drug and diagnostic markers. Both the new features described herein and the planned additions and improvements are focused to maintain TcruziDB as an integral bioinformatics analysis platform for the trypanosomatid research community. TcruziDB will be migrated to the most recent version of GUS, ver. 3.5, and a new Web front-end will be installed using the recently released Web Development Kit (WDK) (). This infrastructure enhancement will greatly improve our database regeneration time and thus reduce the time needed between database updates. Database update cycles will be improved greatly by the implementation of new automated pipelines for the routine population of the database and common analyses such as BLAST comparisons and the identification of protein features such as signal peptides, GPI-anchors, transmembrane domains and protein motifs. As new data are deposited by the community into the database, they will be added to the analysis pipeline, integrated with existing data and presented to the community as rapidly as possible. Given the availability of two other kinetoplastid genome sequences, Trypanosoma brucei (14) and Leishmania major (15), analyses of orthologous gene relationships and hyperlinks between the several existing database resources, TcruziDB and GeneDB (12) are important for researchers. Currently, TcruziDB provides direct gene-to-gene hyperlinks to GeneDB. Orthologous genes as determined by the sequencing consortium are currently provided on gene pages at GeneDB. We will be adding an additional link from the TcruziDB gene pages to the OrthoMCL—orthologous gene database (this issue). This database contains orthologous gene determinations for 55 organisms, representing all domains of life, including T.cruzi and the other kinetoplastid organisms. One of the largest future challenges facing the database and the T.cruzi research community is the representation and characterization of T.cruzi haplotype information. Towards this end, we will be adding the increasing wealth of data that are available for the T.cruzi Esmeraldo strain and add haplotype designations to existing sequence data as they are determined by the sequencing consortium and the T.cruzi research community.

13 in total

1. Gene survey of the pathogenic protozoan Trypanosoma cruzi.

Authors: B M Porcel; A N Tran; M Tammi; Z Nyarady; M Rydâker; T P Urmenyi; E Rondinelli; U Pettersson; B Andersson; L Aslund
Journal: Genome Res Date: 2000-08 Impact factor: 9.043

2. GeneDB: a resource for prokaryotic and eukaryotic organisms.

Authors: Christiane Hertz-Fowler; Chris S Peacock; Valerie Wood; Martin Aslett; Arnaud Kerhornou; Paul Mooney; Adrian Tivey; Matthew Berriman; Neil Hall; Kim Rutherford; Julian Parkhill; Alasdair C Ivens; Marie-Adele Rajandream; Bart Barrell
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. ApiEST-DB: analyzing clustered EST data of the apicomplexan parasites.

Authors: Li Li; Jonathan Crabtree; Steve Fischer; Deborah Pinney; Christian J Stoeckert; L David Sibley; David S Roos
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. Generation and analysis of expressed sequence tags from Trypanosoma cruzi trypomastigote and amastigote cDNA libraries.

Authors: Fernán Agüero; Karim Ben Abdellah; Valeria Tekiel; Daniel O Sánchez; Antonio González
Journal: Mol Biochem Parasitol Date: 2004-08 Impact factor: 1.759

6. Analysis of expressed sequence tags from Trypanosoma cruzi amastigotes.

Authors: Gustavo C Cerqueira; Wanderson D DaRocha; Priscila C Campos; Cláudia S Zouain; Santuza M R Teixeira
Journal: Mem Inst Oswaldo Cruz Date: 2005-08-17 Impact factor: 2.743

7. The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease.

Authors: Najib M El-Sayed; Peter J Myler; Daniella C Bartholomeu; Daniel Nilsson; Gautam Aggarwal; Anh-Nhi Tran; Elodie Ghedin; Elizabeth A Worthey; Arthur L Delcher; Gaëlle Blandin; Scott J Westenberger; Elisabet Caler; Gustavo C Cerqueira; Carole Branche; Brian Haas; Atashi Anupama; Erik Arner; Lena Aslund; Philip Attipoe; Esteban Bontempi; Frédéric Bringaud; Peter Burton; Eithon Cadag; David A Campbell; Mark Carrington; Jonathan Crabtree; Hamid Darban; Jose Franco da Silveira; Pieter de Jong; Kimberly Edwards; Paul T Englund; Gholam Fazelina; Tamara Feldblyum; Marcela Ferella; Alberto Carlos Frasch; Keith Gull; David Horn; Lihua Hou; Yiting Huang; Ellen Kindlund; Michele Klingbeil; Sindy Kluge; Hean Koo; Daniela Lacerda; Mariano J Levin; Hernan Lorenzi; Tin Louie; Carlos Renato Machado; Richard McCulloch; Alan McKenna; Yumi Mizuno; Jeremy C Mottram; Siri Nelson; Stephen Ochaya; Kazutoyo Osoegawa; Grace Pai; Marilyn Parsons; Martin Pentony; Ulf Pettersson; Mihai Pop; Jose Luis Ramirez; Joel Rinta; Laura Robertson; Steven L Salzberg; Daniel O Sanchez; Amber Seyler; Reuben Sharma; Jyoti Shetty; Anjana J Simpson; Ellen Sisk; Martti T Tammi; Rick Tarleton; Santuza Teixeira; Susan Van Aken; Christy Vogt; Pauline N Ward; Bill Wickstead; Jennifer Wortman; Owen White; Claire M Fraser; Kenneth D Stuart; Björn Andersson
Journal: Science Date: 2005-07-15 Impact factor: 47.728

8. The genome of the African trypanosome Trypanosoma brucei.

Authors: Matthew Berriman; Elodie Ghedin; Christiane Hertz-Fowler; Gaëlle Blandin; Hubert Renauld; Daniella C Bartholomeu; Nicola J Lennard; Elisabet Caler; Nancy E Hamlin; Brian Haas; Ulrike Böhme; Linda Hannick; Martin A Aslett; Joshua Shallom; Lucio Marcello; Lihua Hou; Bill Wickstead; U Cecilia M Alsmark; Claire Arrowsmith; Rebecca J Atkin; Andrew J Barron; Frederic Bringaud; Karen Brooks; Mark Carrington; Inna Cherevach; Tracey-Jane Chillingworth; Carol Churcher; Louise N Clark; Craig H Corton; Ann Cronin; Rob M Davies; Jonathon Doggett; Appolinaire Djikeng; Tamara Feldblyum; Mark C Field; Audrey Fraser; Ian Goodhead; Zahra Hance; David Harper; Barbara R Harris; Heidi Hauser; Jessica Hostetler; Al Ivens; Kay Jagels; David Johnson; Justin Johnson; Kristine Jones; Arnaud X Kerhornou; Hean Koo; Natasha Larke; Scott Landfear; Christopher Larkin; Vanessa Leech; Alexandra Line; Angela Lord; Annette Macleod; Paul J Mooney; Sharon Moule; David M A Martin; Gareth W Morgan; Karen Mungall; Halina Norbertczak; Doug Ormond; Grace Pai; Chris S Peacock; Jeremy Peterson; Michael A Quail; Ester Rabbinowitsch; Marie-Adele Rajandream; Chris Reitter; Steven L Salzberg; Mandy Sanders; Seth Schobel; Sarah Sharp; Mark Simmonds; Anjana J Simpson; Luke Tallon; C Michael R Turner; Andrew Tait; Adrian R Tivey; Susan Van Aken; Danielle Walker; David Wanless; Shiliang Wang; Brian White; Owen White; Sally Whitehead; John Woodward; Jennifer Wortman; Mark D Adams; T Martin Embley; Keith Gull; Elisabetta Ullu; J David Barry; Alan H Fairlamb; Fred Opperdoes; Barclay G Barrell; John E Donelson; Neil Hall; Claire M Fraser; Sara E Melville; Najib M El-Sayed
Journal: Science Date: 2005-07-15 Impact factor: 47.728

TcruziDB: an integrated, post-genomics community resource for Trypanosoma cruzi.

INTRODUCTION

DATA INVENTORY UPDATES

SYSTEM DESIGN AND IMPLEMENTATION

ANALYSIS TOOLS

FUTURE PLANS

1. Gene survey of the pathogenic protozoan Trypanosoma cruzi.

2. GeneDB: a resource for prokaryotic and eukaryotic organisms.

3. ApiEST-DB: analyzing clustered EST data of the apicomplexan parasites.

4. GenBank.

5. Generation and analysis of expressed sequence tags from Trypanosoma cruzi trypomastigote and amastigote cDNA libraries.

6. Analysis of expressed sequence tags from Trypanosoma cruzi amastigotes.

7. The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease.

8. The genome of the African trypanosome Trypanosoma brucei.

9. TcruziDB: an integrated Trypanosoma cruzi genome resource.

10. Gene discovery through expressed sequence Tag sequencing in Trypanosoma cruzi.

1. Construction of EST database for comparative gene studies of Acanthamoeba.

Review 2. Pathogenesis of chagas' disease: parasite persistence and autoimmunity.

3. Immunological dominance of Trypanosoma cruzi tandem repeat proteins.

4. Phosphoproteomic analysis of the human pathogen Trypanosoma cruzi at the epimastigote stage.

5. A genomic scale map of genetic diversity in Trypanosoma cruzi.

6. TcTASV: a novel protein family in trypanosoma cruzi identified from a subtractive trypomastigote cDNA library.

7. TBestDB: a taxonomically broad database of expressed sequence tags (ESTs).

8. TcSNP: a database of genetic variation in Trypanosoma cruzi.

9. GPIomics: global analysis of glycosylphosphatidylinositol-anchored molecules of Trypanosoma cruzi.

10. metaTIGER: a metabolic evolution resource.