Literature DB >> 15608288

The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes.

Y Lee1, J Tsai, S Sunkara, S Karamycheva, G Pertea, R Sultana, V Antonescu, A Chan, F Cheung, J Quackenbush.   

Abstract

Although the list of completed genome sequencing projects has expanded rapidly, sequencing and analysis of expressed sequence tags (ESTs) remain a primary tool for discovery of novel genes in many eukaryotes and a key element in genome annotation. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi) are a collection of 77 species-specific databases that use a highly refined protocol to analyze gene and EST sequences in an attempt to identify and characterize expressed transcripts and to present them on the Web in a user-friendly, consistent fashion. A Gene Index database is constructed for each selected organism by first clustering, then assembling EST and annotated cDNA and gene sequences from GenBank. This process produces a set of unique, high-fidelity virtual transcripts, or tentative consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to genetic and physical maps, to provide links to orthologous and paralogous genes, and as a resource for comparative and functional genomic analysis.

Entities:  

Mesh:

Year:  2005        PMID: 15608288      PMCID: PMC540018          DOI: 10.1093/nar/gki064

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The TIGR Gene Index databases (TGI) (http://www.tigr.org/tdb/tgi) are constructed using all publicly available expressed sequence tags (EST) and known gene sequence data stored in GenBank for each target species. Sequences are first cleaned to identify and remove contaminating sequences, including vector, adaptor, mitochondrial, ribosomal and chimeric sequences. These sequences are then searched pairwise against each other and grouped into clusters based on shared sequence similarity. The clusters are assembled at high stringency to produce tentative consensus (TC) sequences. The virtual transcripts represented in the TCs are annotated using a variety of tools for open reading frame (ORF) prediction, single nucleotide polymorphism (SNP) prediction, long oligo prediction for microarrays, putative annotation using a controlled vocabulary, Gene Ontology (GO) and Enzyme Commission (EC) number assignments and maps onto complete or drafted genomes or available genetic maps. The TCs are used to construct a variety of other databases, including the Eukaryotic Gene Orthologs (EGO) database and RESOURCERER, a database that annotates and cross-references microarray resources for plants and animals. At present, 77 species are represented in the Gene Index databases, including 29 animals, 25 plants, 8 fungi and 15 protists; this includes most species for which public EST projects have released more than 50 000 ESTs. Current release information for each species-specific database is summarized in Table 1. Individual databases are updated and released three times yearly, on February 1, June 1 and October 1, if the number of available ESTs for that species has increased by either 25 000 or >10%, whichever is less.
Table 1.

Summary of the current release of TIGR Gene Indices (TGI)

SpeciesSpecies_nameTGITCsETsEST
Animals (29)     
 HumanHomo sapiensHGI 15.0221 41819 740594 468
 MouseMus musculusMGI 14.0167 6947499602 312
 RatRattus norvegicusRGI 13.056 933213187 992
 CattleBos TaurusBtGI 10.038 76041356 644
 PigSus scrofaSsGI 9.033 96351950 376
 DogCanis familiarisDogGI 4.0661368411 506
 ChickenGallus gallusGgGI 8.042 98884872 941
 FrogXenopus laevisXGI 9.039 72462637 249
 ZebrafishDanio rerioZGI 15.032 88939553 940
 CatfishIctalurus punctatusCfgi 5.0325415616 694
 R.troutOncorhynchus mykissRtGI 4.023 13519027 448
 A.salmonSalmo salarAsGI 2.112 2779318 971
 C.intestinalisCiona intestinalisCinGI 3.020 6163930 690
 MedakaOryzias latipesOlGI 5.012 84917113 669
 FuguTakifugu rubripesFGI 1.031204487667
 A.burtoniAstatotilapia burtoniAbGI 1.0402152300
 H.chilotesHaplochromis chilotesHchGI 1.0214704030
 H.red_tail_shellerHaplochromis sp. ‘rts’HsGI 1.0188304422
 KillifishFundulus heteroclitusFhGI 1.035405711 941
 HoneybeeApis melliferaAMGI 4.03700537571
 A.aegyptiAedes aegyptiAeGI 4.015 888325075
 DrosophilaDrosophila melanogasterDGI 9.020 69311046662
 MosquitoAnopheles gambiaeAgGI 7.017 120684714 940
 A.variegatumAmblyomma variegatumAvGI 2.047801631
 R.appendicRhipicephalus appendiculatusRaGI 1.02543194797
 C.elegansCaenorhabditis elegansCeGI 8.017 72850345678
 B.malayiBrugia malayiBmGI 4.02060446841
 O.volvulusOnchocerca volvulusOvGI 3.01065232942
 S.mansoniSchistosoma mansoniSmGI 5.012 9123920 753
Plants (25)     
 PinePinusPGI 4.013 62220517 944
 CocoaTheobroma cacaoTcaGI 1.0754261759
 CottonGossypiumCGI 5.0681214217 396
 ArabidopsisArabidopsis thalianaAtGI 11.028 010518812 485
 L.japonicusLotus japonicusLjGI 3.012 4855615 919
 LettuceLactuca sativaLsGI 2.079615614 168
 SunflowerHelianthus annuusHaGI 3.0603811014 372
 TomatoLycopersicon esculentumLeGI 9.020 53016414 923
 PepperCapsicum annuumCaGI 1.03203477462
 PotatoSolanum tuberosumStGI 9.019 22510213 226
 TobaccoNicotiana tabacumNtGI 1.08978068529
 N.benthamianaNicotiana benthamianaNbGI 1.13819443735
 SoybeanGlycine maxGmGI 12.030 08414137 601
 MedicagoMedicago truncatulaMtGI 7.017 6102519 341
 Ice_plantMesembryanthemum crystallineMcGI 4.02851475557
 GrapeVitis viniferaVvGI 3.113 218549837
 RiceOryza sativaOsGI 15.033 08917 77637 900
 MaizeZea maysZmGI 14.029 41452426 426
 WheatTriticum aestivumTaGI 8.044 63016979 008
 SorghumSorghum bicolorSbGI 8.020 02914318 976
 BarleyHordeum vulgareHvGI 9.021 98116827 041
 S.cerealeSecale cerealeRyeGI 3.01391663890
 S.officinarumSaccharum officinarumSoGI 1.023 596772 281
 A.cepaAllium cepaOnGI 1.03838187870
 C.reinhardtiiChlamydomonas reinhardtiiChrGI 4.010 7779619 466
Fungi (8)     
 A.flavusAspergillus flavusAfGI 4.03749103459
 C.posadasiiCoccidioides posadasiiCpoGI 2.0627503037
 S.cerevisiaeSaccharomyces cerevisiaeScGI 3.041072005198
 S.pombeSchizosaccharomyces pombeSpGI 3.024492974510
 CryptococcusFilobasidiella neoformansCrGI 7.02384593231
 N.crassaNeurospora crassaNcrGI 3.0438965471586
 A.nidulansAspergillus nidulansAnGI 4.0353266642904
 M.griseaMagnaporthe griseaMgGI 5.0637561958320
Protists (15)     
 P.bergheiPlasmodium bergheiPbGI 5.01168413980
 P.falciparumPlasmodium falciparumPfGI 7.0397824873142
 P.vivaxPlasmodium vivaxPvGI 0.5158175567
 P.yoeliiPlasmodium yoeliiPyGI 5.0361137842418
 E.tenellaEimeria tenellaEtGI 4.02077293066
 T.gondiiToxoplasma gondiiTgGI 6.069773111 401
 N.caninumNeospora caninumNcGI 5.0198033715
 S.neuronaSarcocystis neuronaSnGI 4.066501644
 C.parvumCryptosporidium parvumCpGI 4.0171485254
 T.vaginalisTrichomonas vaginalisTvGI 1.087109704
 LeishmaniaLeishmaniaLshGI 4.060014541120
 T.cruziTrypanosoma cruziTcGI 4.021891644749
 T.bruceiTrypanosoma bruceiTbGI 5.073412872018
 D.discoideumDictyostelium discoideumDdGI 4.068261726392
 T.thermophilaTetrahymena thermophilaTtGI 3.014361652626

TIGR Gene Indices are a collection of species-based databases which assemble the ESTs and the Expressed Transcripts (ETs) into TC sequences. Singletons (sET and sEST) are the ET/EST sequences that are not incorporated into a TC during assembly. TCs, sET and sEST are the unique sequences in TGI. There are 77 gene indices in total (data until September 1, 2004). Each line includes species, species name, gene index name and version, total number of TCs within current release, number of singleton ETs and number of singleton ESTs. For Leishmania, pine and cotton, the ESTs were pooled from dbEST for the genus, not a single species. The table was arranged by grouping the total 77 gene indices into animals (29), plants (25), fungi (8) and protists (15).

RECENT DEVELOPMENTS

Construction of the Gene Indices

The process used to assemble each Gene Index is similar to that described previously (1–3), although some modifications have been made to improve the efficiency and accuracy of the process. mgBLAST, a modified version of the Megablast (4) program, is now used for the pairwise sequence comparisons that are the basis for defining the sequence clusters which form the basis for assembly. For large clusters containing hundreds or thousands of sequences (e.g. highly expressed genes such as actin), sequence representation is reduced prior to assembly using a variety of multilayer approaches, including transitive clustering, containment clustering and seeded clustering with known genes. Following clustering, the Paracel Transcript Assembler (PTA), a modified version of CAP3 assembly program (5), is used to assemble each TC. An open source set of software tools that embody this process, TGICL, is available (http://www.tigr.org/tdb/tgi/software) with other open-source utilities for users interested in performing a similar analysis on their own datasets (6).

New features of the TC report

The central element of the TGI databases are the TC sequences and the TC reports that are presented through the project website. Each TC report presents a summary of the assembly and annotation process, including the consensus TC sequence in the FASTA format with a history from previous builds in the header, a map showing component EST and gene sequences, and a table providing links to the primary sequences, putative annotation, an expression summary based on the number of ESTs from various libraries, genomic locations and links to tentative orthologs in EGO. Since the last presentation of the TGI databases in Nucleic Acids Research, several new features have been added to the TC report. Putative polyadenylation signals are identified and shaded in the consensus sequence and putative poly(A/T) trimming sites are shown in sequence map for each of the component ESTs. Potential ORFs are predicted for each TC using a variety of software tools including the NCBI ORF Finder, ESTScan (7) and FrameFinder; predicted ORFs can be searched against a variety of databases using WU-BLAST. Assembly of the TCs can result in incorrect orientations for the consensus and an attempt is now made to determine the proper orientation using the annotated direction of component gene and EST sequences as well as BLAST search results. Putative SNP sites are found by analyzing the multiple sequence alignments that are produced in the assembly stage; putative SNPs are reported only if a variant is found in multiple sequences from independent libraries. Unique 70mers are predicted for each TC using OligoPicker (8). GO terms and metabolic pathway in KEGG are provided for each TC based on protein database searches. Where possible, TCs are aligned with draft genomes and displayed using TGIviewer, gbrowse, EnsEMBL and the UCSC genome viewers.

New databases and tools

The EGO (http://www.tigr.org/tdb/tgi/ego) (9) database, previously known as TIGR Orthologous Gene Alignments (TOGA), uses pairwise sequence similarity searches and a transitive, reciprocal closure process to identify Tentative Ortholog Groups (TOGs) in eukaryotes (9). EGO has expanded its representation to include all 77 species represented in the TGI and TOGs have been cross-referenced to the Online Mendelian in Man (OMIM) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) database of human disease genes. RESOURCERER (10) provides annotation based on the TIGR Gene Indices for widely available microarray resources in human, mouse, rat, zebrafish and Xenopus, including widely used clone sets and Affymetrix GeneChips™ as well as a variety of other sequence-based resources such as RefSeq. RESOURCERER provides a wide range of annotation and integration with genomic and other resources, including gene name assignments, GO term and EC number assignments, chromosomal localization, integration with genetic and quantitative trait locus (QTL) maps, ortholog identification, lists of relevant abstracts in PubMed and promoter region identification. Owing to its integration with the TGI and EGO, RESOURCERER also provides links between microarray platforms both within and between species. Users can also submit a list of GenBank accessions corresponding to their microarray databases for annotation and functional analysis. A plant-specific version, Plant RESOURCERER, was released in September 2004 with microarray resources from Arabidopsis, potato, tomato, maize and rice. Genomic maps align TCs to available complete or draft genomes, including human, mouse, rat, zebrafish, fly, worm, Fugu, mosquito, Arabidopsis, yeast, fission yeast and rice. Also these alignments can be viewed using either TGIviewer or gbrowse or through a number of distributed annotation system (DAS) viewers (11), including one developed at TIGR. Each Gene Index also includes graphical metabolic pathway maps linked to TCs associated with specific pathways through GO term and EC number annotation. Comparisons between TCs are also used to identify putative alternative splice forms based on shared blocks of sequence similarity.

Using the TIGR Gene Indices

There are many ways in which users can access the TIGR Gene Index databases. Nucleotide or protein sequences can be searched using WU-BLAST against individual TGI databases, EGO or pre-selected classes of species, such as animals or plants. The TGI can be searched using unique identifiers (GB and TC Accessions, EST identifiers and ET numbers from the TIGR PREEGAD database), gene product names, functional classifications based on GO terms, metabolic pathways, library-related expression analysis, map position within various sequenced genomes, TOGs in the EGO database and alternative splice forms. Complete annotations for all of the ESTs and TCs in each TGI database are now also provided through the EST Annotator and TC Annotator features which provide comprehensive lists of sequences within each species-specific database. All of the TIGR Gene Indices are available for download through the main page for each species. Downloads consist of six files, including a FASTA file for all unique sequences, the TC list, the component ESTs in each TC, GO analysis, predicted oligos and a README file.

Software

Many of the software tools used to create the TGI are available with source code to the research community through the TGI software tools website (http://www.tigr.org/tdb/tgi/software). The TGI Clustering tool (TGICL) (6) is a software system for fast clustering and assembly of large EST datasets. TGICL starts with a large multi-FASTA file (and an optional quality value file) and outputs the assemblies produced by CAP3 (5). Both clustering and assembly phases can be parallelized by distributing the searches and the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP or PVM (Parallel Virtual Machine) clusters. Other available software includes clview for viewing sequence assemblies in .ace format, SeqClean which is used to remove contaminating sequences from EST and gene sequences and cdbfasta/cdbyank which index FASTA-formatted files and can be used to rapidly extract sequences from them.
  11 in total

1.  CAP3: A DNA sequence assembly program.

Authors:  X Huang; A Madan
Journal:  Genome Res       Date:  1999-09       Impact factor: 9.043

2.  ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences.

Authors:  C Iseli; C V Jongeneel; P Bucher
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  1999

3.  A greedy algorithm for aligning DNA sequences.

Authors:  Z Zhang; S Schwartz; L Wagner; W Miller
Journal:  J Comput Biol       Date:  2000 Feb-Apr       Impact factor: 1.479

4.  An optimized protocol for analysis of EST sequences.

Authors:  F Liang; I Holt; G Pertea; S Karamycheva; S L Salzberg; J Quackenbush
Journal:  Nucleic Acids Res       Date:  2000-09-15       Impact factor: 16.971

5.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Authors:  Geo Pertea; Xiaoqiu Huang; Feng Liang; Valentin Antonescu; Razvan Sultana; Svetlana Karamycheva; Yuandan Lee; Joseph White; Foo Cheung; Babak Parvizi; Jennifer Tsai; John Quackenbush
Journal:  Bioinformatics       Date:  2003-03-22       Impact factor: 6.937

6.  The TIGR gene indices: reconstruction and representation of expressed gene sequences.

Authors:  J Quackenbush; F Liang; I Holt; G Pertea; J Upton
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

7.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species.

Authors:  J Quackenbush; J Cho; D Lee; F Liang; I Holt; S Karamycheva; B Parvizi; G Pertea; R Sultana; J White
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

8.  Selection of oligonucleotide probes for protein coding sequences.

Authors:  Xiaowei Wang; Brian Seed
Journal:  Bioinformatics       Date:  2003-05-01       Impact factor: 6.937

9.  RESOURCERER: a database for annotating and linking microarray resources within and across species.

Authors:  J Tsai; R Sultana; Y Lee; G Pertea; S Karamycheva; V Antonescu; J Cho; B Parvizi; F Cheung; J Quackenbush
Journal:  Genome Biol       Date:  2001       Impact factor: 13.583

10.  The distributed annotation system.

Authors:  R D Dowell; R M Jokerst; A Day; S R Eddy; L Stein
Journal:  BMC Bioinformatics       Date:  2001-10-10       Impact factor: 3.169

View more
  82 in total

1.  De novo characterization of the antler tip of Chinese Sika deer transcriptome and analysis of gene expression related to rapid growth.

Authors:  Baojin Yao; Yu Zhao; Qun Wang; Mei Zhang; Meichen Liu; Hailong Liu; Juan Li
Journal:  Mol Cell Biochem       Date:  2011-12-25       Impact factor: 3.396

2.  WikiCell: a unified resource platform for human transcriptomics research.

Authors:  Dongyu Zhao; Jiayan Wu; Yuanyuan Zhou; Wei Gong; Jingfa Xiao; Jun Yu
Journal:  OMICS       Date:  2012-06

3.  Gene expression profiling of Atlantic cod (Gadus morhua) embryogenesis using microarray.

Authors:  Øyvind Drivenes; Geir Lasse Taranger; Rolf B Edvardsen
Journal:  Mar Biotechnol (NY)       Date:  2011-07-21       Impact factor: 3.619

4.  Medicago truncatula syntaxin SYP132 defines the symbiosome membrane and infection droplet membrane in root nodules.

Authors:  Christina M Catalano; Kirk J Czymmek; Janine G Gann; D Janine Sherrier
Journal:  Planta       Date:  2006-08-30       Impact factor: 4.116

Review 5.  Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA).

Authors:  Jim Leebens-Mack; Todd Vision; Eric Brenner; John E Bowers; Steven Cannon; Mark J Clement; Clifford W Cunningham; Claude dePamphilis; Rob deSalle; Jeff J Doyle; Jonathan A Eisen; Xun Gu; John Harshman; Robert K Jansen; Elizabeth A Kellogg; Eugene V Koonin; Brent D Mishler; Hervé Philippe; J Chris Pires; Yin-Long Qiu; Seung Y Rhee; Kimmen Sjölander; Douglas E Soltis; Pamela S Soltis; Dennis W Stevenson; Kerr Wall; Tandy Warnow; Christian Zmasek
Journal:  OMICS       Date:  2006

Review 6.  Genome sequencing and genome resources in model legumes.

Authors:  Shusei Sato; Yasukazu Nakamura; Erika Asamizu; Sachiko Isobe; Satoshi Tabata
Journal:  Plant Physiol       Date:  2007-06       Impact factor: 8.340

7.  EGENES: transcriptome-based plant database of genes with metabolic pathway information and expressed sequence tag indices in KEGG.

Authors:  Ali Masoudi-Nejad; Susumu Goto; Ruy Jauregui; Masumi Ito; Shuichi Kawashima; Yuki Moriya; Takashi R Endo; Minoru Kanehisa
Journal:  Plant Physiol       Date:  2007-04-27       Impact factor: 8.340

Review 8.  Genomic and genetic database resources for the grasses.

Authors:  Kevin L Childs
Journal:  Plant Physiol       Date:  2009-01       Impact factor: 8.340

9.  The haustorial transcriptomes of Uromyces appendiculatus and Phakopsora pachyrhizi and their candidate effector families.

Authors:  Tobias I Link; Patrick Lang; Brian E Scheffler; Mary V Duke; Michelle A Graham; Bret Cooper; Mark L Tucker; Martijn van de Mortel; Ralf T Voegele; Kurt Mendgen; Thomas J Baum; Steven A Whitham
Journal:  Mol Plant Pathol       Date:  2013-12-17       Impact factor: 5.663

Review 10.  Transcriptome content and dynamics at single-nucleotide resolution.

Authors:  Nicole Cloonan; Sean M Grimmond
Journal:  Genome Biol       Date:  2008-09-18       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.