Literature DB >> 16845049

EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments.

Ali Masoudi-Nejad¹, Koichiro Tonomura, Shuichi Kawashima, Yuki Moriya, Masanori Suzuki, Masumi Itoh, Minoru Kanehisa, Takashi Endo, Susumu Goto.

Abstract

Expressed sequence tag (EST) sequencing has proven to be an economically feasible alternative for gene discovery in species lacking a draft genome sequence. Ongoing large-scale EST sequencing projects feel the need for bioinformatics tools to facilitate uniform EST handling. This brings about a renewed importance for a universal tool for processing and functional annotation of large sets of ESTs. EGassembler (http://egassembler.hgc.jp/) is a web server, which provides an automated as well as a user-customized analysis tool for cleaning, repeat masking, vector trimming, organelle masking, clustering and assembling of ESTs and genomic fragments. The web server is publicly available and provides the community a unique all-in-one online application web service for large-scale ESTs and genomic DNA clustering and assembling. Running on a Sun Fire 15K supercomputer, a significantly large volume of data can be processed in a short period of time. The results can be used to functionally annotate genes, to facilitate splice alignment analysis, to link the transcripts to genetic and physical maps, design microarray chips, to perform transcriptome analysis and to map to KEGG metabolic pathways. The service provides an excellent bioinformatics tool to research groups in wet-lab as well as an all-in-one-tool for sequence handling to bioinformatics researchers.

Entities: Gene Species

Mesh：

Year: 2006 PMID： 16845049 PMCID： PMC1538775 DOI： 10.1093/nar/gkl066

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Expressed sequence tags (ESTs) are partial sequences of expressed genes prepared by reverse transcribing mRNA and cloning the cDNA fragments into a plasmid (1). ESTs have proven to be an extremely valuable resource for high-throughput gene discovery, annotating the genome's drafts by providing sequence information to identify novel genes, gene location and intron–exon boundaries within genomic sequence assemblies (2,3). EST sequencing has also proven to be an economically feasible alternative for gene discovery in species lacking a draft genome sequence. It is a cost-effective way to survey the expressed portions of the genome, especially in plants with extremely large genomes (e.g. 16 000 Mbp in wheat). Large-scale EST sequencing projects, which are conducted by a consortium of laboratories, require bioinformatics tools to facilitate the uniform handling of ESTs. The importance of EST clustering and assembly has been well established as evidenced by the number of databases currently available, such as TIGR gene indices (4), STACK (5) and UniGene (6). This proliferation of online resources demonstrates the need for a universal tool for processing and functional annotation of large sets of ESTs. EGassembler is a single web server, which provides an automated as well as a user-customized analysis tool for cleaning, repeat masking, vector trimming, organelle masking and assembling of ESTs and genomic fragments. It is also designed to serve as a stand-alone web application for each one of the processes.

DESCRIPTION

EGassembler consists of a pipeline of five components, each using highly reliable open-source tools (4,7–9) and a non-redundant custom-made database of repeats (EGrep) and vectors (EGvec) covering almost all publicly available vectors and repeats databases. The EGrep is a non-redundant repeats database covering latest release of the RepBase (10), TREP (11) (), TIGR plant repeats (12) and thousands other publicly available repeat sequences on the Internet. The EGrep was constructed by combining and assembling repetitive elements using PHRAP and CAP3 assembling programs into one single database. EGvec was made by assembling the NCBI's UniVec and EMBL's emvec vector/adaptor library and other vector sequences using CAP3 program. Figure 1 shows a flow chart of the EGassembler process. The web server accepts any type of DNA sequences in FASTA format (EST, GSS, cDNA, gDNA). The sequence cleaning process involves basic procedures such as, removing the polyA/polyT tail, clipping low-quality ends (the ends rich in undetermined bases) and discarding those that are too short (shorter than 100) or which appear to be mostly low-complexity sequences. The repeat masking process compares the query sequence against one or more files of FASTA sequences (library for masking). Masking vectors and organelles is performed using the program Cross_Match (9) where is a general-purpose utility for comparing any two sets of DNA sequence. It is used to compare query sequences to a set of vector or organelle sequences and produce vector/organelle masked versions of the input sequences. The sequence assembling process uses the CAP3 program (7) for Clustering and assembling the sequences into contigs and singletons. CAP3 assembles ESTs from the same gene under more stringent criteria compared with other approaches, and is able to distinguish gene family members while tolerating sequencing error.

Figure 1

EGassembler data flow. The flowchart shows the pipeline used in the EGassembler web server. The Middle portion shows the process and running modes (parallel or single). The right side shows each process action and the left side shows the databases used by each process for masking.

All of the processes in the pipeline, except the assembling step, run in parallel using all CPU resources available on the server. Those programs that were originally written as serial programs, using only one CPU, are now executed in parallel by implementation of a new algorithm using the Perl thread module. This implementation is especially valuable for trimming the vector and masking the organelle sequences. Using the original program on a single CPU required several days depending on input sequences, but now it takes only a few hours. Figure 2 shows a diagram of the EGassembler performance under different loads.

Figure 2

EGassembler performance. The large plot shows the EGassembler performance under different sequence loads and different numbers of CPUs. The inset displays the performance with ≤8000 sequences.

The main menu on EGassembler interface has three sub-menus providing users with the following processing options.

One-click assembling

All the components in the pipeline are run consecutively with their default options. After uploading the sequences, choosing the libraries for trimming and masking, assembling results can be obtained in one-click. The results of all steps are available to users for downloading as both URL addresses in one single-zipped file and as separate files for each step. The URL addresses of results are valid for access by users for one week after completion.

Step-by-step assembling

Users run all the components outlined in the pipeline interactively and have the opportunity to run each one of them with advanced options. The output of each step of the process is automatically used as the input to the next step of the pipeline; users can also jump into any step at anytime with the previous results.

Stand-alone processing

Users can use each one of the components alone with all options available. Web-interface displays the default parameters of the original programs, any of which users can choose/change for each program.

APPLICATION

Using the One-Click Assembling option, we used EGassembler server to analyze 386 515 rice ESTs deposited in NCBI's dbEST database. By searching the Nucleotide database of GenBank using the term ‘oryza sativa AND gbdiv_est[PROP]’, all the deposited ESTs (386 515) were downloaded in FASTA format and used as the input file. From 386 515 EST sequences, 125 404 reads were trimmed and 11 553 sequences discarded through the sequence cleaning process. The repeat masking process identified 345 SINE, 83 LINE, 273 Copia and 1668 Gypsy belonging to the retroelements group and 191 hobo-activator, 1581 TC1-pogo, 398 En-Spm, 268 MuDr and 951 Tourist, all belonging to the DNA transposons group. The number of simple sequence repeats and low-complex elements were 43 293 and 23 412, respectively. Total repetitive elements masked were 5 216 297 bp, about 2.7% of the query sequence. Vector and organelle sequence matches were found in 17 980 (1 300 270 bp, 0.65%) and 2958 (1 064 453 bp, 0.54%) sequences, respectively. CAP3 assembling results in 73 555 singletons (reads that are not used in assembly) and 25 193 contigs. The EGENES database of KEGG (release 34.0, April 2005), which is the transcriptome-based plant database of genes with metabolic pathway information, has also been developed using the pipeline described here.

IMPLEMENTATION

EGassembler is written in Perl CGI and uses suites of open-source programs. The web server runs on a Sun Fire 15K supercomputer, located in the Human Genome Center at the University of Tokyo. While processing, the web server refreshes the results page every 30 s for small sets of data (less than 1000 sequences). For larger data set it provides instead a hyperlink for downloading the results. A user manual for each program and tutorial is available on the web server to provide assistance on using the interface.

FUTURE PLANS

Recently many new algorithms have been introduced for sequence clustering that provides more flexibility and advancement for large-scale projects (13–15). We are planning to validate and use new algorithms to improve the quality of the pipeline on this web server. In the near future there will be an option for users who want to annotate their assembling results based on the KEGG pathway database (16). The results of assembling, including contigs and singletons will be mapped to the pathways in KEGG by transferring the results to another server for automatic functional annotation based on KEGG (KAAS; ). We will also continue collecting new repeats and vector sequences from public resources to enrich our custom database for filtering the sequences.

14 in total

1. Computer-based methods for the mouse full-length cDNA encyclopedia: real-time sequence clustering for construction of a nonredundant cDNA library.

Authors: H Konno; Y Fukunishi; K Shibata; M Itoh; P Carninci; Y Sugahara; Y Hayashizaki
Journal: Genome Res Date: 2001-02 Impact factor: 9.043

2. STACK: Sequence Tag Alignment and Consensus Knowledgebase.

Authors: A Christoffels; A van Gelder; G Greyling; R Miller; T Hide; W Hide
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

3. Shotgun sequencing of the human transcriptome with ORF expressed sequence tags.

Authors: E Dias Neto; R G Correa; S Verjovski-Almeida; M R Briones; M A Nagai; W da Silva; M A Zago; S Bordin; F F Costa; G H Goldman; A F Carvalho; A Matsukuma; G S Baia; D H Simpson; A Brunstein; P S de Oliveira; P Bucher; C V Jongeneel; M J O'Hare; F Soares; R R Brentani; L F Reis; S J de Souza; A J Simpson
Journal: Proc Natl Acad Sci U S A Date: 2000-03-28 Impact factor: 11.205

4. Repbase update: a database and an electronic journal of repetitive elements.

Authors: J Jurka
Journal: Trends Genet Date: 2000-09 Impact factor: 11.639

5. The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants.

Authors: Shu Ouyang; C Robin Buell
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. Complementary DNA sequencing: expressed sequence tags and human genome project.

Authors: M D Adams; J M Kelley; J D Gocayne; M Dubnick; M H Polymeropoulos; H Xiao; C R Merril; A Wu; B Olde; R F Moreno
Journal: Science Date: 1991-06-21 Impact factor: 47.728

7. Base-calling of automated sequencer traces using phred. I. Accuracy assessment.

Authors: B Ewing; L Hillier; M C Wendl; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

8. Generation and analysis of 280,000 human expressed sequence tags.

Authors: L D Hillier; G Lennon; M Becker; M F Bonaldo; B Chiapelli; S Chissoe; N Dietrich; T DuBuque; A Favello; W Gish; M Hawkins; M Hultman; T Kucaba; M Lacy; M Le; N Le; E Mardis; B Moore; M Morris; J Parsons; C Prange; L Rifkin; T Rohlfing; K Schellenberg; M Bento Soares; F Tan; J Thierry-Meg; E Trevaskis; K Underwood; P Wohldman; R Waterston; R Wilson; M Marra
Journal: Genome Res Date: 1996-09 Impact factor: 9.043

9. Database resources of the National Center for Biotechnology.

Authors: David L Wheeler; Deanna M Church; Scott Federhen; Alex E Lash; Thomas L Madden; Joan U Pontius; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Tatiana A Tatusova; Lukas Wagner
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

57 in total

1. Early developmental and stress responsive ESTs from mungbean, Vigna radiata (L.) Wilczek, seedlings.

Authors: Li-Ru Chen; Albert H Markhart; S Shanmugasundaram; Tsai-Yun Lin
Journal: Plant Cell Rep Date: 2007-12-04 Impact factor: 4.570

2. EGENES: transcriptome-based plant database of genes with metabolic pathway information and expressed sequence tag indices in KEGG.

Authors: Ali Masoudi-Nejad; Susumu Goto; Ruy Jauregui; Masumi Ito; Shuichi Kawashima; Yuki Moriya; Takashi R Endo; Minoru Kanehisa
Journal: Plant Physiol Date: 2007-04-27 Impact factor: 8.340

3. Crustacean Genome Exploration Reveals the Evolutionary Origin of White Spot Syndrome Virus.

Authors: Satoshi Kawato; Aiko Shitara; Yuanyuan Wang; Reiko Nozaki; Hidehiro Kondo; Ikuo Hirono
Journal: J Virol Date: 2019-01-17 Impact factor: 5.103

Review 4. Genomics and bioinformatics resources for crop improvement.

Authors: Keiichi Mochida; Kazuo Shinozaki
Journal: Plant Cell Physiol Date: 2010-03-05 Impact factor: 4.927

5. Rediscovering medicinal plants' potential with OMICS: microsatellite survey in expressed sequence tags of eleven traditional plants with potent antidiabetic properties.

Authors: Jagajjit Sahu; Priyabrata Sen; Manabendra Dutta Choudhury; Budheswar Dehury; Madhumita Barooah; Mahendra Kumar Modi; Anupam Das Talukdar
Journal: OMICS Date: 2014-05

6. SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read.

Authors: Juan Falgueras; Antonio J Lara; Noé Fernández-Pozo; Francisco R Cantón; Guillermo Pérez-Trabado; M Gonzalo Claros
Journal: BMC Bioinformatics Date: 2010-01-20 Impact factor: 3.169

7. Functional genomics of mountain pine beetle (Dendroctonus ponderosae) midguts and fat bodies.

Authors: Tidiane Aw; Karen Schlauch; Christopher I Keeling; Sharon Young; Jeremy C Bearfield; Gary J Blomquist; Claus Tittiger
Journal: BMC Genomics Date: 2010-03-30 Impact factor: 3.969

8. ESTPiper--a web-based analysis pipeline for expressed sequence tags.

Authors: Zuojian Tang; Jeong-Hyeon Choi; Chris Hemmerich; Ankita Sarangi; John K Colbourne; Qunfeng Dong
Journal: BMC Genomics Date: 2009-04-21 Impact factor: 3.969

9. Expressed sequence tags from larval gut of the European corn borer (Ostrinia nubilalis): exploring candidate genes potentially involved in Bacillus thuringiensis toxicity and resistance.

Authors: Chitvan Khajuria; Yu Cheng Zhu; Ming-Shun Chen; Lawrent L Buschman; Randall A Higgins; Jianxiu Yao; Andre Lb Crespo; Blair D Siegfried; Subbaratnam Muthukrishnan; Kun Yan Zhu
Journal: BMC Genomics Date: 2009-06-29 Impact factor: 3.969

10. Transcriptome analysis of the venom gland of the scorpion Scorpiops jendeki: implication for the evolution of the scorpion venom arsenal.

Authors: Yibao Ma; Ruiming Zhao; Yawen He; Songryong Li; Jun Liu; Yingliang Wu; Zhijian Cao; Wenxin Li
Journal: BMC Genomics Date: 2009-07-01 Impact factor: 3.969