Literature DB >> 12771222

Efficient clustering of large EST data sets on parallel computers.

Anantharaman Kalyanaraman1, Srinivas Aluru, Suresh Kothari, Volker Brendel.   

Abstract

Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website.

Entities:  

Mesh:

Year:  2003        PMID: 12771222      PMCID: PMC156714          DOI: 10.1093/nar/gkg379

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


  11 in total

1.  CAP3: A DNA sequence assembly program.

Authors:  X Huang; A Madan
Journal:  Genome Res       Date:  1999-09       Impact factor: 9.043

2.  Comparison of gene indexing databases.

Authors:  J Bouck; W Yu; R Gibbs; K Worley
Journal:  Trends Genet       Date:  1999-04       Impact factor: 11.639

3.  Functional annotation of a full-length Arabidopsis cDNA collection.

Authors:  Motoaki Seki; Mari Narusaka; Asako Kamiya; Junko Ishida; Masakazu Satou; Tetsuya Sakurai; Maiko Nakajima; Akiko Enju; Kenji Akiyama; Youko Oono; Masami Muramatsu; Yoshihide Hayashizaki; Jun Kawai; Piero Carninci; Masayoshi Itoh; Yoshiyuki Ishii; Takahiro Arakawa; Kazuhiro Shibata; Akira Shinagawa; Kazuo Shinozaki
Journal:  Science       Date:  2002-03-21       Impact factor: 47.728

4.  An optimized protocol for analysis of EST sequences.

Authors:  F Liang; I Holt; G Pertea; S Karamycheva; S L Salzberg; J Quackenbush
Journal:  Nucleic Acids Res       Date:  2000-09-15       Impact factor: 16.971

5.  The TIGR gene indices: reconstruction and representation of expressed gene sequences.

Authors:  J Quackenbush; F Liang; I Holt; G Pertea; J Upton
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

Review 6.  Pieces of the puzzle: expressed sequence tags and the catalog of human genes.

Authors:  G D Schuler
Journal:  J Mol Med (Berl)       Date:  1997-10       Impact factor: 4.599

7.  A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors:  S B Needleman; C D Wunsch
Journal:  J Mol Biol       Date:  1970-03       Impact factor: 5.469

8.  Identification of common molecular subsequences.

Authors:  T F Smith; M S Waterman
Journal:  J Mol Biol       Date:  1981-03-25       Impact factor: 5.469

9.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Authors:  J Burke; D Davison; W Hide
Journal:  Genome Res       Date:  1999-11       Impact factor: 9.043

10.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base.

Authors:  R T Miller; A G Christoffels; C Gopalakrishnan; J Burke; A A Ptitsyn; T R Broveak; W A Hide
Journal:  Genome Res       Date:  1999-11       Impact factor: 9.043

View more
  27 in total

1.  PlantGDB, plant genome database and analysis tools.

Authors:  Qunfeng Dong; Shannon D Schlueter; Volker Brendel
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping.

Authors:  Wei Zhu; Shannon D Schlueter; Volker Brendel
Journal:  Plant Physiol       Date:  2003-06       Impact factor: 8.340

3.  Nearly identical paralogs: implications for maize (Zea mays L.) genome evolution.

Authors:  Scott J Emrich; Li Li; Tsui-Jung Wen; Marna D Yandeau-Nelson; Yan Fu; Ling Guo; Hui-Hsien Chou; Srinivas Aluru; Daniel A Ashlock; Patrick S Schnable
Journal:  Genetics       Date:  2006-11-16       Impact factor: 4.562

4.  Comparative plant genomics resources at PlantGDB.

Authors:  Qunfeng Dong; Carolyn J Lawrence; Shannon D Schlueter; Matthew D Wilkerson; Stefan Kurtz; Carol Lushbough; Volker Brendel
Journal:  Plant Physiol       Date:  2005-10       Impact factor: 8.340

5.  Anopheles gambiae genome reannotation through synthesis of ab initio and comparative gene prediction algorithms.

Authors:  Jun Li; Michelle M Riehle; Yan Zhang; Jiannong Xu; Frederick Oduol; Shawn M Gomez; Karin Eiglmeier; Beatrix M Ueberheide; Jeffrey Shabanowitz; Donald F Hunt; José M C Ribeiro; Kenneth D Vernick
Journal:  Genome Biol       Date:  2006-03-27       Impact factor: 13.583

6.  Quality assessment of maize assembled genomic islands (MAGIs) and large-scale experimental verification of predicted genes.

Authors:  Yan Fu; Scott J Emrich; Ling Guo; Tsui-Jung Wen; Daniel A Ashlock; Srinivas Aluru; Patrick S Schnable
Journal:  Proc Natl Acad Sci U S A       Date:  2005-08-15       Impact factor: 11.205

7.  PEACE: Parallel Environment for Assembly and Clustering of Gene Expression.

Authors:  D M Rao; J C Moler; M Ozden; Y Zhang; C Liang; J E Karro
Journal:  Nucleic Acids Res       Date:  2010-06-03       Impact factor: 16.971

8.  Microarray and cDNA sequence analysis of transcription during nerve-dependent limb regeneration.

Authors:  James R Monaghan; Leonard G Epp; Srikrishna Putta; Robert B Page; John A Walker; Chris K Beachy; Wei Zhu; Gerald M Pao; Inder M Verma; Tony Hunter; Susan V Bryant; David M Gardiner; Tim T Harkins; S Randal Voss
Journal:  BMC Biol       Date:  2009-01-13       Impact factor: 7.431

9.  PAVE: program for assembling and viewing ESTs.

Authors:  Carol Soderlund; Eric Johnson; Matthew Bomhoff; Anne Descour
Journal:  BMC Genomics       Date:  2009-08-26       Impact factor: 3.969

10.  EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data.

Authors:  Ernesto Picardi; Flavio Mignone; Graziano Pesole
Journal:  BMC Bioinformatics       Date:  2009-06-16       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.