Literature DB >> 17237063

Compressed suffix tree--a basis for genome-scale sequence analysis.

Niko Välimäki1, Wolfgang Gerlach, Kashyap Dixit, Veli Mäkinen.   

Abstract

UNLABELLED: Suffix tree is one of the most fundamental data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet sigma = {A, C, G, T} can be stored in n log absolute value(sigma) = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50. We provide an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log) absolute value(sigma)) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. Typical operations are slowed down by factor 60. AVAILABILITY: The C++ implementation under GNU license is available at http://www.cs.helsinki.fi/group/suds/cst/. An example program implementing a typical pattern discovery task is included. Experimental results in this note correspond to version 0.95.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17237063     DOI: 10.1093/bioinformatics/btl681

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  6 in total

1.  Finding and Characterizing Repeats in Plant Genomes.

Authors:  Jacques Nicolas; Sébastien Tempel; Anna-Sophie Fiston-Lavier; Emira Cherif
Journal:  Methods Mol Biol       Date:  2022

2.  Motif discovery and transcription factor binding sites before and after the next-generation sequencing era.

Authors:  Federico Zambelli; Graziano Pesole; Giulio Pavesi
Journal:  Brief Bioinform       Date:  2012-04-19       Impact factor: 11.622

3.  Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).

Authors:  Richard Durbin
Journal:  Bioinformatics       Date:  2014-01-09       Impact factor: 6.937

4.  Visual ModuleOrganizer: a graphical interface for the detection and comparative analysis of repeat DNA modules.

Authors:  Sebastien Tempel; Emmanuel Talla
Journal:  Mob DNA       Date:  2014-03-28

5.  Using the Sadakane compressed suffix tree to solve the all-pairs suffix-prefix problem.

Authors:  Maan Haj Rachid; Qutaibah Malluhi; Mohamed Abouelhoda
Journal:  Biomed Res Int       Date:  2014-04-16       Impact factor: 3.411

6.  HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis.

Authors:  Luis Santana-Quintero; Hayley Dingerdissen; Jean Thierry-Mieg; Raja Mazumder; Vahan Simonyan
Journal:  PLoS One       Date:  2014-06-11       Impact factor: 3.240

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.