Literature DB >> 19246510

mkESA: enhanced suffix array construction tool.

Robert Homann1, David Fleer, Robert Giegerich, Marc Rehmsmeier.   

Abstract

We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data.

Entities:  

Mesh:

Year:  2009        PMID: 19246510      PMCID: PMC2666816          DOI: 10.1093/bioinformatics/btp112

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The program mkESA is a software tool for constructing enhanced suffix arrays (ESAs) from biological sequence data. The ESA is an index data structure for textual data, introduced in Abouelhoda et al. (2004) as an extension of the well-known suffix array (Manber and Myers, 1993). The ESA is equivalent to the suffix tree, another very important, but more space consuming full-text index data structure (Gusfield, 1997). The major advantages of ESAs over suffix trees are their lower space overhead, improved locality of reference and simple storing to files. A suffix array for text T of length n is a table of size n + 1 that lists the start positions of the suffixes of T in lexicographic order. Using a suffix array, exact string queries can be answered in O(m logn) time, where m is the length of the query, instead of O(m+n) time without a suffix array. ESAs are composed of a suffix array and additional tables that can be used to improve query performance [e.g. O(m+logn) time using the LCP table, called Hgt array in Manber and Myers (1993)], or enabling efficient implementation of more advanced queries (e.g. finding maximum unique matches). Thus, ESAs are fundamental technology in sequence analysis. Many interesting problems on sequences from the field of computational biology can be solved efficiently by transforming sequence data into (enhanced) suffix arrays [see, for instance, (Beckstette et al., 2006; De Bona et al., 2008; Höhl et al., 2002; Krumsiek et al., 2007; Rahmann, 2003)]. Linear-time algorithms for suffix array construction have been proposed as well as algorithms that are fast in practice and/or tuned for space efficiency, rendering use of suffix arrays feasible for large datasets; see Puglisi et al. (2007) for a comprehensive overview. In addition, by the results of Abouelhoda et al. (2004), any program using suffix trees can be transformed so to employ ESAs instead and benefit from the advantages offered by that data structure. Despite the great interest in suffix arrays in the literature, only few actual programs for ESA construction are available. Most existing programs are useful for mere suffix array construction, and do not address specificities of computational biology such as handling multiple sequences and very large datasets. A notable exception is the widely used mkvtree program (http://www.vmatch.de/). mkvtree can read common file formats such as FASTA and keeps sequences separated from their descriptions. An ESA generated by mkvtree may contain multiple sequences, stored so that a match can easily be mapped to its corresponding sequence. The program is available free of charge as part of the Vmatch package, but, unfortunately, in binary form and for non-commercial purposes only. This implies that software relying on mkvtree cannot be distributed easily since the terms of the Vmatch license agreement restrict the legal use of mkvtree. Software that requires using mkvtree also requires all users to obtain the Vmatch package, if available for their platform of choice, and have them sign a license agreement, too. We have implemented the alternative open source software tool mkESA, using the Deep-Shallow algorithm (Manzini and Ferragina, 2004) for in-memory suffix array construction instead of multikey quicksort as used by mkvtree. Thus, mkESA is efficient even for highly repetitive sequence data, and is fast as long as all data can be held in main memory. As further improvement, our implementation of Deep-Shallow can use multiple CPUs for increased speed.

2 IMPLEMENTATION

With mkvtree being the most widely spread program for ESA construction, we tried to pick up all of the important ideas implemented in mkvtree and improve upon its weaknesses. mkESA has been designed so to produce output as compatible with mkvtree as possible. The files generated by mkESA are in fact the same as those made by mkvtree, meaning that data produced by mkESA can be processed by programs that expect mkvtree-generated ESAs. mkESA employs the ‘Deep-Shallow’ algorithm of Manzini and Ferragina (2004) for suffix array construction. This algorithm belongs to the family of ‘lightweight’ suffix sorting algorithms, covering algorithms that use only very small additional space besides the suffix array and the input text, i.e. only O((5+ɛ)n) bytes space for a text of length n, and using 32 bit integers for the suffix array. Our version of Deep-Shallow is multithreaded, i.e. the computational work for suffix sorting can be distributed over multiple CPUs or CPU cores. Since Deep-Shallow is not useful for building LCP tables as by-product of suffix sorting (as is the case with simple multikey quicksort), we use the space-efficient, linear-time algorithm of Manzini (2004) to construct LCP tables from suffix arrays. Moreover, mkESA can generate the inverse suffix array and the skip table (Beckstette et al., 2006). It is worth noting that mkESA can incrementally add additional tables when they are needed.

3 RUNTIME BENCHMARKS

We compared the performance of mkESA with other programs for suffix array construction, namely mkvtree, mksary 1.2.0 (http://sary.sourceforge.net/, included for its ability to run multithreaded), and Manzini's implementation of Deep-Shallow ds. We measured the time and space consumption for building suffix arrays from the datasets in Table 1, using memtime version 1.3. mkESA and mkvtree processed FASTA files, the other programs processed the bare sequence data with FASTA headers removed so that all programs had comparable workloads. Only ‘parallel mkESA’ and ‘parallel mksary’ (Table 2) made explicit use of multiple CPU cores. Measurements were taken on a Sun Fire X4450 (4 Intel Xeon CPUs at 2.93 GHz, 16 cores, 96 GB RAM) running Solaris 10. The programs were compiled with gcc 4.1.1 using flags -m64 -O3 -fomit-frame-pointer. Each experiment was repeated four times in a row; the best (shortest elapsed time) of the results are displayed in Table 2. Our results show comparable memory requirements for all tested programs, while mkESA is usually the fastest among them, even when using only one CPU.
Table 1.

Datasets used for performance measurements

NameDescriptionSizeσ
chr1Chromosome 1 human genome219 (219) MB4
fmdvFoot/mouth disease virus genomes65 (64) MB4
sproUniprotKB/Swiss-Prot rel. 56.4181 (140) MB20
tremUniprotKB/TrEMBL rel. 39.42836 (2110) MB20
f2525th Fibonacci string73 (73) kB2
f3030th Fibonacci string813 (813) kB2

Sizes are given as file sizes, followed by sizes of files with FASTA headers removed in parentheses. Alphabet sizes are given as σ. We included Fibonacci strings since these are hard on many suffix tree and suffix array construction algorithms due to their high repetitiveness. They impose the worst case for the number of nodes in a suffix tree, 2n, and thus, e.g. trigger the worst case running time of O(n2) of the WOTD suffix tree construction algorithm (Giegerich et al., 2003). Dataset ‘fmdv’ is a non-artificial example for highly repetitive sequence data, with similar impact on performance (Table 2).

Table 2.

Results of performance measurements

NamemkESA
Parallel mkESA
mkvtree
secMBsecMBsecMB
chr191 (2.6)108566 (2.6)1093138 (2.2)1148
fmdv89 (0.9)35366 (0.9)3561797 (1.1)338
spro47 (1.9)78525 (1.9)78576 (2.2)813
trem2273 (545)21 4611500 (553)21 4622956 (530)21 827
f250.1 (0.0)0.10.1 (0.0)0.17.3 (0.0)1.4
f301.1 (0.0)5.11.1 (0.0)5.3895 (0.0)5.4

The ‘sec’ columns show the total time consumed in seconds (wall time clock), followed by the time attributed to operating system activities in parentheses. The ‘MB’ columns show main memory consumption in megabytes [resident set size (RSS)]. Parallel versions were allowed to use up to 16 threads. Some programs crashed for various datasets, in which cases results are not shown. For the same reason there is no row for ‘trem’ in the lower part. All values were rounded for readability.

Datasets used for performance measurements Sizes are given as file sizes, followed by sizes of files with FASTA headers removed in parentheses. Alphabet sizes are given as σ. We included Fibonacci strings since these are hard on many suffix tree and suffix array construction algorithms due to their high repetitiveness. They impose the worst case for the number of nodes in a suffix tree, 2n, and thus, e.g. trigger the worst case running time of O(n2) of the WOTD suffix tree construction algorithm (Giegerich et al., 2003). Dataset ‘fmdv’ is a non-artificial example for highly repetitive sequence data, with similar impact on performance (Table 2). Results of performance measurements The ‘sec’ columns show the total time consumed in seconds (wall time clock), followed by the time attributed to operating system activities in parentheses. The ‘MB’ columns show main memory consumption in megabytes [resident set size (RSS)]. Parallel versions were allowed to use up to 16 threads. Some programs crashed for various datasets, in which cases results are not shown. For the same reason there is no row for ‘trem’ in the lower part. All values were rounded for readability.

4 CONCLUSION

We presented mkESA, a portable, lightweight, multithreaded and fast program for constructing enhanced suffix arrays. We carefully tested the software on a variety of UNIX-like operating systems and hardware architectures, including recent versions of Linux, Solaris, Mac OS X, FreeBSD, OpenBSD and NetBSD. Its ability to generate output compatible with mkvtree makes mkESA a convenient open source drop-in replacement for earlier programs. Conflict of Interest: none declared.
  5 in total

1.  Efficient multiple genome alignment.

Authors:  Michael Höhl; Stefan Kurtz; Enno Ohlebusch
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

2.  Fast large scale oligonucleotide selection using the longest common factor approach.

Authors:  Sven Rahmann
Journal:  J Bioinform Comput Biol       Date:  2003-07       Impact factor: 1.122

3.  Gepard: a rapid and sensitive tool for creating dotplots on genome scale.

Authors:  Jan Krumsiek; Roland Arnold; Thomas Rattei
Journal:  Bioinformatics       Date:  2007-02-19       Impact factor: 6.937

4.  Optimal spliced alignments of short sequence reads.

Authors:  Fabio De Bona; Stephan Ossowski; Korbinian Schneeberger; Gunnar Rätsch
Journal:  Bioinformatics       Date:  2008-08-15       Impact factor: 6.937

5.  Fast index based algorithms and software for matching position specific scoring matrices.

Authors:  Michael Beckstette; Robert Homann; Robert Giegerich; Stefan Kurtz
Journal:  BMC Bioinformatics       Date:  2006-08-24       Impact factor: 3.169

  5 in total
  5 in total

1.  gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.

Authors:  Felipe A Louza; Guilherme P Telles; Simon Gog; Nicola Prezza; Giovanna Rosone
Journal:  Algorithms Mol Biol       Date:  2020-09-22       Impact factor: 1.405

2.  Geoseq: a tool for dissecting deep-sequencing datasets.

Authors:  James Gurtowski; Anthony Cancio; Hardik Shah; Chaya Levovitz; Ajish George; Robert Homann; Ravi Sachidanandam
Journal:  BMC Bioinformatics       Date:  2010-10-12       Impact factor: 3.169

3.  Querying large read collections in main memory: a versatile data structure.

Authors:  Nicolas Philippe; Mikaël Salson; Thierry Lecroq; Martine Léonard; Thérèse Commes; Eric Rivals
Journal:  BMC Bioinformatics       Date:  2011-06-17       Impact factor: 3.169

4.  A bioinformatician's guide to the forefront of suffix array construction algorithms.

Authors:  Anish Man Singh Shrestha; Martin C Frith; Paul Horton
Journal:  Brief Bioinform       Date:  2014-01-10       Impact factor: 11.622

5.  Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays.

Authors:  Thomas D Wu
Journal:  Algorithms Mol Biol       Date:  2016-04-23       Impact factor: 1.405

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.