| Literature DB >> 19246510 |
Robert Homann1, David Fleer, Robert Giegerich, Marc Rehmsmeier.
Abstract
We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data.Entities:
Mesh:
Year: 2009 PMID: 19246510 PMCID: PMC2666816 DOI: 10.1093/bioinformatics/btp112
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Datasets used for performance measurements
| Name | Description | Size | σ |
|---|---|---|---|
| chr1 | Chromosome 1 human genome | 219 (219) MB | 4 |
| fmdv | Foot/mouth disease virus genomes | 65 (64) MB | 4 |
| spro | UniprotKB/Swiss-Prot rel. 56.4 | 181 (140) MB | 20 |
| trem | UniprotKB/TrEMBL rel. 39.4 | 2836 (2110) MB | 20 |
| f25 | 25th Fibonacci string | 73 (73) kB | 2 |
| f30 | 30th Fibonacci string | 813 (813) kB | 2 |
Sizes are given as file sizes, followed by sizes of files with FASTA headers removed in parentheses. Alphabet sizes are given as σ. We included Fibonacci strings since these are hard on many suffix tree and suffix array construction algorithms due to their high repetitiveness. They impose the worst case for the number of nodes in a suffix tree, 2n, and thus, e.g. trigger the worst case running time of O(n2) of the WOTD suffix tree construction algorithm (Giegerich et al., 2003). Dataset ‘fmdv’ is a non-artificial example for highly repetitive sequence data, with similar impact on performance (Table 2).
Results of performance measurements
| Name | Parallel | |||||
|---|---|---|---|---|---|---|
| sec | MB | sec | MB | sec | MB | |
| chr1 | 91 (2.6) | 1085 | 66 (2.6) | 1093 | 138 (2.2) | 1148 |
| fmdv | 89 (0.9) | 353 | 66 (0.9) | 356 | 1797 (1.1) | 338 |
| spro | 47 (1.9) | 785 | 25 (1.9) | 785 | 76 (2.2) | 813 |
| trem | 2273 (545) | 21 461 | 1500 (553) | 21 462 | 2956 (530) | 21 827 |
| f25 | 0.1 (0.0) | 0.1 | 0.1 (0.0) | 0.1 | 7.3 (0.0) | 1.4 |
| f30 | 1.1 (0.0) | 5.1 | 1.1 (0.0) | 5.3 | 895 (0.0) | 5.4 |
The ‘sec’ columns show the total time consumed in seconds (wall time clock), followed by the time attributed to operating system activities in parentheses. The ‘MB’ columns show main memory consumption in megabytes [resident set size (RSS)]. Parallel versions were allowed to use up to 16 threads. Some programs crashed for various datasets, in which cases results are not shown. For the same reason there is no row for ‘trem’ in the lower part. All values were rounded for readability.