| Literature DB >> 32973918 |
Felipe A Louza1, Guilherme P Telles2, Simon Gog3, Nicola Prezza4, Giovanna Rosone5.
Abstract
BACKGROUND: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. RESULT: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings.Entities:
Keywords: Burrows–Wheeler transform; Document array; LCP array; String collections; Suffix array
Year: 2020 PMID: 32973918 PMCID: PMC7507297 DOI: 10.1186/s13015-020-00177-y
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Collections
| Collection | size | N. of strings | Max. len. | Avg. len | Max. lcp | Avg. | |
|---|---|---|---|---|---|---|---|
| shortreads | 16.00 | 5 | 171.8 | 100 | 100 | 100 | 32.87 |
| reads | 16.00 | 6 | 57.3 | 300 | 300 | 300 | 91.29 |
| pacbio | 16.00 | 5 | 1.9 | 71,561 | 9117 | 3084 | 19.08 |
| pacbio.1000 | 16.00 | 5 | 17.2 | 1,000 | 1000 | 876 | 18.67 |
| uniprot | 16.04 | 25 | 46.1 | 74,488 | 374 | 74,293 | 99.24 |
| gutenberg | 15.88 | 255 | 334.3 | 757,936 | 50 | 9060 | 18.97 |
| random.dna | 16.00 | 4 | 16.1 | 1,048,576 | 1,048,576 | 33 | 16.18 |
| random.protein | 16.00 | 25 | 16.1 | 1,048,576 | 1,048,576 | 13 | 6.89 |
Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average of strings in a collection
Collections
shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);
reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);
pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);
pacbio.1000 are strings from pacbio trimmed to length 1,000;
uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);
gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);
random-dna was generated with even sampling probability on the standard 4 letter alphabet;
random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet
Algorithms’ running times and memory usage on different datasets collections
| Collection | gsufsort | gsufsort-light | mkESA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time | RAM | Bytes/N | Time | RAM | Bytes/N | Time | RAM | Bytes/N | |
| shortreads | 336.00 | 21.00 | 5:30:54 | 4:51:48 | 274.73 | 17.17 | |||
| reads | 336.00 | 21.00 | 5:10:04 | 5:44:58 | 280.68 | 17.54 | |||
| pacbio | 336.04 | 21.00 | 4:54:21 | 4:26:39 | 272.58 | 17.03 | |||
| pacbio.1000 | 336.00 | 21.00 | 5:20:39 | 4:44:50 | 272.32 | 17.02 | |||
| uniprot | 336.90 | 21.00 | 5:25:37 | 9:58:03 | 294.86 | 18.38 | |||
| gutenberg | 334.40 | 21.00 | 4:53:05 | – | – | – | |||
| random.dna | 331.08 | 21.00 | 5:41:45 | 4:28:43 | 268.33 | 17.02 | |||
| random.protein | 5:20:06 | 331.08 | 21.00 | 5:47:38 | 268.33 | 17.02 | |||
Columns RAM and bytes/N show the peak memory in GB and the bytes per input symbol ratio. Each symbol of uses 1 byte. Results for gutenberg are reported for gsufsort and gsufsort-light only, as mkESA is restricted to DNA and amino-acid alphabets. The best results are indicated in italics
Fig. 1Running time in seconds and peak memory in GB (in logarithmic scale) on an random DNA and protein collections