Literature DB >> 32973918

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.

Felipe A Louza¹, Guilherme P Telles², Simon Gog³, Nicola Prezza⁴, Giovanna Rosone⁵.

Abstract

BACKGROUND: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. RESULT: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings.
CONCLUSIONS: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.

Entities: Chemical

Keywords: Burrows–Wheeler transform; Document array; LCP array; String collections; Suffix array

Year: 2020 PMID： 32973918 PMCID： PMC7507297 DOI： 10.1186/s13015-020-00177-y

Source DB: PubMed Journal: Algorithms Mol Biol ISSN： 1748-7188 Impact factor: 1.405

Background

The suffix array () [1] is one of the most important data structures in string processing. It enables efficient pattern searching in strings, as well as solving many other string problems [2-4]. More space-efficient solutions for such problems are possible by replacing the suffix array with an index based on the Burrows–Wheeler transform () [5]. Many applications require additional data structures—most commonly, the longest common prefix () [6] array and the document array () [7]—on top of or . These structures, possibly stored in compressed form, serve as a basis for building modern compact full-text indices, which allow to efficiently pre-process and query strings in compact space. There are several internal memory algorithms designed for constructing the suffix array and additional data structures when the input consists of a single string [8, 9]. While less emphasis has been put on specialized algorithms for string collections, in many applications the input is composed by many strings, and a common approach is concatenating all strings into a single one and using a standard construction algorithm. However, this approach may deteriorate either the theoretical bounds or the practical behavior of construction algorithms due to, respectively, the resulting alphabet size or unnecessary string comparisons [10-12]. Textual documents and webpages are examples of widespread large string collections. In Bioinformatics, important problems on collections of sequences may be solved rapidly with a small memory footprint using the aforementioned data structures, for example, finding suffix-prefix overlaps for sequence assembly [13], clustering cDNA sequences [14], finding repeats [15] and sequence matching [16]. In this paper we present gsufsort, an open source tool that takes a string collection as input, constructs its (generalized) suffix array and additional data structures, like the , the array, and the , and writes them directly to disk. This way, applications that rely on such data structures may either read them from disk or may easily include gsufsort as a component. Large collections, with up to total letters in d strings, may be handled provided that there is enough memory. This tool is an extension of previous results [10], with new implementations of procedures to obtain the and the generalized suffix array () from during output to disk, and with the implementation of a lightweight alternative to compute .

Implementation

gsufsort is implemented in ANSI C and requires a single Make command to be compiled. It may receive a collection of strings in fasta, fastq or raw ASCII text formats and computes and related data structures, according to input parameters. gsufsort optionally supports gzipped input data using zlib1 and kseq2 libraries. Setting command-line arguments allows selecting which data structures are computed and written on disk, and which construction algorithm is used (see below). Additionally, a function for loading pre-constructed data structures from disk is also provided. Given a collection of d strings from an alphabet of ASCII symbols, having lengths , the strings are concatenated into a single string using the same separator $ and an end-marker #, such that $ and # do not occur in any string , and $ for any other symbol . The total length of T is . Before giving details on gsufsort implementation, we briefly recall some data structures definitions. For a string S of length n let the suffix starting at position i be denoted , . The suffix array of a string S of length n is an array with a permutation of that gives the suffixes of S in lexicographic order. The length of the longest common prefix of strings R and S is denoted by . The array for S gives the between consecutive suffixes in the order of , that is and , . For a suffix array of a collection of strings, the position i of the document array gives the string to which suffix belongs. For the last suffix we have . The generalized suffix array gives the order of the suffixes of every string in a collection, that is, the is as an array of N pairs of integers (a, b) where each entry (a, b) represents the suffix , with and . gsufsort uses algorithm gSACA-K [10] to construct for the concatenated string , which breaks ties between equal suffixes from different strings and by their ranks, namely i and j. gSACA-K can also compute and during construction, such that values do not exceed separator symbols. gSACA-K runs in O(N) time using working space. The is calculated during the output to disk according to its well-known relation to [3]The generalized suffix array () can be computed by gsufsort from and during the output to disk, using the identityWe also provide a lightweight version (gsufsort-light) for the computation of . It uses less memory at the price of being slightly slower. It computes a bitvector with O(1) rank support [4] such that if and otherwise. The values in are obtained on-the-fly while (or ) is written to disk, through the identity

Results

We compared our tool and mkESA. mkESA [17] is a fast suffix array construction software designed for bioinformatics applications. We ran both versions of our tool, gsufsort and gsufsort-light, to build arrays and , while mkESA3 was run to build arrays and for the concatenation of all strings (using the same symbol as separators). The experiments were conducted on a single core of a machine with GNU/Linux (Debian 8, kernel 3.16.0-4, 64 bits) with an Intel Xeon E5-2630 2.40-GHz, 384 GB RAM and 13 TB SATA storage. The sources were compiled by GNU GCC version 4.8.4 with option -O3. The collections we used in our experiments are described in Table 1. They comprise real DNAs, real proteins, documents, random DNA and random protein, and differ by their alphabet size and also by the maximum and average , which offer an approximation for suffix sorting difficulty.

Table 1

Collections

Collection	size	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma$$\end{document}σ	N. of strings	Max. len.	Avg. len	Max. lcp	Avg. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf {lcp}}$$\end{document}lcp
shortreads	16.00	5	171.8	100	100	100	32.87
reads	16.00	6	57.3	300	300	300	91.29
pacbio	16.00	5	1.9	71,561	9117	3084	19.08
pacbio.1000	16.00	5	17.2	1,000	1000	876	18.67
uniprot	16.04	25	46.1	74,488	374	74,293	99.24
gutenberg	15.88	255	334.3	757,936	50	9060	18.97
random.dna	16.00	4	16.1	1,048,576	1,048,576	33	16.18
random.protein	16.00	25	16.1	1,048,576	1,048,576	13	6.89

Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average of strings in a collection

Collections

shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);

reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);

pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);

pacbio.1000 are strings from pacbio trimmed to length 1,000;

uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);

gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);

random-dna was generated with even sampling probability on the standard 4 letter alphabet;

random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet

Collections Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average of strings in a collection Collections shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf); reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989); pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161); pacbio.1000 are strings from pacbio trimmed to length 1,000; uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center); gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org); random-dna was generated with even sampling probability on the standard 4 letter alphabet; random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet The results are shown in Table 2. The data shows a clear time/memory tradeoff for DNA sequences, gsufsort being faster while using approximately 1.25 more memory, gsufsort-light using slightly less memory then mkESA but taking more time. On proteins, gsufsort-light is only marginally slower than gsufsort but faster than mkESA. The authors of mkESA reported a 32% gain on a large protein dataset using 16 threads [17], but larger values seem not to favor mkESA when compared to gsufsort-light, which is 47.9% faster on proteins and 12.9% faster on DNA.

Table 2

Algorithms’ running times and memory usage on different datasets collections

Collection	gsufsort			gsufsort-light			mkESA
Collection	Time	RAM	Bytes/N	Time	RAM	Bytes/N	Time	RAM	Bytes/N
shortreads	4:25:52	336.00	21.00	5:30:54	272.00	17.00	4:51:48	274.73	17.17
reads	5:00:27	336.00	21.00	5:10:04	272.00	17.00	5:44:58	280.68	17.54
pacbio	4:19:37	336.04	21.00	4:54:21	272.03	17.00	4:26:39	272.58	17.03
pacbio.1000	4:28:22	336.00	21.00	5:20:39	272.00	17.00	4:44:50	272.32	17.02
uniprot	5:11:33	336.90	21.00	5:25:37	272.73	17.00	9:58:03	294.86	18.38
gutenberg	4:17:52	334.40	21.00	4:53:05	269.90	17.00	–	–	–
random.dna	4:23:56	331.08	21.00	5:41:45	268.02	17.00	4:28:43	268.33	17.02
random.protein	5:20:06	331.08	21.00	5:47:38	268.02	17.00	4:37:16	268.33	17.02

Columns RAM and bytes/N show the peak memory in GB and the bytes per input symbol ratio. Each symbol of uses 1 byte. Results for gutenberg are reported for gsufsort and gsufsort-light only, as mkESA is restricted to DNA and amino-acid alphabets. The best results are indicated in italics

Algorithms’ running times and memory usage on different datasets collections Columns RAM and bytes/N show the peak memory in GB and the bytes per input symbol ratio. Each symbol of uses 1 byte. Results for gutenberg are reported for gsufsort and gsufsort-light only, as mkESA is restricted to DNA and amino-acid alphabets. The best results are indicated in italics The memory ratio (bytes/N) of gsufsort and gsufsort-light is constant, 21 and 17 bytes per input symbol respectively, corresponding to the space of the input string T (N bytes) plus the space for arrays and (8N bytes each) and, only for gsufsort, the space for (4N bytes). We have also evaluated the performance of gsufsort, gsufsort-light and mkESA on collections of random DNA and random protein sequences. The collections have a growing number of 1MB sequences. The running time in seconds and the peak memory usage in GB are shown in Fig. 1 (logarithmic scale). Using random sequences reduces the variation due to among collections. We can see a perfectly steady behavior of mkESA. While still O(N), gsufsort displays a deviation due to larger constants.

Fig. 1

Running time in seconds and peak memory in GB (in logarithmic scale) on an random DNA and protein collections

Conclusions

We have introduced gsufsort, a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections. gsufsort may be used to pre-compute indexing structures and write them to disk, or may be included as a component in different applications. As an additional advantage, gsufsort is not restricted to biological sequences, as it can process collections of strings over ASCII alphabets.

Availability and requirements

Project name: gsufsort Project home page: http://www.github.com/felipelouza/gsufsort Operating system(s): Platform independent Programming language: ANSI C Other requirements: make, zlib (optional) License: GNU GPL v-3.0.

5 in total

1 in total

1. SLDMS: A Tool for Calculating the Overlapping Regions of Sequences.

Authors: Yu Chen; DongLiang You; TianJiao Zhang; GuoHua Wang
Journal: Front Plant Sci Date: 2022-01-03 Impact factor: 5.753

1 in total

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.

Background

Implementation

Results

Conclusions

Availability and requirements

1. KABOOM! A new suffix array based algorithm for clustering expression data.

2. RepMaestro: scalable repeat detection on disk-based genome sequences.

3. essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.

4. Efficient construction of an assembly string graph using the FM-index.

5. mkESA: enhanced suffix array construction tool.

1. SLDMS: A Tool for Calculating the Overlapping Regions of Sequences.