| Literature DB >> 29795706 |
Andrea Farruggia1, Travis Gagie2,3, Gonzalo Navarro2,4, Simon J Puglisi5, Jouni Sirén6.
Abstract
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.Entities:
Keywords: compressed text indexing; repetitive collections; suffix trees
Year: 2017 PMID: 29795706 PMCID: PMC5956352 DOI: 10.1093/comjnl/bxx108
Source DB: PubMed Journal: Comput J ISSN: 0010-4620 Impact factor: 1.494
Typical compressed suffix tree operations.
| Operation | Description |
|---|---|
|
| The root of the tree |
|
| Is node |
|
| Is node |
|
| Number of leaves in the subtree with |
|
| Pointer to the suffix corresponding to leaf |
|
| The parent of node |
|
| The first child of node |
|
| The next sibling of node |
|
| The lowest common ancestor of nodes |
|
|
|
|
|
|
|
| The highest ancestor of node |
|
| The ancestor of node |
|
|
|
|
| Suffix link iterated |
|
| The child of node |
|
| The character |
Figure 1.An example of our compression of .
Sequence lengths and resources used by index construction for NA12878 relative to the human reference genome with and without chromosome Y. Approx and Inv denote the approximate LCS and the bwt-invariant subsequence, respectively. Sequence lengths are in millions of base pairs, while construction resources are in minutes of wall clock time and gigabytes of memory.
| Sequence length |
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ChrY | Reference (M) | Target (M) | Approx (M) | Inv (M) | Time (min) | Memory (GB) | Time (min) | Memory (GB) | Time (min) | Memory (GB) |
| Yes | 3096 | 3036 | 2992 | 2980 | 1.42 | 4.41 | 175 | 84.0 | 629 | 141 |
| No | 3036 | 3036 | 2991 | 2980 | 1.33 | 4.38 | 173 | 82.6 | 593 | 142 |
Various indexes for NA12878 relative to the human reference genome with and without chromosome Y. The total for RST includes the full RFM. Index sizes are in megabytes and in bits per character.
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| ChrY | Basic | Full | Basic | Full | Basic | Full |
| Total |
|
| Yes | 1248 MB | 2110 MB | 636 MB | 1498 MB | 225 MB | 456 MB | 1233 MB | 1689 MB | 190 MB |
| 3.45 bpc | 5.83 bpc | 1.76 bpc | 4.14 bpc | 0.62 bpc | 1.26 bpc | 3.41 bpc | 4.67 bpc | 0.52 bpc | |
| No | 1248 MB | 2110 MB | 636 MB | 1498 MB | 186 MB | 400 MB | 597 MB | 997 MB | 163 MB |
| 3.45 bpc | 5.83 bpc | 1.76 bpc | 4.14 bpc | 0.51 bpc | 1.11 bpc | 1.65 bpc | 2.75 bpc | 0.45 bpc | |
Breakdown of component sizes in the RFM index for NA12878 relative to the human reference genome with and without chromosome Y in bits per character.
| Basic | Full | |||
|---|---|---|---|---|
| ChrY | Yes (bpc) | No (bpc) | Yes (bpc) | No (bpc) |
|
|
|
|
|
|
|
| 0.12 | 0.05 | 0.14 | 0.06 |
|
| 0.05 | 0.05 | 0.06 | 0.06 |
|
| 0.45 | 0.42 | 0.52 | 0.45 |
|
| – | – | 0.35 | 0.35 |
|
| – | – | 0.12 | 0.12 |
|
| – | – | 0.06 | 0.06 |
Bold values aimed to emphasize the base structure (RFM).
Breakdown of component sizes in the RLCP array for NA12878 relative to the human reference genome with and without chromosome Y. The number of phrases, average phrase length and the component sizes in bits per character. ‘Parse’ contains and , ‘Literals’ contains and , and ‘Tree’ contains and .
| ChrY | Phrases (million) | Length | Parse (bpc) | Literals (bpc) | Tree (bpc) | Total (bpc) |
|---|---|---|---|---|---|---|
| Yes | 128 | 23.6 | 1.35 | 1.54 | 0.52 | 3.41 |
| No | 94 | 32.3 | 0.97 | 0.41 | 0.27 | 1.65 |
Average query times in microseconds for 10 million random queries in the full SSA, the full SSA-RRR and the full RFM for NA12878 relative to the human reference genome with and without chromosome Y.
| ChrY |
|
|
|
| |||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |
| Yes |
|
|
|
|
|
|
|
| No |
|
|
|
|
|
|
|
Query times in microseconds in the LCP array (slarray) and the RLCP array for NA12878 relative to the human reference genome with and without chromosome Y. For the random queries, the query times are averages over 100 million queries. The range lengths for the rmq queries were (for ) with probability . For sequential access, we list the average time per position for scanning the entire array.
|
|
| ||||||
|---|---|---|---|---|---|---|---|
| ChrY | Random (μs) | Sequential (μs) | Random (μs) | Sequential (μs) |
|
|
|
| Yes |
|
|
|
|
|
|
|
| No |
|
|
|
|
|
|
|
Figure 2.Average find and locate times in microseconds per occurrence for 2 million patterns of length 32 with a total of 255 million occurrences on NA12878 relative to the human reference genome without chromosome Y. Left: query time vs. suffix array sample interval. Right: query time vs. index size in bits per character.
Figure 3.Index size in bits per character vs. mutation rate for 25 synthetic sequences relative to a 20 MB reference.
Compressed suffix trees for the maternal haplotypes of NA12878 relative to the human reference genome without chromosome Y. Component choices; index size in bits per character; average time in microseconds per node for preorder traversal; and average time in microseconds per character for finding maximal substrings shared with the paternal haplotypes of chromosome 1 of NA12878 using forward and backward algorithms. The figures in parentheses are estimates based on the progress made in the first 24 hours.
| Maximal substrings | ||||||
|---|---|---|---|---|---|---|
|
|
|
| Size (bpc) | Traversal (μs) | Forward (μs) | Backward (μs) |
|
|
|
| 12.33 |
|
|
|
|
|
|
| 10.79 |
|
|
|
|
|
|
| 18.08 |
|
|
|
|
|
| – | 4.98 | ( |
|
|
|
|
|
| 2.75 |
|
|
|
|
|
|
| 3.21 |
|
|
|