| Literature DB >> 22584621 |
Michaël Vyverman1, Bernard De Baets, Veerle Fack, Peter Dawyndt.
Abstract
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared.Entities:
Mesh:
Year: 2012 PMID: 22584621 PMCID: PMC3424560 DOI: 10.1093/nar/gks408
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Suffix tree for string S = ACATACAGATG, where $ is the special end-character. Each number i inside a leaf represents suffix S[i..] of the string S. Dashed arrows correspond to suffix links. Edges are arranged in lexicographical order. For the sake of brevity, only the first characters followed by two dots and the special end-character $ are shown for edge labels that spell out the rest of the suffix corresponding to the leaf the edge is connected with.
Arrays used by enhanced suffix arrays (columns 2–5), compressed suffix arrays (columns 2, 6 and 7) and FM-indexes (columns 8 – 14) for string S = ACATACAGATG$
| ESA | CSA | FM-index ‘rank’ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SA | LCP | SA−1 | Ψ | BWT | LF | |||||||||
| 0 | 11 | −1 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 8 | ||||
| 1 | 4 | 0 | 6 | [0..11] | 7 | 6 | 0 | 0 | 0 | 1 | 1 | 10 | ||
| 2 | 0 | 3 | 2 | [6..7] | 4 | 7 | 1 | 0 | 0 | 1 | 1 | 0 | ||
| 3 | 6 | 1 | 4 | [0..11] | 10 | 9 | 1 | 0 | 1 | 1 | 1 | 6 | ||
| 4 | 2 | 1 | 5 | 1 | 10 | 1 | 0 | 2 | 1 | 1 | 7 | |||
| 5 | 8 | 2 | 3 | [10..11] | 6 | 11 | 1 | 0 | 2 | 2 | 1 | 9 | ||
| 6 | 5 | 0 | 8 | 3 | 3 | 1 | 1 | 2 | 2 | 1 | 1 | |||
| 7 | 1 | 2 | 7 | [1..5] | 9 | 4 | 1 | 2 | 2 | 2 | 1 | 2 | ||
| 8 | 10 | 0 | 10 | 5 | 0 | 1 | 2 | 2 | 2 | 2 | 11 | |||
| 9 | 7 | 1 | 9 | [0..11] | 11 | 5 | 1 | 3 | 2 | 2 | 2 | 3 | ||
| 10 | 3 | 0 | 8 | 1 | 1 | 4 | 2 | 2 | 2 | 4 | ||||
| 11 | 9 | 1 | 11 | [0..11] | 0 | 8 | 1 | 5 | 2 | 2 | 2 | 5 | ||
From left to right: index position, suffix array, LCP array, child array, suffix link array, inverse suffix array, Ψ-array, BWT text, ‘rank’ array, LF-mapping array and suffixes of string S. FM-indexes also require an array C(S).
Conceptual matrix M containing the lexicographically ordered n cyclic shifts of S = ACATACAGATG$
| BWT[ | offset[ | LF[ | |||
|---|---|---|---|---|---|
| 0 | 0 | 8 | |||
| 1 | 0 | 10 | |||
| 2 | 0 | 0 | |||
| 3 | 0 | 6 | |||
| 4 | 1 | 7 | |||
| 5 | 1 | 9 | |||
| 6 | 0 | 1 | |||
| 7 | 1 | 2 | |||
| 8 | 1 | 11 | |||
| 9 | 2 | 3 | |||
| 10 | 3 | 4 | |||
| 11 | 4 | 5 |
M[0..11,0] contains the lexicographically ordered characters of S and M[0..11,11] equals BWT(S). The last two columns are required for the inverse transformation. offset[i] stores the number of times BWT[i] has appeared earlier in BWT(S). The last column LF[i] contains pointers used during the inverse transformation algorithm: if S[i] = BWT[j], then BWT[LF[j]] = S[i − 1].
Representative memory requirements for different index structure implementations, expressed both as bits per indexed character (column 2) and estimated size in megabytes for several known genomes (columns 3–5)
| Name index structure | Bits/char | Size for genome in MB | Reference | ||
|---|---|---|---|---|---|
| Yeast | Fruit fly | Human | |||
| 2 | 3 | 35 | 775 | NCBI | |
| CSA Grossi | 2.4 | 4 | 42 | 931 | ( |
| FM-index | 3.36 | 5 | 59 | 1302 | ( |
| SSA (best) | 4 | 6 | 70 | 1551 | ( |
| CST Russo | 5 | 8 | 87 | 1939 | ( |
| CSA Sadakane (best) | 5.6 | 8 | 98 | 2171 | ( |
| LZ-index (best) | 6.64 | 10 | 116 | 2574 | ( |
| 8 | 12 | 139 | 3102 | NCBI | |
| CST Navarro | 12 | 18 | 209 | 4653 | ( |
| SSA (worst) | 20 | 30 | 349 | 7754 | ( |
| CST Sadakane | 30 | 45 | 523 | 11 632 | ( |
| LZ-index (worst) | 35.2 | 53 | 614 | 13 648 | ( |
| Suffix array | 40 | 60 | 697 | 15 509 | ( |
| Enhanced SA | 72 | 109 | 1255 | 27 916 | ( |
| WOTD suffix tree | 76 | 115 | 1325 | 29 467 | ( |
| ST McCreight | 232 | 350 | 4045 | 89 952 | ( |
Column 6 contains references to the original theoretical proposals and an additional reference to the articles from which these practical estimates originate. For ease of comparison purposes, the index structures are sorted by increasing memory requirements. As a reference, the original (non-indexed) sequence is also included (bold), both stored using 2-bit encoding and byte encoding.
aGenome sizes were taken from the NCBI genome information pages http://www.ncbi.nlm.nih.gov/genome of Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fruit fly) and Homo Sapiens (human).
bMean of the interval of possible memory requirements given in (62).
Figure 2.Wavelet tree for indexing string S = GT$CCGAATAAA. Only the binary strings are stored in practice. Subsequences of S are shown only to ease the interpretation. This figure does not include data structures for resolving rank and select queries for every bit vector. For this small example, however, the answer to these queries is straightforward.