| Literature DB >> 31149025 |
Christina Boucher1, Travis Gagie2,3, Alan Kuhnle1,4, Ben Langmead5, Giovanni Manzini6,7, Taher Mun5.
Abstract
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.Entities:
Keywords: Burrows-Wheeler Transform; Compression-aware algorithms; Genomic databases; Prefix-free parsing
Year: 2019 PMID: 31149025 PMCID: PMC6534911 DOI: 10.1186/s13015-019-0148-5
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1The suffix trie for our example with the three strings GATTACAT, GATACAT and GATTAGATA. The input is shown at the bottom, in red because we do not need to store it
Fig. 2The suffix tree for our example. We now also need to store the input
Fig. 3The suffix array for our example is the sequence of values stored in the leaves of the tree (which we need not store explicitly). The LF mapping is shown as the arrows between two copies of the suffix array; the arrows to values i such that are in red, to illustrate that they point to consecutive positions in the suffix array and do not cross. Since is the inverse of the LF mapping, it can be obtained by simply reversing the direction of the arrows
Fig. 4The BWT and the sorted list of characters for our example. Drawing arrows between corresponding occurrences of characters in the two strings gives us the diagram for the LF-mapping
The information we compute for our example,
| Rank | Mapped characters | Suffix | Sources | Frequency | Preceding partial sum |
|---|---|---|---|---|---|
| 0 | A | #GATTAC | 1 | 1 | 0 |
| 1 | T | !GATAC | 2 | 1 | 1 |
| 2 | T | !GATTAG | 3 | 1 | 2 |
| 3 | T | A$$ | 5 | 1 | 3 |
| 4 | T | ACAT! | 4 | 2 | 4 |
| 5 | T | AGATA$$ | 5 | 1 | 6 |
| 6 | C | AT! | 4 | 2 | 7 |
| 7 | G | ATA$$ | 5 | 1 | 9 |
| 8 | G | ATAC | 2 | 1 | 10 |
| 9 | G | ATTAC | 1 | 1 | 11 |
| 10 | G | ATTAG | 3 | 1 | 12 |
| 11 | A | CAT# | 4 | 2 | 13 |
| 12 | A | GATA$$ | 5 | 1 | 15 |
| 13 | ! | GATAC | 2 | 1 | 16 |
| 14 | $ | GATTAC | 1 | 1 | 17 |
| 15 | ! | GATTAG | 3 | 1 | 18 |
| 16 | A | T!GATAC | 2 | 1 | 19 |
| 17 | A | T!GATTAG | 3 | 1 | 20 |
| 18 | A | TA$$ | 5 | 1 | 21 |
| 19 | T, A | TAC | 1; 2 | 2 | 22 |
| 20 | T | TAG | 3 | 1 | 24 |
| 21 | A | TTAC | 1 | 1 | 25 |
| 22 | A | TTAG | 3 | 1 | 26 |
Each line shows the lexicographic rank r of an element ; the characters mapped to r by ; s itself; the elements of D from which the mapped characters originate; the total frequency with which characters are mapped to r; and the preceding partial sum of the frequencies
The BWT for
|
|
| Suffix |
|---|---|---|
| 0 | A | $ |
| 1 | T | !GATACAT!GATTAGATA$ |
| 2 | T | !GATTAGATA$ |
| 3 | T | A$ |
| 4 | T | ACAT!GATACAT!GATTAGATA$ |
| 5 | T | ACAT!GATTAGATA$ |
| 6 | T | AGATA$ |
| 7 | C | AT!GATACAT!GATTAGATA$ |
| 8 | C | AT!GATTAGATA$ |
| 9 | G | ATA$ |
| 10 | G | ATACAT!GATTAGATA$ |
| 11 | G | ATTACAT!GATACAT!GATTAGATA$ |
| 12 | G | ATTAGATA$ |
| 13 | A | CAT!GATACAT!GATTAGATA$ |
| 14 | A | CAT!GATTAGATA$ |
| 15 | A | GATA$ |
| 16 | ! | GATACAT!GATTAGATA$ |
| 17 | $ | GATTACAT!GATACAT!GATTAGATA$ |
| 18 | ! | GATTAGATA$ |
| 19 | A | T!GATACAT!GATTAGATA$ |
| 20 | A | T!GATTAGATA$ |
| 21 | A | TA$ |
| 22 | T | TACAT!GATACAT!GATTAGATA$ |
| 23 | A | TACAT!GATTAGATA$ |
| 24 | T | TAGATA$ |
| 25 | A | TTACAT!GATACAT!GATTAGATA$ |
| 26 | A | TTAGATA$ |
Each line shows a position in the BWT; the character in that position; and the suffix immediately following that character in
The dictionary and parse sizes for several files from the Pizza and Chili repetitive corpus, with three settings of the parameters w and p
| File | Size |
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Dict. | Parse | % | Dict. | Parse | % | Dict. | Parse | % | ||
| cere | 440 | 61 | 77 | 31 | 43 | 159 | 46 |
|
|
|
| cere_no_Ns | 409 | 33 | 77 | 27 |
|
|
| 60 | 17 | 19 |
| dna.001.1 | 100 | 8 | 20 | 27 |
|
|
| 21 | 4 | 25 |
| einstein.en.txt | 446 | 2 | 87 | 20 | 3 | 39 | 9 |
|
|
|
| influenza | 148 | 16 | 28 | 30 |
|
|
| 49 | 6 | 37 |
| kernel | 247 | 14 | 52 | 26 | 14 | 20 | 13 |
|
|
|
| world_leaders | 45 | 5 | 5 | 21 |
|
|
| 11 | 1 | 26 |
| world_leaders_no_dots | 23 | 4 | 5 | 34 |
|
|
| 7 | 1 | 33 |
All sizes are reported in megabytes; percentages are the sums of the sizes of the dictionaries and parses, divided by the sizes of the uncompressed files
For each file, the sizes are in italics for the settings with the best overall compression
The dictionary and parse sizes for prefixes of a database of Salmonella genomes, with three settings of the parameters w and p
| Number of genomes | Size |
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Dict. | Parse | % | Dict. | Parse | % | Dict. | Parse | % | ||
| 50 | 249 | 68 | 43 | 44 |
|
|
| 91 | 10 | 40 |
| 100 | 485 | 83 | 85 | 35 |
|
|
| 122 | 19 | 29 |
| 500 | 2436 | 273 | 424 | 29 | 314 | 194 | 21 |
|
|
|
| 1000 | 4861 | 475 | 847 | 27 | 541 | 388 | 19 |
|
|
|
| 5000 | 24936 | 2663 | 4334 | 28 | 2915 | 1987 | 20 |
|
|
|
| 10,000 | 49420 | 4190 | 8611 | 26 | 4652 | 3939 | 17 |
|
|
|
Again, all sizes are reported in megabytes; percentages are the sums of the sizes of the dictionaries and parses, divided by the sizes of the uncompressed files
For each prefix, the sizes are in italics for the settings with the best overall compression
Time (seconds) and peak memory consumption (megabytes) of BWT calculations for preixes of a database of Salmonella genomes, for three settings of the parameters w and p and for the comparison method simplebwt
| Number of genomes |
|
|
| simplebwt | ||||
|---|---|---|---|---|---|---|---|---|
| Time | Peak | Time | Peak | Time | Peak | Time | Peak | |
| 50 | 71 |
| 63 | 642 | 65 | 782 |
| 2247 |
| 100 | 118 |
|
| 837 | 102 | 1059 | 103 | 4368 |
| 500 | 570 |
| 443 | 2742 |
| 3304 | 565 | 21,923 |
| 1000 | 1155 |
| 876 | 4789 |
| 5659 | 1377 | 43,751 |
| 5000 | 7412 |
| 5436 | 46,040 |
| 51,848 | 11,600 | 224,423 |
| 10,000 | 19,152 |
| 12,298 | 74,500 |
| 84,467 | 43,657 | 444,780 |
For each prefix, the time and memory are in italics for the sets which minimize them
Time (seconds) and peak memory consumption (megabytes) of BWT calculations on various files from the Pizza and Chili repetitive corpus, for three settings of the parameters w and p and for the comparison method simplebwt
| File |
|
|
| simplebwt | ||||
|---|---|---|---|---|---|---|---|---|
| Time | Peak | Time | Peak | Time | Peak | Time | Peak | |
| cere | 90 | 603 | 79 |
|
| 801 | 90 | 3962 |
| einstein.en.txt | 53 | 196 | 40 | 88 |
|
| 97 | 4016 |
| influenza |
|
| 27 | 284 | 33 | 435 | 30 | 1331 |
| kernel | 43 | 170 | 29 |
|
| 144 | 50 | 2216 |
| world_leaders | 7 |
| 7 | 74 |
| 98 | 7 | 405 |
For each prefix, the time and memory are in italics for the sets which minimize them
Fig. 5Results versus various choices of parameters (w, p) on a collection of 1000 Salmonella genomes (2.7 GB)
Fig. 6RLFM indexing efficiency for successively larger collections of genetically distinct human chr19s. Results for the prefix-free parsing step (“pfbwt”) are shown alongside the overall RLFM index-building (“rlfm_total”) and Bowtie (“bowtie”) results
Fig. 7RLFM average “count” query time for successively larger collections of genetically distinct human chr19s
Fig. 8Computational efficiency of the three stages of index building when SA sampling is included. Results are shown for the prefix-free parsing (“pfbwt”), back-stepping (“bwtscan”) and run-length encoding (“rle”) steps. “total” is the sum of the three steps