| Literature DB >> 27437028 |
Timo Beller1, Enno Ohlebusch1.
Abstract
BACKGROUND: Recently, Marcus et al. (Bioinformatics 30:3476-83, 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an [Formula: see text] time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Baier et al. (Bioinformatics 32:497-504, 2016) improved their result.Entities:
Keywords: Backward search; Burrows–Wheeler transform; Compressed de Bruijn graph; Pan-genome analysis
Year: 2016 PMID: 27437028 PMCID: PMC4950428 DOI: 10.1186/s13015-016-0083-7
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1The de Bruijn graph for and the string ACTACGTACGTACG$ is shown on the left, while its compressed counterpart is shown on the right
Index data structures of the string ACTACGTACGTACG$
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| 1 | 15 | −1 | 0 | 0 | 10 | 5 | G | $ |
| 2 | 12 | 0 | 1 | 0 | 13 | 6 | T | ACG$ |
| 3 | 8 | 3 | 0 | 0 | 14 | 7 | T | ACGTACG$ |
| 4 | 4 | 7 | 1 | 0 | 15 | 8 | T | ACGTACGTACG$ |
| 5 | 1 | 2 | 0 | 0 | 1 | 9 | $ | ACTACGTACGTACG$ |
| 6 | 13 | 0 | 0 | 0 | 2 | 10 | A | CG$ |
| 7 | 9 | 2 | 0 | 0 | 3 | 11 | A | CGTACG$ |
| 8 | 5 | 6 | 0 | 0 | 4 | 12 | A | CGTACGTACG$ |
| 9 | 2 | 1 | 0 | 1 | 5 | 15 | A | CTACGTACGTACG$ |
| 10 | 14 | 0 | 0 | 0 | 6 | 1 | C | G$ |
| 11 | 10 | 1 | 0 | 0 | 7 | 13 | C | GTACG$ |
| 12 | 6 | 5 | 0 | 1 | 8 | 14 | C | GTACGTACG$ |
| 13 | 11 | 0 | 0 | 0 | 11 | 2 | G | TACG$ |
| 14 | 7 | 4 | 0 | 0 | 12 | 3 | G | TACGTACG$ |
| 15 | 3 | 8 | 0 | 0 | 9 | 4 | C | TACGTACGTACG$ |
| 16 | −1 |
The suffix array of the string ACTACGTACGTACG$ and related notions are defined in section "Preliminaries". The bit vectors and for are explained in section “Computation of right-maximal k-mers and node identifiers”
Fig. 2Explicit representation of the compressed de Bruijn graph from Fig. 1
Fig. 3Implicit representation of the compressed de Bruijn graph from Fig. 1
Fig. 4The string must be split if the length k prefix of is a right-maximal repeat or the length k prefix of is a left-maximal repeat
Fig. 5The string has u as suffix and u has as prefix
Runtime and maximum main memory usage for the construction of the compressed de Bruijn graph
|
| Algorithm | 40 | 62 | 7 × Chr1 | 7 × HG |
|---|---|---|---|---|---|
| init | SplitMEM | 117 (315.25) | 141 (317.00) | − | − |
| init | A1, A2 | 38 (5.00) | 64 (5.00) | 380 (5.00) | − |
| init | A3, A4 | 131 (1.32) | 202 (1.24) | 1168 (1.24) | 20,341 (1.24) |
| 50 | SplitMEM | 2261 (572.19) | − | − | − |
| 50 | A1 | 57 (5.22) | 92 (5.34) | 596 (6.20) | − |
| 50 | A2 | 61 (8.49) | 97 (8.78) | 619 (9.98) | − |
| 50 | A3 | 188 (2.23) | 300 (2.26) | 1733 (3.07) | 29,816 (2.77) |
| 50 | A3compr1 | 208 (1.81) | 346 (1.85) | 1880 (2.66) | 31,472 (2.36) |
| 50 | A3compr2 | 236 (1.63) | 374 (1.66) | 2318 (2.51) | 39,366 (2.22) |
| 50 | A4 | 164 (1.75) | 254 (1.82) | 1419 (1.28) | 25,574 (1.96) |
| 50 | A4compr1 | 167 (1.46) | 257 (1.53) | 1435 (1.28) | 25,866 (1.66) |
| 50 | A4compr2 | 179 (1.32) | 272 (1.24) | 1526 (1.24) | 27,365 (1.39) |
| 50 | A4+explicit | 172 (3.26) | 268 (3.35) | 1515 (3.59) | 27,619 (3.88) |
| 50 | A4compr1+explicit | 176 (2.97) | 271 (3.06) | 1541 (3.31) | 28,044 (3.64) |
| 50 | A4compr2+explicit | 188 (2.66) | 289 (2.74) | 1629 (2.96) | 29,517 (3.38) |
| 100 | SplitMEM | 2568 (572.20) | − | − | − |
| 100 | A1 | 59 (5.00) | 95 (5.00) | 595 (5.95) | − |
| 100 | A2 | 62 (7.89) | 99 (8.19) | 605 (9.74) | − |
| 100 | A3 | 188 (1.63) | 299 (1.68) | 1738 (2.74) | 27,815 (2.23) |
| 100 | A3compr1 | 205 (1.50) | 326 (1.49) | 1839 (2.33) | 30,401 (1.80) |
| 100 | A3compr2 | 232 (1.32) | 411 (1.29) | 2340 (2.14) | 38,134 (1.66) |
| 100 | A4 | 174 (1.71) | 261 (1.79) | 1422 (1.28) | 25,723 (1.94) |
| 100 | A4compr1 | 171 (1.42) | 264 (1.50) | 1439 (1.28) | 26,040 (1.64) |
| 100 | A4compr2 | 185 (1.32) | 289 (1.24) | 1544 (1.24) | 27,464 (1.37) |
| 100 | A4+explicit | 178 (2.61) | 270 (2.73) | 1486 (3.21) | 26,878 (3.36) |
| 100 | A4compr1+explicit | 175 (2.32) | 273 (2.44) | 1500 (2.92) | 26,999 (3.07) |
| 100 | A4compr2+explicit | 190 (2.01) | 299 (2.12) | 1624 (2.68) | 28,665 (2.80) |
| 500 | SplitMEM | 2116 (570.84) | − | − | − |
| 500 | A1 | 72 (5.00) | 113 (5.00) | 620 (5.83) | − |
| 500 | A2 | 83 (7.17) | 117 (7.43) | 640 (9.66) | − |
| 500 | A3 | 194 (1.50) | 304 (1.49) | 1752 (2.67) | 28,548 (2.07) |
| 500 | A3compr1 | 216 (1.50) | 325 (1.49) | 1839 (2.19) | 30,488 (1.65) |
| 500 | A3compr2 | 241 (1.32) | 378 (1.29) | 2319 (2.06) | 36,993 (1.50) |
| 500 | A4 | 184 (1.65) | 283 (1.74) | 1453 (1.28) | 26,362 (1.93) |
| 500 | A4compr1 | 197 (1.35) | 287 (1.44) | 1477 (1.28) | 26,545 (1.63) |
| 500 | A4compr2 | 213 (1.32) | 322 (1.24) | 1622 (1.24) | 28,501 (1.36) |
| 500 | A4+explicit | 185 (1.81) | 285 (1.90) | 1509 (3.14) | 27,285 (3.14) |
| 500 | A4compr1+explicit | 198 (1.52) | 288 (1.61) | 1535 (2.83) | 27,417 (2.79) |
| 500 | A4compr2+explicit | 214 (1.32) | 323 (1.29) | 1694 (2.56) | 29,283 (2.58) |
The first column shows the k-mer size (an entry init means that only the index data structure is constructed) and the second column specifies the algorithm used in the experiment. The remaining columns show the run-times in seconds and, in parentheses, the maximum main memory usage in bytes per base pair (including the construction) for the data sets described in the text. A minus indicates that the respective algorithm was not able to solve its task on our machine equipped with 128 GB of RAM
Breakdown of the space usage of the variants of algorithm A4
| Algo | Part | 62 | 7 × Chr1 | 7 × HG |
|---|---|---|---|---|
| A4 | Wt-bwt | 0.42 (23.83 %) | 0.44 (36.23 %) | 0.43 (22.68 %) |
| A4 | Nodes | 0.10 (5.94 %) | 0.03 (2.61 %) | 0.04 (2.02 %) |
| A4 |
| 0.16 (8.93 %) | 0.16 (12.86 %) | 0.16 (8.25 %) |
| A4 |
| 0.14 (8.04 %) | 0.14 (11.57 %) | 0.14 (7.42 %) |
| A4 | Wt-doc | 0.93 (53.26 %) | 0.45 (36.73 %) | 1.13 (59.63 %) |
| A4compr1 | Wt-bwt | 0.42 (28.57 %) | 0.44 (47.83 %) | 0.43 (26.85 %) |
| A4compr1 | Nodes | 0.10 (7.12 %) | 0.03 (3.44 %) | 0.04 (2.39 %) |
| A4compr1 |
| 0.00 (0.23 %) | 0.00 (0.12 %) | 0.00 (0.09 %) |
| A4compr1 |
| 0.00 (0.23 %) | 0.00 (0.12 %) | 0.00 (0.08 %) |
| A4compr1 | Wt-doc | 0.93 (63.85 %) | 0.45 (48.49 %) | 1.13 (70.59 %) |
| A4compr2 | Wt-bwt | 0.16 (13.03 %) | 0.22 (31.01 %) | 0.22 (15.62 %) |
| A4compr2 | Nodes | 0.10 (8.67 %) | 0.03 (4.55 %) | 0.04 (2.76 %) |
| A4compr2 |
| 0.00 (0.28 %) | 0.00 (0.16 %) | 0.00 (0.10 %) |
| A4compr2 |
| 0.00 (0.28 %) | 0.00 (0.16 %) | 0.00 (0.10 %) |
| A4compr2 | Wt-doc | 0.93 (77.74 %) | 0.45 (64.11 %) | 1.13 (81.42 %) |
The first column shows the algorithm used in the experiment (the k-mer size is 50). The second column specifies the different data structures used: wt-bwt stands for the wavelet tree of the (including rank and select support), nodes stands for the array of nodes (the implicit graph representation), and are the bit vectors described in "Computation of right-maximal k-mers" section (including rank support), and wt-doc stands for the wavelet tree of the document array. The remaining columns show the memory usage in bytes per base pair and, in parentheses, their percentage
Space in bytes per input base pair for the explicit and the implicit representation of the compressed de Bruijn graph
| k | ds | 40 | 62 | 7 × Chr1 | 7 × HG |
|---|---|---|---|---|---|
| 50 | Explicit | 1.80 | 1.89 | 2.80 | 2.57 |
| 50 | Implicit | 0.84 | 0.82 | 0.77 | 0.76 |
| 50 | Implicit-c1 | 0.55 | 0.53 | 0.47 | 0.47 |
| 50 | Implicit-c2 | 0.30 | 0.27 | 0.25 | 0.26 |
| 100 | Explicit | 1.46 | 1.51 | 2.55 | 2.12 |
| 100 | Implicit | 0.80 | 0.79 | 0.75 | 0.74 |
| 100 | Implicit-c1 | 0.51 | 0.50 | 0.46 | 0.45 |
| 100 | Implicit-c2 | 0.26 | 0.24 | 0.23 | 0.24 |
| 500 | Explicit | 1.07 | 1.08 | 2.50 | 2.01 |
| 500 | Implicit | 0.74 | 0.74 | 0.75 | 0.74 |
| 500 | Implicit-c1 | 0.44 | 0.44 | 0.45 | 0.44 |
| 500 | Implicit-c2 | 0.20 | 0.18 | 0.23 | 0.23 |
The numbers for the explicit representation include the input and the numbers for the implicit representation include the stored in a wavelet tree. The suffix -c1 means that the bit vectors and of the implicit representation are compressed, and the suffix -c2 means that additionally the (bit vectors in the) wavelet tree are compressed
Runtime and main memory usage for finding nodes
|
| 62 | 7 × Chr1 | 7 × HG | |
|---|---|---|---|---|
| 50 | A4 | 3 (1.81) | 9 (1.28) | 9 (1.96) |
| 50 | A4compr1 | 3 (1.52) | 9 (0.98) | 11 (1.66) |
| 50 | A4compr2 | 6 (1.20) | 20 (0.70) | 29 (1.39) |
| 100 | A4 | 3 (1.78) | 12 (1.26) | 27 (1.94) |
| 100 | A4compr1 | 3 (1.49) | 15 (0.97) | 19 (1.64) |
| 100 | A4compr2 | 6 (1.17) | 31 (0.68) | 51 (1.37) |
| 500 | A4 | 9 (1.73) | 20 (1.26) | 22 (1.93) |
| 500 | A4compr1 | 12 (1.43) | 24 (0.96) | 27 (1.63) |
| 500 | A4compr2 | 17 (1.11) | 55 (0.67) | 74 (1.36) |
The first column shows the k-mer size and the second column specifies the algorithm used in the experiment. The remaining columns show the run-times in seconds for finding the nodes corresponding to 10,000 patterns of length 900 (that occur in the pan-genome) and, in parentheses, the maximum main memory usage in bytes per base pair for the data sets described in the text
Runtime and main memory usage for finding sequences that correspond to given nodes
|
| 62 | 7 × Chr1 | 7 × HG | |
|---|---|---|---|---|
| 50 | A4 | 10.84 (1.81) | 3.31 (1.28) | 15.33 (1.96) |
| 50 | A4compr1 | 10.91 (1.52) | 3.17 (0.98) | 14.88 (1.66) |
| 50 | A4compr2 | 11.02 (1.20) | 3.07 (0.70) | 13.02 (1.39) |
| 100 | A4 | 8.31 (1.78) | 2.72 (1.26) | 10.99 (1.94) |
| 100 | A4compr1 | 8.11 (1.49) | 2.83 (0.97) | 9.10 (1.64) |
| 100 | A4compr2 | 8.23 (1.17) | 2.84 (0.68) | 9.25 (1.37) |
| 500 | A4 | 2.43 (1.73) | 1.32 (1.26) | 4.51 (1.93) |
| 500 | A4compr1 | 2.78 (1.43) | 1.32 (0.96) | 4.22 (1.63) |
| 500 | A4compr2 | 2.32 (1.11) | 1.29 (0.67) | 4.30 (1.36) |
The first column shows the k-mer size and the second column specifies the algorithm used in the experiment. The remaining columns show the run-times in seconds for finding out to which sequences each of the nodes belongs (where the nodes correspond to 10,000 patterns of length 900 that occur in the pan-genome) and, in parentheses, the maximum main memory usage in bytes per base pair for the data sets described in the text
Length of the longest string corresponding to a node
|
| 62 | 7 x Chr1 | 7 x HG |
|---|---|---|---|
| 50 | 79,967 | 41,571 | 36,579 |
| 100 | 173,366 | 85,773 | 203,398 |
| 500 | 179,671 | 2,283,980 | 1,402,896 |
The first column specifies the k-mer size and the remaining columns show the length of the longest string corresponding to a node in the compressed de Bruijn graph