| Literature DB >> 31510647 |
Martin D Muggli1, Bahar Alipanahi2, Christina Boucher2.
Abstract
MOTIVATION: There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.Entities:
Mesh:
Year: 2019 PMID: 31510647 PMCID: PMC6612864 DOI: 10.1093/bioinformatics/btz350
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) A colored de Bruijn graph consisting of two individual graphs, whose edges are shown in red and blue. The nodes are shown in purple because they can occur in either graph. (b) A second colored de Bruijn graph, whose edges are green and yellow. Again, the nodes are shown in lime because they can occur in either graph. (c) A colored de Bruijn graph merged from the two-colored de Bruijn graphs. (d) The nodes for all three graphs arranged in columns (red and blue, merged, green and yellow). Each column is sorted into co-lexicographic order, with each node’s number of incoming edges shown on its left and the labels of its outgoing edges shown on its right. Vertical alignment illustrates how the merged components (center) are copied from either the left, the right or both
Breakdown of the memory, disk and time usage of to build the colored de Bruijn graph for 8000 strains
| Input stats | de Bruijn graph | Color matrix | Combined requirements | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Program and dataset |
| Colors | RAM (GB) | Time | Size (GB) | RAM (GB) | Time | Size (GB) | RAM (GB) | External memory (TB) | Time | Size (GB) | |||
|
| 1.1 B | 4000 | 136 | 8 h 46 min | 0.31 | 52 | 1 h 39 min | 51.2 | 136 | 1 | 10 h 25 min | 51 | |||
|
| 1.5 B | 4000 | 137 | 10 h 40 min | 0.52 | 54 | 2 h 22 min | 52.5 | 137 | 1.5 | 13 h 2 min | 53 | |||
|
| 2.4 B | 8000 | 10 | 2 h 1 min | 0.63 | 117 | 1 h 2 min | 106 | 117 | 0 | 3 h 3 min | 106 | |||
|
| 2.4 | 8000 | 137 | 21 h 27 min | 0.63 | 117 | 5 h 3 min | 117 | 137 | 1.5 | 26 h 30 min | 106 | |||
Note: The method consists of running on subsets of the population (4A and 4B) and then merging the results with our proposed merge algorithm (denoted here). We list the resources used for both individual runs of , the required and the combined resources. The combined resources consist of the total time and maximum space used across all three components of used in this dataset. No external memory is needed for merging itself so ‘0’ is in the external memory column for .
Breakdown of the peak memory, peak disk and time required by to build the colored de Bruijn graph for 16 000 strains of Salmonella
| Input stats | de Bruijn graph | Color matrix | Combined requirements | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Program and dataset |
| Colors | RAM (GB) | Time | Size (GB) | RAM (GB) | Time | Size (GB) | RAM (GB) | External memory (TB) | Time | Size (GB) |
|
| 1.7 B | 4000 | 135 | 10 h 53 min | 0.46 | 53 | 2 h 34 min | 51.8 | 135 | 1.6 | 13 h 27 min | 52 |
|
| 2.4 B | 4000 | 137 | 14 h 35 min | 0.67 | 59 | 3 h 37 min | 57.9 | 137 | 2.34 | 18 h 12 min | 59 |
|
| 3.8 B | 8000 | 17 | 2 h 59 min | 1.00 | 118 | 57 min | 107 | 118 | 0 | 3 h 56 min | 108 |
|
| 5.8 B | 16 000 | 25 | 4 h 53 min | 1.60 | 254 | 2 h 10 min | 232 | 254 | 0 | 7 h 3 min | 233 |
|
| 5.8 B | 16 000 | 137 | 54 h 47 min | 1.60 | 254 | 14 h 21 min | 232 | 254 | 2.34 | 69 h 8 min | 233 |
Note: We note includes the resources required of the two 4000 runs of (i.e. (4A) and (4B)) and the merge run (i.e. (4A, 4B)) from Table 1. No extra external memory is needed for merging so ‘0’ is in the external memory column for .
Comparison between space-efficient colored de Bruijn graph construction methods for 4000, 8000 and 16 000 Salmonella strains using versus competing methods
| Dataset | No. of | Program | Output size (GB) | Time | RAM (GB) |
|---|---|---|---|---|---|
| 4000 | 1.1 |
| 51 | 10 h 25 min | 136 |
| Bloom Filter Trie | 99 | 51 h 42 min | 120 | ||
| Multi-BRWT | 1.3 TB | 42 h 23 min | 156 | ||
| Mantis/Method of Almodaresi | 36 | 5 h 58 min | 313 | ||
|
| 51 | 10 h 25 min | 136 | ||
| 8000 | 2.4 |
| 114 | 37 h 27 min | 271 |
| Bloom Filter Trie | N/A | N/A | N/A | ||
| Multi-BRWT | N/A | N/A | N/A | ||
| Mantis/Method of Almodaresi | 38 | 13 h 37 min | 370 | ||
|
| 106 | 26 h 30 min | 137 | ||
| 16 000 | 5.8 |
| N/A | N/A | N/A |
| Bloom Filter Trie | N/A | N/A | N/A | ||
| Multi-BRWT | N/A | N/A | N/A | ||
| Mantis/method of Almodaresi | 256 | 36 h 12 min | 316 | ||
|
| 233 | 69 h 8 min | 254 |
Note: We report N/A for any method that exceeded 140 CPU hours, 4 TB of disk space and 750 GB of memory. We anticipate add-on methods to compress better but will still consume the resources shown for their base method because they reuse base the method’s output. We measured RAM as max resident set size. Mantis authors noted their use of memory mapped I/O means this reveals opportunistic consumption and not necessarily requirement for their program. To the best of our knowledge, no extra external memory is needed for Bloom Filter Trie, Multi-BRWT, Mantis and the method of Almodaresi et al., so it is omitted from the table.