| Literature DB >> 32943081 |
Guillaume Holley1, Páll Melsted2.
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.Entities:
Mesh:
Year: 2020 PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Time and memory comparison of Bifrost and BCALM2 for different k-mer sizes and number of threads during graph construction
| Tool | Number of threads | |||||
|---|---|---|---|---|---|---|
| 1 | 4 | 8 | 16 | |||
| Time (h) | Bifrost | 31 | ||||
| 63 | ||||||
| 95 | ||||||
| 127 | ||||||
| BCALM2 | 31 | 44.25 | 14.11 | 8.48 | 6.33 | |
| 63 | N/A | 25.6 | 13.96 | 8.71 | ||
| 95 | N/A | 39.91 | 21.45 | 12.56 | ||
| 127 | N/A | N/A | 27.73 | 16.15 | ||
| Memory (GB) | Bifrost | 31 | 39.59 | 39.58 | 39.59 | 39.60 |
| 63 | 37.77 | 37.77 | 37.77 | 37.78 | ||
| 95 | 44.33 | 44.30 | 44.30 | 44.32 | ||
| 127 | 55.88 | 55.86 | 55.86 | 55.86 | ||
| BCALM2 | 31 | |||||
| 63 | N/A | |||||
| 95 | N/A | |||||
| 127 | N/A | N/A | ||||
Best results are highlighted. N/A indicates the result is unavailable because the computation took more than 48 h
Unitig N50, k-mer, and unitig cardinalities in cdBGs built from NA12878 for different k-mer sizes
| Unitig cardinality | Unitig N50 | ||
|---|---|---|---|
| 31 | 2,675,559,250 | 80,478,269 | 421 |
| 63 | 2,991,703,769 | 28,262,463 | 950 |
| 95 | 3,058,681,425 | 16,691,669 | 1299 |
| 127 | 2,702,556,396 | 44,221,433 | 297 |
Running time and memory usage for indexing and querying a de Bruijn graph for 30 million short reads
| Tool | Process | Time (m) | Memory (GB) |
|---|---|---|---|
| Bifrost | Build | 39.6 | |
| Index | 26.8 | ||
| Query | 26.8 | ||
| Query-total | 26.8 | ||
| BCALM2 | Build | 380 | |
| Blight | Index | 80 | |
| Query | 13.6 | ||
| Query-total | 93.6 | ||
| Squeakr | Build | 1147 | 80 |
| Mantis | Index | 54 | 17 |
| Query | 38.8 | 168 | |
| Query-total | 96.9 | 168 |
The total time of Bifrost and Blight is split into index and query as reported by the software, whereas query-total is the wall time measurement. For Mantis, the index is a separate process and needs only to be run once
Running time and fraction of queries found for different k-mer inclusion rates (θ) using exact and inexact k-mers
| Query type | Time (m) | Queries found (%) | |
|---|---|---|---|
| Exact | 0.50 | 2.8 | 99.0 |
| 0.75 | 3.8 | 96.0 | |
| 0.90 | 4.4 | 93.9 | |
| 1.00 | 4.7 | 92.2 | |
| Inexact | 0.50 | 7.2 | 99.6 |
| 0.75 | 14.8 | 99.0 | |
| 0.90 | 17.7 | 98.1 | |
| 1.00 | 21.2 | 97.3 |
Inexact k-mers allow for one substitution or indel in the k-mer search
Running time, memory usage, and external disk usage for constructing the colored de Bruijn graphs of an increasing number of Salmonella strains
| Number of strains | Tool | Time (h) | Memory (GB) | Disk (GB) |
|---|---|---|---|---|
| 100 | Bifrost | |||
| VARI-merge + KMC2 | 0.33 | 5.1 | 17 | |
| 400 | Bifrost | |||
| VARI-merge + KMC2 | 1.016 | 15.4 | 51 | |
| 1600 | Bifrost | |||
| VARI-merge + KMC2 | 4.86 | 56.9 | 228 | |
| 4000 | Bifrost | |||
| VARI-merge + KMC2 | 12.35 | 138 | 449 | |
| 117,913 | Bifrost | |||
| VARI-merge + KMC2 | N/A | N/A | N/A |
N/A indicates the result is unavailable
Fig. 1A de Bruijn graph in a and its compacted counterpart in b using 3-mers. For simplicity, reverse-complements are not considered
Fig. 2Data structure of a cdBG composed of a hash table M and a unitig array U. Unitigs are composed of 3-mers and are indexed using minimizers of length 1. For simplicity, a lexicographic ordering of minimizers is here used and only one minimizer is shown
Fig. 3A compacted de Bruijn graph containing false positive 3-mers. Errors are represented in red dashed line vertices: K-mer “CCG” creates a false branching and “ACT” creates a false connection. K-mers that are compacted in a unitig are grouped in a gray line box