| Literature DB >> 34252958 |
Jamshed Khan1,2, Rob Patro1,2.
Abstract
MOTIVATION: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.Entities:
Year: 2021 PMID: 34252958 PMCID: PMC8275350 DOI: 10.1093/bioinformatics/btab309
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.For G(S, k) in Figure (1a), (edges not listed) is a walk. w1spells the string GACATG. It is not a path as the vertex ATG is repeated here. Whereas the walk is a path, spelling GACAT. Besides, it is a unitig, and also a maximal one as it cannot be extended on either side while retaining itself a unitig. There are four maximal unitigs in the graph (the paths referred with the arrows), with the canonical spellings: CGA, ATGTC, CTAAGA and GAGC. (a) A (bidirected) de Bruijn graph for , with k = 3. The vertices are the canonical k-mers from S, and each edge corresponds to some 4-mer(s) in S. Each pentagon is a vertex, with the flat and the pointy sides (vertically) denoting its front and back, respectively. For each vertex v, the string inside it is label(v), to be read in the visual direction from the front to the back. The string below it is , to be read in the opposite direction. For example, the 4-mers CGAC and GCTC correspond to the edges {CGA, GAC} and {AGC, CTC}, respectively. The edge corresponding to the 4-mer CATG is a loop {ATG, ATG}. (b) The corresponding compacted de Bruijn graph, with each maximal unitig in its canonical form
Fig. 2.Classes of states of the vertices, and the transition relationships among those. (a) Time taken by each step. (b) Speedup for each step. (a) Four disjoint classes of states of the vertices, based on the properties of their sides. The pictorial shape of the classes correspond to the actual incidence properties of the vertices. For example, the first class of states is for vertices that have exactly one edge incident per side: the edges incident to the front and to the back being encoded with the characters X1 and X2, respectively. There can be different configurations of this shape, and this class contains those 16 states. Whereas the second class is for vertices that have either an ϵ-edge or > 1 distinct edges incident to the front, and one unique edge incident to the back. Due to four possible configurations with this property, this class contains four states. Note that, pictorially, a singular incident edge denotes a unique edge, whereas multiple incident edges mean either >1 edge or an ϵ-edge being incident. (b) Possible transition types between the various classes of the states. For example, consider a state of the class single-in single-out, with the unique edges incident to its front and back being encoded with the characters X1 and X2, respectively. Now, if the state is provided with the input (Y1, Y2), then based on the four different joint outcomes of the conditionals and , the following transitions can happen: 1. : self-transition; 2. : transition to a state of the class multi-in single-out that has X2 at the back; 3. : transition to a state of the class single-in multi-out that has X1 at the front; 4. : transition to the only state of the class multi-in multi-out
Time- and memory-performance benchmarking for the steps of Cuttlefish on the 7 human genomes dataset, across different k
| Build steps (s) | Build Time (s) | Build memory (GB) | Output step (s) | Output memory (GB) | |||||
|---|---|---|---|---|---|---|---|---|---|
|
| Distinct |
| MPHF construction | States computation | Unipaths only | GFA2 | |||
| 23 | 2.39 | 154 | 62 | 762 | 978 | 2.67 | 744 | 1345 | 2.82 |
| 31 | 2.59 | 391 | 70 | 791 | 1252 | 2.88 | 737 | 1203 | 3.01 |
| 61 | 2.96 | 439 | 200 | 797 | 1436 | 3.25 | 798 | 831 | 3.37 |
| 91 | 3.12 | 1118 | 311 | 830 | 2259 | 3.42 | 806 | 860 | 3.49 |
| 121 | 3.24 | 1483 | 902 | 841 | 3226 | 3.55 | 850 | 820 | 3.62 |
Note: The running times are in seconds, and the maximum memory usages are in gigabytes.
Time- and memory-performance benchmarking for compacting single input reference de Bruijn graphs
| Bifrost | deGSM | TwoPaCo | Cuttlefish | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Thread- count |
| Build | Output | Build | Output | Build | Output | Build | Output |
| Human | 1 | 31 | 04:54:50 (27.23) | 15:18 | 01:54:41 (37.94) | 25:06 (9.79) | 01:13:19 (4.15) | 39:38 (4.50) |
| 19:23 (2.84) |
| 61 | 05:16:51 (50.19) | 01:49 | 02:20:57 (84.16) | 21:37 (8.77) | 01:10:18 (6.02) | 12:25 (4.35) |
| 15:37 (3.08) | ||
| 8 | 31 | 01:33:54 (27.23) | 03:59 | 25:20 (37.94) | 05:37 (9.80) | 12:57 (5.04) | — |
| 05:13 (2.92) | |
| 61 | 01:20:28 (50.18) | 00:40 | 47:52 (84.16) | 03:55 (8.80) | 11:28 (5.46) | — |
| 03:20 (3.18) | ||
| 16 | 31 | 01:24:40 (27.24) | 03:30 | 18:19 (37.94) | 03:56 (9.80) | 06:24 (5.57) | — |
| 02:57 (2.93) | |
| 61 | 01:12:33 (50.18) | 00:52 | 46:34 (84.16) | 02:35 (8.80) | 07:12 (5.55) | — |
| 01:54 (3.19) | ||
| Gorilla | 1 | 31 | 05:44:10 (28.08) | 16:30 | 01:34:29 (37.94) | 24:26 (9.75) | 01:00:15 (5.04) | 43:25 (4.49) |
| 17:07 (2.77) |
| 61 | 05:31:06 (50.13) | 02:05 | 02:11:33 (84.16) | 22:03 (8.94) | 01:11:29 (5.83) | 17:52 (4.30) |
| 15:59 (3.03) | ||
| 8 | 31 | 02:06:52 (28.08) | 03:44 | 28:52 (37.94) | 05:43 (9.76) | 13:02 (5.82) | — |
| 04:37 (2.87) | |
| 61 | 01:24:21 (50.13) | 00:54 | 47:45 (84.16) | 03:59 (8.98) | 10:03 (6.00) | — |
| 02:54 (3.12) | ||
| 16 | 31 | 01:50:26 (28.08) | 02:59 | 20:47 (37.94) | 04:07 (9.76) | 07:29 (5.52) | — |
| 03:25 (2.87) | |
| 61 | 01:10:06 (50.13) | 04:04 | 38:45 (84.16) | 02:40 (8.98) | 06:24 (6.09) | — |
| 02:06 (3.14) | ||
| Sugar pine | 16 | 31 | 22:18:24 (229.17) | 01:20:51 | 09:29:24 (145.23) | 01:10:55 (119.18) | 01:49:01 (61.93) | — |
| 01:56:52 (14.28) |
| 61 |
| — |
| — |
| — | 03:14:44 ( | 01:26:26 (20.90) | ||
Note: Each cell contains the running time in wall clock format, and the maximum memory usage in gigabytes, in parentheses. The output steps report the compacted graph in the GFA2 format. The best value with respect to each metric in each row is highlighted.
Bifrost builds the compacted graph and outputs it using the same command; we could split the timing of the steps but were unable to tease apart the maximum memory for the output step. The discrepancy between the memory usage of deGSM and its memory-limit input parameter, , is attributable to their initial k-mer enumeration step—run internally by deGSM using the Jellyfish tool (Marçais and Kingsford, 2011), with parameters set by deGSM—these resources must be accounted for as the input for the problem is a set of references (from which deGSM first produces a k-mer database, much like Cuttlefish). TwoPaCo takes a logarithmic filter-size parameter f as input, and f is critical to the performance. It uses bytes of memory for a bloom filter in the first-pass, which significantly affects the memory usage in the second-pass. We used f = 35 in both k = 31 and k = 61 for human and gorilla; and f = 38 in k = 31 and f = 39 in k = 61 for sugar pine. We have set f such that the maximum memory usage is minimized, by first approximating its optimal value, and then trying it with a few of the nearby values. The best executions found (w.r.t. memory) are reported. Also, the output step of TwoPaCo is single-threaded, and the dashes in their output column indicate this inapplicability of multi-threading. The cells with X indicate abnormal program terminations—Bifrost ran out of memory (with std::bad_alloc), and deGSM had a segmentation fault. The peak memory usages until the point of termination are reported.
Time- and memory-performance benchmarking for compacting colored de Bruijn graphs (i.e. multiple input references) for k = 31, using 16 threads
| Dataset | Total genome-length (bp) | Distinct | Bifrost | deGSM | TwoPaCo | Cuttlefish |
|---|---|---|---|---|---|---|
| 62 | 310 M | 24 M | 1 ( | 1 (3.34) | 1 (0.80) | 1 (0.96) |
| 7 Humans | 21 G | 2.6 B | 95 (29.06) | 30 (37.94) | 62 (6.14) |
|
| 7 Apes | 18 G | 7.1 B | 294 (100.25) | 172 (145.23) | 59 (28.87) |
|
| 11 Conifers | 204 G | 82 B | — | — | 981 (288.99) |
|
| 100 Humans | 322 G | 28 B | — | — | 1395 (126.25) |
|
Note: Each cell contains the running time in minutes, and the maximum memory usage in gigabytes, in parentheses. The output step is excluded from executions. The best value with respect to each metric in each row is highlighted.
The filter-sizes for the TwoPaCo executions are set as described in Table 1. Dashed cells in the Bifrost and the deGSM columns indicate that the experiments were not performed, as it is anticipated that insufficient memory would be available given their memory usages for smaller datasets (w.r.t. k-mer count).
Fig. 3.Scalability metrics of Cuttlefish for varying number of threads, using k = 31 for the human genome