Literature DB >> 34252958

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.

Abstract

MOTIVATION: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.
RESULTS: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata's state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively.
AVAILABILITY AND IMPLEMENTATION: Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34252958 PMCID： PMC8275350 DOI： 10.1093/bioinformatics/btab309

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

With the increasing affordability and throughput of sequencing, the set of whole genome references available for comparative analyses and indexing has been growing tremendously. Modern sequencing technologies can generate billions of short-read sequences per-sample, and with state-of-the-art de novo and reference-guided assembly algorithms, we now have thousands of mammalian-sized genomes available. Moreover, we now have successfully assembled genomes that are an order of magnitude larger than typical mammalian genomes, with the largest ones among these being the sugar pine (31 Gbp) (Stevens ) and the Mexican walking fish (32 Gbp) (Nowoshilow ). With modern long-read sequencing technologies and assembly techniques, the diversity and the completeness of reference-quality genomes is expected to continue increasing rapidly. Representation of these reference collections in compact forms facilitating common genomic analyses is thus of acute interest and importance, as are efficient algorithms for constructing these representations. To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly [EULER (Pevzner ), EULER-SR (Chaisson and Pevzner, 2008), Velvet (Zerbino and Birney, 2008; Zerbino ), ALLPATHS (Butler ; MacCallum ), ABySS (Simpson ), Trans-AByss (Robertson ), SPAdes (Bankevich ), Minia (Chikhi and Rizk, 2013), and SOAPdenovo (Li ; Luo et al., 2015)], it has seen increasing use in comparative genomics [Cortex (Iqbal ), DISCOSNP (Uricaru ), Scalpel (Fang ), and BubbZ (Minkin and Medvedev, 2020)], and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads [Vari (Muggli ), Mantis (Pandey ; Almodaresi ), VariMerge (Muggli ), and MetaGraph (Karasikov )] or from assembled reference sequences [deBGA (Liu ), Pufferfish (Almodaresi ), and deSALT (Liu )], or from both [BLight (Marchet ) and Bifrost (Holley and Melsted, 2020)]. These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which the maximal non-branching paths (also referred to as unitigs) are condensed into single vertices in the underlying graph structure. This retains all the information of the original graph, while typically requiring much less space to store, index and process. The set of maximal unitigs for a de Bruijn graph is unique and forms a node decomposition of the graph (Chikhi ). The colored variant is built over multiple references. The (colored) compacted de Bruijn graph has become an increasingly useful data structure in computational genomics, for its use in detecting and representing variation within a population of genomes, comparing whole genome sequences, as well as, more recently, its accelerated use as a foundational structure to assist with indexing one or more related biological sequences. Applying de Bruijn graphs for these whole genome analysis tasks highlights a collection of algorithmic challenges, such as efficient construction of the graphs [out-of-core (Kundeti ), khmer (Pell ), BOSS (Bowe ), Minia (Chikhi and Rizk, 2013), kFM-index (Rødland, 2013), and BFT (Holley )], memory-efficient representations for the graphs and their construction [SplitMEM (Marcus ), DBGFM (Chikhi ), BWT-based algorithm (Baier ), BCALM2 (Chikhi ), TwoPaCo (Minkin ), deGSM (Guo ), and Bifrost (Holley and Melsted, 2020)], building space- and time-efficient indices on these representations [BFT (Holley ), Pufferfish (Almodaresi ), and BLight (Marchet )], and performing read-alignment on the graphs using the indices [deBGA (Liu ), deSALT (Liu ), and PuffAligner (Almodaresi )], to mention a few. In this work, we tackle the initial step of the pipelines associated with whole genome and pan-genome analysis tasks based on de Bruijn graphs: the time- and memory-efficient construction of (colored) compacted de Bruijn graphs. We present a novel algorithm, implemented in the tool Cuttlefish, for this purpose. It has excellent scaling behavior and low memory requirements, exhibiting better performance than state-of-the-art tools for compacting de Bruijn graphs from reference collections. The algorithm models each distinct k-mer (i.e. vertex of the de Bruijn graph) of the input reference collection as a finite-state automaton, and designs a compact hash table structure to store succinct encodings of the states of the automata. It characterizes the k-mers that flank maximal unitigs through an implicit traversal over the original graph—without building it explicitly—and dynamically transitions the states of the automata with the local information obtained along the way. The maximal unitigs are then extracted through another implicit traversal, using the obtained k-mer characterizations. We assess the algorithm on both individual genomes, and collections of genomes with diverse structural characteristics: 7 closely related humans; 7 related but different apes; 11 related, different and huge conifer plants; and 100 humans. We compare our tool to three existing state-of-the-art tools: Bifrost, deGSM and TwoPaCo. Cuttlefish is competitive with or faster than these tools under various settings of k-mer length and thread count, and uses less memory (often multiple times less) on all but the smallest datasets. Note that, based on the input to the construction of a de Bruijn graph being a set of either reference sequences or sequencing reads, we distinguish the graphs as reference de Bruijn graphs and read de Bruijn graphs, respectively (De Bruijn graphs may also be constructed from k-mer sets directly, which themselves are generated from sets of references or reads.). The presented Cuttlefish algorithm works with de Bruijn graphs based on reference sequences, i.e. it constructs compacted reference de Bruijn graphs.

2 Related work

The amount of short-read sequences produced from samples, like whole-genome DNA, RNA or metagenomic samples, can be in the order of billions. Construction of long-enough contiguous sequences, also referred to as contigs (Staden, 1980), from the sets of reads is known as the fragment assembly problem, and is a central and long-standing problem in computational biology. Many short-read fragment assembly algorithms [BCALM (Chikhi ), BCALM2 (Chikhi ), Bruno (Pan ), and deGSM (Guo )] use the de Bruijn graph to represent the input set of reads, and assemble fragments through graph compaction. More generally, fragment assembly algorithms are typically present as parts of larger and more complex genome assembly algorithms. In this work, we study the closely related problem of constructing and compacting de Bruijn graphs in the whole genome setting. While the computational requirements for the compacted de Bruijn graph construction from genome references are typically substantial compared to the latter phases of sequence analysis (whole- and pan-genome) tasks, only a few tools have focused on the specific problem. SplitMEM (Marcus ) exploits topological relationships between suffix trees and compacted de Bruijn graphs. It introduces a construct called suffix skips, which is a generalization of suffix links and is similar to pointer jumping techniques, to fast navigate suffix trees for identifying MEMs (maximal exact matches), and then transforms those into vertices and edges of the output graph. SplitMEM is improved upon by Baier using two algorithms: based on a compressed suffix tree, and on the Burrows–Wheeler Transform (BWT) (Burrows and Wheeler, 1994). TwoPaCo (Minkin ) takes the approach of enumerating the edges of the original de Bruijn graph, which aids in identifying the junction positions in the genomes, i.e. positions that correspond to vertices in TwoPaCo’s compacted graph format. Initially having the entire set of vertices as junction candidates, it shrinks the candidates set using a Bloom filter (Bloom, 1970), with a pass over the graph. Since the Bloom filter may contain false positive junctions, TwoPaCo makes another pass over the graph, this time keeping a hash table for the reduced set of candidates and thus filtering out the false positives.deGSM (Guo ) builds a BWT of the maximal unitigs without explicitly constructing the unitigs themselves. It partitions the (k + 2)-mers of the input into a collection of buckets, and applies parallel external sorting on the buckets. Then characterizing the k-mers at the ends of the maximal unitigs using the sorted information, it merges the buckets in a way to produce the unitigs-BWT. Bifrost (Holley and Melsted, 2020) initially builds an approximate de Bruijn graph using blocked Bloom filters. Then for each k-mer in the input, it extracts the maximal unitig that contains that k-mer, by extending the k-mer forward (and backward), constructing the suffix (and prefix) of the unitig. This produces an approximate compacted de Bruijn graph. Then using a k-mer counting and minimizer-based policy, it refines the extracted unitigs by removing the false positive k-mers present in the compacted graph.

3 Preliminaries

We consider all strings to be over the alphabet . For some string s, denotes its length. denotes the substring of s from the i’th to the j’th character, inclusive (with 1-based indexing). and denote the suffix and the prefix of s with length , respectively. For two strings, x and y such that , the glue operation is defined as , where is the append operation. A k-mer is a string of length k. For some string x, its reverse complement is the string obtained by reversing the order of the characters in x and complementing each character according to the nucleotide bases’ complementarity. The canonical form of a string x is the string , according to the lexicographic ordering. For a set S of strings and an integer k > 0, the corresponding de Bruijn graph is defined as a bidirected graph with: (i) its set of vertices being exactly the set of canonical k-mers from S; and (ii) two vertices u and v being connected with an edge iff there is some -mer e in S such that u and v are the canonical forms of the k-mers and , respectively, i.e. and (This is the bidirected variant of de Bruijn graphs, applicable practically with the treatment of double-stranded DNA. For a discussion on the simpler directed variant, see Chikhi ). We adopt an edge-centric definition of the de Bruijn graph where an edge exists iff there is some corresponding -mer present in S, as opposed to the node-centric definition where the edges are implicit given the vertices, i.e. an edge exists iff .). A vertex v is said to have exactly two sides, referred to as the front side and the back side . An edge e between two vertices u and v is incident to exactly one side of u and one side of v. These incidence sides are determined using the following rule: If , i.e. is in its canonical form, then e is incident to the back side of u; otherwise it is incident to u’s front side . If , i.e. is in its canonical form, then e is incident to the front side of v; otherwise it is incident to v’s back side . If u = v, then e is said to be a loop. So, an edge e can be defined as a 4-tuple , with s and s being the sides of the vertices u and v, respectively to which e is incident to. The canonical k-mer corresponding to a vertex v is referred to as its label, i.e. . An example illustration of a de Bruijn graph is given in Figure 1a.

Fig. 1.

For G(S, k) in Figure (1a), (edges not listed) is a walk. w1spells the string GACATG. It is not a path as the vertex ATG is repeated here. Whereas the walk is a path, spelling GACAT. Besides, it is a unitig, and also a maximal one as it cannot be extended on either side while retaining itself a unitig. There are four maximal unitigs in the graph (the paths referred with the arrows), with the canonical spellings: CGA, ATGTC, CTAAGA and GAGC. (a) A (bidirected) de Bruijn graph for , with k = 3. The vertices are the canonical k-mers from S, and each edge corresponds to some 4-mer(s) in S. Each pentagon is a vertex, with the flat and the pointy sides (vertically) denoting its front and back, respectively. For each vertex v, the string inside it is label(v), to be read in the visual direction from the front to the back. The string below it is , to be read in the opposite direction. For example, the 4-mers CGAC and GCTC correspond to the edges {CGA, GAC} and {AGC, CTC}, respectively. The edge corresponding to the 4-mer CATG is a loop {ATG, ATG}. (b) The corresponding compacted de Bruijn graph, with each maximal unitig in its canonical form A walk is defined as an alternating sequence of vertices and edges, , such that any two successive vertices and v in w are connected with the edge e (through any side). v0 and v are called the endpoints, and the v for are called the internal vertices of the walk. w is said to enter v using e, and exit v using . For the endpoints v0 and v, w does not have an entering and an exiting edge, respectively. For , w is said to enter v through its front side if e is incident to . Otherwise, it is said to enter v through its back side . For , the terminology for the exiting side is similarly defined using the incidence side of to v.w is said to be input-consistent if, for each of its internal vertices v, the sides of v to which e and are incident are different. Intuitively, an input-consistent walk enters and exits a vertex through different sides. We prove in lemma 1 (see Supplementary Section S2) that the reconstruction (spelling) of the input strings are possible only from input-consistent walks over G(S, k). Therefore, we are only interested in such walks, and refer to input-consistent walks whenever using the term walk onward. For a vertex v in w, let its label be l, i.e. . We say that w sees the spelling s for the vertex v (for ), such that s is l if w enters v from the front, and is otherwise. s0 is defined analogously but using the exiting edge e1 for v0. The spelling of w is defined as . A path is said to be a unitig if , or in G(S, k): each internal vertex v has exactly two incident edges, e and ; and v0 has exactly one edge e1 and v has exactly one edge e incident to the sides respectively through which p exits v0 and enters v. A unitig is said to be maximal if it cannot be extended by a vertex on either side. Figure 1 illustrates examples of walks, paths, their spellings and maximal unitigs in de Bruijn graphs. The compacted de Bruijn graph of the graph G(S, k) is obtained by collapsing each of its maximal unitigs into a single vertex. Figure 1b shows the compacted form of the graph G(S, k) from Figure 1a. The problem of compacting a de Bruijn graph is to compute the set of its maximal unitigs. To ensure that each can be expressed as a sequence tiling of complete maximal unitigs, a unitig needs to be prevented from spanning multiple input strings. We effectuate this through using a slightly altered topology of the de Bruijn graph G(S, k). During construction of , we treat each as , where ϵ denotes the empty symbol. We refer a vertex v as a sentinel if the first or the last k-mer x of some input string corresponds to v. In this occurrence of x, at least one side s of v does not have any incident edge—we refer to these sides as sentinel-sides. Thus, each sentinel-side s has an incident edge that can be encoded with ϵ. The ϵ-encoded edge connects s to a special vertex, say . All such edges connect to an arbitrary but fixed side of . Thus, has one side with 0 incident edges, and another with . Therefore, will be a maximal unitig by itself. Furthermore, for a sentinel v connecting to some vertex u through its sentinel-side s, a unitig containing v cannot extend to u through s, as s has multiple incident edges—restraining unitigs from spanning multiple input strings.

4 Algorithm

4.1 Motivation

Given a set S of strings and an odd integer k > 0, a simple naive algorithm to construct the corresponding compacted de Bruijn graph is to first construct the ordinary de Bruijn graph G(S, k), and then enumerate all the maximal non-branching paths from G(S, k). The graph construction can be performed by a linear scan over each storing information along the way in some graph representation format, like an adjacency list. Enumeration of the non-branching paths can be obtained using a linear-time graph traversal algorithm (Cormen ). However, storing the ordinary graph G(S, k) usually takes an enormous space for large genomes, and there is no simple way to traverse it without having the entire graph present in memory. This motivates the design of methods able to build directly from S without having to construct G(S, k). For some , it is important to note that by definition of edge-centric de Bruijn graphs, each -mer in s is an edge of G(s, k). Therefore, we can obtain a complete walk traversal w(s) over G(s, k) through a linear scan over s, without having to build G(s, k) explicitly. Also, each maximal unitig of the graph is contained as a subpath in this walk, as proven in lemma 2 (see Supplementary Section S2). For the set of strings S, the maximal unitigs are similarly contained in a collection of walks W(S). Thus, to construct efficiently, one approach is to extract off the maximal unitigs from these walks W(S), without building G(S, k). We describe below an asymptotically and practically efficient algorithm that performs this task.

4.2 Flanking vertices of the maximal unitigs

Similar to the TwoPaCo (Minkin ) algorithm, our algorithm is based on the observation that there exists a bijection between the maximal unitigs of G(S, k) and the substrings of the input strings in S whose terminal k-mers correspond to the endpoints of the maximal unitigs. We refer to these endpoint vertices as flanking vertices. This observation reduces the graph compaction problem to the problem of determining the set of the flanking vertices. Once each vertex in G(S, k) can be characterized as either flanking or internal with respect to the maximal unitig containing it, the unitigs themselves can then be enumerated using a walk over G(S, k), by identifying subpaths having flanking k-mers at both ends. Consider a maximal unitig p and one of its endpoints v. Say that v is connected to p through its side , and its other side is . v is a flanking vertex as it is not possible to extend p through while retaining itself a unitig, due to one of the following cases: there are either multiple edges incident to ; or there is exactly one edge incident to , but s has multiple incident edges (Due to the ϵ-encoded edges for sentinel-sides, it is not possible for any side to have zero incident edges—except for the special vertex (see Section 3).). For trivial unitigs (i.e. unitigs with exactly one k-mer), extending the unitig through in both the cases violates the second property of (non-trivial) unitigs; while for non-trivial unitigs, the first property is violated in case (i) and the other one is violated in case (ii). From this, we can observe that the adjacency information of the sides of the vertices can be used to determine the flanking vertices. As per lemma 4 (see Supplementary Section S2), a side of a vertex can have at most four distinct edges. This being finite, the adjacency information (including the presence of an ϵ-encoded edge) can be tracked using some data structure. Through a set of walks over G(S, k), each encountered edge can be recorded for the sides s and s. With another set of walks, the maximal unitigs can then be extracted as per the flanking vertices, determined using the obtained adjacency information. Another approach—adopted in TwoPaCo (Minkin )—is to have an edge list data structure supporting queries, and filling it up with the graph edges (i.e. -mers) with a set of walks. In another set of walks, the flanking vertices can be characterized using incidence queries (four per side), and the maximal unitigs can be extracted on the fly.

4.3 A deterministic finite-state automaton model for vertices

In these characterization approaches of the flanking vertices, a source of redundancy in the information stored is the vertex sides that have multiple incident edges, or are sentinel-sides. By definition, these sides belong to the flanking vertices of the maximal unitigs. So, in a first set of walks over the graph, once a side of a vertex has been determined to have more than one distinct edge, or to be a sentinel-side, the adjacency information for the side becomes redundant: for our purposes, we do not need to be able to enumerate the incident edges for this side, or to differentiate between them. Only the information that this side makes the vertex flanking is sufficient. Complementarily, another source of redundant information is the vertex sides observed to have exactly one incident edge (excluding an ϵ-encoded edge) up-to some point in the walks. For such a side, keeping track of only that single edge is sufficient, instead of having options to track more distinct edges: when a different edge is encountered, or the side is observed to be a sentinel-side, distinguishing between the edges becomes redundant. Thus, we observe that the only required information to keep per side of a vertex is: whether it has exactly one distinct edge (excluding the ϵ-encoded edge), and if so, which edge it is among the four possibilities (as per lemma 4, see Supplementary Section S2); or if it has either multiple distinct incident edges, or an ϵ-encoded edge. Hence, a side can have one of five different configurations: one for each possible singleton edge (excluding the ϵ one), and one for when it has either multiple edges or an ϵ-encoded edge. This implies that each vertex can be in one of different configurations. We refer to such a configuration as a state for a vertex. Figuring out the states of the vertices provides us with enough information to extract the maximal unitigs. To compute the actual state of each vertex v in G(S, k), we treat v as a deterministic finite-state automaton (DFA). Prior to traversing G(S, k), no information is available for v—we designate a special state for such, the unvisited state. Then in response to each incident edge or sentinel k-mer visited for v, an appropriate state-transition is made. Formally, a vertex v in G(S, k) is treated as a DFA , where— States: is the set of all possible 26 states. The 25 actual states (configurations) of the vertices in G(S, k) can be partitioned into four disjoint classes based on their properties, as in Figure 2a.

Fig. 2.

Classes of states of the vertices, and the transition relationships among those. (a) Time taken by each step. (b) Speedup for each step. (a) Four disjoint classes of states of the vertices, based on the properties of their sides. The pictorial shape of the classes correspond to the actual incidence properties of the vertices. For example, the first class of states is for vertices that have exactly one edge incident per side: the edges incident to the front and to the back being encoded with the characters X1 and X2, respectively. There can be different configurations of this shape, and this class contains those 16 states. Whereas the second class is for vertices that have either an ϵ-edge or > 1 distinct edges incident to the front, and one unique edge incident to the back. Due to four possible configurations with this property, this class contains four states. Note that, pictorially, a singular incident edge denotes a unique edge, whereas multiple incident edges mean either >1 edge or an ϵ-edge being incident. (b) Possible transition types between the various classes of the states. For example, consider a state of the class single-in single-out, with the unique edges incident to its front and back being encoded with the characters X1 and X2, respectively. Now, if the state is provided with the input (Y1, Y2), then based on the four different joint outcomes of the conditionals and , the following transitions can happen: 1. : self-transition; 2. : transition to a state of the class multi-in single-out that has X2 at the back; 3. : transition to a state of the class single-in multi-out that has X1 at the front; 4. : transition to the only state of the class multi-in multi-out Input symbols: (where ϵ denotes the empty symbol, used for sentinel-sides) is the set of input symbols for the automaton. In the algorithm, we actually make state-transitions not per edge, but per vertex. That is, for some walk , we process the vertices v. Processing v means checking the edges e and simultaneously and only for v (excluding the edges’ other endpoints); making state transitions for vs automaton as required (This shrinks the state-space for the automata. For the subwalk , if the edges e and are to be processed independent of each other, then during each processing, only one side of v can be seen. This requires each side to have an unvisited state independently, making the state-space size . From lemma 1 (see Supplementary Section S2), the edges e and are incident to different sides of v. Processing them simultaneously for v thus ensures that both the sides are seen together, making one state sufficient to denote the unvisited status’ of both the sides together. This reduces the state-space size to .). Thus, the two incident edges are being used as input for the automaton of v. The edges can be represented succinctly with a pair of characters (c1, c2) for v. c1 and c2 encode the edges incident to vs front and back respectively, in the subwalk (c1 and c2 do not necessarily correspond to e and , in this order; the order can also be the opposite based on the side of entrance of w to v.). Transition function: is the function governing the state-transitions. Figure 2b presents a high-level view of the possible types of transitions between states of the four classes. Supplementary Figure S1 illustrates a detailed overview of the transitions defined for the states as per various input symbols. Initial state: q 0 is the initial state for the automaton, which is the unvisited state. Accept states: is the set of the possible 25 different configurations (i.e. states) for the vertices (Formally, this parameter denotes the set of states reachable from the initial state under certain patterns of the input symbols. As our purpose is not the acceptance of any specific input patterns, rather just to compute the final state of each automaton, we define the accept states as the entire set of possible final states.).

4.4 The Cuttlefish algorithm

For a set S of input strings, the proposed algorithm, , works briefly as follows. Cuttlefish treats each vertex of the de Bruijn graph G(S, k) as a DFA—based on the novel modeling scheme introduced in Section 4.3. First, it enumerates the set K of vertices of G(S, k), applying a k-mer counting algorithm on S. Then it builds a compact hash table structure for K—employing a minimal perfect hash function—to associate the automata to their states (yet to be determined). Having instantiated a DFA M for each vertex , with M being in the initial state q0 (i.e. the unvisited state), it makes an implicit traversal W(S) over G(S, k). For each instance (i.e. k-mer) x of each vertex v visited along W(S), it makes an appropriate state transition for v’s automaton M—using the local information obtained for x, i.e. the two incident edges of x in W(S). Having finished the traversal, the computed final states of the automata correspond to their actual states in the underlying graph G(S, k)—which had not been built explicitly. Cuttlefish then characterizes the flanking vertices of the maximal unitigs in another implicit traversal over G(S, k), using the states information of the automata computed in the preceding step, and extracts the maximal unitigs on the fly. Cuttlefish(S) 1 2 3 4 for each 5 Extract-Maximal-Unitigs(s, h, B) The major components of the algorithm—an efficient associative data structure for the automata and their states, computing the actual states of the automata and extraction of the maximal unitigs from the input strings are discussed in the next subsections. Finally, the correctness of the algorithm is proven in theorem 1 in Supplementary Section S2.

4.5 Hash table structure for the automata

To maintain the (transitioning) states of the automata for the vertices in G(S, k) throughout the algorithm, some associative data structure for the vertices (i.e. canonical k-mers) and their states is required. We design a hash table for this purpose, with a (k-mer, state) key-value pairing. An efficient representation of hash tables for k-mers is a significant data structure problem in its own right, and some attempts even relate back to the compacted de Bruijn graph (Marchet ). In solving the subproblem efficiently for our case, we exploit the fact that the set K of the keys, i.e. canonical k-mers, are static here, and it can be built prior to computing the states. We build the set K efficiently from the provided set S of input strings using the KMC3 algorithm (Kokot ). Afterwards, we construct a minimal perfect hash function (MPHF) h over the set K, employing the BBHash algorithm (Limasset ). A perfect hash function h for a set K of keys is an injective function from K to the set of integers, i.e. for , if , then . Perfect hash functions guarantee that no hashing collisions are made for the keys in K. A perfect hash function is minimal if it maps K to the set . Since we do not need to support lookup of k-mers nonexistent in the input strings, i.e. no alien keys will be queried, an MPHF works correctly in this case. Also, as an MPHF produces no collisions, we do not need to store the keys with the hash table structure for collision resolution—reducing the space requirement for the structure. Although instead of the keys, we need to store the function itself with the structure, it takes much less space than the set of keys. In our setting, the function constructed using BBHash takes bits/k-mer, irrespective of the k-mer size k. Whereas storing the keys would take 2k bits/k-mer. As the hash value h(x) for some key is at most , we use a linear array of size , indexed by h(x), to store the state of the vertices (canonical k-mers) x as the hash table values. We call this array the buckets table B. So, the hash table structure consists of two components: an MPHF h, and an array B. For a canonical k-mer , its associated bucket is in the ’th index of B— stores the state of throughout the algorithm.

4.6 Computing the states and extracting the maximal unitigs

For a set S of input strings with n distinct canonical k-mers and an MPHF h over those k-mers, the algorithm Compute-States(S, h, n) computes the state of each vertex in G(S, k). Initially, it marks all the n vertices (i.e. their automata) with the unvisited state, in a buckets table B. Then it makes a collection of walks over G(S, k)—a walk for each . For each vertex v in w(s), i.e. its associated k-mer instance x in s, it examines the two incident edges of v in w(s), i.e. e and , making appropriate state transition of the automaton of v accordingly. Supplementary Figure S1 provides a detailed overview of the DFA transition function δ. After the implicit complete traversal over G(S, k), the actual states of the vertices (automata) in the graph are computed correctly. For an input string and the MPHF h, the algorithm Extract-Maximal-Unitigs(s, h, B) enumerates all the maximal unitigs present in s, using the states information of the automata present in the buckets table B. The enumeration is done through another implicit complete traversal over G(S, k). For a walk over G(S, k) spelling s, say w enters v through its side s, and the class of the state of v is c (see Fig. 2a for state-classes). Then, v initiates a maximal unitig traversal in w if: v terminates a maximal unitig traversal in w if: ci = multi-in multi-out; or s is the front of v and c = multi-in single-out; or s is the back of v and c = single-in multi-out; or terminates a maximal unitig traversal in w. c = multi-in multi-out; or s is the front of v and c = single-in multi-out; or s is the back of v and c = multi-in single-out; or initiates a maximal unitig traversal in w. The last conditions for initiation and termination do not recurse, i.e. only the first three conditions of the other case are checked for such a cross-referring condition—avoiding circular-reasoning. Supplementary Section S1.2 contains the pseudo-codes of the algorithms Compute-States(S, h, n) and its principal component Process-k-mer, and Extract-Maximal-Unitigs(s, h, B).

4.7 Parallelization scheme

The Cuttlefish algorithm is designed to be efficiently parallelizable on a shared-memory multi-core machine. The first step, i.e. the generation of the set K of canonical k-mers is parallelized in the KMC3 algorithm (Kokot ) itself. The two major steps of it—splitting the collection of k-mers from the input strings into different bins and then sorting and compacting the bins—are highly parallelized. The next step of constructing the MPHF with the BBHash algorithm (Limasset ) is also parallelized by partitioning its input key set K to multiple threads—distributing subsets of keys with some threshold size to the threads as soon as it is read off disk. The next step of computing the states of the vertices is parallelized as follows. For each input string , s is partitioned into a number of uniform sized substrings. Each substring is provided to an available thread, and each thread is responsible to process the k-mers having their starting indices in its substring. Although the threads process disjoint sets of k-mer instances, they query and update the same entry in the hash table structure for identical canonical k-mers. Accessing the MPHF h concurrently from multiple threads is safe, as the accesses are read-only; whereas accessing a hash table entry itself, i.e. an entry into the buckets table B, is not, since all the threads are (read-/write-) accessing the same table B. We ensure that only a single thread can probe and/or update some specific entry at any given time through maintaining a sparse collection of access-locks into B. To ensure low memory usage, we use a bit-packed table for B. As such, even with access-locks, multiple threads may access the same underlying memory-word concurrently while accessing nearby indices. To avoid such data races, we use a thread-safe bit-packed vector (Marçais, 2020). The last step of extracting the maximal unitigs is parallelized similarly, by distributing disjoint substrings of an input string s to different threads, where each thread is responsible to extract each maximal unitig that has its starting k-mer (in the walk spelling s) in its substring. Each maximal unitig is assigned a unique k-mer as its signature, and unique extraction of a maximal unitig is ensured by transitioning its signature k-mer to some special “output”-tagged state of the same state-class when the unitig is extracted first.

4.8 Asymptotics

Given a collection of strings S, let m be the total length of the input strings, and n be the number of distinct k-mers in the input. Then the running time of Cuttlefish is loosely bounded by , where H(k) is the expected time to hash a k-mer by Cuttlefish. Assuming that BBHash takes O(h) time (an expected constant) to hash a machine word of 64-bits, . See Supplementary Section S3.1 for a detailed analysis. The dependence of the running time on both the input size (m) and the variation in the input (expressed through n) is exhibited for an apes dataset in Supplementary Figures S2a–c. The dependence on k is discussed in Section 5.4, with benchmarking in Table 3.

Table 3

Time- and memory-performance benchmarking for the steps of Cuttlefish on the 7 human genomes dataset, across different k

		Build steps (s)			Build Time (s)	Build memory (GB)	Output step (s)		Output memory (GB)
k	Distinct k-mers count (B)	k-mer set construction	MPHF construction	States computation			Unipaths only	GFA2
23	2.39	154	62	762	978	2.67	744	1345	2.82
31	2.59	391	70	791	1252	2.88	737	1203	3.01
61	2.96	439	200	797	1436	3.25	798	831	3.37
91	3.12	1118	311	830	2259	3.42	806	860	3.49
121	3.24	1483	902	841	3226	3.55	850	820	3.62

Note: The running times are in seconds, and the maximum memory usages are in gigabytes.

The maximum memory usage of Cuttlefish is completely defined by the space requirement of the hash table structure. In our setting, the MPHF takes bits/k-mer (For further memory savings, this can be reduced to 3-bits/k-mer, trading off the speed of the hash function.). Each vertex (canonical k-mer) of G(S, k) can be in one of 26 different states (the state-space size for the DFA is ). At least bits are necessary to represent such a state uniquely. Thus, the buckets table consume 5n bits in total. Therefore, the maximum memory usage of the algorithm is bits—translating to roughly a byte per distinct k-mer (The explicit adjacency information based algorithm outlined in Section 4.2 requires 5-bits per side: one for each possible edge, and a sentinel marker—making 2 × 5 = 10 bits to be required for each vertex. Thus, the DFA model makes the memory savings 50% for the hash buckets (10 versus 5 bits), and 36% as a whole (13.7 versus 8.7 bits).). This linear relationship between the memory usage and the distinct k-mers count is illustrated for a human and an apes dataset at Supplementary Figures S2b and d. See Supplementary Section S3.2 for a detailed analysis.

5 Results

We evaluated the performance of Cuttlefish compared to other state-of-the-art tools for constructing (colored) compacted de Bruijn graphs from whole genome references. See Supplementary Section S4.1 for a discussion on the definition of ‘color’ that we adopt here. We also assessed its scaling properties, and effects of the input structure on performance. Experiments are performed on a server with an Intel Xeon CPU (E5-2699 v4) with 44 cores and clocked at 2.20 GHz, 512 GB of memory, and a 3.6 TB Toshiba MG03ACA4 HDD. The runtime and the peak resident memory usage statistics are obtained using the GNU time command.

5.1 Dataset characteristics

We use a varied collection of datasets to benchmark Cuttlefish in evaluating its performance on diverse input characteristics. First, we assess its performance on single input genomes. We use individual references of (i) a human (Homo sapiens, 3.2 Gbp), (ii) a western gorilla (Gorilla gorilla, 3 Gbp) and (iii) a sugar pine (Pinus lambertiana, 27.6 Gbp). We also evaluate its performance in building (colored) compacted de Bruijn graphs, i.e. for collections of references as inputs. We use a number of genome collections exhibiting diverse structural characteristics: (i) 62 E. coli (Escherichia coli) ( 310 Mbp), a dataset with small bacterial genomes; (ii) 7 Humans (Homo sapiens) (21 Gbp), very closely related moderate-sized mammalian genomes from the same species; (iii) 7 Apes (Hominoid) (18 Gbp), related moderate-sized mammalian genomes from the same order (Primate) and superfamily, but different species; and (iv) 11 Conifers (Pinophyta) (204 Gbp), related huge plant genomes from the same order (Pinales), but different species. The E. coli dataset with 62 strains (available at NCBI) was used in benchmarking SplitMEM (Marcus ). The human dataset was used in benchmarking a BWT-based algorithm (Baier ) for compacted de Bruijn graph construction, and includes five different assemblies of the human genome (from the UCSC Genome Browser), as well as the maternal and paternal haplotype of an individual NA12878 (Utah female), from the 1000 Genomes Project. The ape dataset includes seven available references from the orangutan and the chimp genera, a western gorilla, a human and a bonobo. The conifer dataset consists of nine references belonging to the pine (Pinaceae) family: from the pine, the spruce and the larch genera, and a douglas fir; and two references from the redwoods (Sequoioideae) subfamily. We assembled the ape and the conifer datasets from the GenBank database (Sayers ) of NCBI. We also assessed Cuttlefish’s performance on compacting a huge number of closely related genomes. For such, we used the 93 human references generated using the FIGG genome simulator (Killcoyne and Sol, 2014), that was used to benchmark TwoPaCo (Minkin ). Coupled with the previous 7 human genomes, this gives us a dataset of 100 human references (322 Gbp).

5.2 Benchmarking comparisons

We benchmarked Cuttlefish against three other implementations of whole genome reference de Bruijn graph compaction algorithms: Bifrost (Holley and Melsted, 2020), deGSM (Guo ) and TwoPaCo (Minkin ). While TwoPaCo, like Cuttlefish, is specialized to work with reference sequences, Bifrost and deGSM are also capable of constructing the compacted graph from short-read sequences as well. We compare against Bifrost and deGSM using their appropriate settings for construction from references. We also note that among the tools, only Bifrost constructs the compacted de Bruijn graph without using any intermediate disk space. See Supplementary Section S4.2 for a discussion on the disk space usage of the benchmarked tools. All these tools have multi-threading capability. deGSM has a max-memory option, and it is set to the best memory-usage obtained from the other tools. The rest of the tools are run without any memory restrictions. All the results reported for each tool are produced with a warm cache. Tables 1 and 2 contain a summary of the comparison results. Note that, the benchmark performances include all the steps of the pipelines to construct the compacted reference de Bruijn graphs, which for deGSM and Cuttlefish include the k-mer counting.

Table 1.

Time- and memory-performance benchmarking for compacting single input reference de Bruijn graphs

			Bifrost		deGSM		TwoPaCo		Cuttlefish
Dataset	Thread- count	k	Build	Output	Build	Output	Build	Output	Build	Output
Human	1	31	04:54:50 (27.23)	15:18	01:54:41 (37.94)	25:06 (9.79)	01:13:19 (4.15)	39:38 (4.50)	32:59 (2.79)	19:23 (2.84)
		61	05:16:51 (50.19)	01:49	02:20:57 (84.16)	21:37 (8.77)	01:10:18 (6.02)	12:25 (4.35)	38:21 (3.06)	15:37 (3.08)
	8	31	01:33:54 (27.23)	03:59	25:20 (37.94)	05:37 (9.80)	12:57 (5.04)	—	05:49 (2.79)	05:13 (2.92)
		61	01:20:28 (50.18)	00:40	47:52 (84.16)	03:55 (8.80)	11:28 (5.46)	—	07:45 (3.06)	03:20 (3.18)
	16	31	01:24:40 (27.24)	03:30	18:19 (37.94)	03:56 (9.80)	06:24 (5.57)	—	03:26 (2.79)	02:57 (2.93)
		61	01:12:33 (50.18)	00:52	46:34 (84.16)	02:35 (8.80)	07:12 (5.55)	—	04:23 (3.06)	01:54 (3.19)
Gorilla	1	31	05:44:10 (28.08)	16:30	01:34:29 (37.94)	24:26 (9.75)	01:00:15 (5.04)	43:25 (4.49)	31:46 (2.74)	17:07 (2.77)
		61	05:31:06 (50.13)	02:05	02:11:33 (84.16)	22:03 (8.94)	01:11:29 (5.83)	17:52 (4.30)	38:15 (3.02)	15:59 (3.03)
	8	31	02:06:52 (28.08)	03:44	28:52 (37.94)	05:43 (9.76)	13:02 (5.82)	—	05:30 (2.74)	04:37 (2.87)
		61	01:24:21 (50.13)	00:54	47:45 (84.16)	03:59 (8.98)	10:03 (6.00)	—	07:58 (3.02)	02:54 (3.12)
	16	31	01:50:26 (28.08)	02:59	20:47 (37.94)	04:07 (9.76)	07:29 (5.52)	—	03:13 (2.74)	03:25 (2.87)
		61	01:10:06 (50.13)	04:04	38:45 (84.16)	02:40 (8.98)	06:24 (6.09)	—	04:29 (3.02)	02:06 (3.14)
Sugar pine	16	31	22:18:24 (229.17)	01:20:51	09:29:24 (145.23)	01:10:55 (119.18)	01:49:01 (61.93)	—	51:30 (14.24)	01:56:52 (14.28)
		61	X (364.25)	—	X (166.54)	—	01:26:39 (64.86)	—	03:14:44 (20.88)	01:26:26 (20.90)

Note: Each cell contains the running time in wall clock format, and the maximum memory usage in gigabytes, in parentheses. The output steps report the compacted graph in the GFA2 format. The best value with respect to each metric in each row is highlighted.

Bifrost builds the compacted graph and outputs it using the same command; we could split the timing of the steps but were unable to tease apart the maximum memory for the output step. The discrepancy between the memory usage of deGSM and its memory-limit input parameter, , is attributable to their initial k-mer enumeration step—run internally by deGSM using the Jellyfish tool (Marçais and Kingsford, 2011), with parameters set by deGSM—these resources must be accounted for as the input for the problem is a set of references (from which deGSM first produces a k-mer database, much like Cuttlefish). TwoPaCo takes a logarithmic filter-size parameter f as input, and f is critical to the performance. It uses bytes of memory for a bloom filter in the first-pass, which significantly affects the memory usage in the second-pass. We used f = 35 in both k = 31 and k = 61 for human and gorilla; and f = 38 in k = 31 and f = 39 in k = 61 for sugar pine. We have set f such that the maximum memory usage is minimized, by first approximating its optimal value, and then trying it with a few of the nearby values. The best executions found (w.r.t. memory) are reported. Also, the output step of TwoPaCo is single-threaded, and the dashes in their output column indicate this inapplicability of multi-threading. The cells with X indicate abnormal program terminations—Bifrost ran out of memory (with std::bad_alloc), and deGSM had a segmentation fault. The peak memory usages until the point of termination are reported.

Table 2

Time- and memory-performance benchmarking for compacting colored de Bruijn graphs (i.e. multiple input references) for k = 31, using 16 threads

Dataset	Total genome-length (bp)	Distinct k-mers count	Bifrost	deGSM	TwoPaCo	Cuttlefish
62 E.coli	310 M	24 M	1 (0.47)	1 (3.34)	1 (0.80)	1 (0.96)
7 Humans	21 G	2.6 B	95 (29.06)	30 (37.94)	62 (6.14)	21 (2.88)
7 Apes	18 G	7.1 B	294 (100.25)	172 (145.23)	59 (28.87)	25 (7.42)
11 Conifers	204 G	82 B	—	—	981 (288.99)	525 (84.12)
100 Humans	322 G	28 B	—	—	1395 (126.25)	523 (28.75)

Note: Each cell contains the running time in minutes, and the maximum memory usage in gigabytes, in parentheses. The output step is excluded from executions. The best value with respect to each metric in each row is highlighted.

The filter-sizes for the TwoPaCo executions are set as described in Table 1. Dashed cells in the Bifrost and the deGSM columns indicate that the experiments were not performed, as it is anticipated that insufficient memory would be available given their memory usages for smaller datasets (w.r.t. k-mer count).

Time- and memory-performance benchmarking for compacting single input reference de Bruijn graphs Note: Each cell contains the running time in wall clock format, and the maximum memory usage in gigabytes, in parentheses. The output steps report the compacted graph in the GFA2 format. The best value with respect to each metric in each row is highlighted. Bifrost builds the compacted graph and outputs it using the same command; we could split the timing of the steps but were unable to tease apart the maximum memory for the output step. The discrepancy between the memory usage of deGSM and its memory-limit input parameter, , is attributable to their initial k-mer enumeration step—run internally by deGSM using the Jellyfish tool (Marçais and Kingsford, 2011), with parameters set by deGSM—these resources must be accounted for as the input for the problem is a set of references (from which deGSM first produces a k-mer database, much like Cuttlefish). TwoPaCo takes a logarithmic filter-size parameter f as input, and f is critical to the performance. It uses bytes of memory for a bloom filter in the first-pass, which significantly affects the memory usage in the second-pass. We used f = 35 in both k = 31 and k = 61 for human and gorilla; and f = 38 in k = 31 and f = 39 in k = 61 for sugar pine. We have set f such that the maximum memory usage is minimized, by first approximating its optimal value, and then trying it with a few of the nearby values. The best executions found (w.r.t. memory) are reported. Also, the output step of TwoPaCo is single-threaded, and the dashes in their output column indicate this inapplicability of multi-threading. The cells with X indicate abnormal program terminations—Bifrost ran out of memory (with std::bad_alloc), and deGSM had a segmentation fault. The peak memory usages until the point of termination are reported. Time- and memory-performance benchmarking for compacting colored de Bruijn graphs (i.e. multiple input references) for k = 31, using 16 threads Note: Each cell contains the running time in minutes, and the maximum memory usage in gigabytes, in parentheses. The output step is excluded from executions. The best value with respect to each metric in each row is highlighted. The filter-sizes for the TwoPaCo executions are set as described in Table 1. Dashed cells in the Bifrost and the deGSM columns indicate that the experiments were not performed, as it is anticipated that insufficient memory would be available given their memory usages for smaller datasets (w.r.t. k-mer count). We do not compare against compaction algorithms designed for constructing the compacted de Bruijn graph only for sequencing data. These tools usually filter k-mers and alter the topology of the resulting de Bruijn graph based on certain conditions (branch length, path coverage, etc.), and these filters cannot always be finely tuned. Thus, even if ideally configured, these approaches will produce (potentially non-trivially) different compacted graphs. More importantly, however, methods that explicitly build the compacted de Bruijn graph from reference sequences can adopt or even center their algorithms around certain optimizations that are not available to methods building the compacted de Bruijn graph from reads—it therefore seems most fair not to compare methods that do not explicitly support the compacted de Bruijn graph construction from references against methods that explicitly support or are explicitly designed for this purpose. This is evident when evaluating methods that construct compacted de Bruijn graphs from reference sequences for other purposes (Minkin and Medvedev, 2020). In such cases, even authors who have worked on both state-of-the-art construction methods for compacted reference- (Minkin ) and read- (Chikhi ) de Bruijn graphs select the former over the latter. We verified the correctness of the produced unitigs by designing a validator, that confirms: (i) the unique existence of each k-mer in the output unitigs collection [the set of maximal unitigs is unique and forms a node decomposition of the graph (Chikhi )]; and (ii) the complete spelling-ability of the set of input references by the unitigs (i.e. each input reference is correctly constructible from the compacted graph). A direct correspondence between the outputs of the different tools is not straightforward. See Supplementary Section S4.3 for a detailed discussion. The command line configurations for executing the tools, and the dataset sources are present in Supplementary Section S5.

5.3 Parallel scalability

To assess the scalability of Cuttlefish across varying number of processor-threads, we used the human genome and set k = 31, and executed Cuttlefish using 1–32 processor threads. Figures 3a and b show the scaling performance charts.

Fig. 3.

Scalability metrics of Cuttlefish for varying number of threads, using k = 31 for the human genome

Scalability metrics of Cuttlefish for varying number of threads, using k = 31 for the human genome Generation of the set of keys (i.e. k-mers) using KMC3 (Kokot ) is very fast for individual references, and the timing is quite low to begin with—the speedup is almost linear up-to around eight threads, and then saturates afterwards. The MPHF (minimal perfect hash function) construction does not scale past 24 threads; which we perceive as due to limitations involving bottlenecks in disk-access throughput, and memory access bottlenecks associated to the increasing cache misses in BBHash (Limasset ). The next two steps: computing the states of the vertices and extracting the maximal unitigs, both scale roughly linearly all the way up-to 32 threads. For the extraction step, we chose to report each maximal unitig, and skipped outputting the edges for the compacted graph in some GFA format.

5.4 Scaling with k

Next, we assess Cuttlefish’s performance with a range of different k-values. We use the 7 human genomes dataset for this experiment, and use 16 threads during construction. Table 3 shows the performance of each step with varying k-values. Time- and memory-performance benchmarking for the steps of Cuttlefish on the 7 human genomes dataset, across different k Note: The running times are in seconds, and the maximum memory usages are in gigabytes. We observe that the k-mer set construction step slows down with the increasing k, which may be attributable to overhead associated with the KMC3 algorithm (Kokot ) in processing larger k-mer sizes. Although the count of distinct k-mers grows slowly with increasing k, the MPHF construction slows down. Since the MPHF is constructed by iterating over the k-mer set on disk, the decreased speed in this step is mostly related to disk-access throughput. The steps involving graph traversals, computing the vertices’ states and extracting the maximal unitigs, are affected little in their running time by altering the value of k. Outputting the compacted graph in the GFA2 (and in GFA1) format takes much longer than outputting only the unitigs for lower values of k, due to structural properties of the compacted graph (see Supplementary Section S4.4 for a discussion). The memory usage by Cuttlefish is directly proportional to the distinct k-mers-count n, which typically increases with k. Thus, the maximum memory usage also grows with k. Referring back to Section 4.8, the running time of Cuttlefish is , and its memory usage is . Table 3 demonstrates this asymptotic increase in running time with k (also with effects from n), and the increase in memory usage with n (which typically grows with k).

5.5 Effects of the input structure

Next, we evaluate the effects of some of the structural characteristics of the input genomes on the performance of Cuttlefish. Specifically, we assess the impact of the genome sizes (total reference length, m) and their structural variations (through distinct k-mers count, n) on the time and memory consumptions of Cuttlefish. The running time is and the memory usage is (see Section 4.8). The input size determines the total number of k-mers to be processed at the k-mer set construction, vertices’ states computation and the maximal unitigs extraction steps. The variation in the input determines the performance of the MPHF construction, and indirectly affects the states computation and the unitigs extraction steps through their use of the hash table structure. We use the 7 humans and the 7 apes datasets with k = 31 for such, using 16 processor threads. References are incrementally added to the input set, and the performances are measured for each input instance. The benchmarks include both the steps of building the compacted graph and extracting the maximal unitigs, and are illustrated in Supplementary Figure S2. We observe that the running time varies both with the input size and with the structural variation. For the humans dataset, the distinct k-mers count does not increase much from one reference to seven: the increase is just 5%. This is because the dataset contains references from the same species, hence the genomes are very closely related. Thus, the effect of the distinct k-mers count remains roughly similar on all the instances of the dataset. The increase in running time is almost completely dominated by the total workload, i.e. the total size of the genomes. For the apes dataset, however, the genomes are not from the same species, and the variations in the genomes contribute to increasing the distinct k-mers count to 294% from one reference to seven. Thus, the running time on this dataset increases with the total genome size, with additional non-trivial effects from the varying k-mers count. On the other hand, the memory usage by the algorithm is completely determined by the distinct k-mers count, with no effect from the input size. As described in Section 4.8, the memory usage of the algorithm is constant per distinct k-mer, taking 8.7 bits/k-mer. Supplementary Figures 2b and d conform to the theory—the shapes of the memory consumption and the distinct k-mers count plots are identical.

6 Conclusion

We present a highly scalable and very low-memory algorithm, Cuttlefish, to construct the (colored) compacted de Bruijn graph for collections of whole genome references. It models each vertex of the original graph as a deterministic finite-state automaton; and without building the original graph, it correctly determines the state of each automaton. These states characterize the vertices of the graph that flank the maximal unitigs (non-branching paths), allowing efficient extraction of those unitigs. The algorithm is very scalable in terms of time, and it uses a small and constant number of bits per distinct k-mer in the dataset. Besides being efficient for medium and large-scale genomes, e.g. common mammalian genomes, the algorithm is highly applicable for huge collections of (very) large genomes. For example, Cuttlefish constructed the compacted de Bruijn graph for 100 closely related human genomes of total length 322 Gbp in <9 h, taking just 29 GB of memory. For 11 conifer plants that are from the same taxonomic order, each with very large individual genomes and having a total genome length of 204 Gbp, Cuttlefish constructed the compacted graph in <9 h, using 84 GB of memory. For these datasets, the next best method required more than 23 h using 126 GB of memory, and more than 16 h using 289 GB of memory, respectively. The improvement in performance over the state-of-the-art tools stems from the novel modeling of the graph vertices as deterministic finite-state automata. The core data structure is a fast hash table, designed using a minimal perfect hash function to assign k-mers to indices, and a bit-packed buckets table storing succinct encodings of the states of the automata. This compact data structure obtains a memory usage of 8.7 bits/k-mer, leading the algorithm to greatly outperform existing tools at scale in terms of memory consumption, while being equally fast if not faster. For scenarios with further memory savings requirements, the memory usage can be reduced to 8 bits/k-mer, trading off the speed of the hash function. The algorithm is currently only applicable for whole genome references. The assumption of the absence of sequencing errors in these references makes complete walk traversals over the original de Bruijn graph possible without having the graph at hand. This is not the case for short-read sequences, and the sequencing errors make such implicit traversals difficult. A significant line of future research is to extend Cuttlefish to be applicable on raw sequencing data. On repositories with large databases containing many genome references, Cuttlefish can be applied very fast in an on-demand manner, as follows. First, one can build and store a set of hash functions, each one over some class of related genome references. This consists of the first two steps of the algorithm and is a one-time procedure, containing the bottleneck part of the algorithm. Then, whenever some set of references is to be compacted, the hash function of the appropriate super-class can be loaded into memory, and the algorithm then executes only the last two steps, which are quite fast and scalable. This works correctly because the sets of keys that these hash functions are built upon are supersets of the sets of keys being used later for the construction. Cuttlefish provides the option to save and load hash functions, making the scheme feasible. As the number of sequenced and assembled reference genomes increases, the (colored) compacted de Bruijn graph will likely continue to be an important foundational representation for comparative analysis and indexing. To this end, Cuttlefish makes a notable advancement in the number and scale of the genomes to build compacted de Bruijn graphs upon. The algorithm is fast, highly parallelizable and very memory-frugal; and we provide a comprehensive analysis, both theoretical and experimental, of its performance. We believe that the progression will further improve the role of the de Bruijn graph in comparative genomics, computational pan-genomics and sequence analysis pipelines; also facilitating novel biological studies—especially for large-scale genome collections that may not have been possible earlier. Click here for additional data file.

42 in total

1. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors: Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal: J Comput Biol Date: 2012-04-16 Impact factor: 1.479

2. De novo assembly of human genomes with massively parallel short read sequencing.

Authors: Ruiqiang Li; Hongmei Zhu; Jue Ruan; Wubin Qian; Xiaodong Fang; Zhongbin Shi; Yingrui Li; Shengting Li; Gao Shan; Karsten Kristiansen; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal: Genome Res Date: 2009-12-17 Impact factor: 9.043

3. Fast de Bruijn Graph Compaction in Distributed Memory Environments.

Authors: Tony Pan; Rahul Nihalani; Srinivas Aluru
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2018-07-31 Impact factor: 3.710

4. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.

Authors: Jason Pell; Arend Hintze; Rosangela Canino-Koning; Adina Howe; James M Tiedje; C Titus Brown
Journal: Proc Natl Acad Sci U S A Date: 2012-07-30 Impact factor: 11.205

5. Sequence of the Sugar Pine Megagenome.

Authors: Kristian A Stevens; Jill L Wegrzyn; Aleksey Zimin; Daniela Puiu; Marc Crepeau; Charis Cardeno; Robin Paul; Daniel Gonzalez-Ibeas; Maxim Koriabine; Ann E Holtz-Morris; Pedro J Martínez-García; Uzay U Sezen; Guillaume Marçais; Kathy Jermstad; Patrick E McGuire; Carol A Loopstra; John M Davis; Andrew Eckert; Pieter de Jong; James A Yorke; Steven L Salzberg; David B Neale; Charles H Langley
Journal: Genetics Date: 2016-10-28 Impact factor: 4.562

6. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Authors: Fatemeh Almodaresi; Prashant Pandey; Michael Ferdman; Rob Johnson; Rob Patro
Journal: J Comput Biol Date: 2020-03-16 Impact factor: 1.479

7. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Authors: Guillaume Holley; Roland Wittler; Jens Stoye
Journal: Algorithms Mol Biol Date: 2016-04-14 Impact factor: 1.405

8. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads.

Authors: Iain Maccallum; Dariusz Przybylski; Sante Gnerre; Joshua Burton; Ilya Shlyakhter; Andreas Gnirke; Joel Malek; Kevin McKernan; Swati Ranade; Terrance P Shea; Louise Williams; Sarah Young; Chad Nusbaum; David B Jaffe
Journal: Genome Biol Date: 2009-10-01 Impact factor: 13.583