Literature DB >> 24152242

Compact representation of k-mer de Bruijn graphs for genome read assembly.

Abstract

BACKGROUND: Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure.
RESULTS: The FM-index, which is based on the Burrows-Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k-1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently.
CONCLUSIONS: The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/).

Entities: Chemical Disease Species

Mesh：

Year: 2013 PMID： 24152242 PMCID： PMC4015147 DOI： 10.1186/1471-2105-14-313

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

High throughput sequencing is generating huge amounts of sequence data even from single experiments. The raw sequence data will typically be too much to keep in the memory of most off-the-shelf computers, and with sequencing technologies progressing faster than the improvements in computer memory, the memory challenge is likely to increase in the future. One key property of the raw sequencing data is that it is highly redundant. Genomes are usually sequenced at high coverage, which means there will frequently be at least 30–50 reads covering the same region of the genome, differing primarily by sequencing errors. Processing of sequencing reads for genome assembly usually involves two crucial steps: error correction to remove or correct sequencing errors, and assembly of overlapping reads to produce a smaller number of assembled sequences. A common approach for simplifying the processing of the sequence data is to consider all the k-mers of the reads: i.e. all the k-substrings of the reads if we view them as strings. This set of k-strings is then thought of as a subgraph of the de Bruijn graph of order k − 1: i.e. one which has vertices corresponding to all k − 1-substrings and edges corresponding to the k-substrings. Even if sequenced at high coverage, each k-mer is thus represented only once, reducing the redundancy of the sequence data considerably. However, direct storage of all k-mers in a single list will require k letters per k-mer, i.e. 2k bit of information for DNA sequences, which can be quite memory consuming when k is large. Naively, one might expect that this could be greately improved. From each vertex in the graph, there may be 4 possible out-going (or in-coming) edges if the graph represents DNA sequences: one for each of the nucleotides. Encoding which of these exist in the graph should require only 4 bit of information per vertex; if most vertices have only one out-edge, this might even be reduced towards 2 bit of information per vertex by only encoding which of the 4 possible edges is actually found. Of course, this approach requires that the vertices be known, but one might envision that the information about the vertices could be reconstructed when walking the graph: when walking k−1 steps, all k−1 letters of the resulting vertex will be known. A traversable representation of the de Bruijn subgraph is equivalent to storing a searchable index of all the k-substrings. By traversable, I mean that it is possible to efficiently walk the graph starting at any vertex, to check if any k-mer or k−1-mer is present as an edge or vertex in the graph, and preferably also to be able to retrieve the k-mers and k−1-mers represented by the graph. Thus, it is not only important that the data structure be compact, but efficient algorithms for using it are just as important. A number of data structures exist that provide more compact storage of the de Bruijn subgraph than naive k-mer lists or maps. Conway et al. [1] were able to represent a de Bruijn subgraph with 12 G edges in 40.8 GB, i.e. 28.5 bit per edge, by using a compressed array. Other approaches reduce memory by storing only a subset of the k-mers [2-4]. An entirely different approach uses a Bloom filter to store a hashed set of k-mers [5] using only 4 bit per k-mer. This is a probabilistic data structure with a known false positive rate, but where false positive edges can be identified by not being part of longer paths. However, while this data structure is effective for checking if a k-mer is contained in the graph, it does not easily allow listing of all vertices or edges. An enhancement of this method, Minia [6], avoids critical false positives and also allows retrieval of all vertices, but at the cost of higher memory consumption. Another memory-efficient solution uses the FM-index [7], which is based on the Burrows–Wheeler transform [8] used to represent a suffix array [9], to store the collection of reads in a compressed form [10]. The Burrows–Wheeler transform was originally developed for text compression and has the property that recurrent substrings in the text before the transform result in single-letter repeats in the transformed string. The FM-index adds auxiliary information on top of the Burrows–Wheeler transformed sequence that effectively turns it into a compactly stored suffix array. When concatenating the reads, the coverage makes the Burrows–Wheeler transformed sequence dominated by single-letter repeats which are highly compressible [10]. Effectively, it requires 2 bit per edge to store the nucleotide, which corresponds to specifying the in-edge (or out-edge) of a vertex, and additional memory to store the run-length of the nucleotide, which corresponds to the k-mer count. At least up to 50 times coverage, this data structure should be able to store one edge per byte if used to represent the de Bruijn subgraph of k-mers. It should be noted that the ability of different methods to handle read errors varies. Some of the cited methods are intended to perform error correction by filtering k-mers by their frequency, while other methods assume that read errors for the most part have been corrected or excluded in advance. I here provide a data structure with strong similarities to the FM-index, but which stores the de Bruijn subgraph representing the k-mer substrings rather than entire sequences. It is based on the idea of storing for each vertex which of the possible in-coming edges are actually present. For each vertex it thus needs one bit of information per letter in the alphabet, i.e. 4 bit per vertex for DNA sequences, plus some additional data. The additional data consists of a grouping of vertices which requires one extra bit per vertex, plus the equivalent to the FM-index for mapping in-coming edges to their parent vertices. This version of the FM-index, which I call the kFM-index since it applies to an index of k-substrings, can be generated from the stored data, but for computational speed a subset of the index is kept in memory. All in all, a de Bruijn subgraph for DNA sequences, including the stored subset if the index, can be stored using 5–6 bits per vertex if memory consumption is critical. In the case where most vertices are of degree 1, i.e. have one in-edge and one out-edge, the stored data may be compressed down to approximately half the size. Like the FM-index, the kFM-index stores only one strand of DNA sequences, and is suitable for walking the graph in one direction. For genome assembly, one does not know in advance which strand the read is on, and so normally are required to ensure that both the k-mers of the reads and their reverse complements are added to the graph. Some data structures, e.g. most hashing strategies, can combine k-mers and their reverse complements, and thus require roughly half the number of items. For the kFM-index, however, it is necessary to add both the reads and their reverse complements. In doing this, one may walk in the opposite direction by switching to the reverse complement, although there will be some computational overhead in doing so. The basic operations available on the kFM-index are similar to those of the FM-index. Each vertex is identified by it’s index position, i = 0,1,…,n−1 where n is the number of vertices in the de Bruijn subgraph and the vertices are lexicographically ordered. For a given string, the vertices having that string as a prefix, identified by the interval of index positions, can be found efficiently: the computational time is proportional to the length of the string. Given a vertex, identified by it’s index position i, one can look up directly in the stored data which in-coming edges exist for that vertex. The index positions of the vertices from which the in-edges come can be computed efficiently. Thus, checking if a string exists as a path in the de Bruijn subgraph can be done. The reverse operation of identifying the string representation of a given vertex identified by index position i also exists, but is slower: time complexity is O(k lgn). The kFM-index can be generated directly from a sorted list of in-edges, which is appropriate for amounts of sequence data that fit into the computer memory, although it should also be feasible to extend this by sorting the in-edges on disk: the time complexity is O(Nk lgσ lgN) where N is the total length of the sequence data, and thus the number of items to be sorted, and Nk lgσ is the amount of data being sorted. Generation of kFM-indexes in memory from sequentially read sequence data can be done by splitting the raw sequence data into parts, generate kFM-indexes for each part, and then perform pairwise merges of these kFM-indexes. The time complexity of generating the kFM-index in this manner is essentially O(Nkσ lg(nm)), where n is the number of vertices in the final de Bruijn graph (i.e. not counting identical k-mers), σ is the alphabet size, and m is the number of parts the initial sequence data is partitioned into. This has proven to be quite time consuming: in part because of the time complexity of the provided merge algorithm, but probably also in part due to an inefficient implementation. I expect that there is room for major improvements. In addition, these operations are all open to parallelisation. Readers familiar with the FM-index will see the similarities to it, despite the fact that the FM-index represents all suffixes while this new data structure only stores information about k-substrings. Not only is the data structure very similar, but the functions and algorithms are also similar, or at least analogous, to those used with the FM-index. I therefore refer to this data structure as the kFM-index: an FM-index for k-substrings. And instead of pointing out the similarities throughout the article, I will point out differences where these are noteworthy. A Java implementation of the data structure is provided as a demonstration.

Methods

Notation

Let Σ denote an alphabet of size σ = |Σ|, i.e. an arbitrary set whose elements we refer to as letters: for DNA sequences, Σ = {A,C,G,T} and σ = 4. A string of length l, or an l-string, is an element of x ∈ Σ. Let denote the set of all strings, including the empty string denoted ε. We denote the length of the string by |x|. If x and y are strings, xy denotes the concatenated string of length |x|+|y|; for sets U and V of strings, the set of concatenated strings is denoted U∘V = {uv|u ∈ U,v ∈ V}. We write x If x is an l-string, we write x = x1…x where x ∈ Σ are the letters. For p≤q, the [p,q] substring x[=x…x is a string of length q−p+1: x[ is the empty string. A substring x[1, at the start is referred to as a prefix, while a substring x[ at the end is referred to as a suffix. The operation of trimming away the last letter is denoted x−=x[1,=x1…x. For a list of strings, i.e. s ∈ Σ∗, let denote the set of length k substrings: i.e. x ∈ Σ is contained in if and only if there is a string with x = s[ for some position p. We denote the base 2 logarithm by lgx = log2x which is convenient for quantifying information. Thus, the information required to specify one out of n options is lg n bit.

Problem description

Given a set of strings, e.g. a set of sequencing reads, we will construct a compact representation of , i.e. the set of length k substrings, suitable for quickly checking if any particular k-string is present. The data structure is best understood in terms of the de Bruijn subgraph representation of . This has vertices and edges where e ∈ E is an edge from e[1, to e[2,. It is a subgraph of the de Bruijn graph of order k−1, i.e. with vertices Σ and edges Σ. Some authors may refer to this as a word graph, or even just a de Bruijn graph. Since the set of vertices can be deduced from the edges, storing is effectively the same as storing the information encoded in the de Bruijn subgraph. However, the graph structure highlights the overlap between edges meeting at vertices. While some authors focus on k as the length of the strings represented by the edges, others focus on the order of the graph which is the k−1 length of the strings represented by the vertices. Since our purpose is to represent the k-mer composition of the sequences, it is natural to focus on k as the k-mer length. However, the implementation of the algorithms is more naturally centered around the vertices, and so the Java implementation focuses on the order of the de Bruijn subgraph which is k−1.

The kFM-index data structure

The data structure for storing the k-substrings from a set of strings has similarities to the FM-index and the Burrows–Wheeler transformation. One similarity is that the data structure stores the prefixing letters, which represent the in-edges to vertices, and backtracks the de Bruijn subgraph through these in-coming edges rather than walking paths from beginning to end; the sequences, including the strings the vertices and edges represent, are thus reconstructed from the in-edge data when backtracking through the graph. The initial de Bruijn subgraph representing the k-string composition may contain any number of finalvertices: i.e. vertices for which there are no out-going edges. These final vertices correspond to k−1-strings found only as suffixes of the strings in , and represent a problem as they cannot be reached by backtracking the de Bruijn subgraph. As the data structure does not store the k−1-strings for each vertex, but instead reconstructs these strings when walking the graph, these final vertices cannot be thus reconstructed. The solution is to add extra vertices and edges leading from these final vertices to a special final vertex from which we may start the reconstruction. See Figure 1 for an example.

Figure 1

The kFM-index data and corresponding de Bruijn subgraph. Representation of the data structure for DNA 4-mers. The vertex strings, lexigographically sorted, are not stored, but reconstructed from the edge and group end data. The edges columns indicate in-coming edges to each vertex, i.e. letters that may prefix the vertex strings. The group end flag inidicates groups of vertices with the same k−2-prefix. The previous position data can be generated from the edge set data and group end data and is constant within each vertex group; a subset is stored for computational speed. Let be a set that includes all k−1-strings which are final vertices in the graph with edge set : i.e. if and v is not a prefix of any string in , then v has to be in Vfinal. Ideally, in order to get the most compact representation of , we want Vfinal to contain only these strings. However, we might start off by letting Vfinal contain all k−1-suffixes of the strings in , knowing that the vertices required to be in Vfinal have to be a subset of these, and then later prune away superfluous edges and vertices. Hence, we permit Vfinal to be bigger than strictly required. We now define the final-completed de Bruijn subgraph with paths added from each v ∈ Vfinal to a special vertex, $=$…$ which we refer to as the final vertex, having vertices and edges where Vfinal∘$ = {v$…$ ∣ v ∈ Vfinal} denotes the strings from which these additional paths are constructed. These are strings over an extended alphabet , where $ is a special character that is sorted before any of the letters of Σ. The added vertices, i.e. those containing one or more $ at the end, are referred to as final-completing vertices and are parts of paths leading to the final vertex. In fact, the final-completing vertices form a tree with the final vertex, $, as the root. Note that, for the case where Vfinal is empty, we explicitly add the special vertex $: this is purely a matter of convenience. By this extension of the de Bruijn subgraph, we have ensured that there is exactly one final vertex that cannot be reached by backtracking the graph, namely the final vertex $. When sorting the vertices lexicographically, this will always come first. Note that we do not require that the final vertex be reachable from the rest of the graph. If the original graph had Vfinal empty, this would be the case. We may identify E with a subset of Σ×V describing the set of in-coming edges to each vertex, and will by abuse of notation say that the pair (a,v) ∈ Σ×V is an edge if the concatenation av ∈ E. We denote the in-coming edges to v by E⊂Σ: i.e. E = {a ∈ Σ ∣av ∈ E}. Note that backtracking through this de Bruijn subgraph corresponds to reading the strings in the backwards direction, from the end of the string towards the beginning, just as with the FM-index. A variant of the data structure which naturally reads the strings in the forward direction can be obtained by performing the construction on the reversed strings, the only effect of which is on the sorting of strings in the index which would then be based on the reversed string.

Main data

Let n = |V| be the number of vertices of the final-completed de Bruijn subgraph, and let v0,…,v denote the vertices of V in lexicographic order; in particular, v0 = $, which is the only final vertex of the final-completed de Bruijn subgraph. The basic information required to store the final-completed de Bruijn subgraph of is: Edges: The set of edges from each vertex v is stored; i.e. the edge set E identified as a subset of Σ×V. This may be encoded as a σ×n array, η(a,i), with binary values: i.e. η(a,i)∈{false,true} indicates if av∈E. The in-edges E={a∣av∈E} to vertex v may be represented as a bit-mapped number on which set operations correspond to binary operations. Group end flags: We group vertices v ∈ V with the same k−2-prefix together: i.e. u and v are grouped together if u− = u[1, and v− = v[1, are identical. We indicate the group end by a flag f which is true if v is the last vertex in its group, false otherwise. This requires one bit of information per vertex. More formally, these binary arrays take logical values, true or false, as defined by and where v The grouping of vertices with the same k−2-prefix allows us to check which in-edges originate from the same vertices: for a,b ∈ Σ, u,v ∈ V, edges au and bv originate from vertices au− and bv− respectively, which is the same vertex if a = b and u−=v−, which corresponds to checking if u and v are in the same vertex group.

Fundamental functions for utilising the data structure

For a string x and a ∈ Σ, we define the function γ(x) recursively by This function has a natural interpretation. If |x| In order to utilise the data structure, for strings x with length |x| while the first vertex with v>x∞ is Recall that ∞ >a for all a ∈ Σ, so β(x) for x ∈ Σ, lx, while α(x) finds the first v for which v[1,≥x. Thus, α and β are defined by the property for all x ∈ Σ, l Note that if we add the vertex v=∞ to the list, we wouldn’t have to specify the i = n case in the above definitions. Again, this addition would purely be a matter of convenience and not have any practical impact.

Algorithms for using the data structure

The above described data structure encodes a de Bruijn subgraph representation of the k-substring composition of the strings . However, to utilise this representation, we need efficient algorithms. Throughout the algorithms, vertices of V will be identified by their position i ∈ {0,…,n−1} in the lexicographically sorted list v0,…,v where n = |V|. The string that each vertex represents will generally not be known. The alphabet Σ is known from the start and the letters ordered. Computationally, it is natural to represent the letters by numbers 0,…,σ−1 (ignoring the letter $) since they are to be used as array indexes. However, for added readability, I will denote them as letters a ∈ Σ in the algorithms rather as numerical indexes.

Computing the previous position for arbitrary positions

13 in total

1. SSAHA: a fast search method for large DNA databases.
Authors: Z Ning; A J Cox; J C Mullikin
Journal: Genome Res       Date: 2001-10       Impact factor: 9.043
2. Reducing storage requirements for biological sequence comparison.
Authors: Michael Roberts; Wayne Hayes; Brian R Hunt; Stephen M Mount; James A Yorke
Journal: Bioinformatics       Date: 2004-07-15       Impact factor: 6.937
3. Reptile: representative tiling for short read error correction.
Authors: Xiao Yang; Karin S Dorman; Srinivas Aluru
Journal: Bioinformatics       Date: 2010-08-16       Impact factor: 6.937
4. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.
Authors: Guillaume Marçais; Carl Kingsford
Journal: Bioinformatics       Date: 2011-01-07       Impact factor: 6.937
5. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res       Date: 2008-03-18       Impact factor: 9.043
6. ABySS: a parallel assembler for short read sequence data.
Authors: Jared T Simpson; Kim Wong; Shaun D Jackman; Jacqueline E Schein; Steven J M Jones; Inanç Birol
Journal: Genome Res       Date: 2009-02-27       Impact factor: 9.043
7. Succinct data structures for assembling large genomes.
Authors: Thomas C Conway; Andrew J Bromage
Journal: Bioinformatics       Date: 2011-01-17       Impact factor: 6.937
8. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data.
Authors: Yongchao Liu; Jan Schröder; Bertil Schmidt
Journal: Bioinformatics       Date: 2012-11-29       Impact factor: 6.937
9. Efficient de novo assembly of large genomes using compressed data structures.
Authors: Jared T Simpson; Richard Durbin
Journal: Genome Res       Date: 2011-12-07       Impact factor: 9.043
10. Efficient counting of k-mers in DNA sequences using a bloom filter.
Authors: Páll Melsted; Jonathan K Pritchard
Journal: BMC Bioinformatics       Date: 2011-08-10       Impact factor: 3.169

View more

3 in total

1. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips.
Authors: Shoshana Marcus; Hayan Lee; Michael C Schatz
Journal: Bioinformatics       Date: 2014-11-13       Impact factor: 6.937
2. Fully-sensitive seed finding in sequence graphs using a hybrid index.
Authors: Ali Ghaffaari; Tobias Marschall
Journal: Bioinformatics       Date: 2019-07-15       Impact factor: 6.937
3. Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.
Authors: Jamshed Khan; Rob Patro
Journal: Bioinformatics       Date: 2021-07-12       Impact factor: 6.937

3 in total

Background

Methods

Notation

Problem description

The kFM-index data structure

Main data

Index to previous vertex position

Auxiliary data stored for computation speed

Java implementation

Fundamental functions for utilising the data structure

Algorithms for using the data structure

Computing the previous position for arbitrary positions

Find all vertices with a particular prefix

Backtracking through the de Bruijn subgraph

Identifying the string value of a vertex

Generating the kFM-index from a set of strings

Merging two kFM-indexes

Pruning away superfluous final-completing vertices

Pre-assembly

Results

Memory usage

Computational speed

Benchmarking of the Java implementation

E. coli str. K-12 substr. MG1655

Simulated read data

C. elegans str. N2

Soil sample

Discussion

Memory requirements

Further reduction in memory usage

Effects of read errors

Effect of adding final-completing vertices

Low-level parallel computing of

Construction of the kFM-index

Use with large alphabets

Conclusions

Availability

Appendix

Mathematical approximations

Proofs of results

Competing interests

Additional file 1