| Literature DB >> 31891531 |
Mikhail Karasikov1,2,3, Harun Mustafa1,2,3, Amir Joudaki1,2,3, Sara Javadzadeh-No1, Gunnar Rätsch1,2,3, André Kahles1,2,3.
Abstract
High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.Entities:
Keywords: binary relations; compressed data structures; genome graph annotation; sparse binary matrices
Mesh:
Year: 2019 PMID: 31891531 PMCID: PMC7185347 DOI: 10.1089/cmb.2019.0324
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479
FIG. 1.Schematic of hierarchical compressed column-major representations. (a) BRWT for the binary case. Gray rows correspond to all-zero rows, also indicated through the vector to the right of each matrix. Each child encodes only nonzero rows of the submatrix passed to it by its respective parent. Numbers to the left of each matrix are the respective row-indices in the initial matrix. (b) Multi-BRWT in the multiary case. Notation is as in the binary case. Stored vectors are shown in red. BRWT, binary relation wavelet tree.
FIG. 2.Schematic describing the construction of Multi-BRWT. (a) The columns of the input binary matrix depicted as numbered black dots are considered independently. (b, c) Columns are hierarchically pair-matched based on number of shared entries, forming the base Multi-BRWT topology. (d) Pruning internal nodes of Multi-BRWT to optimize the tree structure for a smaller representation size.
FIG. 3.Size of the representation of with densities using different approaches: (a) uniformly random bits, (b) uniformly random rows with multiplicity 5, and (c) uniformly random columns with multiplicity 5. We expect approach (c) to be best reflecting the real-world data of a de Bruijn graph built on related sequences. BinRel-WT, binary relation compressed with wavelet trees; GPM, greedy pairwise matching.
The Measured Size of the Compressed Binary Relation Matrix for Different Representations, in Gigabytes
| Methods | Kingsford | RefSeq |
|---|---|---|
| Column | 36.56 | 80.18 |
| Flat | 41.21 | 121.60 |
| Rainbowfish | 23.16 | 136.65 |
| BinRel-WT | 49.57 | N/A |
| BinRel-WT (sdsl) | 31.44 | 150.59 |
| BRWT | ||
| Multi-BRWT (Split 3) | 13.20 | 53.95 |
| Multi-BRWT (Split 5) | ||
| Multi-BRWT (Split 7) | 13.27 | 53.54 |
| Multi-BRWT (Split 10) | 13.54 | 54.77 |
| Multi-BRWT (Split 13) | 14.10 | 56.25 |
| Multi-BRWT (GPM) | 10.60 | 50.13 |
| Multi-BRWT (GPM + Relax 3) | 10.16 | 47.20 |
| Multi-BRWT (GPM + Relax 5) | 44.22 | |
| Multi-BRWT (GPM + Relax 7) | 44.03 | |
| Multi-BRWT (GPM + Relax 10) | 9.95 | 43.73 |
| Multi-BRWT (GPM + Relax 20) | 9.95 |
Multi-BRWT (Split n) denotes the n-ary BRWT. Multi-BRWT (GPM) denotes the binary BRWT optimized with the GPM. Multi-BRWT (GPM + Relax t) denotes the Multi-BRWT (GPM) with internal nodes pruned to reduce the representation size, where each node has at most t children. The construction times can be found in the Supplementary Data.
BinRel-WT, binary relation compressed with wavelet trees; BRWT, binary relation wavelet tree; GPM, greedy pairwise matching.
Bold values are max in each block.