| Literature DB >> 30020403 |
Harun Mustafa1,2,3, Ingo Schilken1, Mikhail Karasikov1,2,3, Carsten Eickhoff4, Gunnar Rätsch1,2,3, André Kahles1,2,3.
Abstract
Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata.Entities:
Mesh:
Year: 2019 PMID: 30020403 PMCID: PMC6530811 DOI: 10.1093/bioinformatics/bty632
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A wavelet trie constructed for a tuple of bit vectors. Each node is labelled with a longest common prefix (LCP) α and an assignment vector β. During construction at a particular node, the LCP of the bit vectors is extracted and the next significant bit is used to assign the bit vector suffixes to that node’s children. A node becomes a leaf when all bit vectors assigned to it are equal. An example is given in bold. The sequence 0010101 results from the traversal along the dashed line from top to bottom. The index i being queried is updated by calling rank0(·,i) (i) (traverse left) or rank1(·,i) (i) (traverse right) on the β vectors
Datasets used for evaluation
| Data set | Nodes | Edges ( | Colors ( | Colorings | Density (%) ( |
|---|---|---|---|---|---|
| Virus100 | 2,954,719 | 2,956,113 | 100 | 463 | 1.056 |
| 30,310,634 | 30,347,373 | 1,000 | 11,612 | 0.117 | |
| 622,587,315 | 625,110,390 | 53,412 | 1,359,843 | 0.006 | |
| 134,951,429 | 135,369,397 | 135 | 6,630 | 1.475 | |
| 178,196,890 | 180,023,641 | 9 | 510 | 15.270 | |
| 5,714,136,751 | 5,728,489,633 | 30 | 380,051 | 1.762 |
Columns represent number of nodes and edges per dataset, total number of colors and number of unique edge colorings, or unique rows of the annotation matrix, and density of the annotation matrices, where the quantity s refers to the number of set bits in the annotation matrices.
Fig. 2.Improvement in Bloom filter compression ratios after neighborhood correction. Bloom filter accuracy (average fraction of correct edge colorings) as a function of filter size (Color version of this figure is available at Bioinformatics online.)
Compression ratio of wavelet trie and Bloom filter schemes (measured as number of bits per edge)
| Proposed | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Data set | Colors ( | gzip | bzip2 | VARI | RBF | WTr | WTr (CI) | BF 95% | BF 99.0% |
| 100 | 11.4 | 4.8 | 9.8 | 5.8 | 2.2 | 1.3 (52) | 0.36 | 0.44 | |
| 1000 | 26.5 | 7.5 | 14.7 | 9.7 | 18.2 | 5.28 (272) | 0.49 | 0.82 | |
| 53,412 | 135.3 | 37.7 | 56.0 | 662.1 | 64.8 (1693) | 2.58 | 7.41 | ||
| 135 | 15.6 | 5.7 | 19.3 | 7.8 | 3.3 | 1.6 (20) | 0.95 | 1.40 | |
| 9 | 4.6 | 2.7 | 17.3 | 3.3 | N/A | 1.2 (1) | 0.45 | 2.41 | |
| 30 | 10.9 | 5.4 | 14.5 | 5.6 | N/A | 5.4 (22) | 0.68 | 1.82 | |
Note: Each dataset is encoded with eight different compression schemes, including general compression with gzip and bzip2, existing methods specific to colored de Bruijn graphs VARI (Muggli ) and Rainbowfish (RBF, Almodaresi ., (2017)), as well as the wavelet trie encoding (WTr) with and without the class indicator bits set (CI; value in parenthesis describes the number of the first columns in the annotation matrices that were used as the indicator columns), and the corrected Bloom filters at (BF 95%) and (BF 99%) accuracy. All compression ratios are measured as average number of bits per edge. VARI was compiled with 1024 bit support.
On these datasets, VARI and RBF results are generated by exporting the annotation data in compatible formats.
Consumed more than 400GB memory limit.
The class indicators were the columns representing the reference chromosomes, hence, no extra columns were added.
Fig. 3.Growth of compression ratios. Compression ratios on virus graphs of increasing genome count. Error bars were computed from the virus graph chains resulting from six random draws of the Virus1000 dataset (see Section 3.2.1)
Fig. 4.Construction vs. update times of color compressors for virus datasets of differing numbers of columns. WTr, wavelet trie; BF, Bloom filter (Color version of this figure is available at Bioinformatics online.)