| Literature DB >> 28881995 |
Prashant Pandey1, Michael A Bender1, Rob Johnson1,2, Rob Patro1.
Abstract
MOTIVATION: Almost all de novo short-read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input. Even when other approaches are used for subsequent assembly (e.g. when one is using 'long read' technologies like those offered by PacBio or Oxford Nanopore), efficient k -mer processing is still crucial for accurate assembly, and state-of-the-art long-read error-correction methods use de Bruijn Graphs. Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly. Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e. the number of times that each k -mer occurs, which is key in transcriptome assemblers.Entities:
Mesh:
Year: 2017 PMID: 28881995 PMCID: PMC5870571 DOI: 10.1093/bioinformatics/btx261
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Space usage of several AMQs, as a function of , the false positive rate, and α, the load factor
| Filter | Bits per element | Max |
|---|---|---|
| Bloom | N/A | |
| Cuckoo | 0.95 | |
| CQF | 0.95 |
Note: The CQF is more space efficient than the cuckoo filter for all false positive rates and more space efficient than the Bloom filter for false positive rates less than 1/64. The Cuckoo filter and CQF offer good performance until 95% load factor. A Cuckoo filter or CQF offers good performance up to a load factor of 0.95.
Fig. 1.Weighted de Bruijn Graph invariant. The nodes are 4-mers and edges are 5-mers. The nodes and edges are drawn from Read1 and Read2 mentioned in the figure. The solid curve shows the read path. The nodes/edges are not canonicalized
Fig. 2.Types of edges in a de Bruijn Graph. The nodes are 4-mers and edges are 5-mers. For node AAAA, CAAAA is a left edge and AAAAT, AAAAC are right edges. We introduced another edge AAAAA in order to show a duplex edge. All the nodes/edges are canonicalized and the graph is bi-directional
Fig. 3.Total number of distinct k-mers in First QF, Last QF, and Main QF with increasing coverage of the same dataset. We generate dataset simulations using Huang
Datasets used in our experiments
| Dataset | File size | #Files | # | #Distinct |
|---|---|---|---|---|
| GSM984609 | 26 | 12 | 19662773 330 | 1146347598 |
| GSM981256 | 22 | 12 | 16470774825 | 1118090824 |
| GSM981244 | 43 | 4 | 37897872977 | 1404643983 |
| SRR1284895 | 33 | 2 | 26235129875 | 2079889717 |
Note: The file size is in GB. All the datasets are compressed with gzip compression.
Space versus Accuracy trade-off in Squeakr and deBGR
| System | Dataset | Space | Navigational errors | |
|---|---|---|---|---|
| (bits/ | Topological | Abundance | ||
| Squeakr | GSM984609 | 18.9 | 14263577 | 16655318 |
| Squeakr (exact) | 50.8 | 0 | 0 | |
| deBGR | 26.5 | 0 | 0 | |
| Squeakr | GSM981256 | 19.4 | 13591254 | 15864754 |
| Squeakr (exact) | 52.1 | 0 | 0 | |
| deBGR | 27.1 | 0 | 0 | |
| Squeakr | GSM981244 | 30.9 | 10462963 | 12257261 |
| Squeakr (exact) | 79.8 | 0 | 0 | |
| deBGR | 37.0 | 0 | 0 | |
| Squeakr | SRR1284895 | 20.9 | 23272114 | 27200821 |
| Squeakr (exact) | 53.95 | 0 | 0 | |
| deBGR | 25.38 | 0 | 0 | |
Note: Topological errors are false-positive k-mers. Abundance errors are k-mers with an over count.
The maximum number of items present in auxiliary data structures, edges (k-mers) in MBI and nodes (-mers) in work queue as described in the Algorithm 1, during abundance correction
| Dataset | #Edges in | #Edges in work queue ( |
|---|---|---|
| GSM984609 | 30815799 | 76178634 |
| GSM981256 | 29359913 | 72606572 |
| GSM981244 | 22674515 | 56309858 |
| SRR1284895 | 50320986 | 124558299 |
Time to construct the weighted de Bruijn Graph, correct abundances globally in the weighted de Bruijn Graph, and perform local correction per edge in the weighted de Bruijn Graph (averaged over 1M local corrections)
| Dataset | Construction | Global correction | Local correction |
|---|---|---|---|
| (seconds) | (seconds) | (microseconds) | |
| GSM984609 | 6605.65 | 14857.68 | 12.93 |
| GSM981256 | 5470.83 | 15390.56 | 18.25 |
| GSM981244 | 13373.78 | 22266.86 | 16.50 |
| SRR1284895 | 8429.17 | 41218.85 | 16.62 |