| Literature DB >> 27307618 |
Rayan Chikhi1, Antoine Limasset2, Paul Medvedev3.
Abstract
MOTIVATION: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.Entities:
Mesh:
Year: 2016 PMID: 27307618 PMCID: PMC4908363 DOI: 10.1093/bioinformatics/btw279
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Execution of bcalm 2 on a small example, with k = 4 and ℓ = 2. On the top left, we show the input de Bruijn graph. The maximal unitigs correspond to the path from CCCT to TCTA (spelling CCCTCTA), and to the k-mers CCCC, CCCA, CTAC, CTAA. In this example, minimizers are defined using a lexicographic ordering of ℓ-mers. In the top right, we show the contents of the bucket files. Only five of the bucket files are non-empty, corresponding to the minimizers CC, CT, AA, AC and CA. The doubled k-mers are italicized. Below that, we show the set of strings that each i-compaction generates. For example in the bucket CC, the k-mers CCCT and CCTC are compacted into CCCTC, however CCCC and CCCT are not compactable because CCCA is another out-neighbor of CCCC. The lonely ends are denoted by . In the bottom half we show the execution steps of the Reunite algorithm. Nodes in bold are output
Fig. 2.bcalm 2 wall-clock running times with respect to (a) parameters ℓ and k (using 4 cores) and (b) number of cores (using k = 55 and ), on the chromosome 14 dataset
Running times (wall-clock) and memory usage of compaction algorithms for the human datasets.
| Dataset | ABySS-P | Meraculous 2 | ||
|---|---|---|---|---|
| Chr 14 | 5 mins | 15 mins | 11 mins | 62 mins |
| 400 MB | 19 MB | 11 GB | 2.35 GB | |
| Whole human | 1.2 h | 12 h | 6.5 h | 16 h |
| 2.8 GB | 43 MB | 89 GB | unreported |
For bcalm 2 and bcalm we used k = 55, and and , respectively; abundance cutoffs were set to 5 for Chr 14 and 3 for whole human. We used 16 cores for the parallel algorithms ABySS, Meraculous 2 and bcalm 2. Meraculous 2 aborted with a validation failure due to insufficient peak k-mer depth when we ran it with abundance cutoffs of 5. We were able to execute it on chromosome 14 with a cutoff of 8, but not for the whole genome. ()For the whole genome, we show the running times given in Georganas et al. (2014). The exact memory usage was unreported there but is less than <1 TB. Meraculous 2 was executed with 32 prefix blocks.
Performance of bcalm 2 on the loblolly pine and white spruce datasets.
| Dataset | Loblolly pine | White spruce |
|---|---|---|
| Distinct | 10.7 | 13.0 |
| Num threads | 8 | 16 |
| CompactBucket() time | 4 h 40 m | 3 h 47 m |
| CompactBucket() mem | 6.5 GB | 6 GB |
| Reunite file size | 85 GB | 140 GB |
| Reunite() time | 4 h 32 m | 3 h 08 m |
| Reunite() memory | 31 GB | 39 GB |
| Total time | 9 h 12 m | 6 h 55 m |
| Total max memory | 31 GB | 39 GB |
| Unitigs ( | 721 | 1200 |
| Total length | 32.3 Gbp | 49.0 Gbp |
| Longest unitig | 11.2 kbp | 9.0 kbp |
The k-mer size was 31 and the abundance cutoff for k-mer counting was 7.