| Literature DB >> 26539459 |
Inanç Birol1, Justin Chu1, Hamid Mohamadi1, Shaun D Jackman1, Karthika Raghavan1, Benjamin P Vandervalk1, Anthony Raymond1, René L Warren1.
Abstract
De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.Entities:
Year: 2015 PMID: 26539459 PMCID: PMC4619942 DOI: 10.1155/2015/196591
Source DB: PubMed Journal: Int J Genomics ISSN: 2314-436X Impact factor: 2.326
Figure 1Uniqueness of spaced seeds in the (a) E. coli and (b) H. sapiens genomes, as a function of the space length. The red, blue, and black curves correspond to spaced seeds of lengths 8, 16, and 32 bp, respectively. When the space length is zero, the uniqueness figures correspond to 16, 32, and 64 bp single k-mer lengths, respectively. Curves show that, for the E. coli genome, using a spaced seeds of length 16 is equivalent to or better than using k-mers of length 64, when delta is longer than 100 bp.
Figure 2Flowchart of cascading Bloom filters. The process of populating the stage 2 Bloom filter, indicated by the dashed box, is described in Table 1.
Update rules for the counting Bloom filter.
| Value at bit location | Update action | |||
|---|---|---|---|---|
|
| x′ | At location | Set sign | Set count |
| 0 | 0 or −0 | x | 0 | 2 |
| Nonzero | 0 or −0 | x | No change | Increment |
| 0 or −0 | Nonzero | x′ | 1 | Increment |
| Nonzero | Nonzero | x and x′ | 1 | 0 |
| −0 | 0 | x′ | 1 | 2 |
Minifloat counts and their representations.
| Count ( | Mantissaa ( | Exponenta ( |
|---|---|---|
| Zeros | ||
| 0 and −0b | 000 | 0000 |
|
| ||
| Subnormal numbersc ( | ||
| 1 | 001 | 0000 |
| 2 | 010 | 0000 |
| ⋮ | ⋮ | ⋮ |
| 7 | 111 | 0000 |
|
| ||
| Normalized numbersc ( | ||
| 8 | 000 | 0001 |
| 9 | 001 | 0001 |
| ⋮ | ⋮ | ⋮ |
| 15 | 111 | 0001 |
| 16 | 000 | 0010 |
| 18 | 001 | 0010 |
| ⋮ | ⋮ | ⋮ |
| 122,880d | 111 | 1110 |
aMost significant digits on the left.
bDistinguished by the sign bit.
cShown for a sign bit of 0.
dMaximum possible 1.3.4.-2 minifloat number.
Figure 3Approximate counts versus true counts in the minifloat data type 1.4.3.-2. The box-whisker plots indicate the interquartile range and the variability of the counts outside the first and the third quartiles. The distributions represent a repetition of 10,000 counts in each logarithmic bin.
Error correction rules.
| Value at bit location | Interpretation | Action | |||
|---|---|---|---|---|---|
|
|
|
|
| ||
|
|
|
|
| Present in the set | Update count |
|
| |||||
| 0 | 0 | 0 | 0 | Not present in the set | Insert in the filter |
|
|
|
| 0 | ||
|
|
| 0 |
| ||
|
| 0 |
|
| ||
| 0 |
|
|
| ||
|
| |||||
|
| 0 |
| 0 | There may be a single base correction that would make the pattern (1111) | If so, and if the corrected sequence has a nonzero count, correct the read. |
|
| 0 | 0 |
| ||
| 0 |
|
| 0 | ||
| 0 |
| 0 |
| ||
|
| 0 | 0 | 0 | There may be two base corrections that would make the pattern (1111) | |
| 0 |
| 0 | 0 | ||
| 0 | 0 |
| 0 | ||
| 0 | 0 | 0 |
| ||