| Literature DB >> 27087830 |
Guillaume Holley1, Roland Wittler1, Jens Stoye1.
Abstract
BACKGROUND: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.Entities:
Keywords: Bloom filter; Colored de bruijn graph; Compression; Index; Pan-genome; Population genomics; Similar genomes; Succinct data structure; Trie
Year: 2016 PMID: 27087830 PMCID: PMC4832552 DOI: 10.1186/s13015-016-0066-8
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Insertion of six suffixes (that are here complete k-mers) with different colors (boxes with diagonal lines) into a BFT with , and . In a, the first five suffixes are inserted at the root into an uncompressed container. When a sixth suffix gcgccaggaatc is inserted, the uncompressed container exceeds its capacity and is burst, resulting in the BFT structure shown in b
Insertion time and memory usage for the real (P. aeruginosa) and simulated (Y. pestis) dataset. The compression ratio is given w.r.t. the original file sizes. Disk sizes for the SBT are given for the leaves first and then for the complete tree
|
|
| |||
|---|---|---|---|---|
| BFT | SBT | BFT | SBT | |
| Insertion time (min) | 168.52 | 371.45 | 29.88 | 32.67 |
| Peak of main memory (MB/compr. ratio) | 7487/112:1 | 7356/114:1 | 1313/182:1 | 1586/151:1 |
| Disk size (MB/compr. ratio) | 1644/511:1 | 2076–4572/405:1–184:1 | 484/495:1 | 538–1117/445:1–214:1 |
| Compressed disk size (MB/compr. ratio) | 833/1009:1 | 1906–4280/441:1–196:1 | 225/1064:1 | 528–1099/454:1–218:1 |
Fig. 2BFT main memory and disk size for pan-genomes made of one up to all P. aeruginosa isolates
Fig. 3BFT main memory and disk size for pan-genomes made of one up to all simulated Y. pestis isolates
Total and per k-mer query times for the real (P. aeruginosa) and simulated (Y. pestis) dataset with peaks of main memory
|
|
| |||
|---|---|---|---|---|
| BFT | SBT | BFT | SBT | |
| Total query time (min) | 1.19 | 61.86 | 0.57 | 37.42 |
| Query time per | 7.14 | 371.16 | 3.42 | 224.52 |
| Peak of main memory (MB) | 2076 | 11,678 | 544 | 7775 |