| Literature DB >> 27307614 |
Bo Liu1, Dixian Zhu1, Yadong Wang1.
Abstract
MOTIVATION: With the development of high-throughput sequencing, the number of assembled genomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows-Wheeler Transform (BWT) is an important data structure of genome indexing, which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing owing to their incremental nature, which is a bottleneck to use modern computers to accelerate BWT construction.Entities:
Mesh:
Year: 2016 PMID: 27307614 PMCID: PMC4908350 DOI: 10.1093/bioinformatics/btw266
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The unipath-based comparison between two suffixes with the same initial k-mer. (a) Because all the copies of the same k-mers collapse to the same vertex of the dBG, two suffixes, and with the same initial k-mer must link to the same offset of the same unipath (the copies of the unipaths on the DNA sequence are marked by segments with various colors). Thus, the lexicographical order of the two suffixes cannot be determined until the comparison reaches the end of the unipath, as all the corresponding characters of the two suffixes are same to each other. (b) When the comparison goes to new unipaths from the finished (same) unipath, the lexicographical order can be determined only if the two suffixes have two different unipaths on the corresponding positions of their unipath representation, otherwise, more unipaths are needed. In this case, both of the two suffixes have the same unipath (the red unipath) successive to the first unipath (the blue unipath), so that the comparison continues to the third unipaths. For the third unipath, the lexicographical order can be determined as the two suffixes goes to two difference branches (green and purple, respectively) at the end of the second unipath. (c) Owing to the property of dBG, two different unipaths must be different to each other at their first k-mers. Furthermore, if two different k-mers have the same precursor, their first (k-1) characters must be same to the last (k-1) characters of their precursor (the gray segments in the figure), i.e. the two branching k-mers are only different at their k-th character (the blocks marked as green and purple in the figure). In this situation, it only needs to compare the k-th characters of two unipaths to determine if they are the same unipath. (d) With this property, the de Bruijn encoding, , is defined as a string concatenating all such characters, , along the DNA sequence, , where the k-mer, , at the position of is a copy of a multiple-out vertex. Each of the s is also termed as a branching character (marked as colored blocks in the figure). And for a position of , is defined as the position of on , where is the branching character downstream and closest to the position j in .
Fig. 2.A schematic illustration of the deBWT method. (a) DeBWT initially builds a dBG of the input sequence(s) with a user-defined parameter, k, which determines the size of the vertices. The dBG is then analyzed to build the k-mer partition of the BWT and recognize all the unipaths (the colored bars in the figure indicate the copies of various unipaths of the input sequence). With the unipaths, all the multiple-in and multiple-out vertices are indexed by a hash table-based data structure, de Bruijn branch index. Moreover, all the multiple-in vertices are marked. In this case, the red block indicates the first k-mer of the ‘red’ unipath of the dBG which is a multiple-in vertex, and the grey and the white blocks respectively indicate other multiple-in and -out k-mers. (b) DeBWT scans the input sequence(s) to recognize the branching characters with de Bruijn branch index (marked as colored reverse rectangles above the input sequence) and generate the de Bruijn branch encoding. Meanwhile, the suffixes with initial k-mers corresponding to multiple-in vertices, i.e. the suffixes belonging to the unsolved parts of the BWT, are also recognized with the index (marked as colored rectangles below the input sequence). Furthermore, for each of the suffixes within the unsolved parts, deBWT calculates its value to determine the corresponding projection suffix and also recorded it into the de Bruijn branch index. (c) With de Bruijn branch index, deBWT addresses all the unsolved BWT parts by sorting the projection suffixes
Running Time with 32 CPU cores (in minutes)
| Methods | Human genomes | Human contigs | Primate genomes |
|---|---|---|---|
| deBWT | 134 | 129 | 330 |
| deBWT (no conversion) | 48 | 56 | 100 |
| ParaBWT | 241 | 262 | 180 |
| RopeBWT2 | 1694 | 2247 | 1546 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output.
The time of the various steps of deBWT (in minutes)
| Steps | Human genomes | Human contigs | Primate genomes |
|---|---|---|---|
| Phase1: dBG building and analysis | |||
| k-mer counting | 16 | 16 | 26 |
| File conversion | 87 | 74 | 229 |
| k-mer sorting | 3 | 3 | 8 |
| dBG analysis | 8 | 7 | 19 |
| Phase2: The generation of de Bruijn branch encoding and projection suffixes | |||
| de Bruijn branch encoding and | 9 | 12 | 16 |
| Phase3: BWT construction with projection suffixes | |||
| Projection suffixes sorting | 4 | 12 | 10 |
| Additional processing | |||
| Additional processing | 7 | 6 | 22 |
| Supplement | |||
| k-mer counting with KMC2 | 7 | 9 | 12 |
Quantiles of and values of the 10 human genomes dataset
| Quantiles | 0.50 | 0.90 | 0.95 | 0.99 | 0.999 | 0.9999 |
|---|---|---|---|---|---|---|
| 107 | 588 | 2382 | 95 019 | 298 598 | 515 006 | |
| 1760 | 11 872 | 238 368 | 1 925 600 | 3 232 832 | 3 387 040 |
Running time with various numbers of threads (in minutes)
| Methods | 8 threads | 16 threads | 24 threads | 32 threads |
|---|---|---|---|---|
| Human genomes | ||||
| deBWT | 194 | 153 | 142 | 134 |
| deBWT (no conversion) | 109 | 68 | 56 | 48 |
| ParaBWT | 265 | 240 | 240 | 241 |
| Human contigs | ||||
| deBWT | 183 | 154 | 123 | 129 |
| vdeBWT (no conversion) | 116 | 86 | 56 | 56 |
| ParaBWT | 294 | 277 | 276 | 262 |
| Primate genomes | ||||
| deBWT | 423 | 355 | 332 | 330 |
| deBWT (no conversion) | 193 | 125 | 105 | 100 |
| ParaBWT | 196 | 182 | 181 | 180 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
Fig. 3.Time consumption of the various steps of deBWT. The bars respectively indicate the elapsed time (in minutes) of the various steps of deBWT for the 10 human genomes dataset (a), the human genome contig dataset (b) and the 8 primate genomes dataset (c). Bars in the same color correspond to a specific number of threads, i.e. blue, red, green and purple bars are respectively for 8, 16, 24 and 32 threads
Running time of the in silico human genome dataset with various configurations on the k parameter (in minutes)
| Methods | k = 19 | k = 23 | k = 27 | k = 31 |
|---|---|---|---|---|
| deBWT | 142 | 124 | 131 | 134 |
| deBWT (no conversion) | 75 | 51 | 47 | 48 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
Memory footprints with 32 CPU cores (in Gigabytes)
| Methods | Human genomes | Human contigs | Primate genomes |
|---|---|---|---|
| deBWT | 120/78/38 | 120/63/34 | 235/203/58 |
| ParaBWT | 30 | 30 | 29 |
| RopeBWT2 | 30 | 24 | 40 |
| Supplement | |||
| KMC2 | 119 | 119 | 119 |
For the ‘x/y/z’ of deBWT in the memory columns, the x, y and z values respectively indicate the memory footprints of Jellyfish, phase1 of deBWT, and phases2 and phases3 of deBWT.
Statistics on the in silico human genomes and contigs
| Statistics | Human genomes | Human contigs |
|---|---|---|
| length of input sequences | 30955436371 | 30200003020 |
| distinct k-mers | 5073730669 | 4025285321 |
| multiple-out k-mers | 18820763 | 17238123 |
| multiple-in k-mers | 18821805 | 17237511 |
| copies of multiple-out k-mers | 2364004617 | 2301293218 |
| copies of multiple-in k-mers | 2364445711 | 2300904807 |