| Literature DB >> 29949982 |
Fatemeh Almodaresi1, Hirak Sarkar1, Avi Srivastava1, Rob Patro1.
Abstract
Motivation: Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.Entities:
Mesh:
Year: 2018 PMID: 29949982 PMCID: PMC6022659 DOI: 10.1093/bioinformatics/bty292
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The time and memory required to load the index and query all k-mers in reads of the input FASTQ files for different datasets
| Tool | Memory (MB) | Time (h:m:s) | ||||
|---|---|---|---|---|---|---|
| Human transcriptome | Human genome | Bacterial genome | Human transcriptome | Human genome | Bacterial genome | |
| BWA | 308 | 4439 | 27 535 | 0:17:35 | 0:50:31 | 0:14:05 |
| Kallisto | 3336 | 110 464 | 232 353 | 0:02:01 | 0:19:11 | 0:22:25 |
| Pufferfish dense | 454 | 17 684 | 41 532 | 0:02:46 | 0:10:37 | 0:06:03 |
| Pufferfish sparse | 341 | 12 533 | 30 565 | 0:08:34 | 0:22:11 | 0:08:26 |
Fig. 1.An illustration of searching for a particular k-mer, x, in the dense pufferfish index. The minimum perfect hash yields the index, in the pos vector where the k-mer appears in the unipath array. The k-mer is validated against the sequence recorded at this position in useq (and, in this case, it matches). A rank operation on is performed in the bv, which yields the corresponding unipath-level information in the utab. If desired, the relative position of the k-mer within the unipath can be retrieved with an extra select and rank operation. Likewise, the rank used to determine this unipath’s utab entry can also be used to look up the edges adjacent to this unipath in the etab table if desired
Upper half of the table shows construction time and memory requirements for BWA, kallisto and pufferfish (dense and sparse) on three different datasets
| Tool | Memory (MB) | Time (h:m:s) | ||||
|---|---|---|---|---|---|---|
| Human transcriptome | Human genome | Bacterial genomes | Human transcriptome | Human genome | Bacterial genomes | |
| BWA | 292 | 4443 | 32 213 | 0:02:56 | 0:58:27 | 13:11:45 |
| Kallisto | 3552 | 150 657 | 315 387 | 0:03:05 | 3:27:42 | 9:07:35 |
| Pufferfish dense | 1466 | 27 438 | 75 342 | 0:04:13 | 2:09:25 | 13:10:00 |
| Pufferfish sparse | 1466 | 27 438 | 75 342 | 0:04:41 | 2:28:53 | 13:46:11 |
| TwoPaCo | 1466 | 9380 | 17 407 | 0:02:47 | 0:34:43 | 9:59:05 |
| Pufferize | 584 | 27 438 | 75 342 | 0:0:10 | 0:21:53 | 1:03:17 |
| Pufferfish dense index | 438 | 20 000 | 50 459 | 0:01:16 | 0:51:20 | 2:07:38 |
| Pufferfish sparse index | 331 | 17 745 | 50 457 | 0:01:44 | 1:10:48 | 2:43:49 |
In the lower half of the table, the construction statistics are provided for different phases of pufferfish pipeline. The time requirement for pufferfish is the sum of different subparts of the workflow, where the memory requirement is the max of the same. All of the tools in this table with the exception of TwoPaCo have single-threaded execution. We report here the timing results for running TwoPaCo with 16 threads. Timing results for TwoPaCo using a single thread are provided in Supplementary Table 4.
Fig. 2.Full taxonomy classification evaluation for three tools of Kraken, Clark and Pufferfish. (a–c) We compare the F-1, spearman correlation and mean absolute relative difference metrics for the results of the three tools over the 10 simulated read datasets of LC1-8 and HC1, 2 without using any filtering options. In the plots in the second row, we evaluate accuracy of reports after running each tool with their default filtering option (which filters out any mapping with <20% k-mer coverage for Kraken, 44 nucleotide coverage for Pufferfish and without a ‘high-confidence’ for Clark.)
Disk space required for the index of each tool on different datasets
| Tool | Human transcriptome | Human genome | Bacterial genomes |
|---|---|---|---|
| BWA | 347M | 5.12G | 31G |
| Kallisto | 1.7G | 58G | 120G |
| Pufferfish dense | 397M | 16.7G | 39G |
| Pufferfish sparse | 278M | 11.4G | 27.2G |