| Literature DB >> 32657362 |
Vitor C Piro1,2,3, Temesgen H Dadi4, Enrico Seiler4, Knut Reinert4, Bernhard Y Renard1,3.
Abstract
MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices.Entities:
Mesh:
Year: 2020 PMID: 32657362 PMCID: PMC7355301 DOI: 10.1093/bioinformatics/btaa458
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Ganon methodology overview. (A) Empty circles are inner nodes of the tree; ‘x’ circles are leaf nodes (referenced in this manuscript as taxid nodes); full lines represent taxonomic relations, dotted lines represent the extension of the taxonomy to the assembly and sequence levels. Species+ represents all taxonomic groups that are more specific than species with species in the lineage (e.g. subspecies, species group, no rank). (B) A toy example of sequences clustered by species into equal-sized groups, performed by TaxSBP. (C) Sequences are fragmented into k-mers and with a given number of hash functions, those k-mers are inserted into equal-sized bit vectors (Bloom Filters). (D) The IBF, representing the previously generated bit vectors with each bit interleaved. (E) Classification of short reads (black lines) against the IBF. Reads are fragmented into k-mers, counted with the same hash functions against the IBF, filtered and assigned to one or more species followed by LCA assignment for multiple matches
Fig. 2.Cumulative-based precision, sensitivity and F1-score values at all ranks for the simulated reads against all evaluated reference sets (blue = RefSeq-OLD, orange = RefSeq-CG-top-3 and red = RefSeq-ALL-top-3)
Genomic DNA of reference sequences used for evaluations
| Base pairs | # assemblies | # sequences | |
|---|---|---|---|
| RefSeq-OLD | 9 632 441 987 | 3042 | 5242 |
| RefSeq-CG | 46 986 899 184 | 19 623 | 33 029 |
| RefSeq-ALL | 587 607 072 429 | 147 713 | 15 201 684 |
Note: Protein data information can be found in the Supplementary Table S5. Detailed information of each dataset can be found in the Supplementary Section S2.5.1. Data were downloaded using https://github.com/pirovc/genome_updater.
Reference sequences after over-representation filtering
| Base pairs | # species | # leaf taxids | # assemblies | # sequences | |
|---|---|---|---|---|---|
| RefSeq-CG- |
| 11 464 | 14 071 | 15 171 | 24 290 |
| top-3 | (62%) | (100%) | (100%) | (77%) | (74%) |
| RefSeq-ALL- |
| 29 061 | 51 292 | 56 805 | 4 400 402 |
| top-3 | (36%) | (100%) | (100%) | (38%) | (29%) |
Note: Percentages in brackets show the amount of data left compared to the original set (Table 1). Protein data information can be found in the Supplementary Table S5.
Build times, memory consumption and index sizes at taxonomic level
| Reference | Method | Time | Memory | Index size |
|---|---|---|---|---|
| RefSeq-OLD | Centrifuge | 02:51:03 | 98 | 4 |
| Clark | 04:07:56 | 150 | 32 | |
| Diamond | 00:08:07 | 28 | 3 | |
| Ganon | 00:02:08 | 22 | 15 | |
| Kraken | 02:04:16 | 87 | 73 | |
| Kraken2 | 00:17:28 | 13 | 10 | |
| RefSeq-CG-top-3 | Centrifuge | 06:51:25 | 262 | 12 |
| Clark | 08:45:31 | 243 | 81 | |
| Diamond | 00:10:33 | 27 | 9 | |
| Ganon | 00:07:01 | 68 | 62 | |
| Kraken | 04:53:31 | 195 | 184 | |
| Kraken2 | 00:45:25 | 29 | 26 | |
| RefSeq-ALL-top-3 | Diamond | 00:36:23 | 30 | 70 |
| Ganon | 00:54:48 | 248 | 249 | |
| Kraken2 | 05:04:24 | 124 | 123 |
Note: Memory and index size in GiB. All tools build at taxonomic leaf nodes (taxid) besides clark building at species level. Tools running more than 24 h to build were not considered. A total of 48 threads were used for all tools. Computer specifications and parameters used are in the Supplementary Sections S2.1 and S2.4. Krakenuniq was not evaluated on taxonomic level since it runs exactly the same base algorithm as kraken in this configuration.
Build times, memory consumption and index sizes at assembly level
| Reference | Method | Time | Memory | Index size |
|---|---|---|---|---|
| RefSeq-OLD | Centrifuge | 02:51:03 | 98 | 4 |
| Ganon | 00:02:22 | 30 | 23 | |
| Krakenuniq | 02:06:41 | 87 | 73 | |
| RefSeq-CG | Centrifuge | 12:32:08 | 428 | 20 |
| Ganon | 00:10:49 | 100 | 93 | |
| Krakenuniq | 08:54:56 | 321 | 190 | |
| RefSeq-ALL | Ganon | 02:30:47 | 493 | 501 |
Note: Memory and Index size in GiB. Tools running more than 24 h to build were not considered. A total of 48 threads were used for all tools. Computer specifications and parameters used are in the Supplementary Sections S2.1 and S2.4.
Rank-based precision, sensitivity and F1-score values for the simulated reads at species level
| Reference | Method | Sensitivity (%) | Precision (%) |
|
|---|---|---|---|---|
| RefSeq-OLD | Centrifuge |
| 79.59 | 54.54 |
| Clark | 41.13 | 84.28 | 55.28 | |
| Diamond | 4.91 | 29.01 | 8.40 | |
| Ganon | 40.57 |
|
| |
| Kraken | 41.23 | 83.91 | 55.29 | |
| Kraken2 | 41.43 | 78.40 | 54.21 | |
| RefSeq-CG-top-3 | Centrifuge |
| 79.28 | 56.31 |
| Clark | 43.01 | 82.59 | 56.57 | |
| Diamond | 11.30 | 74.43 | 19.62 | |
| Ganon | 41.87 |
|
| |
| Kraken | 43.23 | 82.49 | 56.73 | |
| Kraken2 | 43.61 | 79.01 | 56.20 | |
| RefSeq-ALL-top-3 | Diamond | 13.10 | 88.64 | 22.82 |
| Ganon |
|
|
| |
| Kraken2 | 53.78 | 91.60 | 67.77 |
Note: Numbers in bold denote the best results for each reference set. The use of a larger reference set with RefSeq-ALL-top-3 significantly improves results. Only ganon and diamond indexed the RefSeq-ALL-top-3 in <24 h, thus, centrifuge, clark and kraken were excluded. Results for all taxonomic levels are in the Supplementary Figure S8 and Supplementary Material S2.
Fig. 3.Cumulative-based precision, sensitivity and F1-score values at all ranks for the real reads against all evaluated reference sets (blue = RefSeq-OLD, orange = RefSeq-CG-top-3 and red = RefSeq-ALL-top-3)
Rank-based precision, sensitivity and F1-score values for the real reads at species level
| Reference | Method | Sensitivity (%) | Precision (%) |
|
|---|---|---|---|---|
| RefSeq-OLD | Centrifuge | 0.51 | 2.24 | 0.84 |
| Clark | 0.49 | 3.21 | 0.86 | |
| Diamond | 0.00 | 0.00 | 0.00 | |
| Ganon | 0.43 |
| 0.82 | |
| Kraken | 0.50 | 3.13 | 0.86 | |
| Kraken2 |
| 2.19 |
| |
| RefSeq-CG-top-3 | Centrifuge | 2.41 | 7.03 | 3.59 |
| Clark | 2.34 | 9.57 | 3.76 | |
| Diamond | 1.74 | 11.23 | 3.02 | |
| Ganon | 1.77 |
| 3.26 | |
| Kraken | 2.39 | 9.61 |
| |
| Kraken2 |
| 7.22 | 3.82 | |
| RefSeq-ALL-top-3 | Diamond | 12.38 |
| 20.27 |
| Ganon | 24.78 | 38.67 | 30.20 | |
| Kraken2 |
| 37.84 |
|
Note: Numbers in bold denote the best results for each reference set. The use of a larger reference set with RefSeq-ALL-top-3 significantly improves results. Only ganon and diamond indexed the RefSeq-ALL-top-3 in <24 h, thus, centrifuge, clark and kraken were excluded. Results for all taxonomic levels are in the Supplementary Figure S9 and Supplementary Material S2.
Fig. 4.AMBER average completeness/sensitivity (green) and purity/precision (blue) values for real reads. Results for diamond (left), ganon (middle) and kraken2 (right) using RefSeq-ALL-top-3 set of references. Strain level in AMBER plots is equivalent to species+ in our evaluations
Rank-based precision, sensitivity and F1-score values for the simulated reads at assembly level
| Reference | Method | Sensitivity (%) | Precision (%) |
|
|---|---|---|---|---|
| RefSeq-OLD | Centrifuge |
| 64.54 | 33.68 |
| Ganon | 22.32 |
|
| |
| Krakenuniq | 22.68 | 69.66 | 34.22 | |
| RefSeq-CG | Centrifuge |
| 30.77 | 17.08 |
| Ganon | 11.52 |
|
| |
| Krakenuniq | 11.67 | 32.45 | 17.17 | |
| RefSeq-ALL | Ganon | 21.56 | 87.89 | 34.62 |
Note: Numbers in bold denote the best results for each reference set. Only ganon indexed the RefSeq-ALL in <24 h, thus, centrifuge, clark and kraken were excluded. Results for all taxonomic levels are in the Supplementary Material S2.
Classification performance
| Simulated | Real | ||||||
|---|---|---|---|---|---|---|---|
| Reference | Method | Mbp/m | Wall time | Memory | Mbp/m | Wall time | Memory |
| RefSeq-CG-top-3 | Centrifuge | 298 | 00:24:59 (±51 s) | 13 | 802 | 00:09:19 (±4 s) | 13 |
| Clark | 1104 | 00:06:44 (±5 s) | 101 | 1208 | 00:06:11 (±4 s) | 100 | |
| Diamond | 36 | 03:27:00 (±259 s) | 14 | 33 | 03:40:55 (±170 s) | 15 | |
| Ganon | 380 | 00:20:31 (±8 s) | 61 | 538 | 00:14:50 (±1 s) | 61 | |
| Kraken | 2113 | 00:03:46 (±1 s) | 177 | 2734 | 00:02:57 (±3 s) | 177 | |
| Kraken2 | 3085 | 00:02:47 (±1 s) | 27 | 3833 | 00:02:19 (±1 s) | 27 | |
| RefSeq-ALL-top-3 | Diamond | 6 | 18:23:09 (±729 s) | 21 | 5 | 21:23:00 (±181 s) | 22 |
| Ganon | 107 | 01:13:44 (±7s) | 243 | 153 | 00:53:13 (±12 s) | 243 | |
| Kraken2 | 2991 | 00:04:11 (±4s) | 123 | 3659 | 00:03:52 (±1 s) | 122 | |
Note: Memory in GiB. Full set of simulated and real reads classified with 48 threads. Centrifuge, clark and diamond performance in Mbp/m calculated from wall time. Values are the average of four out five consecutive runs (excluding the slowest run), with SD for the run-time in parentheses. Computer specifications and parameters used are in the Supplementary Sections S2.1 and S2.4.