| Literature DB >> 35758776 |
Jarno N Alanko1,2, Ilya B Slizovskiy3, Daniel Lokshtanov4, Travis Gagie2, Noelle R Noyes3, Christina Boucher5.
Abstract
MOTIVATION: Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35758776 PMCID: PMC9235489 DOI: 10.1093/bioinformatics/btac226
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
MEGARES
| Syotti | CATCH | MrBait | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Input length (base pairs) | Time (mm:ss) | Memory (MB) | Baits (count) | Time (mm:ss) | Memory (MB) | Baits (count) | Time (mm:ss) | Memory (MB) | Baits (count) |
| 125 271 | 00:00 | 6 | 820 | 00:01 | 78 | 824 | 00:02 | 72 | 987 |
| 254 294 | 00:00 | 8 | 1372 | 00:04 | 104 | 1392 | 00:04 | 73 | 1981 |
| 508 459 | 00:00 | 11 | 2604 | 00:09 | 158 | 2635 | 00:09 | 76 | 3983 |
| 1 032 802 | 00:00 | 20 | 4633 | 00:28 | 260 | 4742 | 00:19 | 80 | 8093 |
| 2 090 517 | 00:00 | 35 | 7901 | 01:30 | 423 | 8121 | 00:37 | 88 | 16 374 |
| 4 187 569 | 00:01 | 67 | 13 099 | 05:37 | 735 | 13 489 | 01:12 | 101 | 32 764 |
| 8 106 325 | 00:03 | 125 | 20 976 | 19:13 | 1250 | 21 771 | 02:19 | 128 | 63 428 |
Note: Running time (Time), peak memory usage (Memory) and number of baits (Baits) for increasingly larger subsets of sequences from MEGARES (without VSEARCH filtering). The seconds are rounded down. Rows where all tools ran in <1 s have been removed. The full table is in the Supplementary Material.
Fig. 1.Time scaling on all three datasets
Fig. 2.Average sequence coverage on the MEGARES dataset with all three methods (left) and with just CATCH and Syotti (right). The sequences are in the order of the MEGARes database. The coverage plot with all tools after filtering MrBait baits with VSEARCH is provided within the supplementary material.
BACTERIA
| Syotti | CATCH | MrBait | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Input length (base pairs) | Time (hh:mm:ss) | Memory (MB) | Baits (count) | Time (hh:mm:ss) | Memory (MB) | Baits (count) | Time (hh:mm:ss) | Memory (MB) | Baits (count) |
| 1 592 0441 | 00:00:08 | 237 | 72 035 | 00:29:32 | 4096 | 90 182 | 00:04:48 | 195 | 132 408 |
| 30 786 116 | 00:00:14 | 454 | 96 595 | 01:00:27 | 7575 | 134 112 | 00:10:00 | 316 | 256 041 |
| 62 502 135 | 00:00:27 | 917 | 123 541 | 02:26:56 | 14 234 | 181 825 | 00:18:44 | 572 | 519 838 |
| 125 063 199 | 00:00:52 | 1831 | 157 818 | 08:43:14 | 24 269 | 240 747 | 00:38:48 | 1076 | 1 040 174 |
| 254 576 853 | 00:01:45 | 3723 | 188 813 | 41:23:20 | 45 002 | 303 051 | 01:20:03 | 2071 | 2 117 436 |
| 505 422 833 | 00:03:31 | 7387 | 223 931 |
| NA | NA | 02:39:03 | 4056 | 4 203 752 |
| 1 003 934 029 | 00:07:11 | 14 673 | 267 890 | NA | NA | NA | 05:18:28 | 7980 | 8 349 888 |
| 2 018 459 352 | 00:15:50 | 29 496 | 324 797 | NA | NA | NA | 10:30:37 | 15 697 | 16 788 084 |
| 3 040 260 476 | 00:24:52 | 44 425 | 366 761 | NA | NA | NA | 16:13:43 | 23 615 | 25 286 576 |
Note: Running time (Time), peak memory usage (Memory) and number of baits (baits) for increasingly larger subsets of sequences from the BACTERIA dataset. ‘NA’ signifies that the dataset surpassed 72 h of running time. Rows where all tools took <30 min have been removed to save space. The full data table is in the Supplementary Material.
Fig. 3.Left: Fraction of nucleotides covered by Syotti as the algorithm progresses on the full BACTERIA dataset. Right: average coverage of the reference sequences in BACTERIA after taking the first 200k baits produced by Syotti and filling gaps to maximum length 50. The dashed line shows coverage 1.0
VIRAL
| Syotti | CATCH | MrBait | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Input length (base pairs) | Time (hh:mm:ss) | Memory (MB) | Baits (count) | Time (hh:mm:ss) | Memory (MB) | Baits (count) | Time (hh:mm:ss) | Memory (MB) | Baits (count) |
| 26 238 | 00:00:00 | 5 | 148 | 00:00:00 | 55 | 164 | 00:00:01 | 71 | 218 |
| 9 672 491 | 00:00:03 | 145 | 1103 | 01:19:09 | 2878 | 1065 | 00:02:43 | 130 | 67 054 |
| 66 970 374 | 00:00:37 | 1011 | 1640 |
| NA | NA | 00:18:59 | 577 | 522 683 |
| 911 29 015 | 00:00:55 | 1370 | 6016 | NA | NA | NA | 00:25:36 | 705 | 680 298 |
| 93 144 457 | 00:00:56 | 1400 | 6959 | NA | NA | NA | 00:28:41 | 734 | 696 193 |
| 103 290 818 | 00:01:00 | 1549 | 11 737 | NA | NA | NA | 00:32:01 | 799 | 773 385 |
| 155 140 833 | 00:01:28 | 2321 | 28 543 | NA | NA | NA | 00:45:40 | 1154 | 1 160 119 |
| 230 198 416 | 00:04:04 | 3424 | 60 539 | NA | NA | NA | 01:09:07 | 1549 | 1 557 350 |
| 564 924 375 | 00:07:45 | 8350 | 103 401 | NA | NA | NA | 02:45:21 | 3975 | 4 139 176 |
| 1 040 580 227 | 00:12:39 | 15 391 | 174 742 | NA | NA | NA | 05:14:59 | 7164 | 7 757 360 |
| 1 257 789 768 | 00:16:41 | 18 613 | 226 751 | NA | NA | NA | 06:24:44 | 8519 | 9 225 038 |
Note: Running time (Time), peak memory usage (Memory) and number of baits (count) for increasingly larger subsets of sequences from the VIRAL dataset. ‘NA’ signifies that the dataset surpassed 72 h of running time.
Fig. 4.Coverage of the 684k baits generated by Syotti for the VIRAL dataset, versus and the published bait sets of sizes 250k, 350k and 700k from CATCH. The dashed line shows coverage 1.0
Summary of the main results on the full datasets.
| MEGARES | BACTERIA | VIRAL | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Syotti | CATCH | MrBait | Syotti | CATCH | MrBait | Syotti | CATCH | MrBait | |
| Coverage | 100% | 100% | 96.4% | 100% | * | 100% | 100% | * | ** |
| Number of baits | 20 976 | 21 771 | 63 428 | 366 761 | * | 25 286 576 | 226 751 | * | 9 225 038 |
| Time | 3 s | 19 min 13 s | 2 min 19 s | 24 min 52 s |
| 16 h 13 min 43 s | 16 min 41 s |
| 6 h 24 min 44 s |
| Memory | 125 MB | 1250 MB | 128 MB | 44 425 MB | * | 23 615 MB | 18 613 MB | * | 8519 MB |
Note: VSEARCH filtering on MrBait on MEGARES brings the bait count down to 22 230 baits with coverage 95.9%, with a total run time of 4 h 22 min 36 s (the filtering was infeasible on the other two datasets due to the large number of baits from MrBait). (*) Unavailable due to CATCH not finishing in 72 h. (**) Unavailable due to the coverage analysis taking more than 72 h of running time due to the large number of baits and a large number of matches.