| Literature DB >> 32612216 |
Steven Hofmeyr1, Rob Egan2, Evangelos Georganas3, Alex C Copeland2, Robert Riley2, Alicia Clum2, Emiley Eloe-Fadrosh2, Simon Roux2, Eugene Goltsman2, Aydın Buluç4,5, Daniel Rokhsar2,6, Leonid Oliker4, Katherine Yelick4,5.
Abstract
Metagenome sequence datasets can contain terabytes of reads, too many to be coassembled together on a single shared-memory computer; consequently, they have only been assembled sample by sample (multiassembly) and combining the results is challenging. We can now perform coassembly of the largest datasets using MetaHipMer, a metagenome assembler designed to run on supercomputers and large clusters of compute nodes. We have reported on the implementation of MetaHipMer previously; in this paper we focus on analyzing the impact of very large coassembly. In particular, we show that coassembly recovers a larger genome fraction than multiassembly and enables the discovery of more complete genomes, with lower error rates, whereas multiassembly recovers more dominant strain variation. Being able to coassemble a large dataset does not preclude one from multiassembly; rather, having a fast, scalable metagenome assembler enables a user to more easily perform coassembly and multiassembly, and assemble both abundant, high strain variation genomes, and low-abundance, rare genomes. We present several assemblies of terabyte datasets that could never be coassembled before, demonstrating MetaHipMer's scaling power. MetaHipMer is available for public use under an open source license and all datasets used in the paper are available for public download.Entities:
Mesh:
Year: 2020 PMID: 32612216 PMCID: PMC7329831 DOI: 10.1038/s41598-020-67416-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Genome fractions for references from MarRef found in the WA assemblies.
Figure 2Genome fractions for strains of P. ubique found in the WA assemblies.
Assembly quality for WAmix.
| Sample type | Length (gbp) | Largest alignment | Contigs (millions) | Genome % | Misassemblies | Mismatches/100 kbp | Indels /100 kbp | Duplication ratio | |
|---|---|---|---|---|---|---|---|---|---|
| Extensive | Local | ||||||||
| Coassembly | 7.4 | 1,085,233 | 5.5 | 94 | 96 | 58 | 159 | 3.3 | 1.2 |
| Multiassembly | 12.0 | 50,429 | 10.2 | 46 | 365 | 396 | 1654 | 66.1 | 7.3 |
| Deduplicated | 8.6 | 50,429 | 7.0 | 45 | 243 | 294 | 1146 | 59.6 | 4.3 |
| Single sample avg | 1.0 | 29,207 | 0.8 | 24 | 30 | 33 | 264 | 10.6 | 1.2 |
| ArcticSynth | 0.1 | 1,085,223 | 0.01 | 94 | 45 | 15 | 26 | 1.5 | 1.0 |
Figure 3Genome fraction vs depth for synthetic reference genomes within WAmix.
Figure 4Cumulative lengths for contigs aligned to synthetic reference genomes within WAmix.
Quality of assemblies of synthetic datasets.
| Assembler | NGA50 (kbp) | Largest alignment | Contigs | Genome % | Misassemblies | Mismatches/100 kbp | Indels /100 kbp | Genome bins | rRNAs | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Extensive | Local | 16S | 23S | ||||||||
|
| |||||||||||
| MetaHipMer | 52 | 1,085,233 | 9060 | 93.8 | 45 | 15 | 26 | 1.5 | 15 | 13 | 11 |
| MEGAHIT | 65 | 1,085,427 | 7432 | 95.1 | 110 | 78 | 134 | 1.9 | 16 | 14 | 7 |
| MetaSPAdes | 17 | 1,290,245 | 17,701 | 91.2 | 53 | 50 | 291 | 1.9 | 8 | 7 | 3 |
|
| |||||||||||
| MetaHipMer | 152 | 2,055,376 | 4239 | 92.3 | 80 | 30 | 3 | 0.4 | 19 | 15 | 11 |
| MEGAHIT | 121 | 1,636,294 | 5371 | 93.1 | 69 | 34 | 8 | 0.7 | 19 | 16 | 12 |
| MetaSPAdes | 193 | 2,055,367 | 4987 | 92.1 | 76 | 58 | 69 | 3.3 | 4 | 6 | 4 |
Figure 5Genome fraction vs depth for assemblies of the ArcticSynth dataset.
Figure 6Cumulative lengths for contigs for assemblies of the ArcticSynth dataset.
Quality of Gut mixed assemblies for various depths.
| Assembler | NGA50 (kbp) | Largest alignment | Contigs | Genome % | Misassemblies | Mismatches/100 kbp | Indels /100 kbp | Genome bins | rRNAs | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Extensive | Local | 16S | 23S | ||||||||
| MetaHipMer | 0.13 | 12,213 | 82,299 | 81.8 | 23 | 3 | 77 | 10.0 | 8 | 18 | 8 |
| MEGAHIT | 0.14 | 37,232 | 94,473 | 82.8 | 48 | 15 | 66 | 4.1 | 12 | 5 | 6 |
| MetaSPAdes | 0.15 | 32,154 | 106,047 | 87.4 | 29 | 9 | 94 | 6.3 | 10 | 5 | 6 |
| MetaHipMer | 127 | 667,522 | 74,209 | 99.1 | 8 | 3 | 17 | 0.9 | 14 | 15 | 11 |
| MEGAHIT | 88 | 391,021 | 86,687 | 98.3 | 30 | 10 | 22 | 1.6 | 17 | 3 | 7 |
| MetaSPAdes | 138 | 458,963 | 97,852 | 99.1 | 1 | 6 | 13 | 1.8 | 9 | 3 | 6 |
| MetaHipMer | 150 | 667,521 | 73,881 | 98.6 | 5 | 4 | 14 | 0.6 | 13 | 19 | 12 |
| MEGAHIT | 111 | 666,969 | 86,608 | 98.7 | 18 | 7 | 21 | 1.8 | 16 | 5 | 8 |
| MetaSPAdes | 150 | 1,365,043 | 97,869 | 99.2 | 0 | 6 | 14 | 1.8 | 9 | 2 | 6 |
Quality of Marine mixed assemblies for various depths.
| Assembler | NGA50 (kbp) | Largest alignment | Contigs | Genome % | Misassemblies | Mismatches /100 kbp | Indels/100 kbp | Genome bins | rRNAs | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Extensive | Local | 16S | 23S | ||||||||
| MetaHipMer | 0.18 | 28,718 | 196,390 | 88.2 | 3 | 95 | 72 | 2.4 | 7 | 12 | 6 |
| MEGAHIT | 0.24 | 34,471 | 197,872 | 90.5 | 68 | 27 | 71 | 2.5 | 9 | 15 | 8 |
| MetaSPAdes | 0.36 | 28,334 | 158,348 | 95.6 | 23 | 24 | 96 | 4.2 | 12 | 1 | 6 |
| MetaHipMer | 251 | 1,238,326 | 179,308 | 99.6 | 4 | 10 | 16 | 0.2 | 11 | 20 | 14 |
| MEGAHIT | 239 | 1,382,528 | 184,773 | 99.7 | 3 | 15 | 24 | 1.3 | 13 | 15 | 9 |
| MetaSPAdes | 249 | 1,382,094 | 147,628 | 99.5 | 0 | 20 | 12 | 0.7 | 14 | 1 | 7 |
| MetaHipMer | 328 | 1,410,119 | 181,509 | 99.6 | 6 | 7 | 15 | 1.3 | 12 | 21 | 12 |
| MEGAHIT | 239 | 1,410,120 | 184,646 | 99.7 | 6 | 16 | 24 | 1.2 | 12 | 17 | 9 |
| MetaSPAdes | 246 | 1,382,118 | 147,619 | 99.8 | 0 | 24 | 16 | 1.2 | 12 | 1 | 6 |
Running time (minutes) and memory usage (GB) of the assemblers.
| Assembler | ArcticSynth | SYNTH64D | MBARC-26 | Gut 50x | Marine 50x | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Time | Memory | Time | Memory | Time | Memory | Time | Memory | Time | Memory | |
| MetaHipMer | 20 | 93 | 24 | 100 | 153 | 281 | 22 | 119 | 203 | 614 |
| MEGAHIT | 22 | 4 | 15 | 4 | 90 | 42 | 16 | 4 | 143 | 42 |
| MetaSPAdes | 101 | 76 | 101 | 76 | 347 | 129 | 80 | 42 | 403 | 128 |
New terabase-scale dataset assemblies. Columns labeled * are calculated for scaffolds .
| Description | Data | Assembly | Scaffolds | N50 | Time |
|---|---|---|---|---|---|
| (TB) | (gbp) | (millions) | (kpb) | (hrs) | |
| 2.63 | 46.2 | 41.6 | 1.2 | 5.14 | |
| 2.66 | 18.4 | 12.7 | 1.7 | 2.10 | |
| 3.34 | 15.1 | 15.5 | 1.0 | 3.20 |
Figure 7Iterative contig generation workflow in MetaHipMer. Image source: Georganas et al.[8]. Reproduced under a CC BY 4.0 open access license by permission of E. Georganas.