| Literature DB >> 30691529 |
Thomas D S Sutton1,2, Adam G Clooney1,2, Feargal J Ryan1,2,3, R Paul Ross1,2,4, Colin Hill5,6.
Abstract
BACKGROUND: The viral component of microbial communities plays a vital role in driving bacterial diversity, facilitating nutrient turnover and shaping community composition. Despite their importance, the vast majority of viral sequences are poorly annotated and share little or no homology to reference databases. As a result, investigation of the viral metagenome (virome) relies heavily on de novo assembly of short sequencing reads to recover compositional and functional information. Metagenomic assembly is particularly challenging for virome data, often resulting in fragmented assemblies and poor recovery of viral community members. Despite the essential role of assembly in virome analysis and difficulties posed by these data, current assembly comparisons have been limited to subsections of virome studies or bacterial datasets.Entities:
Keywords: Assembly; Bacteriophage; Benchmark; Comparison; Metagenome; Phage; Viral; Virome
Mesh:
Year: 2019 PMID: 30691529 PMCID: PMC6350398 DOI: 10.1186/s40168-019-0626-5
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
A list of assemblers used in this study
| Link | Version used | Reference | |
|---|---|---|---|
| ABySS | http://www.bcgsc.ca/downloads/abyss/ | v2.0.2 | [ |
| CLC | v5.0.5 | ||
| Geneious | v11.0.3 | [ | |
| IDBA UD |
| v1.1.1 | [ |
| MEGAHIT |
| v1.1.1-2 | [ |
| MetaVelvet | v1.2.02 | [ | |
| MIRA |
| v4.0.2 | [ |
| Ray Meta | v2.3.0 | [ | |
| SOAPdenovo2 |
| v2.04 | [ |
| SPAdes | v3.10.0 | [ | |
| SPAdes meta | v3.10.0 | [ | |
| Velvet | v1.2.10 | [ | |
| VICUNA | https://github.com/broadinstitute/mvicuna | v1.3 | [ |
Fig. 1Relationship between percentage of each genome recovered (genome fraction), the number of contigs generated for each combination of genome and assembler and the abundance and proportion of repeats for each genome. a, b Genomes are ordered by their average genome fraction across all assemblers from high to low along the x-axis. a (main) Relative abundance, normalised by genome length is plotted along y-axis with upper limit of 0.75% and colour of bars determined by proportion of repeat regions in each genome. Blue bars represent genomes with a high proportion of genomic repeats (4th quartile of all genomes) and red represents all other genomes below this quartile. a (insert) Expanded view of a without an upper limit of y value. b Percentage genome recovered is plotted along the y-axis. Points are coloured by assembler with shape of the point is denoting number of contigs generated by each assembler for each genome
Fig. 2Number of contigs each assembler recovered to a minimum genome fraction of 90% in a single contig
The number of false positive and false negative contigs generated by each assembler for the simulated community, together with the sensitivity rates
| False positives | False negative | True positives | No. of contigs returneda | Sensitivity | |
|---|---|---|---|---|---|
| ABSS ( | 0 | 111 | 461 | 7957 | 80.59 |
| ABySS ( | 1 | 123 | 449 | 7732 | 78.50 |
| CLC | 34 | 5 | 567 | 9152 | 99.13 |
| Geneious | 9 | 190 | 382 | 958 | 66.78 |
| IDBA UD | 25 | 9 | 563 | 8999 | 98.43 |
| MEGAHIT | 21 | 8 | 564 | 10,083 | 98.60 |
| MetaVelvet | N/A | N/A | N/A | N/A | N/A |
| MIRA | 4 | 13 | 559 | 27,600 | 97.73 |
| Ray Meta | 0 | 213 | 359 | 4224 | 62.76 |
| SOAPdenovo2 | 536 | 116 | 456 | 11,548 | 79.72 |
| SPAdes | 29 | 3 | 569 | 8230 | 99.48 |
| SPAdes meta | 5 | 14 | 558 | 7419 | 97.55 |
| SPAdes sc | 38 | 7 | 565 | 9506 | 98.78 |
| SPAdes sc careful | 40 | 6 | 566 | 9724 | 98.95 |
| Velvet | 1 | 65 | 507 | 6343 | 88.64 |
| VICUNA | 0 | 558 | 14 | 4 | 2.45 |
a572 in community
The number of false positive and false negative contigs generated by each assembler for (a) mock community A and (b) mock community B along with the sensitivity rates for each
| False positives | False negative | True positive | No. of contigs returneda | Sensitivity | |
|---|---|---|---|---|---|
| A | |||||
| ABySS ( | 52 | 4 | 8 | 61 | 66.67 |
| ABySS ( | 50 | 6 | 6 | 56 | 50.00 |
| CLC | 1143 | 0 | 12 | 1299 | 100.00 |
| Geneious | 53 | 0 | 12 | 65 | 100.00 |
| IDBA UD | 0 | 0 | 12 | 12 | 100.00 |
| MEGAHIT | 0 | 0 | 12 | 13 | 100.00 |
| MetaVelvet | 0 | 3 | 9 | 26 | 75.00 |
| MIRA | 0 | 0 | 12 | 89 | 100.00 |
| Ray Meta | 0 | 0 | 12 | 12 | 100.00 |
| SOAPdenovo2 | 2 | 0 | 12 | 23 | 100.00 |
| SPAdes | 0 | 0 | 12 | 14 | 100.00 |
| SPAdes meta | 0 | 0 | 12 | 14 | 100.00 |
| SPAdes sc | 1513 | 0 | 12 | 1527 | 100.00 |
| SPAdes sc careful | 0 | 0 | 12 | 15 | 100.00 |
| Velvet | 0 | 3 | 9 | 26 | 75.00 |
| VICUNA | 4969 | 0 | 12 | 5385 | 100.00 |
| B | |||||
| ABySS ( | 60 | 4 | 8 | 69 | 66.67 |
| ABySS ( | 132 | 6 | 6 | 139 | 50.00 |
| CLC | 450 | 0 | 12 | 505 | 100.00 |
| Geneious | 14 | 0 | 12 | 30 | 100.00 |
| IDBA UD | 0 | 0 | 12 | 12 | 100.00 |
| MEGAHIT | 0 | 0 | 12 | 14 | 100.00 |
| MetaVelvet | 0 | 1 | 11 | 24 | 91.67 |
| MIRA | 94 | 1 | 11 | 157 | 91.67 |
| Ray Meta | 0 | 0 | 12 | 13 | 100.00 |
| SOAPdenovo2 | 2 | 2 | 10 | 27 | 83.33 |
| SPAdes | 0 | 0 | 12 | 13 | 100.00 |
| SPAdes meta | 0 | 0 | 12 | 14 | 100.00 |
| SPAdes sc | 593 | 0 | 12 | 607 | 100.00 |
| SPAdes sc careful | 0 | 0 | 12 | 14 | 100.00 |
| Velvet | 0 | 1 | 11 | 24 | 91.67 |
| VICUNA | 0 | 0 | 12 | 15 | 100.00 |
a12 in community
Fig. 3Mauve output of the Q33 reference genome (top) along with of the six assemblers which recovered > 99% of the genome with a single contig. Assembly regions outside of locally collinear blocks which do not share homology to the reference genome are highlighted by a black outline. Reverse complement of assemblies in the opposite orientation to the reference were plotted for visualisation purposes (VICUNA, CLC, Geneious)
Fig. 4a Time, measured in seconds, for each assembly to reach completion successfully for each read subset. b The maximum RAM, measured in MB, used for each assembly for each read subset. c Mean N50 length and d mean contig length for four samples for each assembly across the read subsets after filtering contigs less than 1000 bases. Points represent the mean time for the four samples while error bars are the standard error