Metagenomic methods enable the simultaneous characterization of microbial communities without time-consuming and bias-inducing culturing. Metagenome-assembled genome (MAG) binning methods aim to reassemble individual genomes from this data. However, the recovery of mobile genetic elements (MGEs), such as plasmids and genomic islands (GIs), by binning has not been well characterized. Given the association of antimicrobial resistance (AMR) genes and virulence factor (VF) genes with MGEs, studying their transmission is a public-health priority. The variable copy number and sequence composition of MGEs makes them potentially problematic for MAG binning methods. To systematically investigate this issue, we simulated a low-complexity metagenome comprising 30 GI-rich and plasmid-containing bacterial genomes. MAGs were then recovered using 12 current prediction pipelines and evaluated. While 82-94 % of chromosomes could be correctly recovered and binned, only 38-44 % of GIs and 1-29 % of plasmid sequences were found. Strikingly, no plasmid-borne VF nor AMR genes were recovered, and only 0-45 % of AMR or VF genes within GIs. We conclude that short-read MAG approaches, without further optimization, are largely ineffective for the analysis of mobile genes, including those of public-health importance, such as AMR and VF genes. We propose that researchers should explore developing methods that optimize for this issue and consider also using unassembled short reads and/or long-read approaches to more fully characterize metagenomic data.
Metagenomic methods enable the simultaneous characterization of microbial communities without time-consuming and bias-inducing culturing. Metagenome-assembled genome (MAG) binning methods aim to reassemble individual genomes from this data. However, the recovery of mobile genetic elements (MGEs), such as plasmids and genomic islands (GIs), by binning has not been well characterized. Given the association of antimicrobial resistance (AMR) genes and virulence factor (VF) genes with MGEs, studying their transmission is a public-health priority. The variable copy number and sequence composition of MGEs makes them potentially problematic for MAG binning methods. To systematically investigate this issue, we simulated a low-complexity metagenome comprising 30 GI-rich and plasmid-containing bacterial genomes. MAGs were then recovered using 12 current prediction pipelines and evaluated. While 82-94 % of chromosomes could be correctly recovered and binned, only 38-44 % of GIs and 1-29 % of plasmid sequences were found. Strikingly, no plasmid-borne VF nor AMR genes were recovered, and only 0-45 % of AMR or VF genes within GIs. We conclude that short-read MAG approaches, without further optimization, are largely ineffective for the analysis of mobile genes, including those of public-health importance, such as AMR and VF genes. We propose that researchers should explore developing methods that optimize for this issue and consider also using unassembled short reads and/or long-read approaches to more fully characterize metagenomic data.
Entities:
Keywords:
antimicrobial resistance; genomic islands; metagenome-assembled genomes; metagenomics; mobile genetic elements
In keeping with FAIR principles (Findable, Accessible, Interoperable, Reusable data), all analyses presented in this paper can be reproduced and inspected with the associated GitHub repository (https://github.com/fmaguire/MAG_gi_plasmid_analysis) (10.5281/zenodo.4005062) and data repository (https://osf.io/nrejs/) (10.17605/OSF.IO/NREJS).Metagenome-assembled genome (MAG) binning has become an increasingly common approach in environmental, microbiome and public-health studies that makes use of short-read metagenomic data. By examining 12 widely used MAG binning workflows, we demonstrate that these methods are not suitable for the analysis of mobile genetic elements. Given the potential human and animal health implications of antimicrobial resistance and virulence genes associated with these elements, inappropriate use of short-read MAGs has the potential to be misleading at best and harmful at worst. This result will hopefully stimulate a shift in MAG methods to focus on developing approaches optimized for these elements, as well as incorporating additional read-based and long-read analyses.
Introduction
Metagenomics, the sequencing of DNA from within an environmental sample, is widely used to characterize the functional potential and identity of microbial communities [1, 2]. These approaches have been instrumental in developing our understanding of the distribution and evolutionary history of antimicrobial resistance (AMR) genes [3-5], as well as tracking pathogen outbreaks [6]. Although long-read DNA technologies [e.g. Oxford Nanopore Technologies’ nanopore sequencing [7] and Pacific Biosciences’ (PacBio) single-molecule real-time sequencing [8] platforms] are now being used for metagenomic sequencing [9, 10], high-throughput sequencing of relatively short reads (150–250 bp) on platforms such as the Illumina MiSeq still dominates metagenomics. These reads can be directly analysed using reference databases and a variety of homology search tools [11-14]. Since these reads are shorter than most genes, however, read-based methods provide very little information about genomic organization. This lack of contextual information is particularly problematic in the study of AMR genes and virulence factors (VFs), as the genomic context plays a role in function [15], selective pressures [16] and likelihood of lateral gene transfer (LGT) [17].Sequence assembly using specialized metagenomic de Bruijn graph assemblers (e.g. metaSPAdes [18], IDBA-UD [19] and megahit [20]) is often used to try to recover information about genomic context [21]. To disentangle the resulting mix of assembled fragments, there has been a move to group these contigs based on the idea that those from the same source genome will have similar relative abundance and sequence composition [22]. These resulting groups or ‘bins’ are known as metagenome-assembled genomes (MAGs). A range of tools have been released to perform this binning, including CONCOCT [23], MetaBAT2 [24], MaxBin2 [25] and a tool that combines their predictions: DAS Tool [26]. These MAG binning methods have been used successfully in unveiling previously uncharacterized genomic diversity [27-29], but metagenomic assembly and binning has been shown to involve the loss of some information. This means as little as 24.2–36.4 % of reads [30, 31] and ~23 % of genomes [31] are successfully assembled and binned in some metagenomic analyses. The Critical Assessment of Metagenome Interpretation (CAMI) challenge’s (https://data.cami-challenge.org/) Assessment of Metagenome BinnERs (amber) [32] benchmarks different MAG recovery methods in terms of global completeness and bin purity. Similarly, a recent study has also used the amber approach to evaluate 15 different binning methods applied to a common metaSPAdes assembly [33]. However, to the best of our knowledge, there has not been a specific assessment of MAG-based recovery of mobile genetic elements (MGEs) such as genomic islands (GIs) and plasmids, despite their health and research importance.GIs are clusters of chromosomal genes that are known or predicted to have been acquired through LGT events. GIs can arise following the integration of MGEs, such as integrons, transposons, integrative and conjugative elements (ICEs), and prophages (integrated phages) [34, 35]. They are of high interest since VFs are disproportionately associated with these mobile sequences [36], as are certain AMR genes [37, 38]. GIs often have differing nucleotide composition compared to the rest of the genome [34], a trait exploited by GI prediction tools such as SIGI-HMM [39] and IslandPath-DIMOB [40], and integrative tools like IslandViewer [41]. GIs may also exist as multiple copies within a genome [42], leading to potential assembly difficulties and biases in the calculation of coverage statistics.Plasmids are circular or linear extrachromosomal self-replicating pieces of DNA with variable copy numbers and repetitive sequences [43, 44]. Similar to GIs, the sequence composition (including G+C content, dinucleotide bias, etc.) of plasmids are often markedly different from the genome with which they are associated [45-47]. Plasmids are also of high interest as a major source of the lateral dissemination of AMR genes throughout microbial ecosystems [37, 48].GIs and plasmids have proven particularly difficult to assemble from short-read sequencing data. Due to the history of their integration at specific insertion sites, GIs are commonly flanked by direct repeats [49, 50]. Repetitive sequences are known to complicate assembly from short reads, with repeats often found at contig break sites [51]. Given that assembly of closely related genomes in a metagenome is already challenging [52], the polymorphic nature of GIs and the known presence of flanking repeats would be expected to compound these separate assembly issues. Repeats also inhibit the assembly of plasmids from short-read sequencing data, particularly for longer plasmid sequences [53]. Additionally, the varying sequence composition and relative abundance features mean that GIs and plasmids pose significant challenges in MAG recovery.As these MGEs are key to the function and spread of pathogenic traits such as AMR and virulence, and with MAG approaches becoming increasingly popular within microbial and public-health research, it is both timely and vital that we assess the impact of metagenome assembly and binning on the recovery of these elements. Therefore, to address this issue, we performed an analysis of the recovery accuracy of GI and plasmid sequences, and associated AMR/VF genes, across a set of 12 state-of-the-art methods for short-read metagenome assemblies. We show that short-read MAG-based analyses alone are not suitable for the study of mobile sequences, including those of public-health importance.
Methods
Metagenome simulation
Thirty RefSeq genomes were selected using IslandPath-DIMOB [40] GI prediction data collated into the IslandViewer database (www.pathogenomics.sfu.ca/islandviewer) [41] (Table S1, available with the online version of this article). The selected genomes and associated plasmids (listed in Table S2 and deposited at https://osf.io/nrejs/ under ‘data/sequences’) were manually selected to satisfy the following criteria: 10 genomes with 1–10 plasmids, 10 genomes with >10 % of chromosomal DNA predicted to reside in GIs, and 10 genomes with <1 % of chromosomal DNA predicted to reside in GIs.In accordance with the recommendation in the CAMI challenge [52], the genomes were randomly assigned a relative abundance following a log-normal distribution (μ=1, σ=2). Plasmid copy number estimates could not be accurately found for all organisms. Therefore, plasmids were randomly assigned a copy number regime, low (1–20), medium (20–100) or high (500–1000) at a 2 : 1 : 1 rate. Within each regime, the exact copy number was selected using an appropriately scaled gamma distribution (α=4, β=1) truncated to the regime range.Finally, the effective plasmid relative abundance was determined by multiplying the plasmid copy number with the genome relative abundance. The full set of randomly assigned relative abundances and copy numbers can be found in Table S3. Sequences were then concatenated into a single fasta file with the appropriate relative abundance. MiSeq v3 250 bp paired-end reads with a mean fragment length of 1000 bp (standard deviation of 50 bp) were then simulated using art_illumina (v2016.06.05) [54], resulting in a simulated metagenome of 31 174 411 read pairs. The selection of relative abundance and metagenome simulation itself was performed using the ‘data_simluation/simulate_metagenome.py’ script.
MAG recovery
Reads were trimmed using sickle (v1.33) [55] resulting in 25 682 644 surviving read pairs. The trimmed reads were then assembled using three different metagenomic assemblers: metaSPAdes (v3.13.0) [18], IDBA-UD (v1.1.3) [19] and megahit (v1.1.3) [20]). The resulting assemblies were summarized using metaQUAST (v5.0.2) [56]. The assemblies were then indexed and reads mapped back using bowtie2 (v2.3.4.3) [12].Samtools (v1.9) was used to sort the read mappings, and the read coverage was calculated using the MetaBAT2 accessory script (jgi_summarize_bam_contig_depths). The three metagenome assemblies were then separately binned using MetaBAT2 (v2.13) [24] and MaxBin2 (v2.2.6) [25]. MAGs were also recovered using CONCOCT (v0.4.2) [23], following the recommended protocol in the user manual. Briefly, the supplied CONCOCT accessory scripts were used to cut contigs into 10 kb fragments (cut_up_fasta.py) and read coverage calculated for the fragments (CONCOCT_coverage_table.py). These fragment coverages were then used to bin the 10 kb fragments before the clustered fragments were merged (merge_cutup_clustering.py) to create the final CONCOCT MAG bins (extra_fasta_bins.py). Finally, for each metagenome assembly, the predicted bins from these three binners (MaxBin2, MetaBAT2 and CONCOCT) were combined using the DAS Tool (v1.1.1) meta-binner [26]. This resulted in 12 separate sets of MAGs (one set for each assembler and binner pair).
MAG assessment
Synthetic read coverage and depth
The trimmed synthetic reads were mapped back to each reference replicon using bowtie2 (v2.4.1), and sorted and indexed using Samtools (v1.10). The coverage of each reference replicon is calculated using ‘samtools coverage’ and the per base sequencing depth calculated using ‘samtools depth’. The mean and per base depth are then plotted using R (v.3.4.2).
Chromosomal coverage
The MAG assessment for chromosomal coverage was performed by creating a BLASTN 2.9.0+ [57] database consisting of all the chromosomes of the input reference genomes. Each MAG contig was then used as a query against this database and the coverage of the underlying chromosomes tallied by merging the overlapping aligning regions and summing the total length of aligned MAG contigs. The most represented genome in each MAG was assigned as the ‘identity’ of that MAG for further analyses. Coverage values of less than 5 % were filtered out and the number of different genomes to which contigs from a given MAG aligned were tallied. Finally, the overall proportion of chromosomes that were not present in any MAG was tallied for each binner and assembler.In order to investigate the impact of the presence of closely related genomes in the metagenome on the ability to bin chromosomes, we generated a phylogenetic tree for all the input genomes. Single copy universal bacterial proteins were identified in the reference genomes using BUSCO v4.0.2 with the Bacteria Odb10 data [58]. The 86 of these proteins that were found in every reference genome were concatenated and aligned using MAFFT v7.427 [59] and masked with trimal v1.4.1–3 [60]. A maximum-likelihood phylogeny was then inferred with IQ-Tree v1.6.12 [61] using 1000 ultrafast-bootstraps and the in-built ModelFinder determined partitioning [62]. The phylogeny was then visualized using the Interactive Tree of Life (iTOL) v4 [63]. Pairwise branch distances were extracted from the resulting tree using ete3 v3.1.1 [64] and regressed using a linear model against coverage (via seaborn v0.10.0 [65]) and using a Poisson logistic regression model (via statsmodel v0.12.0 [66]) against contamination. R2 and McFadden’s pseudo-R2 were calculated for each model using the statsmodel library.
Plasmid and GI coverage
Plasmid and GI coverage were assessed in the same way. Firstly, a BLASTN database was generated for each set of MAG contigs. Then, each MAG database was searched for plasmid and GI sequences with greater than 50 % coverage. All plasmids or GIs that could be found in the unbinned contigs or MAGs were recorded as having been successfully assembled. The subset of these that were found in the binned MAGs was then separately tallied. Finally, we evaluated the proportion of plasmids or GIs that were correctly assigned to the bin that was maximally composed of chromosomes from the same source genome.
AMR and VF assessment
Detection of AMR/VF genes
For the reference genomes, as well as 12 sets of MAGs, prodigal [67] was used to predict ORFs using the default parameters. AMR genes were predicted using Resistance Gene Identifier (RGI v5.0.0; default parameters) and the Comprehensive Antibiotic Resistance Database (card v3.0.3) [68]. VFs were predicted using the predicted ORFs and BLASTX 2.9.0+ [57] against the Virulence Factor Database (vfdb; obtained on August 26 2019) with an E value cut-off of 0.001 and a minimum identity of 90 % [69]. Each MAG was then assigned to a reference chromosome using the above-mentioned mapping criteria for downstream analysis.
AMR/VF gene recovery
For each MAG set, we counted the total number of AMR/VF genes recovered in each metagenomic assembly and each MAG, and compared this to the number predicted in their assigned reference chromosome and plasmids. We then assessed the ability for MAGs to correctly bin AMR/VF genes of chromosomal, plasmid and GI origin by mapping the location of the reference replicon’s predicted genes to the location of the same genes in the MAGs.
Results
Recovery of genomic elements
Chromosomes
The overall ability of MAG methods to recover the original chromosomal source genomes varied widely. We considered the ‘identity’ of a given MAG bin to be that of the genome that comprises the largest proportion of sequence within that bin. In other words, if a bin is identifiably 70 % species A and 30 % species B, we consider that to be a bin of species A. Ideally, we wish to generate a single bin for each source genome consisting of the entire genome and no contigs from other genomes. Some genomes are cleanly and accurately binned regardless of the assembler and binning method used (Fig. 1). Specifically, greater than 90 % of (minimum 91.8 %) and (minimum 96.4 %) chromosomes are represented in individual bins across all methods. However, no other genomes were consistently recovered at >30 % chromosomal coverage across methods. The three genomes were particularly problematic with the best recovery for each ranging from 1.7–47.49 %. Contrary to what might be expected, the number of close relatives to a given genome in the metagenome did not clearly affect the MAG coverage (Fig. S2).
Fig. 1.
Top genome coverage for input genomes across MAG binners. Each dot represents the coverage of a specified genome when it comprised the plurality of the sequences in a bin. If a genome did not form the plurality of any bin for a specific binner-assembler pair, no dot was plotted for that genome and binner-assembler. The binning tool is indicated by the colour of the dot as per the legend. Genomes such as were accurately recovered across all binner-assembler combinations, whereas genomes such as were systematically poorly recovered.
Top genome coverage for input genomes across MAG binners. Each dot represents the coverage of a specified genome when it comprised the plurality of the sequences in a bin. If a genome did not form the plurality of any bin for a specific binner-assembler pair, no dot was plotted for that genome and binner-assembler. The binning tool is indicated by the colour of the dot as per the legend. Genomes such as were accurately recovered across all binner-assembler combinations, whereas genomes such as were systematically poorly recovered.In terms of the impact of different metagenome assemblers, megahit resulted in the highest median chromosomal coverage across all binners (81.9%), with metaSPAdes performing worst (76.8%) (Fig. 2a). Looking at binning tools, CONCOCT performed very poorly with a median 26 % coverage for top hit per bin, followed by MaxBin2 (83.1%) and MetaBAT2 (88.5%). It is perhaps unsurprising that the best-performing binner in terms of bin top hit coverage was the metabinner DAS
Tool that combines predictions from the other three binners (94.3 % median top hit chromosome coverage per bin; Fig. 2a).
Fig. 2.
Overall binning performance for every combination of metagenome assembler (as indicated by pane titles) and MAG binning tool (x-axis and legend colours). Diamonds in the plots represent outliers (greater or lower than the interquartile range marked by the error bars), and the boxes represent the lower quartile, median and upper quartile. (a) Chromosomal coverage of the most prevalent genome in each bin across binners and metagenome assemblies. Of the three assemblers, megahit resulted in the highest median chromosomal coverage (y-axis) across all binners (coloured bars) at 81.9 %, with metaSPAdes performing the worst (76.8%). Of the four binners, CONCOCT (red) performed poorly with a median coverage, followed by MaxBin2 (blue), MetaBAT2 (purple) and DAS Tool (green) performing the best. (b) Distribution of bin purity across assemblers and binners. The total number of genomes present in a bin at >5 % coverage (y-axis) was largely equivalent across assemblers (x-axis). For the binning tools, MaxBin2 (blue) produced nearly twice as many bins containing multiple species compared to CONCOCT (red), MetaBAT2 (purple) and DAS Tool (green), which all produced chimeric bins at roughly the same rate.
Overall binning performance for every combination of metagenome assembler (as indicated by pane titles) and MAG binning tool (x-axis and legend colours). Diamonds in the plots represent outliers (greater or lower than the interquartile range marked by the error bars), and the boxes represent the lower quartile, median and upper quartile. (a) Chromosomal coverage of the most prevalent genome in each bin across binners and metagenome assemblies. Of the three assemblers, megahit resulted in the highest median chromosomal coverage (y-axis) across all binners (coloured bars) at 81.9 %, with metaSPAdes performing the worst (76.8%). Of the four binners, CONCOCT (red) performed poorly with a median coverage, followed by MaxBin2 (blue), MetaBAT2 (purple) and DAS Tool (green) performing the best. (b) Distribution of bin purity across assemblers and binners. The total number of genomes present in a bin at >5 % coverage (y-axis) was largely equivalent across assemblers (x-axis). For the binning tools, MaxBin2 (blue) produced nearly twice as many bins containing multiple species compared to CONCOCT (red), MetaBAT2 (purple) and DAS Tool (green), which all produced chimeric bins at roughly the same rate.Bin purity, i.e. the number of genomes present in a bin at >5 % coverage, was largely equivalent across assemblers, with a very marginally higher purity for IDBA-UD. Across binning tools, MaxBin2 proved an exception with nearly twice as many bins containing multiple species as the next binner (Fig. 2b). The remaining binning tools were largely equivalent, producing chimeric bins at approximately the same rates. Similar to coverage, there was a weak to non-existent relationship between bin purity and the number of closely related genomes in the metagenome (Fig. S3). There was also not a clear relationship between coverage of a bin and purity, with low purity but high coverage bins observed, as well as high purity but low coverage bins.
Plasmids
Regardless of method, a very small proportion of plasmids were correctly grouped in the bin that was principally composed of chromosomal contigs from the same source genome. Specifically, between 1.5 % (IDBA-UD assembly with DAS Tool bins) and 29.2 % (metaSPAdes with CONCOCT bins) were correctly binned at over 50 % coverage. In terms of metagenome assembly, metaSPAdes was by far the most successful assembler at assembling plasmids, with 66.2 % of plasmids identifiable at greater than 50 % coverage. IDBA-UD performed worst with 17.1 % of plasmids recovered, and megahit recovered 36.9 %. If the plasmid was successfully assembled it was, with one exception, placed in a MAG bin by MaxBin2 and CONCOCT, although a much smaller number were correctly binned (typically less than one third). Interestingly, the MetaBAT2 and DAS Tool binners were more conservative in assigning plasmid contigs to bins; but of those assigned to bins, nearly all were correctly binned (Fig. 3).
Fig. 3.
The performance of metagenomic assembly and binning in recovery of plasmid sequences. Each plot represents a different metagenome assembler, with the groups of bars along the x-axes showing the plasmid recovery performance of each binning tool when applied to the assemblies produced by that tool. For each of these 12 assembler-binner-pair-produced MAGs, the grouped bars from left to right show the percentage of plasmids assembled, assigned to any bin and binned with the correct chromosomes. These stages of the evaluation are indicated by the bar colours as per the legend. Across all tools the assembly process resulted in the largest loss of plasmid sequences and only a small proportion of the assembled plasmids were correctly binned.
The performance of metagenomic assembly and binning in recovery of plasmid sequences. Each plot represents a different metagenome assembler, with the groups of bars along the x-axes showing the plasmid recovery performance of each binning tool when applied to the assemblies produced by that tool. For each of these 12 assembler-binner-pair-produced MAGs, the grouped bars from left to right show the percentage of plasmids assembled, assigned to any bin and binned with the correct chromosomes. These stages of the evaluation are indicated by the bar colours as per the legend. Across all tools the assembly process resulted in the largest loss of plasmid sequences and only a small proportion of the assembled plasmids were correctly binned.
GIs
GIs were poorly assembled and binned across methods (Fig. 4). Unlike for plasmids, the performance of different methods was generally less variable, with no clear best-performing method. Assembly of GIs with >50 % coverage was consistently poor (37.8–44.1 %), with metaSPAdes outperforming the other two assembly approaches. For the CONCOCT and MaxBin2 binning tools, all GIs that were assembled were assigned to a bin, although the proportion of binned GIs that were correctly binned was lower than for DAS Tool and MetaBAT2. DAS Tool, MetaBAT2 and CONCOCT did not display the same precipitous drop between those assembled and those correctly binned as was observed for plasmids. In terms of overall correct binning with the chromosomes from the same genome, the metaSPAdes assembly with CONCOCT (44.1 %) and MaxBin2 (43.3 %) binners performed best.
Fig. 4.
Impact of metagenomic assembly and MAG binning on recovery of GIs. GIs were recovered in a similarly poor fashion to plasmids. Regardless of binning (x-axis) and assembly (panel) methods, <40% of GIs were correctly assigned to the correct source genome. MaxBin2 and CONCOCT placed GIs in a bin the majority of the time (orange); however, a very small fraction was correctly binned (green). Generally, GIs were correctly binned better than plasmids with DAS Tool, MetaBAT2 and CONCOCT.
Impact of metagenomic assembly and MAG binning on recovery of GIs. GIs were recovered in a similarly poor fashion to plasmids. Regardless of binning (x-axis) and assembly (panel) methods, <40% of GIs were correctly assigned to the correct source genome. MaxBin2 and CONCOCT placed GIs in a bin the majority of the time (orange); however, a very small fraction was correctly binned (green). Generally, GIs were correctly binned better than plasmids with DAS Tool, MetaBAT2 and CONCOCT.
AMR genes
The recovery of AMR genes in MAGs was poor with only ~49–55 % of all AMR genes predicted in our reference genomes regardless of the assembly tool used, and metaSPAdes performing marginally better than other assemblers (Fig. 5a). Binning the contigs resulted in a ~1–15 % loss in AMR gene recovery with the concoct-
metaSPAdes pair performing best at only 1 % loss and DAS Tool-
megahit performing the worst at 15 % reduction of AMR genes recovered. Overall, only 24–40 % of all AMR genes were correctly binned. This was lowest with the MaxBin2-
IDBA-UD pair (24 %) and highest in the CONCOCT-
metaSPAdes pipeline (40 %).
Fig. 5.
Recovery of AMR genes across assemblers, binners and genomic context. (a) The proportion of reference AMR genes recovered (y-axis) was largely similar across assembly tools (panels as indicated by title) at roughly 50 %, with metaSPAdes performing marginally better overall. Binning tools (x-axis) resulted in a small reduction in AMR genes recovered (orange); however, only 24–40 % of all AMR genes were correctly binned (green). metaSPAdes-
CONCOCT was the best performing MAG binning pipeline. (b) Per cent of correctly binned AMR genes recovered by genomic context. MAG methods were best at recovering chromosomally located AMR genes (light blue) regardless of metagenomic assembler or binning tool used. Recovery of AMR genes in GIs showed a bigger variation between tools (light green). None of the 12 evaluated MAG recovery methods were able to recover plasmid-located AMR genes.
Recovery of AMR genes across assemblers, binners and genomic context. (a) The proportion of reference AMR genes recovered (y-axis) was largely similar across assembly tools (panels as indicated by title) at roughly 50 %, with metaSPAdes performing marginally better overall. Binning tools (x-axis) resulted in a small reduction in AMR genes recovered (orange); however, only 24–40 % of all AMR genes were correctly binned (green). metaSPAdes-
CONCOCT was the best performing MAG binning pipeline. (b) Per cent of correctly binned AMR genes recovered by genomic context. MAG methods were best at recovering chromosomally located AMR genes (light blue) regardless of metagenomic assembler or binning tool used. Recovery of AMR genes in GIs showed a bigger variation between tools (light green). None of the 12 evaluated MAG recovery methods were able to recover plasmid-located AMR genes.Moreover, focusing on only the AMR genes that were correctly binned (Fig. 5b), we can evaluate the impact of different genomic contexts (i.e. chromosomal, plasmid, GI). Across all methods only 30–53 % of all chromosomally located AMR genes (n=120), 0–45 % of GI located AMR genes (n=11) and none of the plasmid-localized AMR genes (n=20) were correctly binned.
VF genes
We also examined the impact of MAG approaches on recovery of VF genes as identified using the Virulence Factor Database (vfdb). We saw a similar trend as AMR genes (Fig. 6a). Between 56 and 64 % of VFs were identifiable in the metagenomic assemblies (with megahit recovering the greatest proportion). The binning process further reduced the number of recovered VFs by 4–26 %, with DAS Tool-
megahit performing the worst (26 % reduction) and CONCOCT-
metaSPAdes performing the best (4 % reduction). Unlike AMR genes, the majority of VF genes assigned to a bin were assigned to the correct bin (i.e. that bin largely made up of contigs from the same input genome). Overall, CONCOCT-
metaSPAdes again performed best with 43 % of all VFs correctly assigned.
Fig. 6.
Recovery of VF genes across assemblers, binners and genomic context. (a) Per cent of reference VF genes recovered across assemblers and binners. The proportion of reference VF genes recovered (y-axis) exhibited a similar trend as AMR genes. Recovery was greatest after the assembling stage (blue), with megahit performing best. Binning tools resulted in a larger reduction in VF genes recovered (orange) compared to AMR genes. However, in the majority of cases, VF genes that were binned were correctly binned (green). metaSPAdes-
CONCOCT was again the best performing pair. (b) Per cent of correctly binned VF genes recovered in each genomic region. MAGs were again best at recovering chromosomally located VF genes (light blue), and able to correctly bin the majority of chromosomally located VFs. Again, there was very poor performance in terms of the recovery of GIs (light green), and none of the plasmid-located AMR genes (orange) were correctly binned.
Recovery of VF genes across assemblers, binners and genomic context. (a) Per cent of reference VF genes recovered across assemblers and binners. The proportion of reference VF genes recovered (y-axis) exhibited a similar trend as AMR genes. Recovery was greatest after the assembling stage (blue), with megahit performing best. Binning tools resulted in a larger reduction in VF genes recovered (orange) compared to AMR genes. However, in the majority of cases, VF genes that were binned were correctly binned (green). metaSPAdes-
CONCOCT was again the best performing pair. (b) Per cent of correctly binned VF genes recovered in each genomic region. MAGs were again best at recovering chromosomally located VF genes (light blue), and able to correctly bin the majority of chromosomally located VFs. Again, there was very poor performance in terms of the recovery of GIs (light green), and none of the plasmid-located AMR genes (orange) were correctly binned.As with AMR genes, the genomic context (chromosome, plasmid, GI) of a given VF largely determined how well it was binned (Fig. 6b). The majority (73–98 %) of all chromosomally located VF genes (n=757) were correctly binned. However, 0–16 % of GI-localized VF genes (n=809) and again none of the plasmid-associated VF genes (n=3) were recovered across all 12 MAG pipelines.
Comparisons of rates of loss
We combined the performance metrics for Figs 3–6 to compare the rates of loss of different components (Fig. S5). This highlighted that genomic components (GIs and plasmids) and plasmids in particular are lost at a disproportionately higher rate than individual gene types during MAG recovery. This also emphasizes that better metagenomic assembly does not necessarily result in better binning/recovery of GIs and plasmids.
Simulated read analysis
To further explore the potential causes of poor assembly and binning of MGEs, we analysed the resultant coverage distribution from mapping our synthetically generated reads back to the original chromosomes, GIs and plasmids from which they were simulated. This analysis identified that while coverage of our synthetic metagenome was consistently between 91–100 % across all reference chromosomes, the coverage of GIs and plasmids encompassed a larger range (Fig. 7). Inspecting individual genomes shows large spikes and drops in coverage and per base read depth in and around these elements (Figs S6 and S7). This variability in coverage might be attributed to repeated elements and the sequence composition differences that are commonly associated with MGEs. This issue is likely responsible for failures to accurately estimate the read-depth/coverage in these regions, upon which both assembly (in traversal of the assembly graph) and binning rely.
Fig. 7.
Mean coverage by genomic region. The mean coverage of our synthetic reads to their source genome is plotted by their genomic region. Chromosome (blue) and GI (green) displayed a similar mean coverage of ~96.5 %. Plasmids (orange) had a higher mean coverage at ~98 %. The per genome coverage range of plasmids and GIs are higher than that of chromosomes. Diamond dots indicate the mean coverage of a region and black dots indicate outliers.
Mean coverage by genomic region. The mean coverage of our synthetic reads to their source genome is plotted by their genomic region. Chromosome (blue) and GI (green) displayed a similar mean coverage of ~96.5 %. Plasmids (orange) had a higher mean coverage at ~98 %. The per genome coverage range of plasmids and GIs are higher than that of chromosomes. Diamond dots indicate the mean coverage of a region and black dots indicate outliers.
Discussion
In this paper, we evaluated the ability of MAG binning methods to correctly recover MGEs (i.e. GIs and plasmids) from metagenomic samples. Overall, chromosomal sequences were binned well (up to 94.3 % coverage, with perfect bin purity using megahit
DAS Tool). The presence of closely related genomes had unclear impacts on the coverage and cross-contamination of bins (e.g. species in Figs S2 and S3). Additionally, the trade-off between false positives and sensitivity in the binning of closely related sequences is an area in need of further exploration.Given the importance of MGEs in the function and spread of virulence traits and AMR, it is particularly noteworthy that regardless of MAG binning method, plasmids and GIs were disproportionately lost compared to core chromosomal regions. At best (with metaSPAdes and CONCOCT), 29.2 % of plasmids and 44.1 % of GIs were identifiable at >50 % coverage in the correct bin (i.e. grouped with a bin that was mostly made up of contigs from the same genome). While some MGEs were likely recovered in more partial forms (<50 % coverage), use of these by researchers interested in selective pressures and LGT could lead to inaccurate inferences. This poor result is congruent with the intuition that the divergent compositional features and repetitive nature of these MGEs is problematic for MAG methods (a conclusion supported by the observed high coverage and read-depth variability of MGEs when mapping simulated reads back to the original genomes). The particularly poor plasmid binning performance is likely attributable to the known difficulties in assembly of plasmids from short-read data [53]. Therefore, binning efficiency might improve with use of long-read sequencing or assembly methods optimized for the assembly and binning of plasmid sequences [53] (such as scapp [70]). Incorporating long-read data has been shown to improve overall MAG binning [71] and facilitate metagenomic characterization of plasmids [72]. However, the low throughput and high error rate of current long-read technologies relative to widely used short-read approaches present a challenge when characterizing MGEs in metagenomes, especially those of greater complexity. Further research is needed to fully characterize the performance of different long-read protocols and analytical approaches (including hybrid approaches with short-reads) on the accuracy of recovering MGEs in metagenomic samples.With the growing use of MAG methods in infectious disease research [73-77] and the public-health and agri-food importance of the LGT of AMR and VF genes, we also specifically evaluated the binning of these gene classes. The majority of these genes were correctly assembled across assemblers, but were either not assigned or incorrectly assigned to MAG bins during binning. At best across all binners, 40 % of all AMR genes and ~63 % of VF genes (CONCOCT-
metaSPAdes) present in the reference genomes were assigned to the correct MAG. While a majority of chromosomally located VF genes (73–98 %) and AMR genes (53%) were binned correctly, only 16 % of GI VFs (n=809), 45 % of GI AMR genes (n=11) and not a single plasmid-associated VF (n=3) or AMR gene (n=20) were correctly binned. This included critical high-threat MGE-associated AMR genes such as oxacillinases (OXA) and carbapenemases (KPC). One potential caveat of this is that some AMR genes and VFs may no longer be detectable in MAGs due to issues with ORF prediction (see Supplementary Information Supplementary and Fig. S4). We also observed a higher variability in per base read depth and range of coverage in MGEs (Figs 7, S6 and S7). This, combined with previous studies observing fragmented ORF predictions in draft genomes, can lead to downstream over- or under-annotation with functional labels depending on the approach used [78]. Although not yet developed, methods that combine the assembly/binning pipelines tested here with read-based inference would provide a better sense of which functions are potentially being missed by the MAG reconstructions.Our simulated metagenomic community comprised 30 distinct bacterial genomes with varying degrees of relatedness. While this diversity can be representative of certain clinical samples [79-81], other environments with relevance to public health, such as the human gut, soil and livestock, can have 100–1000s of species [82-85]. In addition, MGEs such as GIs and plasmids are known to recombine, producing closely related variants [86-88] that could further complicate assembly from a metagenomic sample. Polymorphic MGEs were not explicitly introduced in our simulated metagenome. Consequently, our analysis likely over-represents the effectiveness of the methods tested in a public-health setting. Metagenomic simulation is also unlikely to perfectly represent the noise and biases in real metagenomic sequencing, but it does provide the ground-truth necessary for evaluation [32, 89]. This simulation approach, combined with the development of an MGE/AMR-focused mock metagenome (similarly to the mockrobiota initiative [90]), could provide a key resource to develop and validate new binning approaches and different sequencing strategies. Additionally, it would provide a way to further optimize parameter settings of existing metagenomic assembly and binning tools beyond the default settings used in these analyses (considered representative of most ‘real-world’ usage [91]) without overfitting to a particular metagenome.This study has shown that while short-read MAG-binning approaches provide a useful tool to study a bacterial species’ core chromosomal elements, they have severe limitations in the recovery of MGEs. The majority of these MGEs will either fail to be assembled or be incorrectly binned. The consequence of this is the disproportionate loss of key public-health MGE-associated VF and AMR genes that may be crucial markers for monitoring the spread of virulence and resistance among clinically important pathogens. As many of these clinically relevant genes have a high propensity for LGT between unrelated bacteria [36, 37], it is critical to highlight that short-read MAG approaches are currently insufficient to thoroughly profile them. Within public-health metagenomic research, as well as other research areas that study MGEs, it is vital we utilize MAGs in conjunction with other methods (e.g. targeted AMR assembly [92], long-read sequencing, plasmid optimized assembly [70] and read-based sequence homology search [11]) before drawing biological or epidemiological conclusions.Click here for additional data file.Click here for additional data file.
Authors: Mya Breitbart; Peter Salamon; Bjarne Andresen; Joseph M Mahaffy; Anca M Segall; David Mead; Farooq Azam; Forest Rohwer Journal: Proc Natl Acad Sci U S A Date: 2002-10-16 Impact factor: 11.205
Authors: Vanessa M D'Costa; Christine E King; Lindsay Kalan; Mariya Morar; Wilson W L Sung; Carsten Schwarz; Duane Froese; Grant Zazula; Fabrice Calmels; Regis Debruyne; G Brian Golding; Hendrik N Poinar; Gerard D Wright Journal: Nature Date: 2011-08-31 Impact factor: 49.962
Authors: Christian J H von Wintersdorff; John Penders; Julius M van Niekerk; Nathan D Mills; Snehali Majumder; Lieke B van Alphen; Paul H M Savelkoul; Petra F G Wolffs Journal: Front Microbiol Date: 2016-02-19 Impact factor: 5.640
Authors: Bhavjinder K Dhillon; Matthew R Laird; Julie A Shay; Geoffrey L Winsor; Raymond Lo; Fazmin Nizam; Sheldon K Pereira; Nicholas Waglechner; Andrew G McArthur; Morgan G I Langille; Fiona S L Brinkman Journal: Nucleic Acids Res Date: 2015-04-27 Impact factor: 16.971
Authors: Elizabeth G Wilbanks; Hugo Doré; Meredith H Ashby; Cheryl Heiner; Richard J Roberts; Jonathan A Eisen Journal: ISME J Date: 2022-04-22 Impact factor: 11.217
Authors: Ashley M Rooney; Amogelang R Raphenya; Roberto G Melano; Christine Seah; Noelle R Yee; Derek R MacFadden; Andrew G McArthur; Pierre H H Schneeberger; Bryan Coburn Journal: mSystems Date: 2022-06-01 Impact factor: 7.324
Authors: Wing Yin Venus Lau; Gemma R Hoad; Vivian Jin; Geoffrey L Winsor; Ashmeet Madyan; Kristen L Gray; Matthew R Laird; Raymond Lo; Fiona S L Brinkman Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971
Authors: Nicholas D Youngblut; Jacobo de la Cuesta-Zuluaga; Georg H Reischer; Silke Dauser; Nathalie Schuster; Chris Walzer; Gabrielle Stalder; Andreas H Farnleitner; Ruth E Ley Journal: mSystems Date: 2020-11-03 Impact factor: 6.496