| Literature DB >> 32434845 |
Jacob T Evans1, Vincent J Denef2.
Abstract
Metagenome-assembled genomes (MAGs) expand our understanding of microbial diversity, evolution, and ecology. Concerns have been raised on how sequencing, assembly, binning, and quality assessment tools may result in MAGs that do not reflect single populations in nature. Here, we reflect on another issue, i.e., how to handle highly similar MAGs assembled from independent data sets. Obtaining multiple genomic representatives for a species is highly valuable, as it allows for population genomic analyses; however, when retaining genomes of closely related populations, it complicates MAG quality assessment and abundance inferences. We show that (i) published data sets contain a large fraction of MAGs sharing >99% average nucleotide identity, (ii) different software packages and parameters used to resolve this redundancy remove very different numbers of MAGs, and (iii) the removal of closely related genomes leads to losses of population-specific auxiliary genes. Finally, we highlight some approaches that can infer strain-specific dynamics across a sample series without dereplication.Entities:
Keywords: MAG; binning; dereplication; metagenomics; population genomics; software
Mesh:
Year: 2020 PMID: 32434845 PMCID: PMC7380574 DOI: 10.1128/mSphere.00971-19
Source DB: PubMed Journal: mSphere ISSN: 2379-5042 Impact factor: 4.389
FIG 1Overview of dereplication approaches used in this study. All approaches first cluster similar genomes (Mash clusters are delineated with boxes) using a fast, less accurate approach (Mash), which is included in the dRep package but is a separate preprocessing step we carried out for the pyani analysis (indicated with the dotted line). Each cluster of MAGs then is separately dereplicated using pairwise alignments by identifying MAGs within each Mash cluster that share ANI above the specified threshold. These clusters are indicated by boxes, with Mash clusters split in two multiple cluster groups using the same line type (full or dashed lines). Which genomes end up in the same cluster varies depending on the approach used; only one clustering is shown. Finally, a representative MAG is selected, either as part of the package (dRep) or using a custom script (our approach that used pyani for pairwise comparisons, indicated by the dotted line), selecting the MAG with the highest estimated completion and lowest estimated contamination.
FIG 2Effects of dereplication. Phylogenetic tree of a set of closely related MAGs (family Muribaculaceae) from Parks et al. (3), grouped based on sequence similarity by Mash. A box outline indicates the genome was preserved after dereplication, while white space indicates it was removed. The dRep default does not remove multiple nearly identical MAGs, while dRep-gANI removes MAGs that are more distantly related than the 99% or 96.5% ANI cutoff. Black bars show the average sequence read coverage across all contigs of each MAG, ranging from 0 to 2,000, when aligning a metagenomic data set (Sequence Read Archive accession no. SRR1702559) using all genomes in the tree (none) or dereplicated genome sets using different tools. Reads were mapped to each Multi-FASTA file of retained MAGs using BWA-MEM with default parameters (26). Average coverage per contig was computed with pileup.sh from bbtools (https://sourceforge.net/projects/bbmap/). The phylogenetic tree was created by searching for marker genes with PhyloSift (27) using its default set of marker genes. All MAGs had estimated completeness levels of >90% (3). The genes then were aligned with PhyloSift and the resulting alignments concatenated, and the tree was created with FastTree (28) using the -nt and -gtr parameters.
Summary of comparison of dereplication tools
| Parameter | No. of gene clusters/MAGs retained by dereplication tool (% identity cutoff) | ||||
|---|---|---|---|---|---|
| None | Pyani (99%) | dRep-default (99%) | dRep-gANI (99%) | dRep-gANI (96.5%) | |
| Effect of dereplication on retained pangenome gene clusters | |||||
| | 9,175 | 8,962 | 9,175 | 8,728 | 6,947 |
| Effect of dereplication on number of retained MAGs | |||||
| Parks et al. ( | 7,800 | 5,236 | 6,288 | 4,047 | 3,357 |
| Almeida et al. ( | 1,951 | 1,865 | 1,607 | 1,605 | 1,590 |
| Pasolli et al. ( | 50 | 8 | 40 | 1 | 1 |
| Pasolli et al. ( | 49 | 36 | 41 | 26 | 1 |
Retained gene clusters of the Microcystis aeruginosa pangenome of 46 MAGs (9). A gene cluster consists of all genes across all genomes that had a minimum bit score of at least 0.5 when using the pangenome analysis workflow in Anvi’o (18). Retention of at least one representative in each gene cluster was evaluated when using different dereplication tools and ANI settings.
Number of MAGs remaining after dereplication tools were run shows tool-dependent results of dereplication using data from three published data sets. SGB refers to numbered species-level clusters generated in analyses by Pasolli et al. (5).
Impact of genome completeness and percent alignment threshold on dereplication using pyani
| % alignment | Genome completeness | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | |
| 10 | 37 | 35 | 36 | 33 | 33 | 31 | 29 | 27 | 23 | 21 |
| 25 | 45 | 44 | 37 | 33 | 33 | 31 | 29 | 27 | 23 | 21 |
| 50 | 45 | 45 | 45 | 45 | 40 | 31 | 29 | 27 | 23 | 21 |
| 75 | 45 | 45 | 45 | 45 | 45 | 45 | 44 | 38 | 24 | 21 |
Microcystis aeruginosa MAGs were artificially reduced in completeness by random subsampling of contigs. The full MAGs ranged in estimated completeness between 95 and 100%. Shown are the numbers of the original 46 MAGs retained by our combined Mash-pyani approach using different percent alignment thresholds.