| Literature DB >> 26951112 |
Naseer Sangwan1,2, Fangfang Xia3, Jack A Gilbert4,5,6,7.
Abstract
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.Entities:
Mesh:
Year: 2016 PMID: 26951112 PMCID: PMC4782286 DOI: 10.1186/s40168-016-0154-5
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Workflow and overview for recovering population genomes from shotgun metagenomics data
Key methodological features of three main metagenome binning approaches
| Method | Starting point | Clustering methods | Negatives | Positives | Computational Resources |
|---|---|---|---|---|---|
| Nucleotide composition (NC) | Oligonucleotide frequency matrix and %G+C-based screening. | HCL, correlation-based network graph and emergent self-organization maps (ESOM). | (i) More efficent for the genomes with skewed nucleotide composition patterns. | (i) Individual metagenome assemblies or samples where populations do not change over time can be used. | (i) R packages: qgraph (8), i graph, pv-clust [ |
| (ii) tetramerFreqs [ | |||||
| (iii) Databionic ESOM tools [ | |||||
| (ii) Less efficient in differentiating between closely related genotypes. | |||||
| (iv) 2T-binning [ | |||||
| (iii) Depends on the visualization and manual inspection of bins and therefore are not suitable for very large assemblies representing complex environments. | |||||
| Nucleotide composition and abundance (NCA) | A composite distance matrix from oligonucleotide frequency matrix and coverage. | K-medioids clustering, Gaussian mixture models, and expectation and maximization algorithm. | (ii), (iv) Require multiple samples for better performance, and therefore are associated with cost, time, and computational resources. | (i), (ii) Improved contig binning than NC method. | (i) MetaBAT [ |
| (ii) CONCOCT [ | |||||
| (iii) MaxBin [ | |||||
| (iv) GroopM [ | |||||
| (v) Databionic ESOM tools [ | |||||
| Differential abundance (DA) | Differential coverage patterns across multiple samples where population changed in abundance over time. | Profile based correlation cut-off. | (iv) Must have multiple samples with population changed in abundance over time, and therefore are associated with cost, computational time, and resources. | (ii), (iii) Strain level resolution can be achieved. | (i) Multi-metagenome [ |
| (ii) MGS Canopy algorithm [ | |||||
| (iii) Databionic ESOM tools [ |
Key methodological features of NCA-based metagenome binning tools
| Binning software | Sequence composition model | Differential abundance model | Clustering algorithm | Stopping criteria | Post-processing and other notable heuristics |
|---|---|---|---|---|---|
| ABAWACA | Combined mono-, di-, and tri- nucleotide frequencies | Hierarchical clustering with iterative splitting; long scaffolds are broken into 5-kb fragments at the beginning; splitting based on a single metric that results in the best separation in each round | No separation can be made given quality score based on the extent to which the broken scaffolds are grouped correctly | Genome assessment based on marker genes and consensus taxonomic placement with reciprocal best BLAST hits; manual inspection using ggKBase; scaffold extension | |
| Canopya | Inter-assembly tetranucleotide frequency | Abundance distance defined in terms of Pearson correlation and Spearman’s rank correlation coefficients | Canopy clustering (seed-and-recruit) | Stabilization of canopy profiles | Sample-specific augmented assemblies on two samples with most mapped reads and one with most gene containing de novo contigs |
| CONCOCT | K-mer frequencies (tetranucleotide by default); uniform Dirichlet distribution prior on the relative frequencies; dimension reduction using principal component analysis to keep 90 % of joined composition and coverage variance | Combined log-transformed profile of normalized coverage and composition vectors | Gaussian mixture models; regularized expectation-maximization; cluster number determined by automatic relevance determination | Parameter convergence and maximum iteration number | Empirical variational Bayesian approach; variational approximation used to perform integral in optimizing mixing coefficients |
| GroopM | Tetranucleotide frequencies; dimension reduction using principal component analysis to keep 80 % of compositional variance | Transformed coverage space to reduce unevenness of variability distribution | Iterative clustering in two custom steps: two-way clustering and Hough partitioning; bin refinement using self-organizing map | 1:1 correspondence between bins and sub regions on the SOM surface | GC variance model for chimera detection |
| MaxBin | Tetranucleotide frequencies; Euclidean distance; empirically estimated Gaussian distributions of intra- and inter-genome distances | Poisson distribution | Expectation-maximization; cluster number estimated from single-copy genes; initial parameters inferred from the shortest marker gene | Parameter convergence and maximum iteration number | Recursive checking of all bins for median number of marker genes |
| MetaBAT | Tetranucleotide frequencies; Euclidean distance; empirical posterior probability derived from different contig sizes using logistic regression | Abundance distance defined as the non-shared area of two normal distributions | Modified K-medoid clustering without the need to set the number of clusters | Medoid convergence | Progressive weighting of the relative importance of DA vs TNF based on the number of samples; optional assembly, based on CheckM assessment, of mapped reads from a single most represented sample to reduce contamination |
aWe have also included the DA method Canopy because it uses sequence composition in post-binning refinement