Literature DB >> 31633780

Computational Framework for High-Quality Production and Large-Scale Evolutionary Analysis of Metagenome Assembled Genomes.

Boštjan Murovec1, Leon Deutsch2, Blaz Stres2,3,4,5,6.   

Abstract

Microbial species play important roles in different environments and the production of high-quality genomes from metagenome data sets represents a major obstacle to understanding their ecological and evolutionary dynamics. Metagenome-Assembled Genomes Orchestra (MAGO) is a computational framework that integrates and simplifies metagenome assembly, binning, bin improvement, bin quality (completeness and contamination), bin annotation, and evolutionary placement of bins via detailed maximum-likelihood phylogeny based on multiple marker genes using different amino acid substitution models, next to average nucleotide identity analysis of genomes for delineation of species boundaries and operational taxonomic units. MAGO offers streamlined execution of the entire metagenomics pipeline, error checking, computational resource distribution and compatibility of data formats, governed by user-tailored pipeline processing. MAGO is an open-source-software package released in three different ways, as a singularity image and a Docker container for HPC purposes as well as for running MAGO on a commodity hardware, and a virtual machine for gaining a full access to MAGO underlying structure and source code. MAGO is open to suggestions for extensions and is amenable for use in both research and teaching of genomics and molecular evolution of genomes assembled from small single-cell projects or large-scale and complex environmental metagenomes.
© The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  FastANI; evolutionary analyses; genome assembly and binning; metagenomics; microbial draft genomes; species boundaries

Year:  2020        PMID: 31633780      PMCID: PMC6993843          DOI: 10.1093/molbev/msz237

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Microbial species play important roles in different environments characterized by a span of organismal complexities. The shotgun sequencing coupled to metagenomic analyses are used to study microbial communities in these environments. The analysis and biological interpretation of sequence information derived from complex communities or single-amplified cell communities represented as metagenome or whole-genome sequencing data sets, respectively, is challenging and crucially depends on sophisticated computational resources and analyses. These include various pieces of software and steps (e.g., read assembly, binning, annotation, bin evaluation) next to program-specific settings, file format conversions and decision points that require and consume substantial time, computational resources and may introduce unintended bias (Sczyrba et al. 2017). Obtaining genomes from metagenomes is an emerging approach with the potential for large-scale recovery of high-quality near-complete genomes amenable for analyses of their evolutionary divergence, evolutionary dynamics, and abundance in original samples (Meyer et al. 2018). Advances in computational tools have improved our ability to address relevant evolutionary questions. However, computational costs for hundreds of samples are measured in tenths of thousands of CPU hours. The development of highly successful tools such as FastQC (Andrews 2010), fastp (Chen et al. 2018), IDBA-UD (Peng et al. 2012), megaHIT (Li et al. 2015), metaSPAdes (Nurk et al. 2017), maxBin (Wu et al. 2016), MetaBAT (Kang et al. 2015), CONCOCT (Alneberg et al. 2014), BinSanity (Graham et al. 2017), Dereplication-Aggregation Scoring Tool (Sieber et al. 2018), CheckM (Parks et al. 2015), ezTree (Wu, 2018), and lessons learned through the Critical Assessment of Metagenomic Information (CAMI; Sczyrba et al. 2017; Meyer et al. 2018; Fritz et al. 2019) enabled the field of molecular evolution of Bacteria and Archaea domains to progress from being a descriptive to an experimental endeavor, providing insight into evolutionary wealth of novel metagenome-assembled genomes (MAGs), novel microbial lineages uncovered from the environment, hence substantially revising and expanding the tree of life (Parks et al. 2017; Parks et al. 2018) and evolutionary dynamic in complex environments and medicine (Lin and Kussell, 2019; Garud et al. 2019). Although the tools are widely used, a number of limitations (supplementary table S1, Supplementary Material online) and their dispersed and boutique nature is limiting their integration and presents an obstacle to their reproducible use within community, their further adoption alongside the ubiquitous increases in sequencing volumes, study complexity (Jain et al. 2018), emerging standards (Sczyrba et al. 2017; Bowers et al. 2017), and technology upgrades (e.g., Nanopores). To date, no uniform piece of software exists that would integrate efficiently, scalable and reproducibly all the steps linking the raw outputs from the sequencing platform (i.e., sequence data sets) over the steps of sequence quality trimming, assembly, binning, bin improvement, bin quality control, bin annotation, to evolutionary and phylogenomic placement of bins based on multiple orthologous marker genes on protein level, provide core- and pan-genome analyses and species boundary delineation through fast average nucleotide identity (ANI) of resulting draft genomes. The field-wide analysis standards are emerging due to the ongoing efforts (Sczyrba et al. 2017; Meyer et al. 2018; Fritz et al. 2019); however, the lack of reproducible framework makes it difficult to embrace these standards, perform meta-analyses of existing data (Schloss et al. 2009; Parks et al. 2017) or simply remap and extend past analyses (Parks et al. 2018; Jain et al. 2018) to evolutionary dynamics (Garud et al. 2019). A single software platform, Metagenome Assembled Genomes Orchestra (MAGO) (fig. 1; supplementary table S1, Supplementary Material online) was developed to fill this gap and to overcome the limitations (supplementary table S2, Supplementary Material online) by integrating an ensemble of previously developed tools, streamlining their performance and deliver compatibility of data formats, together with additional features for error checking, effective computational resource use, governed by user-tailored pipeline processing (as specified by a textual configuration file). MAGO currently makes use of the three most effective assemblers and six binners put forward by CAMISIM (Fritz et al. 2019) and AMBER (Meyer et al. 2018) studies, respectively. The resulting bins are further improved by additional (the seventh) binner, Dereplication-Aggregation Scoring Tool (Sieber et al. 2018) and evaluated by CheckM according to their quality (% completeness and % contamination; Parks et al. 2015) in line with MIMAG standard (Bowers et al. 2017). CheckM utilizes a broader set of orthologous protein marker genes specific to the position of each MAG within a reference genome tree and information about collocation of these genes, based on amino acid identity between marker genes. Finally, the produced collection of high quality MAGs can be used to extract protein-coding single-copy orthologous marker genes using functional annotation and build maximum likelihood trees from amino acid sequences with different amino acid substitution models within MAGO using ezTree (Wu, 2018). The resulting alignment file can be exported to build user specific trees in existing high-end software (e.g., MEGA, Kumar et al. 2018). To annotate and calculate core- and pan-genomes MAGO integrates Prokka (Seemann, 2014) and Roary (Page et al. 2015) and makes outputs (fasta, gbk) available for additional downstream analyses of genome rearrangements (e.g., Mauve, Darling et al. 2010). FastANI (Jain et al. 2018) is utilized for high-throughput ANI analysis of MAGs that is used to define species boundaries and Operational Taxonomic Unit (OTU) delineation at various thresholds of ANI. All outputs are readily made available in structured directories for additional inspection and inclusion in other types of analyses tools (e.g., MEGA-X, Kumar et al. 2018; GTDB-Tk, Parks et al. 2018; MAGpy, Stewart et al. 2019). In total, MAGO consists of a number (n = 53) of externally developed pieces of software (supplementary table S1, Supplementary Material online) and >9,000 lines of Python code integrated into seamless workflow to perform error checking of pipeline configuration and to prevent suboptimal utilization of computational resources.
. 1.

A schematic representation of steps integrated within MAGO starting from the input of raw sequencing data to MAGs, bin quality checking and the production of a collection of high-quality MAGs. These are further utilized in analysis of evolutionary relationships to produce maximum-likelihood (ML) phylogenomic placement, MAGs annotation, and core/pan genome calculations next to determination of species boundaries and operational taxonomic units at genomic level. The outputs are easily integrated into recently developed tools (e.g., MEGA-X, Kumar et al. 2018; GTDB-Tk, Parks et al. 2018; MAGpy, Stewart et al. 2019).

A schematic representation of steps integrated within MAGO starting from the input of raw sequencing data to MAGs, bin quality checking and the production of a collection of high-quality MAGs. These are further utilized in analysis of evolutionary relationships to produce maximum-likelihood (ML) phylogenomic placement, MAGs annotation, and core/pan genome calculations next to determination of species boundaries and operational taxonomic units at genomic level. The outputs are easily integrated into recently developed tools (e.g., MEGA-X, Kumar et al. 2018; GTDB-Tk, Parks et al. 2018; MAGpy, Stewart et al. 2019). To overcome the constraints of web-based implementations of existing software and the known software limitations described above (supplementary table S2, Supplementary Material online) MAGO was made available as a singularity image (https://www.sylabs.io/singularity/; last accessed September 04, 2019) and a Docker container (https://www.docker.com; last accessed September 04, 2019) for high performance computing (HPC) purposes, and also as a VirtualBox (https://www.virtualbox.org/; last accessed September 04, 2019) virtual machine (as outlined in supplementary materials and methods, Supplementary Material online). By making MAGO an open-source-software package under the Commons Creative Attribution CC-BY License (https://creativecommons.org/licenses/; last accessed September 04, 2019) the software is free and open to modifications by other researchers. It is available for download at the project website (http://mago.fe.uni-lj.si; last accessed October 28, 2019). The accompanying preprepared example pipelines and test data set document necessary information about the use of MAGO, enhance reproducibility as the entire pipeline settings can now easily be shared as a single textual pipeline file between researchers, and results reproduced independently (supplementary figs. S1 and S2, Supplementary Material online). The abilities of MAGO are attested by the quality of the underlying pieces of software (supplementary table S1, Supplementary Material online) and their respective publications. Increasingly complex model data sets spanning CAMI (Sczyrba et al. 2017) and EBI (https://www.ebi.ac.uk/ena/data/view/PRJEB8286; last accessed September 04, 2019) were used in benchmarking MAGO (supplementary table S3, Supplementary Material online; results not shown). The Genome Assembly Gold-standard Evaluations (GAGE) and single-cell amplified genome project (Salzberg et al. 2012; Kogawa et al. 2018) were used for realistic pure culture data analyses (supplementary table S3, Supplementary Material online; supplementary figs. S3–S7, Supplementary Material online). Finally, a number of real case metagenomics data sets (n = 106; s = 0.4 TB; supplementary table S3, Supplementary Material online) were analyzed: 1) the moose rumen microbiome (Svartström et al. 2017; figs. 2 and 3), and 2) longitudinal American pre/term delivery microbiomes (Goltsman et al. 2018; supplementary figs. S4–S9, Supplementary Material online).
. 2.

Overview of the basic quality metrics of MAGs reconstructed from the moose rumen microbiome collection (samples S1–6) (supplementary table S3, Supplementary Material online; Svartström et al. 2017): (A) completeness (>50%); (B) contamination (<10%).

. 3.

Genetic discontinuity observed in the wild moose rumen MAGs shown for the first 5,000 pairwise genome comparisons (supplementary table S3, Supplementary Material online). Values of FastANI estimates in the ANI range of 75–100% are shown. The 95% and 83% ANI thresholds of FastANI estimates serve to delineate comparisons belonging to the same species (>95% intraspecies ANI) or different species (<83% interspecies ANI).

Overview of the basic quality metrics of MAGs reconstructed from the moose rumen microbiome collection (samples S1–6) (supplementary table S3, Supplementary Material online; Svartström et al. 2017): (A) completeness (>50%); (B) contamination (<10%). Genetic discontinuity observed in the wild moose rumen MAGs shown for the first 5,000 pairwise genome comparisons (supplementary table S3, Supplementary Material online). Values of FastANI estimates in the ANI range of 75–100% are shown. The 95% and 83% ANI thresholds of FastANI estimates serve to delineate comparisons belonging to the same species (>95% intraspecies ANI) or different species (<83% interspecies ANI). Unless otherwise stated, in analyses of 280 GB data set of the moose rumen microbiome collection (supplementary table S3, Supplementary Material online; Svartström et al. 2017) all parameters used were the default for each subroutine. After initial sequence quality control (FastQC, fastp), each sample was assembled (MEGAHIT) and binned individually (MaxBin, metaBAT, and Concoct), aggregated and dereplicated (Dereplication-Aggregation Scoring Tool). CheckM was used to assess the quality of resulting MAGs (% completeness; % contamination). Single-sample binning produced a total of 3,012 bins. The distribution of the produced MAGs into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards (Bowers et al. 2017) (high: >90% completeness and <5% contamination, presence of 5S, 16S, and 23S rRNA genes, and at least 18 tRNAs; medium: ≥50% completeness and <10% contamination). Given that few of the MAGs with >90% completeness and <5% contamination in general pass the MIMAG thresholds regarding the presence of rRNA and tRNA genes due to known issues relating to the difficulties in assembly of rRNA regions, the MAGs of high quality are described as “near complete” in general (Bowers et al. 2017). Medium quality bins (n = 670) represented 22.2 ± 3.4% of all bins, whereas 75%, 80% complete bins (10% contamination) (Stewart et al. 2019) next to near complete bins represented 14.7 ± 3.4% (n = 443), 12.9 ± 2.9% (n = 389), and 6.5 ± 1.2% (197) of all recovered MAGs, respectively. In general, MAGO enabled to recover 13 MAGs (80% complete; 10% contamination; dereplicated) per each 10 GB of input sequence data. The resulting MAGs obtained in this study were first used to explore the existence of genetic discontinuity among the microbial species as observed in large collections of complete genomes from unrelated studies (Jain et al. 2018). The bimodal distribution, with the vast majority (99.8%) of the total genome comparisons showing either > 95% intraspecies ANI or <83% interspecies ANI values, was observed also for the pairwise comparisons of MAGs recovered in this study (fig. 3). It is highly likely that the discontinuity represents a true biological signature, confirming the existence of sequence-discrete populations in natural environments. Although the exact biological mechanisms giving rise to this phenomenon were not explored in this study, the existence of genetic discontinuity in various environments provides opportunity to reconsider its potential origins: 1) decreased recombination frequency below 95% ANI; 2) dispersal limitations in habitats; 3) reduced diversity due to ongoing competition; 4) stochastic events over long periods of time, and provides opportunity to extend analyses from Bacterial and Archaeal domain toward plasmids (Nurk et al. 2017) and viruses (Sutton et al. 2019) for which MAGO can be adopted. In addition, the reconstructed MAGs were compared with a large and heterogeneous collection of characterized prokaryotic genomes (n = 91, 761; Jain et al. 2018). The majority of MAGs recovered in this study exhibited ANI < 83% (i.e., interspecies ANI values) with genomes in the collection. According to the species demarcation cut-off of ∼95% ANI the MAGs recovered from actively fermenting wild moose rumen represent potentially new species amenable for detailed genomic analyses. MAGO efficiently alleviates the metagenome data analysis bottleneck and provides an important and straightforward-to-implement step toward making the future large-scale evolutionary analyses of MAGs efficient, flexible, scalable and reproducible, enforcing the MIMAG standard. Its outputs are easily integrated into downstream pipelines such as The Genome Taxonomy Database (GTDB) to establish a standardized microbial taxonomy based on genome phylogeny (http://gtdb.ecogenomic.org/; last accessed September 04, 2019). MAGO is open to suggestions for extensions and is amenable for use in both research and teaching of genomics and molecular evolution of genomes assembled from small single-cell projects or large-scale and complex environmental metagenomes. Click here for additional data file.
  33 in total

1.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Authors:  Yu Peng; Henry C M Leung; S M Yiu; Francis Y L Chin
Journal:  Bioinformatics       Date:  2012-04-11       Impact factor: 6.937

2.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life.

Authors:  Donovan H Parks; Christian Rinke; Maria Chuvochina; Pierre-Alain Chaumeil; Ben J Woodcroft; Paul N Evans; Philip Hugenholtz; Gene W Tyson
Journal:  Nat Microbiol       Date:  2017-09-11       Impact factor: 17.745

3.  MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms.

Authors:  Sudhir Kumar; Glen Stecher; Michael Li; Christina Knyaz; Koichiro Tamura
Journal:  Mol Biol Evol       Date:  2018-06-01       Impact factor: 16.240

4.  Ninety-nine de novo assembled genomes from the moose (Alces alces) rumen microbiome provide new insights into microbial plant biomass degradation.

Authors:  Olov Svartström; Johannes Alneberg; Nicolas Terrapon; Vincent Lombard; Ino de Bruijn; Jonas Malmsten; Ann-Marie Dalin; Emilie El Muller; Pranjul Shah; Paul Wilmes; Bernard Henrissat; Henrik Aspeborg; Anders F Andersson
Journal:  ISME J       Date:  2017-07-21       Impact factor: 10.302

5.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.

Authors:  Donovan H Parks; Michael Imelfort; Connor T Skennerton; Philip Hugenholtz; Gene W Tyson
Journal:  Genome Res       Date:  2015-05-14       Impact factor: 9.043

6.  metaSPAdes: a new versatile metagenomic assembler.

Authors:  Sergey Nurk; Dmitry Meleshko; Anton Korobeynikov; Pavel A Pevzner
Journal:  Genome Res       Date:  2017-03-15       Impact factor: 9.043

7.  fastp: an ultra-fast all-in-one FASTQ preprocessor.

Authors:  Shifu Chen; Yanqing Zhou; Yaru Chen; Jia Gu
Journal:  Bioinformatics       Date:  2018-09-01       Impact factor: 6.937

8.  MAGpy: a reproducible pipeline for the downstream analysis of metagenome-assembled genomes (MAGs).

Authors:  Robert D Stewart; Marc D Auffret; Timothy J Snelling; Rainer Roehe; Mick Watson
Journal:  Bioinformatics       Date:  2019-06-01       Impact factor: 6.937

9.  Choice of assembly software has a critical impact on virome characterisation.

Authors:  Thomas D S Sutton; Adam G Clooney; Feargal J Ryan; R Paul Ross; Colin Hill
Journal:  Microbiome       Date:  2019-01-28       Impact factor: 14.650

10.  Evolutionary dynamics of bacteria in the gut microbiome within and across hosts.

Authors:  Nandita R Garud; Benjamin H Good; Oskar Hallatschek; Katherine S Pollard
Journal:  PLoS Biol       Date:  2019-01-23       Impact factor: 8.029

View more
  2 in total

1.  metaGEM: reconstruction of genome scale metabolic models directly from metagenomes.

Authors:  Francisco Zorrilla; Filip Buric; Kiran R Patil; Aleksej Zelezniak
Journal:  Nucleic Acids Res       Date:  2021-12-02       Impact factor: 16.971

Review 2.  A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data.

Authors:  Chao Yang; Debajyoti Chowdhury; Zhenmiao Zhang; William K Cheung; Aiping Lu; Zhaoxiang Bian; Lu Zhang
Journal:  Comput Struct Biotechnol J       Date:  2021-11-23       Impact factor: 7.271

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.