Literature DB >> 30418481

MAGpy: a reproducible pipeline for the downstream analysis of metagenome-assembled genomes (MAGs).

Robert D Stewart¹, Marc D Auffret², Timothy J Snelling³, Rainer Roehe², Mick Watson¹.

Abstract

MOTIVATION: Metagenomics is a powerful tool for assaying the DNA from every genome present in an environment. Recent advances in bioinformatics have enabled the rapid assembly of near-complete metagenome-assembled genomes (MAGs), and there is a need for reproducible pipelines that can annotate and characterize thousands of genomes simultaneously, to enable identification and functional characterization.
RESULTS: Here we present MAGpy, a scalable and reproducible pipeline that takes multiple genome assemblies as FASTA and compares them to several public databases, checks quality, suggests a taxonomy and draws a phylogenetic tree.
AVAILABILITY AND IMPLEMENTATION: MAGpy is available on github: https://github.com/WatsonLab/MAGpy. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Mutation Species

Mesh：

Year: 2019 PMID： 30418481 PMCID： PMC6581432 DOI： 10.1093/bioinformatics/bty905

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Discovering and studying microbes in the environment have been a goal of genomic technologies for many years (Brodie ; Watson ), but advances in DNA sequencing (Goodwin ; Loman and Watson, 2015; Watson, 2014) have enabled a revolution in metagenomics that has accelerated this area of research. Metagenomics refers to the whole-genome investigation of every organism within a particular environment, and is often used in microbiome studies to investigate changes in the taxonomic and functional profile of samples of interest. This method of simultaneously quantifying taxonomic and functional structure has been used in studies of age and geography in the human gut (Yatsunenko ), release of carbon due to permafrost thawing (Mackelprang ), the environmental impact and feed efficiency of animal agriculture (Roehe ; Wallace et al., 2015), environmental characterization of Earth’s oceans (Venter ) and the extraction of industrially and commercially relevant enzymes from environmental samples (Roumpeka ). Metagenomics also offers the ability to assemble near-complete and draft microbial genomes without the need for culture. Such ‘metagenomic binning’ approaches involve the assembly of metagenomic sequence reads into contigs followed by clustering, or binning, of contigs into putative genomes, called metagenome-assembled genomes (MAGs) (Bowers ). Recently, we have used this technique to assemble complete and draft genomes from the cattle rumen (Stewart ), and in this manuscript we present MAGpy, the pipeline we used to characterize the 913 genomes presented in that paper. A major challenge in the analysis of MAGs is that researchers are often presented with hundreds or thousands of putative genomes, which need to be annotated, characterized, placed within a phylogenetic tree and assigned a putative function or role. This is further complicated by the fact that many of the putative genomes do not have close relatives with good quality reference genomes, making comparative genomics almost impossible. Here we present MAGpy (pronounced ‘magpie’), a reproducible pipeline for the characterisation of MAGs using open source and freely available bioinformatics software. MAGpy is implemented as a Snakemake (Koster and Rahmann, 2012) pipeline, enabling reproducible analyses, extensibility, integration with high-performance-compute clusters and restart capabilities. MAGpy annotates the genomes, predicts putative protein sequences, compares the MAGs to multiple genomic, proteomic and protein family databases, produces several reports and draws a taxonomic tree. We demonstrate the utility of MAGpy on a subset of 800 bacterial and archaeal MAGs recently published by Parks et al. (Parks ).

2 Comparison to other tools

The aim of MAGpy is to assist researchers in characterizing hundreds or thousands of MAGs, specifically to help researchers identify the likely taxonomy of each MAG, and it is the pipeline we use for characterization of rumen microbes assembled from metagenomes (Stewart ). Few other tools have similar aims or scope. CheckM predicts the taxonomic lineage of each MAG as an initial step in testing MAG quality, and this evidence is incorporated into MAGpy. PhyloPhlAn enables researchers to place any genome(s) into the tree of life, which can assist in identification. Again, PhyloPhlAn is run as part of the MAGpy pipeline. Generic genome and metagenome annotation tools exist: Prokka (Seemann, 2014) is a genome annotation tool that can be installed locally and which can annotate microbial genomes and prepare them for submission to GenBank; whereas PATRIC (Wattam ), RAST (Aziz ), MG-RAST (Keegan et al., 2016), Microscope (Vallenet ) and IMG/M (Chen ) are online tools that provide services such as genome and metagenome annotation. The focus of these tools is on annotation—i.e. identifying the location and likely function of genes and proteins. Whilst this information can be used to identify the likely taxonomy of a newly assembled genome or MAG, it is not their primary focus. The focus of MAGpy is not (meta)genome annotation per se; rather we wish to leverage genome sequence data and predicted protein sequences to help identify the closest sequenced relative to each MAG; we want to enable this as a local analysis and we want to do this at scale. Proteins are more conserved and can provide matches to more distant relatives. MAGpy uses Prodigal to predict proteins, a similar approach to Prokka. Mash (Ondov ) and Sourmash (Brown and Irber, 2016) are relatively new tools that use MinHash distances to compare massive sequence datasets rapidly. Both enable novel genomes to be compared to tens of thousands of existing genomes in the public databases. We integrate Sourmash into MAGpy to enable comparison of MAGs to over 100 000 public genomes in GenBank.

3 Materials and methods

MAGpy makes use of Snakemake to define an analysis workflow for MAGs based on open source and freely available bioinformatics software. First, CheckM (Parks ) is run, which uses a set of pre-computed core genes to assess the completeness and contamination of MAGs. CheckM also attempts to assign a taxonomic level to the MAGs, though in our experience this is often a conservative estimate. In tandem, MAGpy predicts the protein coding sequences of MAGs using Prodigal (Hyatt ). DIAMOND (Buchfink ) BLASTP is used to compare the proteins to UniProt (UniProt Consortium, 2018). This has multiple purposes—the hits from UniProt provide a form of annotation of the putative proteins and may predict function; many MAGs may show little similarity to published genomes at the DNA level, but as proteins are more conserved, protein hits can help define the closest sequenced genome; and the length of the predicted protein and that protein’s hits can be used to detect truncated genes and proteins in the MAG annotation. Reports of the DIAMOND results at the level of the MAG and for each contig within each MAG are produced. The proteins are also compared to protein families in Pfam (Finn ) using PfamScan; and to create a tree using PhyloPhlAn (Segata ), which is subsequently visualised using GraPhlAn (Asnicar ). The MAG genome sequences are also compared to over 100 000 public genomes using MinHash signatures as implemented in Sourmash (Brown and Irber, 2016).

4 Results and discussion

We applied MAGpy to 800 Bacterial and Archaeal MAGs from Parks et al. (Parks ). The CheckM report [which uses Ete3 (Huerta-Cepas ) to expand the taxonomic prediction] can be seen in Supplementary Table S1, the Sourmash report in Supplementary Table S2 and the Uniprot report in Supplementary Table S3. The PhyloPhlAn tree can be seen in Figure 1 . Specific examples reveal the strengths of each approach. Whilst CheckM predicts a lineage of s__algicola for UBA6511, the UniProt results show 3403 (86%) of that genome’s 3945 predicted proteins have a top hit to Maribacter dokdonensis DSW-8 with an average similarity of 95.78%. The Sourmash results (Supplementary Table S2) also predict a strong hit to Maribacter. On many occasisons, CheckM is only able to predict a lineage of k__Bacteria, as in the case of UBA3429. However, both the UniProt and Sourmash results show a strong similarity of this genome to Thermus thermophiles HB8, providing strain-level resolution where CheckM fails.

Fig. 1.

Phylogenetic tree of 800 MAGs created using PhyloPhlAn and produced by MAGpy

Phylogenetic tree of 800 MAGs created using PhyloPhlAn and produced by MAGpy The outputs of MAGpy can also be used to identify potential chimeric MAGs. As well as producing a MAG-level report for the UniProt comparisons, a contig-level report is also produced for each MAG. This report includes the number of proteins predicted for each contig, and the most popular genus and species for those proteins from the diamond search. Supplementary Table S4 shows a contig-level report for UBA7370, a high quality MAG. Most contigs show very high protein-level similarity to the same genus (‘Synechococcus’) and species (‘Synechococcus sp. KORDI-49’). There are only two exceptions, with one contig showing similarity to ‘Synechococcus sp. (strain WH8103)’ and another hitting the genus ‘uncultured’. Deeper examination shows these to come from hits to ‘uncultured marine type-A Synechococcus’. In comparison, Supplementary Table S5 shows a contig-level report from UBA6779. In this MAG, many of the contigs show high protein-level similarity to genus ‘Zunongwangia’ and species ‘Zunongwangia profunda (strain DSM 18752/CCTCC AB 206139/SM-A87)’; however, there are also contigs with high-similarity to Salegentibacter and Leeuwenhoekiella, and, towards the bottom of the table, multiple contigs with low protein-level similarity to Gramella, Mesonia, Legionella and various others. Whilst many of these are from the same family (Flavobacteriaceae), there are certainly signs this is a chimeric MAG. Researchers would be advised to remove these contigs from the MAG and re-analyze. We conclude that MAGpy represents a novel, useful and reproducible workflow that enables researchers to predict the closest relative to newly sequenced and assembled MAGs. MAGpy carries out extensive comparative genomics at the DNA and protein-level, attempts to place MAGs within a phylogenetic tree, produces detailed reports and allows for the identification of potential chimeric MAGs.

Funding

This project was supported by the Biotechnology and Biological Sciences Research Council (BBSRC; BB/N016742/1, BB/N01720X/1), including institute strategic programme and national capability awards to The Roslin Institute (BBSRC: BB/P013759/1, BB/P013732/1, BB/J004235/1, BB/J004243/1); and by the Scottish Government (RESAS) as part of the 2016-2021 commission. Conflict of Interest: none declared. Click here for additional data file.

30 in total

1. Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw.

Authors: Rachel Mackelprang; Mark P Waldrop; Kristen M DeAngelis; Maude M David; Krystle L Chavarria; Steven J Blazewicz; Edward M Rubin; Janet K Jansson
Journal: Nature Date: 2011-11-06 Impact factor: 49.962

2. Application of a high-density oligonucleotide microarray approach to study bacterial population dynamics during uranium reduction and reoxidation.

Authors: Eoin L Brodie; Todd Z Desantis; Dominique C Joyner; Seung M Baek; Joern T Larsen; Gary L Andersen; Terry C Hazen; Paul M Richardson; Donald J Herman; Tetsu K Tokunaga; Jiamin M Wan; Mary K Firestone
Journal: Appl Environ Microbiol Date: 2006-09 Impact factor: 4.792

3. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

4. MicroScope: a platform for microbial genome annotation and comparative genomics.

Authors: D Vallenet; S Engelen; D Mornico; S Cruveiller; L Fleury; A Lajus; Z Rouy; D Roche; G Salvignol; C Scarpelli; C Médigue
Journal: Database (Oxford) Date: 2009-11-25 Impact factor: 3.451

5. Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169

6. Environmental genome shotgun sequencing of the Sargasso Sea.

Authors: J Craig Venter; Karin Remington; John F Heidelberg; Aaron L Halpern; Doug Rusch; Jonathan A Eisen; Dongying Wu; Ian Paulsen; Karen E Nelson; William Nelson; Derrick E Fouts; Samuel Levy; Anthony H Knap; Michael W Lomas; Ken Nealson; Owen White; Jeremy Peterson; Jeff Hoffman; Rachel Parsons; Holly Baden-Tillson; Cynthia Pfannkoch; Yu-Hui Rogers; Hamilton O Smith
Journal: Science Date: 2004-03-04 Impact factor: 47.728

7. Human gut microbiome viewed across age and geography.

Authors: Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon
Journal: Nature Date: 2012-05-09 Impact factor: 49.962

8. The RAST Server: rapid annotations using subsystems technology.

Authors: Ramy K Aziz; Daniela Bartels; Aaron A Best; Matthew DeJongh; Terrence Disz; Robert A Edwards; Kevin Formsma; Svetlana Gerdes; Elizabeth M Glass; Michael Kubal; Folker Meyer; Gary J Olsen; Robert Olson; Andrei L Osterman; Ross A Overbeek; Leslie K McNeil; Daniel Paarmann; Tobias Paczian; Bruce Parrello; Gordon D Pusch; Claudia Reich; Rick Stevens; Olga Vassieva; Veronika Vonstein; Andreas Wilke; Olga Zagnitko
Journal: BMC Genomics Date: 2008-02-08 Impact factor: 3.969

9. DetectiV: visualization, normalization and significance testing for pathogen-detection microarray data.

Authors: Michael Watson; Juliet Dukes; Abu-Bakr Abu-Median; Donald P King; Paul Britton
Journal: Genome Biol Date: 2007 Impact factor: 13.583

10. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes.

Authors: Nicola Segata; Daniela Börnigen; Xochitl C Morgan; Curtis Huttenhower
Journal: Nat Commun Date: 2013 Impact factor: 14.919

15 in total

Review 1. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

2. Advantages and Limits of Metagenomic Assembly and Binning of a Giant Virus.

Authors: Julien Andreani; Rania Francis; Frederik Schulz; Hadjer Boudjemaa; Jacques Yaacoub Bou Khalil; Janey Lee; Bernard La Scola; Tanja Woyke
Journal: mSystems Date: 2020-06-23 Impact factor: 6.496

3. 1200 high-quality metagenome-assembled genomes from the rumen of African cattle and their relevance in the context of sub-optimal feeding.

Authors: Toby Wilkinson; Daniel Korir; Moses Ogugo; Robert D Stewart; Mick Watson; Edith Paxton; John Goopy; Christelle Robert
Journal: Genome Biol Date: 2020-09-03 Impact factor: 13.583

4. Characterisation of the British honey bee metagenome.

Authors: Tim Regan; Mark W Barnett; Dominik R Laetsch; Stephen J Bush; David Wragg; Giles E Budge; Fiona Highet; Benjamin Dainat; Joachim R de Miranda; Mick Watson; Mark Blaxter; Tom C Freeman
Journal: Nat Commun Date: 2018-11-26 Impact factor: 14.919

5. Assembly of hundreds of novel bacterial genomes from the chicken caecum.

Authors: Laura Glendinning; Robert D Stewart; Mark J Pallen; Kellie A Watson; Mick Watson
Journal: Genome Biol Date: 2020-02-12 Impact factor: 13.583

6. Computational Framework for High-Quality Production and Large-Scale Evolutionary Analysis of Metagenome Assembled Genomes.

Authors: Boštjan Murovec; Leon Deutsch; Blaz Stres
Journal: Mol Biol Evol Date: 2020-02-01 Impact factor: 16.240

7. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery.

Authors: Robert D Stewart; Marc D Auffret; Amanda Warr; Alan W Walker; Rainer Roehe; Mick Watson
Journal: Nat Biotechnol Date: 2019-08-02 Impact factor: 54.908

8. Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation.

Authors: Derek M Bickhart; Mick Watson; Sergey Koren; Kevin Panke-Buisse; Laura M Cersosimo; Maximilian O Press; Curtis P Van Tassell; Jo Ann S Van Kessel; Bradd J Haley; Seon Woo Kim; Cheryl Heiner; Garret Suen; Kiranmayee Bakshy; Ivan Liachko; Shawn T Sullivan; Phillip R Myer; Jay Ghurye; Mihai Pop; Paul J Weimer; Adam M Phillippy; Timothy P L Smith
Journal: Genome Biol Date: 2019-08-02 Impact factor: 13.583

9. CAMITAX: Taxon labels for microbial genomes.

Authors: Andreas Bremges; Adrian Fritz; Alice C McHardy
Journal: Gigascience Date: 2020-01-01 Impact factor: 6.524

10. Lower methane emissions were associated with higher abundance of ruminal Prevotella in a cohort of Colombian buffalos.

Authors: Sandra Bibiana Aguilar-Marin; Claudia Lorena Betancur-Murillo; Gustavo A Isaza; Henry Mesa; Juan Jovel
Journal: BMC Microbiol Date: 2020-11-27 Impact factor: 3.605