Robert D Stewart1, Marc D Auffret2, Timothy J Snelling3, Rainer Roehe2, Mick Watson1. 1. The Roslin Institute and R(D)SVS, University of Edinburgh, Easter Bush, UK. 2. Scotland's Rural College, Easter Bush, UK. 3. The Rowett Institute of Nutrition and Health, University of Aberdeen, King's College, Aberdeen, UK.
Abstract
MOTIVATION: Metagenomics is a powerful tool for assaying the DNA from every genome present in an environment. Recent advances in bioinformatics have enabled the rapid assembly of near-complete metagenome-assembled genomes (MAGs), and there is a need for reproducible pipelines that can annotate and characterize thousands of genomes simultaneously, to enable identification and functional characterization. RESULTS: Here we present MAGpy, a scalable and reproducible pipeline that takes multiple genome assemblies as FASTA and compares them to several public databases, checks quality, suggests a taxonomy and draws a phylogenetic tree. AVAILABILITY AND IMPLEMENTATION: MAGpy is available on github: https://github.com/WatsonLab/MAGpy. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Metagenomics is a powerful tool for assaying the DNA from every genome present in an environment. Recent advances in bioinformatics have enabled the rapid assembly of near-complete metagenome-assembled genomes (MAGs), and there is a need for reproducible pipelines that can annotate and characterize thousands of genomes simultaneously, to enable identification and functional characterization. RESULTS: Here we present MAGpy, a scalable and reproducible pipeline that takes multiple genome assemblies as FASTA and compares them to several public databases, checks quality, suggests a taxonomy and draws a phylogenetic tree. AVAILABILITY AND IMPLEMENTATION: MAGpy is available on github: https://github.com/WatsonLab/MAGpy. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Discovering and studying microbes in the environment have been a goal of genomic technologies for many years (Brodie ; Watson ), but advances in DNA sequencing (Goodwin ; Loman and Watson, 2015; Watson, 2014) have enabled a revolution in metagenomics that has accelerated this area of research. Metagenomics refers to the whole-genome investigation of every organism within a particular environment, and is often used in microbiome studies to investigate changes in the taxonomic and functional profile of samples of interest. This method of simultaneously quantifying taxonomic and functional structure has been used in studies of age and geography in the human gut (Yatsunenko ), release of carbon due to permafrost thawing (Mackelprang ), the environmental impact and feed efficiency of animal agriculture (Roehe ; Wallace et al., 2015), environmental characterization of Earth’s oceans (Venter ) and the extraction of industrially and commercially relevant enzymes from environmental samples (Roumpeka ).Metagenomics also offers the ability to assemble near-complete and draft microbial genomes without the need for culture. Such ‘metagenomic binning’ approaches involve the assembly of metagenomic sequence reads into contigs followed by clustering, or binning, of contigs into putative genomes, called metagenome-assembled genomes (MAGs) (Bowers ). Recently, we have used this technique to assemble complete and draft genomes from the cattle rumen (Stewart ), and in this manuscript we present MAGpy, the pipeline we used to characterize the 913 genomes presented in that paper.A major challenge in the analysis of MAGs is that researchers are often presented with hundreds or thousands of putative genomes, which need to be annotated, characterized, placed within a phylogenetic tree and assigned a putative function or role. This is further complicated by the fact that many of the putative genomes do not have close relatives with good quality reference genomes, making comparative genomics almost impossible.Here we present MAGpy (pronounced ‘magpie’), a reproducible pipeline for the characterisation of MAGs using open source and freely available bioinformatics software. MAGpy is implemented as a Snakemake (Koster and Rahmann, 2012) pipeline, enabling reproducible analyses, extensibility, integration with high-performance-compute clusters and restart capabilities. MAGpy annotates the genomes, predicts putative protein sequences, compares the MAGs to multiple genomic, proteomic and protein family databases, produces several reports and draws a taxonomic tree. We demonstrate the utility of MAGpy on a subset of 800 bacterial and archaeal MAGs recently published by Parks et al. (Parks ).
2 Comparison to other tools
The aim of MAGpy is to assist researchers in characterizing hundreds or thousands of MAGs, specifically to help researchers identify the likely taxonomy of each MAG, and it is the pipeline we use for characterization of rumen microbes assembled from metagenomes (Stewart ). Few other tools have similar aims or scope. CheckM predicts the taxonomic lineage of each MAG as an initial step in testing MAG quality, and this evidence is incorporated into MAGpy. PhyloPhlAn enables researchers to place any genome(s) into the tree of life, which can assist in identification. Again, PhyloPhlAn is run as part of the MAGpy pipeline.Generic genome and metagenome annotation tools exist: Prokka (Seemann, 2014) is a genome annotation tool that can be installed locally and which can annotate microbial genomes and prepare them for submission to GenBank; whereas PATRIC (Wattam ), RAST (Aziz ), MG-RAST (Keegan et al., 2016), Microscope (Vallenet ) and IMG/M (Chen ) are online tools that provide services such as genome and metagenome annotation. The focus of these tools is on annotation—i.e. identifying the location and likely function of genes and proteins. Whilst this information can be used to identify the likely taxonomy of a newly assembled genome or MAG, it is not their primary focus. The focus of MAGpy is not (meta)genome annotation per se; rather we wish to leverage genome sequence data and predicted protein sequences to help identify the closest sequenced relative to each MAG; we want to enable this as a local analysis and we want to do this at scale.Proteins are more conserved and can provide matches to more distant relatives. MAGpy uses Prodigal to predict proteins, a similar approach to Prokka. Mash (Ondov ) and Sourmash (Brown and Irber, 2016) are relatively new tools that use MinHash distances to compare massive sequence datasets rapidly. Both enable novel genomes to be compared to tens of thousands of existing genomes in the public databases. We integrate Sourmash into MAGpy to enable comparison of MAGs to over 100 000 public genomes in GenBank.
3 Materials and methods
MAGpy makes use of Snakemake to define an analysis workflow for MAGs based on open source and freely available bioinformatics software. First, CheckM (Parks ) is run, which uses a set of pre-computed core genes to assess the completeness and contamination of MAGs. CheckM also attempts to assign a taxonomic level to the MAGs, though in our experience this is often a conservative estimate. In tandem, MAGpy predicts the protein coding sequences of MAGs using Prodigal (Hyatt ). DIAMOND (Buchfink ) BLASTP is used to compare the proteins to UniProt (UniProt Consortium, 2018). This has multiple purposes—the hits from UniProt provide a form of annotation of the putative proteins and may predict function; many MAGs may show little similarity to published genomes at the DNA level, but as proteins are more conserved, protein hits can help define the closest sequenced genome; and the length of the predicted protein and that protein’s hits can be used to detect truncated genes and proteins in the MAG annotation. Reports of the DIAMOND results at the level of the MAG and for each contig within each MAG are produced. The proteins are also compared to protein families in Pfam (Finn ) using PfamScan; and to create a tree using PhyloPhlAn (Segata ), which is subsequently visualised using GraPhlAn (Asnicar ). The MAG genome sequences are also compared to over 100 000 public genomes using MinHash signatures as implemented in Sourmash (Brown and Irber, 2016).
4 Results and discussion
We applied MAGpy to 800 Bacterial and Archaeal MAGs from Parks et al. (Parks ). The CheckM report [which uses Ete3 (Huerta-Cepas ) to expand the taxonomic prediction] can be seen in Supplementary Table S1, the Sourmash report in Supplementary Table S2 and the Uniprot report in Supplementary Table S3. The PhyloPhlAn tree can be seen in Figure 1
. Specific examples reveal the strengths of each approach. Whilst CheckM predicts a lineage of s__algicola for UBA6511, the UniProt results show 3403 (86%) of that genome’s 3945 predicted proteins have a top hit to Maribacter dokdonensis DSW-8 with an average similarity of 95.78%. The Sourmash results (Supplementary Table S2) also predict a strong hit to Maribacter. On many occasisons, CheckM is only able to predict a lineage of k__Bacteria, as in the case of UBA3429. However, both the UniProt and Sourmash results show a strong similarity of this genome to Thermus thermophiles HB8, providing strain-level resolution where CheckM fails.
Fig. 1.
Phylogenetic tree of 800 MAGs created using PhyloPhlAn and produced by MAGpy
Phylogenetic tree of 800 MAGs created using PhyloPhlAn and produced by MAGpyThe outputs of MAGpy can also be used to identify potential chimeric MAGs. As well as producing a MAG-level report for the UniProt comparisons, a contig-level report is also produced for each MAG. This report includes the number of proteins predicted for each contig, and the most popular genus and species for those proteins from the diamond search. Supplementary Table S4 shows a contig-level report for UBA7370, a high quality MAG. Most contigs show very high protein-level similarity to the same genus (‘Synechococcus’) and species (‘Synechococcus sp. KORDI-49’). There are only two exceptions, with one contig showing similarity to ‘Synechococcus sp. (strain WH8103)’ and another hitting the genus ‘uncultured’. Deeper examination shows these to come from hits to ‘uncultured marine type-A Synechococcus’.In comparison, Supplementary Table S5 shows a contig-level report from UBA6779. In this MAG, many of the contigs show high protein-level similarity to genus ‘Zunongwangia’ and species ‘Zunongwangia profunda (strain DSM 18752/CCTCC AB 206139/SM-A87)’; however, there are also contigs with high-similarity to Salegentibacter and Leeuwenhoekiella, and, towards the bottom of the table, multiple contigs with low protein-level similarity to Gramella, Mesonia, Legionella and various others. Whilst many of these are from the same family (Flavobacteriaceae), there are certainly signs this is a chimeric MAG. Researchers would be advised to remove these contigs from the MAG and re-analyze.We conclude that MAGpy represents a novel, useful and reproducible workflow that enables researchers to predict the closest relative to newly sequenced and assembled MAGs. MAGpy carries out extensive comparative genomics at the DNA and protein-level, attempts to place MAGs within a phylogenetic tree, produces detailed reports and allows for the identification of potential chimeric MAGs.
Funding
This project was supported by the Biotechnology and Biological Sciences Research Council (BBSRC; BB/N016742/1, BB/N01720X/1), including institute strategic programme and national capability awards to The Roslin Institute (BBSRC: BB/P013759/1, BB/P013732/1, BB/J004235/1, BB/J004243/1); and by the Scottish Government (RESAS) as part of the 2016-2021 commission.Conflict of Interest: none declared.Click here for additional data file.
Authors: Rachel Mackelprang; Mark P Waldrop; Kristen M DeAngelis; Maude M David; Krystle L Chavarria; Steven J Blazewicz; Edward M Rubin; Janet K Jansson Journal: Nature Date: 2011-11-06 Impact factor: 49.962
Authors: Eoin L Brodie; Todd Z Desantis; Dominique C Joyner; Seung M Baek; Joern T Larsen; Gary L Andersen; Terry C Hazen; Paul M Richardson; Donald J Herman; Tetsu K Tokunaga; Jiamin M Wan; Mary K Firestone Journal: Appl Environ Microbiol Date: 2006-09 Impact factor: 4.792
Authors: D Vallenet; S Engelen; D Mornico; S Cruveiller; L Fleury; A Lajus; Z Rouy; D Roche; G Salvignol; C Scarpelli; C Médigue Journal: Database (Oxford) Date: 2009-11-25 Impact factor: 3.451
Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169
Authors: J Craig Venter; Karin Remington; John F Heidelberg; Aaron L Halpern; Doug Rusch; Jonathan A Eisen; Dongying Wu; Ian Paulsen; Karen E Nelson; William Nelson; Derrick E Fouts; Samuel Levy; Anthony H Knap; Michael W Lomas; Ken Nealson; Owen White; Jeremy Peterson; Jeff Hoffman; Rachel Parsons; Holly Baden-Tillson; Cynthia Pfannkoch; Yu-Hui Rogers; Hamilton O Smith Journal: Science Date: 2004-03-04 Impact factor: 47.728
Authors: Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon Journal: Nature Date: 2012-05-09 Impact factor: 49.962
Authors: Ramy K Aziz; Daniela Bartels; Aaron A Best; Matthew DeJongh; Terrence Disz; Robert A Edwards; Kevin Formsma; Svetlana Gerdes; Elizabeth M Glass; Michael Kubal; Folker Meyer; Gary J Olsen; Robert Olson; Andrei L Osterman; Ross A Overbeek; Leslie K McNeil; Daniel Paarmann; Tobias Paczian; Bruce Parrello; Gordon D Pusch; Claudia Reich; Rick Stevens; Olga Vassieva; Veronika Vonstein; Andreas Wilke; Olga Zagnitko Journal: BMC Genomics Date: 2008-02-08 Impact factor: 3.969
Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410
Authors: Tim Regan; Mark W Barnett; Dominik R Laetsch; Stephen J Bush; David Wragg; Giles E Budge; Fiona Highet; Benjamin Dainat; Joachim R de Miranda; Mick Watson; Mark Blaxter; Tom C Freeman Journal: Nat Commun Date: 2018-11-26 Impact factor: 14.919
Authors: Robert D Stewart; Marc D Auffret; Amanda Warr; Alan W Walker; Rainer Roehe; Mick Watson Journal: Nat Biotechnol Date: 2019-08-02 Impact factor: 54.908
Authors: Derek M Bickhart; Mick Watson; Sergey Koren; Kevin Panke-Buisse; Laura M Cersosimo; Maximilian O Press; Curtis P Van Tassell; Jo Ann S Van Kessel; Bradd J Haley; Seon Woo Kim; Cheryl Heiner; Garret Suen; Kiranmayee Bakshy; Ivan Liachko; Shawn T Sullivan; Phillip R Myer; Jay Ghurye; Mihai Pop; Paul J Weimer; Adam M Phillippy; Timothy P L Smith Journal: Genome Biol Date: 2019-08-02 Impact factor: 13.583