Literature DB >> 32426808

MetaSanity: an integrated microbial genome evaluation and annotation pipeline.

Christopher J Neely¹, Elaina D Graham¹, Benjamin J Tully^1,2.

Abstract

SUMMARY: As the importance of microbiome research continues to become more prevalent and essential to understanding a wide variety of ecosystems (e.g. marine, built, host associated, etc.), there is a need for researchers to be able to perform highly reproducible and quality analysis of microbial genomes. MetaSanity incorporates analyses from 11 existing and widely used genome evaluation and annotation suites into a single, distributable workflow, thereby decreasing the workload of microbiologists by allowing for a flexible, expansive data analysis pipeline. MetaSanity has been designed to provide separate, reproducible workflows that (i) can determine the overall quality of a microbial genome, while providing a putative phylogenetic assignment, and (ii) can assign structural and functional gene annotations with varying degrees of specificity to suit the needs of the researcher. The software suite combines the results from several tools to provide broad insights into overall metabolic function. Importantly, this software provides built-in optimization for 'big data' analysis by storing all relevant outputs in an SQL database, allowing users to query all the results for the elements that will most impact their research.
AVAILABILITY AND IMPLEMENTATION: MetaSanity is provided under the GNU General Public License v.3.0 and is available for download at https://github.com/cjneely10/MetaSanity. This application is distributed as a Docker image. MetaSanity is implemented in Python3/Cython and C++. Instructions for its installation and use are available within the GitHub wiki page at https://github.com/cjneely10/MetaSanity/wiki, and additional instructions are available at https://cjneely10.github.io/year-archive/. MetaSanity is optimized for users with limited programing experience. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Year: 2020 PMID： 32426808 PMCID： PMC7520038 DOI： 10.1093/bioinformatics/btaa512

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The analysis of microbial genomes has become an increasingly common task for many fields of biology and geochemistry. Researchers can routinely generate hundreds/thousands of environmentally derived microbial genomes using methodologies, such as metagenomics (Tully ), high-throughput culturing (Thrash ) and single-cell sorting (Stepanauskas ). However, analyzing the data can be problematic, as data analysis is computationally intensive and requires a knowledge of software that is constantly changing and may be difficult to install or execute. For the average researcher, the task of evaluating and annotating a set of microbial genomes may be time intensive and computationally rigorous. Here, we present MetaSanity, a comprehensive solution for generating evaluation and annotation pipelines for bacterial and archaeal isolate genomes, metagenome-assembled genomes (MAGs) and single-amplified genomes (SAGs). MetaSanity provides genome quality evaluation, phylogenetic assignment, as well as structural and functional annotation through a variety of integrated programs based on the procedure described in Tully (2019). By providing users with the ability to create annotation pipelines using multiple annotation suites, MetaSanity achieves a broad level of functional annotation that may be missed by implementing a single annotation program. MetaSanity provides a workflow that combines all outputs into a single queryable database that operates easily from the command line. Installation can be performed at the user level, limiting the need for intervention by system administrators, and, except for certain memory intensive programs, can be run locally on high-end personal computers.

2 Materials and methods

MetaSanity consists of two smaller workflows (Fig. 1): (i) PhyloSanity, to evaluate the completion, contamination, redundancy and phylogeny of each genome in a dataset, and (ii) FuncSanity, to provide structural and functional annotations of each genome. Each component consists of several optional applications that can be modified to specific research needs. While each component contained within the two pipelines runs independently and generates component-specific outputs, MetaSanity combines all outputs into a single queryable SQL database that allows fast and easy retrieval of data—in this case, gene annotations and other related genomic data. MetaSanity focuses on allowing users the ability to fine tune and adjust their data analysis pipelines with minimal effort and maximize computational and storage efficiency (Supplementary Table S1). MetaSanity is distributed as a Docker image (Merkel, 2014) and is implemented using a combination of Python3 (Python Software Foundation (2014)/Cython (Bradshaw ) and C++ (ISO/IEC 2014). MetaSanity is also available as a source code installation for systems that do not support Docker.

Fig. 1.

MetaSanity pipeline schema. Programs and databases that are part of the MetaSanity installation are in blue boxes. Programs in the dotted orange boxes must be installed separately by the user due to licensing agreements. DB, database. (Color version of this figure is available at Bioinformatics online.)

2.1 Phylosanity

PhyloSanity is designed to provide metrics of genome quality and to filter genomes for downstream analysis based on user-defined quality metrics. The workflow integrates CheckM v1.0.18 (Parks ), GTDB-Tk v.0.3.2 (Parks ) and FastANI (Jain ) as part of its evaluation pipeline. CheckM estimates the completion and contamination of each genome (Parks ). Next, FastANI compares each genome in a pairwise fashion against all other genomes to determine the average nucleotide identify (ANI) for each genome pair (Jain ). For any set of genomes that shares an ANI above a user-defined value, a non-redundant genome representative will be selected from the set that is the most complete and least contaminated. This allows users the option to exclude redundant genomes from further analysis. Differentiating genomes as non-redundant versus redundant can be useful for researchers working with MAGs or SAGs that are generated from replicate samples and may not have biological meaning when working with isolates or strain level differences. All genomes can undergo phylogenetic assignment based on relative evolutionary distance (Parks ) through GTDB-Tk (Pierre-Alain ), which will replace the CheckM-returned taxonomic assignment.

2.2 Funcsanity

FuncSanity provides structural and functional annotation of microbial genomes. The workflow incorporates annotation suites from eight existing and widely used programs. The use of multiple annotation programs has the advantage of capturing functional predictions that may not have been detected due to database or search limitations. Specialized annotation programs, such as VirSorter v1.0.5 (Roux ), use custom tools and/or databases to return relevant annotations that are not captured by other programs in MetaSanity. Open reading frames (ORFs) are predicted using Prodigal v2.6.3 (Hyatt ); however, users may opt to use the putative coding DNA sequences (CDS) generated by Prokka v1.13.3 (Seemann, 2014). From here, putative ORFs are processed by a set of annotation tools that can be selected by the user with user-defined filtering and cut-off values.

2.2.1 Kyoto Encyclopedia of Genes and Genomes annotation

Putative ORFs can be searched against the KofamKOALA database using KofamScan v.1.1.0 (Aramaki ). Default parameters are used, and the ‘mapper’ tab-delimited output option is generated, linking ORF IDs to Kyoto Encyclopedia of Genes and Genomes (KEGG) Ontology (KO) IDs. Users can query any KO ID to generate specific functional search results in BioMetaDB.

2.2.2 KEGG-Decoder

KEGG annotations can be used to estimate the completeness of various biogeochemically relevant metabolic pathways in a genome using KEGG-Decoder v.1.0.10 (Graham ; https://github.com/bjtully/BioData/tree/master/KEGGDecoder). Users can search genomes based on completeness of a pathway or function of interest. An additional heatmap summary visualization is generated.

2.2.3 Virsorter

VirSorter v1.0.5 (Roux ) can be implemented to identify phage and prophage signatures in each genome using default parameters. Users can search for matches to each of the phage and prophage categories returned by VirSorter and generate lists of contigs and/or genomes with the assignments (Supplementary Table S1).

2.2.4 Interproscan

InterProScan 5.36-75.0 (Jones ) is an optional installation and can be used for domain prediction on putative ORFs. Users have the option of downloading all of the InterProScan databases, including TIGRfam (Haft ), Pfam (Finn ), CDD (Marchler-Bauer , Zhang et al., 2017) and PANTHER (Mi ). Each InterProScan database result is indexed separately in BioMetaDB and can be used to return matching genomes using database specific IDs (e.g. PF01036 would return putative rhodopsin ORFs from a Pfam result).

2.2.5 Prokka annotation

If not chosen as the option for structural annotation, genomes can be annotated using Prokka and its associated databases with the parameters –addgenes (adds the ‘gene’ feature to each CDS in the GenBank output format), –addmrna (adds the ‘mRNA’ feature to each CDS in the GenBank output format), –usegenus (use the genus-specific databases), –metagenome (improve gene predictions for fragmented genomes) and –rnammer [sets RNAmmer as the preferred rRNA prediction tool instead of Barrnap (http://www.vicbioinformatics.com/software.barrnap.shtml)]. rRNA identification with RNAmmer v.1.2 (Lagesen ) and signal peptide detection with SignalP v.4.1 (Nielsen, 2017) are optional installations. Users are given the option to use Prokka-derived putative CDS, which take into consideration the locations of RNA gene sequences (tRNA, rRNA, tmRNA and ncRNA) in their downstream analysis. These putative proteins are used in further analysis instead of the Prodigal-derived CDS, which are not aware of RNA boundaries.

2.2.6 Carbohydrate-active enzyme annotation

Putative ORFs can be assigned a putative carbohydrate-active enzyme (CAZy) functionality (Cantarel ) based on the dbCANv2 database (Zhang ). ORFs are searched against dbCANv2 using HMMER v3.1b2 (Eddy, 2011) with the minimum score threshold set to 75 (-T parameter).

2.2.7 Peptidase annotation

Putative ORFs can be assigned to a peptidase family using a set of HMMs that represent the MEROPS database (Rawlings ). PSORTb v.3.0 (Yu ) and SignalP can be optionally performed on MEROPS matches to determine if a putative enzyme is predicted to be extracellular. An extracellular assignment is made if PSORTb predicts ‘extracellular’ or ‘outer membrane’ localization or if PSORTb returns ‘unknown’ localization and SignalP predicts the presence of a signal peptide. Users can search for genes and genomes based on overall MEROPS annotations or by searching for specific designations (e.g. GT41 for glycosyl transferase family 41). InterProScan, SignalP and RNAmmer are not automatically distributed with MetaSanity and require users to download their binaries separately and agree to their individual license requirements.

2.2.8 Configuration adjustments

Each run of MetaSanity can be adjusted to the needs of the genome(s) being analyzed. One of the required parameters for MetaSanity if a configuration (config) file that sets which analyses will be performed on a genome or set of genomes, including optional settings (e.g. number of CPU threads, etc.). Config files allow users to determine specific pipelines of analyses, as needed, and provide a record which can be used for future genome sets for rapid re-deployment of analyses.

2.3 Biometadb

MetaSanity outputs consist of tab-delimited descriptions of genomic data, which can easily be analyzed by a variety of external tools. We additionally provide BioMetaDB, a specialized relational database management system tool that integrates modularized storage and retrieval of FASTA records with the metadata describing them. This application uses tab-delimited data files to generate table relation schemas via Python3. Based on SQLAlchemy v.1.3.7 (Bayer, 2012), BioMetaDB allows researchers to efficiently manage data from the command line by providing operations that include: (i) the ability to store information from any valid tab-delimited data file and to quickly retrieve FASTA records or annotations related to these datasets by using SQL-optimized command-line queries; and, (ii) the ability to run all CRUD operations (create, read, update, delete) from the command line and from python scripts. Output from both workflows is stored into a BioMetaDB project, providing users a simple interface to comprehensively examine their data (Supplementary Table S2). Users can query application results used across the entire genome set for specific information that is relevant to their research, allowing the potential to screen genomes based on returned taxonomy, quality, annotation, putative metabolic function or any combination thereof. More information on using BioMetaDB is available via the BioMetaDB GitHub (https://github.com/cjneely10/BioMetaDB).

3 Results

MetaSanity was tested on two separate systems—a personal computer with an Intel core i5-4570 CPU @ 3.20 GHz processor with 4 cores and 32 GB of RAM operating the Ubuntu 18.04.3 LTS Linux distribution, and an academic server with an Intel Xeon E7-4850 v2 @ 2.30 GHz processor with 96 cores and 1 TB of RAM operating the Ubuntu 18.04.3 LTS Linux distribution. Reduced options were calculated on the personal computer using all four available threads and preset parameter flags that skip memory-intensive processes. Complete options were calculated on the academic server using 10 threads and no parameters to reduce memory usage. Runtime results are available in Supplementary Table S3. The current architecture relies on sequential completion of time intensive processes, several of which are optional for users. Ongoing modifications that take advantage of parallelizing these processes should decrease the overall computation time. Click here for additional data file.

25 in total

1. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes.

Authors: Nancy Y Yu; James R Wagner; Matthew R Laird; Gabor Melli; Sébastien Rey; Raymond Lo; Phuong Dao; S Cenk Sahinalp; Martin Ester; Leonard J Foster; Fiona S L Brinkman
Journal: Bioinformatics Date: 2010-05-13 Impact factor: 6.937

2. Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0).

Authors: Huaiyu Mi; Anushya Muruganujan; Xiaosong Huang; Dustin Ebert; Caitlin Mills; Xinyu Guo; Paul D Thomas
Journal: Nat Protoc Date: 2019-02-25 Impact factor: 13.491

3. Accelerated Profile HMM Searches.

Authors: Sean R Eddy
Journal: PLoS Comput Biol Date: 2011-10-20 Impact factor: 4.475

4. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.

Authors: Donovan H Parks; Michael Imelfort; Connor T Skennerton; Philip Hugenholtz; Gene W Tyson
Journal: Genome Res Date: 2015-05-14 Impact factor: 9.043

5. InterProScan 5: genome-scale protein function classification.

Authors: Philip Jones; David Binns; Hsin-Yu Chang; Matthew Fraser; Weizhong Li; Craig McAnulla; Hamish McWilliam; John Maslen; Alex Mitchell; Gift Nuka; Sebastien Pesseat; Antony F Quinn; Amaia Sangrador-Vegas; Maxim Scheremetjew; Siew-Yit Yong; Rodrigo Lopez; Sarah Hunter
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

6. Potential for primary productivity in a globally-distributed bacterial phototroph.

Authors: E D Graham; J F Heidelberg; B J Tully
Journal: ISME J Date: 2018-03-09 Impact factor: 10.302

7. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database.

Authors: Pierre-Alain Chaumeil; Aaron J Mussig; Philip Hugenholtz; Donovan H Parks
Journal: Bioinformatics Date: 2019-11-15 Impact factor: 6.937

8. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics.

Authors: Brandi L Cantarel; Pedro M Coutinho; Corinne Rancurel; Thomas Bernard; Vincent Lombard; Bernard Henrissat
Journal: Nucleic Acids Res Date: 2008-10-05 Impact factor: 16.971

9. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans.

Authors: Benjamin J Tully; Elaina D Graham; John F Heidelberg
Journal: Sci Data Date: 2018-01-16 Impact factor: 6.444

10. Improved genome recovery and integrated cell-size analyses of individual uncultured microbial cells and viral particles.

Authors: Ramunas Stepanauskas; Elizabeth A Fergusson; Joseph Brown; Nicole J Poulton; Ben Tupper; Jessica M Labonté; Eric D Becraft; Julia M Brown; Maria G Pachiadaki; Tadas Povilaitis; Brian P Thompson; Corianna J Mascena; Wendy K Bellows; Arvydas Lubys
Journal: Nat Commun Date: 2017-07-20 Impact factor: 14.919

6 in total

1. Metagenome-Assembled Genome Sequence of Kapabacteriales Bacterium Strain Clear-D13, Assembled from a Harmful Algal Bloom Enrichment Culture.

Authors: Shaikha Al-Saud; Kyra M Florea; Eric A Webb; J Cameron Thrash
Journal: Microbiol Resour Announc Date: 2020-12-03

2. Metagenome-Assembled Genome Sequence of Vulcanococcus sp. Strain Clear-D1, Assembled from a Cyanobacterial Enrichment Culture.

Authors: Alexis E Floback; Kyra M Florea; J Cameron Thrash; Eric A Webb
Journal: Microbiol Resour Announc Date: 2020-12-03

3. Metagenome-Assembled Genome Sequence of Rhodobacteraceae Bacterium Strain Clear-D3, Assembled from a Dolichospermum Consortium from Clear Lake, California.

Authors: Chuankai Cheng; Kyra M Florea; Eric A Webb; J Cameron Thrash
Journal: Microbiol Resour Announc Date: 2020-12-03

4. Metagenome-Assembled Genome Sequence of Dolichospermum circinale Strain Clear-D4, Assembled from a Harmful Cyanobacterial Bloom Enrichment Culture.

Authors: Kyra M Florea; J Cameron Thrash; Eric A Webb
Journal: Microbiol Resour Announc Date: 2020-12-03

5. Metagenome-Assembled Genome Sequence of Aphanizomenon flos-aquae Strain Clear-A1, Assembled from an Enrichment Culture.

Authors: Tiffany Yemenian; Kyra M Florea; J Cameron Thrash; Eric A Webb
Journal: Microbiol Resour Announc Date: 2020-12-03

6. Diversity and prevalence of ANTAR RNAs across actinobacteria.

Authors: Dolly Mehta; Arati Ramesh
Journal: BMC Microbiol Date: 2021-05-29 Impact factor: 3.605

6 in total