Literature DB >> 30865266

GToTree: a user-friendly workflow for phylogenomics.

Abstract

SUMMARY: Genome-level evolutionary inference (i.e. phylogenomics) is becoming an increasingly essential step in many biologists' work. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required-such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together etc.-can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set. Although GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ∼12 000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees.
AVAILABILITY AND IMPLEMENTATION: GToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTree. It is implemented primarily in bash with helper scripts written in python. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2019 PMID： 30865266 PMCID： PMC6792077 DOI： 10.1093/bioinformatics/btz188

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The number of sequenced genomes is increasing rapidly, largely through the recovery of metagenome-assembled genomes (e.g. Hug ; Parks ) and through the generation of single-cell amplified genomes (e.g. Berube ; Kashtan ). Phylogenomics (inferring genome-level evolutionary relationships) is becoming a fundamental step in many biologists’ work—such as in the characterization of newly recovered genomes, or in leveraging available reference genomes to guide evolutionary questions (Braakman ). There are several tools available for the major steps in a typical phylogenomics workflow, and at least one analysis platform that incorporates a phylogenomics workflow amid a larger infrastructure (anvi’o; Eren ). But a complete workflow focused solely on phylogenomics, enabling greater efficiency and scalability, and with flexibility with regard to input formats, is lacking. GToTree fills a void on three primary fronts: (i) it accepts as input any combination of fasta files, GenBank files and/or NCBI accessions—allowing integration of genomes from various sources and stages of analysis without any computational burden to the user; (ii) it enables the automation of required between-tool tasks such as filtering out hits by gene-length, filtering out genomes with too few hits to a specified target gene-set, and swapping genome identifiers so resulting trees and alignments can be explored more easily; and (iii) its scalability—GToTree can turn ∼1700 input genomes into a tree in 1 h on a standard laptop, and can optionally run many steps in parallel. This software gives more researchers the capability to create phylogenomic trees to aid in their work. At the time of publication, GToTree is primarily implemented in bash, but it will be converted to entirely python and be controlled by a more appropriate workflow language in the near future.

2 Description

2.1 Input

The required inputs to GToTree are (i) any combination of fasta files, GenBank files and/or NCBI assembly accessions, and (ii) an hidden Markov Model (HMM) file with the target genes. The HMM file can be custom or one of the 13 included HMM files covering varying breadths of diversity (discussed below). Optionally, the user can also provide a mapping file of specific input genome IDs with the labels they would like to have displayed in the final alignment and tree.

2.2 Processing

An overview of the GToTree workflow is presented in Figure 1 and detailed here:

Fig. 1.

Overview of general workflow and an example Tree of Life made with GToTree encompassing ∼1700 genomes from NCBI’s RefSeq using a universal SCG-set (Hug )

Retrieve coding-sequences (CDSs) for input genomes, depending on the input source: fasta files—identify CDSs with prodigal (Hyatt et al., 2010) GenBank files—extract CDSs if annotated, if not identify with prodigal (Hyatt et al., 2010) NCBI accession—download amino acid sequences of CDSs if annotated, if not, download the assembly and identify CDSs with prodigal (Hyatt et al., 2010) Identify target genes in all genomes with HMMER3 (Eddy, 2011) using pre-defined model cutoffs (–cut_ga) by default, if a genome has more than one hit to a target gene, no gene will be contributed to the alignment for that target gene from that genome. Report estimates of genome completeness/redundancy using the information from the HMM search (see Supplementary Note S1). Filter out potentially spurious gene-hits based on length, and genomes based on fraction of target-genes detected. Align each gene-set with Muscle (Edgar, 2004), perform automated trimming with Trimal (Gutíerrez ), and concatenate all. Optionally add custom genome labels or lineages (for any that have taxids associated with them whether from NCBI accession or found in provided GenBank files; utilizes TaxonKit; Shen and Xiong, 2019). Generate tree, currently supported are FastTree (Price ; note: FastTree does not enable incorporation of a specified root in tree generation) and IQ-TREE (Nguyen ; IQ-TREE does enable the incorporation of a specified root). Overview of general workflow and an example Tree of Life made with GToTree encompassing ∼1700 genomes from NCBI’s RefSeq using a universal SCG-set (Hug )

2.3 Outputs

The primary outputs from GToTree include the full alignment file (fasta), the tree file (newick), and tab-delimited summary tables with information on all genomes and individual ones for each genome input source. Additionally, outputs include report files on filtered or problematic genes/genomes.

2.4 CG-set generation

All 17, 929 Pfam (protein families; El-gebali ) HMM profiles from release 32.0 (accessed on December 2018) were downloaded from the Pfam ftp site (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/). As Pfam-HMMs actually target-specific domains or protein regions, there are many unique Pfam entries that come from the same functional protein—e.g. Enolase_N (PF03952) and Enolase_C (PF00113). This is not ideal if using them to search for single-copy genes (SCGs) for purposes such as phylogenomics or completion/redundancy estimates). To ensure no two Pfam-HMMs from the same protein were contained in a SCG-set, only Pfams with HMMs that on average covered >50% of the underlying protein sequences that went into building that Pfam’s HMM were retained. This left 8924 Pfams. To identify target SCGs, amino-acid CDSs of all ‘complete’ genomes with annotations in NCBI were downloaded for bacteria (n = 11, 405; accessed December 9, 2018) and archaea (n = 309; accessed December 15, 2018) (‘Complete’ is a specific classification of genome quality assigned by NCBI, see Supplementary Material Note S2.). All protein sequences were searched against the 8924 filtered Pfam-HMMs with ‘hmmsearch’ (HMMER v3.2.1; Eddy, 2011) with default settings other than specifying the ‘–cut_ga’ flag to utilize the gathering thresholds stored in the curated Pfam models. Reported protein hits for each individual Pfam were tallied for each individual genome (Supplementary Table S1; available at figshare.com/articles/Supp_Table_1/7562453). SCG-sets were generated for all Bacteria, all Archaea, and then for each bacterial phylum that held >99 genomes, and each proteobacterial class that had >99 genomes. For each of those taxonomic groups, Pfams that had exactly 1 hit in greater than or equal to 90% of the genomes of that group were retained as the SCG-set for that group. The counts for HMM hits for all genomes assayed are presented in Supplementary Table S2, and the code used to generate the bacterial SCG-set as an example is presented here: github.com/AstrobioMike/GToTree/wiki/SCG-sets.

3 Results

To exemplify GToTree, NCBI assembly accessions were downloaded for all RefSeq, complete, representative genomes (with the search query ‘“latest refseq”[filter] AND “complete genome”[filter] AND “representative genome”[filter] AND all[filter] NOT anomalous[filter]’ performed on Decemeber 20, 2018). This resulted in 1698 genomes spanning Archaea, Bacteria and Eukarya (please see Supplementary MaterialNote S3 on including Eukaryotes with GToTree). Using a SCG-set that spans all three domains (Hug ), runtime to create this tree (Fig. 1) was ∼60 min on a standard laptop (used was a late 2013 MacBook Pro). The tree was visualized by uploading the output newick file to the web-hosted Interactive Tree of Life (Letunic and Bork, 2016), all code to generate it and the results files come packaged with GToTree. Click here for additional data file.

14 in total

1. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus.

Authors: Nadav Kashtan; Sara E Roggensack; Sébastien Rodrigue; Jessie W Thompson; Steven J Biller; Allison Coe; Huiming Ding; Pekka Marttinen; Rex R Malmstrom; Roman Stocker; Michael J Follows; Ramunas Stepanauskas; Sallie W Chisholm
Journal: Science Date: 2014-04-25 Impact factor: 47.728

2. Metabolic evolution and the self-organization of ecosystems.

Authors: Rogier Braakman; Michael J Follows; Sallie W Chisholm
Journal: Proc Natl Acad Sci U S A Date: 2017-03-27 Impact factor: 11.205

3. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life.

Authors: Donovan H Parks; Christian Rinke; Maria Chuvochina; Pierre-Alain Chaumeil; Ben J Woodcroft; Paul N Evans; Philip Hugenholtz; Gene W Tyson
Journal: Nat Microbiol Date: 2017-09-11 Impact factor: 17.745

4. Accelerated Profile HMM Searches.

Authors: Sean R Eddy
Journal: PLoS Comput Biol Date: 2011-10-20 Impact factor: 4.475

5. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.

Authors: Lam-Tung Nguyen; Heiko A Schmidt; Arndt von Haeseler; Bui Quang Minh
Journal: Mol Biol Evol Date: 2014-11-03 Impact factor: 16.240

6. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees.

Authors: Ivica Letunic; Peer Bork
Journal: Nucleic Acids Res Date: 2016-04-19 Impact factor: 16.971

7. Single cell genomes of Prochlorococcus, Synechococcus, and sympatric microbes from diverse marine environments.

Authors: Paul M Berube; Steven J Biller; Thomas Hackl; Shane L Hogle; Brandon M Satinsky; Jamie W Becker; Rogier Braakman; Sara B Collins; Libusha Kelly; Jessie Berta-Thompson; Allison Coe; Kristin Bergauer; Heather A Bouman; Thomas J Browning; Daniele De Corte; Christel Hassler; Yotam Hulata; Jeremy E Jacquot; Elizabeth W Maas; Thomas Reinthaler; Eva Sintes; Taichi Yokokawa; Debbie Lindell; Ramunas Stepanauskas; Sallie W Chisholm
Journal: Sci Data Date: 2018-09-04 Impact factor: 6.444

8. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors: Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

9. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

10. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

72 in total

1. Accurate Annotation of Microbial Metagenomic Genes and Identification of Core Sets.

Authors: Chiara Vanni
Journal: Methods Mol Biol Date: 2021

2. Niche dimensions of a marine bacterium are identified using invasion studies in coastal seawater.

Authors: Brent Nowinski; Mary Ann Moran
Journal: Nat Microbiol Date: 2021-01-25 Impact factor: 17.745

3. Oral microbiome diversity: The curious case of Corynebacterium sp. isolation.

Authors: Puthayalai Treerat; Brian McGuire; Elizabeth Palmer; Erin M Dahl; Lisa Karstens; Justin Merritt; Jens Kreth
Journal: Mol Oral Microbiol Date: 2022-08-01 Impact factor: 4.107

4. Phylogenetic and functional diverse ANME-1 thrive in Arctic hydrothermal vents.

Authors: F Vulcano; C J Hahn; D Roerdink; H Dahle; E P Reeves; G Wegener; I H Steen; R Stokke
Journal: FEMS Microbiol Ecol Date: 2022-10-17 Impact factor: 4.519

5. Metagenome-Assembled Genomes of Bacteria Associated with Massospora cicadina Fungal Plugs from Infected Brood VIII Periodical Cicadas.

Authors: Cassandra L Ettinger; Brian Lovett; Matt T Kasson; Jason E Stajich
Journal: Microbiol Resour Announc Date: 2022-08-29

6. The Hunt for Ancient Prions: Archaeal Prion-Like Domains Form Amyloid-Based Epigenetic Elements.

Authors: Tomasz Zajkowski; Michael D Lee; Shamba S Mondal; Amanda Carbajal; Robert Dec; Patrick D Brennock; Radoslaw W Piast; Jessica E Snyder; Nicholas B Bense; Wojciech Dzwolak; Daniel F Jarosz; Lynn J Rothschild
Journal: Mol Biol Evol Date: 2021-05-04 Impact factor: 16.240

7. Metagenomic Profile of Microbial Communities in a Drinking Water Storage Tank Sediment after Sequential Exposure to Monochloramine, Free Chlorine, and Monochloramine.

Authors: Vicente Gomez-Alvarez; Hong Liu; Jonathan G Pressman; David G Wahman
Journal: ACS ES T Water Date: 2021

8. Draft Genome Sequence of Staphylococcus succinus Strain GN1, Isolated from a Basement Floor in Milwaukee, WI.

Authors: Grant P Nickolson; Nasim Maghboli Balasjin; Christopher W Marshall
Journal: Microbiol Resour Announc Date: 2021-07-08

9. Anoxygenic photosynthesis and iron-sulfur metabolic potential of Chlorobia populations from seasonally anoxic Boreal Shield lakes.

Authors: J M Tsuji; N Tran; S L Schiff; J J Venkiteswaran; L A Molot; M Tank; S Hanada; J D Neufeld
Journal: ISME J Date: 2020-08-03 Impact factor: 10.302

10. Deadwood-Inhabiting Bacteria Show Adaptations to Changing Carbon and Nitrogen Availability During Decomposition.

Authors: Vojtěch Tláskal; Petr Baldrian
Journal: Front Microbiol Date: 2021-06-17 Impact factor: 5.640