Literature DB >> 28057682

phyloSkeleton: taxon selection, data retrieval and marker identification for phylogenomics.

Abstract

Summary: With the wealth of available genome sequences, a difficult and tedious part of inferring phylogenomic trees is now to select genomes with an appropriate taxon density in the different parts of the tree. The package described here offers tools to easily select the most representative organisms, following a set of simple rules based on taxonomy and assembly quality, to retrieve the genomes from public databases (NCBI, JGI), to annotate them if necessary, to identify given markers in these, and to prepare files for multiple sequence alignment. Availability and Implementation: phyloSkeleton is a Perl module and is freely available under GPLv3 at https://bitbucket.org/lionelguy/phyloskeleton/ . Contact: lionel.guy@imbim.uu.se.

Entities: Species

Mesh：

Substances：
Genetic Markers

Year: 2017 PMID： 28057682 PMCID： PMC5408842 DOI： 10.1093/bioinformatics/btw824

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Many studies (e.g. Pick ) have highlighted the importance of the effect of taxon sampling on recovering the correct species tree with phylogenomics methods, although the relative benefits of adding more taxa or more genes has been hotly debated (e.g. Philippe ; Rosenberg and Kumar, 2003; Zwickl and Hillis, 2002). Thanks to the wealth of publicly available sequence data, the affordability of DNA sequencing, and the rise of single-cell and metagenomics methods, the bottleneck in establishing phylogenies is now computational. To keep phylogenetic inference tractable, researchers face a trade-off between number of taxa and number of marker genes to include. One of the frequent aims of phylogenomic studies is to place novel, unknown organisms in their phylogenetic context, using a backbone tree composed of well-known organisms. To achieve this, or to resolve a particular region of the tree of life, a common solution is to obtain a denser sampling close to the nodes of interest, and less dense further away. In practice, this is often achieved by a recursive process: Select representative genomes to include. Retrieve the corresponding genomes and, if available, proteomes or else annotate the genomes. Identify orthologs of selected marker genes in each proteome. Align the sequences for each marker separately. Concatenate the alignment, tracking protein names and ids. Infer a phylogeny. Upon tree inspection, if the density of taxa at the place of interest is not good enough, go back to (1). The whole process is often repeated many times: first, trees are computed with faster phylogenetic methods, e.g. FastTree (Price ), until the right sampling density is achieved at the right place, when more sensitive algorithms like RAxML (Stamatakis, 2014) or PhyloBayes (Rodrigue and Lartillot, 2014) can be used. Identifying orthologs (step 3) is a difficult algorithmic problem, and accurately excluding paralogous sequences generally requires visual inspection of each single-gene tree. Multiple sequence alignment (step 4) and phylogenetic inference (step 6) are the most computationally intensive and are often the bottleneck of the analysis. However, the other steps are often tedious, requiring long hands-on time searching databases and keeping track of protein and organisms names, slowing down the whole process. Step 1, selecting representative genomes, can be especially problematic when sampling well-studied taxa (e.g. the Enterobacteriaceae) counting thousands of available genomes. Many software packages aim at automating different parts of the phylogenomics process, but to the best of my knowledge, none covers it all or addresses the automated selection of representative taxa. For example, Agalma (Dunn ) automates the annotation of transcriptome data, the alignment of homologous marker sets and performs a preliminary phylogeny, but relies on data provided by the user. BIR (Kumar ) and PhlyoTreePruner (Kocot ) attempt to automatically identify orthologs, but they require the user to provide curated sets of markers or single-gene trees, respectively. Phyla-AMPHORA (Wang and Wu, 2013) gathers phylum-level markers, but does not provide means to vary the density of taxon sampling. It would be interesting to include phylum-level alignments in the phyloSkeleton pipeline, but unfortunately, Phyla-AMPHORA has not been updated since its publication. PhyloSift (Darling ) and CheckM (Parks ), intended for metagenomics datasets, have the possibility to place sequences into a fixed reference backbone tree. MicrobeDB (Langille ) allows the user to maintain a local database of publicly available and own genome sequences, easing the burden of maintaining a backbone tree. The purpose of phyloSkeleton is to automate steps 1 and 2, picking up the best representative at a variable density as decided by the user, retrieving the genomic data, annotating genes if necessary; it also facilitates step 3, identifying orthologs, by automatically preparing single-gene trees. Finally, phyloSkeleton automates the concatenation of single-gene alignments, and provides useful tools to visualize trees. So far, phyloSkeleton is aimed primarily at prokaryotic genomes, but could potentially be used for eukaryotic ones.

2 Methods

2.1 Selection of representative genomes

The user first retrieves lists of available genomes from NCBI (Genbank) and, optionally, from the Joint Genomic Institute (JGI; IMG database). The selection of representative genomes is based on a set of simple taxonomic rules: at a specific higher level (e.g. class), select one representative per lower level (e.g. genus). These rules can be combined to achieve the right sampling density at the right place, for example sampling at species level in the order of interest, and at class level in the selected outgroup phylum. The user also has the possibility to add their own data. The selection algorithm selects the best representative, first preferring (i) a reference or (ii) a representative genome if they are available in NCBI’s Entrez genome collection. Then, the assembly level and the source are considered, looking in decreasing order of preference at (i) complete or (ii) chromosome level assemblies at NCBI, (iii) finished projects at JGI, (iv) chromosome with gaps, (v) scaffolds or (vi) contigs assembly level at NCBI and at (vi) permanent draft or (vii) draft at JGI. Lastly, the largest genomes are favored. The genomes and, eventually, the proteomes of the selected representatives are automatically retrieved from NCBI and/or JGI. Genomes for which no proteome is available are annotated with prodigal (Hyatt ) or prokka (Seemann, 2014).

2.2 Marker selection

All genomes are screened for marker genes that will be used for the concatenated phylogeny. The user provides a set of HMM profiles corresponding to these markers. Three generic sets, one consisting of 15 ribosomal protein genes, one bacteria- and one archaea-specific (Rinke ), are shipped with the software. HMMER (Eddy, 2011) is used to identify the best matches. If there is more than one significant match per proteome, a warning is raised and a single-gene phylogeny for this specific marker is prepared, to help the user selecting the correct paralog.

2.3 Other tools

After marker identification, phyloSkeleton gathers the protein sequences in fasta files, and, upon alignment, concatenate the alignments. It also contains scripts to facilitate analyzing large trees by adding colors and group names, in conjunction with FigTree (Andrew Rambaut, http://tree.bio.ed.ac.uk/software/figtree/).

3 Conclusion

PhyloSkeleton gathers genome sequences to infer a phylogenetic tree with variable taxon sampling density, following simple rules based on taxonomy and genome assembly quality. It is especially useful to place a novel, unknown organism in a backbone tree, or to resolve a particular region of a large tree, or to explore the monophyly of certain taxa. It allows the user to quickly perform many iterations of the phylogenomic process: changing the selection rules to modify taxon sampling density is the only manual step once the initial run has completed. The software, released under GPLv3, comes with a comprehensive manual, a complete tutorial and a test data set, available at https://bitbucket.org/lionelguy/phyloskeleton.

18 in total

1. Taxon sampling, bioinformatics, and phylogenomics.

Authors: Michael S Rosenberg; Sudhir Kumar
Journal: Syst Biol Date: 2003-02 Impact factor: 15.683

2. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

3. A phylum-level bacterial phylogenetic marker database.

Authors: Zhang Wang; Martin Wu
Journal: Mol Biol Evol Date: 2013-03-21 Impact factor: 16.240

4. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

5. Improved phylogenomic taxon sampling noticeably affects nonbilaterian relationships.

Authors: K S Pick; H Philippe; F Schreiber; D Erpenbeck; D J Jackson; P Wrede; M Wiens; A Alié; B Morgenstern; M Manuel; G Wörheide
Journal: Mol Biol Evol Date: 2010-04-08 Impact factor: 16.240

6. MicrobeDB: a locally maintainable database of microbial genomic sequences.

Authors: Morgan G I Langille; Matthew R Laird; William W L Hsiao; Terry A Chiu; Jonathan A Eisen; Fiona S L Brinkman
Journal: Bioinformatics Date: 2012-05-09 Impact factor: 6.937

Review 7. BIR Pipeline for Preparation of Phylogenomic Data.

Authors: Surendra Kumar; Anders K Krabberød; Ralf S Neumann; Katerina Michalickova; Sen Zhao; Xiaoli Zhang; Kamran Shalchian-Tabrizi
Journal: Evol Bioinform Online Date: 2015-04-27 Impact factor: 1.625

8. Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package.

Authors: Nicolas Rodrigue; Nicolas Lartillot
Journal: Bioinformatics Date: 2013-12-18 Impact factor: 6.937

9. Agalma: an automated phylogenomics workflow.

Authors: Casey W Dunn; Mark Howison; Felipe Zapata
Journal: BMC Bioinformatics Date: 2013-11-19 Impact factor: 3.169

10. PhyloSift: phylogenetic analysis of genomes and metagenomes.

Authors: Aaron E Darling; Guillaume Jospin; Eric Lowe; Frederick A Matsen; Holly M Bik; Jonathan A Eisen
Journal: PeerJ Date: 2014-01-09 Impact factor: 2.984

6 in total

1. Resurrecting ancestral genes in bacteria to interpret ancient biosignatures.

Authors: Betul Kacar; Lionel Guy; Eric Smith; John Baross
Journal: Philos Trans A Math Phys Eng Sci Date: 2017-12-28 Impact factor: 4.226

2. A Parasitic Arsenic Cycle That Shuttles Energy from Phytoplankton to Heterotrophic Bacterioplankton.

Authors: Stephen J Giovannoni; Kimberly H Halsey; Jimmy Saw; Omran Muslin; Christopher P Suffridge; Jing Sun; Chih-Ping Lee; Eric R Moore; Ben Temperton; Stephen E Noell
Journal: mBio Date: 2019-03-19 Impact factor: 7.867

3. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT.

Authors: F A Bastiaan von Meijenfeldt; Ksenia Arkhipova; Diego D Cambuy; Felipe H Coutinho; Bas E Dutilh
Journal: Genome Biol Date: 2019-10-22 Impact factor: 13.583

4. Host Adaptation in Legionellales Is 1.9 Ga, Coincident with Eukaryogenesis.

Authors: Eric Hugoson; Andrei Guliaev; Tea Ammunét; Lionel Guy
Journal: Mol Biol Evol Date: 2022-03-02 Impact factor: 16.240

5. Transitional genomes and nutritional role reversals identified for dual symbionts of adelgids (Aphidoidea: Adelgidae).

Authors: Dustin T Dial; Kathryn M Weglarz; Akintunde O Aremu; Nathan P Havill; Taylor A Pearson; Gaelen R Burke; Carol D von Dohlen
Journal: ISME J Date: 2021-09-10 Impact factor: 10.302

6. SprayNPray: user-friendly taxonomic profiling of genome and metagenome contigs.

Authors: Arkadiy I Garber; Catherine R Armbruster; Stella E Lee; Vaughn S Cooper; Jennifer M Bomberger; Sean M McAllister
Journal: BMC Genomics Date: 2022-03-12 Impact factor: 3.969

6 in total