Literature DB >> 28174903

Phyx: phylogenetic tools for unix.

Joseph W Brown1, Joseph F Walker1, Stephen A Smith1.   

Abstract

SUMMARY: The ease with which phylogenomic data can be generated has drastically escalated the computational burden for even routine phylogenetic investigations. To address this, we present phyx : a collection of programs written in C ++ to explore, manipulate, analyze and simulate phylogenetic objects (alignments, trees and MCMC logs). Modelled after Unix/GNU/Linux command line tools, individual programs perform a single task and operate on standard I/O streams that can be piped to quickly and easily form complex analytical pipelines. Because of the stream-centric paradigm, memory requirements are minimized (often only a single tree or sequence in memory at any instance), and hence phyx is capable of efficiently processing very large datasets.
AVAILABILITY AND IMPLEMENTATION: phyx runs on POSIX-compliant operating systems. Source code, installation instructions, documentation and example files are freely available under the GNU General Public License at https://github.com/FePhyFoFum/phyx. CONTACT: eebsmith@umich.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2017. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 28174903      PMCID: PMC5870855          DOI: 10.1093/bioinformatics/btx063

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


Introduction

Phylogenetic and phylogenomic analyses now involve massive datasets which makes traditional approaches for the analysis and manipulation of data onerous undertakings. A number of phylogenetic toolkits exist including but not limited to ETE (Huerta-Cepas ), newick utilities (Junier and Zdobnov, 2010), Mesquite (Maddison and Maddison, 2016), ape (Popescu ), phyutility (Smith and Dunn, 2008) and DendroPy (Sukumaran and Holder, 2010). Of course, no individual software package is exhaustive in its functionality (i.e. methods and supported file formats), so these packages largely complement one another, both in terms of novel and overlapping functionalities (i.e. confirming computed values), with differences (e.g. in memory requirements and speed) sometimes making certain packages more conducive to particular workflows. However, despite the rich diversity of existing tools, there is a niche to be filled for programs that are conducive to high throughput processes and the convenience of a POSIX-style interface. In an effort to provide a more flexible and efficient software package for processing phylogenetic data and for conducting phylogenomic research we present phyx, a set of programs to carry out a wide range of phylogenetic tasks. Written in C ++ and modeled after Unix/GNU/Linux command line tools, individual programs perform a single task, have individual manual (i.e. man) pages and operate on standard I/O streams. A result of this stream-centric approach is that, for most programs, only a single sequence or tree is in memory at any moment. Thus, large datasets can be processed with minimal memory requirements. phyx’s ever-growing complement of programs currently consists of 35+ programs (see Table 1 for a subset) focused on exploring, manipulating, analyzing and simulating phylogenetic objects (alignments, trees and MCMC logs). As with standard Unix command line tools, these programs can be piped (together with non-phyx tools), allowing the easy construction of efficient analytical pipelines. phyx also logs all program calls to a plain text file, which is an executable record that can be submitted as part of a manuscript for reviewing and replicability purposes. phyx thus provides a convenient, lightweight and inclusive toolkit consisting of programs spanning the wide breadth of programs utilized by researchers performing phylogenomic analyses.
Table 1.

Selected phyx programs and their functions. See github for additional details and full program list

ProgramFunction
pxlssq/pxlstrList attributes of alignments/trees
pxrms/pxrmtRemove taxa from alignments/trees
pxrls/pxrltRelabel taxa in alignments/trees
pxbootAlignment bootstrap/jackknife resampling
pxclsqRemove missing/ambiguous sites from an alignment
pxs2fa/phy/nexConvert alignment to fasta/phylip/Nexus format
pxlogConcatenate and resample MCMC parameter/tree logs
pxfqfiltFilter fastq files by quality
pxrrReroot/unroot trees
pxtlateTranslate nucleotide sequences
pxsw/pxnwPairwise sequence alignment
pxstrecAncestral state reconstruction, stochastic mapping
pxbdfit/pxbdsimBirth-death tree inference/simulator
pxseqgenSimulate nucleotide/protein sequences on user tree
Selected phyx programs and their functions. See github for additional details and full program list

2 Materials and methods

2.1 File processing, manipulation and conversion

File manipulation and conversion is a tedious and error-prone, but often required, component of phylogenetic analysis, made more burdensome by the volume of data available in current phylogenomics studies. phyx supports the popular formats for sequence alignments (fasta, fastq, phylip and Nexus) and trees (newick and Nexus), and provides lightweight, high-throughput utilities to convert data among formats without the user needing to provide the format of the original data as phyx will attempt to auto-detect the original format. Alignments can be further manipulated by removing individual taxa, resampling (bootstrap or jackknifing), sequence recoding, translation to protein, reverse complementation, filtering by quality scores or the amount of missing data, and concatenation across mixed alignment formats. Processing large data matrices is only one step required for phylogenomic analyses. In order to perform downstream analyses (e.g. orthology detection (Yang and Smith, 2014), mapping gene trees to species tree (Smith ), or gene tree/species tree reconciliation (Mirarab )) it is now also essential to be able to manipulate individual gene trees constructed from these data. phyx enables fast, efficient manipulations such as pruning individual taxa, extracting subclades and rerooting/unrooting trees. Finally, Bayesian MCMC analyses involving phylogenies have become common in the biological sciences, and often involve large log files generated from replicated analyses. phyx enables both the concatenation and resampling (burnin and/or thinning) of MCMC tree or parameter logs for downstream summary.

2.2 Analysis and simulation

In addition to file manipulation, phyx provides a growing number of tools for data analysis and simulation. Analytical capabilities presently include pairwise sequence alignment using either the Needleman-Wunsch or Smith-Waterman algorithms, tree inference using the neighbour-joining criterion, ancestral state reconstruction and stochastic mapping of discrete characters, fitting of Brownian or OU models to continuous characters, fitting birth-death models to trees, and computing alignment column bipartitions either in isolation or on a user tree. Data simulation is an essential tool with which to explore model sensitivity and adequacy through parametric bootstrapping or posterior predictive analyses (Bollback, 2002). phyx currently enables simulation of both birth-death trees (see example in Fig. 1) and nucleotide or protein alignments given a tree and substitution model parameters.
Fig. 1.

Parametric bootstrapping of a diversification process. The primate phylogeny of Springer was fit to a birth-death model (pxbdfit). To explore the breadth of plausible diversification outcomes the maximum likelihood parameters (b: 0.339487, d: 0.268944) were used to simulate (pxbdsim) 25 000 phylogenies conditioned on either the extant diversity (367, left) or root age (66.7066 Ma, right) of the empirical tree

Parametric bootstrapping of a diversification process. The primate phylogeny of Springer was fit to a birth-death model (pxbdfit). To explore the breadth of plausible diversification outcomes the maximum likelihood parameters (b: 0.339487, d: 0.268944) were used to simulate (pxbdsim) 25 000 phylogenies conditioned on either the extant diversity (367, left) or root age (66.7066 Ma, right) of the empirical tree

2.3 Comparison to existing programs

While we view phyx as a complement to existing tools, we demonstrate the relative performance (speed and memory requirements) of some phyx programs for common phylogenomics tasks in the Supplementary Data, available at Bioinformatics online.

3 Conclusion

phyx was designed to complement existing phylogenetic toolkits by enabling the exploration, manipulation, analysis and simulation of phylogenetic objects directly from the command line. Moreover, by conforming to a stream-centric approach, memory requirements are reduced significantly so that large volumes of data can be processed on even personal laptop computers. Click here for additional data file.
  10 in total

1.  Bayesian model adequacy and choice in phylogenetics.

Authors:  Jonathan P Bollback
Journal:  Mol Biol Evol       Date:  2002-07       Impact factor: 16.240

2.  ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R.

Authors:  Andrei-Alin Popescu; Katharina T Huber; Emmanuel Paradis
Journal:  Bioinformatics       Date:  2012-04-11       Impact factor: 6.937

3.  DendroPy: a Python library for phylogenetic computing.

Authors:  Jeet Sukumaran; Mark T Holder
Journal:  Bioinformatics       Date:  2010-04-25       Impact factor: 6.937

4.  Phyutility: a phyloinformatics tool for trees, alignments and molecular data.

Authors:  Stephen A Smith; Casey W Dunn
Journal:  Bioinformatics       Date:  2008-01-28       Impact factor: 6.937

5.  The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell.

Authors:  Thomas Junier; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2010-05-13       Impact factor: 6.937

6.  Macroevolutionary dynamics and historical biogeography of primate diversification inferred from a species supermatrix.

Authors:  Mark S Springer; Robert W Meredith; John Gatesy; Christopher A Emerling; Jong Park; Daniel L Rabosky; Tanja Stadler; Cynthia Steiner; Oliver A Ryder; Jan E Janečka; Colleen A Fisher; William J Murphy
Journal:  PLoS One       Date:  2012-11-16       Impact factor: 3.240

7.  Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics.

Authors:  Ya Yang; Stephen A Smith
Journal:  Mol Biol Evol       Date:  2014-08-25       Impact factor: 16.240

8.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data.

Authors:  Jaime Huerta-Cepas; François Serra; Peer Bork
Journal:  Mol Biol Evol       Date:  2016-02-26       Impact factor: 16.240

9.  Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants.

Authors:  Stephen A Smith; Michael J Moore; Joseph W Brown; Ya Yang
Journal:  BMC Evol Biol       Date:  2015-08-05       Impact factor: 3.260

10.  ASTRAL: genome-scale coalescent-based species tree estimation.

Authors:  S Mirarab; R Reaz; Md S Bayzid; T Zimmermann; M S Swenson; T Warnow
Journal:  Bioinformatics       Date:  2014-09-01       Impact factor: 6.937

  10 in total
  44 in total

1.  FAD2 Gene Radiation and Positive Selection Contributed to Polyacetylene Metabolism Evolution in Campanulids.

Authors:  Tao Feng; Ya Yang; Lucas Busta; Edgar B Cahoon; Hengchang Wang; Shiyou Lü
Journal:  Plant Physiol       Date:  2019-08-16       Impact factor: 8.340

2.  Gene duplications and phylogenomic conflict underlie major pulses of phenotypic evolution in gymnosperms.

Authors:  Gregory W Stull; Xiao-Jian Qu; Caroline Parins-Fukuchi; Ying-Ying Yang; Jun-Bo Yang; Zhi-Yun Yang; Yi Hu; Hong Ma; Pamela S Soltis; Douglas E Soltis; De-Zhu Li; Stephen A Smith; Ting-Shuang Yi
Journal:  Nat Plants       Date:  2021-07-19       Impact factor: 15.793

3.  Phylotranscriptomics of Theaceae: generic-level relationships, reticulation and whole-genome duplication.

Authors:  Qiong Zhang; Lei Zhao; Ryan A Folk; Jian-Li Zhao; Nelson A Zamora; Shi-Xiong Yang; Douglas E Soltis; Pamela S Soltis; Lian-Ming Gao; Hua Peng; Xiang-Qin Yu
Journal:  Ann Bot       Date:  2022-03-23       Impact factor: 4.357

4.  Evolutionarily conserved function of the sacred lotus (Nelumbo nucifera Gaertn.) CER2-LIKE family in very-long-chain fatty acid elongation.

Authors:  Xianpeng Yang; Zhouya Wang; Tao Feng; Juanjuan Li; Longyu Huang; Baiming Yang; Huayan Zhao; Matthew A Jenks; Pingfang Yang; Shiyou Lü
Journal:  Planta       Date:  2018-06-08       Impact factor: 4.116

5.  Positive selection and gene duplications in tumour suppressor genes reveal clues about how cetaceans resist cancer.

Authors:  Daniela Tejada-Martinez; João Pedro de Magalhães; Juan C Opazo
Journal:  Proc Biol Sci       Date:  2021-02-24       Impact factor: 5.349

6.  Analysis of Paralogs in Target Enrichment Data Pinpoints Multiple Ancient Polyploidy Events in Alchemilla s.l. (Rosaceae).

Authors:  Diego F Morales-Briones; Berit Gehrke; Chien-Hsun Huang; Aaron Liston; Hong Ma; Hannah E Marx; David C Tank; Ya Yang
Journal:  Syst Biol       Date:  2021-12-16       Impact factor: 15.683

7.  The R2R3-MYB transcription factor MtMYB134 orchestrates flavonol biosynthesis in Medicago truncatula.

Authors:  Jogindra Naik; Ruchika Rajput; Boas Pucker; Ralf Stracke; Ashutosh Pandey
Journal:  Plant Mol Biol       Date:  2021-03-11       Impact factor: 4.076

8.  Hagenia from the early Miocene of Ethiopia: Evidence for possible niche evolution?

Authors:  Friðgeir Grímsson; Silvia Ulrich; Mario Coiro; Shirley A Graham; Bonnie F Jacobs; Ellen D Currano; Alexandros Xafis; Reinhard Zetter
Journal:  Ecol Evol       Date:  2021-03-23       Impact factor: 2.912

9.  Large X-Linked Palindromes Undergo Arm-to-Arm Gene Conversion across Mus Lineages.

Authors:  Callie M Swanepoel; Emma R Gerlinger; Jacob L Mueller
Journal:  Mol Biol Evol       Date:  2020-07-01       Impact factor: 16.240

10.  PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data.

Authors:  Jacob L Steenwyk; Thomas J Buida; Abigail L Labella; Yuanning Li; Xing-Xing Shen; Antonis Rokas
Journal:  Bioinformatics       Date:  2021-02-09       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.