Literature DB >> 20472542

The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell.

Thomas Junier1, Evgeny M Zdobnov.   

Abstract

SUMMARY: We present a suite of Unix shell programs for processing any number of phylogenetic trees of any size. They perform frequently-used tree operations without requiring user interaction. They also allow tree drawing as scalable vector graphics (SVG), suitable for high-quality presentations and further editing, and as ASCII graphics for command-line inspection. As an example we include an implementation of bootscanning, a procedure for finding recombination breakpoints in viral genomes. AVAILABILITY: C source code, Python bindings and executables for various platforms are available from http://cegg.unige.ch/newick_utils. The distribution includes a manual and example data. The package is distributed under the BSD License. CONTACT: thomas.junier@unige.ch

Entities:  

Mesh:

Year:  2010        PMID: 20472542      PMCID: PMC2887050          DOI: 10.1093/bioinformatics/btq243

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Phylogenetic trees are a fundamental component of evolutionary biology, and methods for computing them are an active area of research. Once computed, a tree may be further processed in various ways (Table 1). Small datasets consisting of a few trees of moderate size can be processed with interactive GUI programs. As datasets grow, however, interactivity becomes a burden and a source of errors, and it becomes impractical to process large datasets of hundreds of trees and/or very large trees without automation.
Table 1.

Selected Newick utilities programs and their functions

ProgramFunction
nw_cladeExtracts clades (subtrees), specified by labels
nw_distanceExtracts branch lengths in various ways (from root, from parent, as matrix, etc.)
nw_displayDraws trees as ASCII or SVG (suitable for further editing for presentations or publications), several options
nw_matchReports matches of a tree in a larger tree
nw_orderOrders tree nodes, without altering topology
nw_renameChanges node labels
nw_rerootReroots trees on an outgroup, specified by labels
nw_trimTrims a tree at a specified depth
nw_topologyRetains topological information

SVG, Scalable vector graphics.

Selected Newick utilities programs and their functions SVG, Scalable vector graphics. Automation is facilitated if the programs that constitute an analysis pipeline can easily communicate data with each other. One way of doing this in the Unix shell environment is to make them capable of reading from standard input and writing to standard output—such programs are called filters. Although there are many automatable programs for computing trees [e.g. PhyML (Guindon and Gascuel, 2003), PHYLIP (Felsenstein, 1989)], programs for processing trees [e.g. TreeView (Page, 2002), iTOL (Letunic and Bork, 2007)] are typically interactive. Here, we present the Newick utilities, a set of automatable filters that implement the most frequent tree-processing operations.

2 RESULTS

The Newick utilities have the following features: no user interaction is required; input is read from a file or from standard input; output is written to standard output; all options are passed on the command line (no control files); the input format is Newick (Archie et al., 1986); the output is in plain text (Newick, ASCII graphics or SVG); there are no limits to the number or size of the input trees; each program performs one function, with some variants; and the programs are self-documenting (option -h).

2.1 Example: Bootscanning

Bootscanning (Salminen, 1995) locates recombination breakpoints by identifying (locally) closest relatives of a reference sequence. An example implementation is as follows: produce a multiple alignment of all sequences, including the reference; divide the alignment into equidistant windows of constant size (e.g. 300 bp every 50 bp); compute a maximum-likelihood tree for each window; root the trees on the appropriate outgroup (not the reference); from each tree, extract the distance (along the tree) from the reference to each of the other sequences; and plot the result (Fig. 1).
Fig. 1.

Bootscanning using PhyML, EMBOSS, Muscle, Newick utilities, GNUPlot and standard Unix shell programs. The species with the lowest distance is the reference's nearest neighbor (by distance along tree branches). A recombination breakpoint is predicted near position 450, as the nearest neighbor changes abruptly.

Bootscanning using PhyML, EMBOSS, Muscle, Newick utilities, GNUPlot and standard Unix shell programs. The species with the lowest distance is the reference's nearest neighbor (by distance along tree branches). A recombination breakpoint is predicted near position 450, as the nearest neighbor changes abruptly. The distribution includes a Bash script, bootscan.sh, that performs the procedure with Muscle (Edgar, 2004) (Step 1), EMBOSS (Rice et al., 2000) (Step 2), PhyML (Step 3), GNUPlot (Step 6) and Newick utilities for Steps 4 and 5. This method was used to detect breakpoints in human enterovirus (Tapparel et al., 2007).

3 DISCUSSION

The Newick utilities add tree-processing capabilities to a shell user's toolkit. Since they have no hard-coded limits, they can handle large amounts of data; since they are non-interactive, they are easy to automate into pipelines, and since they are filters, they can easily work with other shell tools. Tree processing may also be programmed using a specialized package [e.g. BioPerl (Stajich et al., 2002), APE (Paradis et al., 2004) or ETE (Huerta-Cepas et al., 2010)], but this implies knowledge of the package, and such programs tend to be slower and use more resources than their C equivalents. The difference is particularly apparent for large trees (Fig. 2).
Fig. 2.

Average run times (10 samples) of rerooting tasks on various tree sizes in different implementations. The task involved reading, rerooting and printing out the tree as Newick. Runs of the BioPerl and APE implementation on the 20 000-leaf tree did not complete. Error bars show 1 SD. Computer: 3 GHz 64 bit Intel Core 2 Duo, 1 GB RAM, Linux 2.6. Made with R (R Development Core Team, 2008).

Average run times (10 samples) of rerooting tasks on various tree sizes in different implementations. The task involved reading, rerooting and printing out the tree as Newick. Runs of the BioPerl and APE implementation on the 20 000-leaf tree did not complete. Error bars show 1 SD. Computer: 3 GHz 64 bit Intel Core 2 Duo, 1 GB RAM, Linux 2.6. Made with R (R Development Core Team, 2008).

3.1 Python bindings

To combine the advantages of a high-level, object-oriented language for the application logic with a C library for fast data manipulation, one can use the Newick utilities through Python's ctypes module. This allows one to code a rerooting program in 25 lines of Python while retaining good performance (Fig. 2). A detailed example is included in the documentation. Some users will feel more at ease working in the shell or with shell scripts, using existing bioinformatics tools; others will prefer to code their own tools in a scripting language. The Newick utilities are designed to meet the requirements of both.
  10 in total

1.  EMBOSS: the European Molecular Biology Open Software Suite.

Authors:  P Rice; I Longden; A Bleasby
Journal:  Trends Genet       Date:  2000-06       Impact factor: 11.639

2.  The Bioperl toolkit: Perl modules for the life sciences.

Authors:  Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal:  Genome Res       Date:  2002-10       Impact factor: 9.043

3.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

Authors:  Stéphane Guindon; Olivier Gascuel
Journal:  Syst Biol       Date:  2003-10       Impact factor: 15.683

4.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

5.  Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.

Authors:  Ivica Letunic; Peer Bork
Journal:  Bioinformatics       Date:  2006-10-18       Impact factor: 6.937

6.  Visualizing phylogenetic trees using TreeView.

Authors:  Roderic D M Page
Journal:  Curr Protoc Bioinformatics       Date:  2002-08

7.  Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning.

Authors:  M O Salminen; J K Carr; D S Burke; F E McCutchan
Journal:  AIDS Res Hum Retroviruses       Date:  1995-11       Impact factor: 2.205

8.  ETE: a python Environment for Tree Exploration.

Authors:  Jaime Huerta-Cepas; Joaquín Dopazo; Toni Gabaldón
Journal:  BMC Bioinformatics       Date:  2010-01-13       Impact factor: 3.169

9.  APE: Analyses of Phylogenetics and Evolution in R language.

Authors:  Emmanuel Paradis; Julien Claude; Korbinian Strimmer
Journal:  Bioinformatics       Date:  2004-01-22       Impact factor: 6.937

10.  New complete genome sequences of human rhinoviruses shed light on their phylogeny and genomic features.

Authors:  Caroline Tapparel; Thomas Junier; Daniel Gerlach; Samuel Cordey; Sandra Van Belle; Luc Perrin; Evgeny M Zdobnov; Laurent Kaiser
Journal:  BMC Genomics       Date:  2007-07-10       Impact factor: 3.969

  10 in total
  168 in total

1.  Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between Gnetales and angiosperms.

Authors:  Jin-Hua Ran; Ting-Ting Shen; Ming-Ming Wang; Xiao-Quan Wang
Journal:  Proc Biol Sci       Date:  2018-06-27       Impact factor: 5.349

2.  Hybridization of powdery mildew strains gives rise to pathogens on novel agricultural crop species.

Authors:  Fabrizio Menardo; Coraline R Praz; Stefan Wyder; Roi Ben-David; Salim Bourras; Hiromi Matsumae; Kaitlin E McNally; Francis Parlange; Andrea Riba; Stefan Roffler; Luisa K Schaefer; Kentaro K Shimizu; Luca Valenti; Helen Zbinden; Thomas Wicker; Beat Keller
Journal:  Nat Genet       Date:  2016-01-11       Impact factor: 38.330

3.  Genomic Diversity of Phages Infecting Probiotic Strains of Lactobacillus paracasei.

Authors:  Diego J Mercanti; Geneviève M Rousseau; María L Capra; Andrea Quiberoni; Denise M Tremblay; Simon J Labrie; Sylvain Moineau
Journal:  Appl Environ Microbiol       Date:  2015-10-16       Impact factor: 4.792

Review 4.  What was the real contribution of endosymbionts to the eukaryotic nucleus? Insights from photosynthetic eukaryotes.

Authors:  David Moreira; Philippe Deschamps
Journal:  Cold Spring Harb Perspect Biol       Date:  2014-07-01       Impact factor: 10.005

5.  A new Microviridae phage isolated from a failed biotechnological process driven by Escherichia coli.

Authors:  Simon J Labrie; Marie-Ève Dupuis; Denise M Tremblay; Pier-Luc Plante; Jacques Corbeil; Sylvain Moineau
Journal:  Appl Environ Microbiol       Date:  2014-09-05       Impact factor: 4.792

6.  Gene duplications and phylogenomic conflict underlie major pulses of phenotypic evolution in gymnosperms.

Authors:  Gregory W Stull; Xiao-Jian Qu; Caroline Parins-Fukuchi; Ying-Ying Yang; Jun-Bo Yang; Zhi-Yun Yang; Yi Hu; Hong Ma; Pamela S Soltis; Douglas E Soltis; De-Zhu Li; Stephen A Smith; Ting-Shuang Yi
Journal:  Nat Plants       Date:  2021-07-19       Impact factor: 15.793

7.  Recovery and Community Succession of the Zostera marina Rhizobiome after Transplantation.

Authors:  Lu Wang; Mary K English; Fiona Tomas; Ryan S Mueller
Journal:  Appl Environ Microbiol       Date:  2021-01-15       Impact factor: 4.792

8.  A Bioinformatic Pipeline for Improved Genome Analysis and Clustering of Isolates during Outbreaks of Legionnaires' Disease.

Authors:  Wolfgang Haas; Pascal Lapierre; Kimberlee A Musser
Journal:  J Clin Microbiol       Date:  2021-01-21       Impact factor: 5.948

9.  Distinct evolutionary pressures underlie diversity in simian immunodeficiency virus and human immunodeficiency virus lineages.

Authors:  Will Fischer; Cristian Apetrei; Mario L Santiago; Yingying Li; Rajeev Gautam; Ivona Pandrea; George M Shaw; Beatrice H Hahn; Norman L Letvin; Gary J Nabel; Bette T Korber
Journal:  J Virol       Date:  2012-10-10       Impact factor: 5.103

10.  The physiology and habitat of the last universal common ancestor.

Authors:  Madeline C Weiss; Filipa L Sousa; Natalia Mrnjavac; Sinje Neukirchen; Mayo Roettger; Shijulal Nelson-Sathi; William F Martin
Journal:  Nat Microbiol       Date:  2016-07-25       Impact factor: 17.745

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.