Literature DB >> 32016344

Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data.

Lucas Czech¹, Pierre Barbera¹, Alexandros Stamatakis^1,2.

Abstract

SUMMARY: We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven.
AVAILABILITY AND IMPLEMENTATION: Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2020 PMID： 32016344 PMCID： PMC7214027 DOI： 10.1093/bioinformatics/btaa070

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The necessity of computation in biology, and in (metagenomic) sequence analysis in particular, has long been acknowledged. In phylogenetics, for example, there is a plethora of software for analyzing data, covering tasks, such as sequence alignment (Pervez ), phylogenetic tree inference (Zhou ) and diverse types of downstream analyses (Washburne ). Furthermore, in metagenomics, a key task is the taxonomic identification of sequences obtained from microbial environments. An increasingly popular method for this is phylogenetic (or evolutionary) placement, which can classify large numbers of (meta-)genomic sequences with respect to a given reference phylogeny. Common tools for phylogenetic placement are pplacer (Matsen ), RAxML-EPA (Berger ) as well as the more recent and more scalable tools EPA-ng (Barbera ), APPLES (Balaban ) and RAPPAS (Linard ). The result of a phylogenetic placement can be understood as a distribution of sequences over the reference tree, which allows to examine the composition of microbial communities, and to derive biological and ecological insights (Czech ; Czech and Stamatakis, 2019). Here, we introduce Genesis, a library for working with phylogenetic data, as well as Gappa, a command-line tool for typical analyses of such data. They focus on phylogenetic trees and phylogenetic placements, but also offer various additional functionality. Combined, they allow to analyze as well as visualize phylogenetic (placement) data with existing methods and to experiment with and develop novel ideas. To maximize usability of our tools, our implementation is guided by the following design objectives: (i) most users require a fast and simple application for analyzing their data, (ii) some power users desire customization, for example, via scripting, (iii) developers require a flexible toolkit for rapid prototyping and (iv) with the on-going data growth, the implementation needs to be scalable and efficient with respect to memory and execution times. To this end, Genesis and Gappa are written in C++11, relying on a modern, modular and function-centric software design. We evaluate the code quality, the runtime behavior and the memory requirements for conducting typical tasks, such as file parsing and data processing in the Supplementary Material. An exemplary benchmark for reading Newick files is shown in Figure 1. We find that Genesis has the overall best code quality score compared to other scientific codes written in C or C++, using softwipe for the comparison (https://github.com/adrianzap/softwipe). It is also consistently faster than all evaluated Python and R libraries in our tests. Furthermore, Gappa is faster and more memory efficient than its main competitor Guppy in almost all tests and, most importantly, it scales better on larger datasets in all benchmarks.

Fig. 1.

Runtimes for reading Newick files with 1 K–1 M taxa (tip/leaf nodes) and a randomly generated topology, using a variety of different libraries and tools. See the Supplementary Material for details

2 Features of Genesis

Genesis is a highly flexible library for reading, manipulating and evaluating phylogenetic data with a simple and straightforward application programming interface (API). Typical tasks, such as parsing and writing files, iterating over the elements of a data structure, and other frequently used functions are mostly one-liners that integrate well with modern C++. The library is multi-threaded, allowing for fully leveraging multi-core systems for scalable processing of large datasets. The functionality is divided into loosely coupled modules, which are organized in C++ namespaces. We briefly describe them in the following.

2.1 Phylogenetic trees

Phylogenetic trees are implemented via a pointer-based data structure that enables fast and flexible operations, and allows to store arbitrary data at the nodes and at the edges. The trees may contain multifurcations and may have a designated root node. Trees can be parsed from Newick files and be written to Newick, phyloxml and nexus files, again including support for arbitrary edge and node annotations. Traversing the tree starting from an arbitrary node in, for example, post-order, pre-order, or level-order can be accomplished via simple for loops: // Read a tree from a Newick file. Tree tree = CommonTreeNewickReader().read( from_file (“path/to/tree.newick”) ); // Traverse tree in preorder, print node names. for (auto it: preorder (tree) ) { auto& data = it.node().data(); std::cout ≪ data.name ≪ “\n”; } Functionality for manipulating trees, finding lowest common ancestors of nodes [e. g. using the Euler tour technique of Berkman and Vishkin (1993)] or paths between nodes, calculating distances between nodes, testing monophyly of clades, obtaining a bitset representation of the bipartitions/splits of the tree and many other standard tasks are provided. Furthermore, functions for drawing rectangular and circular phylograms or cladograms to svg files, using individual custom edge colors and node shapes, are provided for creating publication quality figures.

2.2 Phylogenetic placements

Handling phylogenetic placement data constitutes a primary focus of Genesis. Placement data are usually stored in so-called jplace files (Matsen ). Our implementation offers low-level functions for reading, writing, filtering, merging and otherwise manipulating these data, as well as high-level functions for distance calculations (Evans and Matsen, 2012), edge PCA and squash clustering (Matsen and Evans, 2011) and phylogenetic k-means clustering (Czech and Stamatakis, 2019), among others. Advanced functions for analyzing and visualizing the data are implemented as well, for instance, our adaptation of phylofactorization to phylogenetic placement data (Czech and Stamatakis, 2019; Washburne ). Lastly, we offer a simple simulator for generating random placement data (e.g. for testing). To the best of our knowledge, competing software that can parse placement data in form of jplace files [BoSSA (Lefeuvre, 2018), ggtree (Yu ) or iTOL (Letunic and Bork, 2016)] merely offers some very basic analyses and visualizations, such as displaying the distribution of placed sequences on the reference phylogeny, but does not offer the wide functionality range of Genesis.

2.3 Other features

Sequences and alignments can be efficiently read from and written to fasta and phylip files; high-level functions for managing sequences include several methods for calculating consensus sequences, the entropy of sequence sets and sequence re-labeling via hashes. Taxonomies and taxonomic paths (e.g. ‘Eukaryota; Alveolata; Apicomplexa’) can be parsed from databases, such as Silva (Quast ; Yilmaz ) or NCBI (Benson ; Sayers ), and stored in a hierarchical taxonomic data structure, again with the ability to store arbitrary meta-data at each taxon, and to traverse the taxonomy. Furthermore, Genesis supports several standard file formats, such as json, csv and svg. All input methods automatically and transparently handle gzip-compressed files. Moreover, a multitude of auxiliary functions and classes is provided: matrices and dataframes to store data, statistical functions and histogram generation to examine such data, regression via the generalized linear model, multi-dimensional scaling, k-means clustering, an efficient bitvector implementation (e. g. used for the bitset representation of phylogenetic trees mentioned above), color support for handling gradients in plots, etc. The full list of functionality is available via the online documentation. Lastly, Genesis offers a simple architecture for scripting-like development, intended for rapid prototyping or small custom programs (e.g. convert some files or examine some data for a particular experiment).

3 Features of Gappa

The flexibility of a library, such as Genesis is primarily useful for method developers. For most users, it is, however, more convenient to offer a simple interface for typical, frequent tasks. To this end, we have developed the command line program Gappa. Gappa implements and makes available the methods we presented in Czech ) and Czech and Stamatakis (2019), such as: automatically obtaining a set of reference sequences from large databases, which can be used to infer a reference tree for phylogenetic placement; visualization tools to display the distribution of placements on the tree or to visualize per-branch correlations with meta-data features of environmental samples; analysis methods, such as phylogenetic k-means and placement-factorization for environmental samples. Gappa also contains re-implementations of a few prominent methods of Guppy (Matsen ), as well as commands for sanitizing, filtering, and manipulating files in formats, such as jplace, Newick or fasta, and a command for conducting a taxonomic assignment of phylogenetic placements (Kozlov ). As Gappa internally relies on Genesis, it is also efficient and scalable. Hence, Gappa can also be considered as a collection of demo programs for using Genesis, which might be helpful as a starting point for developers who intend to use our library. In comparison to Guppy, we have observed speedups of several orders of magnitude and significantly lower memory requirements in general when processing large data volumes; see the Supplementary Material and Czech ) for details.

4 Conclusion

We presented Genesis, a library for working with phylogenetic (placement) data and related data types, as well as Gappa, a command line interface for analysis methods and common tasks related to phylogenetic placements. Genesis and Gappa already formed an integral part in several of our previous publications and programs (Barbera ; Czech ; Czech and Stamatakis, 2019; Mahé ; Zhou ), proving their flexibility and utility. In future Genesis releases, we intend to offer API bindings to Python, thus making the library more accessible to developers. In Gappa, we will implement additional commands, in particular for working with phylogenetic placements, as well as re-implement the remaining commands of Guppy, in order to facilitate analysis of larger datasets. Both Genesis and Gappa are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa.

Funding

This work was financially supported by the Klaus Tschira Stiftung gGmbH in Heidelberg, Germany. Conflict of Interest: none declared. Click here for additional data file.

22 in total

1. Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples.

Authors: Lucas Czech; Alexandros Stamatakis
Journal: PLoS One Date: 2019-05-28 Impact factor: 3.240

Review 2. Methods for phylogenetic analysis of microbiome data.

Authors: Alex D Washburne; James T Morton; Jon Sanders; Daniel McDonald; Qiyun Zhu; Angela M Oliverio; Rob Knight
Journal: Nat Microbiol Date: 2018-05-24 Impact factor: 17.745

3. Parasites dominate hyperdiverse soil protist communities in Neotropical rainforests.

Authors: Frédéric Mahé; Colomban de Vargas; David Bass; Lucas Czech; Alexandros Stamatakis; Enrique Lara; David Singer; Jordan Mayor; John Bunge; Sarah Sernaker; Tobias Siemensmeyer; Isabelle Trautmann; Sarah Romac; Cédric Berney; Alexey Kozlov; Edward A D Mitchell; Christophe V W Seppey; Elianne Egge; Guillaume Lentendu; Rainer Wirth; Gabriel Trueba; Micah Dunthorn
Journal: Nat Ecol Evol Date: 2017-03-20 Impact factor: 15.460

4. The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples.

Authors: Steven N Evans; Frederick A Matsen
Journal: J R Stat Soc Series B Stat Methodol Date: 2012-02-15 Impact factor: 4.488

5. A format for phylogenetic placements.

Authors: Frederick A Matsen; Noah G Hoffman; Aaron Gallagher; Alexandros Stamatakis
Journal: PLoS One Date: 2012-02-22 Impact factor: 3.240

6. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets.

Authors: Alex D Washburne; Justin D Silverman; Jonathan W Leff; Dominic J Bennett; John L Darcy; Sayan Mukherjee; Noah Fierer; Lawrence A David
Journal: PeerJ Date: 2017-02-09 Impact factor: 2.984

7. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets.

Authors: Xiaofan Zhou; Xing-Xing Shen; Chris Todd Hittinger; Antonis Rokas
Journal: Mol Biol Evol Date: 2018-02-01 Impact factor: 16.240

8. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.

Authors: Christian Quast; Elmar Pruesse; Pelin Yilmaz; Jan Gerken; Timmy Schweer; Pablo Yarza; Jörg Peplies; Frank Oliver Glöckner
Journal: Nucleic Acids Res Date: 2012-11-28 Impact factor: 16.971

9. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

10. Methods for automatic reference trees and multilevel phylogenetic placement.

Authors: Lucas Czech; Pierre Barbera; Alexandros Stamatakis
Journal: Bioinformatics Date: 2019-04-01 Impact factor: 6.937

38 in total

1. Sensitive Identification of Bacterial DNA in Clinical Specimens by Broad-Range 16S rRNA Gene Enrichment.

Authors: Sara Rassoulian Barrett; Noah G Hoffman; Christopher Rosenthal; Andrew Bryan; Desiree A Marshall; Joshua Lieberman; Alexander L Greninger; Vikas Peddu; Brad T Cookson; Stephen J Salipante
Journal: J Clin Microbiol Date: 2020-11-18 Impact factor: 5.948

2. Type II Photosynthetic Reaction Center Genes of Avocado (Persea americana Mill.) Bark Microbial Communities are Dominated by Aerobic Anoxygenic Alphaproteobacteria.

Authors: Eneas Aguirre-von-Wobeser
Journal: Curr Microbiol Date: 2021-05-15 Impact factor: 2.188

3. Divergence of Biocrust Active Bacterial Communities in the Negev Desert During a Hydration-Desiccation Cycle.

Authors: Capucine Baubin; Noya Ran; Hagar Siebner; Osnat Gillor
Journal: Microb Ecol Date: 2022-07-05 Impact factor: 4.552

4. Predicted functional genes for the biodegradation of xenobiotics in groundwater and sediment at two contaminated naval sites.

Authors: Andrea Vera; Fernanda Paes Wilson; Alison M Cupples
Journal: Appl Microbiol Biotechnol Date: 2022-01-11 Impact factor: 4.813

5. Petabase-scale sequence alignment catalyses viral discovery.

Authors: Robert C Edgar; Jeff Taylor; Victor Lin; Tomer Altman; Pierre Barbera; Dmitry Meleshko; Dan Lohr; Gherman Novakovsky; Benjamin Buchfink; Basem Al-Shayeb; Jillian F Banfield; Marcos de la Peña; Anton Korobeynikov; Rayan Chikhi; Artem Babaian
Journal: Nature Date: 2022-01-26 Impact factor: 49.962

6. Chemotaxis may assist marine heterotrophic bacterial diazotrophs to find microzones suitable for N₂ fixation in the pelagic ocean.

Authors: Søren Hallstrøm; Jean-Baptiste Raina; Martin Ostrowski; Donovan H Parks; Gene W Tyson; Philip Hugenholtz; Roman Stocker; Justin R Seymour; Lasse Riemann
Journal: ISME J Date: 2022-08-01 Impact factor: 11.217

7. Changes in Archaeal Community and Activity by the Invasion of Spartina anglica Along Soil Depth Profiles of a Coastal Wetland.

Authors: Jinhyun Kim; Young Mok Heo; Jeongeun Yun; Hanbyul Lee; Jae-Jin Kim; Hojeong Kang
Journal: Microb Ecol Date: 2021-05-18 Impact factor: 4.552

8. Dissecting the contribution of host genetics and the microbiome in complex behaviors.

Authors: Shelly A Buffington; Sean W Dooling; Martina Sgritta; Cecilia Noecker; Oscar D Murillo; Daniela F Felice; Peter J Turnbaugh; Mauro Costa-Mattioli
Journal: Cell Date: 2021-03-10 Impact factor: 41.582

9. Composition and Associations of the Infant Gut Fungal Microbiota with Environmental Factors and Childhood Allergic Outcomes.

Authors: Stuart E Turvey; B Brett Finlay; Rozlyn C T Boutin; Hind Sbihi; Ryan J McLaughlin; Aria S Hahn; Kishori M Konwar; Rachelle S Loo; Darlene Dai; Charisse Petersen; Fiona S L Brinkman; Geoffrey L Winsor; Malcolm R Sears; Theo J Moraes; Allan B Becker; Meghan B Azad; Piush J Mandhane; Padmaja Subbarao
Journal: mBio Date: 2021-06-01 Impact factor: 7.867

10. Human encroachment into wildlife gut microbiomes.

Authors: Gloria Fackelmann; Mark A F Gillingham; Julian Schmid; Alexander Christoph Heni; Kerstin Wilhelm; Nina Schwensow; Simone Sommer
Journal: Commun Biol Date: 2021-06-25