Literature DB >> 27417145

apex: phylogenetics with multiple genes.

Thibaut Jombart¹, Frederick Archer², Klaus Schliep³, Zhian Kamvar⁴, Rebecca Harris⁵, Emmanuel Paradis⁶, Jérome Goudet^7,8, Hilmar Lapp⁹.

Abstract

Genetic sequences of multiple genes are becoming increasingly common for a wide range of organisms including viruses, bacteria and eukaryotes. While such data may sometimes be treated as a single locus, in practice, a number of biological and statistical phenomena can lead to phylogenetic incongruence. In such cases, different loci should, at least as a preliminary step, be examined and analysed separately. The r software has become a popular platform for phylogenetics, with several packages implementing distance-based, parsimony and likelihood-based phylogenetic reconstruction, and an even greater number of packages implementing phylogenetic comparative methods. Unfortunately, basic data structures and tools for analysing multiple genes have so far been lacking, thereby limiting potential for investigating phylogenetic incongruence. In this study, we introduce the new r package apex to fill this gap. apex implements new object classes, which extend existing standards for storing DNA and amino acid sequences, and provides a number of convenient tools for handling, visualizing and analysing these data. In this study, we introduce the main features of the package and illustrate its functionalities through the analysis of a simple data set.

Entities: Chemical Disease Gene Species

Keywords: zzm321990rzzm321990; genetics; package; phylogenies; software

Mesh：

Year: 2016 PMID： 27417145 PMCID： PMC5215480 DOI： 10.1111/1755-0998.12567

Source DB: PubMed Journal: Mol Ecol Resour ISSN： 1755-098X Impact factor: 7.090

Introduction

The constant improvement of sequencing technologies provides ever‐increasing amounts of genetic sequences for a wide range of organisms including viruses, bacteria and a variety of eukaryotes. As a consequence, organisms are more frequently sequenced for multiple genes, and full‐genome data are becoming increasingly common (Pettersson et al. 2009; Pareek et al. 2011). This is especially true for viruses and bacteria, whose smaller genomes allow for large collections of complete genome sequences to be assembled relatively easily (e.g. Harris et al. 2013; Weinert et al. 2015). While it is tempting to treat multiple genes as a single evolutionary unit (de Queiroz & Gatesy 2007), a number of biological processes and statistical issues can lead to phylogenetic incongruence, causing different genes to exhibit distinct phylogenies (Wendel & Doyle 1998; Rokas et al. 2003; Pollard et al. 2006; Knowles 2009; Som 2015). Therefore, multiple gene data should ideally be first analysed separately to look for congruent genealogies and then possibly concatenated to derive a single phylogenetic tree. In practice, however, such an approach may not be trivial to implement. Indeed, handling and analysing multiple gene alignments can be a daunting task. While the r software is growing as a platform of choice for phylogenetic analyses (Paradis et al. 2004; Jombart et al. 2010a; Kembel et al. 2010; Schliep 2011; Popescu et al. 2012), tools for handling and analysing multiple gene data are missing. In this study, we introduce apex, a new r package which fills this gap. apex implements new object classes for storing and handling multiple genes data. This package is fully integrated and compatible with existing r standards for phylogenetic reconstruction and makes the analysis of multiple genes as straightforward as for single genes. In the following, we provide an overview of these functionalities and illustrate the main features using a worked example.

Functionalities

New object classes

The main feature of apex is to provide classes for storing and handling multiple gene data. This is implemented through two novel formal (S4) classes, multidna and multiphyDat, which are respectively extensions of the DNAbin objects from the package ape (Paradis et al. 2004; Popescu et al. 2012), optimized for distance‐based methods, and phyDat objects from the package phangorn (Schliep 2011), better suited for parsimony and likelihood‐based approaches. As in any formal class, the structure of multidna and multiphyDat is defined by a collection of predefined ‘slots’, each containing specific information such as the number of individuals in the data set, the various gene sequences and additional meta‐information. For the sake of simplicity, both classes have identical slots (Tables 1 and 2) and only differ in the way genetic sequences are stored internally: multidna uses bytes to code nucleotides (DNAbin objects, Table 1), while multiphyDat enumerates variable patterns in the sequences and can be used for DNA as well as amino acid sequences (phyDat objects, Table 2). In both cases, aligned sequences are stored inside a list (Tables 1 and 2), in which each element corresponds to a specific gene/locus. Besides storing genetic sequences, multidna and multiphyDat can also store labels for the individuals sequenced, as well as any metadata regarding the individuals or the genes of the data set (Tables 1 and 2).

Table 1

Content of multidna objects. The content of each slot can be accessed using ‘@[slot name]’, where ‘[slot name]’ is any of the values listed in the first column

Slot name	Data stored	Description
dna a	list of DNAbin matrices	A list of DNAbin matrices, each storing sequences of a given locus/gene; names are optional, and if provided, identify the genes; all matrices have the same individuals in rows, and nucleotide positions in columns
labels	character vector	A vector of labels for the individuals
n.ind	integer	The number of individuals in the data set
n.seq	integer	The total number of sequences, pooling all genes, and including gap‐only sequences
n.seq.miss	integer	The total number of gap‐only sequences
ind.info a	data.frame	A data.frame containing information on the individuals, where individuals are in rows
gene.info a	data.frame	A data.frame containing information on the genes, where genes are in rows

Slots whose content is NULL when empty.

Table 2

Content of multiphyDat objects. The content of each slot can be accessed using ‘@[slot name]’, where ‘[slot name]’ is any of the values listed in the first column

Slot name	Data stored	Description
seq a	list of phyDat objects	A list of phyDat objects, each storing sequences of a given locus/gene; names are optional, and if provided identify the genes; all matrices have the same individuals in rows, and nucleotide positions in columns
type	character vector	A character string indicating the type of sequences (e.g. DNA, protein)
labels	character vector	A vector of labels for the individuals
n.ind	integer	The number of individuals in the data set
n.seq	integer	The total number of sequences, pooling all genes, and including gap‐only sequences
n.seq.miss	integer	The total number of gap‐only sequences
ind.info a	data.frame	A data.frame containing information on the individuals, where individuals are in rows
gene.info a	data.frame	A data.frame containing information on the genes, where genes are in rows

Slots whose content is NULL when empty.

Content of multidna objects. The content of each slot can be accessed using ‘@[slot name]’, where ‘[slot name]’ is any of the values listed in the first column Slots whose content is NULL when empty. Both classes ensure that data are stored in a consistent way. When creating new multidna or multiphyDat objects, all individuals with at least one sequence are first enumerated and sorted alphanumerically. For each locus, gap‐only sequences are created for every missing individual, and new alignments containing all individuals, sorted identically, are created. As a result, the different gene alignments and their analyses can be readily compared, which greatly facilitates the assessment of congruence amongst the loci.

Handling data

The fact that different gene data are stored in a consistent way also makes data manipulation easy. In both classes, we implemented matrix‐like subsetting using the syntax ‘[‘ operator. Assuming ‘x’ is a multidna or multiphyDat object, then ‘x[i, j]’ is a subset of ‘x’ where ‘i’ indicates the individuals to be kept, and ‘j’ the retained genes. This subsetting follows r standards and allows for vector of integers, characters or logical to be used, so that handling of multidna or multiphyDat objects should be as natural as usual objects (vectors and matrices). In addition to the easy subsetting and reordering of individuals and genes, the generic function ‘concatenate’ can be used to merge several genes into a single alignment. By default, all genes in the object are used, but an optional argument permits to select which genes to include in the alignment.

Importing and exporting data

Building on resources provided in ape (Paradis et al. 2004; Popescu et al. 2012), phangorn (Schliep 2011) and adegenet (Jombart 2008; Jombart & Ahmed 2011), apex provides functions to import and export data from and to a variety of formats (Table 2). Content of multiphyDat objects. The content of each slot can be accessed using ‘@[slot name]’, where ‘[slot name]’ is any of the values listed in the first column Slots whose content is NULL when empty. multidna and multiphyDat objects can both be created in r using the constructor ‘new(a, …)’, where ‘a’ is a character string indicating the class of the object (‘multidna’ or ‘multiphyDat’), and ‘…’ is a list of arguments passed to the constructor, the main one being a list of objects (character, DNAbin or phyDat matrices) storing DNA or amino acid sequences. However, it is likely that in most cases, genetic sequences of different loci will be stored in separate text files, using one file per gene. Three functions permit to import data from a list of files directly into apex. The functions ‘read.multidna’ and ‘read.multiFASTA’ build upon ape's procedures to read data in interleaved, sequential, Clustal or FASTA format. In addition, the function ‘read.multiphyDat’ enables imports of amino acid sequences with interleaved, sequential or FASTA format. Note that in this case, the resulting multiphyDat object can no longer be converted to multidna, which is restricted to nucleotide sequences only. Once imported in apex, data can also be converted to various formats (Table 3). Conversion from multidna to multiphyDat is implemented by multidna2multiphyDat, while the reverse operation is performed by multiphyDat2multidna. Single‐nucleotide polymorphisms (SNPs) can be extracted from the alignments and translated into a genind object, thereby providing access to a wide range of multivariate analyses (Jombart et al. 2008, 2009, 2010b). As an alternative, each unique gene sequence can be treated as a separate allele, generalizing the multilocus sequence type (MLST) approach which proved highly useful for classifying clonal organisms in microbiology (Maiden et al. 1998; Aanensen & Spratt 2005; Maiden 2006), and could be equally useful for large gene collections of nonclonal organisms.

Table 3

Functions for importing and exporting data in apex

Function	Input	Output	Notes
read.multidna	Interleaved, sequential, clustal, fasta files	multidna	Based on read.dna (ape package)
read.multiFASTA	Fasta files	multidna	Based on read.FASTA (ape package)
read.multiphyDat	Interleaved, fasta	multiphyDat	Based on read.phyDat (phangorn package) can read amino acid sequences
multidna2multiphyDat	multidna	multiphyDat
multiphyDat2multidna	multiphyDat	multidna	Only works for DNA sequences
multidna2genind	multidna	genind a	Extract either SNPs or haplotypes
multiphyDat2genind	multiphyDat	genind a	Extract either SNPs or haplotypes

Base class for genetic markers in the adegenet package.

Accessors

A set of functions have been provided to access the slots in a multidna and multiphyDat objects. For example, the number of loci and their names can be obtained with the getNumLoci and getLocusNames functions, respectively. The setLocusNames function can also be used to set the names in the same way that the names, colnames and rownames functions work on standard r objects. Functions for importing and exporting data in apex Base class for genetic markers in the adegenet package. The number of sequences at each locus stored in the @dna slot can be obtained with the getNumSequences function. Counts of number of sequences for specific loci can be produced by providing a vector of their names to the loci argument of this function. By default, only sequences not composed entirely of gaps are counted; however, if the exclude.gap.only argument is FALSE, the number of all sequences is returned. Names of sequences at each locus can be obtained with the getSequenceNames function, which also has the exclude.gap.only argument that functions similarly. The sequences themselves can be returned with the getSequences function. By default, this returns the list of DNAbin‐formatted sequences stored in the @dna slot. The sequences can be filtered for specific individuals and loci by providing a character vector to the ids and loci arguments, respectively. If only a single locus is returned and the simplify argument is TRUE (the default), then the return value is a single DNAbin object. If simplify is FALSE, the function will always return a list of DNAbin objects.

Analysing data

As data visualization is often the first step in data analysis, we implemented a plot method for multidna objects, which permits to visualize the separate alignments simultaneously. This can be useful for a quick assessment of the quality of different alignments, or patterns of missing data (N or other ambiguous nucleotides) or alignment gaps. As each gene is an item of a list (the slots @dna and @seq, Tables 1 and 2), any operation usually carried out on a single gene can be applied to all genes using the base r function lapply. However, to facilitate further the analysis of sequences from multiple genes, we provide functions for applying the most common phylogenetic analyses to multiple gene data. The 17 different genetic distances implemented in ape can be computed using multidna objects, either separately for each gene or after concatenating them, using dist.multidna. The function getTree provides a wrapper for a number of phylogenetic reconstruction methods implemented in ape, including different versions of neighbour‐joining (Saitou & Nei 1987; Gascuel 1997) and minimum evolution trees (Desper & Gascuel 2002).

Worked example

We illustrate a typical workflow of phylogenetic analysis in apex using data on eight species of New World chickadees (Aves: Paridae) typed on four nuclear loci (Harris et al. 2014). This data set is distributed with apex in two forms: a multidna object (data set ‘chickadees’) and as raw DNA alignments. We use the latter in this example. We assume all DNA alignments are stored as FASTA files in the working directory. After loading apex, we read and process the DNA alignments by selecting all files with ‘.fasta’ extension (this uses the base r command dir(pattern=".fasta")) using the apex function read.multiFASTA: > library(apex) > x < - read.multiFASTA(dir(pattern=".fasta")) > x === multidna === [32 DNA sequences in 4 genes] @n.ind: 8 individuals @n.seq: 32 sequences in total @n.seq.miss: 8 gap-only (missing) sequences @labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 ... @dna: (list of DNAbin matrices) ... Inspecting ‘x’, for example, by printing it (the full output, too long to be displayed, has been cut here) will indicate that the data consist of 32 DNA sequences coming from four different alignments which pooled together document 8 taxa. Eight gap‐only sequences were added by default for missing sequences, so that all DNAbin matrices in ‘x’ correspond to the same taxa (Fig. 1). The actual number of taxa present in each alignment is as follows: > getNumSequences(x) patr_poat43 patr_poat47 patr_poat48 patr_poat49 5 6 8 5

Figure 1

Individual and concatenated sequence alignments of chickadees data. The first four graphs are a plot of a multidna object containing DNA alignments for four different loci (patr_poat 43, 47, 48 and 49). The fifth graph displays the concatenated alignment. To ensure matching of taxa across genes, gap‐only sequences have been added where sequences were missing in ‘x’. For phylogenetic reconstruction, however, these gap sequences are best removed: > x.nogaps <- read.multiFASTA(dir (pattern=".fasta"), gaps=FALSE) Alignments can be visualized using the simple plot function (Fig. 1). A concatenated alignment (Fig. 1) can be obtained using the following: > y < - concatenate(x) Individual phylogenies can be obtained for each locus using the following: > trees < - getTree(x.nogaps) By default, getTree produces neighbour‐Joining trees using Hamming distances, using pairwise deletions, enforcing non‐negative branch lengths and ladderizing the trees. All of these can be customized using any distance and tree algorithms implemented in ape (Paradis et al. 2004), through arguments passed to getTree. Alternatively, a single tree can be obtained after concatenation of the alignments using the following: > tree <- getTree(x, pool=TRUE) Despite the modest size of this data set, results show substantial phylogenetic incongruence amongst different loci (Fig. 2), stressing the need for considering several informative loci for achieving a more robust phylogenetic reconstruction (Harris et al. 2014).

Figure 2

Phylogenies of the chickadees data. The function getTree was used to obtain individual phylogenies for each locus (patr_poat43, 47, 48 and 49), and from the concatenated alignment (central phylogeny). Each phylogeny is an unrooted, neighbour‐joining tree based on Hamming distances between DNA sequences. Taxa are identified using colours to facilitate comparison of the trees.

Discussion

Phylogenetic analysis of multiple gene data can be a daunting task, requiring substantial prior data processing (e.g. comparing sequenced taxa across different loci, handling missing data) and replication and comparison of the analyses over all loci. So far, this could only be achieved in r using multiple ad hoc scripts (e.g. Bryson et al. 2013; Harris et al. 2013; Grummer et al. 2015), thereby limiting the reproducibility and reliability of the analyses, and discouraging the investigation of potential phylogenetic incongruence (Som 2015). The aim of apex is to streamline this process and make phylogenetic reconstruction with multiple genes as straightforward as with a single locus. To achieve this goal, apex implements new object classes which extend existing standards for DNA and amino acid sequences. It provides simple functions for importing multiple alignments data from classical formats and software, seamlessly taking care of re‐ordering and matching taxa across loci, identifying missing sequences, and optionally adding gap‐only sequences to ensure matching of taxa for all loci. The resulting objects can be handled easily using various accessor functions, procedures for subsetting data by taxa, locus or site, and for concatenating selected loci. Additionally, apex implements a series of wrappers for the visualization and analysis of multiple gene data and implements a generalization of sequence‐type polymorphism (Maiden et al. 1998; Aanensen & Spratt 2005; Maiden 2006) directly compatible with standard population genetics methods (Goudet 2005; Jombart 2008; Paradis 2010; Jombart & Ahmed 2011). This latter feature should be particularly useful for deriving fast classifications of organisms (e.g. ‘sequence types’ in bacteria) from large genomic data sets. Arguably, dedicated statistical approaches are needed for an in‐depth investigation of phylogenetic discrepancies exhibited by multilocus data (Hillis et al. 2005; Knowles 2009; Jombart et al. 2015; Som 2015). By providing support for the representation, visualization and basic analysis of these data, apex should facilitate and hopefully encourage the development of new methods for exploring phylogenetic incongruence. As such, we believe it is a worthwhile addition to the growing platform for genetic sequence analysis in r. T.J., F.A., K.S. and Z.K. developed the package. R.H. provided data sets. J.G., E.P. and H.L. provided advice for the package design. T.J., F.A. and H.L. wrote the manuscript.

Software availability

The stable version of apex is released on the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/web/packages/apex/index.html and can be installed in r by typing the following: install.packages(apex) The development version of apex is hosted on github: https://github.com/thibautjombart/apex apex is distributed under GNU Public Licence (GPL) version 2 or greater. It is fully documented in a vignette accessible by typing the following: vignette(apex)

29 in total

1. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle.

Authors: Richard Desper; Olivier Gascuel
Journal: J Comput Biol Date: 2002 Impact factor: 1.479

2. Estimating species trees: methods of phylogenetic analysis when there is incongruence across genes.

Authors: L Lacey Knowles
Journal: Syst Biol Date: 2009-09-17 Impact factor: 15.683

3. The influence of sampling design on species tree inference: a new relationship for the New World chickadees (Aves: Poecile).

Authors: Rebecca B Harris; Matthew D Carling; Irby J Lovette
Journal: Evolution Date: 2013-10-24 Impact factor: 3.694

4. pegas: an R package for population genetics with an integrated-modular approach.

Authors: Emmanuel Paradis
Journal: Bioinformatics Date: 2010-01-14 Impact factor: 6.937

5. adegenet: a R package for the multivariate analysis of genetic markers.

Authors: Thibaut Jombart
Journal: Bioinformatics Date: 2008-04-08 Impact factor: 6.937

6. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.

Authors: O Gascuel
Journal: Mol Biol Evol Date: 1997-07 Impact factor: 16.240

Review 7. Causes, consequences and solutions of phylogenetic incongruence.

Authors: Anup Som
Journal: Brief Bioinform Date: 2014-05-27 Impact factor: 11.622

8. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

9. Estimating the temporal and spatial extent of gene flow among sympatric lizard populations (genus Sceloporus) in the southern Mexican highlands.

Authors: Jared A Grummer; Martha L Calderón-Espinosa; Adrián Nieto-Montes de Oca; Eric N Smith; Fausto R Méndez-de la Cruz; Adam D Leaché
Journal: Mol Ecol Date: 2015-04 Impact factor: 6.185

10. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations.

Authors: Thibaut Jombart; Sébastien Devillard; François Balloux
Journal: BMC Genet Date: 2010-10-15 Impact factor: 2.797

8 in total

1. treespace: Statistical exploration of landscapes of phylogenetic trees.

Authors: Thibaut Jombart; Michelle Kendall; Jacob Almagro-Garcia; Caroline Colijn
Journal: Mol Ecol Resour Date: 2017-05-15 Impact factor: 7.090

2. The West Palaearctic genera of Nematinae (Hymenoptera, Tenthredinidae).

Authors: Marko Prous; Andrew Liston; Katja Kramp; Henri Savina; Hege Vårdal; Andreas Taeger
Journal: Zookeys Date: 2019-09-16 Impact factor: 1.546

3. A graph-based evidence synthesis approach to detecting outbreak clusters: An application to dog rabies.

Authors: Anne Cori; Pierre Nouvellet; Tini Garske; Hervé Bourhy; Emmanuel Nakouné; Thibaut Jombart
Journal: PLoS Comput Biol Date: 2018-12-17 Impact factor: 4.475

4. Phylogenetic and phenotypic characterization of Fusarium oxysporum f. sp. niveum isolates from Florida-grown watermelon.

Authors: James C Fulton; B Sajeewa Amaradasa; Tülin S Ertek; Fanny B Iriarte; Tatiana Sanchez; Pingsheng Ji; Mathews L Paret; Owen Hudson; Md Emran Ali; Nicholas S Dufault
Journal: PLoS One Date: 2021-03-25 Impact factor: 3.240

5. What's left in the tank? Identification of non-ascribed aquarium's coral collections with DNA barcodes as part of an integrated diagnostic approach.

Authors: Luigi Colin; Daniel Abed-Navandi; Dalia A Conde; Jamie Craggs; Rita da Silva; Max Janse; Björn Källström; Alexander Pearce-Kelly; Chris Yesson
Journal: Conserv Genet Resour Date: 2022-01-11 Impact factor: 0.991

6. A worldwide phylogeography of the whiteworm lichens Thamnolia reveals three lineages with distinct habitats and evolutionary histories.

Authors: Ioana Onuţ-Brännström; Leif Tibell; Hanna Johannesson
Journal: Ecol Evol Date: 2017-04-13 Impact factor: 2.912

7. The lytic Myoviridae of Enterobacteriaceae form tight recombining assemblages separated by discontinuities in genome average nucleotide identity and lateral gene flow.

Authors: Tomaž Accetto; Nika Janež
Journal: Microb Genom Date: 2018-03-27

8. A fast likelihood solution to the genetic clustering problem.

Authors: Marie-Pauline Beugin; Thibault Gayet; Dominique Pontier; Sébastien Devillard; Thibaut Jombart
Journal: Methods Ecol Evol Date: 2018-01-30 Impact factor: 7.781

8 in total