Literature DB >> 31536115

The Prevalence and Impact of Model Violations in Phylogenetic Analysis.

Suha Naser-Khdour¹, Bui Quang Minh^1,2, Wenqi Zhang¹, Eric A Stone¹, Robert Lanfear¹.

Abstract

In phylogenetic inference, we commonly use models of substitution which assume that sequence evolution is stationary, reversible, and homogeneous (SRH). Although the use of such models is often criticized, the extent of SRH violations and their effects on phylogenetic inference of tree topologies and edge lengths are not well understood. Here, we introduce and apply the maximal matched-pairs tests of homogeneity to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic data sets. We show that roughly one-quarter of all the partitions we analyzed (23.5%) reject the SRH assumptions, and that for 25% of data sets, tree topologies inferred from all partitions differ significantly from topologies inferred using the subset of partitions that do not reject the SRH assumptions. This proportion increases when comparing trees inferred using the subset of partitions that rejects the SRH assumptions, to those inferred from partitions that do not reject the SRH assumptions. These results suggest that the extent and effects of model violation in phylogenetics may be substantial. They highlight the importance of testing for model violations and possibly excluding partitions that violate models prior to tree reconstruction. Our results also suggest that further effort in developing models that do not require SRH assumptions could lead to large improvements in the accuracy of phylogenomic inference. The scripts necessary to perform the analysis are available in https://github.com/roblanf/SRHtests, and the new tests we describe are available as a new option in IQ-TREE (http://www.iqtree.org).

Entities: Chemical Disease Species

Keywords: model violations; phylogenetic inference; systematic bias; test of symmetry

Mesh：

Year: 2019 PMID： 31536115 PMCID： PMC6893154 DOI： 10.1093/gbe/evz193

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Phylogenetics is an essential tool for inferring evolutionary relationships between individuals, species, genes, and genomes. Moreover, phylogenetic trees form the basis of a huge range of other inferences in evolutionary biology, from gene function prediction to drug development and forensics (Eisen 1998; Farrell et al. 2000; Mäser et al. 2001; Gardner et al. 2002; Yao et al. 2003, 2004; Grenfell et al. 2004; Salipante and Horwitz 2006; Gray et al. 2009; Brady and Salzberg 2011; Dunn et al. 2011). Most phylogenetic studies use models of sequence evolution which assume that the evolutionary process follows stationary, reversible, and homogeneous (SRH) conditions. Stationarity implies that the marginal frequencies of the nucleotides or amino acids are constant over time, reversibility implies that the evolutionary process is stationary and undirected (substitution rates between nucleotides or amino acids are equal in both directions), and homogeneity implies that the instantaneous substitution rates are constant along the tree or over an edge (Felsenstein 2004; Yang and Rannala 2012; Jermiin et al. 2017). However, these simplifying assumptions are often violated by real data (Foster and Hickey 1999; Tarrío et al. 2001; Paton et al. 2002; Goremykin and Hellwig 2005; Murray et al. 2005; Bourlat et al. 2006; Hyman et al. 2007; Sheffield et al. 2009; Nesnidal et al. 2010; Nabholz et al. 2011; Martijn et al. 2018). Such model violation may lead to systematic error that, unlike stochastic error, cannot be remedied simply by increasing the size of a data set (Felsenstein 2004; Ho and Jermiin 2004; Jermiin et al. 2004; Philippe et al. 2005; Sullivan and Joyce 2005; Kumar et al. 2012; Brown and Thomson 2017; Duchene et al. 2017). As phylogenetic data sets are steadily growing in terms of taxonomic and site sampling, it is vital that we develop and employ methods to measure and understand the extent to which systematic error affects phylogenetic inference (systematic bias), and explore ways of mitigating this systematic bias in empirical studies. One approach to accommodate data that have evolved under non-SRH conditions is to employ models that relax the SRH assumptions. A number of non-SRH models have been implemented in a variety of software packages (Foster 2004; Lartillot and Philippe 2004; Blanquart and Lartillot 2006; Boussau and Gouy 2006; Jayaswal et al. 2007, 2011, 2014; Knight et al. 2007; Dutheil and Boussau 2008; Sumner et al. 2012; Zou et al. 2012; Groussin et al. 2013; Nguyen et al. 2015; Woodhams et al. 2015). However, such models remain infrequently used as searching for optimal phylogenetic trees under these models is computationally demanding (Betancur-r et al. 2013) and the implementations are often not easy to use. As a result, the vast majority of empirical phylogenetic inferences rely on models that assume sequences have evolved under SRH conditions, such as the general time reversible family of models implemented in many of the most widely used phylogenetics software packages (Swofford 2001; Drummond and Rambaut 2007; Guindon et al. 2010; Ronquist et al. 2012; Bazinet et al. 2014; Bouckaert et al. 2014; Stamatakis 2014; Nguyen et al. 2015; Höhna et al. 2016). Another approach to accounting for data that may have evolved under non-SRH conditions is to test for model violations prior to tree reconstruction. Here, one first screens data sets or parts of data sets, and reconstructs trees exclusively from data that do not reject SRH conditions. A number of methods have been proposed to test for violation of SRH conditions in aligned sequences prior to estimating trees (Bowker 1948; Stuart 1955; Rzhetsky and Nei 1995; Kumar and Gadagkar 2001; Weiss and von Haeseler 2003; Ababneh et al. 2006; Ho et al. 2006), and there are also a posteriori tests for absolute model adequacy which are employed after trees have been estimated (Goldman 1993; Bollback 2002; Brown and ElDabaje 2009; Brown 2014; Duchene et al. 2017; Brown and Thomson 2018). Allowing the data to reject the model when the assumptions of the model are violated is an important approach to reducing systematic bias in phylogenetic inference (Philippe et al. 2005; Brown 2014). Knowing in advance which sequences and loci are inconsistent with the SRH assumptions will allow us to choose more complex models or to omit some of these sequences and loci from downstream analyses (Kumar and Gadagkar 2001). The need for methods that assess the evolutionary process prior to phylogenetic inference becomes more important as the number of sequences and sites per data set increases, because systematic bias has an increasing effect on inferences from larger phylogenetic data sets (Ho and Jermiin 2004; Jermiin et al. 2004; Phillips et al. 2004; Delsuc et al. 2005). In this article, we evaluate the extent and effect of model violation due to non-SRH evolution using 35 empirical data sets with a total of 3,572 partitions. We determine if the SRH assumptions are violated by extending and applying the matched-pairs tests of homogeneity (Jermiin et al. 2017) to each partition. We then compare the phylogenetic trees for each data set estimated from all of the partitions, the partitions that reject the SRH assumptions, and the partitions that do not reject the SRH assumptions, in order to evaluate the effect violating SRH conditions on phylogenetic inference. Our results suggest that violating SRH assumptions can have substantial impacts on phylogenetic inference.

Materials and Methods

Empirical Data Sets

In order to assess the impact of model violation in phylogenetics, we first gathered a representative sample of 35 partitioned empirical data sets that had been used for phylogenetic analysis in recent studies (table 1). Within the constraints of selecting data that were publicly available and suitably annotated, that is, such that all loci and all codon positions within protein-coding loci could be identified, we selected the data sets to provide as representative a sample as possible of the data types, taxa, and genomic regions most commonly used to infer bifurcating phylogenetic trees from concatenated alignments. These data sets include nucleotide sequences from nuclear, mitochondrial, plastid, and virus genomes, and include protein-coding DNA, introns, intergenic spacers, tRNA, rRNA, and ultraconserved elements. The number of taxa and sites in these data sets range from 27 to 355 and from 699 to 1,079,052, respectively. The clades represented in these data sets include animals, plants, and viruses. We partitioned all data sets to the maximum possible extent based on the biological properties of the data, that is, we divided every locus and every codon position within each protein-coding locus into a separate partition. All partitioning information is available at the github repository (https://github.com/roblanf/SRHtests/tree/master/datasets), and the full details of every data set are provided in table 1 and in supplementary extended table 5, Supplementary Material online.

Table 1

Number of Taxa, Number of Sites, Clade, and Study Reference for Each Data Set That Have Been Used in This Study

Data Set	Study References	Data Set References	Clade	Taxa	Sites
Anderson_2013	Anderson et al. (2014)	Anderson et al. (2013)	Loliginids	145	3,037
Bergsten_2013	Bergsten et al. (2013)	Bergsten et al. (2013)	Dytiscidae	38	2,111
Broughton_2013	Broughton et al. (2013)	Broughton et al. (2013)	Osteichthyes	61	19,997
Brown_2012	Brown et al. (2012)	Brown et al. (2012)	Ptychozoon	41	1,665
Cannon_2016a	Cannon et al. (2016)	Cannon et al. (2016)	Metazoa	78	89,792
Cognato_2001	Cognato and Vogler (2001)	Cognato and Vogler (2001)	Coleoptera: Scolytinae	44	1,897
Day_2013	Day et al. (2013)	Day et al. (2013)	Synodontis	152	3,586
Devitt_2013	Devitt et al. (2013)	Devitt et al. (2013)	Ensatina eschscholtzii klauberi	69	823
Dornburg_2012	Dornburg et al. (2012)	Dornburg et al. (2012)	Teleostei: Beryciformes: Holocentridae	44	5,919
Faircloth_2013	Faircloth et al. (2013)	Faircloth et al. (2013)	Actinopterygii	27	149,366
Fong_2012	Fong et al. (2012)	Fong et al. (2012)	Vertebrata	110	25,919
Horn_2014	Horn et al. (2014)	Horn et al. (2014)	Euphorbia	197	11,587
Kawahara_2013	Kawahara and Rubinoff (2013)	Kawahara and Rubinoff (2013)	Hyposmocoma	70	2,238
Lartillot_2012	Lartillot and Delsuc (2012)	Lartillot and Delsuc (2012)	Eutheria	78	15,117
McCormack_2013	McCormack et al. (2013)	McCormack et al. (2013)	Neoaves	33	1,079,052
Moyle_2016	Moyle et al. (2016)	Moyle et al. (2016)	Oscines	106	375,172
Murray_2013	Murray et al. (2013)	Murray et al. (2013)	Eucharitidae	237	3,111
Oaks_2011	Oaks (2011)	Oaks (2011)	Crocodylia	79	7,282
Rightmyer_2013	Rightmyer et al. (2013)	Rightmyer et al. (2013)	Hymenoptera: Megachilidae	94	3,692
Sauquet_2011	Sauquet et al. (2012)	Sauquet et al. (2011)	Nothofagus	51	5,444
Seago_2011	Seago et al. (2011)	Seago et al. (2011)	Coccinellidae	97	2,253
Sharanowski_2011	Sharanowski et al. (2011)	Sharanowski et al. (2011)	Braconidae	139	3,982
Siler_2013	Siler et al. (2013)	Siler et al. (2013)	Lycodon	61	2,697
Tolley_2013	Tolley et al. (2013)	Tolley et al. (2013)	Chamaeleonidae	203	5,054
Unmack_2013	Unmack et al. (2013)	Unmack et al. (2013)	Melanotaeniidae	139	6,827
Wainwright_2012	Wainwright et al. (2012)	Wainwright et al. (2012)	Acanthomorpha	188	8,439
Wood_2012	Wood et al. (2013)	Wood et al. (2012)	Archaeidae	37	5,185
Worobey_2014a	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	146	3,432
Worobey_2014b	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	327	759
Worobey_2014c	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	92	1,416
Worobey_2014d	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	355	1,497
Worobey_2014e	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	340	699
Worobey_2014f	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	332	2,151
Worobey_2014g	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	326	2,274
Worobey_2014h	Worobey et al. (2014)	Worobey et al. (2014)	Influenzavirus A	351	2,280

Number of Taxa, Number of Sites, Clade, and Study Reference for Each Data Set That Have Been Used in This Study

Workflow Summary

Figure 1 outlines the workflow. For each partition in each data set, we used a new approach based on the three matched-pairs tests of homogeneity to ask whether the evolution of the aligned sequences in the partition rejects the SRH assumptions. The three matched-pairs tests of homogeneity, described in more detail below, test three slightly different assumptions about the historical process that generated each aligned pair of sequences in a given partition. A significant result from any test suggests that the nature of the evolutionary process required to explain the aligned sequences violates at least one of the three SRH conditions (Jermiin et al. 2017). For each test, we classify each partition as pass if the result of the test is nonsignificant or fail if the result of the test is significant. We then denote the original data set as Dall, while the concatenation of pass partitions is denoted Dpass and the concatenation of fail partitions as Dfail (fig. 1).

. 1.

—Flow chart of methodology. For each partition in the alignment, we choose the pair of sequences with the maximum divergence and apply the matched-pairs tests of homogeneity on that pair.

—Flow chart of methodology. For each partition in the alignment, we choose the pair of sequences with the maximum divergence and apply the matched-pairs tests of homogeneity on that pair. To investigate the impact of model violation on phylogenetic inference, we infer and compare three phylogenetic trees, Tall, Tpass, and Tfail, estimated from Dall, Dpass, and Dfail, respectively.

Matched-Pairs Tests of Homogeneity

The three matched-pairs tests of homogeneity that are applied to pairs of sequences are: the MPTS (matched-pairs test of symmetry), MPTMS (matched-pairs test of marginal symmetry), and MPTIS (matched-pairs test of internal symmetry). The statistics are computed on an m-by-m (m is 4 for nucleotides and 20 for amino acids) divergence matrix with elements , where is the number of alignment sites having nucleotide (or amino acid) in the first sequence and nucleotide (or amino acid) in the second sequence. The MPTS tests the symmetry of by computing the Bowker’s (1948) test statistic as the χ2 distance between and its transpose: where . A P value is then obtained by a χ2 test with degrees of freedom, where is the number of pairs for which . A small P value (e.g., <0.05) indicates that the assumption of symmetry is rejected at that significance level, suggesting that evolution is nonstationary, nonhomogeneous, or both (Jermiin et al. 2017). The MPTMS tests the equality of nucleotide or amino acid composition between two sequences. To do so, MPTMS computes the Stuart’s test statistic using the difference between nucleotide or amino acid frequencies of two sequences, , and its variance–covariance matrix, . In detail, is given by where is the sum of over j, is the sum of over i, and, k = m−1. , the estimated variance–covariance matrix of u under the assumption of marginal symmetry, is defined elementwise by: A P value is obtained by a χ2 test with m−1 degrees of freedom. A small P value (<0.05) indicates that the stationarity assumption is rejected. Note that when is not invertible, the Stuart’s statistic is ill-defined and the MPTMS is not applicable. The MPTIS uses the test statistic as the difference between Bowker’s and Stuart’s statistic: . is χ2 distributed with degrees of freedom. A small P value (<0.05) indicates that the homogeneity assumption is rejected. The MPTS, MPTMS, and MPTIS test different aspects of the symmetry with which differences accumulate between pairs of sequences due to the substitution process. The MPTS is a comprehensive and sufficient test to determine whether the data comply with the SRH assumptions (Jermiin et al. 2017), but it cannot provide any information about the source of this violation. Some information on the underlying source of model violation may be obtained by performing the other two tests of symmetry: the MPTMS and the MPTIS. If the violation of the SRH assumptions stems from differences in base composition between the sequences, this should affect the marginal symmetry of the sequence pair, which can in principle be detected by the MPTMS. If the violation of the SRH assumptions stems from changes in the relative substitution rates over time, this should affect the internal symmetry of the sequence pair, which can in principle be detected by the MPTIS. However, even after performing all three tests, it is difficult to ascertain which of the three SRH assumptions is violated during the evolutionary process because the relationships between the SRH conditions and the three matched-pair tests is neither bijective nor injective, that is, there is not a one-to-one correspondence between the three tests and violation of the three SRH conditions (Jermiin et al. 2017). The three matched-pairs tests of homogeneity are appropriate to test for SRH assumptions as they consider the alignment on a site-by-site basis. The basic intuition that underlies these tests is that two sequences diverging under SRH conditions should accumulate differences symmetrically (e.g., both sequences are equally likely to accumulate at a C to T change at a site in which both originally shared a C). This symmetry of accumulation is reflected by symmetries in the resulting difference matrix, violations of which can be assessed statistically. However, these tests were designed to ask whether any single pair of sequences rejects the SRH conditions (Jermiin et al. 2017). To ask whether a given partition rejects SRH conditions, we developed an approach to extend the matched-pairs tests of homogeneity to accommodate data sets with more than two sequences.

Maximum Symmetry Test

In order to determine whether a given multiple sequence alignment rejects SRH conditions, we consider only the pair of taxa with the maximum divergence. In order to find the maximum divergent pair, we sum the off-diagonal elements of the divergence matrix and divide by the sum of all elements. We then randomly choose one pair from all the pairs with the maximum divergence score (if there is more than one pair). By using the most divergent sequence pair, we maximize our power to detect model violations without a priori knowledge of the underlying tree topology and the dependencies that it induces in the data. For the maximum divergent pair, we then apply the matched-pair tests of homogeneity and calculate their χ2P values. If the obtained P value is <0.05, then we consider that the null hypothesis of SRH evolution is rejected for the corresponding partition and we add it to the Dfail data set. Otherwise, we add it to the Dpass data set. We denote our applications of the MPTS, MPTMS, and MPTIS based on the as MaxSymTest, MaxSymTestmar, and MaxSymTestint, respectively.

Phylogenetic Inference

We used IQ-TREE (Nguyen et al. 2015) to infer up to seven phylogenetic trees for every data set: Tall (all partitions from the original data set; Dall); and Tpass and Tfail based on the Dpass and Dfail data sets from each of the three tests (MaxSymTest, MaxSymTestmar, MaxSymTestint), provided that there was at least one partition in each category. We ran IQ-TREE using the default settings with the best-fit fully partitioned model (Chernomor et al. 2016), which allows each partition to have its own evolutionary model and edge-linked rate determined by ModelFinder (Kalyaanamoorthy et al. 2017) followed 1,000 ultrafast bootstrap replicates (Hoang et al. 2018).

Distance between Trees

For each of the three tests (MPTS, MPTMS, MPTIS) we calculated the Normalized Path-Difference (NPD) and quartet distance (QD) (Steel and Penny 1993; Sand et al. 2014) between all three possible pairs of trees (Tall vs. Tpass; Tall vs. Tfail; and Tpass vs. Tfail), as long as Dpass and Dfail were nonempty and so Tpass and Tfail had been estimated. The path-difference metric (PD) is defined as the Euclidean distance between pairs of taxa (Steel and Penny 1993; Mir and Russello 2010). In this study, because we are interested only in differences between topologies, we use the variant of the PD metric that ignores branch lengths. In order to compare path distances between trees with different number of taxa, we normalized PD (to obtain NPD) by the mean of a null distribution of PDs generated from 10K random pairs of trees with the same number of taxa (Bogdanowicz et al. 2012). Thus, an NPD of 0 indicates an identical pair of trees, an NPD of 1 indicates that a pair of trees is as similar as a pair of randomly selected trees with the same number of taxa; and an NPD >1 indicates a pair of trees that are less similar than a randomly selected pair of trees with the same number of taxa. Since path differences are always nonnegative, the NPD is also guaranteed to be nonnegative. The QD metric is defined as the fraction of quartets (subsets of four taxa) that induce different subtrees between the two trees being compared. QD ranges between 0 and 1, where 0 means that two trees are identical and 1 means that they do not share any quartet subtrees. Compared with PD, QD has the advantage that its distribution is less sensitive to the underlying distribution of tree topologies (Steel and Penny 1993).

Tree Topology Tests

The NPD and the QD give us measures of the differences between pairs of trees, but they do not tell us whether the differences are phylogenetically significant in the three data sets (Dpass, Dall, and Dfail) derived from a given test. For example, trees that differ due to stochastic error associated with small data sets may be very different, but such differences may not be statistically significant. To assess the significance of the differences between Tpass, Tall, and Tfail, we used the weighted Shimodaira–Hasegawa (wSH) test (Shimodaira and Hasegawa 1999; Shimodaira 2002) implemented in IQ-TREE with 1,000 RELL replicates (Kishino et al. 1990). Given the alignment (Dpass), the wSH test computes a P value for each tree, where a small P value (<0.05) implies that the corresponding tree has a significantly worse likelihood than the best tree in the set of Tpass, Tall, and Tfail. We use Dpass for these tests because it is, by definition, the only data set that does not reject the underlying assumptions of the SH test. As such, we only compute sWH P values when Dpass is nonempty. Thus, we performed a wSH test for each of the three MaxSymTest variants: each of which asks whether Tall and/or Tfail can be rejected in favor of Tpass.

Correlation between Number of Substitutions and Model Violation

We hypothesized that partitions with more substitutions may be more likely to violate the SRH assumptions, since substitutions form the raw data for the matched-pairs tests of homogeneity. To assess this, we fitted a linear mixed-effects model for each of the three tests using the glmer function from the lme4 package in R (Bates et al. 2015). In this model, we treat each partition as a datapoint, the number of substitutions measured for that partition as a fixed effect, and the data set from which that partition was taken as a random effect. This allows us to estimate the extent to which the number of substitutions in a partition associates with whether a partition fails a given test of symmetry, after accounting for differences between the data sets. To calculate the R2 value, we use the r.squaredGLMM function from the MuMIn package in R (Barton 2009; Nakagawa and Schielzeth 2013).

Software Implementation

We implemented a new option –symtest in IQ-TREE to perform the three MaxSymTest matched-pairs tests of symmetry. In addition, the option –symtest-remove-bad allows users to remove from the final analysis partitions that fail the MaxSymTest. One can change the removal criterion to MaxSymTestmar or MaxSymTestint via the –symtest-type MAR|INT option. In addition, the cutoff P value can be changed using the –symtest-pval NUM option, where the default value is 0.05.

Reproducibility

The GitHub repository (https://github.com/roblanf/SRHtests) contains the raw data and Python and R scripts necessary to perform all analyses reported in this study.

Results

Violation of SRH Conditions Is Common across 35 Empirical Data Sets

Across all 3,572 partitions analyzed, 573 (16.0%) failed the MaxSymTest, 728 (20.4%) failed the MaxSymTestmar, and 312 (2.8%) failed the MaxSymTestint. In total, 840 (23.5%) of the partitions failed at least one test. The proportion of partitions failing each test varied substantially among data sets (fig. 2), but on an average, 21.8% of the partitions in each data set failed the MaxSymTest, 27.5% failed the MaxSymTestmar, and 5.1% failed the MaxSymTestint.

. 2.

—The proportion of partitions that reject the null hypothesis of the MaxSymTest, MaxSymTestmar, and MaxSymTestint (P value <0.05) in each data set.

—The proportion of partitions that reject the null hypothesis of the MaxSymTest, MaxSymTestmar, and MaxSymTestint (P value <0.05) in each data set. The fraction of failing partitions also varied with the genome type (e.g., mitochondrial, chloroplast, or nuclear) and context (e.g., protein-coding, UCE, tRNA) from which the partition was sequenced (table 2) although we note that a substantial proportion of the partitions from almost every category failed at least one of the tests (table 2).

Table 2

The Proportion of Partitions That Failed At Least One of the Three Tests—MaxSymTest, MaxSymTestmar, and MaxSymTestint

Type/Genome	Nuclear	Mitochondrial	Plastid	Virus
First codon positions	20.2%	27.6%	33.3%	25.0%
Second codon positions	21.0%	7.4%	0.0%	25.0%
Third codon positions	76.6%	44.8%	0.0%	75.0%
Other (e.g., intron)	27.8%	100.0%	0.0%
rRNA	30.0%	25.0%
UCE	22.5%
tRNA		0.0%

The Proportion of Partitions That Failed At Least One of the Three Tests—MaxSymTest, MaxSymTestmar, and MaxSymTestint There were no clear differences in the substitution models that were selected for the partitions that pass or fail the tests (see supplementary extended tables 1–3, Supplementary Material online). However, we note that the two most-frequently selected substitution models (for 35% of the partitions) were relatively simple: K80 (Kimura 1980) and HKY (Hasegawa et al. 1985).

Model Violation Has a Large Influence on Tree Topologies

Using both MaxSymTest and MaxSymTestmar, we compared each tree inferred from each data set (Tall) to the corresponding trees estimated from the failed (Tfail) and passed (Tpass) partitions. Disturbingly, for each of the two tree distance metrics that we considered (NPD and QD), we find that the tree inferred from the original data set tended to be more similar to the tree estimated from the failed partitions (table 3 and supplementary extended table 4, Supplementary Material online). Furthermore, the mean NPD distance between Tpass and Tfail across all 35 data sets for the MaxSymTest was 0.69, that is, the two trees are 69% as dissimilar as random pairs of trees. This suggests that violations of SRH assumptions drive large changes in tree topologies.

Table 3

The Proportion of Data Sets That Have the Highest NPD Metric (and QD metric) between the Three Comparisons (All-fail, All-pass, Pass–fail) for MaxSymTest, MaxSymTestmar, and MaxSymTestint

	T _fail	T _pass
MaxSymTest
T_all	14.3% (4.8%)	4.8% (4.8%)
T_pass	80.9% (90.4%)
MaxSymTest_mar
T_all	8.3% (0.0%)	8.3% (4.2%)
T_pass	83.4% (95.8%)
MaxSymTest_int
T_all	28.6% (28.6%)	0.0% (0.0%)
T_pass	71.4% (71.4%)

The Proportion of Data Sets That Have the Highest NPD Metric (and QD metric) between the Three Comparisons (All-fail, All-pass, Pass–fail) for MaxSymTest, MaxSymTestmar, and MaxSymTestint The results of the wSH tests (table 4) confirm that the differences between trees that we observe tend to be statistically significant. For example, when using the MaxSymTestmar, Tpass is a significantly better description of the Dpass data than Tall in ∼37% of the data sets, and better than Tfail in ∼89% of the data sets.

Table 4

The Proportion of Data Sets That Have a Significant P Value in the Weighted SH Test When Using Dpass As the Input Alignment for the Test

	T _all	T _fail
MaxSymTest	25%	79%
MaxSymTest_mar	37%	89%
MaxSymTest_int	4%	28%

The Proportion of Data Sets That Have a Significant P Value in the Weighted SH Test When Using Dpass As the Input Alignment for the Test

The Number of Substitutions Explains Less than One-Third of the Variance in Passing or Failing the Tests of Symmetry

The number of substitutions in a partition explained 27.5% of the variation in whether or not a partition passed or failed the MaxSymTest (supplementary extended fig. 7, Supplementary Material online). This proportion is very similar for MaxSymTestmar (24.4%) (supplementary extended fig. 8, Supplementary Material online), but is dramatically lower for the MaxSymTestint (1.8%) (supplementary extended fig. 9, Supplementary Material online). Thus, although the number of substitutions in a partition is a highly significant (P < 2e-16) predictor of passing or failing any of the tests, that it explains only about a quarter of the variation suggests that other factors, such as underlying differences in the extent to which partitions violate the SRH assumptions, are driving the remaining ∼75% of the variation.

Model Violation Due to Non-SRH Evolution Affects the Inferred Relationship between Even-Toed and Odd-Toed Ungulates in the Tree of Mammals

To examine the effects of model violation in more detail, we selected two data sets for more detailed consideration. Conflicting support for the placement of Xenacoelomorpha, the clade that contains Xenoturbella and Acoelomorpha, in the tree of life across different analyses has led to various hypotheses about the evolution of Bilateria (Cannon et al. 2016). In addition, the interordinal relationships in Laurasiatheria, especially the relationships between Fereuungulata (Perissodactyla, Cetartiodactyla, Carnivora, and Pholidota), in the tree of placental mammals is controversial (Cao et al. 1998; Zhou et al. 2012). It has been suggested that such inferences might be strongly affected by model violation and systematic error (Cao et al. 1998; Delsuc et al. 2005; Philippe et al. 2011; Tsagkogeorga et al. 2013). To assess whether data that pass or fail the MaxSymTestmar show different signals regarding the evolution of the Bilateria and the superorder Laurasiatheria, we examined in more detail the Tall, Tpass, and Tfail trees from recent studies that explored the tree of placental mammals (Lartillot and Delsuc 2012) and the tree of all animals (Cannon et al. 2016). The mammals’ data set comprises 78 mammalian taxa, including 73 placental mammals with 51 partitions representing the first, second, and third codon positions of the 17 genes (Lartillot and Delsuc 2012). The tree reconstructed from all of the partitions (Tall) and the tree reconstructed from the partitions that pass the MaxSymTest (Tpass, 29 partitions) both show Perissodactyla (odd-toed ungulates) as a sister group to Cetartiodactyla (even-toed ungulates) (fig. 3 and supplementary extended figs. 4 and 5, Supplementary Material online). Even so, the bootstrap support for this branch is not high: 73% for Tall and 34% for Tpass. On the other hand, the tree reconstructed from the data that fail the MaxSymTest (Tfail, 22 partitions) shows Perissodactyla as the sister group to the clade that contains Carnivora + Pholidota with 49% bootstrap support (fig. 3 and supplementary extended fig. 6, Supplementary Material online).

. 3.

—Maximum-likelihood trees of mammalian relationships based on analysis of Lartillot 2012 data set. (a) The tree inferred from all 51 partitions and from the 29 partitions that passed the MaxSymTest. (b) The tree inferred from 22 partitions that failed the MaxSymTest. Red numbers at the internal branches indicate the bootstrap support values that are <100% under the best fitting model. Numbers in curly brackets show the GC content (in panel a, %GC and bootstrap support values are for Tall and Tpass, respectively). The animal data set comprises 76 metazoan taxa, 2 choanoflagellate outgroups, 212 genes, and 424 partitions representing first and second codon positions (Cannon et al. 2016). The tree reconstructed from all of the partitions (Tall) is identical to the trees reconstructed from the 381 partitions that pass the MaxSymTest (Tpass), the partitions that fail the MaxSymTest (Tpass, 43 partitions), and the tree shown in the original paper from both DNA and amino acid data (Cannon et al. 2016), which places Xenacoelomorpha as the sister group of Nephrozoa (Deuterostomia and Protostomia) with 100% bootstrap support (supplementary extended figs. 1–3, Supplementary Material online).

Discussion

In this article, we show that model violation is prevalent and has a strong impact on tree reconstruction in many phylogenetic data sets. This impact varies substantially between different data sets and different types of partitions. The trees inferred from different groups of partitions from the same data set often have topologies that are biologically and statistically significantly different. Our results show great heterogeneity in the extent of model violation among different data sets and partitions. This is demonstrated by the varying proportion of partitions that failed the matched-pairs tests of homogeneity in each data set and in each genomic context (codon position, rRNA, tRNA, UCE, or other) and type of genome (nuclear, mitochondrial, plastid, and virus). Model violations are most frequently observed in the third codon positions for viral, mitochondrial and nuclear genomes, and intergenic spacers in plastid sequences. Yet, our results affirm that non-SRH evolution is far from constrained to these genomic regions. For example, in a data set of placental mammals, of the 22 partitions that failed the MaxSymTest, only 11 are third codon positions. The tree inferred from the partitions that show significant violation of the SRH conditions (Tfail) differs in its topology from the tree inferred from the partitions that do not show significant violation of the SRH conditions (Tpass) with respect to the interordinal relationships in Laurasiatheria (fig. 3). The tree inferred from partitions that violate the SRH conditions (Tfail) is consistent with the results from the original paper in that it places Perissodactyla as a sister group to Carnivora + Pholidota (Lartillot and Delsuc 2012). However, other studies using ML analysis show Perissodactyla to be a sister group to Cetartiodactyla (Graur et al. 1997; Murphy et al. 2001; Tsagkogeorga et al. 2013; Liu et al. 2017), which is also the relationship we find in this study with the tree inferred from partitions that do not show significant violation of the SRH assumptions. Examining the results of the two other tests (MaxSymTestmar and MaxSymTestint) we noticed that all the partitions that failed the MaxSymTest also failed the MaxSymTestmar, suggesting that those partitions are violating the models mainly due to nonstationarity. Based on this observation, GC content may drive the differences between the trees inferred from all partitions and those inferred from partitions that failed neither MaxSymTest nor MaxSymTestmar. Trees with partitions that violate the models tend to group together clades with similar GC content (e.g., as in Betancur-r et al. 2013). However, it is hard to discern any clear evidence for this from examining the GC content of the clades (fig. 3). Yet, our results show that all the clades in the partitions that failed the MaxSymTest have on an average a higher GC content (fig. 3). The results of our study also provide some insight into the likely cause of model violation in the data sets we examined. Figure 2 shows that violation of marginal symmetry (assessed with MaxSymTestmar) was much more common than violation of internal symmetry (assessed with MaxSymTestint). This suggests that nonstationarity, which is associated with marginal symmetry, is likely a more common cause of systematic bias than nonhomogeneity in the data sets that we examined (see also Jayaswal et al. 2005; Ababneh et al. 2006; Song et al. 2010). Yet, the difference between the proportion of partitions that failed the MaxSymTestmar and the proportion of partitions that failed the MaxSymTestint could also be due to the higher power of the MaxSymTestmar. Either way, this result hints that the development and application of nonstationary models (Yang 1994; Roberts and Yang 1995; Yap and Speed 2005) may be an important avenue toward reducing systematic bias in future analyses. Moreover, our results show a clear preference for simple substitution models with a single transition/transversion ratio over more complex models such as general time reversible. This suggests that developing nonstationary models with a single parameter for the transition/transversion ratio might be sufficient to reduce systematic bias in phylogenetic analysis. One limitation of using the tests that we propose in this article is that their power will be limited if there are few differences between the sequences being examined. Indeed, our analyses show that in our representative sample of >3,500 partitions from published data sets, roughly ∼25% of the variance in whether a partition passes or fails a given test can be attributed to the number of observed differences between the sequences. Nevertheless, this implies that the remaining ∼75% of the variance in whether a partition passes or fails a test could be attributable to other processes, such as variation in the extent of model violation among partitions. This suggests that we should be cautiously optimistic: although a lack of power on small or slowly evolving partitions may induce some false negatives (i.e., failures to identify partitions that have evolved under non-SRH conditions), the tests we propose still have significant power to identify partitions that show the evidence of model violation. It is possible that removing such partitions from phylogenetic analyses may improve the accuracy of results by reducing the overall burden of model violation on the inference of the tree topology. We hope that our implementation of these tests in the user-friendly software IQ-TREE will allow empirical phylogeneticists to continue to explore whether this is the case.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

103 in total

1. Compositional heterogeneity and phylogenomic inference of metazoan relationships.

Authors: Maximilian P Nesnidal; Martin Helmkampf; Iris Bruchhaus; Bernhard Hausdorf
Journal: Mol Biol Evol Date: 2010-04-09 Impact factor: 16.240

2. Phylogenomic analyses elucidate the evolutionary relationships of bats.

Authors: Georgia Tsagkogeorga; Joe Parker; Elia Stupka; James A Cotton; Stephen J Rossiter
Journal: Curr Biol Date: 2013-10-31 Impact factor: 10.834

3. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit.

Authors: Jeremy M Brown
Journal: Syst Biol Date: 2014-01-11 Impact factor: 15.683

4. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.

Authors: M Kimura
Journal: J Mol Evol Date: 1980-12 Impact factor: 2.395

5. RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language.

Authors: Sebastian Höhna; Michael J Landis; Tracy A Heath; Bastien Boussau; Nicolas Lartillot; Brian R Moore; John P Huelsenbeck; Fredrik Ronquist
Journal: Syst Biol Date: 2016-05-28 Impact factor: 15.683

6. Large-scale phylogeny of chameleons suggests African origins and Eocene diversification.

Authors: Krystal A Tolley; Ted M Townsend; Miguel Vences
Journal: Proc Biol Sci Date: 2013-03-27 Impact factor: 5.349

7. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.

Authors: M Hasegawa; H Kishino; T Yano
Journal: J Mol Evol Date: 1985 Impact factor: 2.395

8. Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida.

Authors: Sarah J Bourlat; Thorhildur Juliusdottir; Christopher J Lowe; Robert Freeman; Jochanan Aronowicz; Mark Kirschner; Eric S Lander; Michael Thorndyke; Hiroaki Nakano; Andrea B Kohn; Andreas Heyland; Leonid L Moroz; Richard R Copley; Maximilian J Telford
Journal: Nature Date: 2006-10-18 Impact factor: 49.962

9. Multi-locus phylogenetic analysis reveals the pattern and tempo of bony fish evolution.

Authors: Richard E Broughton; Ricardo Betancur-R; Chenhong Li; Gloria Arratia; Guillermo Ortí
Journal: PLoS Curr Date: 2013-04-16

10. Estimation of phylogeny using a general Markov model.

Authors: Vivek Jayaswal; Lars S Jermiin; John Robinson
Journal: Evol Bioinform Online Date: 2007-02-25 Impact factor: 1.625

17 in total

1. The widespread IS200/IS605 transposon family encodes diverse programmable RNA-guided endonucleases.

Authors: Han Altae-Tran; Soumya Kannan; F Esra Demircioglu; Rachel Oshiro; Suchita P Nety; Luke J McKay; Mensur Dlakić; William P Inskeep; Kira S Makarova; Rhiannon K Macrae; Eugene V Koonin; Feng Zhang
Journal: Science Date: 2021-09-09 Impact factor: 47.728

2. Assessing Confidence in Root Placement on Phylogenies: An Empirical Study Using Nonreversible Models for Mammals.

Authors: Suha Naser-Khdour; Bui Quang Minh; Robert Lanfear
Journal: Syst Biol Date: 2022-06-16 Impact factor: 9.160

3. Phylogenomics of Elongate-Bodied Springtails Reveals Independent Transitions from Aboveground to Belowground Habitats in Deep Time.

Authors: Daoyuan Yu; Yinhuan Ding; Erik Tihelka; Chenyang Cai; Feng Hu; Manqiang Liu; Feng Zhang
Journal: Syst Biol Date: 2022-08-10 Impact factor: 9.160

4. nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models.

Authors: Cuong Cao Dang; Bui Quang Minh; Hanon McShea; Joanna Masel; Jennifer Eleanor James; Le Sy Vinh; Robert Lanfear
Journal: Syst Biol Date: 2022-08-10 Impact factor: 9.160

5. Beyond Drosophila: resolving the rapid radiation of schizophoran flies with phylotranscriptomics.

Authors: Keith M Bayless; Michelle D Trautwein; Karen Meusemann; Seunggwan Shin; Malte Petersen; Alexander Donath; Lars Podsiadlowski; Christoph Mayer; Oliver Niehuis; Ralph S Peters; Rudolf Meier; Sujatha Narayanan Kutty; Shanlin Liu; Xin Zhou; Bernhard Misof; David K Yeates; Brian M Wiegmann
Journal: BMC Biol Date: 2021-02-08 Impact factor: 7.431

6. Hidden diversity of the most basal tapeworms (Cestoda, Gyrocotylidea), the enigmatic parasites of holocephalans (Chimaeriformes).

Authors: Daniel Barčák; Chia-Kwung Fan; Pasaikou Sonko; Roman Kuchta; Tomáš Scholz; Martina Orosová; Hsuan-Wien Chen; Mikuláš Oros
Journal: Sci Rep Date: 2021-03-09 Impact factor: 4.379

7. WGS- versus ORF5-Based Typing of PRRSV: A Belgian Case Study.

Authors: Frank Vandenbussche; Elisabeth Mathijs; Marylène Tignon; Tamara Vandersmissen; Ann Brigitte Cay
Journal: Viruses Date: 2021-12-02 Impact factor: 5.048

8. Relative model selection of evolutionary substitution models can be sensitive to multiple sequence alignment uncertainty.

Authors: Stephanie J Spielman; Molly L Miraglia
Journal: BMC Ecol Evol Date: 2021-11-29

9. Interrogating Phylogenetic Discordance Resolves Deep Splits in the Rapid Radiation of Old World Fruit Bats (Chiroptera: Pteropodidae).

Authors: Nicolas Nesi; Georgia Tsagkogeorga; Susan M Tsang; Violaine Nicolas; Aude Lalis; Annette T Scanlon; Silke A Riesle-Sbarbaro; Sigit Wiantoro; Alan T Hitch; Javier Juste; Corinna A Pinzari; Frank J Bonaccorso; Christopher M Todd; Burton K Lim; Nancy B Simmons; Michael R McGowen; Stephen J Rossiter
Journal: Syst Biol Date: 2021-10-13 Impact factor: 15.683

10. Spaghetti to a Tree: A Robust Phylogeny for Terebelliformia (Annelida) Based on Transcriptomes, Molecular and Morphological Data.

Authors: Josefin Stiller; Ekin Tilic; Vincent Rousset; Fredrik Pleijel; Greg W Rouse
Journal: Biology (Basel) Date: 2020-04-06