| Literature DB >> 31536115 |
Suha Naser-Khdour1, Bui Quang Minh1,2, Wenqi Zhang1, Eric A Stone1, Robert Lanfear1.
Abstract
In phylogenetic inference, we commonly use models of substitution which assume that sequence evolution is stationary, reversible, and homogeneous (SRH). Although the use of such models is often criticized, the extent of SRH violations and their effects on phylogenetic inference of tree topologies and edge lengths are not well understood. Here, we introduce and apply the maximal matched-pairs tests of homogeneity to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic data sets. We show that roughly one-quarter of all the partitions we analyzed (23.5%) reject the SRH assumptions, and that for 25% of data sets, tree topologies inferred from all partitions differ significantly from topologies inferred using the subset of partitions that do not reject the SRH assumptions. This proportion increases when comparing trees inferred using the subset of partitions that rejects the SRH assumptions, to those inferred from partitions that do not reject the SRH assumptions. These results suggest that the extent and effects of model violation in phylogenetics may be substantial. They highlight the importance of testing for model violations and possibly excluding partitions that violate models prior to tree reconstruction. Our results also suggest that further effort in developing models that do not require SRH assumptions could lead to large improvements in the accuracy of phylogenomic inference. The scripts necessary to perform the analysis are available in https://github.com/roblanf/SRHtests, and the new tests we describe are available as a new option in IQ-TREE (http://www.iqtree.org).Entities:
Keywords: model violations; phylogenetic inference; systematic bias; test of symmetry
Mesh:
Year: 2019 PMID: 31536115 PMCID: PMC6893154 DOI: 10.1093/gbe/evz193
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Number of Taxa, Number of Sites, Clade, and Study Reference for Each Data Set That Have Been Used in This Study
| Data Set | Study References | Data Set References | Clade | Taxa | Sites |
|---|---|---|---|---|---|
| Anderson_2013 |
| Anderson et al. (2013) | Loliginids | 145 | 3,037 |
| Bergsten_2013 |
|
| Dytiscidae | 38 | 2,111 |
| Broughton_2013 |
|
| Osteichthyes | 61 | 19,997 |
| Brown_2012 |
|
| Ptychozoon | 41 | 1,665 |
| Cannon_2016a |
|
| Metazoa | 78 | 89,792 |
| Cognato_2001 |
|
| Coleoptera: Scolytinae | 44 | 1,897 |
| Day_2013 |
|
| Synodontis | 152 | 3,586 |
| Devitt_2013 | Devitt |
| Ensatina eschscholtzii klauberi | 69 | 823 |
| Dornburg_2012 |
|
| Teleostei: Beryciformes: Holocentridae | 44 | 5,919 |
| Faircloth_2013 |
|
| Actinopterygii | 27 | 149,366 |
| Fong_2012 |
|
| Vertebrata | 110 | 25,919 |
| Horn_2014 |
|
| Euphorbia | 197 | 11,587 |
| Kawahara_2013 |
|
| Hyposmocoma | 70 | 2,238 |
| Lartillot_2012 |
|
| Eutheria | 78 | 15,117 |
| McCormack_2013 |
|
| Neoaves | 33 | 1,079,052 |
| Moyle_2016 |
|
| Oscines | 106 | 375,172 |
| Murray_2013 |
|
| Eucharitidae | 237 | 3,111 |
| Oaks_2011 |
|
| Crocodylia | 79 | 7,282 |
| Rightmyer_2013 |
|
| Hymenoptera: Megachilidae | 94 | 3,692 |
| Sauquet_2011 |
| Sauquet et al. (2011) | Nothofagus | 51 | 5,444 |
| Seago_2011 |
|
| Coccinellidae | 97 | 2,253 |
| Sharanowski_2011 |
|
| Braconidae | 139 | 3,982 |
| Siler_2013 |
|
| Lycodon | 61 | 2,697 |
| Tolley_2013 |
|
| Chamaeleonidae | 203 | 5,054 |
| Unmack_2013 |
|
| Melanotaeniidae | 139 | 6,827 |
| Wainwright_2012 |
|
| Acanthomorpha | 188 | 8,439 |
| Wood_2012 |
| Wood et al. (2012) | Archaeidae | 37 | 5,185 |
| Worobey_2014a |
|
| Influenzavirus A | 146 | 3,432 |
| Worobey_2014b |
|
| Influenzavirus A | 327 | 759 |
| Worobey_2014c |
|
| Influenzavirus A | 92 | 1,416 |
| Worobey_2014d |
|
| Influenzavirus A | 355 | 1,497 |
| Worobey_2014e |
|
| Influenzavirus A | 340 | 699 |
| Worobey_2014f |
|
| Influenzavirus A | 332 | 2,151 |
| Worobey_2014g |
|
| Influenzavirus A | 326 | 2,274 |
| Worobey_2014h |
|
| Influenzavirus A | 351 | 2,280 |
. 1.—Flow chart of methodology. For each partition in the alignment, we choose the pair of sequences with the maximum divergence and apply the matched-pairs tests of homogeneity on that pair.
. 2.—The proportion of partitions that reject the null hypothesis of the MaxSymTest, MaxSymTestmar, and MaxSymTestint (P value <0.05) in each data set.
The Proportion of Partitions That Failed At Least One of the Three Tests—MaxSymTest, MaxSymTestmar, and MaxSymTestint
| Type/Genome | Nuclear | Mitochondrial | Plastid | Virus |
|---|---|---|---|---|
| First codon positions | 20.2% | 27.6% | 33.3% | 25.0% |
| Second codon positions | 21.0% | 7.4% | 0.0% | 25.0% |
| Third codon positions | 76.6% | 44.8% | 0.0% | 75.0% |
| Other (e.g., intron) | 27.8% | 100.0% | 0.0% | |
| rRNA | 30.0% | 25.0% | ||
| UCE | 22.5% | |||
| tRNA | 0.0% |
The Proportion of Data Sets That Have the Highest NPD Metric (and QD metric) between the Three Comparisons (All-fail, All-pass, Pass–fail) for MaxSymTest, MaxSymTestmar, and MaxSymTestint
|
|
| |
|---|---|---|
| MaxSymTest | ||
| | 14.3% (4.8%) | 4.8% (4.8%) |
| | 80.9% (90.4%) | |
| MaxSymTestmar | ||
| | 8.3% (0.0%) | 8.3% (4.2%) |
| | 83.4% (95.8%) | |
|
| ||
| | 28.6% (28.6%) | 0.0% (0.0%) |
| | 71.4% (71.4%) | |
The Proportion of Data Sets That Have a Significant P Value in the Weighted SH Test When Using Dpass As the Input Alignment for the Test
|
|
| |
|---|---|---|
| MaxSymTest | 25% | 79% |
| MaxSymTestmar | 37% | 89% |
| MaxSymTestint | 4% | 28% |
. 3.—Maximum-likelihood trees of mammalian relationships based on analysis of Lartillot 2012 data set. (a) The tree inferred from all 51 partitions and from the 29 partitions that passed the MaxSymTest. (b) The tree inferred from 22 partitions that failed the MaxSymTest. Red numbers at the internal branches indicate the bootstrap support values that are <100% under the best fitting model. Numbers in curly brackets show the GC content (in panel a, %GC and bootstrap support values are for Tall and Tpass, respectively).