| Literature DB >> 26231182 |
Samuel H Church1, Joseph F Ryan2, Casey W Dunn3.
Abstract
The Swofford-Olsen-Waddell-Hillis (SOWH) test evaluates statistical support for incongruent phylogenetic topologies. It is commonly applied to determine if the maximum likelihood tree in a phylogenetic analysis is significantly different than an alternative hypothesis. The SOWH test compares the observed difference in log-likelihood between two topologies to a null distribution of differences in log-likelihood generated by parametric resampling. The test is a well-established phylogenetic method for topology testing, but it is sensitive to model misspecification, it is computationally burdensome to perform, and its implementation requires the investigator to make several decisions that each have the potential to affect the outcome of the test. We analyzed the effects of multiple factors using seven data sets to which the SOWH test was previously applied. These factors include a number of sample replicates, likelihood software, the introduction of gaps to simulated data, the use of distinct models of evolution for data simulation and likelihood inference, and a suggested test correction wherein an unresolved "zero-constrained" tree is used to simulate sequence data. To facilitate these analyses and future applications of the SOWH test, we wrote SOWHAT, a program that automates the SOWH test. We find that inadequate bootstrap sampling can change the outcome of the SOWH test. The results also show that using a zero-constrained tree for data simulation can result in a wider null distribution and higher p-values, but does not change the outcome of the SOWH test for most of the data sets tested here. These results will help others implement and evaluate the SOWH test and allow us to provide recommendations for future applications of the SOWH test. SOWHAT is available for download from https://github.com/josephryan/SOWHAT.Entities:
Keywords: Phylogenetics; SOWH test; topology test
Mesh:
Year: 2015 PMID: 26231182 PMCID: PMC4604836 DOI: 10.1093/sysbio/syv055
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
Figure 1.A typical SOWH test. The test begins with two maximum likelihood searches on a single alignment. One search, represented by the black arrow, is performed with no constraining topology. Another test, represented by the shaded arrow, is constrained to follow an a priori topology that represents a phylogenetic hypothesis incongruent with the maximum likelihood topology. The black gears represent maximum likelihood software used to score the trees (i.e., GARLI, RAxML). These two searches result in two maximum likelihood scores, the difference () between which is the test statistic. From the constrained search, the optimized parameters and topology are retrieved and used to simulate new alignments with software (shaded gear) such as Seq-Gen. For each simulated alignment (shaded), two maximum likelihood searches are performed, one unconstrained (black arrow) and one constrained (shaded arrow), scores are obtained, and a value is calculated. The test statistic is compared to this distribution of values. A significantly large value is one which falls above some proportion of those generated by data simulation (i.e., 95%)
Figure 2.A SOWH test using two models of evolution. In this test, two maximum likelihood searches are performed as described above using a model of evolution (Model 1). Instead of retrieving parameter values from the constrained search, as would be done in a typical SOWH test, an additional constrained maximum likelihood is performed using a different model of evolution (Model 2). Parameters are retrieved from this test and used to simulate new data. These simulated data sets are then scored using the same model of evolution used to score the original data sets (Model 1). This adjustment to the typical SOWH test was suggested following the assumption that a SOWH test performed with the same model of evolution for likelihood scoring and data simulation would result in a smaller values on simulated data, a smaller null distribution, and a more liberal test. For tests in our study which use the CAT–GTR model in PhyloBayes, both the additional constrained search and data simulation are performed using PhyloBayes—all other likelihood searches are performed using the specified likelihood software (i.e., GARLI or RAxML).
Number of sample replicates
| Data set | Samples | Tests | ML soft. | Min. conf. interval | ||||
|---|---|---|---|---|---|---|---|---|
| Average | Lowest | Highest | Lower | Upper | ||||
| Buckley | 100 | 100 | RAxML | 0.261 | 0.140 | 0.312 | 0.114 | 0.185 |
| Buckley | 500 | 100 | RAxML | 0.263 | 0.219 | 0.264 | 0.171 | 0.211 |
| Sullivan | 100 | 100 | RAxML | 0.411 | 0.290 | 0.510 | 0.256 | 0.327 |
| Sullivan | 500 | 100 | RAxML | 0.401 | 0.344 | 0.458 | 0.329 | 0.360 |
| Dixon | 100 | 100 | RAxML | 0.092 | 0.030* | 0.160 | 0.017 | 0.051 |
| Dixon | 500 | 100 | RAxML | 0.095 | 0.065 | 0.132 | 0.056 | 0.073 |
Notes: 100 SOWH tests were performed for three data sets with a sample size of 100, and 100 tests were performed with a sample size of 500. P-values for the Dixon data set at a sample size of 100 vary from 0.030 to 0.169, indicating repeated SOWH tests at this sample size could result in different outcomes using a significance level of 0.05. The minimum confidence interval is 0.017–0.051, indicating that all p-values which fall below 0.05 are accompanied by a confidence interval which spans the significance level, therefore more sampling is required. At a sample size of 500, all p-values and all confidence intervals fall entirely above the confidence level (the minimum interval is 0.056–0.073), indicating a sufficient sample size. * indicates p-value less than 0.05.
Choice of likelihood software
| Data set | Source | Samples | Model | ML software | Conf. interval | ||
|---|---|---|---|---|---|---|---|
| Lower | Upper | ||||||
| Buckley | New | 1000 | GTR+I+ | GARLI | 0.018* | 0.011 | 0.028 |
| Buckley | New | 1000 | GTR+I+ | RAxML | 0.010* | 0.005 | 0.018 |
| Buckley | Reported | 1000 | GTR+I+ | PAUP* (ML) | 0.018* | – | – |
| Buckley | New | 1000 | GTR+ | GARLI | 0.118 | 0.099 | 0.140 |
| Buckley | New | 1000 | GTR+ | RAxML | 0.252 | 0.225 | 0.280 |
| Buckley | Reported | 1000 | GTR+ | PAUP* (ML) | 0.015* | – | – |
| Dixon | New | 500 | HKY | GARLI | 0.012* | 0.004 | 0.026 |
| Dixon | Reported | 500 | HKY | PAUP* (ML) | 0.258 | – | – |
| Dixon | New | 500 | HKY+ | GARLI | 0.002** | <0.005 | 0.011 |
| Dixon | Reported | 500 | HKY+ | PAUP* (ML) | <0.005** | – | – |
| Dunn | New | 100 | GTR+I+ | GARLI | <0.01** | <0.01 | 0.036 |
| Dunn | New | 100 | GTR+I+ | RAxML | <0.01** | <0.01 | 0.036 |
| Edwards | New | 100 | GTR+I+ | GARLI | <0.01** | <0.01 | 0.036 |
| Edwards | New | 100 | GTR+I+ | RAxML | <0.01** | <0.01 | 0.036 |
| Edwards | Reported | 100 | GTR+I+ | PAUP* (ML) | <0.01** | – | – |
| Liu | New | 100 | GTR+I+ | GARLI | <0.01** | <0.01 | 0.036 |
| Liu | New | 100 | GTR+I+ | RAxML | <0.01** | <0.01 | 0.036 |
| Sullivan | New | 100 | GTR+I+ | GARLI | <0.01** | <0.01 | 0.036 |
| Sullivan | New | 100 | GTR+I+ | RAxML | <0.01** | <0.01 | 0.036 |
| Sullivan | Reported | 100 | GTR+I+ | PAUP* (ML) | <0.01** | – | – |
| Wang | New | 500 | GTR+ | GARLI | <0.005** | <0.005 | 0.007 |
| Wang | New | 500 | GTR+ | RAxML | <0.005** | <0.005 | 0.007 |
| Wang | Reported | 500 | GTR+ | PAUP* (ML) | <0.005** | – | – |
Notes: SOWH tests were performed using GARLI and RAxML for each data set, with the exception of Dixon et al. (2007) as HKY is not an option in RAxML. Each SOWH test was performed using the model, sample size, and constraint topology specified in the original performance of the test. Buckley and Dixon were analyzed using two different models of evolution, as reported originally. The resulting p-values were compared to those reported in the literature for five data sets. The other two data sets were not directly compared due to known differences in implementation; the SOWH test performed by Dunn et al. (2005) used parsimony to score tree; the test by Liu et al. (2012) was performed with a partition scheme not used here. The outcome of the tests differed from the literature for two data sets, Buckley using GTR+ and Dixon using HKY. * indicates p-values less than 0.05; ** indicates less than 0.01.
Treatment of gaps
| Data set | % Gaps in data set | Simulate gaps | Conf. interval | ||
|---|---|---|---|---|---|
| Lower | Upper | ||||
| Dunn | 14.759 | yes | <0.01** | <0.01 | 0.036 |
| Dunn | 14.759 | no | <0.01** | <0.01 | 0.036 |
| Edwards | 14.268 | yes | <0.01** | <0.01 | 0.036 |
| Edwards | 14.268 | no | <0.01** | <0.01 | 0.036 |
| Liu | 9.149 | yes | <0.01** | <0.01 | 0.036 |
| Liu | 9.149 | no | <0.01** | <0.01 | 0.036 |
| Wang | 13.092 | yes | 0.026* | 0.014 | 0.044 |
| Wang | 13.092 | no | 0.022* | 0.011 | 0.039 |
Notes: SOWHAT by default propagates the exact number and position of gaps present in the original data set into all simulated data sets. Suppressing this feature, thereby excluding gaps from subsequent analyses, did not change the outcome of any SOWH tests examined here. * indicates -values less than 0.05; ** indicates less than 0.01.
Model specification: JC69 analysis
| Data set | Models | Conf. interval | |||
|---|---|---|---|---|---|
| ML score | Parameters | Lower | Upper | ||
| Buckley | GTR+G | GTR+G | 0.241 | 0.215 | 0.269 |
| Buckley | JC69 | GTR+G+I | <0.001** | <0.001 | 0.004 |
| Dixon | HKY+G | HKY+G | 0.012* | 0.004 | 0.026 |
| Dixon | JC69 | GTR+G+I | <0.005** | <0.005 | 0.007 |
| Dunn | GTR+G+I | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Dunn | JC69 | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Edwards | GTR+G+I | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Edwards | JC69 | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Liu | GTR+G+I | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Liu | JC69 | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Sullivan | GTR+G+I | GTR+G+I | 0.290 | 0.204 | 0.389 |
| Sullivan | JC69 | GTR+G+I | <0.01** | <0.01 | 0.036 |
| Wang | GTR+G+I | GTR+G+I | 0.026* | 0.014 | 0.044 |
| Wang | JC69 | GTR+G+I | <0.005** | <0.005 | 0.007 |
Notes: Model 1 represents the model used for likelihood inference (i.e., searching and scoring both the original and simulated data sets). Model 2 was used to estimate parameter values and simulate data sets. Separating the models for scoring and simulation has been suggested as a correction for a liberal bias present in the SOWH test. Using a model for scoring with fewer parameters free to vary, such as JC69, here resulted in a more liberal test. All hypotheses were rejected when JC69 was used as Model 1. * indicates p-values less than 0.05; ** indicates less than 0.01.
Model specification: CAT analysis
| Data set | Models | Conf. interval | |||
|---|---|---|---|---|---|
| ML score | Parameters | Lower | Upper | ||
| Buckley | GTR+ | GTR+ | 0.241 | 0.215 | 0.269 |
| Buckley | GTR+ | CAT | 0.013* | 0.007 | 0.022 |
| Dunn | GTR+I+ | GTR+I+ | <0.01** | <0.01 | 0.036 |
| Dunn | GTR+I+ | CAT | 0.080 | 0.035 | 0.152 |
| Edwards | GTR+I+ | GTR+I+ | <0.01** | <0.01 | 0.036 |
| Edwards | GTR+I+ | CAT | <0.01** | <0.01 | 0.036 |
| Sullivan | GTR+I+ | GTR+I+ | 0.290 | 0.204 | 0.389 |
| Sullivan | GTR+I+ | CAT | 0.030* | 0.006 | 0.085 |
| Liu | GTR+I+ | GTR+I+ | <0.01** | <0.01 | 0.036 |
| Liu | GTR+I+ | CAT | <0.01** | <0.01 | 0.036 |
| Wang | GTR+G | GTR+G | 0.026* | 0.014 | 0.044 |
| Wang | GTR+G | CAT | 0.020* | 0.010 | 0.036 |
Notes: Model 1 and Model 2 are the same as described in Table 4. Using a model for simulation with a greater number of parameters free to vary, such as the CAT model of PhyloBayes, did not result in universally larger values and therefore a more conservative test, though this was true for one data set, Dunn. The outcome of two other tests also differed, for Buckley and Sullivan, but the result was a more liberal test. * indicates p-values less than 0.05; ** indicates less than 0.01.
Generating topology
| Data set | Model | Generating tree | Conf. interval | ||
|---|---|---|---|---|---|
| Lower | Upper | ||||
| Buckley | GTR+I+ | Fully Resolved | 0.025* | 0.016 | 0.037 |
| Buckley | GTR+I+ | Zero-constrained | 0.123 | 0.103 | 0.145 |
| Buckley | GTR+ | Fully Resolved | 0.241 | 0.215 | 0.269 |
| Buckley | GTR+ | Zero-constrained | 0.439 | 0.408 | 0.470 |
| Dixon | HKY+ | Fully Resolved | 0.012* | 0.004 | 0.026 |
| Dixon | HKY+ | Zero-constrained | 0.032* | 0.018 | 0.051 |
| Dixon | HKY | Fully Resolved | 0.002** | <0.005 | 0.011 |
| Dixon | HKY | Zero-constrained | <0.005** | <0.005 | 0.007 |
| Dunn | GTR+I+ | Fully Resolved | <0.01** | <0.01 | 0.036 |
| Dunn | GTR+I+ | Zero-constrained | <0.01** | <0.01 | 0.036 |
| Edwards | GTR+I+ | Fully Resolved | <0.01** | <0.01 | 0.036 |
| Edwards | GTR+I+ | Zero-constrained | <0.01** | <0.01 | 0.036 |
| Liu | GTR+I+ | Fully Resolved | <0.01** | <0.01 | 0.036 |
| Liu | GTR+I+ | Zero-constrained | <0.01** | <0.01 | 0.036 |
| Sullivan | GTR+I+ | Fully Resolved | 0.290 | 0.204 | 0.389 |
| Sullivan | GTR+I+ | Zero-constrained | 0.240 | 0.160 | 0.336 |
| Wang | GTR+ | Fully Resolved | 0.026* | 0.014 | 0.044 |
| Wang | GTR+ | Zero-constrained | 0.034* | 0.020 | 0.054 |
Notes: We compared SOWH tests performed using a fully resolved generating topology to tests performed using the zero-constrained tree, as suggested by Susko (2014). The zero-constrained tree is created by manipulating the most likely unconstrained tree so that edges incongruent with the alternative hypothesis are reduced to nearly zero. Using this method changed the outcome of only one test, Buckley, using the model GTR+I+. * indicates p-values less than 0.05; ** indicates less than 0.01.