| Literature DB >> 29788113 |
David A Duchêne1, Sebastian Duchêne2, Simon Y W Ho1.
Abstract
Statistical phylogenetic analyses of genomic data depend on models of nucleotide or amino acid substitution. The adequacy of these substitution models can be assessed using a number of test statistics, allowing the model to be rejected when it is found to provide a poor description of the evolutionary process. A potentially valuable use of model-adequacy test statistics is to identify when data sets are likely to produce unreliable phylogenetic estimates, but their differences in performance are rarely explored. We performed a comprehensive simulation study to identify test statistics that are sensitive to some of the most commonly cited sources of phylogenetic estimation error. Our results show that, for many test statistics, traditional thresholds for assessing model adequacy can fail to reject the model when the phylogenetic inferences are inaccurate and imprecise. This is particularly problematic when analysing loci that have few informative sites. We propose new thresholds for assessing substitution model adequacy and demonstrate their effectiveness in analyses of three phylogenomic data sets. These thresholds lead to frequent rejection of the model for loci that yield topological inferences that are imprecise and are likely to be inaccurate. We also propose the use of a summary statistic that provides a practical assessment of overall model adequacy. Our approach offers a promising means of enhancing model choice in genome-scale data sets, potentially leading to improvements in the reliability of phylogenomic inference.Entities:
Mesh:
Year: 2018 PMID: 29788113 PMCID: PMC6007652 DOI: 10.1093/gbe/evy094
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Details of Nine Test Statistics Used to Assess the Adequacy of Nucleotide Substitution Models
| Test Statistic | Calculation | Model Component Assessed | Type of Statistic | Reference |
|---|---|---|---|---|
| Tree- and model-based chi-squared statistic of base frequencies across taxa. It is calculated using matrices of base composition with a row for each taxon and a column for each nucleotide. One matrix corresponds to the values expected under the tree and model of base composition, and the other corresponds to the observed base composition. The statistic is calculated using the following formula:
| Stationarity of base composition | Data-based | ||
| Multinomial (or unconstrained) likelihood | Product of the unique site frequencies ( | Overall fit | Data-based | |
| Likelihood of the data using the unconstrained model ( | Overall fit | Data-inference hybrid | ||
| Biochemical diversity | Calculated as the number of different bases occurring at each site, and the mean value taken across the alignment | Diversity in base composition across sites | Data-based | |
| Consistency index | Minimum possible number of substitutions in the data divided by the minimum number required to describe a given tree using parsimony. It has been considered as a measure of homoplasy, and is expected to take a value of 1 in the absence of homoplasy | Consistency of phylogenetic information in the data compared with the most parsimonious scenario | Data-inference hybrid | |
| Branch support | Mean of branch-support values across the maximum-likelihood tree. Branch support can be calculated in a variety of ways, including nonparametric bootstrap or approximate likelihood-ratio test ( | Overall fit | Inference-based | |
| 95% CI in branch-support statistic | 95% range in branch-support values across the maximum-likelihood tree | Overall fit | Inference-based | |
| Tree length | Sum of the branch lengths in the maximum-likelihood tree | Overall fit | Inference-based | |
| Mahalanobis distance | A test of model adequacy can be placed in a multivariate setting that simultaneously considers multiple test statistics by using Mahalanobis distances. The aim of this approach is to estimate a distance between the empirical test statistics and the multivariate predictive distribution from several test statistics. Individual test statistics are first standardized so that they appear on the same scale. Then it is possible to calculate the mean ( | Summary assessment from multiple test statistics; Overall fit | Based on other test statistics | |
| where | ||||
| For assessing substitution model adequacy, we used two combinations of test statistics to define the multivariate distribution. One included the other eight test statistics considered in this study, and the other included the four statistics that were the most sensitive to biased phylogenetic inferences. |
. 1.—The six characteristics that were varied in simulations of sequence evolution to investigate the performance and adequacy of the candidate substitution model (GTR + Γ): (a) substitution model parameterization; (b) compositional heterogeneity; (c) covarion-like rate variation; (d) terminal branch lengths; (e) covarion-like rate variation and terminal branch lengths; and (f) sequence length for each locus. One hundred replicates were performed under each scenario from (a) to (e), under each of the sequence lengths shown in (f). Colors in (a) indicate different rate parameters, whereas in (b) they indicate the magnitude and proportion of taxa undergoing a change in base composition. Branch thickness corresponds to evolutionary rate in (c), (d), and (e).
. 2.—The performance of phylogenetic inference using the GTR + Γ substitution model in simulations with 5,000 nucleotides under six representative simulation conditions (for results from every simulation scenario, see supplementary figs. S1–S3, Supplementary Material online). Each box represents the results of 100 replicate analyses. Performance is described by (a) the length of the estimated tree minus that of the simulated tree, divided by that of the simulated tree, (b) the difference in stemminess, defined as the proportion of the inferred tree length represented by internal branches, (c) the unweighted Robinson–Foulds topological distance between estimated and simulated trees, and (d) the mean node support in the estimated tree, which is a measure of precision in estimates.
. 3.—The sensitivity of nine test statistics for assessing the adequacy of the GTR + Γ substitution model in simulations with 5,000 nucleotides under six representative simulation conditions (for results from every simulation scenario, see supplementary figs. S4–S6, Supplementary Material online). The Mahalanobis test statistic was calculated to summarize all test statistics (M1), or the four sensitive test statistics (M2). Each box represents the results of 100 replicate analyses.
. 4.—Estimated two-dimensional representation of tree-space for samples of loci from turtles, birds, and mammals. Data are colored such that warmer colors indicate higher values of (a) Mahalanobis distance (M2), (b) mean branch support, (c) tree length, and (d) number of variable sites. High values of these variables occur in similar locations of tree-space. In the sequence data from turtles, loci with high values of M2 are not necessarily the same as those that have high values for the other variables (see supplementary fig. S7, Supplementary Material online).
. 5.—Estimated two-dimensional representation of tree-space for loci from turtles, birds, and mammals. Data are colored according to whether each locus passes (black) or fails (red) each of the five tests of model adequacy using our new thresholds for assessment.