| Literature DB >> 25858352 |
Michael D Woodhams1, Jesús Fernández-Sánchez2, Jeremy G Sumner2.
Abstract
When the process underlying DNA substitutions varies across evolutionary history, some standard Markov models underlying phylogenetic methods are mathematically inconsistent. The most prominent example is the general time-reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, nonhomogeneous Lie Markov models have been identified as the class of models that are consistent in the face of a changing process of DNA substitutions regardless of taxon sampling. Some well-known models in popular use are within this class, but are either overly simplistic (e.g., the Kimura two-parameter model) or overly complex (the general Markov model). On a diverse set of biological data sets, we test a hierarchy of Lie Markov models spanning the full range of parameter richness. Compared against the benchmark of the ever-popular GTR model, we find that as a whole the Lie Markov models perform well, with the best performing models having 8-10 parameters and the ability to recognize the distinction between purines and pyrimidines.Entities:
Keywords: Lie Markov models; Model selection; ModelTest; multiplicative closure; phylogenetics
Mesh:
Substances:
Year: 2015 PMID: 25858352 PMCID: PMC4468350 DOI: 10.1093/sysbio/syv021
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
The rate matrices of RY Lie Markov models are linear combinations of basis matrices
Notes: Each model uses a subset of the first 12 matrices listed here. Under some circumstances it is mathematically convenient to replace with the 13th matrix, .
The RY Lie Markov models. Basis matrix can be substituted for throughout
| Name | Basis matrices | Name | Basis matrices |
|---|---|---|---|
| 1.1 | 6.6 | ||
| 2.2b | 6.7a | ||
| 3.3a | 6.7b | ||
| 3.3b | 6.8a | ||
| 3.3c | 6.8b | ||
| 3.4 | 6.17a | ||
| 4.4a | 6.17b | ||
| 4.4b | 8.8 | ||
| 4.5a | 8.10a | ||
| 4.5b | 8.10b | ||
| 5.6a | 8.16 | ||
| 5.6b | 8.17 | ||
| 5.7a | 8.18 | ||
| 5.7b | 9.20a | ||
| 5.7c | 9.20b | ||
| 5.11a | 10.12 | ||
| 5.11b | 10.34 | ||
| 5.11c | 12.12 | ||
| 5.16 |
Notes: The number before the point indicates the dimension (number of parameters) of the model, the number after the point is the number of rays generated by the model.
Some properties of the RY Lie Markov models
| Name | aka | Rev? | EBFDF | Name | aka | Rev? | EBFDF |
|---|---|---|---|---|---|---|---|
| 1.1 | JC | ✓ | 0 | 6.6 | (SSM) | × | 1 |
| 2.2b | K2ST | ✓ | 0 | 6.7a | × | 3 | |
| 3.3a | K3ST | ✓ | 0 | 6.7b | × | 3 | |
| 3.3b | × | 0 | 6.8a | × | 3 | ||
| 3.3c | TrNef | ✓ | 0 | 6.8b | × | 1 | |
| 3.4 | ✓ | 1 | 6.17a | × | 1 | ||
| 4.4a | F81 | ✓ | 3 | 6.17b | × | 1 | |
| 4.4b | ✓ | 1 | 8.8 | × | 3 | ||
| 4.5a | × | 1 | 8.10a | × | 3 | ||
| 4.5b | × | 1 | 8.10b | × | 1 | ||
| 5.6a | × | 0 | 8.16 | × | 3 | ||
| 5.6b | × | 3 | 8.17 | × | 3 | ||
| 5.7a | × | 2 | 8.18 | × | 3 | ||
| 5.7b | × | 0 | 9.20a | × | 2 | ||
| 5.7c | × | 0 | 9.20b | DS | × | 0 | |
| 5.11a | × | 2 | 10.12 | × | 3 | ||
| 5.11b | × | 0 | 10.34 | × | 3 | ||
| 5.11c | × | 0 | 12.12 | GM | × | 3 | |
| 5.16 | × | 1 |
Notes: The “aka” (“also known as”) column identifies models already known to phylogenetics under a different name (see text). “Rev?” indicates which models are time reversible (✓) and which are not (×). “EBFDF” is the equilibrium base frequency degrees of freedom. EBFDF = 0 has . EBFDF = 1 has . EBFDF=2 has . EBFDF = 3 has unconstrained EBF.
FNesting relationships of the RY Lie Markov models. Box shape and weight indicates the degrees of freedom in EBF. Alternate model names are in parentheses. Solid or dotted connecting lines are to reduce visual confusion and have no additional significance.
The top 10 models for each data set, by Bayesian Information Criterion (BIC)
| Clade: | Human | Angiosperms | Cormorants | Yeast | Teleost Fish | Buttercups | Ratites |
|---|---|---|---|---|---|---|---|
| Approx range: | Species | Class | Family | Genus | mult. orders | Genus | Order |
| Tree diameter: | 0.008 | 0.434 | 0.721 | 1.465 | 0.523 | 0.021 | 1.085 |
| DNA type | mitoch | chlorop | mito/nuc | mostly nuc | nuclear | chlorop | mitoch |
| taxa×sites | |||||||
| Site rate model | |||||||
| TrN | MK10.34 | HKY | 12.12 | RY5.11b | WS4.4b | RY8.16 | |
| HKY | RY8.18 | TrN | GTR | RY3.3c | WS3.4 | RY10.34 | |
| 8.9 | 16.0 | 6.5 | 79.8 | 0.1 | 0.0 | 7.8 | |
| TIM | 12.12 | K81uf | RY10.12 | RY2.2b | WS4.5a | TVM | |
| 9.7 | 16.9 | 6.8 | 912.0 | 3.4 | 5.1 | 8.6 | |
| RY8.8 | MK8.17 | RY8.8 | RY8.8 | RY5.7b | WS4.5b | 12.12 | |
| 13.5 | 26.9 | 10.8 | 946.5 | 6.4 | 6.0 | 14.0 | |
| RY8.18 | WS8.10a | TIM | RY9.20a | TIMef | MK5.7a | GTR | |
| 15.4 | 27.3 | 13.3 | 1156.6 | 6.7 | 10.3 | 15.2 | |
| K81uf | WS10.12 | RY8.18 | WS10.12 | RY4.4b | RY5.7a | WS10.12 | |
| 18.6 | 28.3 | 16.2 | 1450.6 | 7.5 | 10.7 | 18.6 | |
| GTR | RY10.12 | MK8.10a | TVM | RY3.4 | WS6.8a | RY8.17 | |
| 21.3 | 29.8 | 17.3 | 1518.7 | 8.5 | 11.5 | 30.5 | |
| TVM | WS10.34 | TVM | TIM | SYM | WS5.6b | WS8.10a | |
| 29.9 | 36.1 | 19.3 | 1613.4 | 9.5 | 12.4 | 34.7 | |
| RY10.12 | RY9.20a | RY10.12 | TrN | RY5.11a | WS6.6 | MK10.12 | |
| 30.5 | 89.5 | 20.1 | 1640.5 | 9.6 | 13.1 | 42.3 | |
| MK10.34 | WS8.10b | WS8.17 | MK10.34 | 3.3a | K81uf | WS10.34 | |
| 31.1 | 108.5 | 21.6 | 1663.0 | 10.0 | 14.4 | 44.6 |
Notes: is how much worse this model scores than the optimal model (first). A complete table of BIC scores is available in the Supplementary Material available on Dryad at http://dx.doi.org/10.5061/dryad.461g6. Tree diameter is approximately the number of mutations per site between the most distant taxa.
Summary of rankings of models under BIC for the seven data sets. Models marked “*” are time reversible, non-Lie Markov models
| Model | Median | Best | EBF | Model | Median | Best | EBF | Model | Median | Best | EBF |
|---|---|---|---|---|---|---|---|---|---|---|---|
| rank | rank | DF | rank | rank | DF | rank | rank | DF | |||
| *TVM | 8 | 3 | 3 | WS9.20a | 38 | 28 | 2 | WS8.8 | 73 | 22 | 3 |
| RY10.12 | 9 | 3 | 3 | *SYM | 41 | 8 | 0 | MK6.7b | 74 | 42 | 3 |
| MK10.34 | 11 | 1 | 3 | 9.20b | 41 | 39 | 0 | WS8.16 | 75 | 23 | 3 |
| WS10.34 | 11 | 8 | 3 | RY6.8b | 42 | 34 | 1 | WS6.7b | 77 | 14 | 3 |
| *GTR | 12 | 2 | 3 | WS6.6 | 43 | 9 | 1 | MK8.8 | 77 | 39 | 3 |
| 12.12 | 12 | 1 | 3 | MK8.10b | 43 | 17 | 1 | WS6.8a | 78 | 7 | 3 |
| RY8.18 | 14 | 2 | 3 | RY8.10b | 46 | 33 | 1 | WS5.6b | 79 | 8 | 3 |
| RY8.8 | 14 | 4 | 3 | WS8.10b | 47 | 10 | 1 | MK6.8a | 79 | 40 | 3 |
| *K81uf | 15 | 3 | 3 | RY6.6 | 48 | 33 | 1 | MK8.16 | 81 | 38 | 3 |
| *TIM | 16 | 3 | 3 | *TVMef | 49 | 13 | 0 | WS5.11a | 81 | 50 | 2 |
| WS8.10a | 17 | 5 | 3 | RY4.4b | 49 | 6 | 1 | WS6.17b | 82 | 16 | 1 |
| MK8.17 | 17 | 4 | 3 | MK6.6 | 50 | 18 | 1 | MK5.6b | 83 | 34 | 3 |
| *HKY | 18 | 1 | 3 | RY5.11b | 50 | 1 | 0 | WS6.8b | 87 | 12 | 1 |
| *TrN | 18 | 1 | 3 | RY5.11c | 50 | 15 | 0 | WS5.16 | 87 | 11 | 1 |
| WS10.12 | 18 | 6 | 3 | RY5.16 | 51 | 39 | 1 | MK5.11a | 88 | 37 | 2 |
| RY9.20a | 19 | 5 | 2 | RY4.5a | 53 | 18 | 1 | MK4.5b | 88 | 69 | 1 |
| MK10.12 | 20 | 9 | 3 | MK6.17a | 54 | 32 | 1 | WS5.11b | 88 | 65 | 0 |
| RY8.16 | 21 | 1 | 3 | RY6.17a | 54 | 34 | 1 | WS4.5b | 89 | 4 | 1 |
| RY8.10a | 23 | 11 | 3 | MK5.6a | 55 | 19 | 0 | MK4.4b | 89 | 64 | 1 |
| RY10.34 | 23 | 2 | 3 | RY6.17b | 57 | 35 | 1 | WS3.3b | 90 | 59 | 0 |
| MK8.10a | 24 | 7 | 3 | MK4.5a | 58 | 25 | 1 | WS3.4 | 91 | 2 | 1 |
| RY6.8a | 24 | 13 | 3 | RY5.7b | 59 | 4 | 0 | MK6.8b | 91 | 67 | 1 |
| 6.7a | 25 | 13 | 3 | *TIMef | 60 | 5 | 0 | WS4.4b | 92 | 1 | 1 |
| MK8.18 | 25 | 16 | 3 | RY3.4 | 60 | 7 | 1 | WS2.2b | 93 | 57 | 0 |
| RY8.17 | 26 | 7 | 3 | WS5.6a | 61 | 23 | 0 | MK6.17b | 94 | 71 | 1 |
| RY5.6b | 26 | 21 | 3 | RY5.6a | 61 | 14 | 0 | WS3.3c | 95 | 56 | 0 |
| WS8.17 | 27 | 10 | 3 | RY4.5b | 61 | 22 | 1 | MK5.16 | 96 | 70 | 1 |
| WS8.18 | 28 | 24 | 3 | RY5.7c | 61 | 43 | 0 | WS5.11c | 97 | 67 | 0 |
| RY6.7b | 31 | 22 | 3 | WS5.7c | 62 | 16 | 0 | MK5.11b | 97 | 85 | 0 |
| RY5.11a | 32 | 9 | 2 | MK5.7b | 63 | 37 | 0 | MK3.4 | 98 | 68 | 1 |
| MK5.7a | 32 | 5 | 2 | MK5.7c | 64 | 17 | 0 | MK3.3c | 98 | 75 | 0 |
| WS6.17a | 32 | 15 | 1 | WS5.7b | 64 | 40 | 0 | 4.4a | 99 | 44 | 3 |
| RY5.7a | 33 | 6 | 2 | 3.3a | 65 | 10 | 0 | MK5.11c | 99 | 78 | 0 |
| MK9.20a | 33 | 19 | 2 | RY3.3c | 67 | 2 | 0 | MK3.3b | 100 | 77 | 0 |
| WS5.7a | 35 | 23 | 2 | RY3.3b | 71 | 11 | 0 | MK2.2b | 102 | 74 | 0 |
| WS4.5a | 37 | 3 | 1 | RY2.2b | 72 | 3 | 0 | *JC | 108 | 83 | 0 |
Notes: EBFDF = Equilibrium base frequency degrees of freedom (see text under “Equilibrium base frequencies”). The best ranked models have high EBFDF.
FA parameterization of model 3.4 which is restricted to only the stochastic rate matrices. (a) The region of stochasticity for model 3.4 with fixed . (b) Without loss of generality, we take . Given in defines point (representing a matrix) on the edge of the region of stochasticity, and a measure of how far is from the origin, which defines the JC matrix . (c) () have defined a stochastic rate matrix .
Approximate levels of saturation of model Markov matrices before their product matrix has significant (>5%) chance of being nonembeddable (i.e., “average” rate matrix , as defined in the text, is nonstochastic)
| Saturation | Possible embeddability issues |
|---|---|
| 1 Substitution/site | 5.6a, 6.6, 6.8a, 6.8b, 8.8, 8.10a, |
| 8.10b, 8.16, 8.17, 8.18, 10.12, 10.34 | |
| 2 Substitution/site | 5.6b, 5.7b, 5.11a, 5.11b, 5.11c, |
| 5.16, 6.7a, 6.7b, 6.17a, 6.17b | |
| 3 Substitution/site | 4.4b, 5.7c |
| > 3 Substitution/site | 3.4, 4.5a, 4.5b |
| never | 2.2b, 3.3a, 3.3b, 3.3c, 4.4a |
Notes: Data derived from Monte Carlo simulation.