| Literature DB >> 20696057 |
Arong Luo1, Huijie Qiao, Yanzhou Zhang, Weifeng Shi, Simon Yw Ho, Weijun Xu, Aibing Zhang, Chaodong Zhu.
Abstract
BACKGROUND: Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory.Entities:
Mesh:
Year: 2010 PMID: 20696057 PMCID: PMC2925852 DOI: 10.1186/1471-2148-10-242
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Conditions used in simulations for 24 models of the GTR family.
| Simulation | Parameter set | Tree height | Ntaxa | Nchar | Simulation software | No. |
|---|---|---|---|---|---|---|
| I | A | 0.7 | 30 | 1000 | Seq-Gen | I-1 |
| A | 0.5 | 30 | 1000 | Seq-Gen | I-2 | |
| A | 0.3 | 30 | 1000 | Seq-Gen | I-3 | |
| A | 0.1 | 30 | 1000 | Seq-Gen | I-4 | |
| B | 0.7 | 30 | 1000 | Seq-Gen | I-5 | |
| B | 0.5 | 30 | 1000 | Seq-Gen | I-6 | |
| B | 0.3 | 30 | 1000 | Seq-Gen | I-7 | |
| B | 0.1 | 30 | 1000 | Seq-Gen | I-8 | |
| II | A | 0.5 | 22 | 1000 | Seq-Gen | II-1 |
| A | 0.5 | 50 | 1000 | Seq-Gen | II-2 | |
| III | A | 0.5 | 22 | 1000 | Mesquite | III-1 |
| IV | A | Nonclock | 22 | 1000 | Seq-Gen | IV-1 |
| V | A | 0.5 | 30 | 300 | Seq-Gen | V-1 |
| A | 0.5 | 30 | 2000 | Sen-Gen | V-2 |
One hundred replicates were performed for each set of conditions. The models consisted of JC (0) [61], K80 (1) [62], SYM (5) [63], F81 (3) [64], HKY (4) [65,66], GTR (8) [67], JC + I (1), K80 + I (2), SYM + I (6), F81 + I (4), HKY + I (4), GTR + I (9), JC + Γ (1), K80 + Γ (2), SYM + Γ (6), F81 + Γ (4), HKY + Γ (4), GTR + Γ (9), JC + I + Γ (2), K80 + I + Γ (3), SYM + I + Γ (7), F81 + I + Γ (5), HKY + I + Γ (5) and GTR + I + Γ (10), where 'I' represents the proportion of invariable sites, 'Γ' represents the discrete gamma distribution with four rate categories, and number in parentheses is the number of free parameters of each model. One classification of the 24 models was to put them into four categories: base (JC, K80, etc.), base + I (JC + I, K80 + I, etc.), base + Γ (JC + Γ, K80 + Γ, etc.) and base + I + Γ (JC + I + Γ, K80 + I + Γ, etc.). The other was that models with the same number of free parameters were grouped together, resulting in a total of 11 categories. In addition, we called every four models having the same parameters in the substitution-rate matrix as base-like models (e.g., the four models of SYM, SYM + I, SYM + Γ, SYM + I + Γ were called as SYM-like models).
Figure 1Accuracy values of the four model-selection criteria for selecting 24 simulated models. In the multiple-line charts, categories along the x-axis represent the simulated models. For the sake of clarity, only six models are labelled, and each one is followed by three similar ones (e.g., JC is followed by JC + I, JC + Γ, and JC + I + Γ). The y-axis represents the accuracy values (%). A shows the results of the simulations I-1, I-2, I-3, II-1, II-2, III-1, V-1 and V-2; B shows the results of the simulations I-5, I-6, I-7, and I-8; and C shows the results for the other two simulations.
Figure 2Trees used to guide dataset simulations. Tree heights are 0.7, 0.5, 0.3, and 0.1 substitutions per site for A, B, C, and D, respectively (30 taxa each); tree heights are 0.5 substitutions per site for both E (22 taxa) and F (50 taxa). All trees are ultrametric except for tree G, which is the non-clock tree (22 taxa) and was only used in simulation IV-1.
Figure 3Precision of the four criteria corresponding to 24 simulated models. Categories along the x-axis represent the 24 simulated models. For the sake of clarity, only seven models are labelled, and each one is followed by three similar ones (e.g., JC is followed by JC + I, JC + Γ, and JC + I + Γ). The y-axis represents the means and standard deviations of precision values for each simulated model across the 14 simulations, which are different statistical results from those in Additional file 2. The markers denote the means, while lengths of error bars denote the standard deviation values.
Number of model(s) selected by the four model-selection criteria in the 14 simulations.
| Simulation | One model | Two models | Three models | Four models |
|---|---|---|---|---|
| I-1 | 41.33 | 48.08 | 10.5 | 0.08 |
| I-2 | 46.13 | 47.96 | 5.88 | 0.04 |
| I-3 | 46.13 | 47.54 | 6.25 | 0.08 |
| I-4 | 42.29 | 50.04 | 7.67 | 0 |
| I-5 | 33.83 | 54.75 | 11.33 | 0.08 |
| I-6 | 35.96 | 53.96 | 9.96 | 0.13 |
| I-7 | 35.63 | 53.54 | 10.63 | 0.21 |
| I-8 | 34.54 | 52.5 | 12.88 | 0.08 |
| II-1 | 46.75 | 46.75 | 6.42 | 0.08 |
| II-2 | 42.38 | 49.46 | 8.13 | 0.04 |
| III-1 | 46.71 | 48.04 | 5.17 | 0.08 |
| IV-1 | 37.83 | 52.67 | 9.5 | 0 |
| V-1 | 43.75 | 47.38 | 8.75 | 0.13 |
| V-2 | 47.67 | 46.13 | 6.21 | 0 |
Due to rounding, the four numerical values in some simulations do not sum exactly to 100%.
Figure 4Dissimilarity of six criterion pairs for 24 simulated models. These charts illustrate the dissimilarity of every pair of model-selection criteria corresponding to 24 simulated models. Categories along the x-axis represent the 24 models. For the sake of clarity, only seven models are labelled, and each one is followed by three similar ones (e.g., JC is followed by JC + I, JC + Γ, and JC + I + Γ). The y-axis represents the dissimilarity values (%). A shows the results of the simulations I-1, I-2, I-3, II-1, II-2, III-1, V-1, V-2, while B shows the results of the other simulations.
Statistics of χ2 test and multiple comparison tests for the 14 simulations.
| Simulation | Multiple comparison Sig. (α' = 0.0083) | ||||||
|---|---|---|---|---|---|---|---|
| hLRT-AIC | hLRT-BIC | hLRT-DT | AIC-BIC | AIC-DT | BIC-DT | ||
| I-1 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.669 | < 0.001 | < 0.001 |
| I-2 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.244 | 0.474 | 0.941 |
| I-3 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.243 | 0.231 | 1.000 |
| I-4 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 1.000 |
| I-5 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.025 | 0.032 | 1.000 |
| I-6 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.067 | 0.073 | 1.000 |
| I-7 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.126 | 0.121 | 1.000 |
| I-8 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.059 | 0.057 | 1.000 |
| II-1 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.564 | 0.657 | 0.942 |
| II-2 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.581 | 0.473 | 0.842 |
| III-1 | < 0.001 | 0.008 | < 0.001 | < 0.001 | 0.289 | 0.297 | 0.996 |
| IV-1 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 1.000 |
| V-1 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.012 | 0.058 | 0.128 |
| V-2 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.740 | 0.763 | 1.000 |
The 24 models were classified into four categories: base, base + I, base + Γ, and base + I + Γ in these tests.
Figure 5General percentages of four model categories recovered. The four stack-bar charts illustrate the percentages of base, base + I, base + Γ, and base + I + Γ in all recovered models of each simulation by every criterion considered. For the sake of clarity, numbers are labelled in the x-axis, representing the simulations in the order of I-1, I-2, I-3, I-4, I-5, I-6, I-7, I-8, II-1, II-2, III-1, IV-1, V-1, and V-2 from left to right.
Figure 6Counts of models recovered, classified by the number of free parameters. In these charts, the x-axis represents the numbers of model free parameters. The y-axis represents means and standard deviations of the counts for each of the 11 model categories across the 14 simulations. The markers denote the means, while lengths of error bars denote the standard deviation values.