| Literature DB >> 21540409 |
Maria Anisimova1, Manuel Gil, Jean-François Dufayard, Christophe Dessimoz, Olivier Gascuel.
Abstract
Phylogenetic inference and evaluating support for inferred relationships is at the core of many studies testing evolutionary hypotheses. Despite the popularity of nonparametric bootstrap frequencies and Bayesian posterior probabilities, the interpretation of these measures of tree branch support remains a source of discussion. Furthermore, both methods are computationally expensive and become prohibitive for large data sets. Recent fast approximate likelihood-based measures of branch supports (approximate likelihood ratio test [aLRT] and Shimodaira-Hasegawa [SH]-aLRT) provide a compelling alternative to these slower conventional methods, offering not only speed advantages but also excellent levels of accuracy and power. Here we propose an additional method: a Bayesian-like transformation of aLRT (aBayes). Considering both probabilistic and frequentist frameworks, we compare the performance of the three fast likelihood-based methods with the standard bootstrap (SBS), the Bayesian approach, and the recently introduced rapid bootstrap. Our simulations and real data analyses show that with moderate model violations, all tests are sufficiently accurate, but aLRT and aBayes offer the highest statistical power and are very fast. With severe model violations aLRT, aBayes and Bayesian posteriors can produce elevated false-positive rates. With data sets for which such violation can be detected, we recommend using SH-aLRT, the nonparametric version of aLRT based on a procedure similar to the Shimodaira-Hasegawa tree selection. In general, the SBS seems to be excessively conservative and is much slower than our approximate likelihood-based methods.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21540409 PMCID: PMC3158332 DOI: 10.1093/sysbio/syr041
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
Simulated and real data sets used for the comparison of branch support methods
| Data set description | Data type | No. taxa | Sequence length | Gaps or missing (%) | Phylogenetic signal |
| Simulated data | |||||
| (A) 1000 replicates from | DNA | 100 | 600 | None | Distribution with mean = 0.26 |
| (B) 1500 replicates from ( | DNA | 12 | 1000 | None | Distribution with mean = 0.48 |
| (C) 1000 replicates simulated on large random star trees; simulation model HKY + Γ | DNA | 100 | 600 | None | 0 |
| (D) 1000 replicates simulated on random star trees; simulation model HKY + Γ | DNA | 12 | 1000 | None | 0 |
| Real data from | |||||
| (1) Orchids, nuclear ribosomal | DNA | 23 | 682 | 7.93 | 0.26 |
| (2) Mammals, nuclear protein-coding | DNA | 13 | 1161 | 0.23 | 0.50 |
| (3) Insects 1, nuclear protein-coding EF1α | DNA | 14 | 2033 | 6.15 | 0.18 |
| (4) Insects 2, mitochondrial (12 | DNA | 14 | 2249 | 0.06 | 0.17 |
| (5) Sharks 1, mitochondrial (12 | DNA | 23 | 1880 | 3.24 | 0.19 |
| (6) Sharks 2, mitochondrial (12 | DNA | 21 | 1963 | 2.58 | 0.24 |
| (7) Snakes, mitochondrial (12 | DNA | 23 | 1545 | 6.46 | 0.32 |
| (8) 3 domains of life, | AA | 15 | 258 | 19.10 | 0.56 |
| Real Metazoan proteins from | |||||
| (9) | AA | 49 | 190 | 3.71 | 0.29 |
| (10) | AA | 47 | 249 | 5.07 | 0.27 |
| (11) | AA | 39 | 598 | 24.57 | 0.32 |
| (12) | AA | 26 | 145 | 15.12 | 0.31 |
| (13) | AA | 25 | 215 | 16.71 | 0.33 |
| (14) | AA | 27 | 713 | 34.78 | 0.34 |
| (15) | AA | 33 | 1145 | 31.94 | 0.31 |
| (16) | AA | 26 | 414 | 13.41 | 0.30 |
| Real data from test set used in | |||||
| (17) Protein-coding | DNA | 500 | 1398 | 2.25 | 0.28 |
The phylogenetic signal is the proportion of the total tree length that is taken up by internal branches (Phillips et al. 2001).
FFP error rate (continuous lines) and power (dotted lines) of branch support methods. Data are simulated with 100 taxa, 600 nucleotides under the covarion model and analyzed using incorrect models: (a) HKY + Γ and (b) JC + Γ.
FP error rate (FP rate) and power of branch support methods for simulated data set (A in Table 1) for a threshold of 0.95
| Analysis model | Support method | FP rate (%) | Power (%) |
| HKY + Γ | |||
| aLRT | 6.3 | 79 | |
| aBayes | 2.8 | 71 | |
| SH-aLRT | 0.2 | 36 | |
| SBS | 0.3 | 48 | |
| RBS | 0.2 | 35 | |
| JC + Γ | |||
| aLRT | 10 | 81 | |
| aBayes | 5 | 74 | |
| SH-aLRT | 0.3 | 38 | |
| SBS | 0.3 | 35 |
FProbabilistic interpretation is rarely achieved. Inferred average support of a clade is plotted against the true probability under the true (HKY + Γ) and the incorrect (JC + Γ) models.
FBayesian PP compared with aBayes supports, and their distributions in real data: (a) DNA data 1–8 from Table 1, analyzed assuming HKY + Γ; (b) AA data 9–16 from Table 1, analyzed assuming WAG + Γ.
FBayesian PP compared with aBayes supports and their distributions in simulations: (a) for correctly inferred branches under HKY + Γ; (b) for incorrectly inferred branches under HKY + Γ; (c) for correctly inferred branches under JC + Γ; (d) for incorrectly inferred branches under JC + Γ.
FComparison of branch support measures on the nsf2-F gene: (a) Metazoan phylogeny reconstructed for the nsf2-F gene with ML using PHYML; (b) estimated branch supports corresponding to reconstructed branches, and (c) the hypothesized species tree (Guindon and Gascuel 2003; Lartillot et al. 2007).