| Literature DB >> 20801907 |
Yu Fan1, Rui Wu, Ming-Hui Chen, Lynn Kuo, Paul O Lewis.
Abstract
Bayesian phylogenetic analyses often depend on Bayes factors (BFs) to determine the optimal way to partition the data. The marginal likelihoods used to compute BFs, in turn, are most commonly estimated using the harmonic mean (HM) method, which has been shown to be inaccurate. We describe a new more accurate method for estimating the marginal likelihood of a model and compare it with the HM method on both simulated and empirical data. The new method generalizes our previously described stepping-stone (SS) approach by making use of a reference distribution parameterized using samples from the posterior distribution. This avoids one challenging aspect of the original SS method, namely the need to sample from distributions that are close (in the Kullback-Leibler sense) to the prior. We specifically address the choice of partition models and find that using the HM method can lead to a strong preference for an overpartitioned model. In contrast to the HM method and the original SS method, we show using simulated data that the generalized SS method is strikingly more precise (repeatable BF values of the same data and partition model) and yields BF values that are much more reasonable than those produced by the HM method. Comparisons of HM and generalized SS methods on an empirical data set demonstrate that the generalized SS method tends to choose simpler partition schemes that are more in line with expectation based on inferred patterns of molecular evolution. The generalized SS method shares with thermodynamic integration the need to sample from a series of distributions in addition to the posterior. Such dedicated path-based Markov chain Monte Carlo analyses appear to be a cost of estimating marginal likelihoods accurately.Entities:
Mesh:
Year: 2010 PMID: 20801907 PMCID: PMC3002242 DOI: 10.1093/molbev/msq224
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
FPlots relating the number of sites to twice the natural logarithm of the BF (2log(BF)) in favor of the partitioned model (with two equal-size subsets) over the unpartitioned model for 200 data sets simulated under a diversity of unpartitioned GTR + G models (see text for details). (a) Left: 2log(BF) estimated using the HM method. (b) Middle: 2log(BF) estimated using the original SS method. (c) Right: 2log(BF) estimated using the generalized SS method.1
FScatterplots showing twice the natural logarithm of the BF (2log(BF)) estimated using two independent analyses started with different pseudorandom number seeds. (a) Left: 2log(BF) estimated using the HM method. (b) Middle: 2log(BF) estimated using the original SS method. (c) Right: 2log(BF) estimated using the generalized SS method.
Mean Log Marginal Likelihoods and Standard Deviations Based on 20 Independent Replicates from Analysis of the Four-gene New Zealand Cicada Data Set
| Partition Model | HM | Generalized SS |
| Unpartitioned | − 10246.78 (1.60) | − 10336.83 (0.19) |
| By gene | − 10215.31 (2.51) | − 10361.76 (0.78) |
| By codon | − 9692.18 (3.05) | − 9823.35 (0.82) |
| By gene and codon | − 9634.64 (3.52) | − 9875.39 (0.31) |
FResults of applying the HM and generalized SS methods to the empirical New Zealand cicada data set for four different partitioning schemes: unpartitioned (None), partitioned by gene (Gene, 4 subsets), partitioned by codon (Codon, 3 subsets), and partitioned by both gene and codon (Both, 12 subsets). Error bars represent standard deviations based on 20 independent replicates. The dotted line connects mean log marginal likelihoods estimated using the HM method, and the solid line connects mean log marginal likelihoods estimated using the generalized SS method.
Tree Length, Subset Relative Rates (m1 and m2), and Proportion of Invariable Sites Parameter Values (pinvar,1 and pinvar,2) for Two Subsets (subscripts 1 and 2). In Total, 549 (11%) of the 5,000 Sites in Subset 1, and 99 (99%) of the 100 Sites in Subset 2 Were Variable
| Model | HM | Generalized SS | Tree Length | ||||
| JC + I|JC + I | − 13788.91 | − 14025.85 | 0.15 | 1.00 | 1.00 | 0.27 | 0.96 |
| JC + M|JC + M | − 13442.35 | − 13642.55 | 0.24 | 0.52 | 25.03 | 0.00 | 0.00 |
| JC + I + M|JC + I + M | − 13433.07 | − 13646.04 | 0.24 | 0.52 | 24.90 | 0.22 | 0.01 |
| True | — | — | 0.22 | 0.51 | 25.50 | 0.00 | 0.00 |
NOTE.—I, invariable sites model; M, subset relative rates model; JC, Jukes–Cantor model.