Literature DB >> 28983516

A biologist's guide to Bayesian phylogenetic analysis.

Fabrícia F Nascimento^1,2, Mario Dos Reis³, Ziheng Yang⁴.

Abstract

Bayesian methods have become very popular in molecular phylogenetics due to the availability of user-friendly software implementing sophisticated models of evolution. However, Bayesian phylogenetic models are complex, and analyses are often carried out using default settings, which may not be appropriate. Here, we summarize the major features of Bayesian phylogenetic inference and discuss Bayesian computation using Markov chain Monte Carlo (MCMC), the diagnosis of an MCMC run, and ways of summarising the MCMC sample. We discuss the specification of the prior, the choice of the substitution model, and partitioning of the data. Finally, we provide a list of common Bayesian phylogenetic software and provide recommendations as to their use.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28983516 PMCID： PMC5624502 DOI： 10.1038/s41559-017-0280-x

Source DB: PubMed Journal: Nat Ecol Evol ISSN： 2397-334X Impact factor: 15.460

Introduction

Bayesian phylogenetic methods were introduced in the 1990s1,2 and have since revolutionised the way we analyse genomic sequence data3. Examples of such analyses include phylogeographic analysis of virus spread in humans4–7, inference of phylogeographic history and migration between species8–10, analysis of species diversification rates11,12, divergence time estimation13–15, and inference of phylogenetic relationships among species or populations13,16–20. The popularity of Bayesian methods appears to be due to two factors: (1) the development of powerful models of data analysis; and (2) the availability of user-friendly computer programs implementing the models (Table 1).

Table 1

List of Bayesian programs

Program	Brief description	Refs
BEAST	Implements a vast number of models. Examples are simultaneous estimation of the tree topology and divergence times, phylodynamics, phylogeography, and species tree estimation under the multispecies coalescent model.	86
MrBayes	Implements a large number of models for analysis of nucleotide, amino acid, and morphological data. Estimates species phylogenies and species divergence times.	87
RevBayes	Similar to MrBayes, but with its own programming language to set up complex hierarchical Bayesian models.	88
MCMCTree	Estimates divergence times on a fixed phylogenetic tree.	89
Phycas	Estimates phylogenetic trees based on nucleotide data. This allows for multifurcating trees, helping to reduce spuriously high posterior probabilities for phylogenies.	90,91
PhyloBayes	Reconstructs phylogenetic trees using infinite mixture models to account for among-site and among-lineage heterogeneity in nucleotide or amino acid compositions, which may be important for inferring deep phylogenies.	92
BPP	Implements species tree estimation and species delimitation under the multi-species coalescent model using multi-loci genomic sequence data.	56
Migrate	Estimates population sizes and migration rates under the population-subdivision model based on molecular data.	93
IMa2	Estimates divergence times, population sizes and migration rates under the isolation-with-migration model using multi-loci DNA sequence data and a fixed phylogenetic tree for populations.	94
Structure	Estimates population structure from multi-locus genotype data.	95
BAMM	Estimates clade diversification rates on phylogenies.	96
Tracer	A program for MCMC diagnostics and summaries.	81
AWTY	A package for MCMC diagnostics for Bayesian phylogenetic inference.	97

Models implemented in Bayesian software programs are becoming increasingly complicated, and the priors and model assumptions made in those programs are not always clear to the user. Analyses are often conducted using default priors, which may not be appropriate and may lead to biased or incorrect results. Likewise, over-simplified likelihood models may produce biased results, while over-complicated models may lead to loss of power as well as inefficient computation. The workhorse underlying all modern Bayesian phylogenetic programs is the Markov chain Monte Carlo (MCMC) or Metropolis-Hastings algorithm21,22. However, MCMC is both art and science, and a basic understanding of its workings is essential for the correct use of those programs. In this review, we explain the basic concepts of Bayesian statistics and discuss the major features of MCMC algorithms, such as the prior and the likelihood, MCMC proposals, diagnosis of MCMC convergence and mixing, and summary of the posterior sample. Our intended reader is the empirical biologist who needs to use Bayesian phylogenetic programs to analyse their data. We lay out and answer a set of questions important for setting up a Bayesian analysis. We focus on Bayesian estimation of phylogenetic trees. However, the basic concepts discussed here apply to other phylogenetic problems as well, such as divergence time estimation or species tree estimation under the multi-species coalescent model. Extensive reviews of these are available elsewhere23–25.

What is the Bayesian method?

The Bayesian method is a statistical inference methodology. Its main feature is the use of probability distributions to describe the uncertainty of all unknowns including the model parameter(s). Let D be the observed data and θ the unknown parameter. We assign a distribution f(θ), called the prior distribution, based on our knowledge about θ before analysis of the data. After the data are observed, we use Bayes’s theorem to calculate the posterior distribution of θ given the data: where the probability of the data given the parameter, f(D|θ), is called the likelihood. This summarises the information about θ in the data. The normalising constant z = ∫ f(θ)f(D|θ)dθ ensures that f(θ|D) integrates to 1 and is a proper statistical distribution. Equation (1) indicates that the posterior is proportional to the prior times the likelihood, or the posterior combines the information in the prior and in the data. An example of the prior, likelihood and posterior for a two-parameter phylogenetic example is given in Figure 1.

Figure 1

Prior, likelihood and posterior distribution for a two-parameter phylogenetic example.

The data of the 12s RNA mitochondrial genes from human and orang-utan are used to estimate of the evolutionary distance (d) and the transition/transversion ratio (κ) model75.

In the above we assume that the model for generating the data is known. In the so-called trans-model inference, we have several competing models, with each model m having its own parameters θ. Then a prior, f(m, θ) = f(m) f(θ|m), is assigned to both the model (m) and its parameters (θ), and the posterior of the model and parameter is similarly given by Bayes’s theorem: f(m, θ|D) ∝ f(m, θ) f(D|m, θ). In phylogenetics, the tree topology and the substitution model together specify the statistical model for the data. Different tree topologies thus correspond to different models, while the branch lengths or divergence times as well as the substitution parameters (such as the transition/transversion rate ratio) are parameters in the model. The data are usually a molecular sequence alignment or an alignment of morphological characters (or a combination of both). An appealing property of Bayesian inference is that it makes direct probabilistic statements about the model or unknown parameter. The posterior probability of a model, f(m|D), is the probability that the model is correct, given the data. The 95% credibility interval (CI) of a parameter covers the true parameter with probability 0.95, given the data. Such statements are impossible using confidence intervals and p-values in classical statistics, which treat parameters as unknown constants26.

What type of data can I use?

The most common type of data used in phylogenetic analyses is DNA and amino acid sequence alignments. Morphological characters can also be used27. Here, we focus on DNA sequences. The sequences must be aligned before they are used as input data in phylogenetic programs, and alignment accuracy is important in phylogenetic analysis. Much effort has been made to develop models of insertions and deletions28–30. For species phylogeny estimation, the sequences must be orthologs, as incorrect use of paralogs may lead to incorrect phylogenies. Several methods are now available to infer paralogy/orthology31,32.

How do I select a substitution model for my data?

A number of models have been developed to describe nucleotide or amino acid substitutions26,33,34. For nucleotide sequences, these range from the simple JC69 (for Jukes and Cantor)35 to the complex GTR (for General Time Reversible)36–38, and the unrestricted model (UNREST)37. In JC69 all nucleotide changes occur at the same rate, while in GTR or UNREST substitutions occur at different rates depending on the source and target nucleotides. It is also common to assume a gamma model of variable rates across sites, in particular, in analysis of coding DNA or protein sequences39–41. Programs such as jModelTest42, Modelgenerator43 or PartitionFinder44 are commonly used to choose a substitution model. Those programs examine the goodness of fit of the model to the data but never consider the robustness of the analysis to model assumptions. For example, it is well known that the transition/tranversion bias typically has a greater impact on the fit of the model to data (judged by the improvement in likelihood), but less effect on estimation of the tree topology and branch lengths than rate variation among sites41. Although there does not seem to be serious harm in mechanical use of those programs, it may be unnecessary to do so in many cases. As a rule of thumb, different substitution models tend to give very similar sequence distance estimates when sequence divergence is less than 10%, so that a simple model can be used even though it may not fit the data. Complex models are necessary in reconstruction of deep phylogenies. Two of the most complex nucleotide substitution models, HKY+Γ and GTR+Γ, often produce similar estimates of phylogenetic trees and branch lengths37,45. When in doubt, note that it is more problematic to under-specify than to over-specify the model in Bayesian phylogenetics46. For discrete morphological data, the Mk model, an extension of the JC69 model to k morphological character states, can be used27. An extension that allows for unequal rates of substitution is available in MrBayes47. A correction for assertion bias is applied in calculation of the likelihood function because only variable characters are used27. For continuous characters, diffusion process models (such as the Wiener or the Ornstein-Uhlenbeck process) can be used48. Definitions and detailed review of these models are given elsewhere49. There has been much interest in the joint analysis of morphological and molecular data to estimate divergence times for extant and fossil species50–52.

What is over- and under-parameterisation?

A model is non-identifiable if different values of parameters make the same predictions about the data, so that such data can never be used to estimate those parameters; in other words, the model is non-identifiable if f(D|θ1) = f(D|θ2) for certain θ1 ≠ θ2 and for all possible data D [53]. A simple phylogenetic example is estimation of the geological time of divergence between two species (t) and the molecular evolutionary rate (r) using data of a pair of aligned sequences. The likelihood depends only on the molecular distance, d = rt, and not on t and r separately, and is the same for, say, t = 1 and r = 0.1, or t = 0.1 and r = 1, or any other combination of t and r such that rt = d = 0.1. In theory, non-identifiability (or over-parameterisation) is not a serious problem for Bayesian analysis, especially if informative priors are assigned on the parameters. In practice, over-parameterisation can cause both inference difficulties (such as loss of power, strong correlations between parameters, large variance in the posterior, and extreme sensitivity to the prior and model assumptions) and computational problems (such as poor mixing of the MCMC). Sometimes, a model is identifiable, but the data contain only weak information about the parameters with the likelihood surface being nearly flat. Then similar symptoms will show up in the data analysis. An example is the popular I+G model of rate variation among sites, which assumes a proportion of sites p0 in the alignment are invariable with rate 0, while the other sites (1 – p0) evolve according to a discrete gamma distribution54. Because the gamma distribution allows for extremely conserved sites with rates close to 0, p0 and the gamma shape parameter α are strongly correlated55. The MCMC algorithm may have to spend a long time exploring a ridge on the posterior surface. A similar case applies to the use of parameter-rich GTR+Γ model in analysis of highly similar sequences from closely related species as in Bayesian species delimitation or species tree estimation under the multi-species coalescent model24,56. The GTR model has eight parameters that describe the exchangeabilities between nucleotides. If there are only a few variable sites in the alignment, there will be little information about those parameters. Simple models, such as JC69 and K80, may be adequate in such analysis. On the other hand, the use of overly simplistic model or under-parametrisation can cause systematically incorrect phylogenetic trees and seriously biased estimates of branch lengths and substitution parameters, and over-confident assessment of uncertainties such as spuriously high posterior probabilities for trees or clades46. For example, ignoring variable substitution rates among sites leads to underestimated branch lengths41. Systematic errors tend to be greater when sequences are more divergent. In short, the substitution model is a trade-off between bias on one hand and variance and computation expense on the other, and should ideally be chosen by a careful consideration of its role on the analysis rather than mechanistic use of a model selection procedure.

How do I decide to concatenate or partition my data?

The rationale for partitioned analysis is that sites in the same partition have similar evolutionary characteristics while those in different partitions have different characteristics40,44,57. The characteristics here may be substitution rates, base composition, branch lengths, or even the tree topology. The Bayesian program will estimate different parameter values or even different gene tree topologies for the different partitions, thus accounting for their heterogeneity in the evolutionary process. For example, genes with different G+C compositions or evolutionary rates may be analysed as separate partitions in phylogeny reconstruction. Vertebrate mitochondrial genes coded on the same strand of the genome have similar G+C content and may be concatenated and analysed as a single partition, although the three codon positions may be treated as different partitions to account for their large differences in rate and in base compositions58. Non-coding mitochondrial genes (rRNAs and tRNAs) may be analysed as another partition. Likewise, mitochondrial and nuclear sequences should also be analysed as different partitions59. For nuclear sequences, exons and introns should be analysed as different partitions, and the three codon positions should be placed in their own partitions. Some partitioning software may suggest the use of different substitution models for partitions44 (e.g., HKY for one partition and GTR+G for another). This is unnecessary because with the same model for all partitions, different parameter values will accommodate the heterogeneity among partitions. An important issue is whether partitions should share the same tree topology. In traditional phylogenetic inference, topology is assumed to be the same across partitions. However, a number of biological processes, such as gene duplication, horizontal gene transfer, and incomplete lineage sorting can cause different genes to have different trees60,61. Recently, a number of methods for species tree estimation have been developed under the multi-species coalescent (MSC) model24,62,63, which account for the process of incomplete lineage sorting (the so-called deep coalescent, due to polymorphism in ancestral species, where coalescence may occur in ancient ancestors leading to gene trees that differ from the species tree). Under the MSC different genomic regions (or exons) are placed into different partitions and allowed to have their own gene-trees, which are embedded into the species tree. The mitochondrial genome does not recombine and mitochondrial genes should be treated as one partition within the MSC. In some viruses, such as influenza, different genome segments can re-assort (i.e. be horizontally transferred) among related strains64, and thus different segments can have different topologies and should be treated as different partitions.

How do I choose the prior for my Bayesian analysis?

In theory the prior should summarize the biologist’s best knowledge about the model or parameters before the data are analysed26,65. In practice, specification of the prior is often a thorny issue, especially if there are multiple parameters with complex correlations or if little is known about the parameters. While we are supposed to specify a joint prior distribution for all parameters, the common practice is to ignore the correlation, and assign independent priors for the parameters. When there are many parameters of the same kind, such independent and identically distributed (i.i.d.) prior can sometimes cause problems because they may make a strong statement about the mean or sum of those parameters. For example, it is common to assign independent exponential or uniform priors for branch lengths in the unrooted tree, but this i.i.d. prior can cause very long trees in analysis of highly similar sequence data66,67. In relaxed-clock dating analysis, the i.i.d prior for substitution rates among different partitions makes a strong statement about the average rate over loci, leading to biased but over-confident divergence time estimates68, in particular as the number of partitions increases. Such i.i.d. priors should be avoided. Default priors in many Bayesian software packages may not be appropriate for the data being analysed and should be used with caution. Specification of the prior is the biologist’s responsibility even though it may not be an easy task. Robustness analysis should also be an important component of any Bayesian analysis. By evaluating the posteriors generated under different priors, the biologist can evaluate whether the posterior is robust to the prior. In Bayesian estimation of phylogenetic trees without the assumption of a molecular clock, it is common to assign a uniform prior on the unrooted tree topologies. When phylogenetic analysis is conducted on rooted trees under the clock or relaxed clock models69, rooted trees are commonly assigned a prior using a model of cladogenesis such as the Yule process and the birth-death-sampling process70. Note that all those models favour balanced trees, and the impact of the prior on the posterior probabilities of the rooted trees can be substantial if the tree is large. For coalescent-based species tree estimation, the MSC model specifies a probability distribution for the rooted gene trees (topologies and node ages)71. This is part of the model rather than a prior on gene trees to be specified. In molecular clock dating analysis, fossils may be used to specify minimum and maximum bounds on clade age, which are used to construct a so-called calibration density to calibrate the age of the clade, it is also advisable to include a prior on the age of the root of the tree. For an overview on calibration densities for use in divergence dating, see72. It is also necessary to specify a prior on the evolutionary rates for the different loci or partitions. A gamma-Dirichlet prior can be used instead of the i.i.d. prior mentioned above68. In relaxed-clock models, the rates not only vary among partitions, but also drift along branches on the tree. Current Bayesian implementations assume that rates drift independently among partitions so that different partitions are independent realizations of the rate-drift process73,74. A discussion of the different rate-drift models is given in68.

What is Markov chain Monte Carlo (MCMC)?

Once the biologist has decided on the data, model and prior, the next step is to obtain a sample from the posterior. This is done by using MCMC, a simulation technique for sampling from a probability distribution that is known up to a normalising constant21,22. Note that all terms on the right hand side of equation (1) are straightforward to calculate except the normalizing constant z, which involves multidimensional integrals and may be too expensive to compute. Thus, MCMC is particularly suitable for Bayesian computation. Instead of calculating the posterior distribution f(θ|D), the algorithm generates a sample from the posterior, which can be used to estimate the mean, the standard deviation of the posterior, or even the whole posterior distribution. Here we illustrate the major features of MCMC by applying it to the problem of estimating the sequence distance d and the transition/transversion rate ratio κ under the K80 model75 using a pair of DNA sequences. The data (D) are an alignment of the human and orangutan mitochondrial 12S rRNA genes, summarized as n = 84 transitional differences and n = 6 transversional differences at n = 948 sites26, p.7. We assign independent gamma priors, d ~ G(2, 20) and κ ~ G(2, 0.1), with densities (Fig. 1a): The likelihood (Fig. 1b) is given by the K80 model26,75 as where Thus, the unnormalized posterior (Fig. 1c) is We give a sketch of an MCMC algorithm in Box 1, and then discuss its main features. We use two sliding windows (uniform distributions centred around the current parameter value) to update parameters d and κ. The sliding window (even with reflection) is a symmetrical proposal, in the sense that the probability density of proposing d* from d is equal to that of proposing d from d*. If the proposal is asymmetrical, a correction term, called the Hastings ratio22 needs to be applied. Note that the parameter values (d and κ) visited in the next iteration depend on the current values but not values visited in the past. The algorithm has no memory. This memoryless property is called the Markovian property. As a result, the sequence of visited parameter values form a Markov chain, and the algorithm is called Markov chain Monte Carlo. An important feature of the algorithm is that it requires the calculation of the ratio of posterior densities, but not the posterior density itself. The normalizing constant z of equation (1) cancels in the calculation of the acceptance ratio α in steps 2a & 2b, and algorithm thus avoids its calculation. It is easy to see that the algorithm visits parameter values with high posterior more often than those with low posterior. Indeed, it visits the parameter values exactly in proportion to their posterior. One runs the algorithm over many iterations, and then uses the visited values of d and κ to construct a histogram to estimate the posterior distribution or to calculate the mean and standard deviation of the posterior (Fig. 2).

Figure 2

Trace plots and histograms for parameters d and κ sampling the posterior distribution of Figure 1c using efficient and inefficient MCMC chains.

Parts a and b show the trace plots of d and κ for an efficient chain with good mixing. The window sizes are w = 0.12 and w = 180, with acceptance proportions P = 30.4% for d and 29.8% for κ, achieving efficiency Eff = 23% for d and 20% for κ. Parts a’ and b’ show the trace plots for an inefficient chain with poor mixing, with w = 5 and w = 1. In a’, the window for d is too wide, and most proposals are rejected (P = 1.5%), so that the chain is often stuck at the same value for many iterations, leading to poor mixing with Eff = 1.79%. In b’, the window for κ is too small, so that most of the proposals are accepted (with P = 98.6%), but the chain makes small baby steps and is very slow in traversing the posterior parameter space, with Eff = 1.28%. Parts c and c’ show histograms of κ for two runs of the efficient and inefficient chains (sample size n = 10,000). The posterior mean (and standard deviation) calculated using a very long run of the efficient chain is 0.104 (0.0114) for d, and 29.2 (10.0) for κ.

The window size (or step length) in the sliding window proposal (w and w) can affect the mixing efficiency of the chain (Box 2). If the window is too large, most of the proposals will fall in the tails of the posterior and be rejected. The chain then stays at the current value and does not move (Fig 2a’). If the window is too small, the chain takes tiny baby steps, almost all of which are accepted but the chain is ineffective in exploring the posterior surface (Fig 2b’). Thus, both small steps (with high acceptance proportion) and large steps (with very low acceptance proportion) lead to inefficient algorithms. The step lengths should be adjusted to achieve a near optimal acceptance proportion, at about 30-40%. Fine-tuning a phylogenetic MCMC chain to be efficient is important because MCMC runs may take weeks or months. It is easy to monitor the acceptance proportion and use it to adjust the step length automatically76. Most current MCMC phylogenetic programs have automatic fine-tuning algorithms and this is in most cases not a concern for the user. In trans-model MCMC algorithms, both the model index m and the model parameters θ change over the chain. The algorithm will involve both within-model proposals, which change parameters of the current model, and trans-model proposals, which move from the current model to another new model77. In the long run, the frequency at which the MCMC visits each model is an estimate of the posterior probability of that model. There are a number of differences between within-model and trans-model algorithms26, and here we note a few concerning mixing efficiency and acceptance proportion. First, for a within-model move (such as a sliding window changing the sequence distance or branch length), we can make the window size small enough so that the acceptance proportion is arbitrarily close to 100%. However, in trans-model moves, the acceptance proportion is constrained by the posterior model probabilities. If the maximum a posteriori (MAP) model (the model with the highest posterior probability) has the posterior P1, then the acceptance proportion cannot exceed 2(1 – P1) [26]. Thus, if the MAP tree has posterior 99%, the highest acceptance proportion for cross-tree moves is 2%. Second, while an acceptance proportion of near 0 indicates a poor proposal (e.g., the window size is too large) for a within-model move, this may and may not indicate a mixing problem in cross-model moves because it may be caused by the MAP model having posterior near 100%. Third, for a within-model move, the optimal acceptance proportion is intermediate at 30-40%, but for a trans-model move, a mobile chain is in general more efficient than a lazy chain, so that we should strive to achieve as high an acceptance proportion as possible. All those comments apply to Bayesian phylogenetic MCMC algorithms, which include both within-tree moves that change the branch lengths and substitution parameters without changing the tree topology and cross-tree moves that change the tree topology. The cross-tree moves are typically constructed using tree-perturbation (branch-swapping) algorithms such as nearest-neighbour interchange (NNI), subtree pruning and re-grafting (SPR) and tree bisection and reconnection (TBR)26,78. About a dozen MCMC phylogenetic programs are now available (Table 1).

What are convergence, burn-in and mixing of the MCMC?

An MCMC algorithm may suffer from two problems: slow convergence and poor mixing. In the long run, the Markov chain should be spending most of the time visiting high-probability regions of the posterior. The convergence rate is the rate at which a chain starting from any initial position (which may be in the tails of the posterior) moves to the high-posterior region of the parameter space79. Parameter values sampled before reaching this stationary phase are usually discarded as the burn-in. Thus, if convergence is slow, a long burn-in will be necessary. Convergence rate is affected by the proposals used and by the shape of the posterior in the tails67. If the posterior is nearly flat in the tail, it will be difficult for the chain to get out of the tail and move to the high-posterior region. Mixing efficiency refers to how efficiently the chain traverses the posterior after it has reached the stationary distribution. If the chain is more efficient, the estimate based on the MCMC sample will have a smaller variance, and the results will show less variation among independent runs (Box 2) and a relatively short chain will provide acceptable estimate. The proposal (such as the uniform sliding window vs. the normal-distribution sliding window) as well as the step length for the same proposal (such as the width of the sliding window) can have a great effect on mixing efficiency76. Both convergence and mixing problems can be diagnosed by using a trace plot, in which we plot the log likelihood or sampled parameter values against the MCMC iteration, for example, using R80 or Tracer81. It is also very important to run the same algorithm multiple times to check consistency between runs. With fast convergence, different chains that started from very different positions become indistinguishable very quickly. Efficient mixing is indicated by different runs generated nearly identical means, standard deviations, and histograms. If the runs are healthy, samples from different runs can be combined to produce posterior summaries. The trace plots of Figures 2a and 2b are from an efficient chain with good mixing, while those of Figures 2a' and 2b' have poor mixing and low efficiency. The histograms from the efficient algorithm match each other much better than those from the inefficient algorithm (Fig. 2c and 2c'). In theory, the consistency among multiple runs could be because all runs got stuck in a region of the parameter space, giving the false impression that convergence was reached. This may happen when there are multiple peaks in the posterior. Thus, it is important to initiate the runs from widely dispersed starting points.

How many iterations should I run my chain for? How many samples should I take?

Ideally one would like to run the MCMC long enough to obtain a reliable estimation of the posterior distribution, but not overly too long as to waste computational resources. However, currently reliable automatic stopping rules do not exist. As a result, the user has to specify the number of iterations, and then decide whether the chain is long enough or additional iterations are necessary using certain diagnosis tools. MCMC algorithms tend to generate huge output files. To save disk space, one takes a sample only for every certain number of iterations. For example, running an MCMC chain for 107 iterations and using a sample frequency of 103 iterations will produce 104 samples. Note that in some programs (such as MCMCtree and BPP), each MCMC iteration consists of a fixed sequence of MCMC proposals, while in some others (such as MrBayes and BEAST), it consists of one proposal, chosen at random from a collection of proposals. Thus, if there are 1,000 parameters in the model and if each proposal changes one parameter, each MCMC iteration in the former programs is worth about 1,000 iterations in the latter programs. Thus, MCMC iterations from different programs are not comparable. The biologist should instead aim to accumulate a reasonable (as large as practically possible) effective sample size (ESS) for each parameter (Box 2).

Why should an MCMC analysis be run with an “empty alignment”? Is the data informative?

It is useful to run the MCMC algorithm sampling from the prior. This is achieved by setting the likelihood to 1 in equation (1). Some programs generate a dummy “empty” alignment that can be used to achieve the same effect. Runs should also be assessed for good convergence and mixing. Running the chain without data is a good way of checking the correctness of the software, because the mean, variance, etc. of the prior are often analytically available and can be checked against the MCMC sample. In molecular clock dating using fossil calibrations, the prior on divergence times incorporates the calibration information and is typically intractable. Running the program without using the sequences allows one to generate the prior used by the program. The sample from the prior can also be compared with the sample from the posterior (which is generated by using the data) to assess how informative the data are, and whether there are serious conflicts between the prior and the data. High similarity between the prior and the posterior suggests that the data contain little information about the parameters. Considerable overlap between the prior and posterior but with the posterior being much more concentrated than the prior means that the data are informative and the prior is reasonable. In contrast, if the prior and posterior do not overlap well, there may be a conflict between the prior and the data, possibly caused by misspecified priors. One can also modify the prior to assess the impact of the prior on the posterior. Note, however, that it is incorrect to specify the prior by trying to match the posterior, since the prior is supposed to reflect our knowledge before the analysis of the data.

Conclusions

Bayesian phylogenetics has undergone explosive growth during the past decade. The implementation of sophisticated models in easy-to-use software programs has made the method extremely appealing to biologists. The method is especially powerful in combining different sources of information in an integrated data analysis. As a result, Bayesian MCMC methods are the most commonly used framework for development of new models of data analysis, especially in the areas of divergence time estimation integrating molecular, morphological and fossil information82, species tree estimation using multi-locus genomic sequence data24, and species delimitation incorporating genetic and morphological/ecological information83. The potential of the Bayesian method to deal with these and future questions has never been greater. For further reading on the Bayesian method and Bayesian phylogenetics the reader may consult26,84,85 A tutorial that helps the user to write a simple R program to conduct phylogenetic MCMC to reproduce the figures of this paper is available at: http://github.com/thednainus/Bayesian_tutorial

79 in total

1. Inference of population structure using multilocus genotype data.

Authors: J K Pritchard; M Stephens; P Donnelly
Journal: Genetics Date: 2000-06 Impact factor: 4.562

2. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci.

Authors: Bruce Rannala; Ziheng Yang
Journal: Genetics Date: 2003-08 Impact factor: 4.562

3. Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds.

Authors: Ziheng Yang; Bruce Rannala
Journal: Mol Biol Evol Date: 2005-09-21 Impact factor: 16.240

4. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters.

Authors: Peter Beerli
Journal: Bioinformatics Date: 2005-11-29 Impact factor: 6.937

5. Impacts of the Cretaceous Terrestrial Revolution and KPg extinction on mammal diversification.

Authors: Robert W Meredith; Jan E Janečka; John Gatesy; Oliver A Ryder; Colleen A Fisher; Emma C Teeling; Alisha Goodbla; Eduardo Eizirik; Taiz L L Simão; Tanja Stadler; Daniel L Rabosky; Rodney L Honeycutt; John J Flynn; Colleen M Ingram; Cynthia Steiner; Tiffani L Williams; Terence J Robinson; Angela Burk-Herrick; Michael Westerman; Nadia A Ayoub; Mark S Springer; William J Murphy
Journal: Science Date: 2011-09-22 Impact factor: 47.728

6. The fossilized birth-death process for coherent calibration of divergence-time estimates.

Authors: Tracy A Heath; John P Huelsenbeck; Tanja Stadler
Journal: Proc Natl Acad Sci U S A Date: 2014-07-09 Impact factor: 11.205

7. Maximum-likelihood estimation of evolutionary trees from continuous characters.

Authors: J Felsenstein
Journal: Am J Hum Genet Date: 1973-09 Impact factor: 11.025

8. Estimating the pattern of nucleotide substitution.

Authors: Z Yang
Journal: J Mol Evol Date: 1994-07 Impact factor: 2.395

Review 9. Bayesian molecular clock dating of species divergences in the genomics era.

Authors: Mario dos Reis; Philip C J Donoghue; Ziheng Yang
Journal: Nat Rev Genet Date: 2015-12-21 Impact factor: 53.242

Review 10. RNA Virus Reassortment: An Evolutionary Mechanism for Host Jumps and Immune Evasion.

Authors: Dhanasekaran Vijaykrishna; Reshmi Mukerji; Gavin J D Smith
Journal: PLoS Pathog Date: 2015-07-09 Impact factor: 6.823

30 in total

1. Using Parsimony-Guided Tree Proposals to Accelerate Convergence in Bayesian Phylogenetic Inference.

Authors: Chi Zhang; John P Huelsenbeck; Fredrik Ronquist
Journal: Syst Biol Date: 2020-09-01 Impact factor: 15.683

2. Phylogenomic Resolution of the Cetacean Tree of Life Using Target Sequence Capture.

Authors: Michael R McGowen; Georgia Tsagkogeorga; Sandra Álvarez-Carretero; Mario Dos Reis; Monika Struebig; Robert Deaville; Paul D Jepson; Simon Jarman; Andrea Polanowski; Phillip A Morin; Stephen J Rossiter
Journal: Syst Biol Date: 2020-05-01 Impact factor: 15.683

10. Using Phylogenomic Data to Explore the Effects of Relaxed Clocks and Calibration Strategies on Divergence Time Estimation: Primates as a Test Case.

Authors: Mario Dos Reis; Gregg F Gunnell; Jose Barba-Montoya; Alex Wilkins; Ziheng Yang; Anne D Yoder
Journal: Syst Biol Date: 2018-07-01 Impact factor: 9.160