| Literature DB >> 30958812 |
Remco Bouckaert1,2, Timothy G Vaughan3,4, Joëlle Barido-Sottani3,4, Sebastián Duchêne5, Mathieu Fourment6, Alexandra Gavryushkina7, Joseph Heled8, Graham Jones9, Denise Kühnert2, Nicola De Maio10, Michael Matschiner11, Fábio K Mendes1, Nicola F Müller3,4, Huw A Ogilvie12, Louis du Plessis13, Alex Popinga1, Andrew Rambaut14, David Rasmussen15, Igor Siveroni16, Marc A Suchard17, Chieh-Hsi Wu18, Dong Xie1, Chi Zhang19, Tanja Stadler3,4, Alexei J Drummond1.
Abstract
Elaboration of Bayesian phylogenetic inference methods has continued at pace in recent years with major new advances in nearly all aspects of the joint modelling of evolutionary data. It is increasingly appreciated that some evolutionary questions can only be adequately answered by combining evidence from multiple independent sources of data, including genome sequences, sampling dates, phenotypic data, radiocarbon dates, fossil occurrences, and biogeographic range information among others. Including all relevant data into a single joint model is very challenging both conceptually and computationally. Advanced computational software packages that allow robust development of compatible (sub-)models which can be composed into a full model hierarchy have played a key role in these developments. Developing such software frameworks is increasingly a major scientific activity in its own right, and comes with specific challenges, from practical software design, development and engineering challenges to statistical and conceptual modelling challenges. BEAST 2 is one such computational software platform, and was first announced over 4 years ago. Here we describe a series of major new developments in the BEAST 2 core platform and model hierarchy that have occurred since the first release of the software, culminating in the recent 2.5 release.Entities:
Mesh:
Year: 2019 PMID: 30958812 PMCID: PMC6472827 DOI: 10.1371/journal.pcbi.1006650
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Phylogenetic structures available in BEAST 2.
(a) A tip-dated time tree, with leaf times as boundary conditions but not data (generally a coalescent prior is applied in this setting). (b) A species tree with one or more embedded gene trees (c) A multi-type time tree has measured types at the leaves and the type changes that paint the ancestral lineages in the tree are sampled as latent variables by MCMC. (d) A sampled ancestor tree, with two types of sampling events: extinct species (red) and extant species (blue). Extinct species can be leaves or, if they are the direct ancestor of another sample, degree-2 sampled ancestor nodes. (e) An ancestral gene conversion graph is composed of a clonal frame (solid time tree) and an extra edge and gene boundaries for each gene conversion event. (f) A species network with one or more embedded gene trees.
BEAST 2 packages.
| Package | Subspecification | Special Feature | Reference |
|---|---|---|---|
| bModelTest | nucleotide subst. | model averaging, model comparison | [ |
| SSM | nucleotide. subst. model | standard named nucleotide models | - |
| CodonSubstModels | codon subst. model | M0 | [ |
| MM | morphological model | discrete | [ |
| BEASTvntr | microsatellite model | variable number of tandem repeat data | [ |
| RBS | subst. | model averaging for contiguous site partitions | [ |
| PoMo | nucleotide subst. model | mutation-selection | [ |
| MGSM | site model | multi-gamma & relaxed gamma | [ |
| substBMA | site model | Dirichlet mixture model for site partitions | [ |
| FLC | molecular clock model | strict and relaxed clocks within local clock model | [ |
| SA | unstructured population, non-par. | sampled ancestor | [ |
| CA | unstructured population, non-par. | calibration density, sampling rate estimate | [ |
| BDSKY | unstructured population, non-par. | BD serial skyline | [ |
| BD incomplete sampling (no | [ | ||
| phylodynamics | unstructured population, par. | deterministic closed SIR, stochastic closed SIR | [ |
| birth-death SIR | [ | ||
| EpiInf | unstructured population, par. | prevalence estimation, particle filtering | [ |
| PhyDyn | unstructured and structured populations, par. | define epidemic model by ODEs | [ |
| MultiTypeTree | structured population | structured tree | [ |
| BadTrIP | structured population | within-host, transmission inference | [ |
| BDMM | structured population | multitype BD | [ |
| BASTA | structured population | approx. structured coalescent | [ |
| MASCOT | structured population | approx. structured coalescent and time variant GLM’s | [ |
| SCOTTI | structured population | transmission inference | [ |
| BREAK AWAY | geographical model | break-away model of phylogeography | [ |
| GEO SPRE | geographical model | whole world phylogeography | [ |
| SSE | Geographical and structured population | State-dependent birth-death + cladogenic events | [ |
| BACTER | network model | clonal frame ancestral recombination graph | [ |
| SpeciesNetwork | network model | species networks | [ |
| DENIM | multispecies coalescent | species tree estimation with gene flow | [ |
| SNAPP | multispecies coalescent | from independent biallelic markers | [ |
| STACEY | multispecies coalescent | species delimitation & species tree estimation | [ |
| StarBEAST 2 | multispecies coalescent | faster, species tree clocks, FBD-MSC, AIM | [ |
| MODEL SELECTION | model selection | path sampling, stepping stone | [ |
| NS | model selection | nested sampling | [ |
| MASTER | simulation | stochastic population dynamics simulation | [ |
| TreeModelAdequacy | model adequacy using simulation | phylodynamic model adequacy using phylogenetic tree test statistics | [ |
* birth-death skyline handles sampled ancestors.
1 subst. for substitution models;
2 par. for parametric and non-par. for nonparametric models;
3 BD for birth-death;
4 ODEs for ordinary differential equations;
5 analy. integ. of pop. for analytical integration of population.
Fig 2bModelTest analysis for 36 mammalian species [50].
a) Posterior distribution of substitution models. Each circle represents a substitution model indicated by a six digit number corresponding to the six rates of reversible substitution models. In alphabetical order, these are A→C, A→G, A→T, C→G, C→T, and G→T, which can be shared in groups. The six digit numbers indicate these groupings, for example 121121 indicates the HKY model, which has shared rates for transitions and shared rates for transversions. Here, only models are considered that are reversible and do not share transition and transversion rates (with the exception of the JC69 and F81 models). Other substitution model sets are available. Links between substitution models indicate possible jumps during the MCMC chain from simpler (tail of arrow) to more complex (head of arrow) models and back. There is no single preferred substitution model for this data, as the posterior probability is spread over a number of alternative substitution models. Blue circles indicate the eight models contained in the 95% credible set, models with red circles are outside of this set, and models without circles have negligible support. b) Posterior tree distribution resulting from the bModelTest analysis.
Fig 3Birth-death skyline (bdsky) analysis of the 2013–2016 West African Ebola virus disease epidemic.
(a) The maximum clade credibility tree of the 811 sequences used in the analysis. (b) The median posterior estimate of the estimated effective reproductive number (R) over time is shown in orange, with the 95% highest posterior density (HPD) interval in orange shading. The red dotted line indicates the epidemic threshold (R = 1). If R is below this threshold the epidemic has reached a turning point and is no longer spreading. The posterior distribution of the origin time of the epidemic (t0) is shown in green. The number of laboratory-confirmed cases per week is shown in blue. Red arrows indicate weeks with fewer than 10 confirmed cases. The dotted line at A indicates the onset of symptoms in the suspected index case (see text for details). The dotted lines at B and C indicate the dates at which the WHO declared an Ebola virus disease outbreak in Guinea and a Public Health Emergency of International Concern (PHEIC), respectively. The dotted line at D indicates the first time any of the three countries with intense transmission (Liberia) was declared Ebola free following 42 days without any new infections being reported (new cases were subsequently detected in Liberia in June 2015). (c) The median posterior estimate of the monthly sampling proportion is shown in purple, with the 95% HPD interval in purple shading. The red dashed line indicates the number of sampled sequences in the dataset, divided by the number of laboratory-confirmed cases, for each month in the analysis. This serves as an empirical estimate of the true sampling proportion. The posterior distributions and medians (dashed lines) of the infected period and the mean clock rate (truncated at the 95% HPD limits) are shown in panels (d) and (e).
Fig 4The multispecies coalescent (MSC) model with three species and a single gene tree.
A separate coalescent process applies to each of the five branches in the tree; the branches for the extant species A (red), B (green) and C (blue), the ancestral branch of A and B (yellow), and the root branch (grey). Several individuals have been sampled per species. In this example the ancestral lineage of individual b4 does not coalesce in species B or ancestral species 4. In ancestral species 5, it coalesces with the ancestral lineage of species C. This leads to incomplete lineage sorting and enables gene tree discordance—in this example b4 is a sister taxon to individuals from species C, rather than to individuals from its own species, or sister species A. If b4 was the representative individual for its species, then this gene would exhibit gene tree discordance. Other individuals which show concordance at this locus are expected to show discordance at other unlinked loci when populations are large or speciation times are recent.
Fig 5AIM analysis of 100 nuclear gene alignments for the five Princess cichlid species.
Species are Neolamprologus marunguensis, N. gracilis, N. brichardi, N. olivaceous, N. pulcher, as well as the outgroup Metriaclima zebra. a) to d) show the best-supported tree topologies. Arrows show directions of gene flow that are supported with a Bayes Factor of more than 10. Trees a) and c) only differ in the timing of the speciation events; however, AIM differentiates between differently ranked topologies, since these have to be characterized by using different parameters.
Fig 6Posterior predictive distributions for two phylodynamic models.
The right column shows the trajectories of the reproductive number over time for a set of 100 publicly available genomes from the 2009 H1N1 influenza pandemic in North America using stochastic (birth-death SIR; [28]) and deterministic (deterministic coalescent SIR [27]) models. Each blue line is a trajectory sampled from the posterior distribution. The models make different inferences of when the reproductive number falls below 1 (vertical dotted line; the horizontal dashed line is for R = 1), indicating that the pandemic is past its infectious peak. The right column shows the posterior predictive distributions of the root height for both models (grey histograms) and the value for the empirical data (orange vertical lines). Trees simulated from the stochastic model produce trees that are more consistent with the empirical tree than those from the deterministic model, suggesting that stochasticity may play an important role in the early stages of the pandemic (samples were collected up to June 2009).