| Literature DB >> 29610334 |
Simon Möller1, Louis du Plessis2, Tanja Stadler3,4.
Abstract
Bayesian phylogenetics aims at estimating phylogenetic trees together with evolutionary and population dynamic parameters based on genetic sequences. It has been noted that the clock rate, one of the evolutionary parameters, decreases with an increase in the sampling period of sequences. In particular, clock rates of epidemic outbreaks are often estimated to be higher compared with the long-term clock rate. Purifying selection has been suggested as a biological factor that contributes to this phenomenon, since it purges slightly deleterious mutations from a population over time. However, other factors such as methodological biases may also play a role and make a biological interpretation of results difficult. In this paper, we identify methodological biases originating from the choice of tree prior, that is, the model specifying epidemiological dynamics. With a simulation study we demonstrate that a misspecification of the tree prior can upwardly bias the inferred clock rate and that the interplay of the different models involved in the inference can be complex and nonintuitive. We also show that the choice of tree prior can influence the inference of clock rate on real-world Ebola virus (EBOV) datasets. While commonly used tree priors result in very high clock-rate estimates for sequences from the initial phase of the epidemic in Sierra Leone, tree priors allowing for population structure lead to estimates agreeing with the long-term rate for EBOV.Entities:
Keywords: Bayesian phylodynamics; Ebola; molecular clock; phylogenetics; tree inference
Mesh:
Year: 2018 PMID: 29610334 PMCID: PMC5910814 DOI: 10.1073/pnas.1713314115
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Fig. 1.Results of the simulation study. (A) The tree that was used in the simulation study [this tree is the maximum clade credibility (MCC) tree of an analysis under a birth–death skyline model on a dataset consisting of the coding regions of 236 EBOV genomes sampled from patients in Guinea]. (B) The median values and 95% HPD intervals for key parameters estimated from simulated sequences. The dashed lines indicate the true values used in simulations. Clock rate is reported in substitutions per site per year, tree height and tree length in years, and total divergence (product of clock rate and tree length) in substitutions per site. (C) The distribution of topologies of posterior tree samples for analyses of simulated datasets of different sequence lengths, where we projected the Euclidean distances between real-valued representations of the topologies onto a 2D space. The red cross marks the true tree.
Fig. 2.Median and 95% HPD intervals for key parameters inferred from the Guinea (A) and Sierra Leone (B) datasets under different tree priors. For units refer to the Fig. 1 legend.
Fig. 3.A toy example of how the sequence data can influence the branch length via changing the topology.