| Literature DB >> 32101295 |
Mandev S Gill1, Philippe Lemey1, Marc A Suchard2,3,4, Andrew Rambaut5,6, Guy Baele1.
Abstract
Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an "online" fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data-in terms of alignment changes, sequence addition or removal-present common scenarios that can benefit from online inference.Entities:
Keywords: BEAST; Bayesian phylogenetics; Markov chain Monte Carlo; online inference; pathogen phylodynamics; real-time analysis
Mesh:
Year: 2020 PMID: 32101295 PMCID: PMC7253210 DOI: 10.1093/molbev/msaa047
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Reduced Burn-In (in millions of iterations) Achieved with Online Bayesian Phylodynamic Inference.
| Sequences | Standard Analysis | Online Analysis | ||||
|---|---|---|---|---|---|---|
| Data | Total | Added | Burn-In (G) | Burn-In (ESS) | Burn-In (G) | Burn-In (ESS) |
| 2014, Epi week 26 | 158 | 13 | 0.2 (<0.1) | <0.1 (<0.1) | <0.1 (<0.1) | <0.1 (<0.1) |
| 2014, Epi week 31 | 240 | 8 | 0.8 (0.3) | <0.1 (<0.1) | <0.1 (<0.1) | 0.4 (0.9) |
| 2014, Epi week 42 | 706 | 32 | 8.6 (2.1) | 10.2 (10.3) | 0.6 (0.9) | 1.0 (1.0) |
| 2015, Epi week 2 | 1,072 | 24 | 16.4 (7.3) | 17.6 (7.1) | 0.6 (0.5) | 0.4 (0.5) |
| 2015, Epi week 42 | 1,610 | 2 | 49.6 (20.6) | 60.2 (15.4) | <0.1 (<0.1) | 0.6 (1.3) |
Note.—Comparison of burn-in for the log joint density sample resulting from two different analysis methods applied to Ebola virus data taken from the West African Ebola epidemic of 2013–2016. The standard de novo approach of analyzing the full data set from scratch is compared with the online inference approach that updates inferences from the previous epi week upon the arrival of new data. The length of burn-in (in millions of states) is determined through a graphical approach (G) that consists of analyzing posterior trace plots, as well as by computing the amount of discarded burn-in that maximizes the ESS. Results are averaged over five replicates for each analysis, with standard deviation in parentheses.
. 1.Comparison of burn-in resulting from standard de novo analyses versus online Bayesian analyses to compute updated inferences from data taken from different time points of the West African Ebola virus epidemic. The data flow of the epidemic, in terms of total sequence available during each epi week, is recreated in the background of the plot in gray bars. Dark gray bars show the data corresponding to the five time points at which we compute updated inferences. The plots chart the burn-in required by de novo analyses, represented by circles, and online analyses, represented by diamonds. Solid lines correspond to burn-in estimates based on visual analyses of trace plots whereas dotted lines correspond to burn-in estimates based on maximizing ESS values.
. 2.Box plots show distribution of savings in computation time by using online inference as compared with standard de novo analyses to update inferences for data from different time points in the West African Ebola virus epidemic. White box plots correspond to analyses using a Tesla P100 graphics card for scientific computing and gray boxes correspond to analyses using a multi-core CPU. Irrespective of the actual hardware used, the time savings are substantial with up to 600 h on average saved using our online approach on CPU for our most demanding scenario. The axis corresponding to running time (in hours) is log-transformed to allow for greater visibility of plots for smaller data sets.
. 3.A new sequence is inserted into an existing phylogenetic tree by determining the closest observed sequence (in terms of genetic distance) already in the tree, and inserting a new ancestor node for the new sequence and its closest sequence. The genetic distance between the new sequence and its closest sequence is converted into a distance in units of time, d, by dividing by the evolutionary rate associated with the branch leading to the closest sequence. To determine the insertion time of the new ancestor node (in terms of time prior to the present time), we require , where t is the sampling time of the new sequence, and t the sampling time of its closest sequence. This yields .