| Literature DB >> 33235813 |
Sebastian Duchene1, Leo Featherstone1, Melina Haritopoulou-Sinanidou1, Andrew Rambaut2, Philippe Lemey3, Guy Baele3.
Abstract
The ongoing SARS-CoV-2 outbreak marks the first time that large amounts of genome sequence data have been generated and made publicly available in near real time. Early analyses of these data revealed low sequence variation, a finding that is consistent with a recently emerging outbreak, but which raises the question of whether such data are sufficiently informative for phylogenetic inferences of evolutionary rates and time scales. The phylodynamic threshold is a key concept that refers to the point in time at which sufficient molecular evolutionary change has accumulated in available genome samples to obtain robust phylodynamic estimates. For example, before the phylodynamic threshold is reached, genomic variation is so low that even large amounts of genome sequences may be insufficient to estimate the virus's evolutionary rate and the time scale of an outbreak. We collected genome sequences of SARS-CoV-2 from public databases at eight different points in time and conducted a range of tests of temporal signal to determine if and when the phylodynamic threshold was reached, and the range of inferences that could be reliably drawn from these data. Our results indicate that by 2 February 2020, estimates of evolutionary rates and time scales had become possible. Analyses of subsequent data sets, that included between 47 and 122 genomes, converged at an evolutionary rate of about 1.1 × 10-3 subs/site/year and a time of origin of around late November 2019. Our study provides guidelines to assess the phylodynamic threshold and demonstrates that establishing this threshold constitutes a fundamental step for understanding the power and limitations of early data in outbreak genome surveillance.Entities:
Keywords: 2019 novel coronavirus (SARS-CoV-2); molecular clock; phylodynamic threshold; phylogenetics; severe acute respiratory syndrome corona virus 2; temporal signal
Year: 2020 PMID: 33235813 PMCID: PMC7454936 DOI: 10.1093/ve/veaa061
Source DB: PubMed Journal: Virus Evol ISSN: 2057-1577
Description of data snapshots of SARS-CoV-2.
| Publication date range (from 10 January 2020) | Number of genomes | Sampling window (from 23 December 2019) | Days since first genome sample |
|---|---|---|---|
| 23 January | 22 | 17 January 2020 | 31 |
| 2 February | 47 | 27 January 2020 | 41 |
| 6 February | 55 | 28 January 2020 | 45 |
| 10 February | 66 | 3 February2020 | 49 |
| 15 February | 90 | 7 February 2020 | 54 |
| 18 February | 95 | 9 February 2020 | 57 |
| 21 February | 109 | 9 February 2020 | 60 |
| 24 February | 122 | 10 February 2020 | 63 |
Figure 1.BETS results. Each panel corresponds to a snapshot data set collected up to a given month and day in 2020, with a certain number, n, of genomes, and the number of days since the first genome sample was collected (23 December 2019). The y-axis represents the log Bayes factors, where the best-performing model has a value of 0. Each bar corresponds to an analysis configuration for BETS, with two possible molecular clock models: the strict (SC) and the uncorrelated relaxed clock with an underlying lognormal distribution (UCLN). For the UCLN, we considered two possible priors on the standard deviation of the lognormal distribution: an exponential distribution with mean 0.33 or with mean 100, labelled as Exp(0.33) and Exp(100), respectively. The sampling times could be configured using the true values (dates), no sampling times (none), or permuted, with these latter two options indicating no temporal signal. For the analyses with permuted sampling times and the UCLN, we used an exponential prior with mean 0.33 for the standard deviation of the lognormal distribution. Black and dark grey bars correspond to analyses with the correct sampling times with the SC or UCLN clock models, respectively. Dark and light red bars are for analyses with no sampling times with these two clock models, and all light grey bars are for analyses with permuted sampling times.
Figure 2.Root-to-tip regressions for snapshot data sets. The y-axis corresponds to the root-to-tip distance of phylogenetic trees with branch lengths in units of substitutions per site. The x-axis represents calendar time. Each point corresponds to a tip in the tree. The regression line is the best fitting line using the root position that maximised R2. The R2, the intercept with the x-axis (x-intercept), and slope are shown for each data set, with the latter two representing crude estimates of the evolutionary rate and time of origin, respectively.
Figure 3.Prior and posterior densities for parameters of interest using the molecular clock model with best fit for all snapshot data set (SC for all data sets, except for 24 February, where the UCLN was chosen). The y-axis corresponds to parameter values, while the x-axis represents the relative density. Light blue densities correspond to the effective prior, while those in dark blue show the posterior.
Prior distributions used for key parameters.
| Parameter | Prior |
|---|---|
| Evolutionary (clock) rate | Continuous time Markov chain (CTMC) |
| Standard deviation of evolutionary rate (UCLN only) | Exponential (mean = 0.33 or mean = 100) |
| Exponential coalescent growth rate | Laplace ( |
| Exponential coalescent population size | Lognormal ( |