| Literature DB >> 27110344 |
Gemma G R Murray1, Fang Wang1, Ewan M Harrison2, Gavin K Paterson3, Alison E Mather4, Simon R Harris5, Mark A Holmes2, Andrew Rambaut6, John J Welch1.
Abstract
'Dated-tip' methods of molecular dating use DNA sequences sampled at different times, to estimate the age of their most recent common ancestor. Several tests of 'temporal signal' are available to determine whether data sets are suitable for such analysis. However, it remains unclear whether these tests are reliable.We investigate the performance of several tests of temporal signal, including some recently suggested modifications. We use simulated data (where the true evolutionary history is known), and whole genomes of methicillin-resistant Staphylococcus aureus (to show how particular problems arise with real-world data sets).We show that all of the standard tests of temporal signal are seriously misleading for data where temporal and genetic structures are confounded (i.e. where closely related sequences are more likely to have been sampled at similar times). This is not an artefact of genetic structure or tree shape per se, and can arise even when sequences have measurably evolved during the sampling period. More positively, we show that a 'clustered permutation' approach introduced by Duchêne et al. (Molecular Biology and Evolution, 32, 2015, 1895) can successfully correct for this artefact in all cases and introduce techniques for implementing this method with real data sets.The confounding of temporal and genetic structures may be difficult to avoid in practice, particularly for outbreaks of infectious disease, or when using ancient DNA. Therefore, we recommend the use of 'clustered permutation' for all analyses. The failure of the standard tests may explain why different methods of dating pathogen origins have reached such wildly different conclusions.Entities:
Keywords: Bayesian dating; Staphylococcus aureus; dated‐tips; pathogen origins; permutation tests
Year: 2015 PMID: 27110344 PMCID: PMC4832290 DOI: 10.1111/2041-210X.12466
Source DB: PubMed Journal: Methods Ecol Evol Impact factor: 7.781
Figure 1The left‐hand column shows schematic representations of the tree topologies over which evolution was simulated. The grey triangle represents the variable branching patterns of a simulated coalescence process. The middle column shows results of the regression of root‐to‐tip distance against sampling date. A significant positive correlation is consistent with the presence of temporal signal. P‐values were obtained by random permutation of sampling dates across sequences (P) or monophyletic clusters of sequences that shared a sampling date (P clust). The right‐hand column shows the maximum a posteriori estimate of the with 95% highest posterior density intervals (red) as inferred using beast. These are compared to equivalent estimates from data sets with the sampling dates randomly permuted across sequences (purple), or clusters of sequences (blue). For the model selection approach, we report the increase in AICM values when sampling dates were included in the analysis. (a) and (b) represent a ‘balanced’ sampling strategy where each clade was sampled equally thoroughly at each of the sampling times; (c) and (d) represent a confounded sampling strategy where each clade was sampled at a different time. For (a) and (c), true temporal structure is high, such that a substantial amount of molecular evolution could occur between the sampling dates, while for (b) and (d), temporal structure is low.
Figure 2Illustrative phylogenies in which genetic and temporal structure are (a) unconfounded or (b) confounded. Grey arrows describe the distance between pairs of sequences sampled on different dates (t 0 and t 1).
Figure 3Dating analyses for genomes sampled over 17 years. Plots show the maximum a posteriori (MAP) estimates of the , with 95% highest posterior density (HPD) intervals. (a) shows the estimate from the complete data set (red), and from random (purple) or confounded (blue) subsamples, all with the same common ancestor and range of sampling dates. (b) shows estimates from subsamples with a narrower sampling range, and a different true . Red dashed lines and shaded areas describe the best estimate of the and its 95% HPD interval as inferred from the complete data set. Grey dashed lines show the youngest possible , as determined by the oldest sample. Below are the results of tests of temporal signal and confounding. For beast permutation tests: ✓ indicates that the true MAP estimate lay outside of the range of the MAP estimates from the randomized data sets, ✓✓ indicates that the true MAP estimate is not within the HPD intervals of the estimates from randomized data sets, and ✓✓✓ indicates that the HPD interval of the true estimate does not overlap with the HPD intervals of estimates from the randomized data sets. For the model selection approaches, we report the probability that the model without sampling dates is the ‘true’ model (AICM analysis), or the Bayes factor support for the inclusion of sampling dates (Kass & Raftery 1995). Tests indicating temporal signal are in bold; *P < 0·05; **P < 0·01.
Figure 4The Bayesian dating test for strains sampled from a dog during an outbreak in a veterinary hospital. Differences in the degree of clustering with sampling date are apparent between the phylogenies estimated with (the MCC tree from the Bayesian dated‐tip analysis) and without the use of temporal information (a neighbour‐joining tree). Colour and symbol shape represent strains sampled on the same date. The plot shows the maximum a posteriori estimates of the (on a log scale) with 95% highest posterior density intervals. The true estimate (red) is compared to estimates with the sampling dates randomly permuted across sequences (purple), or across single‐date clusters identified from the neighbour‐joining tree (blue), or the MCC tree (green). The blue horizontal line indicates the date of admission of the dog into the veterinary hospital. Significance levels are described in the legend of Fig. 3.