Literature DB >> 28348834

Genome-scale rates of evolutionary change in bacteria.

Sebastian Duchêne^1,2,3, Kathryn E Holt^2,3, François-Xavier Weill⁴, Simon Le Hello⁴, Jane Hawkey^2,3, David J Edwards^2,3, Mathieu Fourment¹, Edward C Holmes¹.

Abstract

Estimating the rates at which bacterial genomes evolve is critical to understanding major evolutionary and ecological processes such as disease emergence, long-term host-pathogen associations and short-term transmission patterns. The surge in bacterial genomic data sets provides a new opportunity to estimate these rates and reveal the factors that shape bacterial evolutionary dynamics. For many organisms estimates of evolutionary rate display an inverse association with the time-scale over which the data are sampled. However, this relationship remains unexplored in bacteria due to the difficulty in estimating genome-wide evolutionary rates, which are impacted by the extent of temporal structure in the data and the prevalence of recombination. We collected 36 whole genome sequence data sets from 16 species of bacterial pathogens to systematically estimate and compare their evolutionary rates and assess the extent of temporal structure in the absence of recombination. The majority (28/36) of data sets possessed sufficient clock-like structure to robustly estimate evolutionary rates. However, in some species reliable estimates were not possible even with 'ancient DNA' data sampled over many centuries, suggesting that they evolve very slowly or that they display extensive rate variation among lineages. The robustly estimated evolutionary rates spanned several orders of magnitude, from approximately 10-5 to 10-8 nucleotide substitutions per site year-1. This variation was negatively associated with sampling time, with this relationship best described by an exponential decay curve. To avoid potential estimation biases, such time-dependency should be considered when inferring evolutionary time-scales in bacteria.

Entities: Chemical Disease Species

Keywords: bacteria; evolution; molecular clock; phylogeny; substitution rates; time-dependency

Mesh：

Year: 2016 PMID： 28348834 PMCID： PMC5320706 DOI： 10.1099/mgen.0.000094

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

This study consists of nucleotide sequence alignments, all of which are available online (http://zenodo.org/record/45951#.Vr1M-JN95E4).

Impact Statement

Phylogenetic methods can be used to infer the evolutionary time-scale of groups of organisms. In some pathogens it is possible to estimate the time of origin of disease outbreaks or of cross-species transmission events. However, the accuracy of these estimates relies on our understanding of the rate at which genetic change accumulates over time. The recent surge of genomic data presents an unprecedented opportunity to enhance our understanding of microbial evolution, with the potential of improving future inferences of their evolutionary time-scale. We estimated the rate of nucleotide substitution in 36 complete genomes of different bacterial pathogens using a range of computational methods. We find large differences in the rates produced. For example, some bacteria, such as those that cause hospital-derived infections, evolve orders of magnitude more rapidly than those that cause tuberculosis, which undergo extended periods of latency. We also find that the sampling times appear to play an important role in determining their rate. Our results provide the first genomic perspective of bacterial rates of evolution, thereby improving our understanding of the time-scale over which they diversify.

Introduction

Estimating the rate of molecular evolution is critical for understanding a variety of evolutionary and epidemiological processes. Rates of molecular evolution are the product of the number of mutations that arise per replication event, the frequency of replication events per unit time and the probability of mutational fixation. These rates are determined by a variety of factors including background mutation rate, the direction and strength of natural selection, generation time and population size. Both generation time and population size have been shown to scale negatively with the substitution rate in a range of organisms (Bromham, 2009). For example, because of differences in generation time, spore-forming bacteria evolve more slowly over time than those that do not form spores (Weller & Wu, 2015). Similarly, lineages undergoing adaptive evolution are expected to accumulate substitutions more rapidly than those subject to purifying selection (Eyre-Walker & Keightley, 2007). Despite the wealth of sequence data, genome-wide rates of evolutionary change in bacteria are often uncertain. At one end of the spectrum, rates as high as ~10−5 nucleotide substitutions per site year−1 have been reported for Neisseria gonorrhoeae (Pérez-Losada ). In contrast, genome-wide rates of only ~10−9 substitutions per site year−1 have been observed in Mycobacterium tuberculosis (Comas ). Importantly, however, these estimates are not always readily comparable because they use different methods and sources of data. Furthermore, most previous studies have not investigated the degree of temporal structure (i.e. clock-like behaviour) in the data, such that their reliability is uncertain. The increasing availability of genomic data means that it is now possible to estimate substitution rates with sequences collected over a number of years (i.e. tip-date calibration), not only for rapidly evolving pathogens, such as RNA viruses, but also in some DNA viruses and bacteria (Biek ). The rates at which bacteria evolve, the strength of clock-like signal in the data and the determinants of any rate differences observed have not been systematically investigated. However, these parameters are of importance both for the accurate interpretation of outbreak investigations that may depend on the reliable estimation of the time-scale of putatively linked transmission cases, and for revealing the long-term time-scales over which bacteria have been associated with specific hosts. Importantly, estimates of evolutionary rate have been shown to scale negatively with their time-scale of measurement in several organisms (Ho ), a pattern that has been attributed to the gradual purging of deleterious mutations over time (Ho & Larson, 2006; Penny, 2005). In the context of phylogenetic analyses, the time-scale of measurement corresponds to the age of the calibration or the sampling time-frame (Duchêne ; Molak & Ho, 2015). In viruses, natural selection, mutational saturation and substitution model inadequacy have also been shown to contribute to this pattern (Duchêne , 2015a; Ho ). However, the phenomenon of time-dependency of rate estimates remains largely unexplored in bacterial genomes, although there is some empirical evidence that it may also hold in these organisms (Comas ). To provide a comprehensive picture of genomic-scale evolutionary rates in bacteria and their temporal dynamics, particularly the extent of time-dependency in the data, we analysed, using a variety of phylogenetic methods, 36 publically available whole genome SNP data sets from bacterial pathogens associated with human disease sampled over periods extending over 2000 years.

Methods

Data collection.

We collected 35 whole genome nucleotide SNP alignments from previously published studies (Baines ; Bart ; Bos ; Bratcher ; Croucher ; Davies ; Devault ; Eldholm ; Gaiarsa ; Holden ; Holt , 2013, 2016; Howden ; Marvig ; Merker ; Njamkepo ; Schuenemann ; Schultz ; Stinear ; Uhlemann ; Wagner ; Ward ; Zhou , 2014), and one unpublished data set of Salmonella enterica serovar Kentucky (Table S1). The Salmonella enterica serovar Kentucky data set comprised 88 strains isolated between 1937 and 2012 (Le Hello ). Whole genome sequencing was performed at GATC Biotech (Germany) using an Illumina HiSeq system and analysed as described previously (Holt ). Although sequence reads for most other data sets are publically available in the NCBI Short Read Archive (SRA), we obtained the original alignments from the authors wherever possible. With this approach we take advantage of domain knowledge of the original studies for choice of reference sequence and identification of repetitive or horizontally transferred sequences, which are important for the accurate generation of SNP alignments. We removed outgroup taxa and samples that were distantly related to the majority of sequences to limit our analyses to the taxonomic group of interest and to minimize the artificial inflation of among-lineage rate variation due to the presence of very long branches in the phylogenetic trees. For the Neisseria meningitidis and M. tuberculosis Lineages 2 and 4 data sets we obtained sequence reads from the Short Read Archive and called SNPs using the RedDog pipeline [as described previously (Schultz ); available at https://github.com/katholt/RedDog]. SNPs can be introduced into bacterial genomes individually via mutations or in clusters by recombination, a process that may bias demographic inferences and rate estimates (Hedge & Wilson, 2014; Lapierre ). In particular, ignoring recombination can lead to incorrect estimates of branch lengths, which in turn results in an apparently over-dispersed molecular clock, precluding accurate estimates of evolutionary rates and time-scales. In some cases, removing sites with evidence of recombination can alleviate this problem (see Fig. S1). Thus, we removed all genomic regions with evidence of recombination using Gubbins v1.4.2 (Croucher ), and verified these results using RDP4 (Martin ). For RDP4 we removed sequences with significant evidence of recombination according to six of the methods implemented in the program: Bootscan (Salminen ), Geneconv (Padidam ), Maxchi (Smith, 1992), Siscan (Gibbs ), RDP (Martin & Rybicki, 2000) and 3seq (Boni ). Accordingly, we discarded an entire data set of N. meningitidis in which nearly half of the sequences displayed recombination (Table S1). We also visually checked that the alignments were free of additional obviously recombining regions. Our final data set comprised 16 bacterial species from 13 genera, with genome sizes ranging from 1.4 Mbp (Streptococcus pyogenes) to 6.2 Mbp (Pseudomonas aeruginosa). The data set sizes ranged from 189 (Streptococcus pneumonia and M. tuberculosis Lineage 4) to 15 (Mycobacterium leprae ancient DNA) sequences, and alignment lengths from 15 394 SNPs (Acinetobacter baumannii) to 402 (M. tuberculosis Lineage 4) SNPs. All the data sets analysed here included sampling times associated with each sequence in the form of year of isolation. Individual data sets spanned sampling ranges of approximately 2000 years in the case of M. leprae to 2.5 years in Staphylococcus aureus ST8:USA300. Four data sets included ancient DNA samples: M. leprae (Schuenemann ), two Yersinia pestis data sets (Wagner ), and one M. tuberculosis data set comprising strains sampled from New World mummies, animal strains and human Lineage 6 (Bos ). In M. leprae, the oldest sample had an estimated age of approximately 2000 years (Schuenemann ), while in the Y. pestis data sets the oldest samples had ages of approximately 600 and 1500 years (Wagner ). The New World mummy M. tuberculosis samples had ages of approximately 900 years (Bos ). Notably, one of the Y. pestis data sets included samples from both contemporary strains and those from the first, second and third plague pandemics, while the other is a subset that only included sequences from the second pandemic that began with the Black Death and contemporary strains. The remaining data sets had sampling times from a few years to several decades (Table S1). All the data sets used here are available online (http://zenodo.org/record/45951#.Vr1M-JN95E4).

Maximum-likelihood (ML) analysis and root-to-tip regression.

We estimated ML trees using PhyML v3.1 (Guindon ), employing the GTR+Γ substitution model with four categories for the Γ distribution of among-site rate heterogeneity and sub-tree pruning regrafting branch-swapping. We did not consider the proportion of invariable sites in the substitution model because the data sets comprised SNPs only, such that all sites are variable. The branch lengths in the ML trees are the expected number of nucleotide substitutions per site; as we are working with SNP alignments, this corresponds to the expected number of substitutions per variable site. To convert the branch lengths into genome-wide distances, we multiplied them by the length of the SNP alignment, and divided them by the core genome length that reflects the size of the genome in which SNPs were called (this information was extracted from the original publications). We fitted regressions for the root-to-tip distance as a function of sampling time using nelsi v0.21 (Ho ), where the position of the root is selected to maximize the determination coefficient, R2. Under clock-like evolution there should be a linear relationship between sampling time (year) and the expected number of nucleotide substitutions along the tree. The slope of the line corresponds to the substitution rate, although this estimate is statistically invalid because data points may not be phylogenetically independent. The extent to which the points deviate from the regression line reflects the amount of among-lineage rate variation, measured using R2 (Rambaut ).

Bayesian analysis and date randomizations.

We estimated rates of evolutionary change using a Bayesian Markov chain Monte Carlo method implemented in beast v1.8 (Drummond ), with a chain length of 108 steps, sampling every 5000 steps. If the effective sample size of any of the parameters was less than 200, we increased the chain length by 50 % and reduced the sampling frequency accordingly. For all data sets we used the GTR+Γ substitution model and the sampling times (tip dates) for calibration. For this analysis we utilized both strict and uncorrelated lognormal molecular clock models and constant-size coalescent and Bayesian Skyline demographic models, resulting in four possible model combinations. We compared the statistical fit of these models by estimating marginal likelihoods via path-sampling (Xie ). We report the rate estimates from the model with the highest marginal likelihood. To validate our Bayesian rate estimates we conducted a date-randomization test, which consists of repeating the analysis while assigning the sampling times randomly to the sequences (Firth ). In all cases we used the best-fit combination of demographic and clock model, described above. We conducted ten randomization replicates per data set to generate an expectation of the rate estimates in the absence of temporal structure, which appears to be sufficient to assess temporal structure in bacterial data (Murray ). Two criteria have been proposed to assess temporal structure. One method, known as CR1 (Duchêne ), considers that data do not have temporal structure if the mean rate with the correct sampling times is contained within the 95 % highest posterior density (HPD) of that from any of the randomizations. CR2 is more conservative; the data do not have temporal structure if the HPD of the estimate with the correct sampling times overlaps with that from any of the randomizations. These criteria have different levels of type I and type II errors (Duchêne ). Here, we considered the proportion of randomized replicates with HPDs that overlapped with those obtained using the correct sampling times, following CR2. We arbitrarily determined that data had ‘strong’ temporal structure if the proportion was 0, ‘moderate’ if it was between 0 and 0.5, and ‘low’ if it was less than 0.5. We verified that samples with the same sampling time did not form monophyletic groups in our ML trees. Such pattern can produce false positives in the date-randomization test (i.e. incorrectly suggesting that a data set has temporal structure) (Duchêne ; Murray ).

Regression analyses of the time-dependency of substitution rates.

A key element of our study was to determine whether the time-span of sampling was associated with the evolutionary rate estimate. To test for this pattern we conducted a least squares linear regression for the genome-wide rate as a function of sampling time. Importantly, only rate estimates with strong and moderate temporal structure were included in this analysis. We used a log10 transformation because our rate estimates span several orders of magnitude. A potential shortcoming of this analysis is that the data do not necessarily represent independent samples because some correspond to closely related lineages. This can be addressed by using phylogenetic independent contrasts, or phylogenetic generalized least squares. However, these methods require a phylogenetic tree of the evolutionary relationships of all data points that cannot be estimated for our data because the sites in different SNP alignments are not necessarily homologous. Instead, we used an approach in which the regression is conducted using a single randomly chosen data point from each species. We repeated this procedure 1000 times and verified that the range of slope estimates with random subsamples did not include zero, and we report the mean value and the corresponding confidence interval. For our regression of the rate as a function of sampling time we used a test described previously (Duchêne ) to verify that the slope estimate was not a statistical artefact that sometimes occurs when fitting a regression for a ratio as a function of its denominator. The linear regression method, however, does not provide a realistic description of time-dependency, which has been suggested to follow a decay curve (Ho ; Penny, 2005). We therefore modelled the relationship between rate and time using a double exponential curve of the form r=a/T, where r and T correspond to the rate estimate and sampling time on log10 scale, respectively, and parameters a and b control the asymptote and rate of decay in the function. This parameterization is similar to that proposed by O’Fallon (2010). Parameters a and b were optimized using the algorithm of Nelder & Mead (1965). To obtain a confidence interval around the parameter estimates, we conducted 100 bootstrap replicates of the rate estimates and sampling times.

Results

Measuring the extent of clock-like structure in bacterial evolution

Our root-to-tip regressions revealed large differences in the degree of clock-like behaviour, with 22 of the 35 data sets having R2 values of less than 0.5, suggesting weak clock-like behaviour, and eight with R2 values of 0.7 or higher, suggesting stronger clock-like behaviour (Fig. 1). A data set of Staphylococcus aureus multilocus sequence type ST239 had the highest R2, at 0.96, with similar values observed in Vibrio cholerae (0.92), Shigella sonnei (0.89) and Shigella dysenteriae type 1 (0.88). The lowest R2 was 1.52×10−3 for M. tuberculosis Lineage 2. The regression slope (rate) for M. tuberculosis Lineage 2 and Y. pestis from the second global pandemic included negative values, suggesting that these rates are either too low, or that there is extensive among-lineage rate variation, to allow reliable rate estimation.

Fig. 1.

Regressions of the root-to-tip genetic distance (expected nucleotide substitutions per site) as a function of sampling time (year) for 35 bacterial data sets. Each point corresponds to an individual sampled genome (SNP sequence in the alignment), and the red dashed line is the linear regression using least squares; the R2 coefficients are also shown. Shading corresponds to the degree of temporal structure according to the date-randomization test in beast; blue indicates strong temporal structure, while orange and red indicate moderate and low temporal structure, respectively. Our comparison of molecular clock models included the strict clock, which assumes that the rate of evolution is constant across lineages, and the uncorrelated lognormal relaxed clock that treats the rate across lineages as a random variable. Most bacterial data sets favoured a relaxed molecular clock, such that there is marked rate variation among lineages (Figs S2 and S3). The date-randomization test suggested that 28 data sets had strong to moderate temporal signal (Figs 2 and S2) (defined as ≤5 randomizations with an HPD overlapping with the HPDs estimated with the correct tip-dates). Only one (M. leprae) of the 15 bacterial species (excluding N. meningitidis, which was discarded because of the large number of recombinant sequences) analysed had no data sets displaying moderate to strong temporal signal. However, five of eight species represented by two or more data sets displayed a mix of both weak signal and strong to moderate temporal signal (Fig. 2), suggesting that a lack of temporal signal may be a property of the individual data sets rather than a true species effect.

Fig. 2.

Bayesian estimates of genome-scale nucleotide substitution rates for all bacterial data sets. The axis for the nucleotide substitution rate is shown on log10 scale. Circular points represent the mean rate estimate, and error bars correspond to the 95 % HPD values. Colours indicate the degree of temporal structure according to the date-randomization test as indicated in Fig. 1. For comparison, the x symbol represents the point estimates using regression, while the asterisk (*) corresponds to estimates that were negative, and are thus not shown. That even data sets from slowly evolving bacteria such as M. tuberculosis, which exhibit substitution rates in the order of 10-8 substitutions per site year−1, possessed moderate temporal signal underlines the power of genome-scale data to estimate evolutionary dynamics. Despite this, it is striking that some data sets, including ancient DNA samples, had very little temporal signal, such that incorporating historical DNA data is not always sufficient for reliable rate estimation. For example, previous studies have suggested no temporal structure in data sets of Y. pestis with sampling time ranges of ~290 years (Bos ) and ~650 years (Wagner ), and of M. leprae with a sampling range of almost 2000 years (Schuenemann ).

The range of genome-wide rates of evolutionary change in bacteria

The highest mean genome-wide rate estimate for the data sets with temporal structure was 9.35×10−6 substitutions per site year−1 (HPD: 2.50×10−6–1.74×10−5) for a vancomycin-resistant Enterococcus faecium (VRE) data set sampled over 10 years. High rates were also observed in A. baumannii Global Clone 2 (GC2) sampled over 7 years and with a mean rate of 3.15×10−6 substitutions per site year−1 (HPD: 2.34×10−6–4.44×10−6) and Staphylococcus aureus clonal complex 398 (CC398) sampled over 9.5 years and with a mean rate of 2.43×10−6 substitutions per site year−1 (HPD: 1.14×10−6–3.98×10−6). The lowest estimate was for Y. pestis with samples collected over ~1500 years which resulted in a mean rate of 1.57×10−8 substitutions per site year−1 (HPD: 1.03×10−8–2.27×10−8), although this data set had only moderate temporal structure. Other data sets with low rates and strong temporal structure were: M. tuberculosis, with samples collected over ~900 years and a mean rate of 5.39×10−8 substitutions per site year−1 (HPD: 2.49×10−8–1.02×10−7); M. tuberculosis Lineage 4 with samples collected over 13 years and a mean rate of 5.67×10−8 substitutions per site year−1 (HPD: 3.80×10−8–8.02×10−8); and three Salmonella enterica serovar Paratyphi A data sets, with sampling times of 58–84 years and mean rates ranging from 7.60×10−8 to 9.47×10−8 substitutions per site year−1(Figs 2 and S2). There was large variation in rate estimates for some closely related bacteria. Our analyses included several data sets of different serovars of Salmonella enterica: Kentucky, Agona, Typhi and Paratyphi A. Interestingly, the rate estimates for the human-restricted serovars Typhi and Paratyphi A (the agents of typhoid fever) (mean rates ranging from 1.78×10−7 to 8.02×10−8 substitutions per site year−1) were consistently lower than those estimated for the host-generalist serovars Kentucky and Agona (5.34–3.95×10−7 substitutions per site year−1). The rate estimates for Staphylococcus aureus were also highly variable between lineages, which included CC398, type USA300 and multilocus sequence types ST22, ST93 and ST239. The estimated mean rate for the livestock-associated CC398 was 2.43×10−6 substitutions per site year−1 (HPD: 1.14×10−6–3.86×10−6), the highest for this species. In contrast, the estimated mean rate for ST93 was 5.55×10−7 substitutions per site year−1 (HPD: 2.77×10−7–9.60×10−7), nearly five times lower (Fig. 2). We also investigated whether rate estimates based on root-to-tip regression were consistently biased compared to those obtained using the Bayesian approach in beast (Fig. 3). If the estimates from the two methods are the same, then they should fall along the line y=x when plotted against each other. Notably, most points fell above the regression line, implying that the mean Bayesian estimates (y-axis) were higher than those from the regression (x-axis). A probable cause of this pattern is that deep branches in the phylogenetic trees are over-represented in the regression method. If substitution rates do indeed vary in a time-dependent manner (Duchêne ; see below), such that higher rates are observed toward the present, then rate estimates obtained using regression may exhibit a downward bias (Fig. 3). The exceptionswere three Klebsiella pneumoniae data sets (CC258), only one of which had sufficient temporal structure for reliable estimation, with a mean rate estimate using beast of 2.99×10−7 substitutions per site year−1 (HPD: 9.51×10−8–5.65×10−7), while that using regression was 3.05×10−7 substitutions per site year−1 [95 % confidence interval (CI): 1.03×10−7–5.07×10−7); and Bordetella pertussis, with a mean rate of 1.74×10−7 substitutions per site year−1 using beast (HPD: 1.53×10−7–1.94×10−7), and of 2.15×10−7 substitutions per site year−1 (CI: 2.24×10−7–2.78×10−7) using regression.

Fig. 3.

Estimates of genome-scale nucleotide substitution rates using a Bayesian method (beast) compared to those estimated via root-to-tip regression. The line represents y=x. Points that fall on the line correspond to data sets for which the mean Bayesian estimates closely match those from the regression. Points above and below the line are data sets for which the Bayesian estimate is higher or lower than that from the regression, respectively. The colour corresponds to the degree of temporal structure according to the date-randomization test as indicated in Fig. 1.

Modelling time-dependent rates of evolution

For those data sets with strong to moderate temporal structure, we investigated the relationship between evolutionary rate and sampling time, considered as the time-span between the youngest and oldest samples in each data set. Our linear regression for the rate estimates as a function of sampling time on a log10 scale revealed a significant negative association between rate and time with slope =−0.701 (CI: −4.74 to −5.81 and P=0.0003; Fig. 4). The optimal parameterization for the decay function r=a/T was a=−5.90 (CI: −6.39 to −5.77) and b=−0.17 (CI: −0.26 to −0.13) (Fig. 4). Importantly, the residual errors were normally distributed around the fitted curve (Fig S4).

Fig. 4.

Estimates of genome-wide nucleotide substitution rates in human-associated bacterial pathogens as a function of sampling time in years. The axes are shown on a log10 scale. The colour corresponds to the degree of temporal structure according to the date-randomization test; blue indicates strong temporal structure and orange indicates moderate temporal structure. The dashed line corresponds to the linear regression, while the solid line corresponds to the decay curve (both fitted using only the points with strong and moderate temporal structure). The grey lines represent 100 bootstrap replicates of the decay curve, and thus represent the uncertainty in the decay function.

Discussion

We have performed the largest comparative and systematic study of evolutionary rates in bacteria to date, providing an important resource for future studies of bacterial evolution. This analysis revealed an approximately two order of magnitude range of evolutionary rates for those bacterial data sets in which there was sufficient temporal structure for rate estimation. For the most rapidly evolving bacteria analysed here, such as E. faecium, Staphylococcus aureus and A. baumannii, we estimated genome-wide rates between 10−5 and 10−6 substitutions per site year−1. Not only are these rates within the range of those previously estimated for rapidly evolving bacteria (Holt ; Ward ) but, more strikingly, they are also similar to those of slowly evolving DNA viruses (Duchêne ; Firth ). However, even our highest rate estimates are lower than some reported previously, such as those for Helicobacter pylori (Kennemann ) and N. gonorrhoeae (Pérez-Losada ), at ~10−5 and 10−4 substitutions per site year−1, respectively. As these species also experience high rates of recombination (Maixner ; Yahara ), which can disrupt temporal signal, we suggest that these unusually high estimates be treated with caution until they are verified. For example, our N. meningitidis data set exhibited extensive recombination, precluding standard phylogenetic analyses. The lowest rate estimates obtained here, at ~10−8 substitutions per site year−1, were also comparable to previous studies of slowly evolving bacteria, notably M. tuberculosis (Kay ). Importantly, however, even though they were relatively low, these rates were sufficiently rapid to be accurately estimated using genomic-scale data. In contrast, rate estimates lower than about 1.5×10−8 substitutions per site year−1 could not be validated using our methods and hence may require data sampled over far longer time periods. However, it was also noteworthy that the M. leprae and one of the Y. pestis data sets did not have sufficient temporal structure for rate estimation despite the availability of ancient DNA. Interestingly, a recent analysis of Bronze Age strains of Y. pestis is consistent with a much higher substitution rate in this bacterium, at ~10−7 substitutions per site year−1 (Rasmussen ), in contrast to the data presented here and previously (Cui ; Morelli ). Why these data sets differ so fundamentally in estimated substitution rate is clearly an important area for future study. Nearly all bacterial species investigated here displayed genome evolution that was measurable over a period of 10–100 years. Data sets with sampling times spanning less than 10 years were largely unreliable. Importantly, for several species the strength of temporal signal varied substantially between data sets, and a lack of temporal signal in one data set could not be taken as a general indication of overall lack of signal for the species. Notably, our analysis reveals that evolutionary rates in bacteria display a negative relationship with sampling time, such that their rates can be considered time-dependent (Ho ). This observation is of particular importance in the context of the analysis of genome sequences taken from individual disease outbreaks or transmission chains (Didelot ), as the evolutionary rates measured over these very short time-scales will probably contain deleterious mutations yet to be purged by purifying selection and substitutional saturation is increasingly apparent over time. As such, these estimates are expected to be much higher than those for samples obtained over longer time-scales. For example, based on our data (Fig. 4), we would predict the evolutionary rate estimated for a given bacterial species or clade over a sampling frame of 10 years to be more than an order of magnitude higher than that estimated for the same bacteria sampled over a period of 100 years. Conversely, the longer-term (and lower) evolutionary rates of the type inferred here would tend to over-estimate the time-scale of bacterial transmission when applied to outbreak data. The relationship between evolutionary rate and sampling time can be used to assess the extent to which extrapolating rates of evolution over different time-scales will lead to a bias in estimates of divergence times. To this end, future studies should compare bacterial data sets of the same species across different sampling time-frames to more precisely determine the nature of the time-dependent curve. Strikingly, in some cases the time-dependent pattern appears to hold within closely related bacterial lineages, such as our A. baumanii and Salmonella enterica Paratyphi A data sets. However, it is notable that both reliable rate estimates for M. tuberculosis, estimated over sampling frames of 15 and 895 years, were nearly identical. M. tuberculosis are notoriously slow growing bacteria whose generation time is orders of magnitude slower than the other bacteria analysed here. It is likely that other factors such as genome size and DNA G+C (guanine and cytosine) content also contribute to rate variation between bacteria (Rocha ). Given the time dependency of rate estimates we observed, we suggest future studies of bacterial evolutionary dynamics would be best addressed by comparing multiple independent replicate sample sets for each species, collected over matched time-spans. Overall, our analyses show that genome evolution can now be reliably measured in bacteria and establish a genomic framework for understanding long-term evolutionary dynamics in bacteria.

67 in total

Review 1. Analyzing the mosaic structure of genes.

Authors: J M Smith
Journal: J Mol Evol Date: 1992-02 Impact factor: 2.395

2. Second-pandemic strain of Vibrio cholerae from the Philadelphia cholera outbreak of 1849.

Authors: Alison M Devault; G Brian Golding; Nicholas Waglechner; Jacob M Enk; Melanie Kuch; Joseph H Tien; Mang Shi; David N Fisman; Anna N Dhody; Stephen Forrest; Kirsten I Bos; David J D Earn; Edward C Holmes; Hendrik N Poinar
Journal: N Engl J Med Date: 2014-01-08 Impact factor: 91.245

3. Temporal trends in gonococcal population genetics in a high prevalence urban community.

Authors: Marcos Pérez-Losada; Keith A Crandall; Jonathan Zenilman; Raphael P Viscidi
Journal: Infect Genet Evol Date: 2006-12-01 Impact factor: 3.342

4. Historical variations in mutation rate in an epidemic pathogen, Yersinia pestis.

Authors: Yujun Cui; Chang Yu; Yanfeng Yan; Dongfang Li; Yanjun Li; Thibaut Jombart; Lucy A Weinert; Zuyun Wang; Zhaobiao Guo; Lizhi Xu; Yujiang Zhang; Hancheng Zheng; Nan Qin; Xiao Xiao; Mingshou Wu; Xiaoyi Wang; Dongsheng Zhou; Zhizhen Qi; Zongmin Du; Honglong Wu; Xianwei Yang; Hongzhi Cao; Hu Wang; Jing Wang; Shusen Yao; Alexander Rakin; Yingrui Li; Daniel Falush; Francois Balloux; Mark Achtman; Yajun Song; Jun Wang; Ruifu Yang
Journal: Proc Natl Acad Sci U S A Date: 2012-12-27 Impact factor: 11.205

5. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.

Authors: Kathryn E Holt; Julian Parkhill; Camila J Mazzoni; Philippe Roumagnac; François-Xavier Weill; Ian Goodhead; Richard Rance; Stephen Baker; Duncan J Maskell; John Wain; Christiane Dolecek; Mark Achtman; Gordon Dougan
Journal: Nat Genet Date: 2008-07-27 Impact factor: 38.330

6. Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans.

Authors: Iñaki Comas; Mireia Coscolla; Tao Luo; Sonia Borrell; Kathryn E Holt; Midori Kato-Maeda; Julian Parkhill; Bijaya Malla; Stefan Berg; Guy Thwaites; Dorothy Yeboah-Manu; Graham Bothamley; Jian Mei; Lanhai Wei; Stephen Bentley; Simon R Harris; Stefan Niemann; Roland Diel; Abraham Aseffa; Qian Gao; Douglas Young; Sebastien Gagneux
Journal: Nat Genet Date: 2013-09-01 Impact factor: 38.330

7. Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not.

Authors: Jessica Hedge; Daniel J Wilson
Journal: mBio Date: 2014-11-25 Impact factor: 7.867

8. Prolonged decay of molecular rate estimates for metazoan mitochondrial DNA.

Authors: Martyna Molak; Simon Y W Ho
Journal: PeerJ Date: 2015-03-05 Impact factor: 2.984

9. Early divergent strains of Yersinia pestis in Eurasia 5,000 years ago.

Authors: Simon Rasmussen; Morten Erik Allentoft; Kasper Nielsen; Ludovic Orlando; Martin Sikora; Karl-Göran Sjögren; Anders Gorm Pedersen; Mikkel Schubert; Alex Van Dam; Christian Moliin Outzen Kapel; Henrik Bjørn Nielsen; Søren Brunak; Pavel Avetisyan; Andrey Epimakhov; Mikhail Viktorovich Khalyapin; Artak Gnuni; Aivar Kriiska; Irena Lasak; Mait Metspalu; Vyacheslav Moiseyev; Andrei Gromov; Dalia Pokutta; Lehti Saag; Liivi Varul; Levon Yepiskoposyan; Thomas Sicheritz-Pontén; Robert A Foley; Marta Mirazón Lahr; Rasmus Nielsen; Kristian Kristiansen; Eske Willerslev
Journal: Cell Date: 2015-10-22 Impact factor: 41.582

10. Genomic insights to control the emergence of vancomycin-resistant enterococci.

Authors: Benjamin P Howden; Kathryn E Holt; Margaret M C Lam; Torsten Seemann; Susan Ballard; Geoffrey W Coombs; Steven Y C Tong; M Lindsay Grayson; Paul D R Johnson; Timothy P Stinear
Journal: mBio Date: 2013-08-13 Impact factor: 7.867

92 in total

Review 1. Real-Time Analysis and Visualization of Pathogen Sequence Data.

Authors: Richard A Neher; Trevor Bedford
Journal: J Clin Microbiol Date: 2018-10-25 Impact factor: 5.948

2. Host-Specific Evolutionary and Transmission Dynamics Shape the Functional Diversification of Staphylococcus epidermidis in Human Skin.

Authors: Wei Zhou; Michelle Spoto; Rachel Hardy; Changhui Guan; Elizabeth Fleming; Peter J Larson; Joseph S Brown; Julia Oh
Journal: Cell Date: 2020-01-30 Impact factor: 41.582

Review 3. Ecology and evolution of Mycobacterium tuberculosis.

Authors: Sebastien Gagneux
Journal: Nat Rev Microbiol Date: 2018-02-19 Impact factor: 60.633

4. Genomic Diversity and Recombination among Xylella fastidiosa Subspecies.

Authors: Mathieu Vanhove; Adam C Retchless; Anne Sicard; Adrien Rieux; Helvecio D Coletta-Filho; Leonardo De La Fuente; Drake C Stenger; Rodrigo P P Almeida
Journal: Appl Environ Microbiol Date: 2019-06-17 Impact factor: 4.792

5. Chicken Meat-Associated Enterococci: Influence of Agricultural Antibiotic Use and Connection to the Clinic.

Authors: Abigail L Manson; Daria Van Tyne; Timothy J Straub; Sarah Clock; Michael Crupain; Urvashi Rangan; Michael S Gilmore; Ashlee M Earl
Journal: Appl Environ Microbiol Date: 2019-10-30 Impact factor: 4.792

6. Embeddability and rate identifiability of Kimura 2-parameter matrices.

Authors: Marta Casanellas; Jesús Fernández-Sánchez; Jordi Roca-Lacostena
Journal: J Math Biol Date: 2019-11-08 Impact factor: 2.259

7. Getting sick in the Neolithic.

Authors: Anne C Stone
Journal: Nat Ecol Evol Date: 2020-03 Impact factor: 15.460

Review 8. Population Biology and Comparative Genomics of Campylobacter Species.

Authors: Lennard Epping; Esther-Maria Antão; Torsten Semmler
Journal: Curr Top Microbiol Immunol Date: 2021 Impact factor: 4.291

9. Long-term Persistence of an Extensively Drug-Resistant Subclade of Globally Distributed Pseudomonas aeruginosa Clonal Complex 446 in an Academic Medical Center.

Authors: Nathan B Pincus; Kelly E R Bachta; Egon A Ozer; Jonathan P Allen; Olivia N Pura; Chao Qi; Nathaniel J Rhodes; Francisco M Marty; Alisha Pandit; John J Mekalanos; Antonio Oliver; Alan R Hauser
Journal: Clin Infect Dis Date: 2020-09-12 Impact factor: 9.079

10. Bayesian inference of ancestral dates on bacterial phylogenetic trees.

Authors: Xavier Didelot; Nicholas J Croucher; Stephen D Bentley; Simon R Harris; Daniel J Wilson
Journal: Nucleic Acids Res Date: 2018-12-14 Impact factor: 16.971