Literature DB >> 29038424

Frequency-dependent selection in vaccine-associated pneumococcal population dynamics.

Jukka Corander^1,2,3, Christophe Fraser⁴, Michael U Gutmann⁵, Brian Arnold⁶, William P Hanage⁶, Stephen D Bentley³, Marc Lipsitch^6,7, Nicholas J Croucher⁸.

Abstract

Many bacterial species are composed of multiple lineages distinguished by extensive variation in gene content. These often cocirculate in the same habitat, but the evolutionary and ecological processes that shape these complex populations are poorly understood. Addressing these questions is particularly important for Streptococcus pneumoniae, a nasopharyngeal commensal and respiratory pathogen, because the changes in population structure associated with the recent introduction of partial-coverage vaccines have substantially reduced pneumococcal disease. Here we show that pneumococcal lineages from multiple populations each have a distinct combination of intermediate-frequency genes. Functional analysis suggested that these loci may be subject to negative frequency-dependent selection (NFDS) through interactions with other bacteria, hosts or mobile elements. Correspondingly, these genes had similar frequencies in four populations with dissimilar lineage compositions. These frequencies were maintained following substantial alterations in lineage prevalences once vaccination programmes began. Fitting a multilocus NFDS model of post-vaccine population dynamics to three genomic datasets using Approximate Bayesian Computation generated reproducible estimates of the influence of NFDS on pneumococcal evolution, the strength of which varied between loci. Simulations replicated the stable frequency of lineages unperturbed by vaccination, patterns of serotype switching and clonal replacement. This framework highlights how bacterial ecology affects the impact of clinical interventions.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2017 PMID： 29038424 PMCID： PMC5708525 DOI： 10.1038/s41559-017-0337-x

Source DB: PubMed Journal: Nat Ecol Evol ISSN： 2397-334X Impact factor: 15.460

Population genomics has revealed many bacterial species exhibit extensive variation in their ‘accessory’ genomes. While neutral evolutionary models can account for such diversity1–3, allowing for heterogeneity in the evolutionary rate between genes significantly improves their fit to genomic data4–6, consistent with selection causing differences in gene content7. If recombination rates are sufficiently high, selection can alter the distribution of individual genes8. However, lower levels of recombination are associated with chromosome-wide sweeps, such that niche specialization at one or more loci can result in largely clonal ‘ecotypes’9,10. Similarly, a recent model suggested selection acting on a high proportion of the genome could partition even freely-recombining bacteria into highly-diverged ‘metabolic types’11. As well as adaptation to particular niches, this latter model11 considered antigenic loci to be under negative frequency-dependent selection (NFDS), the situation in which alleles are most beneficial to genotypes when they are rare. This is based on the assumption antigens become more costly when common, because they are more frequently recognised by acquired immune responses. Such NFDS has been proposed to explain the extensive antigenic diversity of the nasopharyngeal coloniser and respiratory pathogen Streptococcus pneumoniae (the pneumococcus)11–13. This variation makes anti-pneumococcal vaccine development challenging. The first licensed conjugate vaccine (PCV7) targeted seven of over ninety serotypes14, and consequently was associated with ‘serotype replacement’ as vaccine types (VTs) were replaced by non-vaccine types (NVTs), with no overall change in carriage rates15. This was driven by both serotype switching, the replacement of VTs by NVTs that differed at few loci other than that which determined the serotype, and clonal replacement of VTs by distantly-related NVTs. These population dynamics are now amenable to detailed study, having been tracked by genomic surveillance of isolates carried by children in both Massachusetts (USA) 14 and Southampton (UK) 16,17, and isolates from invasive pneumococcal disease in adults in Nijmegen (the Netherlands) 18. Here we use the distribution of the accessory genome across isolates to develop a gene frequency-based model of bacterial population structure based on multiple NFDS mechanisms19,20.

Results

Enrichment of loci under frequency-dependent selection in the accessory genome

Previous analyses of 5,442 clusters of orthologous genes (COGs) in the Massachusetts pneumococcal population suggested those present at intermediate frequencies were important in distinguishing sequence clusters21. To identify functions that were enriched in this set of genes, the 1,112 COGs present in 5% to 95% of isolates and 1,194 core COGs14 were annotated by integrating multiple analyses (Fig 1a & Supplementary Datasets 1 and 2). The most substantive difference was in mobile genetic elements (MGEs; Fisher’s exact test; odds ratio, OR = 336; two-sided p < 2.2x10-16). However, few of these genes were ‘cargo’ beneficial to the host bacterium, and were instead likely to be parasitic, consistent with the distribution of prophage between pneumococci21,22. Correspondingly, restriction modification systems (RMSs) that protect against MGE infection accounted for 2.4% of the intermediate-frequency genes, but were absent from the core COGs. These are most often advantageous when rare, such that the donor of an infecting MGE is unlikely to have the same system19, but typically futile when ubiquitous. Hence the co-existence of lineages likely involves competition between bacteria and MGEs through ‘kill-the-winner’ dynamics, a form of NFDS in which an increase in a genotype’s frequency would be associated with a counterbalancing rise in the prevalence of MGE genotypes able to infect such cells23.

Figure 1

Diversity and structure of the pneumococcal population. a Functional classification of the 1,112 intermediate-frequency and 1,194 core COGs in the Massachusetts pneumococcal population, as detailed in Supplementary Datasets 1 and 2. Each barchart compares the frequencies of functional categories in intermediate-frequency and core COGs. Categories are grouped as likely to be under NFDS resulting from bacterium-MGE interactions (pink segments), bacterium-bacterium interactions (blue segments), or bacterium-host interactions (green segments). The chart with orange segments shows the frequencies of loci with roles in general metabolism or signal transduction, or otherwise could not be classified. b Population structure of the 4,127 isolates from Massachusetts (Mass), Southampton (Soton), Nijmegen and Maela (Supplementary Dataset 3). The maximum likelihood phylogeny was generated from 1,447 core gCOGs. The adjacent columns contain a row for each genome, which represent the population in which the bacterium was isolated, its susceptibility to PCV7-induced immunity, and sequence cluster classification. c Comparison of core genome divergence, quantified as the cophenetic distance between isolates in the core genome phylogeny, and the accessory genome divergence, quantified as the Jaccard distance between the gCOG content of isolates. Each point is a pairwise comparison between randomly sampled isolates (excluding the polyphyletic SC0), which was coloured orange if the isolates belonged to the same sequence cluster; purple if they belonged to different sequence clusters but were both encapsulated; or otherwise dark blue, revealing the presence of some genetically divergent unencapsulated genotypes. Isocontour lines quantify the distribution of points in each category.

Annotation also suggested direct interference competition between bacteria was likely to be important in maintaining a diversity of lineages24. Bacteriocins, which mediate interstrain killing25, were significantly enriched in the accessory genome relative to the core (Fisher’s exact test; OR = 24.0; two-sided p < 2.2x10-16). Although regulatory components of the bacteriocin-like peptide (blp) locus were conserved across the population, most of the gene cluster was composed of various combinations of bacteriocin and immunity protein genes, many of which were found in multiple loci26. Despite this diversity, each of the previously-described fifteen monophyletic sequence clusters14 was typically associated with one distinctive blp allele (Supplementary Fig 1), with an exception being sequence clusters (SCs) 3 and 14, which did not co-exist for long owing to vaccine-induced population dynamics14. Sequence clusters also varied in their complement of rarer bacteriocin biosynthesis gene clusters, including pneumocyclicin27, pneumolancidin28, two loci likely regulated by the TprA/PhrA quorum-sensing system29, and other putative loci (Supplementary Fig 1). No individual gene cluster replicated the diversity of the blp locus, with sequence variation instead often corresponding to disruptive mutations in bacteriocin structural or biosynthetic genes. Assuming relevant phenotypes can be reliably inferred from the gene clusters, such mutations result in bacteria immune to the bacteriocin, but unable to kill competitors. These immune non-producers co-circulate with producer cells carrying the putatively fully-functional allele, and susceptible cells completely lacking the gene cluster. Analogous variation with respect to individual bacteriocins is likely present between the blp loci, given their diverse complements of production and immunity genes. If both biosynthesis and immunity functions are costly, these phenotypes can co-exist through rock-paper-scissors NFDS dynamics as producers kill susceptible cells, immune non-producers outcompete producers, and susceptible cells outcompete immune non-producers30. Hence the distinctive overall bacteriocin production profile of strains may be shaped by NFDS acting on multiple loci. NFDS can also result from competition for resources20,31. A particular nutrient import strategy, either optimized for different nutrients31 or different concentrations of the same nutrient32, will become less advantageous as it becomes more common, as a consequence of more intense competition for the same resource24. While nutrient importers account for 11.5% of the core COGs, because many are universally necessary, they also make up 9.35% of the intermediate-frequency COGs. Hence they are significantly enriched relative to general metabolic genes in the latter category (Fisher’s exact test, OR = 2.48, two-sided p = 2.61x10-8). This suggests NFDS may sustain multiple nutrient acquisition strategies in the population as a consequence of interstrain competition for resources. Antibiotic resistance, also variable between isolates, could be affected by similar competition33. If resistant bacteria are considered adapted to hosts consuming antibiotics, but suffering a cost in untreated hosts, then resistance will be most effective as a resource acquisition strategy where rare owing to the lessened competition with other strains. This could directly result in NFDS, although there are alternative explanations for the co-existence of sensitive and resistant pneumococci that instead imply NFDS through other mechanisms34. A further functional category to be enriched in the intermediate-frequency COGs relative to the core genome were genes encoding for the biosynthesis of immunogenic structures, such as surface proteins35 or the capsule36 (Fisher’s exact test, OR = 2.56, two-sided p = 9.23x10-10). These can be under NFDS as long as alleles are immunologically distinguishable, a criterion met by the serotype-defining capsule36, as well as accessory antigens that are typically either present as large surface structures, or completely absent, such as the pili37. Therefore multiple disparate functions enriched in the intermediate-frequency genes relative to the core genome can each be understood as being subject to NFDS, albeit through different processes. While no NFDS mechanism could be identified for 32.1% of the intermediate frequency COGs, this category is likely to include both metabolic enzymes and signal transduction proteins linked to loci under NFDS on genomic islands, and loci under NFDS that cannot be identified as such owing to incomplete functional information. To test whether these inferences applied to other pneumococcal populations in a similar manner, further genomic datasets were compared to those from Massachusetts.

Population similarities in frequencies of genes, but not genotypes

Overall, 4,127 isolates were combined from available reference sequences, Massachusetts, Southampton, Nijmegen, and the Maela refugee camp in Thailand where the population is unvaccinated38 (Supplementary Dataset 3). A new analysis identified 11,049 ‘global’ COGs (gCOGs), from which a ‘relaxed’ core of 1,447 gCOGs was extracted to generate a maximum likelihood phylogeny (Fig 1b; Supplementary Fig 2). Strikingly, there was little evidence of genetic isolation-by-distance, as both vaccine-type status and country of isolation had a polyphyletic distribution, indicating a history of recombination and frequent international migration. The core alignment was also used to define 74 sequence clusters. Plotting the pairwise core genome divergence of isolates, represented by their cophenetic separation in the tree, against their accessory genome divergence, calculated as the Jaccard distance between the isolates’ gCOG content, demonstrated members of the same sequence cluster were substantially more similar in their accessory, as well as core, genomes (Fig 1c). These differences between lineages were likely biologically meaningful, as they represented a significant proportion of the accessory genome and were preserved despite international dissemination of some genotypes and ongoing horizontal DNA transfer. Although some of the previously-identified atypical unencapsulated lineages were associated with extensive private gene content21, sequence clusters of encapsulated pneumococci each contained few unique accessory loci. The mean numbers of gCOGs present in ≥95% of the isolates in a given sequence cluster, but not meeting this criterion in any other sequence cluster in the same population, were just 16.75 in Massachusetts, 19.94 in Southampton, 19.46 in Nijmegen, and 15.02 Maela (Supplementary Fig 2). Sequence clusters’ distinctiveness instead resulted from the polyclonal distribution of the 1,731 intermediate frequency gCOGs, present in between 5% and 95% of the pre-vaccination isolates in at least one population (Supplementary Fig 2). Hence a long history of recombination was reflected in intermediate-frequency loci being associated with multiple lineages, with each lineage in turn defined by a unique combination of intermediate-frequency loci. Despite the lineages representing discrete and distinct sets of genotypes, their prevalences were highly heterogeneous between the four populations, with a significant correlation only between those in Massachusetts and Southampton (Fig 2a). In marked contrast, the frequencies of accessory gCOGs were strongly correlated between Massachusetts and every other population (Fig 2b; Pearson correlation, two-sided p < 10-15 in all comparisons). This suggests pneumococcal populations are configured by genomic islands being maintained at equilibrium frequencies that are conserved between populations, consistent with their prevalence being influenced by NFDS19. A significant deviation between populations was the elevated frequency of Tn916 in Maela; this transposon underlies tetracycline resistance21, and therefore the difference is likely to represent a location-specific selection pressure rather than drift39. Hence selection appears to shape pneumococcal populations to be similar in frequencies of genes, rather than genotypes.

Figure 2

Distribution of genetic diversity between populations. Column (a) compares the distribution of sequence clusters between populations; the frequency of each sequence cluster in Massachusetts is shown on the horizontal axis, and the corresponding frequencies in Maela, Southampton and Nijmegen are shown on the vertical axes in the plots from top to bottom. Red points correspond to predominantly VT (≥75%) sequence clusters; blue points to predominantly NVT (≥75%) sequence clusters, and black points to mixed sequence clusters. Column (b) compares the distribution of gCOGs between populations. The frequency of each in Massachusetts is shown on the horizontal axis, and the corresponding frequencies in Maela, Southampton and Nijmegen are shown on the vertical axes. Only gCOGs present at a mean frequency between 5% and 95% across the two compared populations were included, and the corresponding points are coloured according to the functional annotation of COGs in Fig 1a. The elevated frequencies of gCOGs encoded by Tn916, including the tetM tetracycline resistance gene, in Maela are annotated. Column (c) compares the pre- and post-vaccination frequencies of sequence clusters in Massachusetts, Southampton and Nijmegen. Points are coloured as in (a), showing the general decline in the frequency of VT sequence clusters. Column (d) compares the pre- and post-vaccination frequency of gCOGs in Massachusetts, Southampton and Nijmegen. Only gCOGs with an overall frequency between 5% and 95% in the relevant population were included in the panels. Points are coloured as in (b). The reduced frequency of the wciN allele involved in synthesis of the VT 6A and 6B capsules is annotated. As the relationships between gCOG frequencies were linear, each panel displays Pearson's correlation statistics, including two-sided p values.

Vaccination as a test of negative frequency-dependent selection

The partial-coverage vaccines introduced to limit pneumococcal disease can be used as a natural experiment, to test whether loci expected to change in frequency due to association with VTs were actually maintained at equilibrium frequencies by NFDS. Although a significant correlation existed between pre- and post-PCV7 sequence cluster frequencies in the three vaccinated populations (Fig 2c), divergence in population composition was driven by the replacement of some VT sequence clusters with distantly-related NVT lineages. Across all comparisons of pre- and post-PCV7 populations, gCOG frequencies showed a stronger positive correlation. This stability in gene frequencies reflected the significant correlation between the post-PCV7 decrease in a gCOG’s absolute frequency in VT isolates, and the contemporaneous increase in its absolute frequency in NVT isolates (Supplementary Fig 3), consistent with the NFDS hypothesis. The greatest deviation in the Massachusetts population was wciN, directly involved in the synthesis of the vaccine-targeted 6A and 6B capsules, reflecting differences in selection pressures between timepoints14. This suggested the equilibrium frequencies of the intermediate frequency gCOGs were likely to govern the post-vaccine restructuring of the population. To quantify whether NFDS of intermediate frequency gCOGs could explain changes in pneumococcal populations better than a neutral model, a discrete time Wright-Fisher multilocus NFDS model was constructed in which the number of offspring produced by a genotype i at generation t, X, was distributed as: General density-dependent competition was parameterised by the total number of pneumococci in the simulated population at time t, N, and the environment’s carrying capacity κ. This was constant across t, reflecting the stable levels of pneumococcal carriage post-PCV715,16. The other demographic process was migration, at rate m (per month-long generation), by which isolates in the resident simulated population were replaced by genotypes randomly selected from the genomic data from the same location. VT genotypes were subject to a fitness cost, v, representing vaccine efficacy at preventing transmission. The final term parameterised NFDS, the strength of which was determined by σ and the exponent π: where l is an intermediate frequency locus (gCOG or antibiotic resistance phenotype), and g is a binary variable indicating whether l is present in genotype i. Each l has an equilibrium frequency e, its prevalence in the pre-vaccination sample, and an instantaneous frequency at generation t, f. Therefore f determines whether l benefits its host, when it is rare relative to e, or has a net cost, when it is common relative to e. Model details are described in Supplementary Fig 4 and the Methods. The σ, v and m parameters were estimated for the Massachusetts population using Approximate Bayesian Computation, an inference technique for intractable simulator-based models40,41. The simulated population was compared to the sequence cluster distribution across three time points (Fig 3a) using the Jensen-Shannon divergence (JSD) to determine similarity. Convergence of the parameter estimates found strong evidence for NFDS (σ significantly greater than its lower bound; Table 1 & Supplementary Table 1, Supplementary Fig 5). The precedent of other models4,6 suggested the fit could be improved by allowing the strength of selection to be heterogeneous across loci. Hence an expanded model featured a proportion, p, of the intermediate frequency loci experiencing NFDS at strength σ, while (1-p) experienced NFDS at strength σ (see Methods). Convergence of parameter estimates again found strong evidence for NFDS (σ and p significantly greater than their lower bounds; Table 1 & Supplementary Table 1, Supplementary Fig 5), with a substantial improvement over the homogeneous selection model, as quantified by the significantly smaller JSD values from appropriately parameterised simulations (Wilcoxon test on 100 simulation pairs, W = 9902, two-sided p = 4.73x10-33; Supplementary Fig 6).

Fig. 3

Comparing the sampled and simulated pneumococcal populations. In each barplot, the bacterial population is split into sequence clusters by vertical black lines, annotated at the top of the graph. Each sequence cluster is split into three timepoints: pre-vaccination, a midpoint sample and a late sample. Only sequence clusters present at greater than 2.5% frequency at one of these timepoints in the genomic sample are included in the graphs; full results are shown in the supplementary materials. The bars at each timepoint are split into red segments, for VT isolates, and blue segments, for NVT isolates. In each comparison, the top row is the genomic sample against which simulations were evaluated. The bottom row summarises the output of 100 simulations using the heterogeneous rate multilocus NFDS model performed using the point estimate parameter values from Table 1. At the times at which samples were present in the respective genomic collections, the same numbers of isolates were randomly selected from the simulated populations. The bars represent the median result, and the error bars (orange for VT isolates, and purple for NVT isolates) represent the interquartile range observed across the simulations. (a) The results for Massachusetts split isolates into pre-vaccination (2001; 133 isolates), midpoint (2004; 203 isolates) and late (2007; 280 isolates) samples. (b) The results for Southampton, splitting isolates into pre-vaccination (up to 2007; 100 isolates), midpoint (2008-2009; 194 isolates) and late (2010-2011; 195 isolates) samples. (c) The results for Nijmegen, splitting isolates into pre-vaccination (up to 2007; 209 isolates), midpoint (2008-2009; 80 isolates) and late (2010-2011; 48 isolates) samples.

Table 1

Parameter estimates from model fits achieved through Approximate Bayesian Computation with BOLFI, run for 2,000 iterations. The displayed values represent point estimates of parameters generated based on the Gaussian process minimisers, with 95% credibility intervals in parentheses where calculated. The simplest neutral model required fitting only v and m to the genomic data. Homogeneous rate (σ, v and m) and heterogeneous rate (σ, v, m, p and σ) fits are shown for the multilocus NFDS model, in which intermediate frequency gCOGs and resistance phenotypes have equilibrium frequencies; for the serotype NFDS model, in which serotypes have equilibrium frequencies; and for the ecotype model, in which sequence clusters have equilibrium frequencies. Replicate fits of the heterogeneous rate multilocus NFDS models to the Massachusetts, Southampton and Nijmegen datasets are shown in Supplementary Table 1 to demonstrate the robustness of the fitting process to stochastic effects.

Population	Model	Maximal NFDS strength, σ_f	Vaccine selection strength, v	Migration rate, m	Proportion of loci under strong NFDS, p_f	Weaker NFDS strength, σ_w
Mass	Neutral	-	0.0375	0.0073	-	-
Mass	Homogeneous rate multilocus NFDS	0.0075 (0.0017 - 0.0234)	0.0733 (0.0430 - 0.1207)	0.0057 (0.0020 - 0.0131)	-	-
Mass	Heterogeneous rate multilocus NFDS	0.1363 (0.0213-0.2113)	0.0812 (0.0491-0.1254)	0.0044 (0.0015-0.0165)	0.2483 (0.1197-0.5448)	0.0023 (0.0010-0.0514)
Mass	Homogeneous rate serotype NFDS	0.0333	0.0415	0.0071	-	-
Mass	Heterogeneous rate serotype NFDS	3.2613	0.0394	0.0053	0.1862	0.0127
Mass	Homogeneous rate ecotype	3.4514	0.0525	0.0090	-	-
Mass	Heterogeneous rate ecotype	1.0101	0.0541	0.0071	0.99	0.0009
Soton	Homogeneous rate multilocus NFDS	0.0028 (0.0010 - 0.0117)	0.1175 (0.0667 - 0.2262)	0.0032 (0.0011 - 0.0132)	-	-
Soton	Heterogeneous rate multilocus NFDS	0.1393 (0.0121 - 0.2148)	0.2063 (0.0832 - 0.3150)	0.0124 (0.0012 - 0.0394)	0.4035 (0.1005 - 0.5951)	0.0023 (0.0010 - 0.0238)
Nijmegen	Homogeneous rate multilocus NFDS	0.0605 (0.0012 - 0.0966)	0.0318 (0.0011 - 0.2621)	0.0018 (0.0009 - 0.0184)	-	-
Nijmegen	Heterogeneous rate multilocus NFDS	0.1462 (0.0013 - 0.2012)	0.0381 (0.0016 - 0.3235)	0.0015 (0.0009 - 0.0060)	0.1988 (0.0013 - 0.8356)	0.0032 (0.0010 -0.1247)
Maela	Heterogeneous rate multilocus NFDS	0.1115 (0.0020 - 0.2138)	0.0011 (0.0010 - 0.0354)	0.0227 (0.0012 - 0.0568)	0.4995 (0.0028 - 0.9468)	0.0129 (0.0010 - 0.1416)

At the locus level, those genes subject to stronger NFDS stabilised close to their equilibrium frequencies, whereas the frequencies of those subject to weaker NFDS drifted near-neutrally (Supplementary Fig 6). At the lineage level, these simulations replicated three important facets of the post-vaccination population dynamics (Fig 3a & Supplementary Fig 7). The first was the stable post-vaccine prevalence of some NVT sequence clusters, such as SC4 and SC8. The second was serotype switching, the replacement of VT by NVT within sequence clusters that remained at stable overall frequencies, as observed in SC1, SC5, SC9 and SC15. The third was clonal replacement of VT by unrelated NVT, such as the contemporaneous disappearance of SC13, SC14, SC22 and SC24, and expansion of SC3, SC6, SC7 and SC11. These trends were not trivial to replicate. The same framework was used to fit a neutral model (NFDS eliminated, with σ = 0); a serotype-focused single locus NFDS model (e applied to serotype, rather than locus, frequencies), and an ecotype model (e applied to sequence cluster, rather than locus, frequencies). Both the neutral and serotype models poorly reproduced the stability of SC8’s frequency, serotype switching within SC9 and SC15, or any patterns of clonal expansion. The ecotype models better reproduced NVT sequence cluster stability and serotype switching, but did not replicate the observed patterns of clonal replacement. All of these models resulted in significantly worse fits to the data than the heterogeneous multilocus NFDS model (Supplementary Figs 6 & 7). The estimated vaccine selection strength, v, of 0.081 per month from the heterogenous rate multilocus NFDS is consistent with PCV7’s halving of the rate at which VT are acquired42, if pneumococci transmit at least once every six months, an interval similar to the carriage duration of VT serotypes43. Similarly, the estimated migration rate, m, of 0.0044 per month suggests half the resident Massachusetts pneumococcal population is replaced by immigrant strains over approximately 13 years, which is realistic given the 50% probability that a pneumococcal lineage was detectable in different localities within Massachusetts after 3-4 years14.

Consistent evidence of NFDS in other populations

The homogeneous and heterogeneous multilocus NFDS models were also fitted to similar surveillance data from Southampton (Supplementary Fig 5 & 8). The JSD values for the heterogenous rate model were again reproducible and significantly smaller than for the homogeneous rate version (Wilcoxon test on 100 simulation pairs, W = 9954, two-sided p = 1.01x10-33). The point estimates of parameter values were again robust and, in the case of the three parameters determining the strength of NFDS, very similar to those for Massachusetts (Table 1 & Supplementary Table 1). However, the vaccine selection strength was estimated to be 2.54-fold higher in Southampton than in Massachusetts. This difference is likely attributable to the substantially higher PCV7 coverage in children under 24 months of age in the years immediately after the vaccine’s introduction in the UK relative to the USA44,45, combined with the lower age range included in the Southampton study, excluding older children who are less likely to have been immunized, or in whom natural acquisition of immunity blunted the selective pressure of the vaccine16,46. Simulations using these point estimates again replicated the strain dynamics observed in the genomic sample (Fig 3b). VT SC5 and SC18 were eliminated at realistic rates; NVT SC3, SC19 and SC35 remained at stable frequencies; serotype switching occurred within SC1 and SC9, while NVT SC2 rose in prevalence at a much faster rate than same lineage did in Massachusetts. The homogeneous and heterogeneous rate multilocus NFDS models were also fitted to a genomic dataset from cases of invasive pneumococcal disease in Nijmegen. The heterogeneous model was again a significantly closer fit to the genomic data, as assessed by the JSDs (Wilcoxon test, W = 3988, two-sided p = 0.0135; Table 1 and Supplementary Fig 5). Precisely replicating the observed population dynamics was difficult (Fig 3c and Supplementary Fig 9), owing to the sparser sampling, particularly post-PCV7, and inevitable bias towards more invasive genotypes in this dataset. While the estimated strength of NFDS was similar to both Massachusetts and Southampton, the estimated vaccine selection strength was lower than in these infant carriage surveillance projects, consistent with the Nijmegen collection being isolated in an adult population primarily protected by herd immunity18. Correspondingly, fitting the heterogenous rate model to the Maela dataset, isolated from an entirely unvaccinated community, estimated v close to zero (Table 1). NFDS acting on genomic islands can also affect variation in the core genome. Comparisons between pre- and post-vaccination populations, and between different locations, revealed allele frequencies of core genome single nucleotide polymorphisms (SNPs) typically showed very similar correlations to those of accessory loci frequencies (Supplementary Fig 10). This was not a consequence of tight linkage between SNPs in the regions flanking genomic islands (Supplementary Fig 10). Nevertheless, simulations in which NFDS acted on only accessory loci precisely replicated the post-vaccination changes in the core SNP allele frequencies, and similar correlations to those between collections were observed in simulations where the Massachusetts population was gradually replaced with isolates from other datasets (Supplementary Fig 10). Therefore while it is possible core genome loci may also be under NFDS, the observed correlations can be attributed to NFDS acting only on accessory loci.

Consequences of NFDS for the impact of vaccination

Simulations were used to investigate counterfactual scenarios. In the absence of vaccination (v = 0), the pre-PCV7 populations were stable in Massachusetts (Supplementary Fig 7), Southampton (Supplementary Fig 8) and Nijmegen (Supplementary Fig 9). Eliminating migration (m = 0) significantly increased the proportion of VTs observed in simulations in all three populations (Wilcoxon tests; Massachusetts, W = 0, two-sided p = 2.56x10-34; Southampton, W = 0, two-sided p = 2.56x10-34; Nijmegen, W = 1453, two-sided p = 4.50x10-18), highlighting the importance of imported or previously rare NVTs in driving out VTs. However, removing NFDS significantly decreased the proportion of VTs observed in all three populations (Wilcoxon tests; W = 10000, two-sided p = 2.56x10-34 in Massachusetts and Southampton; W = 9979, two sided p = 4.81x10-34 in Nijmegen). This is because following vaccination, those loci enriched in VT genotypes become increasingly advantageous to their bacterial hosts as they become rarer, resulting in NFDS slowing the rate at which VT genotypes are eliminated until such loci rise in frequency in NVT genotypes.

Discussion

These combined analyses of multiple population genomic datasets suggest that NFDS plays an important role both in the stable structuring of pneumococcal populations, and their dynamics following disruption by vaccine-induced immunity. According to the best-fitting model, relatively strong NFDS acts on a few hundred accessory genes, corresponding to 5.0% of the Massachusetts pangenome, and 8.3% of that in Southampton. This cumulative effect across multiple loci in complex populations is predicted to maintain stable lineage compositions in the absence of disruption by vaccination, without the oscillatory dynamics associated with some single locus NFDS processes19,30,47,48. Hence multiple lineages can persistently coexist within this framework despite their confinement to a niche, the human nasopharynx, that is physiochemically homogeneous compared with the varied environments inhabited by species often considered as split into ecotypes, such as Escherichia coli. Furthermore, although intraspecific recombinations are slow over the timescales simulated in this study14, horizontal DNA transfer has comprehensively reassorted genomic islands between genotypes over the species’ history. Their consequent polyclonal distribution means accessory locus frequencies can be preserved by multiple lineage combinations, thereby accounting for the diverse population structures observed globally, and the panoply of strains they contain49. While the NFDS processes represented in the multilocus model were also sufficient to explain the major post-vaccination population changes, further work is required to determine whether core loci are also involved. Continued development of such quantitative models with large genomic datasets promises to improve our understanding of how diverse selective pressures affecting bacterial populations shape their response to public health interventions, and how best to design novel pathogen control strategies.

Methods

Annotation of the accessory genome

The previously analysed Massachusetts population14,50 contained 1,112 COGs present in between 5% and 95% of the 616 isolates and 1,194 COGs present in a single copy in every isolate. Information on whether these were associated with capsule polysaccharide synthesis, antibiotic resistance, RMSs, Pneumococcal Pathogenicity Island 1 or MGEs was extracted from previously described analyses14,21,50. Coding sequences (CDSs) associated with proteinaceous immunogenic structures were identified through the results of protein antigen array data35. Candidate bacteriocins were identified using the BAGEL3 algorithm51. The variation at the blp locus, and the other putative bacteriocin production loci, was manually identified within de novo assemblies of the Massachusetts isolates using Artemis and ACT52. The heatmap showing the distribution of the blp alleles in Supplementary Fig 1 was generated by mapping Illumina reads for each of the Massachusetts isolates against the concatenated set of loci using BWA with default settings53. Further information on COG functional domains14 and previous automated annotations50 was additionally used to manually curate all available information into the annotation and classification in Supplementary Datasets 1 and 2.

Bioinformatic analysis of genomic data

The isolate collections analysed each came from systematic sampling of defined host populations. The Massachusetts pneumococcal dataset, isolated from the nasopharynxes of children up to five years of age during routine primary care physician visits, consisted of the 616 de novo assemblies generated with Velvet54 as described previously14,50. VelvetOptimiser55 was used to assemble data from the Maela collection38 (3,085 genomes), isolated from the nasopharynxes of infants up to two years of age, and their mothers, in a Thai refugee camp; the Southampton collection16 (516 genomes), isolated from the nasopharynxes of children up to four years of age during outpatient visits; and the Nijmegen18 collection (337 genomes), isolated from adults hospitalised with bacteraemic pneumonia. These were supplemented with 20 complete, publically available reference genomes (Supplementary Dataset 3). To standardise these genome collections relative to the Massachusetts dataset, assemblies were discarded if they were less than 1.98 Mb, or greater than 2.19 Mb, in length; or had an N50 less than 15 kb14,50; or necessary information was absent from the public databases. Of the 4,586 genomes, 4,462 met these criteria and were included in a preliminary analysis that identified non-pneumococcal streptococci, which were then excluded from the final analysis. Consequently, the final dataset of 4,127 genomes contained 20 reference sequences, 616 Massachusetts sequences, 491 Southampton sequences, 337 Nijmegen sequences, and 2,663 Maela sequences. Each genome was processed with RNAmmer v1.2, to annotate rRNA56; tRNAscan-SE v1.3.1, to annotate tRNA57; Rfam scan, to annotate other non-coding RNA58; scanned for BOX, RUP and SPRITE repeats using HMM profiles59,60; and Prodigal v2.661, to annotate CDSs using a model trained on the genome of S. pneumoniae ATCC 70066962. CDSs that overlapped with the non-coding RNA or short interspersed repeat sequences were then removed from the annotation, and the remaining set translated to allow a non-redundant set of proteins to be identified. A version without low complexity regions was generated through filtering with segmasker63 and masking of choline binding domains. All-against-all comparisons of these protein databases were then generated using BLAT v0.3464. Global COGs (gCOGs) were then generated using COGtriangles and COGcognitor65, and through linking pairs of highly similar sequences, as described previously14. The gCOG nomenclature was then applied to the full, redundant set of protein sequences. To correct for misassemblies, particularly those reflecting differences between the methods used to assemble the Massachusetts isolates’ genomes and those from other populations, false positive CDSs were eliminated from the intermediate frequency gCOGs. A database generated from the annotation of S. pneumoniae ATCC 70066962 was used to search intermediate frequency gCOG DNA sequences using BLASTALL v2.2.25. This identified 39 gCOGs corresponding to fragments of tRNA, oligomers of choline binding domains, or antisense fragments of insertion sequences. This left a final set of 11,049 gCOGs, of which 1,731 were present at a frequency between 5% and 95% in the pre- or peri-vaccination samples (grouped as “pre-vaccination” samples in the Results section) of at least one of the four study populations. To transfer the functional annotation onto the gCOG sequences, the annotated protein sequences from Massachusetts in Supplementary Table 1 were used to identify identical proteins in the gCOG dataset. When COGs could not be matched to gCOGs through this approach, links were instead made through searching gCOGs for proteins with identity to the middle 50% of annotated protein sequences from Massachusetts. These links were then manually curated to categorise the 1,731 intermediate frequency gCOG sequences where possible, as shown in Fig 2.

Analysis of population structure

To analyse the overall population structure, a ‘relaxed’ core set of 1,447 gCOGs were identified that met two criteria: first, that they were present in at least 95% of the isolates; and second, that the total number of gCOG representatives was less than 105% of the number of isolates containing the gCOG, to exclude gCOGs that are present in high copy number in some, or all, genomes. A codon alignment was then generated for each gCOG using mafft v7.22166, excluding any sequences from isolates containing more than one representative of the gCOG. These were concatenated, with gap sites used to pad regions where data were missing for a particular isolate, and a 293,508 bp alignment of polymorphic sites extracted using SNP-sites67. A phylogeny was generated from this alignment using FastTree2 with the ‘fastest’ option68. Population structure was analysed with hierarchical BAPS clustering69 using five independent runs of the estimation algorithm starting from the upper bound of 200-500 clusters, which all converged to the same posterior mode. Two polyphyletic primary BAPS clusters were split into their secondary level clusters, yielding 73 sequence clusters that were almost entirely congruent with the phylogeny, and SC0, which remained polyphyletic. The monophyletic sequence clusters most similar to those in Massachusetts14 were numbered accordingly. The plot in Fig 1c combined cophenetic distances from the core genome phylogeny, extracted with Bioperl70, and Jaccard distance calculated from the presence and absence matrix of gCOGs using the R package vegan71. For each isolate, 100 comparator isolates were selected at random, and this sample of pairwise comparisons used to generate the plot. Of the polymorphic sites in the core genome, 282,043 corresponded to a base in the S. pneumoniae ATCC 700669 reference genome. For each population, the set of sites that were both biallelic and had a non-reference allele frequency between 5% and 95% in that population were extracted with VCFtools v0.1.1472; there were 27,616 of these in the Massachusetts dataset, 26,954 in the Southampton dataset, 28,396 in the Nijmegen dataset, and 30,579 in the Maela dataset. The r2 statistics between these polymorphic sites, and between the binary presence and absence information of accessory gCOGs with a representative in the S. pneumoniae ATCC 700669 genome, were then calculated with VCFtools by treating each isolate as a phased haplotype. These were used to generate the linkage analysis plots in Supplementary Fig 10.

Inference of antibiotic resistance profiles

Individual isolates’ genotypes were used to predict their antimicrobial resistance profiles. The presence of aph3’ (the gCOG CLS350021) was inferred to cause resistance to aminoglycosides; the presence of tetM (CLS03712) was inferred to cause resistance to tetracycline; the presence of cat (CLS01043) was inferred to cause resistance to chloramphenicol; and the presence of ermB (CLS01283), mef (CLS02227), or both was inferred to cause macrolide resistance62,73. These gCOGs themselves were removed from the set of loci used in the simulations, and the inferred antibiotic resistance phenotype used instead. Non-susceptibility to other antibiotics is determined by core genome loci; to incorporate these into the model, resistant alleles of relevant loci were treated analogously to the presence of an accessory resistance gene. The presence of the I100L substitution in the dihydrofolate reductase protein (CLS03211) was inferred to result in resistance to trimethoprim74,75, and the presence of an insertion shortly after S61 in the dihydropteroate synthase protein (CLS01442) was inferred to result in resistance to sulphamethoxazole76. Three penicillin-binding proteins substantially contribute to β-lactam resistance. Using a similar approach to that of Li et al77, the population-wide protein sequences of Pbp1A (CLS01776), Pbp2X (CLS01031) and Pbp2B (CLS01093) were aligned with mafft v7.22166, and the transpeptidase domain regions extracted. Following validation using the isolates from Massachusetts14, sequences exhibiting less than 97% amino acid identity with the susceptible alleles defined by Li et al in the multiple sequence alignment were considered resistance-associated. These antibiotic resistance phenotypes were included as intermediate frequency loci if they met the criteria for a given population.

Multilocus negative frequency dependent selection model

The multilocus negative frequency dependent selection model was generated within a discrete-time Wright-Fisher framework78,79. Although such models were designed with a number of strong assumptions, the results of simulations have been found robust to violations of these conditions80. Each individual i had a genotype g defined by a binary string representing the presence and absence of each gCOG or antibiotic resistance phenotype present at an intermediate frequency in the starting population. The number of offspring arising from i at time t is a Poisson-distributed random variable X. This Poisson approximation is justifiable if only a small proportion of descendants survive to the next generation78, as is likely to be the case for a nasopharyngeal coloniser with a small within-host effective population size81 that experiences a strong bottleneck at transmission. To allow for differential reproductive success between genotypes in a manner that depended on the composition of the overall population, Xi,t was parameterised using the function (Supplementary Fig 4): The four components of the function each correspond to a different biological process. General density-dependent selection depends on κ, the carrying capacity of the environment, and Nt, the total number of individuals at time t. This maintained an approximately stable population size throughout simulations. This is justifiable, as S. pneumoniae colonization levels did not substantially change in the years immediately after PCV7’s introduction. Migration into the population occurred at rate m, subject to the limits 0 ≤ m ≤ 1, and therefore the reproductive fitness of resident individuals was reduced by a factor of (1-m) accordingly to maintain an approximately constant population size of κ. The number of immigrating individuals at time t, N, was a random variable calculated as: Migrant individuals were selected, with replacement, from all isolates observed at any time point in the geographically-specified dataset being studied. Therefore it was the only mechanism by which genotypes not present in the pre-vaccine genome samples could enter the simulated population. To prevent artefactually improving the fit of the model at high values of m through sampling all isolates in proportion to their observed frequency, the selection of an immigrating isolate was biased such that it was equally likely to come from any sequence cluster with at least one representative in the studied population, although these were present at very different frequencies within each population. Hence the probability of an immigrating individual being of genotype i and sequence cluster s, p, was: Where S is the number of sequence clusters in the population, n is the number of isolates in sequence cluster s of genotype i in the genome dataset, and ns was the number of isolates in the sequence cluster s in the genomic dataset. The vaccine selection pressure to which individual i was subject, vi, depended on whether the individuals were of a vaccine serotype or not; for PCV7, the vaccine serotypes were 4, 6B, 9V, 14, 18C, 19F and 23F, as well as 6A, a vaccine-related type to which PCV7 elicited strong cross-immunity14. Consequently, vi was determined as: Where v was subject to the constraint 0 ≤ v ≤ 1. In the homogeneous rate multilocus model of NFDS, the magnitude of this pressure was determined by the term (1 + σ) , where σ ≥ 0. The selection pressure depended on the genotype g and distribution of intermediate frequency loci at time t, as summarised by the exponent π. The calculation of π necessitated determining the frequency f of each locus l at time t in the simulation, using the binary variables g that represent presence or absence of l in i: These were compared to the equilibrium frequencies, e, of the same loci, which were assumed to correspond to their frequencies in the sample of G0 genomes from isolates sampled pre- or peri-vaccination: The overall deviation of the L accessory genome loci included in the simulations, for individual i at time t, πi,t, was calculated as: Therefore if all accessory genes are at their equilibrium frequencies, then (1 + σ) = 1, and NFDS has no effect on an individual’s reproductive fitness. When a genotype contains many genes rarer than their equilibrium frequencies, (1 + σ) > 1, and NFDS increases an individual’s reproductive fitness. Lastly, when a genotype contains many genes more common than their equilibrium frequencies, (1 + σ) < 1, and therefore NFDS reduces an individual’s reproductive fitness. In the absence of l from an individual’s genotype, f has no direct effect on its fitness.

Extension to heterogeneous frequency-dependent selection

Two further parameters were introduced when accessory genes were split into two categories, each subject to a different level of frequency dependent selection. The σ parameter represented the strength of weaker NFDS acting on a fraction, (1-p), of the accessory genes included in the model. To facilitate inference of these two parameters, it was assumed that loci under weaker negative frequency dependent selection would vary in frequency to a greater extent between the initial and final genomic samples; therefore the accessory loci were ordered by the statistic Δ: Where e is the frequency of the gCOG or antibiotic resistance phenotype across all pre- or peri-vaccination samples, as defined previously, and f0, is its frequency across all post-vaccination samples. The denominator is intended to emphasise the effects of gCOGs at frequencies of approximately 50%, which are likely to have a large effect on the overall population structure. The proportion p of genes for which Δl was smallest were considered subject to NFDS with strength (1+σ), whereas the rest where subject to NFDS of strength (1+σ). If the L loci were ordered by ascending values of Δ, then l was the highest ranking meeting the criterion, This resulted in two distinct measures of the deviation of f from e: And: Hence the modified offspring distribution was:

Simulations and parameter estimation

The model was implemented in C++ using the GNU scientific library, and is available for download from https://github.com/nickjcroucher/multiLocusNFDS. In each simulation, genotypes were represented by the gCOGs and antibiotic resistance phenotypes present in between 5% and 95% of the pre- or peri-vaccination population. Hence L was 1,090 for Massachusetts, 1,175 for Southampton, 1,090 for Nijmegen and 1,254 for Maela. For simplicity, κ was assumed to represent the number of pneumococci likely to transmit between individuals in the sampled population. This was estimated to correspond to 25% colonisation of children under ten years of age in the USA and European samples. In Massachusetts15, an under ten population of 828,129 in 2000 82 implied a bacterial population size of 2x105 (105 was actually used for model fitting for computational efficiency; comparing simulations demonstrated this had no detectable effect on the results); in Southampton (including Hampshire and Portsmouth), an under ten population of 202,404 in 201183 implied a bacterial population size of 5x104; and in Nijmegen (including Arnhem), an under ten population of 77,753 in 201184 implied a bacterial population of 2x104. An elevated colonisation rate of 50%85 was used for Maela, where estimating that 15% of the 40,000 residents being under 10 implied a bacterial population size of 3x103. Each simulation was run for a number of timesteps corresponding to the number of months spanned by the genomic collection, excluding early or late years in which sampling was sparse. The well-sampled periods were the 72 months between spring 2001 and spring 2007 for Massachusetts50; the 48 months between spring 2007 and spring 2011 for Southampton16; the 120 months between 2001 and 2011 for Nijmegen18; and the 24 months between 2007 and 2009 for Maela85. All isolates from a single winter were assigned to the year in which the season ended. In simulations of the Nijmegen population, where a substantial proportion of samples pre-dated the vaccine’s introduction, v = 0 for years up to 2007. In each case, the starting population for the simulation, of size κ, was generated by randomly resampling with replacement from the genotypes present in the pre- and peri-vaccination samples in each study; hence the ‘pre-vaccination’ population consisted of isolates sampled up to spring 2001 in Massachusetts, up to spring 2007 in Southampton, and up to 2007 in Nijmegen. These were the genomic samples used to calculate e for all intermediate frequency loci; all later samples were used to calculate f in the definition of Δ. At each time t at which a genomic sample was available, the equivalent number of genotypes was randomly sampled from the simulated population. The similarity between the simulated and genomic samples at t was then calculated as the Jensen-Shannon divergence86 (JSD) between the real and simulated samples: Where f is the simulated frequency of genotypes of sequence cluster s and vaccine type status v at time t, and a is the equivalent value from the genomic sample. This value was summed over all vaccine type statuses and sequence clusters for each timepoint sampled in the genomic dataset to calculate the overall divergence of the simulation from the sampled data. Each set of simulations was run with variation in the parameters v (range 0-0.5); m (range 0-0.2); σ (range 10-6-0.22); σw (range 10-6-0.15; only in the heterogeneous rates model), and p (range 0-1; only in the heterogeneous rates model). Model fitting was achieved through Approximate Bayesian Computation with the BOLFI algorithm40, run for 2,000 iterations of Bayesian optimisation to identify best-fitting parameter sets through minimizing the JSD (Table 1, Supplementary Fig 5). Point estimates of parameter values were generated based on the Gaussian process minimisers, with the distribution of the projected JSD values shown for each fit in Supplementary Fig 5. Exploration of parameter space was performed with logarithmically transformed values to avoid discontinuity of the approximate likelihood function near the natural boundary and to enable better fit of the Gaussian process regression. The 95% posterior credible intervals for the parameters were obtained using three generations of sequential Monte Carlo sampling with the same default settings as used in Gutmann and Corander40 for the pneumococcal day care transmission model.

Alternative model formulations

To test whether equivalently good fits to the genomic data could be achieved using different approaches within the same framework, alternative model formulations were tested. The neutral model was fitted in the same way as the multilocus NFDS models, except that σ was fixed at zero. The serotype NFDS model assumed all serotypes were present at equilibrium frequencies in the pre-vaccine samples, and therefore πi,t was calculated as the deviation of an isolate’s serotype from its initial frequency. This was fitted using both the homogeneous and heterogeneous selection rate models. In the latter case, Δ was calculated by comparing the serotype e values with their post-vaccination frequencies, as for the intermediate frequency loci. The same parameter ranges were used as for the multilocus NFDS model, except σ and σ were allowed to take values within the range 10-6-25 to compensate for the single locus contributing to π and ω. Additionally, to avoid many lower frequency sequence clusters evolving neutrally, serotypes were considered to be at intermediate frequencies if they were between 1% and 99% prevalence in the pre- or peri-vaccination population. The ecotype NFDS model assumed each sequence cluster was adapted to a specific ecological niche, and therefore was present at an equilibrium frequency in the pre-vaccine samples. Therefore π and ω were calculated as the deviation of an isolate’s sequence cluster from its initial frequency. This was fitted using both the homogeneous and heterogeneous selection rate models using the same parameter ranges and intermediate frequency range as for the serotype NFDS model, as well as the same approach to the calculation of Δ. For both the serotype and ecotype models, model fitting was conducted with BOLFI40 as for the multilocus NFDS model, using JSDs to quantify the differences between the simulated and sampled populations. Results are shown in Table 1. These implementations are not intended to represent the optimal versions of each model, but instead demonstrate that the fits of the multilocus NFDS models cannot be trivially replicated through changing the genetic basis of NFDS. Simulations in which isolates from two populations were combined used the pre-vaccination population from Massachusetts and post-vaccine isolates from one of the alternative populations. The initial population was drawn only from Massachusetts; both these isolates, and those from the alternative dataset, could enter the simulated population through migration. When the alternative population corresponded to Southampton or Nijmegen, the population size, number of generations, parameter point estimates, Δ and e values were those of the alternative population. When the alternative population was Maela, the e and Δ values were those of the alternative population, but the simulations were otherwise parameterised for the Massachusetts population, due to the difficulty of obtaining robust point estimates for parameters from the Maela population as a consequence of the lack of vaccine introduction in this location.

Statistical analyses

Statistical analyses, including calculation of Pearson’s R2, Wilcoxon tests, interquartile ranges and Fisher’s exact tests, were performed in R87. Estimation of parameter values and credibility intervals through model fitting were performed with BOLFI40. All reported p values are two-sided.

Code availability

The model code used in this analysis is freely available from the GitHub repository, https://github.com/nickjcroucher/multilocusNFDS.

Data availability

The sequence datasets analysed in the current study are available in the public sequence databases with the accession codes listed in Supplementary Dataset 3. The epidemiological and phylogenetic data analysed in the current study are available from https://microreact.org/project/multilocusNFDS.

76 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Local dispersal promotes biodiversity in a real-life game of rock-paper-scissors.

Authors: Benjamin Kerr; Margaret A Riley; Marcus W Feldman; Brendan J M Bohannan
Journal: Nature Date: 2002-07-11 Impact factor: 49.962

3. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

4. Outpatient antibiotic use in Europe and association with resistance: a cross-national database study.

Authors: Herman Goossens; Matus Ferech; Robert Vander Stichele; Monique Elseviers
Journal: Lancet Date: 2005 Feb 12-18 Impact factor: 79.321

5. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

6. Vaccination Drives Changes in Metabolic and Virulence Profiles of Streptococcus pneumoniae.

Authors: Eleanor R Watkins; Bridget S Penman; José Lourenço; Caroline O Buckee; Martin C J Maiden; Sunetra Gupta
Journal: PLoS Pathog Date: 2015-07-16 Impact factor: 6.823

7. Five winters of pneumococcal serotype replacement in UK carriage following PCV introduction.

Authors: Rebecca A Gladstone; Johanna M Jefferies; Anna S Tocheva; Kate R Beard; David Garley; Wei Wei Chong; Stephen D Bentley; Saul N Faust; Stuart C Clarke
Journal: Vaccine Date: 2015-03-14 Impact factor: 3.641

8. Penicillin-Binding Protein Transpeptidase Signatures for Tracking and Predicting β-Lactam Resistance Levels in Streptococcus pneumoniae.

Authors: Yuan Li; Benjamin J Metcalf; Sopio Chochua; Zhongya Li; Robert E Gertz; Hollis Walker; Paulina A Hawkins; Theresa Tran; Cynthia G Whitney; Lesley McGee; Bernard W Beall
Journal: mBio Date: 2016-06-14 Impact factor: 7.867

9. Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains.

Authors: Justin S Hogg; Fen Z Hu; Benjamin Janto; Robert Boissy; Jay Hayes; Randy Keefe; J Christopher Post; Garth D Ehrlich
Journal: Genome Biol Date: 2007 Impact factor: 13.583

10. Population genomic datasets describing the post-vaccine evolutionary epidemiology of Streptococcus pneumoniae.

Authors: Nicholas J Croucher; Jonathan A Finkelstein; Stephen I Pelton; Julian Parkhill; Stephen D Bentley; Marc Lipsitch; William P Hanage
Journal: Sci Data Date: 2015-10-27 Impact factor: 6.444

45 in total

1. The great escape.

Authors: Eva Heinz
Journal: Nat Rev Microbiol Date: 2017-12-08 Impact factor: 60.633

2. Immune exclusion by naturally acquired secretory IgA against pneumococcal pilus-1.

Authors: Ulrike Binsker; John A Lees; Alexandria J Hammond; Jeffrey N Weiser
Journal: J Clin Invest Date: 2020-02-03 Impact factor: 14.808

3. Negative frequency-dependent selection maintains coexisting genotypes during fluctuating selection.

Authors: Caroline B Turner; Sean W Buskirk; Katrina B Harris; Vaughn S Cooper
Journal: Mol Ecol Date: 2019-12-05 Impact factor: 6.185

4. Pneumococcal quorum sensing drives an asymmetric owner-intruder competitive strategy during carriage via the competence regulon.

Authors: Pamela Shen; John A Lees; Gavyn Chern Wei Bee; Sam P Brown; Jeffrey N Weiser
Journal: Nat Microbiol Date: 2018-12-10 Impact factor: 17.745

5. Pneumococcal within-host diversity during colonization, transmission and treatment.

Authors: Gerry Tonkin-Hill; Clare Ling; Chrispin Chaguza; Susannah J Salter; Pattaraporn Hinfonthong; Elissavet Nikolaou; Natalie Tate; Andrzej Pastusiak; Claudia Turner; Claire Chewapreecha; Simon D W Frost; Jukka Corander; Nicholas J Croucher; Paul Turner; Stephen D Bentley
Journal: Nat Microbiol Date: 2022-10-10 Impact factor: 30.964

6. The research and development process for multiscale models of infectious disease systems.

Authors: Winston Garira
Journal: PLoS Comput Biol Date: 2020-04-02 Impact factor: 4.475

7. Atlas of group A streptococcal vaccine candidates compiled using large-scale comparative genomics.

Authors: Mark R Davies; Liam McIntyre; Ankur Mutreja; Jake A Lacey; John A Lees; Rebecca J Towers; Sebastián Duchêne; Pierre R Smeesters; Hannah R Frost; David J Price; Matthew T G Holden; Sophia David; Philip M Giffard; Kate A Worthing; Anna C Seale; James A Berkley; Simon R Harris; Tania Rivera-Hernandez; Olga Berking; Amanda J Cork; Rosângela S L A Torres; Trevor Lithgow; Richard A Strugnell; Rene Bergmann; Patric Nitsche-Schmitz; Gusharan S Chhatwal; Stephen D Bentley; John D Fraser; Nicole J Moreland; Jonathan R Carapetis; Andrew C Steer; Julian Parkhill; Allan Saul; Deborah A Williamson; Bart J Currie; Steven Y C Tong; Gordon Dougan; Mark J Walker
Journal: Nat Genet Date: 2019-05-27 Impact factor: 38.330

8. Negative frequency-dependent selection and asymmetrical transformation stabilise multi-strain bacterial population structures.

Authors: Gabrielle L Harrow; John A Lees; William P Hanage; Marc Lipsitch; Jukka Corander; Caroline Colijn; Nicholas J Croucher
Journal: ISME J Date: 2021-01-06 Impact factor: 10.302

9. Panproteome-wide analysis of antibody responses to whole cell pneumococcal vaccination.

Authors: Joseph J Campo; Timothy Q Le; Jozelyn V Pablo; Christopher Hung; Andy A Teng; Hervé Tettelin; Andrea Tate; William P Hanage; Mark R Alderson; Xiaowu Liang; Richard Malley; Marc Lipsitch; Nicholas J Croucher
Journal: Elife Date: 2018-12-28 Impact factor: 8.140

10. Serotype Patterns of Pneumococcal Disease in Adults Are Correlated With Carriage Patterns in Older Children.

Authors: Anne L Wyllie; Joshua L Warren; Gili Regev-Yochay; Noga Givon-Lavi; Ron Dagan; Daniel M Weinberger
Journal: Clin Infect Dis Date: 2021-06-01 Impact factor: 9.079