Jukka Corander1,2,3, Christophe Fraser4, Michael U Gutmann5, Brian Arnold6, William P Hanage6, Stephen D Bentley3, Marc Lipsitch6,7, Nicholas J Croucher8. 1. Helsinki Institute for Information Technology, Department of Mathematics and Statistics, University of Helsinki, 00014, Helsinki, Finland. 2. Department of Biostatistics, University of Oslo, 0317, Oslo, Norway. 3. Infection Genomics, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. 4. Big Data Institute, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7LF, UK. 5. School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK. 6. Center for Communicable Disease Dynamics, Harvard T. H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA. 7. Departments of Epidemiology and Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA. 8. MRC Centre for Outbreak Analysis and Modelling, Department of Infectious Disease Epidemiology, Imperial College London, London, W2 1PG, UK. n.croucher@imperial.ac.uk.
Abstract
Many bacterial species are composed of multiple lineages distinguished by extensive variation in gene content. These often cocirculate in the same habitat, but the evolutionary and ecological processes that shape these complex populations are poorly understood. Addressing these questions is particularly important for Streptococcus pneumoniae, a nasopharyngeal commensal and respiratory pathogen, because the changes in population structure associated with the recent introduction of partial-coverage vaccines have substantially reduced pneumococcal disease. Here we show that pneumococcal lineages from multiple populations each have a distinct combination of intermediate-frequency genes. Functional analysis suggested that these loci may be subject to negative frequency-dependent selection (NFDS) through interactions with other bacteria, hosts or mobile elements. Correspondingly, these genes had similar frequencies in four populations with dissimilar lineage compositions. These frequencies were maintained following substantial alterations in lineage prevalences once vaccination programmes began. Fitting a multilocus NFDS model of post-vaccine population dynamics to three genomic datasets using Approximate Bayesian Computation generated reproducible estimates of the influence of NFDS on pneumococcal evolution, the strength of which varied between loci. Simulations replicated the stable frequency of lineages unperturbed by vaccination, patterns of serotype switching and clonal replacement. This framework highlights how bacterial ecology affects the impact of clinical interventions.
Many bacterial species are composed of multiple lineages distinguished by extensive variation in gene content. These often cocirculate in the same habitat, but the evolutionary and ecological processes that shape these complex populations are poorly understood. Addressing these questions is particularly important for Streptococcus pneumoniae, a nasopharyngeal commensal and respiratory pathogen, because the changes in population structure associated with the recent introduction of partial-coverage vaccines have substantially reduced pneumococcal disease. Here we show that pneumococcal lineages from multiple populations each have a distinct combination of intermediate-frequency genes. Functional analysis suggested that these loci may be subject to negative frequency-dependent selection (NFDS) through interactions with other bacteria, hosts or mobile elements. Correspondingly, these genes had similar frequencies in four populations with dissimilar lineage compositions. These frequencies were maintained following substantial alterations in lineage prevalences once vaccination programmes began. Fitting a multilocus NFDS model of post-vaccine population dynamics to three genomic datasets using Approximate Bayesian Computation generated reproducible estimates of the influence of NFDS on pneumococcal evolution, the strength of which varied between loci. Simulations replicated the stable frequency of lineages unperturbed by vaccination, patterns of serotype switching and clonal replacement. This framework highlights how bacterial ecology affects the impact of clinical interventions.
Population genomics has revealed many bacterial species exhibit extensive variation in their ‘accessory’ genomes. While neutral evolutionary models can account for such diversity1–3, allowing for heterogeneity in the evolutionary rate between genes significantly improves their fit to genomic data4–6, consistent with selection causing differences in gene content7. If recombination rates are sufficiently high, selection can alter the distribution of individual genes8. However, lower levels of recombination are associated with chromosome-wide sweeps, such that niche specialization at one or more loci can result in largely clonal ‘ecotypes’9,10. Similarly, a recent model suggested selection acting on a high proportion of the genome could partition even freely-recombining bacteria into highly-diverged ‘metabolic types’11. As well as adaptation to particular niches, this latter model11 considered antigenic loci to be under negative frequency-dependent selection (NFDS), the situation in which alleles are most beneficial to genotypes when they are rare. This is based on the assumption antigens become more costly when common, because they are more frequently recognised by acquired immune responses.Such NFDS has been proposed to explain the extensive antigenic diversity of the nasopharyngeal coloniser and respiratory pathogen Streptococcus pneumoniae (the pneumococcus)11–13. This variation makes anti-pneumococcal vaccine development challenging. The first licensed conjugate vaccine (PCV7) targeted seven of over ninety serotypes14, and consequently was associated with ‘serotype replacement’ as vaccine types (VTs) were replaced by non-vaccine types (NVTs), with no overall change in carriage rates15. This was driven by both serotype switching, the replacement of VTs by NVTs that differed at few loci other than that which determined the serotype, and clonal replacement of VTs by distantly-related NVTs. These population dynamics are now amenable to detailed study, having been tracked by genomic surveillance of isolates carried by children in both Massachusetts (USA) 14 and Southampton (UK) 16,17, and isolates from invasive pneumococcal disease in adults in Nijmegen (the Netherlands) 18. Here we use the distribution of the accessory genome across isolates to develop a gene frequency-based model of bacterial population structure based on multiple NFDS mechanisms19,20.
Results
Enrichment of loci under frequency-dependent selection in the accessory genome
Previous analyses of 5,442 clusters of orthologous genes (COGs) in the Massachusetts
pneumococcal population suggested those present at intermediate frequencies were
important in distinguishing sequence clusters21. To identify functions that were enriched in this set of genes,
the 1,112 COGs present in 5% to 95% of isolates and 1,194 core COGs14 were annotated by integrating multiple
analyses (Fig 1a & Supplementary Datasets 1
and 2). The most
substantive difference was in mobile genetic elements (MGEs; Fisher’s
exact test; odds ratio, OR = 336; two-sided p <
2.2x10-16). However, few of these genes were
‘cargo’ beneficial to the host bacterium, and were instead likely
to be parasitic, consistent with the distribution of prophage between
pneumococci21,22. Correspondingly, restriction modification systems
(RMSs) that protect against MGE infection accounted for 2.4% of the
intermediate-frequency genes, but were absent from the core COGs. These are most
often advantageous when rare, such that the donor of an infecting MGE is
unlikely to have the same system19, but
typically futile when ubiquitous. Hence the co-existence of lineages likely
involves competition between bacteria and MGEs through
‘kill-the-winner’ dynamics, a form of NFDS in which an increase in
a genotype’s frequency would be associated with a counterbalancing rise
in the prevalence of MGE genotypes able to infect such cells23.
Figure 1
Diversity and structure of the pneumococcal population. a Functional classification of the 1,112 intermediate-frequency and 1,194 core COGs in the Massachusetts pneumococcal population, as detailed in Supplementary Datasets 1 and 2. Each barchart compares the frequencies of functional categories in intermediate-frequency and core COGs. Categories are grouped as likely to be under NFDS resulting from bacterium-MGE interactions (pink segments), bacterium-bacterium interactions (blue segments), or bacterium-host interactions (green segments). The chart with orange segments shows the frequencies of loci with roles in general metabolism or signal transduction, or otherwise could not be classified. b Population structure of the 4,127 isolates from Massachusetts (Mass), Southampton (Soton), Nijmegen and Maela (Supplementary Dataset 3). The maximum likelihood phylogeny was generated from 1,447 core gCOGs. The adjacent columns contain a row for each genome, which represent the population in which the bacterium was isolated, its susceptibility to PCV7-induced immunity, and sequence cluster classification. c Comparison of core genome divergence, quantified as the cophenetic distance between isolates in the core genome phylogeny, and the accessory genome divergence, quantified as the Jaccard distance between the gCOG content of isolates. Each point is a pairwise comparison between randomly sampled isolates (excluding the polyphyletic SC0), which was coloured orange if the isolates belonged to the same sequence cluster; purple if they belonged to different sequence clusters but were both encapsulated; or otherwise dark blue, revealing the presence of some genetically divergent unencapsulated genotypes. Isocontour lines quantify the distribution of points in each category.
Annotation also suggested direct interference competition between bacteria was likely to
be important in maintaining a diversity of lineages24. Bacteriocins, which mediate interstrain killing25, were significantly enriched in the
accessory genome relative to the core (Fisher’s exact test; OR = 24.0;
two-sided p < 2.2x10-16). Although regulatory
components of the bacteriocin-like peptide (blp) locus were
conserved across the population, most of the gene cluster was composed of
various combinations of bacteriocin and immunity protein genes, many of which
were found in multiple loci26. Despite
this diversity, each of the previously-described fifteen monophyletic sequence
clusters14 was typically associated
with one distinctive blp allele (Supplementary Fig 1),
with an exception being sequence clusters (SCs) 3 and 14, which did not co-exist
for long owing to vaccine-induced population dynamics14.Sequence clusters also varied in their complement of rarer bacteriocin biosynthesis gene clusters, including pneumocyclicin27, pneumolancidin28, two loci likely regulated by the TprA/PhrA quorum-sensing system29, and other putative loci (Supplementary Fig 1). No individual gene cluster replicated the diversity of the blp locus, with sequence variation instead often corresponding to disruptive mutations in bacteriocin structural or biosynthetic genes. Assuming relevant phenotypes can be reliably inferred from the gene clusters, such mutations result in bacteria immune to the bacteriocin, but unable to kill competitors. These immune non-producers co-circulate with producer cells carrying the putatively fully-functional allele, and susceptible cells completely lacking the gene cluster. Analogous variation with respect to individual bacteriocins is likely present between the blp loci, given their diverse complements of production and immunity genes. If both biosynthesis and immunity functions are costly, these phenotypes can co-exist through rock-paper-scissors NFDS dynamics as producers kill susceptible cells, immune non-producers outcompete producers, and susceptible cells outcompete immune non-producers30. Hence the distinctive overall bacteriocin production profile of strains may be shaped by NFDS acting on multiple loci.NFDS can also result from competition for resources20,31. A particular nutrient
import strategy, either optimized for different nutrients31 or different concentrations of the same nutrient32, will become less advantageous as it
becomes more common, as a consequence of more intense competition for the same
resource24. While nutrient importers
account for 11.5% of the core COGs, because many are universally necessary, they
also make up 9.35% of the intermediate-frequency COGs. Hence they are
significantly enriched relative to general metabolic genes in the latter
category (Fisher’s exact test, OR = 2.48, two-sided p =
2.61x10-8). This suggests NFDS may sustain multiple nutrient
acquisition strategies in the population as a consequence of interstrain
competition for resources.Antibiotic resistance, also variable between isolates, could be affected by similar
competition33. If resistant bacteria
are considered adapted to hosts consuming antibiotics, but suffering a cost in
untreated hosts, then resistance will be most effective as a resource
acquisition strategy where rare owing to the lessened competition with other
strains. This could directly result in NFDS, although there are alternative
explanations for the co-existence of sensitive and resistant pneumococci that
instead imply NFDS through other mechanisms34.A further functional category to be enriched in the intermediate-frequency COGs relative to the core genome were genes encoding for the biosynthesis of immunogenic structures, such as surface proteins35 or the capsule36 (Fisher’s exact test, OR = 2.56, two-sided p = 9.23x10-10). These can be under NFDS as long as alleles are immunologically distinguishable, a criterion met by the serotype-defining capsule36, as well as accessory antigens that are typically either present as large surface structures, or completely absent, such as the pili37.Therefore multiple disparate functions enriched in the intermediate-frequency genes relative to the core genome can each be understood as being subject to NFDS, albeit through different processes. While no NFDS mechanism could be identified for 32.1% of the intermediate frequency COGs, this category is likely to include both metabolic enzymes and signal transduction proteins linked to loci under NFDS on genomic islands, and loci under NFDS that cannot be identified as such owing to incomplete functional information. To test whether these inferences applied to other pneumococcal populations in a similar manner, further genomic datasets were compared to those from Massachusetts.
Population similarities in frequencies of genes, but not genotypes
Overall, 4,127 isolates were combined from available reference sequences, Massachusetts, Southampton, Nijmegen, and the Maela refugee camp in Thailand where the population is unvaccinated38 (Supplementary Dataset 3). A new analysis identified 11,049 ‘global’ COGs (gCOGs), from which a ‘relaxed’ core of 1,447 gCOGs was extracted to generate a maximum likelihood phylogeny (Fig 1b; Supplementary Fig 2). Strikingly, there was little evidence of genetic isolation-by-distance, as both vaccine-type status and country of isolation had a polyphyletic distribution, indicating a history of recombination and frequent international migration.The core alignment was also used to define 74 sequence clusters. Plotting the pairwise
core genome divergence of isolates, represented by their cophenetic separation
in the tree, against their accessory genome divergence, calculated as the
Jaccard distance between the isolates’ gCOG content, demonstrated members
of the same sequence cluster were substantially more similar in their accessory,
as well as core, genomes (Fig 1c). These
differences between lineages were likely biologically meaningful, as they
represented a significant proportion of the accessory genome and were preserved
despite international dissemination of some genotypes and ongoing horizontal DNA
transfer. Although some of the previously-identified atypical unencapsulated
lineages were associated with extensive private gene content21, sequence clusters of encapsulated
pneumococci each contained few unique accessory loci. The mean numbers of gCOGs
present in ≥95% of the isolates in a given sequence cluster, but not
meeting this criterion in any other sequence cluster in the same population,
were just 16.75 in Massachusetts, 19.94 in Southampton, 19.46 in Nijmegen, and
15.02 Maela (Supplementary Fig
2). Sequence clusters’ distinctiveness instead resulted from
the polyclonal distribution of the 1,731 intermediate frequency gCOGs, present
in between 5% and 95% of the pre-vaccination isolates in at least one population
(Supplementary Fig
2). Hence a long history of recombination was reflected in
intermediate-frequency loci being associated with multiple lineages, with each
lineage in turn defined by a unique combination of intermediate-frequency
loci.Despite the lineages representing discrete and distinct sets of genotypes, their prevalences were highly heterogeneous between the four populations, with a significant correlation only between those in Massachusetts and Southampton (Fig 2a). In marked contrast, the frequencies of accessory gCOGs were strongly correlated between Massachusetts and every other population (Fig 2b; Pearson correlation, two-sided p < 10-15 in all comparisons). This suggests pneumococcal populations are configured by genomic islands being maintained at equilibrium frequencies that are conserved between populations, consistent with their prevalence being influenced by NFDS19. A significant deviation between populations was the elevated frequency of Tn916 in Maela; this transposon underlies tetracycline resistance21, and therefore the difference is likely to represent a location-specific selection pressure rather than drift39. Hence selection appears to shape pneumococcal populations to be similar in frequencies of genes, rather than genotypes.
Figure 2
Distribution of genetic diversity between populations. Column (a) compares the
distribution of sequence clusters between populations; the frequency of each
sequence cluster in Massachusetts is shown on the horizontal axis, and the
corresponding frequencies in Maela, Southampton and Nijmegen are shown on the
vertical axes in the plots from top to bottom. Red points correspond to
predominantly VT (≥75%) sequence clusters; blue points to predominantly
NVT (≥75%) sequence clusters, and black points to mixed sequence
clusters. Column (b) compares the distribution of gCOGs between
populations. The frequency of each in Massachusetts is shown on the horizontal
axis, and the corresponding frequencies in Maela, Southampton and Nijmegen are
shown on the vertical axes. Only gCOGs present at a mean frequency between 5%
and 95% across the two compared populations were included, and the corresponding
points are coloured according to the functional annotation of COGs in Fig 1a. The elevated frequencies of gCOGs
encoded by Tn916, including the tetM
tetracycline resistance gene, in Maela are annotated. Column (c)
compares the pre- and post-vaccination frequencies of sequence clusters in
Massachusetts, Southampton and Nijmegen. Points are coloured as in (a), showing
the general decline in the frequency of VT sequence clusters. Column
(d) compares the pre- and post-vaccination frequency of gCOGs
in Massachusetts, Southampton and Nijmegen. Only gCOGs with an overall frequency
between 5% and 95% in the relevant population were included in the panels.
Points are coloured as in (b). The reduced frequency of the
wciN allele involved in synthesis of the VT 6A and 6B
capsules is annotated. As the relationships between gCOG frequencies were
linear, each panel displays Pearson's correlation statistics, including
two-sided p values.
Vaccination as a test of negative frequency-dependent selection
The partial-coverage vaccines introduced to limit pneumococcal disease can be used as a
natural experiment, to test whether loci expected to change in frequency due to
association with VTs were actually maintained at equilibrium frequencies by
NFDS. Although a significant correlation existed between pre- and post-PCV7
sequence cluster frequencies in the three vaccinated populations (Fig 2c), divergence in population composition
was driven by the replacement of some VT sequence clusters with
distantly-related NVT lineages. Across all comparisons of pre- and post-PCV7
populations, gCOG frequencies showed a stronger positive correlation. This
stability in gene frequencies reflected the significant correlation between the
post-PCV7 decrease in a gCOG’s absolute frequency in VT isolates, and the
contemporaneous increase in its absolute frequency in NVT isolates (Supplementary Fig 3),
consistent with the NFDS hypothesis. The greatest deviation in the Massachusetts
population was wciN, directly involved in the synthesis of the
vaccine-targeted 6A and 6B capsules, reflecting differences in selection
pressures between timepoints14. This
suggested the equilibrium frequencies of the intermediate frequency gCOGs were
likely to govern the post-vaccine restructuring of the population.To quantify whether NFDS of intermediate frequency gCOGs could explain changes in pneumococcal populations better than a neutral model, a discrete time Wright-Fisher multilocus NFDS model was constructed in which the number of offspring produced by a genotype i at generation t, X, was distributed as:General density-dependent competition was parameterised by the total number of
pneumococci in the simulated population at time t,
N, and the environment’s carrying
capacity κ. This was constant across t, reflecting the
stable levels of pneumococcal carriage post-PCV715,16. The other demographic
process was migration, at rate m (per month-long generation),
by which isolates in the resident simulated population were replaced by
genotypes randomly selected from the genomic data from the same location. VT
genotypes were subject to a fitness cost, v, representing
vaccine efficacy at preventing transmission. The final term parameterised NFDS,
the strength of which was determined by σ
and the exponent π: where l is an intermediate
frequency locus (gCOG or antibiotic resistance phenotype), and
g is a binary variable indicating whether
l is present in genotype i. Each
l has an equilibrium frequency
e, its prevalence in the pre-vaccination
sample, and an instantaneous frequency at generation t,
f. Therefore
f determines whether l
benefits its host, when it is rare relative to e,
or has a net cost, when it is common relative to e.
Model details are described in Supplementary Fig 4 and the Methods.The σ, v and m parameters were estimated for the Massachusetts population using Approximate Bayesian Computation, an inference technique for intractable simulator-based models40,41. The simulated population was compared to the sequence cluster distribution across three time points (Fig 3a) using the Jensen-Shannon divergence (JSD) to determine similarity. Convergence of the parameter estimates found strong evidence for NFDS (σ significantly greater than its lower bound; Table 1 & Supplementary Table 1, Supplementary Fig 5). The precedent of other models4,6 suggested the fit could be improved by allowing the strength of selection to be heterogeneous across loci. Hence an expanded model featured a proportion, p, of the intermediate frequency loci experiencing NFDS at strength σ, while (1-p) experienced NFDS at strength σ (see Methods). Convergence of parameter estimates again found strong evidence for NFDS (σ and p significantly greater than their lower bounds; Table 1 & Supplementary Table 1, Supplementary Fig 5), with a substantial improvement over the homogeneous selection model, as quantified by the significantly smaller JSD values from appropriately parameterised simulations (Wilcoxon test on 100 simulation pairs, W = 9902, two-sided p = 4.73x10-33; Supplementary Fig 6).
Fig. 3
Comparing the sampled and simulated pneumococcal populations. In each barplot, the bacterial population is split into sequence clusters by vertical black lines, annotated at the top of the graph. Each sequence cluster is split into three timepoints: pre-vaccination, a midpoint sample and a late sample. Only sequence clusters present at greater than 2.5% frequency at one of these timepoints in the genomic sample are included in the graphs; full results are shown in the supplementary materials. The bars at each timepoint are split into red segments, for VT isolates, and blue segments, for NVT isolates. In each comparison, the top row is the genomic sample against which simulations were evaluated. The bottom row summarises the output of 100 simulations using the heterogeneous rate multilocus NFDS model performed using the point estimate parameter values from Table 1. At the times at which samples were present in the respective genomic collections, the same numbers of isolates were randomly selected from the simulated populations. The bars represent the median result, and the error bars (orange for VT isolates, and purple for NVT isolates) represent the interquartile range observed across the simulations. (a) The results for Massachusetts split isolates into pre-vaccination (2001; 133 isolates), midpoint (2004; 203 isolates) and late (2007; 280 isolates) samples. (b) The results for Southampton, splitting isolates into pre-vaccination (up to 2007; 100 isolates), midpoint (2008-2009; 194 isolates) and late (2010-2011; 195 isolates) samples. (c) The results for Nijmegen, splitting isolates into pre-vaccination (up to 2007; 209 isolates), midpoint (2008-2009; 80 isolates) and late (2010-2011; 48 isolates) samples.
Table 1
Parameter estimates from model fits achieved through Approximate Bayesian Computation with BOLFI,
run for 2,000 iterations. The displayed values represent point estimates of
parameters generated based on the Gaussian process minimisers, with 95%
credibility intervals in parentheses where calculated. The simplest neutral
model required fitting only v and m to the
genomic data. Homogeneous rate (σ,
v and m) and heterogeneous rate
(σ, v,
m, p and
σ) fits are shown for the multilocus
NFDS model, in which intermediate frequency gCOGs and resistance phenotypes have
equilibrium frequencies; for the serotype NFDS model, in which serotypes have
equilibrium frequencies; and for the ecotype model, in which sequence clusters
have equilibrium frequencies. Replicate fits of the heterogeneous rate
multilocus NFDS models to the Massachusetts, Southampton and Nijmegen datasets
are shown in Supplementary
Table 1 to demonstrate the robustness of the fitting process to
stochastic effects.
Population
Model
Maximal NFDS strength, σf
Vaccine selection strength, v
Migration rate, m
Proportion of loci under strong NFDS, pf
Weaker NFDS strength, σw
Mass
Neutral
-
0.0375
0.0073
-
-
Mass
Homogeneous rate multilocus NFDS
0.0075 (0.0017 - 0.0234)
0.0733 (0.0430 - 0.1207)
0.0057 (0.0020 - 0.0131)
-
-
Mass
Heterogeneous rate multilocus NFDS
0.1363 (0.0213-0.2113)
0.0812 (0.0491-0.1254)
0.0044 (0.0015-0.0165)
0.2483 (0.1197-0.5448)
0.0023 (0.0010-0.0514)
Mass
Homogeneous rate serotype NFDS
0.0333
0.0415
0.0071
-
-
Mass
Heterogeneous rate serotype NFDS
3.2613
0.0394
0.0053
0.1862
0.0127
Mass
Homogeneous rate ecotype
3.4514
0.0525
0.0090
-
-
Mass
Heterogeneous rate ecotype
1.0101
0.0541
0.0071
0.99
0.0009
Soton
Homogeneous rate multilocus NFDS
0.0028 (0.0010 - 0.0117)
0.1175 (0.0667 - 0.2262)
0.0032 (0.0011 - 0.0132)
-
-
Soton
Heterogeneous rate multilocus NFDS
0.1393 (0.0121 - 0.2148)
0.2063 (0.0832 - 0.3150)
0.0124 (0.0012 - 0.0394)
0.4035 (0.1005 - 0.5951)
0.0023 (0.0010 - 0.0238)
Nijmegen
Homogeneous rate multilocus NFDS
0.0605 (0.0012 - 0.0966)
0.0318 (0.0011 - 0.2621)
0.0018 (0.0009 - 0.0184)
-
-
Nijmegen
Heterogeneous rate multilocus NFDS
0.1462 (0.0013 - 0.2012)
0.0381 (0.0016 - 0.3235)
0.0015 (0.0009 - 0.0060)
0.1988 (0.0013 - 0.8356)
0.0032 (0.0010 -0.1247)
Maela
Heterogeneous rate multilocus NFDS
0.1115 (0.0020 - 0.2138)
0.0011 (0.0010 - 0.0354)
0.0227 (0.0012 - 0.0568)
0.4995 (0.0028 - 0.9468)
0.0129 (0.0010 - 0.1416)
At the locus level, those genes subject to stronger NFDS stabilised close to their
equilibrium frequencies, whereas the frequencies of those subject to weaker NFDS
drifted near-neutrally (Supplementary Fig 6). At the lineage level, these simulations
replicated three important facets of the post-vaccination population dynamics
(Fig 3a & Supplementary Fig 7). The
first was the stable post-vaccine prevalence of some NVT sequence clusters, such
as SC4 and SC8. The second was serotype switching, the replacement of VT by NVT
within sequence clusters that remained at stable overall frequencies, as
observed in SC1, SC5, SC9 and SC15. The third was clonal replacement of VT by
unrelated NVT, such as the contemporaneous disappearance of SC13, SC14, SC22 and
SC24, and expansion of SC3, SC6, SC7 and SC11. These trends were not trivial to
replicate. The same framework was used to fit a neutral model (NFDS eliminated,
with σ = 0); a serotype-focused single locus
NFDS model (e applied to serotype, rather than
locus, frequencies), and an ecotype model (e
applied to sequence cluster, rather than locus, frequencies). Both the neutral
and serotype models poorly reproduced the stability of SC8’s frequency,
serotype switching within SC9 and SC15, or any patterns of clonal expansion. The
ecotype models better reproduced NVT sequence cluster stability and serotype
switching, but did not replicate the observed patterns of clonal replacement.
All of these models resulted in significantly worse fits to the data than the
heterogeneous multilocus NFDS model (Supplementary Figs 6 & 7).The estimated vaccine selection strength, v, of 0.081 per month from the heterogenous rate multilocus NFDS is consistent with PCV7’s halving of the rate at which VT are acquired42, if pneumococci transmit at least once every six months, an interval similar to the carriage duration of VT serotypes43. Similarly, the estimated migration rate, m, of 0.0044 per month suggests half the resident Massachusetts pneumococcal population is replaced by immigrant strains over approximately 13 years, which is realistic given the 50% probability that a pneumococcal lineage was detectable in different localities within Massachusetts after 3-4 years14.
Consistent evidence of NFDS in other populations
The homogeneous and heterogeneous multilocus NFDS models were also fitted to similar
surveillance data from Southampton (Supplementary Fig 5 & 8). The JSD values for the
heterogenous rate model were again reproducible and significantly smaller than
for the homogeneous rate version (Wilcoxon test on 100 simulation pairs,
W = 9954, two-sided p =
1.01x10-33). The point estimates of parameter values were again
robust and, in the case of the three parameters determining the strength of
NFDS, very similar to those for Massachusetts (Table 1 & Supplementary Table 1).However, the vaccine selection strength was estimated to be 2.54-fold higher in
Southampton than in Massachusetts. This difference is likely attributable to the
substantially higher PCV7 coverage in children under 24 months of age in the
years immediately after the vaccine’s introduction in the UK relative to
the USA44,45, combined with the lower age range included in the Southampton
study, excluding older children who are less likely to have been immunized, or
in whom natural acquisition of immunity blunted the selective pressure of the
vaccine16,46. Simulations using these point estimates again
replicated the strain dynamics observed in the genomic sample (Fig 3b). VT SC5 and SC18 were eliminated at
realistic rates; NVT SC3, SC19 and SC35 remained at stable frequencies; serotype
switching occurred within SC1 and SC9, while NVT SC2 rose in prevalence at a
much faster rate than same lineage did in Massachusetts.The homogeneous and heterogeneous rate multilocus NFDS models were also fitted to a genomic dataset from cases of invasive pneumococcal disease in Nijmegen. The heterogeneous model was again a significantly closer fit to the genomic data, as assessed by the JSDs (Wilcoxon test, W = 3988, two-sided p = 0.0135; Table 1 and Supplementary Fig 5). Precisely replicating the observed population dynamics was difficult (Fig 3c and Supplementary Fig 9), owing to the sparser sampling, particularly post-PCV7, and inevitable bias towards more invasive genotypes in this dataset. While the estimated strength of NFDS was similar to both Massachusetts and Southampton, the estimated vaccine selection strength was lower than in these infant carriage surveillance projects, consistent with the Nijmegen collection being isolated in an adult population primarily protected by herd immunity18. Correspondingly, fitting the heterogenous rate model to the Maela dataset, isolated from an entirely unvaccinated community, estimated v close to zero (Table 1).NFDS acting on genomic islands can also affect variation in the core genome. Comparisons
between pre- and post-vaccination populations, and between different locations,
revealed allele frequencies of core genome single nucleotide polymorphisms
(SNPs) typically showed very similar correlations to those of accessory loci
frequencies (Supplementary Fig
10). This was not a consequence of tight linkage between SNPs in the
regions flanking genomic islands (Supplementary Fig 10). Nevertheless, simulations in which
NFDS acted on only accessory loci precisely replicated the post-vaccination
changes in the core SNP allele frequencies, and similar correlations to those
between collections were observed in simulations where the Massachusetts
population was gradually replaced with isolates from other datasets (Supplementary Fig 10).
Therefore while it is possible core genome loci may also be under NFDS, the
observed correlations can be attributed to NFDS acting only on accessory
loci.
Consequences of NFDS for the impact of vaccination
Simulations were used to investigate counterfactual scenarios. In the absence of vaccination (v = 0), the pre-PCV7 populations were stable in Massachusetts (Supplementary Fig 7), Southampton (Supplementary Fig 8) and Nijmegen (Supplementary Fig 9). Eliminating migration (m = 0) significantly increased the proportion of VTs observed in simulations in all three populations (Wilcoxon tests; Massachusetts, W = 0, two-sided p = 2.56x10-34; Southampton, W = 0, two-sided p = 2.56x10-34; Nijmegen, W = 1453, two-sided p = 4.50x10-18), highlighting the importance of imported or previously rare NVTs in driving out VTs. However, removing NFDS significantly decreased the proportion of VTs observed in all three populations (Wilcoxon tests; W = 10000, two-sided p = 2.56x10-34 in Massachusetts and Southampton; W = 9979, two sided p = 4.81x10-34 in Nijmegen). This is because following vaccination, those loci enriched in VT genotypes become increasingly advantageous to their bacterial hosts as they become rarer, resulting in NFDS slowing the rate at which VT genotypes are eliminated until such loci rise in frequency in NVT genotypes.
Discussion
These combined analyses of multiple population genomic datasets suggest that NFDS plays an important role both in the stable structuring of pneumococcal populations, and their dynamics following disruption by vaccine-induced immunity. According to the best-fitting model, relatively strong NFDS acts on a few hundred accessory genes, corresponding to 5.0% of the Massachusetts pangenome, and 8.3% of that in Southampton. This cumulative effect across multiple loci in complex populations is predicted to maintain stable lineage compositions in the absence of disruption by vaccination, without the oscillatory dynamics associated with some single locus NFDS processes19,30,47,48. Hence multiple lineages can persistently coexist within this framework despite their confinement to a niche, the human nasopharynx, that is physiochemically homogeneous compared with the varied environments inhabited by species often considered as split into ecotypes, such as Escherichia coli. Furthermore, although intraspecific recombinations are slow over the timescales simulated in this study14, horizontal DNA transfer has comprehensively reassorted genomic islands between genotypes over the species’ history. Their consequent polyclonal distribution means accessory locus frequencies can be preserved by multiple lineage combinations, thereby accounting for the diverse population structures observed globally, and the panoply of strains they contain49. While the NFDS processes represented in the multilocus model were also sufficient to explain the major post-vaccination population changes, further work is required to determine whether core loci are also involved. Continued development of such quantitative models with large genomic datasets promises to improve our understanding of how diverse selective pressures affecting bacterial populations shape their response to public health interventions, and how best to design novel pathogen control strategies.
Methods
Annotation of the accessory genome
The previously analysed Massachusetts population14,50 contained 1,112 COGs present in between 5% and 95% of the 616 isolates and 1,194 COGs present in a single copy in every isolate. Information on whether these were associated with capsule polysaccharide synthesis, antibiotic resistance, RMSs, Pneumococcal Pathogenicity Island 1 or MGEs was extracted from previously described analyses14,21,50. Coding sequences (CDSs) associated with proteinaceous immunogenic structures were identified through the results of protein antigen array data35. Candidate bacteriocins were identified using the BAGEL3 algorithm51. The variation at the blp locus, and the other putative bacteriocin production loci, was manually identified within de novo assemblies of the Massachusetts isolates using Artemis and ACT52. The heatmap showing the distribution of the blp alleles in Supplementary Fig 1 was generated by mapping Illumina reads for each of the Massachusetts isolates against the concatenated set of loci using BWA with default settings53. Further information on COG functional domains14 and previous automated annotations50 was additionally used to manually curate all available information into the annotation and classification in Supplementary Datasets 1 and 2.
Bioinformatic analysis of genomic data
The isolate collections analysed each came from systematic sampling of defined host
populations. The Massachusetts pneumococcal dataset, isolated from the
nasopharynxes of children up to five years of age during routine primary care
physician visits, consisted of the 616 de novo assemblies
generated with Velvet54 as described
previously14,50. VelvetOptimiser55 was used to assemble data from the Maela collection38 (3,085 genomes), isolated from the
nasopharynxes of infants up to two years of age, and their mothers, in a Thai
refugee camp; the Southampton collection16 (516 genomes), isolated from the nasopharynxes of children up to
four years of age during outpatient visits; and the Nijmegen18 collection (337 genomes), isolated from
adults hospitalised with bacteraemic pneumonia. These were supplemented with 20
complete, publically available reference genomes (Supplementary Dataset 3).
To standardise these genome collections relative to the Massachusetts dataset,
assemblies were discarded if they were less than 1.98 Mb, or greater than 2.19
Mb, in length; or had an N50 less than 15 kb14,50; or necessary
information was absent from the public databases. Of the 4,586 genomes, 4,462
met these criteria and were included in a preliminary analysis that identified
non-pneumococcal streptococci, which were then excluded from the final analysis.
Consequently, the final dataset of 4,127 genomes contained 20 reference
sequences, 616 Massachusetts sequences, 491 Southampton sequences, 337 Nijmegen
sequences, and 2,663 Maela sequences.Each genome was processed with RNAmmer v1.2, to annotate rRNA56; tRNAscan-SE v1.3.1, to annotate tRNA57; Rfam scan, to annotate other non-coding
RNA58; scanned for BOX, RUP and
SPRITE repeats using HMM profiles59,60; and Prodigal v2.661, to annotate CDSs using a model trained on the genome of
S. pneumoniae ATCC 70066962. CDSs that overlapped with the non-coding RNA or short
interspersed repeat sequences were then removed from the annotation, and the
remaining set translated to allow a non-redundant set of proteins to be
identified. A version without low complexity regions was generated through
filtering with segmasker63 and masking of
choline binding domains. All-against-all comparisons of these protein databases
were then generated using BLAT v0.3464.
Global COGs (gCOGs) were then generated using COGtriangles and COGcognitor65, and through linking pairs of highly
similar sequences, as described previously14. The gCOG nomenclature was then applied to the full, redundant
set of protein sequences.To correct for misassemblies, particularly those reflecting differences between the methods used to assemble the Massachusetts isolates’ genomes and those from other populations, false positive CDSs were eliminated from the intermediate frequency gCOGs. A database generated from the annotation of S. pneumoniae ATCC 70066962 was used to search intermediate frequency gCOG DNA sequences using BLASTALL v2.2.25. This identified 39 gCOGs corresponding to fragments of tRNA, oligomers of choline binding domains, or antisense fragments of insertion sequences. This left a final set of 11,049 gCOGs, of which 1,731 were present at a frequency between 5% and 95% in the pre- or peri-vaccination samples (grouped as “pre-vaccination” samples in the Results section) of at least one of the four study populations.To transfer the functional annotation onto the gCOG sequences, the annotated protein sequences from Massachusetts in Supplementary Table 1 were used to identify identical proteins in the gCOG dataset. When COGs could not be matched to gCOGs through this approach, links were instead made through searching gCOGs for proteins with identity to the middle 50% of annotated protein sequences from Massachusetts. These links were then manually curated to categorise the 1,731 intermediate frequency gCOG sequences where possible, as shown in Fig 2.
Analysis of population structure
To analyse the overall population structure, a ‘relaxed’ core set of 1,447
gCOGs were identified that met two criteria: first, that they were present in at
least 95% of the isolates; and second, that the total number of gCOG
representatives was less than 105% of the number of isolates containing the
gCOG, to exclude gCOGs that are present in high copy number in some, or all,
genomes. A codon alignment was then generated for each gCOG using mafft
v7.22166, excluding any sequences
from isolates containing more than one representative of the gCOG. These were
concatenated, with gap sites used to pad regions where data were missing for a
particular isolate, and a 293,508 bp alignment of polymorphic sites extracted
using SNP-sites67. A phylogeny was
generated from this alignment using FastTree2 with the ‘fastest’
option68.Population structure was analysed with hierarchical BAPS clustering69 using five independent runs of the estimation algorithm
starting from the upper bound of 200-500 clusters, which all converged to the
same posterior mode. Two polyphyletic primary BAPS clusters were split into
their secondary level clusters, yielding 73 sequence clusters that were almost
entirely congruent with the phylogeny, and SC0, which remained polyphyletic. The
monophyletic sequence clusters most similar to those in Massachusetts14 were numbered accordingly. The plot in
Fig 1c combined cophenetic distances
from the core genome phylogeny, extracted with Bioperl70, and Jaccard distance calculated from the presence and
absence matrix of gCOGs using the R package vegan71. For each isolate, 100 comparator isolates were selected at
random, and this sample of pairwise comparisons used to generate the plot.Of the polymorphic sites in the core genome, 282,043 corresponded to a base in the
S. pneumoniae ATCC 700669 reference genome. For each
population, the set of sites that were both biallelic and had a non-reference
allele frequency between 5% and 95% in that population were extracted with
VCFtools v0.1.1472; there were 27,616 of
these in the Massachusetts dataset, 26,954 in the Southampton dataset, 28,396 in
the Nijmegen dataset, and 30,579 in the Maela dataset. The
r2 statistics between these polymorphic sites,
and between the binary presence and absence information of accessory gCOGs with
a representative in the S. pneumoniae ATCC 700669 genome, were
then calculated with VCFtools by treating each isolate as a phased haplotype.
These were used to generate the linkage analysis plots in Supplementary Fig 10.
Inference of antibiotic resistance profiles
Individual isolates’ genotypes were used to predict their antimicrobial
resistance profiles. The presence of aph3’ (the gCOGCLS350021) was inferred to cause resistance to aminoglycosides; the presence of
tetM (CLS03712) was inferred to cause resistance to
tetracycline; the presence of cat (CLS01043) was inferred to
cause resistance to chloramphenicol; and the presence of ermB
(CLS01283), mef (CLS02227), or both was inferred to cause
macrolide resistance62,73. These gCOGs themselves were removed
from the set of loci used in the simulations, and the inferred antibiotic
resistance phenotype used instead.Non-susceptibility to other antibiotics is determined by core genome loci; to incorporate these into the model, resistant alleles of relevant loci were treated analogously to the presence of an accessory resistance gene. The presence of the I100L substitution in the dihydrofolate reductase protein (CLS03211) was inferred to result in resistance to trimethoprim74,75, and the presence of an insertion shortly after S61 in the dihydropteroate synthase protein (CLS01442) was inferred to result in resistance to sulphamethoxazole76. Three penicillin-binding proteins substantially contribute to β-lactam resistance. Using a similar approach to that of Li et al77, the population-wide protein sequences of Pbp1A (CLS01776), Pbp2X (CLS01031) and Pbp2B (CLS01093) were aligned with mafft v7.22166, and the transpeptidase domain regions extracted. Following validation using the isolates from Massachusetts14, sequences exhibiting less than 97% amino acid identity with the susceptible alleles defined by Li et al in the multiple sequence alignment were considered resistance-associated. These antibiotic resistance phenotypes were included as intermediate frequency loci if they met the criteria for a given population.
Multilocus negative frequency dependent selection model
The multilocus negative frequency dependent selection model was generated within a discrete-time Wright-Fisher framework78,79. Although such models were designed with a number of strong assumptions, the results of simulations have been found robust to violations of these conditions80. Each individual i had a genotype g defined by a binary string representing the presence and absence of each gCOG or antibiotic resistance phenotype present at an intermediate frequency in the starting population. The number of offspring arising from i at time t is a Poisson-distributed random variable X. This Poisson approximation is justifiable if only a small proportion of descendants survive to the next generation78, as is likely to be the case for a nasopharyngeal coloniser with a small within-host effective population size81 that experiences a strong bottleneck at transmission. To allow for differential reproductive success between genotypes in a manner that depended on the composition of the overall population, Xi,t was parameterised using the function (Supplementary Fig 4):The four components of the function each correspond to a different biological process. General density-dependent selection depends on κ, the carrying capacity of the environment, and Nt, the total number of individuals at time t. This maintained an approximately stable population size throughout simulations. This is justifiable, as S. pneumoniae colonization levels did not substantially change in the years immediately after PCV7’s introduction.Migration into the population occurred at rate m, subject to the limits 0 ≤ m ≤ 1, and therefore the reproductive fitness of resident individuals was reduced by a factor of (1-m) accordingly to maintain an approximately constant population size of κ. The number of immigrating individuals at time t, N, was a random variable calculated as:Migrant individuals were selected, with replacement, from all isolates observed at any time point in the geographically-specified dataset being studied. Therefore it was the only mechanism by which genotypes not present in the pre-vaccine genome samples could enter the simulated population. To prevent artefactually improving the fit of the model at high values of m through sampling all isolates in proportion to their observed frequency, the selection of an immigrating isolate was biased such that it was equally likely to come from any sequence cluster with at least one representative in the studied population, although these were present at very different frequencies within each population. Hence the probability of an immigrating individual being of genotype i and sequence cluster s, p, was: Where S is the number of sequence clusters in the population, n is the number of isolates in sequence cluster s of genotype i in the genome dataset, and ns was the number of isolates in the sequence cluster s in the genomic dataset.The vaccine selection pressure to which individual i was subject, vi, depended on whether the individuals were of a vaccine serotype or not; for PCV7, the vaccine serotypes were 4, 6B, 9V, 14, 18C, 19F and 23F, as well as 6A, a vaccine-related type to which PCV7 elicited strong cross-immunity14. Consequently, vi was determined as: Where v was subject to the constraint 0 ≤ v ≤ 1.In the homogeneous rate multilocus model of NFDS, the magnitude of this pressure was determined by the term (1 + σ) , where σ ≥ 0. The selection pressure depended on the genotype g and distribution of intermediate frequency loci at time t, as summarised by the exponent π. The calculation of π necessitated determining the frequency f of each locus l at time t in the simulation, using the binary variables g that represent presence or absence of l in i:These were compared to the equilibrium frequencies, e, of
the same loci, which were assumed to correspond to their frequencies in the
sample of G0 genomes from isolates sampled pre- or
peri-vaccination:The overall deviation of the L accessory genome loci included in the simulations, for individual i at time t, πi,t, was calculated as:Therefore if all accessory genes are at their equilibrium frequencies, then (1 + σ) = 1, and NFDS has no effect on an individual’s reproductive fitness. When a genotype contains many genes rarer than their equilibrium frequencies, (1 + σ) > 1, and NFDS increases an individual’s reproductive fitness. Lastly, when a genotype contains many genes more common than their equilibrium frequencies, (1 + σ) < 1, and therefore NFDS reduces an individual’s reproductive fitness. In the absence of l from an individual’s genotype, f has no direct effect on its fitness.
Extension to heterogeneous frequency-dependent selection
Two further parameters were introduced when accessory genes were split into two categories, each subject to a different level of frequency dependent selection. The σ parameter represented the strength of weaker NFDS acting on a fraction, (1-p), of the accessory genes included in the model. To facilitate inference of these two parameters, it was assumed that loci under weaker negative frequency dependent selection would vary in frequency to a greater extent between the initial and final genomic samples; therefore the accessory loci were ordered by the statistic Δ: Where e is the frequency of the gCOG or antibiotic resistance phenotype across all pre- or peri-vaccination samples, as defined previously, and f0, is its frequency across all post-vaccination samples. The denominator is intended to emphasise the effects of gCOGs at frequencies of approximately 50%, which are likely to have a large effect on the overall population structure. The proportion p of genes for which Δl was smallest were considered subject to NFDS with strength (1+σ), whereas the rest where subject to NFDS of strength (1+σ). If the L loci were ordered by ascending values of Δ, then l was the highest ranking meeting the criterion, This resulted in two distinct measures of the deviation of f from e: And: Hence the modified offspring distribution was:
Simulations and parameter estimation
The model was implemented in C++ using the GNU scientific library, and is available for download from https://github.com/nickjcroucher/multiLocusNFDS. In each simulation, genotypes were represented by the gCOGs and antibiotic resistance phenotypes present in between 5% and 95% of the pre- or peri-vaccination population. Hence L was 1,090 for Massachusetts, 1,175 for Southampton, 1,090 for Nijmegen and 1,254 for Maela. For simplicity, κ was assumed to represent the number of pneumococci likely to transmit between individuals in the sampled population. This was estimated to correspond to 25% colonisation of children under ten years of age in the USA and European samples. In Massachusetts15, an under ten population of 828,129 in 2000 82 implied a bacterial population size of 2x105 (105 was actually used for model fitting for computational efficiency; comparing simulations demonstrated this had no detectable effect on the results); in Southampton (including Hampshire and Portsmouth), an under ten population of 202,404 in 201183 implied a bacterial population size of 5x104; and in Nijmegen (including Arnhem), an under ten population of 77,753 in 201184 implied a bacterial population of 2x104. An elevated colonisation rate of 50%85 was used for Maela, where estimating that 15% of the 40,000 residents being under 10 implied a bacterial population size of 3x103.Each simulation was run for a number of timesteps corresponding to the number of months spanned by the genomic collection, excluding early or late years in which sampling was sparse. The well-sampled periods were the 72 months between spring 2001 and spring 2007 for Massachusetts50; the 48 months between spring 2007 and spring 2011 for Southampton16; the 120 months between 2001 and 2011 for Nijmegen18; and the 24 months between 2007 and 2009 for Maela85. All isolates from a single winter were assigned to the year in which the season ended. In simulations of the Nijmegen population, where a substantial proportion of samples pre-dated the vaccine’s introduction, v = 0 for years up to 2007. In each case, the starting population for the simulation, of size κ, was generated by randomly resampling with replacement from the genotypes present in the pre- and peri-vaccination samples in each study; hence the ‘pre-vaccination’ population consisted of isolates sampled up to spring 2001 in Massachusetts, up to spring 2007 in Southampton, and up to 2007 in Nijmegen. These were the genomic samples used to calculate e for all intermediate frequency loci; all later samples were used to calculate f in the definition of Δ.At each time t at which a genomic sample was available, the equivalent
number of genotypes was randomly sampled from the simulated population. The
similarity between the simulated and genomic samples at t was
then calculated as the Jensen-Shannon divergence86 (JSD) between the real and simulated
samples: Where f is the
simulated frequency of genotypes of sequence cluster s and
vaccine type status v at time t, and
a is the equivalent value from the
genomic sample. This value was summed over all vaccine type statuses and
sequence clusters for each timepoint sampled in the genomic dataset to calculate
the overall divergence of the simulation from the sampled data.Each set of simulations was run with variation in the parameters v
(range 0-0.5); m (range 0-0.2);
σ (range 10-6-0.22);
σw (range 10-6-0.15; only in the heterogeneous
rates model), and p (range 0-1; only in the
heterogeneous rates model). Model fitting was achieved through Approximate
Bayesian Computation with the BOLFI algorithm40, run for 2,000 iterations of Bayesian optimisation to identify
best-fitting parameter sets through minimizing the JSD (Table 1, Supplementary Fig 5). Point estimates of parameter values were
generated based on the Gaussian process minimisers, with the distribution of the
projected JSD values shown for each fit in Supplementary Fig 5.
Exploration of parameter space was performed with logarithmically transformed
values to avoid discontinuity of the approximate likelihood function near the
natural boundary and to enable better fit of the Gaussian process regression.
The 95% posterior credible intervals for the parameters were obtained using
three generations of sequential Monte Carlo sampling with the same default
settings as used in Gutmann and Corander40 for the pneumococcal day care transmission model.
Alternative model formulations
To test whether equivalently good fits to the genomic data could be achieved using different approaches within the same framework, alternative model formulations were tested. The neutral model was fitted in the same way as the multilocus NFDS models, except that σ was fixed at zero. The serotype NFDS model assumed all serotypes were present at equilibrium frequencies in the pre-vaccine samples, and therefore πi,t was calculated as the deviation of an isolate’s serotype from its initial frequency. This was fitted using both the homogeneous and heterogeneous selection rate models. In the latter case, Δ was calculated by comparing the serotype e values with their post-vaccination frequencies, as for the intermediate frequency loci. The same parameter ranges were used as for the multilocus NFDS model, except σ and σ were allowed to take values within the range 10-6-25 to compensate for the single locus contributing to π and ω. Additionally, to avoid many lower frequency sequence clusters evolving neutrally, serotypes were considered to be at intermediate frequencies if they were between 1% and 99% prevalence in the pre- or peri-vaccination population.The ecotype NFDS model assumed each sequence cluster was adapted to a specific
ecological niche, and therefore was present at an equilibrium frequency in the
pre-vaccine samples. Therefore π and
ω were calculated as the deviation
of an isolate’s sequence cluster from its initial frequency. This was
fitted using both the homogeneous and heterogeneous selection rate models using
the same parameter ranges and intermediate frequency range as for the serotype
NFDS model, as well as the same approach to the calculation of
Δ. For both the serotype and ecotype
models, model fitting was conducted with BOLFI40 as for the multilocus NFDS model, using JSDs to quantify the
differences between the simulated and sampled populations. Results are shown in
Table 1. These implementations are
not intended to represent the optimal versions of each model, but instead
demonstrate that the fits of the multilocus NFDS models cannot be trivially
replicated through changing the genetic basis of NFDS.Simulations in which isolates from two populations were combined used the pre-vaccination population from Massachusetts and post-vaccine isolates from one of the alternative populations. The initial population was drawn only from Massachusetts; both these isolates, and those from the alternative dataset, could enter the simulated population through migration. When the alternative population corresponded to Southampton or Nijmegen, the population size, number of generations, parameter point estimates, Δ and e values were those of the alternative population. When the alternative population was Maela, the e and Δ values were those of the alternative population, but the simulations were otherwise parameterised for the Massachusetts population, due to the difficulty of obtaining robust point estimates for parameters from the Maela population as a consequence of the lack of vaccine introduction in this location.
Statistical analyses
Statistical analyses, including calculation of Pearson’s
R2, Wilcoxon tests, interquartile ranges and
Fisher’s exact tests, were performed in R87. Estimation of parameter values and credibility intervals through
model fitting were performed with BOLFI40. All reported p values are two-sided.
Code availability
The model code used in this analysis is freely available from the GitHub repository, https://github.com/nickjcroucher/multilocusNFDS.
Data availability
The sequence datasets analysed in the current study are available in the public sequence databases with the accession codes listed in Supplementary Dataset 3. The epidemiological and phylogenetic data analysed in the current study are available from https://microreact.org/project/multilocusNFDS.
Authors: Eleanor R Watkins; Bridget S Penman; José Lourenço; Caroline O Buckee; Martin C J Maiden; Sunetra Gupta Journal: PLoS Pathog Date: 2015-07-16 Impact factor: 6.823
Authors: Rebecca A Gladstone; Johanna M Jefferies; Anna S Tocheva; Kate R Beard; David Garley; Wei Wei Chong; Stephen D Bentley; Saul N Faust; Stuart C Clarke Journal: Vaccine Date: 2015-03-14 Impact factor: 3.641
Authors: Yuan Li; Benjamin J Metcalf; Sopio Chochua; Zhongya Li; Robert E Gertz; Hollis Walker; Paulina A Hawkins; Theresa Tran; Cynthia G Whitney; Lesley McGee; Bernard W Beall Journal: mBio Date: 2016-06-14 Impact factor: 7.867
Authors: Justin S Hogg; Fen Z Hu; Benjamin Janto; Robert Boissy; Jay Hayes; Randy Keefe; J Christopher Post; Garth D Ehrlich Journal: Genome Biol Date: 2007 Impact factor: 13.583
Authors: Nicholas J Croucher; Jonathan A Finkelstein; Stephen I Pelton; Julian Parkhill; Stephen D Bentley; Marc Lipsitch; William P Hanage Journal: Sci Data Date: 2015-10-27 Impact factor: 6.444
Authors: Gerry Tonkin-Hill; Clare Ling; Chrispin Chaguza; Susannah J Salter; Pattaraporn Hinfonthong; Elissavet Nikolaou; Natalie Tate; Andrzej Pastusiak; Claudia Turner; Claire Chewapreecha; Simon D W Frost; Jukka Corander; Nicholas J Croucher; Paul Turner; Stephen D Bentley Journal: Nat Microbiol Date: 2022-10-10 Impact factor: 30.964
Authors: Mark R Davies; Liam McIntyre; Ankur Mutreja; Jake A Lacey; John A Lees; Rebecca J Towers; Sebastián Duchêne; Pierre R Smeesters; Hannah R Frost; David J Price; Matthew T G Holden; Sophia David; Philip M Giffard; Kate A Worthing; Anna C Seale; James A Berkley; Simon R Harris; Tania Rivera-Hernandez; Olga Berking; Amanda J Cork; Rosângela S L A Torres; Trevor Lithgow; Richard A Strugnell; Rene Bergmann; Patric Nitsche-Schmitz; Gusharan S Chhatwal; Stephen D Bentley; John D Fraser; Nicole J Moreland; Jonathan R Carapetis; Andrew C Steer; Julian Parkhill; Allan Saul; Deborah A Williamson; Bart J Currie; Steven Y C Tong; Gordon Dougan; Mark J Walker Journal: Nat Genet Date: 2019-05-27 Impact factor: 38.330
Authors: Gabrielle L Harrow; John A Lees; William P Hanage; Marc Lipsitch; Jukka Corander; Caroline Colijn; Nicholas J Croucher Journal: ISME J Date: 2021-01-06 Impact factor: 10.302
Authors: Joseph J Campo; Timothy Q Le; Jozelyn V Pablo; Christopher Hung; Andy A Teng; Hervé Tettelin; Andrea Tate; William P Hanage; Mark R Alderson; Xiaowu Liang; Richard Malley; Marc Lipsitch; Nicholas J Croucher Journal: Elife Date: 2018-12-28 Impact factor: 8.140
Authors: Anne L Wyllie; Joshua L Warren; Gili Regev-Yochay; Noga Givon-Lavi; Ron Dagan; Daniel M Weinberger Journal: Clin Infect Dis Date: 2021-06-01 Impact factor: 9.079