Jason A Hendry1, Dominic Kwiatkowski1,2,3,4, Gil McVean1,3. 1. Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom. 2. Wellcome Centre for Human Genetics, University of Oxford, Oxford, United Kingdom. 3. Medical Research Council Centre for Genomics and Global Health, University of Oxford, Oxford, United Kingdom. 4. Wellcome Sanger Institute, Cambridge, United Kingdom.
Abstract
There is an abundance of malaria genetic data being collected from the field, yet using these data to understand the drivers of regional epidemiology remains a challenge. A key issue is the lack of models that relate parasite genetic diversity to epidemiological parameters. Classical models in population genetics characterize changes in genetic diversity in relation to demographic parameters, but fail to account for the unique features of the malaria life cycle. In contrast, epidemiological models, such as the Ross-Macdonald model, capture malaria transmission dynamics but do not consider genetics. Here, we have developed an integrated model encompassing both parasite evolution and regional epidemiology. We achieve this by combining the Ross-Macdonald model with an intra-host continuous-time Moran model, thus explicitly representing the evolution of individual parasite genomes in a traditional epidemiological framework. Implemented as a stochastic simulation, we use the model to explore relationships between measures of parasite genetic diversity and parasite prevalence, a widely-used metric of transmission intensity. First, we explore how varying parasite prevalence influences genetic diversity at equilibrium. We find that multiple genetic diversity statistics are correlated with prevalence, but the strength of the relationships depends on whether variation in prevalence is driven by host- or vector-related factors. Next, we assess the responsiveness of a variety of statistics to malaria control interventions, finding that those related to mixed infections respond quickly (∼months) whereas other statistics, such as nucleotide diversity, may take decades to respond. These findings provide insights into the opportunities and challenges associated with using genetic data to monitor malaria epidemiology.
There is an abundance of malaria genetic data being collected from the field, yet using these data to understand the drivers of regional epidemiology remains a challenge. A key issue is the lack of models that relate parasite genetic diversity to epidemiological parameters. Classical models in population genetics characterize changes in genetic diversity in relation to demographic parameters, but fail to account for the unique features of the malaria life cycle. In contrast, epidemiological models, such as the Ross-Macdonald model, capture malaria transmission dynamics but do not consider genetics. Here, we have developed an integrated model encompassing both parasite evolution and regional epidemiology. We achieve this by combining the Ross-Macdonald model with an intra-host continuous-time Moran model, thus explicitly representing the evolution of individual parasite genomes in a traditional epidemiological framework. Implemented as a stochastic simulation, we use the model to explore relationships between measures of parasite genetic diversity and parasite prevalence, a widely-used metric of transmission intensity. First, we explore how varying parasite prevalence influences genetic diversity at equilibrium. We find that multiple genetic diversity statistics are correlated with prevalence, but the strength of the relationships depends on whether variation in prevalence is driven by host- or vector-related factors. Next, we assess the responsiveness of a variety of statistics to malaria control interventions, finding that those related to mixed infections respond quickly (∼months) whereas other statistics, such as nucleotide diversity, may take decades to respond. These findings provide insights into the opportunities and challenges associated with using genetic data to monitor malaria epidemiology.
It is widely accepted that relationships exist between the regional epidemiology of malaria and the genetic diversity of local parasite populations. For example, the origin and spread of antimalarial drug resistance, the rate and directions of parasite migration, and the local intensity of transmission all have an impact on population genetic diversity (reviewed in [1-7]). In most cases, however, the precise nature of these relationships remains unclear. From a modelling perspective, exploring these relationships would require that both genetic processes (including mutation, drift and meiosis) and epidemiological ones (including the transmission dynamics and life cycle of malaria) are combined into a single framework. At present such integrated models are rare, yet without them, parasite genetic data will be under-utilized as a resource for malaria surveillance.One epidemiological parameter of central importance to malaria surveillance is transmission intensity, as it is used by National Malaria Control Programs (NMCPs) to prescribe malaria control interventions and assess their efficacy [8]. NMCPs can attempt to measure transmission intensity with a variety of statistics. These include the basic reproduction number (R0), defined as the number of secondary infections produced by a single infected individual in a naive population; the entomological inoculation rate (EIR), defined as the mean number of infectious bites received by an individual per annum; parasite prevalence or rate (PR or PfPR if the focus is P. falciparum), defined as the fraction of individuals in the population carrying detectable infection; or simply the rate of clinical incidence (reviewed in [9]). However, there are well-documented issues with all of these statistics. Though a theoretical gold-standard, R0 is difficult to measure in practice, with estimation methods relying either on exploiting equilibrium relationships with other measures of transmission, or formulae involving several poorly-characterised parameters [9, 10]. The EIR suffers from small-scale variability in mosquito density, a lack of standardisation across mosquito catching methods, and difficulties associated with catching sufficient mosquitoes when transmission intensities are low [9, 11, 12]. Rates of clinical incidence are confounded by variation in acquired immunity and treatment seeking behaviour, as well as incomplete record keeping [12]. Parasite prevalence is the most widely collected measure and has been used as the basis of large-scale maps [13-15], yet it requires prohibitively extensive sampling at low transmission intensities [12], and must address biases in detection power that may arise from infection-course and age-dependent variation in parasitemia [9, 16]. Thus, a means to either estimate or improve existing estimates of transmission intensity with genetic data would be valuable.The problem of estimating an epidemiological parameter like transmission intensity from genetic data is superficially similar to the demographic inference problems that are commonly encountered in population genetics. For example, a multitude of methods now exist for estimating effective population size (N) from genetic data [17-20], and it could be hypothesized that the regional transmission intensity of malaria is a function of the N of the local parasite population. However, there are at least two challenges unique to epidemiological inference from malaria genetic data that make it a distinctive and more difficult problem.First, it is not clear that the classical models in population genetics (including the Wright-Fisher, Moran, and other models that converge to Kingman’s n−Coalescent in the infinite population limit [21]) that are often employed by demographic inference methods are suitable for malaria. The life cycle of P. falciparum involves oscillating between human host and mosquito vector populations, which may be of different sizes, and may also induce different rates of drift and mutation. Within both the host and the vector, parasite populations likely experience bottlenecks (for example, as the ookinetes penetrate the midgut wall of the vector), exponential growth phases (merozoites replicating in the blood of the host), and interactions with the host or vector immune system. Indeed, there has been work demonstrating that the malaria life cycle simultaneously intensifies drift and selection; a result contrary to what is expected under a Wright-Fisher model [22]. So, while classical models in population genetics have the advantage of being extensively studied and mathematically tractable, with a known set of relationships between equilibrium genetic diversity statistics and demographic parameters, they do not readily apply to P. falciparum. Finally, even if these models did apply, the relationships between demographic parameters (such as N) and epidemiological ones (such as transmission intensity) would need to be defined.Conversely, the epidemiological models that have been designed to reflect malaria biology and transmission do not explicitly incorporate genetic processes. The most well known class of epidemiological models are the compartment-based models, where individual hosts and vectors transition between compartments which can represent a variety of disease states (such as susceptible, infected, or immune), and the overall population is represented by the total number of hosts and vectors occupying each state (reviewed in [23, 24]). A canonical compartment model for malaria is the Ross-Macdonald, where hosts and vectors can be either susceptible or infected [24]. Conveniently, in these models, the equilibrium prevalence of infected hosts and vectors is a function of the parameters specifying the transition rates between compartments. However, the absence of genetic processes (and indeed, individual parasites) means they offer no insight into how parasite genetic diversity relates to these transition rates or, as a corollary, any epidemiological parameters derived from them. Thus, at least with respect to malaria, the traditional modelling landscape cannot address questions that involve both genetic data and epidemiology.A second challenge specific to epidemiological inference using genetic data is that the ultimate aim is often to inform disease control and, as a result, the time-dimension of the inference is of critical importance. Many demographic inference methods base their estimates of N on distributions of coalescent times between segments of DNA. As these coalescent events typically occur on the scale of N generations in past, the estimates are historical; reflecting the average population size over hundreds or thousands of generations. Such approaches are not suitable for disease control, where policy decisions need to be made on the basis of information about the near-present, or predictions about the future.In view of these challenges, the development of integrated genetic-epidemiological models is a critical undertaking for the future of malaria genomic surveillance. To date, there have been two notable examples of such integrated models. The first, developed by Daniels et al. in 2015, was designed to support epidemiological inference from single-nucleotide polymorphism (SNP) data, collected during a period of intensified intervention in Thiès, Senegal [25]. Here, the model was used with an Approximate Bayesian Computation (ABC) algorithm to independently corroborate a decline and rebound in transmission intensity on the basis of 24-SNP barcodes [25]. More recently, Watson et al. have developed a model of substantially greater epidemiological complexity—including features such as individual vectors and a vector development cycle, six host infection states, host age and immunity—and used this model to estimate parasite prevalence in five locations from Uganda and Kenya after fitting a single parameter in the model to SNP data [26]. In both these cases, the focus on SNP barcodes reflected the data available. Yet, as P. falciparum whole-genome sequencing (WGS) data continues to increase [27], there is a need for modelling frameworks that can investigate the broader suite of genetic diversity statistics calculable from WGS data.Here, we aim to address the lack of integrative genetic-epidemiological models by developing a new forward-time model called forward-dream. forward-dream merges the Ross-Macdonald model with a continuous-time intra-host and intra-vector Moran model, and further incorporates meiosis within the vector (allowing for multiple oocysts), multiple infection (by either super- or co-infection), and a representation of the transmission bottlenecks. Implemented as a stochastic simulation, we use the model to explore relationships between measures of genetic diversity and parasite prevalence, both at equilibrium and in response to malaria control interventions that perturb equilibrium. We confirm that a variety of genetic diversity statistics are correlated with parasite prevalence, although to varying degrees and over different time-scales. In addition, we find that interventions that affect the duration of infection in hosts have a greater influence on parasite genetic diversity than those that influence vector biting rate or density. Overall, our results suggest that statistics based on the complexity of infection (COI) are strongly, robustly, and rapidly responsive to changes in prevalence, highlighting their potential value for malaria surveillance.
Results
Developing a model of P.falciparum malaria transmission and evolution
We developed an agent-based simulation of P.falciparum malaria incorporating features of its transmission and life cycle, as well as explicitly modelling the genetic material of parasites. Our integrated genetic-epidemiological model is called forward-dream (forward-time drift, recolonisation, extinction, admixture and meiosis) and is comprised of three layers: (1) a stochastic epidemiological layer, which controls how malaria spreads through a population of hosts and vectors and reaches equilibrium; (2) a stochastic infection layer, which controls the behaviour of malaria parasites during individual transmission events and within individual hosts and vectors; and (3) a stochastic genetic layer, which controls how the genetic material of individual parasites is represented, mutated and recombined (Fig 1). We describe each layer below and provide additional information, including a discussion of parameterisation, in the S1 Appendix.
Fig 1
Schematic of forward-dream.
(a) The epidemiological layer. Hosts and vectors oscillate between susceptible and infected compartments according to a Ross-Macdonald model. (b) The infection layer. The capacity of individual hosts and vectors to be infected is represented by a fixed number of sub-compartments (black boxes) each which can harbour a unique parasite genome (colored circles). In a susceptible host/vector, all sub-compartments are empty. Upon infection, all sub-compartments are populated. Drift and mutation occur among sub-compartments according to a continuous-time Moran model. Parasite genomes undergo meiosis during transmission from host to vector. Super-infection can occur, resulting in an average of half of all sub-compartments being replaced with newly transmitted parasite genomes. Note that the infection layer can be nested within the Ross-Macdonald model. (c) The genetic layer. The genome is represented by a fixed-length array of 0’s and 1’s. Mutation is reversible, converting 0 to 1 or 1 to 0. Recombination occurs during meiosis.
Schematic of forward-dream.
(a) The epidemiological layer. Hosts and vectors oscillate between susceptible and infected compartments according to a Ross-Macdonald model. (b) The infection layer. The capacity of individual hosts and vectors to be infected is represented by a fixed number of sub-compartments (black boxes) each which can harbour a unique parasite genome (colored circles). In a susceptible host/vector, all sub-compartments are empty. Upon infection, all sub-compartments are populated. Drift and mutation occur among sub-compartments according to a continuous-time Moran model. Parasite genomes undergo meiosis during transmission from host to vector. Super-infection can occur, resulting in an average of half of all sub-compartments being replaced with newly transmitted parasite genomes. Note that the infection layer can be nested within the Ross-Macdonald model. (c) The genetic layer. The genome is represented by a fixed-length array of 0’s and 1’s. Mutation is reversible, converting 0 to 1 or 1 to 0. Recombination occurs during meiosis.
Epidemiological layer
For the epidemiological layer we implemented a stochastic, agent-based version of the Ross-Macdonald model (reviewed in [24]). In this model, a fixed number of hosts (N) and vectors (N) alternate between susceptible and infected based on four fixed rate parameters (Fig 1a). The model can be described by two coupled differential equations:
where h0 and h1 are the number of susceptible and infected hosts, respectively, with N = h0+ h1; and v0 and v1 are the number of susceptible and infected vectors, respectively, with N = v0+ v1. Note that in this model λ and ϕ are compound parameters representing the rate at which hosts and vectors become infected. In particular,
where b is the daily vector biting rate, (N/N) gives the vector density, and π is the probability that an infectious bite from a vector produces an infected host (the vector-to-host transmission efficiency). Similarly, we have,
where b is again the daily vector biting rate and π is the probability that a vector that bites an infected host becomes infected (the host-to-vector transmission efficiency). We allow for mixed infections however, for the epidemiological layer, they have no consequence: both clonal and mixed infections are assigned to the infected compartments (h1 or v1).Note that the epidemiological layer dictates the equilibrium prevalence of infection in hosts (X = h1/N) and vectors (X = v1/N). In particular, the equilibrium prevalence in hosts is given by:
and in vectors by:We have confirmed that our implementation of the Ross-Macdonald model in forward-dream converges to the expected equilibrium prevalence values (Fig 2a and S1 Fig). In total, the behaviour of the epidemiological layer is specified by seven parameters.
Fig 2
Simultaneously monitoring parasite prevalence and genetic diversity using forward-dream.
(a) A forward-dream simulation seeded with ten infected hosts harbouring identical parasite genomes and run for 50 years. Prevalence of all infected hosts (‘Host All’, which corresponds to PfPR) and multiply-infected hosts (‘Host Mixed’) is indicated by red and pink lines, respectively. The same is shown for vectors in blues. The prevalence of infected hosts and vectors fluctuates around their Ross-Macdonald equilibrium values (X and X), indicated with red and blue horizontal lines, respectively. (b) The same simulation as in (a), but visualizing eight genetic diversity statistics that were computed by collecting parasite genomes from twenty randomly selected infected hosts every 30 days. The statistics are defined in S1 Table. For reference, the light and dark grey shaded areas show the host and vector prevalence (right y-axis), corresponding to the red and dark blue lines in panel (a).
Simultaneously monitoring parasite prevalence and genetic diversity using forward-dream.
(a) A forward-dream simulation seeded with ten infected hosts harbouring identical parasite genomes and run for 50 years. Prevalence of all infected hosts (‘Host All’, which corresponds to PfPR) and multiply-infected hosts (‘Host Mixed’) is indicated by red and pink lines, respectively. The same is shown for vectors in blues. The prevalence of infected hosts and vectors fluctuates around their Ross-Macdonald equilibrium values (X and X), indicated with red and blue horizontal lines, respectively. (b) The same simulation as in (a), but visualizing eight genetic diversity statistics that were computed by collecting parasite genomes from twenty randomly selected infected hosts every 30 days. The statistics are defined in S1 Table. For reference, the light and dark grey shaded areas show the host and vector prevalence (right y-axis), corresponding to the red and dark blue lines in panel (a).
Infection layer
The infection layer specifies the biology of individual infection and transmission events. This includes a representation of: (i) the capacity of hosts and vectors to harbour infection, (ii) the evolution of infections within hosts and vectors, and (iii) the transmission of infection between hosts and vectors.We model an individual host’s capacity to be infected with n within-host sub-compartments (Fig 1b). When an individual host is in the susceptible state (i.e. is uninfected), all n sub-compartments are empty. When an individual host becomes infected, all n sub-compartments simultaneously become populated. Each sub-compartment can potentially harbour a unique parasite genome (such that the maximum complexity of infection k = n), although typically multiple sub-compartments will be occupied by identical, or near-identical, genomes. For example, if a host is infected by a vector carrying a single distinct parasite genome, all n sub-compartments will initially be occupied by that genome. Alternatively, if a host is co-infected by a vector carrying two distinct parasite genomes, its sub-compartments will be drawn from a mixture of the two genomes.Moving forward in time, the P. falciparum infection of a host evolves according to a continuous-time Moran process for a population with n individuals, parameterised by a drift rate (d) and mutation rate (θ) [28]. We have confirmed that our implementation of the Moran process yields fixation times consistent with theoretical expectation (S2 Fig). The Moran process continues until the infection is cleared and the host returns to the susceptible state, with all sub-compartments becoming simultaneously empty. For hosts, clearance occurs at rate γ, as specified in the epidemiological layer. The infection of a vector evolves according to a comparable process as in the host, but with a set of parameters n, d, θ, and ϵ.In forward-dream, a P. falciparum infection is transmitted from a vector to a host through a transmission bottleneck, such that not all of the parasites within the infectious vector establish themselves in the host. In particular, a random subset of all parasites (where is the set of all parasites in the vector) is transmitted. The number of transmitted parasites is drawn from a truncated binomial:This results in an average of parasites passing through the transmission bottleneck; the size of the bottleneck is controlled by p. For all of the simulations presented here, p = 0.2 and n = 20, resulting in an average of ≃4 parasites passing through the bottleneck.Note that in cases where n > 1 and contains unique parasites, co-infection may occur. If the vector is infecting a susceptible (i.e. uninfected) host, n parasites are drawn with replacement from with each having an equal probability (1/n) of being drawn. These n parasites then populate the n host sub-compartments. Alternatively, super-infection occurs if the host is already infected. Defining the parasites already within the host by the set , we first create the union . From this set n parasites are drawn, where each parasite has probability 1/2n or 1/2n of being drawn, if it is from h or , respectively. As a consequence, on average super-infection results in half of the within-host compartments being occupied by new parasites.The transmission from host to vector is the same as above, but with the addition of meiosis. In brief, the n parasite strains selected at random from the infected host may undergo meiosis before populating the the n within-vector sub-compartments. The meiosis model is based on a simplified implementation of our previously published meiosis simulator, pf-meiosis [29]. The number of recombination events is drawn from a Poisson distribution scaled with respect to chromosome length, such that an average of one cross-over event occurs per bivalent during meiosis. Recombination breakpoints are sampled uniformly from along the chromosome. The model includes multiple oocysts, allowing for parallel rounds of meiosis to occur during a single transmission event, with the number of oocysts being drawn from a truncated geometric distribution: ∼min[10, Geo(p)].In total, the infection layer is specified by nine parameters.
Genetic layer
The genetic layer of forward-dream describes the malaria genome model. We represent the genetic material of an individual parasite as a single fixed-length array of zeros and ones, defined by the parameter N (Fig 1c). In effect, this array represents a single chromosome marked with N single-nucleotide polymorphisms (SNPs). Mutation is symmetric and reversible. The only parameter specific to the genome evolution layer is N.Overall, forward-dream is specified by 17 parameters (Table 1). It is implemented in Python and available on GitHub at https://github.com/JasonAHendry/fwd-dream.
Table 1
Complete list of simulation parameters for forward-dream.
Values given represent those of a simulation with a host (X) and vector (X) prevalence of 0.65 and 0.075, respectively. This corresponds to the Initialise epoch of all malaria control intervention simulations (see below). For details on how parameter values were selected see the S1 Appendix.
Parameter
Definition
Value
Nh
Number of hosts
1000
Nv
Number of vectors
5000
b
Biting rate (per vector per day)
0.25
πh
Vector-to-host transmission efficiency
0.1
πv
Host-to-vector transmission efficiency
0.1
γ
Host infection clearance rate
0.005
ϵ
Vector infection clearance rate
0.2
nh
Maximum COI for hosts
20
nv
Maximum COI for vectors
20
dh
Drift rate in hosts (events per day per host)
2
dv
Drift rate in vectors (events per day per vector)
2
θh
Probability of mutation event given drift has occurred in hosts
1.25⋅10−4
θv
Probability of mutation event given drift has occurred in vectors
1.25⋅10−4
ph
Probability a given parasite genome passes through host-to-vector transmission bottleneck
0.2
pv
Probability a given parasite genome passes through vector-to-host transmission bottleneck
0.2
poocysts
Number of oocysts drawn from ∼Geo(poocysts)
0.5
Nsnps
Number of SNPs modelled per parasite genome
8000
Complete list of simulation parameters for forward-dream.
Values given represent those of a simulation with a host (X) and vector (X) prevalence of 0.65 and 0.075, respectively. This corresponds to the Initialise epoch of all malaria control intervention simulations (see below). For details on how parameter values were selected see the S1 Appendix.
Relationships between parasite prevalence and genetic diversity at equilibrium
Within the forward-dream framework, it is straightforward to monitor the fraction of hosts that are infected (h1/N), which corresponds to the most ubiquitously collected measure of transmission intensity, parasite prevalence (PfPR) (Fig 2a). It is also possible to monitor the fraction of vectors infected (v1/N), which corresponds closely to the sporozoite rate (SP). Finally, the genetic diversity of the parasite population can be monitored by collecting parasite genomes from a sample of infected hosts and simulating DNA sequencing (see Materials and methods) (Fig 2b).We sought to use forward-dream to elucidate relationships between PfPR and the genetic diversity of the parasite population. To this end, we varied PfPR across simulations and observed the resulting differences in parasite genetic diversity. Within the Ross-Macdonald framework, PfPR is a function of the four rate parameters (see Eq 1). In nature, what underlies prevalence differences observed between two geographies or points in time is often unknown, and likely the outcome of a myriad of epidemiological and environmental factors. To achieve different PfPR values in forward-dream, we choose three parameters to vary separately: (i) the human infection clearance rate (γ), which may vary between sites if, at one site there is quicker recourse to treatment, or differing proportions of symptomatic and asymptomatic individuals; (ii) the vector biting rate (λ), which may vary as a consequence of differences in vector species, environmental conditions, or the presence of bednets; and (iii) the number of vectors (N)—which is influenced by climate and weather, local geography, and also by insecticide-based interventions (see [30]). We tuned each of these parameters to achieve equilibrium PfPR values that varied from 0.1 to 0.8 in a population of one thousand human hosts (S3 Fig). At each prevalence value, forward-dream was seeded with forty infected hosts carrying identical parasites and then run to equilibrium. After reaching equilibrium, simulations were continued for an additional 10 years, during which time parasite genomes were collected every thirty days by sampling from the infected host population. Since we also wanted to explore the noise distributions of different genetic diversity statistics and how they are influenced by sample size, at each sampling time we collected parasite genomes from five randomly-drawn samples ranging in size from 20 to 100 infected hosts.The genetic diversity statistics calculated from these parasite genomes are described in S1 Table. The statistics can be divided into three broad categories: (i) those related to mixed infections, which includes the fraction of mixed samples and the mean complexity of infection (COI); (ii) those that are related to the size and shape of the genetic genealogy of the sample, which includes the number of segregating sites, the number of singletons, nucleotide diversity (π), Watterson’s Theta (θ), and Tajima’s D; and (iii) those that summarize the structure of identity-by-state (IBS) and identity-by-descent (IBD) within the population, including, between a pair of samples, the average fraction of the genome in IBD or IBS and the average length of an IBD or IBS segment. We note that these statistics are not independent, indeed many are co-linear (S4 Fig), however they reflect commonly-used measures of genetic diversity in population genetics.We conducted a total of thirty replicate simulations at each prevalence level and aggregated all the samples of a given size to explore both relationships with PfPR and sampling noise. The relationships for all eleven genetic diversity statistics are shown (Fig 3a and S5–S7 Figs), partitioned by which epidemiological parameter was varied. Linear regression was used to determine the fraction of the variance (r2) in each genetic diversity statistic that could be explained by variation by PfPR, and how this changed with sample size (Fig 3b). We observed that nearly all genetic diversity statistics have a considerable proportion of their variance explained by prevalence. For many statistics—the fraction of mixed samples, mean COI, number of segregating sites, nucleotide diversity, Watterson’s θ, and fraction IBS—the r2 exceeded 0.9 for at least one epidemiological driver. The two statistics with the lowest r2 values were Tajima’s D and the mean IBD segment length. A stable value for Tajima’s D would suggest that the parasite population’s genealogical tree has a consistent shape across different prevalence values. One possible cause for the modest r2 values we observe for Tajima’s D is recurrent mutation that results from a finite genome size. A lower r2 for the mean IBD segment length may be in part related to a lack of power to detect small IBD segments at higher prevalence values.
Fig 3
Equilibrium relationships between genetic diversity and parasite prevalence.
(a) Boxplots showing the equilibrium parasite prevalence in hosts (x-axis) versus genetic diversity of the parasite population, by different statistics. Each individual box summarizes 30 replicate simulations at the indicated prevalence; in each, 40 infected hosts were sampled every 30 days for ten years to collect parasite genomes from which diversity statistics were computed. Left, middle, and right columns show relationships when equilibrium parasite prevalence is varied as a function of the host clearance rate (γ, in blue), vector biting rate (b, in green) or number of vectors (N, in red). The variance explained in an ordinary linear regression (r2) is shown at the top left for each relationship, and in (b) the variance explained is shown across a larger panel of genetic diversity statistics and as a function of sample size. The variation in r2 as a function of sample size is indicated by point size. Slope of the line of best fit is indicated at left (+, increasing; -, decreasing). Boxplots for all statistics in (b) can be found in S5–S7 Figs.
Equilibrium relationships between genetic diversity and parasite prevalence.
(a) Boxplots showing the equilibrium parasite prevalence in hosts (x-axis) versus genetic diversity of the parasite population, by different statistics. Each individual box summarizes 30 replicate simulations at the indicated prevalence; in each, 40 infected hosts were sampled every 30 days for ten years to collect parasite genomes from which diversity statistics were computed. Left, middle, and right columns show relationships when equilibrium parasite prevalence is varied as a function of the host clearance rate (γ, in blue), vector biting rate (b, in green) or number of vectors (N, in red). The variance explained in an ordinary linear regression (r2) is shown at the top left for each relationship, and in (b) the variance explained is shown across a larger panel of genetic diversity statistics and as a function of sample size. The variation in r2 as a function of sample size is indicated by point size. Slope of the line of best fit is indicated at left (+, increasing; -, decreasing). Boxplots for all statistics in (b) can be found in S5–S7 Figs.A second major observation is that there are pronounced differences in the variance explained by PfPR when different epidemiological parameters drive variation in prevalence. In particular, varying host clearance rate (γ) results in a significantly higher r2 for all of the genealogy statistics, as well as for all of the IBD and IBS statistics, excluding mean IBS segment length. For example, the variance in nucleotide diversity explained by PfPR drops from 94% to 60% (for samples of size 40) when parasite prevalence is modulated by the number of vectors (N), instead of the host clearance rate (γ). This reduction in explanatory power is associated with a plateau in diversity at higher prevalence values observed when either the number of vectors (N) or biting rate (b) are increased. In contrast, the two statistics related to mixed infections have consistently high r2 values (>75%), regardless of which epidemiological parameter is driving prevalence variation.The r2 value is a reflection of the signal-to-noise ratio for different genetic diversity statistics, and may be influenced by sample size. Here, we find that for some statistics there is considerable benefit to collecting additional samples, whereas for others there is only little benefit (Fig 3b and S8 Fig). When prevalence change is driven by the number of vectors (N) or biting rate (b), the variance in the fraction of mixed samples explained by PfPR climbs from 0.81 to 0.94 as the sample size is increased from 20 to 100. For the same increase in sample size, the r2 of nucleotide diversity increases only from 0.62 to 0.68. The r2 of the mean IBD segment length also increases substantially with an increase in the number of samples collected (from 0.01 to 0.15). Here, a small sample size can result in a very high variance in mean estimates, due to the chance detection of rare, long IBD segments. We found that increasing from 20 to 100 samples reduced the variance in mean IBD segment length by up to 60% (S8 Fig).There are multiple sources for the noise that remains even after increasing sample size, related to the stochastic nature of both the epidemiological and genetic processes. Epidemiological noise means that prevalence fluctuates around its equilibrium value, and this can impact some diversity statistics. Noise on the genetic level can exist deep within the structure of the genealogical tree (nearer to the root), but also also more recently. To assess these contributions, we performed a variance decomposition analysis, determining what fraction of the variation in different genetic diversity statistics was occurring within- versus between- replicate simulations (S9 Fig). Since we collect parasite genetic data for 10 years for each replicate simulation, variation on a timescale longer than this will manifest as between- rather than within-simulation. Overall, we found that the majority of variation occurred between replicate simulations for the genealogy related statistics, indicating that they vary over long timescales (S9 Fig). Conversely, more than 90% of the variation in COI statistics existed within individual simulations, indicative of a short timescale of variation.
Non-equilibrium relationships between parasite prevalence and genetic diversity
An important application of our model is in settings where malaria control interventions are actively occurring. In such cases, and also in cases with seasonal variation in parasite prevalence, it is unlikely that the parasite population will be in equilibrium. Thus, our second aim with forward-dream was to explore which measures of genetic variation are most predictive of instantaneous PfPR in non-equilibrium settings.
Malaria control interventions
In order to understand how genetic diversity statistics relate to PfPR in scenarios where a malaria control intervention has been deployed, we developed a framework where individual forward-dream simulations pass through three distinct epochs: Initialise, Crash and Recovery (Fig 4). In the Initialise epoch, simulations are run until equilibrium at a parasite prevalence of 0.65, under the parameters listed in Table 1. At the start of the Crash epoch, one of either the host clearance rate (γ), the vector biting rate (b), or the number of vectors (N), is changed such that the new equilibrium PfPR is 0.2. The parameter change occurs incrementally over a period of thirty days following a logistic transition function, as to mimic the staged introduction of a malaria control intervention. As a consequence, the simulation leaves equilibrium, with host and vector prevalence declining. The Crash epoch is allowed to continue until the population has regained equilibrium. At the start of the Recovery epoch, we return the changed simulation parameter to its original value, again over thirty days following a logistic transition function. Again, this results in the simulation leaving equilibrium, with the PfPR increasing back to 0.65. As with the Crash epoch, the Recovery epoch continues until the population regains equilibrium. In summary, the three epoch model allows us to explore both a parasite population decline and rebound.
Fig 4
Responses of genetic diversity statistics to a crash and recovery in parasite prevalence.
(a) An individual forward-dream simulation where the number of vectors (N) is reduced at time zero (x-axis, indicated by grey vertical bar), resulting in a decline of parasite prevalence in hosts from 0.65 to 0.2. (b) Left column, the same simulation as in (a), but showing the response of three genetic diversity statistics (colored lines) to the prevalence change. Light and dark grey areas show host and vector prevalence, as in Fig 2. Right column, mean value of each genetic diversity statistics across 100 independent replicate simulations, with shading showing the 95% confidence interval. (c) Same simulation as in (a) at a later time, where the number of vectors is returned to its original value. Parasite prevalence increases back to 0.65. (d) Same as (b), but corresponding to the recovery shown in (c).
Responses of genetic diversity statistics to a crash and recovery in parasite prevalence.
(a) An individual forward-dream simulation where the number of vectors (N) is reduced at time zero (x-axis, indicated by grey vertical bar), resulting in a decline of parasite prevalence in hosts from 0.65 to 0.2. (b) Left column, the same simulation as in (a), but showing the response of three genetic diversity statistics (colored lines) to the prevalence change. Light and dark grey areas show host and vector prevalence, as in Fig 2. Right column, mean value of each genetic diversity statistics across 100 independent replicate simulations, with shading showing the 95% confidence interval. (c) Same simulation as in (a) at a later time, where the number of vectors is returned to its original value. Parasite prevalence increases back to 0.65. (d) Same as (b), but corresponding to the recovery shown in (c).Throughout the three epochs parasite genomes are collected every five days from twenty hosts selected at random, allowing the same suite of genetic diversity statistics discussed above to be followed through time. An immediate observation was that in individual simulations, the trajectories of genetic diversity statistics exhibited considerable noise, tending to fluctuate through time even in the absence of any change to simulation parameters (Fig 4). To better discern the average behaviour of different statistics, we sought to smooth their trajectories by computing a rolling mean. However, consistent with our variance decomposition analysis, we found that the timescale of random fluctuations differs greatly among genetic diversity statistics. Statistics such as the mean COI and the fraction of mixed samples fluctuate rapidly, and variation around their true mean could be substantially reduced (to <20% the original variance) by averaging the genetic data collected over roughly a month (S10 Fig). Yet we found that other statistics, such as nucleotide diversity, are more inertial in nature; they can trend up or down over very long time periods without any change to the underlying simulation parameters. Reducing their initial variance to an equivalent degree required averaging over 10 years worth of genetic data (S10 Fig).Thus, we took an alternate approach to extract underlying trends: we averaged the trajectories of each statistic across 100 independent, replicate forward-dream simulations (Fig 4b and 4d). Several observations emerged. First and consistent with population genetic theory, Tajima’s D increases in the period where PfPR is declining (indicative of a contracting N), and falls below zero in the period where PfPR is climbing (indicative of an expanding N) [31] (S11 Fig). Second, there are marked differences in the rate at which different genetic diversity statistics respond to changing prevalence. For example, the COI-related statistics respond faster than IBD or IBS related statistics, which in turn respond faster than nucleotide diversity. Finally, we noted that the rate at which a given genetic diversity statistics responds to a decline in PfPR may be different from the rate at which it responds to an increase in PfPR (Fig 4).We developed two metrics to summarize the temporal responses of different genetic diversity statistics to changes in parasite prevalence in our simulations (Fig 5, see also Materials and methods). Within the three epoch simulation framework, we could construct equilibrium distributions for each statistic before and after each prevalence change (i.e. equilibrium distributions for Initialise, Crash, and Recovery). Using these distributions, we computed a “detection time” (t), which we define as the amount of time, following an intervention, until a given genetic diversity statistic takes on values outside of its pre-intervention equilibrium. In effect, t is an estimate of how long it takes to detect that a change in parasite prevalence has occurred by monitoring a given genetic diversity statistic through time. We also computed an “equilibrium time” (t), which we define as the amount of time until a given genetic diversity statistic reaches its new, post-intervention equilibrium. Note that these metrics were designed to be informative summaries of the simulations only; it is unlikely they could be deployed in real-world settings.
Fig 5
Detection and equilibrium times of genetic diversity statistics following a crash in parasite prevalence.
(a) Each plot shows the behaviour of a genetic diversity statistic in an individual simulation through a crash in parasite prevalence, induced by: left column, increasing the host clearance rate (γ); middle column, reducing the vector biting rate (b); or right column, reducing the number of vectors. The intervention occurs at time zero (x-axis, grey vertical bar) in all cases. For each plot, the detection time (vertical dashed bar) and equilibrium time (vertical solid bar) of the genetic diversity statistic is indicated. Note that here a single simulation is shown for each intervention type. (b) Empirical cumulative density functions (ECDFs) of the detection and equilibrium times of diversity statistics, created from 100 independent replicate simulations for each intervention type. The y-axis gives the fraction of replicate simulations with a detection (dashed line) or equilibrium (solid line) less than the time indicated on the x-axis. Line color specifies the type of intervention. Open and closed circles give medians for the detection and equilibrium times, respectively. The first year is magnified for clarity.
Detection and equilibrium times of genetic diversity statistics following a crash in parasite prevalence.
(a) Each plot shows the behaviour of a genetic diversity statistic in an individual simulation through a crash in parasite prevalence, induced by: left column, increasing the host clearance rate (γ); middle column, reducing the vector biting rate (b); or right column, reducing the number of vectors. The intervention occurs at time zero (x-axis, grey vertical bar) in all cases. For each plot, the detection time (vertical dashed bar) and equilibrium time (vertical solid bar) of the genetic diversity statistic is indicated. Note that here a single simulation is shown for each intervention type. (b) Empirical cumulative density functions (ECDFs) of the detection and equilibrium times of diversity statistics, created from 100 independent replicate simulations for each intervention type. The y-axis gives the fraction of replicate simulations with a detection (dashed line) or equilibrium (solid line) less than the time indicated on the x-axis. Line color specifies the type of intervention. Open and closed circles give medians for the detection and equilibrium times, respectively. The first year is magnified for clarity.Fig 5b shows the empirical cumulative density functions for t and t across all considered genetic diversity statistics, following the host prevalence decline at the beginning of the Crash epoch. Consistent with our observations from the averaged trajectories, the mean COI and fraction of mixed samples had both the shortest detection times (median of 6–9 mo., depending on intervention) and equilibrium times ( 2–3 yrs). The statistic with the next shortest detection time was the number of segregating sites ( 1–2 yrs), however this statistic had a very long equilibrium time ( 40–75 yrs). The next two fastest statistics were the fraction IBD and mean IBS segment length. We note these statistics are highly co-linear within our simulations (S4 Fig, Pearson’s R ≥ 0.89), and had both fast detection and equilibrium times (detection 1–4 yrs, equilibrium 8–16 yrs). The least responsive statistic was nucleotide diversity, which had a median detection time of 3–6 years and a median equilibrium time of 45–75 years.We hypothesised that the fast detection time for the number of segregating sites was a consequence of starting from a relatively high prevalence (65%). In this context, the parasite population carries a large and relatively stable number of segregating sites (CV <10%), and so a significant reduction can be quickly detected. At the same time, IBD related statistics are less sensitive to changes that occur at high prevalence. To explore this further, we repeated the above experiment but declined from an initial prevalence of 30% to a post-intervention prevalence of 10%. Consistent with this hypothesis, the detection time for IBD fraction was faster than number of segregating sites in this context (S12 Fig).Finally, we sought to understand how the detection and equilibrium times we observed depended on the size of the population under consideration. To this end, we repeated the above experiment but with 400, 700, and 1000 hosts. For statistics related to the genetic genealogy, we found that the equilibrium times grew markedly with the host population size (S13 Fig). For example, the median equilibrium time for nucleotide grew from 28 years for a host population size of 400, to 75 years for a host population size of 1000. Equilibrium times for the mean IBS segment length and Fraction IBD statistics also increased with population size, but to a much lesser degree. Interestingly, the detection times exhibited little dependency on host population size.In terms of the relative behaviour of the statistics, the t and t values for the Recovery epoch were similar to that of the Crash epoch (S14 Fig), with mean COI and the fraction of mixed samples being the fastest statistics, and nucleotide diversity being the slowest. Overall, The median times tended to be longer, in particular for t. This is likely a consequence of the rate of diversity being re-established (by mutation) being slower than the rate of it being eliminated (by an intervention).
Seasonality
We next aimed to explore whether any measures of genetic diversity were responsive to changes in parasite prevalence driven by seasonality. To this end, we developed a simulation framework where the number of vectors oscillates between a peak reached in the wet season and trough reached in the dry season, with PfPR fluctuating between ∼0.6 and ∼0.2 (see Materials and methods). Vectors that die entering the dry season are selected at random, with the dry season lasting 170 days and the wet season lasting 195 days. Consistent with our results from the intervention analysis, we found that the fraction of mixed samples and mean COI showed clear correlations with seasonal change in prevalence (r2 = 0.62 for mean COI, r2 = 0.59 for fraction mixed samples; Fig 6). We found that the mean IBD and IBS segment lengths also exhibited weak correlations with seasonally varying prevalence (r2 = 0.12 and r2 = 0.06, respectively). However, compared to equilibrium patterns, the relationships are in the opposite direction—with an increase in mean segment length at higher prevalence values. This is likely driven by “epidemic expansion” in the early wet season—with the parasite population expanding faster than it acquires new mutations, resulting in increased IBD [32]. Consistent with this, we observed similar increases in mean IBD and IBS segment length in the first year of the Recovery epoch explored in the previous section (S15 Fig).
Fig 6
Responses of genetic diversity statistics to seasonal change in parasite prevalence.
Annual variation in parasite prevalence was induced by varying the number of vectors (top row, see Materials and methods). The behaviour of genetic diversity statistics for an individual simulation is shown in the left column from second row onwards. The mean behaviour of 10 independent replicate simulations is shown at middle, with shaded areas giving the 95% confidence intervals. Scatterplots at right show the relationship between each genetic diversity at parasite prevalence across the six years of seasonal fluctuation. Each point represents a genetic diversity estimate (y-axis) computed from sampling parasite genomes from 20 infected hosts in an individual simulation; parasite prevalence (x-axis) is computed across the entire host population at the same time. Data from all 10 replicate simulations has been aggregated. Variance explained r2 from an ordinary linear regression is indicated at top left.
Responses of genetic diversity statistics to seasonal change in parasite prevalence.
Annual variation in parasite prevalence was induced by varying the number of vectors (top row, see Materials and methods). The behaviour of genetic diversity statistics for an individual simulation is shown in the left column from second row onwards. The mean behaviour of 10 independent replicate simulations is shown at middle, with shaded areas giving the 95% confidence intervals. Scatterplots at right show the relationship between each genetic diversity at parasite prevalence across the six years of seasonal fluctuation. Each point represents a genetic diversity estimate (y-axis) computed from sampling parasite genomes from 20 infected hosts in an individual simulation; parasite prevalence (x-axis) is computed across the entire host population at the same time. Data from all 10 replicate simulations has been aggregated. Variance explained r2 from an ordinary linear regression is indicated at top left.Notably, none of the other genetic diversity statistics we calculated exhibited correlations with seasonal fluctuation in prevalence.
Discussion
The collection of parasite genetic data may, over the next few years, become a routine part of malaria surveillance. Yet, deriving the maximum benefit from this data will require an understanding of the relationships between malaria genetic diversity and epidemiology. Such understanding can be guided by modelling approaches, but only if both evolutionary and epidemiological processes are integrated. At present this is rare, as classical models in population genetics are poor approximations of the malaria life cycle, and classical epidemiological models don’t incorporate the evolution of parasites. To address this issue, we have combined the Ross-Macdonald and Moran models into a single framework, which we have implemented as a stochastic simulation called forward-dream.We have used forward-dream to investigate the relationships between parasite genetic diversity and parasite prevalence in equilibrium and non-equilibrium settings. We find that many measures of parasite genetic diversity correlate with parasite prevalence at equilibrium. Our findings align with existing empirical data [25, 32–35], and support the idea that the rate of mixed infection (and as a consequence the rate of recombination) is positively correlated with parasite prevalence [36]. Moreover, we find that, for a given human host population size, statistics that reflect the long-term effective population size (N) of the parasite, such as a nucleotide diversity and the number of segregating sites, also increase with equilibrium prevalence. We also explored the behaviour of these genetic diversity statistics in non-equilibrium settings, most importantly in response to changes in parasite prevalence that mimic malaria control interventions. Other authors have emphasised that the viability of a genomic approach to malaria surveillance will depend on how rapidly signals of epidemiological change become detectable in a reasonably sized sample of parasite genetic data [25]. We find that statistics related to the COI distribution respond most rapidly (on the order of months), whereas other statistics, such as nucleotide diversity, may take decades to respond to a change in parasite prevalence. These pronounced differences in response times are related to the type of events that must occur to change different statistics’ values. The COI distribution changes directly in response to individual infection events (on the timescale of the rate of transmission and clearance), whereas other statistics change in response to drift (on the timescale of N) and/or mutational events (on the timescale of μ).In addition to understanding the relationships of different genetic diversity statistics with prevalence, it is also important to understand sources of noise and the extent to which they can be controlled. The noise we observe in forward-dream has contributions from across the layers of our simulation: for a given set of epidemiological parameters, there will be variation in host prevalence due to the stochastic nature of infection and clearance events; for a given host prevalence, there will be variation in genetic diversity driven by the stochastic nature of drift, mutation and recombination events; and for a given population-level genetic diversity, there will be variation in sequencing data introduced by sampling a subset of all infected hosts. Moreover, on the genetic level, noise can be generated deep within the genealogical tree, related to the timings of coalescent events nearer to the root; and also in the more recent past, driven by more recent coalescent events. The extent to which different statistics are influenced by these sources of noise dictates the value of collecting additional samples, either as part of a cross-sectional survey or longitudinally. For statistics that are influenced by epidemiological noise (such as those derived from the COI distribution) and recent coalescent events (in particular mean length of IBD and IBS segments), increasing sample size can act to average over temporal fluctuations, or reduce sampling variation in a cross-sectional survey. In contrast, collecting additional samples in the present day does little to reduce variation driven by historical events and manifest in statistics like nucleotide diversity or Watterson’s Theta (θ). For these statistics, relatively few samples are required to produce good estimates, and additional longitudinal sampling in the short-term has little added value. Though not explored here, one way to average over such deep genealogical noise is by assessing a larger proportion of the genome.Our results also demonstrate how relationships between prevalence and genetic variation are sensitive to the underlying epidemiological process. Specifically, we find that changes to the host clearance rate had a more profound effect on several genetic diversity statistics than changes to either the vector biting rate or density. The statistics exhibiting this behaviour are all related to θ = N
μ. As we did not alter the mutation rate across simulations, we expect this observation is being driven by effects on N. Where a population’s size fluctuates through time, the N can be approximated as the harmonic mean of those sizes, and thus is more influenced by periods where the population is small [37]. Similarly, a P. falciparum lineage alternates between a large vector population and a small host population, and so one explanation for our observation is that the amount of diversity is impacted more by changes influencing the smaller host population. This result complicates efforts to use genetic variation metrics to compare parasite prevalence across space or time, as it implies that only under certain conditions will changes in prevalence be reflected by changes in genetic variation. Furthermore, it implies that it will be important to assess which modes of prevalence variation are observed more frequently in nature. This assessment can be aided directly by epidemiological data. For example, the empirical observation that the relationship between EIR and prevalence is log-linear suggests that variation in the number of vectors or biting rate may be a dominant epidemiological driver of geographical prevalence variation [38].forward-dream has several limitations, most obviously with respect to its simplicity and scale. With regards to simplicity for example, heterogeneous biting, acquired immunity, and migration are all phenomena that have been proposed to influence the rate of mixed infections [2], yet they are not included in forward-dream. We do not explore the effect of selection, though this may be relevant in many contexts, particularly in Southeast Asia where drug resistance is widespread [39]. With regards to scale, we simulate only one thousands hosts, and our genome model is limited to single chromosome of eight thousand sites. As a consequence, some genetic diversity statistics have absolute values that differ from existing empirical observations. Two examples include our estimate of Fraction IBD at low prevalence values, which is higher than observation [40] as a result of the inbreeding that may occur in the small populations we simulate; and our estimate of the number of segregating sites, which is low as a result of our much smaller genome.Many of these limitations could be addressed with the continued development of forward-dream, however, there are at least two salient considerations. The first is that increasing either complexity or scale will likely increase computational costs. Merging the Moran and Ross-Macdonald models resulted in forward-dream being more computationally expensive than either, and for most of the individual simulations in this study, run-times were on the order of several hours (S16 Fig). The majority of this was associated with the time required to bring forward-dream to equilibrium; a general problem faced by forward-time genetic models (see [41]). Reverse-time simulations harnessing coalescent theory can have greatly accelerated computational times by omitting processes extraneous to the sample of genetic data collected (for example [42]), however the reverse-time formulation of the model described here is yet to be elucidated. In a similar vein, the development of models separating epidemiological and genetic processes are underway [43], and could result in significantly faster simulations. Second, it is important to consider that more complex models typically require more parameters. Most likely these models will be both analytically intractable and statistically non-identifiable, thus making inference about their values impossible without additional and complex field experiments. Indeed, many of the parameters used within the present model have substantial uncertainty and were hard to find in current literature (see S1 Appendix). Community efforts to collate existing knowledge and address key uncertainties through experimental work would greatly benefit the field.The immediate value of forward-dream, as demonstrated here, is as a tool through which relationships between genetics and epidemiology can be explored and experimental and analytical strategies can be evaluated. Moreover, as methods for epidemiological inference from malaria genetic data are developed, forward-dream can provide a basis for assessing their expected performance and designing ideal sampling strategies under different epidemiological scenarios. We do not see forward-dream as likely powering such inference methods directly, for example through techniques such as Approximate Bayesian Computation (ABC) (reviewed in [44, 45]), because of the limitations described above. Rather, it will enable the identification of approaches to hypothesis testing and estimation that are insensitive to the specific values of a multitude of epidemiological parameters (for example, assessing relative changes in prevalence by monitoring COI). It is in these ways that forward-dream, and other simulations like it, can provide a platform for interpreting the signals within the projected tens of thousands of malaria genomes that will be collected over the next decade, and can help to leverage those signals for malaria surveillance.
Materials and methods
Collection of genetic data and computation of summary statistics
For all of the simulations in this manuscript, parasite genomes were collected from randomly selected infected hosts. For each host, we simulated DNA sequencing by taking a subset of all parasite genomes within the host (where ) such that each genome in was different from all others at a minimum of 5% of its sites; the assumption being that genomes more similar than this would not be readily distinguishable by sequencing. The COI of each host is then . The fraction of mixed samples and the mean COI is computed directly from the distribution of k across all sequenced hosts. To compute other statistics, we pooled all genomes collected across all hosts. The number of segregating sites, the number of singletons, nucleotide diversity, Watterson’s Theta (θ), and Tajima’s D were calculated using scikit-alllel (https://scikit-allel.readthedocs.io/en/stable/). We estimated identity-by-descent (IBD) profiles between pairs of parasite genomes by imposing a 2cM length threshold on contiguous segments of identity-by-state (IBS), an approach similar to that used by methods like GERMLINE [46].
Averaging of genetic diversity statistics across independent forward-dream simulations
To produce smoothed trajectories of genetic diversity statistics for the intervention analysis (Fig 4b and 4d), we averaged independent replicate forward-dream simulations. As forward-dream operates in continuous-time, parasite genetic data is never sampled at exactly the same time in independent simulations. Thus, to average simulations, we binned time into 25-day intervals. Finally, across all replicate simulations and for the entire duration of the intervention analysis, genetic diversity statistics computed within each 25-day bin were averaged.
Computing response time statistics t and t
We created two simple metrics to characterize the temporal response of different genetic diversity to changes in PfPR. These metrics were developed specifically in the context of the intervention experiments described in the Results section, and they are computed for individual simulations. The “detection time” (t) estimates the amount of time before a change in parasite prevalence would be detected, if that change was being monitored for using a given genetic diversity statistic. To compute it, we first construct a distribution for the genetic diversity statistic of interest at equilibrium. When considering a decline in parasite prevalence, this is achieved by recording the genetic diversity statistic’s value for 25 years proceeding the Crash epoch, during which time the simulation is at equilibrium (at a host prevalence value of 0.65). For an increase, the statistic is recorded for 25 years proceeding the Recovery epoch. t is then computed as the first time, after the prevalence chance has occurred, that three consecutive samples have a value for that statistic outside of the quantile interval [α/2, 1−α/2], with α = 0.01; i.e. the first time when three consecutive samples have a value that would be observed with a probability of less than 1% if the simulation were at equilibrium. Requiring that three consecutive samples (equivalent to approximately two weeks of genetic data) have values outside the interval makes t more robust to the high variability observed in individual simulations.Similarly, the “equilibrium time” (t) estimates the time until a given genetic diversity statistic regains equilibrium following a host prevalence change. t is computed as the first time that six consecutive samples have a value within the inter-quartile range ([α/2, 1−α/2], with α = 0.5), of the distribution of the statistic at its new equilibrium. Again, requiring six consecutive values within the inter-quartile range makes our estimates of t more robust to the high variability of individual simulations; we elected for six rather than only three samples as the criterion of being within the inter-quartile range is weaker than the t criterion.
Parameters for malaria control intervention and seasonality experiments
The complete parameter files (stored as ‘.ini’) used to specify the malaria control intervention and seasonality experiments are available on GitHub within the ‘params’ directory. All of these experiments began with the same set of parameters listed in Table 1 before individual parameters were changed to either mimic malaria control interventions or induce seasonality. To achieve a parasite prevalence of 0.2 during the Crash epoch of malaria control intervention experiments, either the host clearance rate (γ) was increased to 0.012, the vector biting rate (b) was reduced to 0.16, or the number of vectors (N) was reduced to 2050. In the Recovery epoch they were returned to their original values. To achieve an annually varying parasite prevalence in the seasonality experiment, the number of vectors (N) was oscillated between 10 during a dry season lasting 170 days, and 2800 in the wet season lasting 195 days.
Additional information on forward-dream implementation and parameterisation.
(PDF)Click here for additional data file.
Validating equilibrium host prevalence values in forward-dream.
The epidemiological layer of forward-dream implements the Ross-Macondald model, where the host prevalence is a function of the rate parameters (see Eq 1). Violinplots summarize the prevalence values observed in forward-dream simulations with expected equilibrium prevalence values varying from 0.1 to 0.8 (computed using Eq 1) given on the x-axis. The different equilibrium prevalence values were achieved by varying either the host clearance rate (γ), the vector biting rate (b), or the number of vectors (N). The variance explained (r2) in an ordinary linear regression is shown at top-left of each plot.(TIF)Click here for additional data file.
Validating intra-host fixation times in forward-dream.
(a) The infection of a single host is evolved through time and the within-host alelle frequency of a given site is indicated by the red line. The site fixes around day 125. The experiment is repeated 1000 times (grey lines) and the fraction of infections fixed at a given time is indicated by the blue line. All experiments started with an initial allele frequency of 0.5 and a drift rate of 1/event per day. (b) Distribution of fixation times from (a). The observed mean (64.81 days) is very close to the theoretically expected mean from the Moran model (64.56 days). (c) The experiment in (a) is repeated but with different initial allele frequencies (x-axis) and three different drift rates (light blue, dark blue, and green line). In all cases, the observed mean fixation times are close to the theoretically expected times. Shading gives 95% confidence intervals for mean estimates.(TIF)Click here for additional data file.
Varying equilibrium prevalence values in forward-dream.
The parameter values of forward-dream are varied to produce simulations with equilibrium parasite prevalence values varying from 0.1 to 0.8. (a) Varying the number of vectors. Prevalence in hosts indicated in red, vectors in blue. Dots mark parasite prevalence values of 0.1 through 0.8. (b) Varying the vector biting rate b. Note 1/b gives the average time between successive bites, show in right plot. (c) Varying the host clearance rate (γ). Note 1/γ gives the average duration of host infection, shown in right plot.(TIF)Click here for additional data file.
Co-linearity between different genetic diversity statistics in forward-dream.
Matrices of Pearson’s Correlation Co-efficient (R) calculated between all pairs of genetic diversity statistics is shown. In panel (a) host prevalence was tuned to different values between 0.1 and 0.8 by varying the host clearance rate (γ); (b) by varying the vector biting rate (b); or (c) by varying the number of vectors N. In all cases there is significant co-linearity between different genetic diversity statistics.(TIF)Click here for additional data file.
Equilibrium relationships between parasite prevalence and mixed infection related statistics.
Distributions of mixed infection related genetic diversity statistics (y-axis), plotted for equilibrium parasite prevalence values tuned to between 0.2 and 0.8 (x-axis) in forward-dream simulations. Left, middle, and right columns show distributions when parasite prevalence is varied as a function of the host clearance rate (γ, in blue), vector biting rate (b, in green) or number of vectors (N, in green). Each boxplot contains the result of 30 replicate experiments, where the parasite genomes within 40 randomly selected hosts are collected at every 30 days for 10 years and are used to compute the genetic statistic of interest. The variance explained by ordinary least squares regression is given at top left, and line of best fit and confidence intervals indicated in grey.(TIF)Click here for additional data file.
Equilibrium relationships between parasite prevalence and genetic diversity statistics related to the size and shape of the sample genealogy.
See S5 Fig for details.(TIF)Click here for additional data file.
Equilibrium relationships between parasite prevalence and genetic diversity statistics related to identity-by-descent patterns.
See S5 Fig for details.(TIF)Click here for additional data file.
Exploring the effect of sample size on the noise distributions of genetic diversity statistics.
(a) Shows the influence of sample size on the mean and standard deviation of different genetic diversity statistics. The mean and standard deviation of each genetic diversity statistics across all samples collected (over ten years from thirty replicate simulations) is shown, for three different prevalence levels (indicated by color). Prevalence was varied by changing the number of vectors (N), and the mean and standard deviation are normalised to their value at a sample size of twenty. Note how increasing sample size reduces the standard deviation of different statistics to varying degrees. (b) Change in r2 for increasing sample sizes. Each row is a different genetic diversity statistic and each column a different sample size (indicated at top by n). Individual points within scatter plots represent estimates of the given genetic diversity statistic from samples of the indicated size, after simulations have reached equilibrium. Shades of green indicate the equilibrium prevalence values of individual simulations. r2 values for all genetic diversity statistics can be found in Fig 3b.(TIF)Click here for additional data file.
Variance decomposition of genetic diversity statistics at equilibrium across a range of prevalence values.
(a) Trajectories of a set of four genetic diversity statistics over a ten year period at equilibrium, for three randomly selected replicate simulations (indicated by color). Each replicate simulation has the same host prevalence of 60%, achieved by tuning the number of vectors N. Marginal densities for the diversity statistics in each replicate simulation are shown at right. Note how for nucleotide diversity, there is substantial variation between replicate simulations; whereas for other statistics most variation is observed within individual simulations. (b) The fraction of the total variance occurring within- rather than between-individual replicate simulations is shown for all genetic diversity statistics. Each point represents a particular statistic (y-axis), varied epidemiological parameter (color) and equilibrium prevalence value (shade). Analysis is conducted over thirty replicate simulations for each epidemiological parameter and prevalence level.(TIF)Click here for additional data file.
Temporal fluctuations in genetic diversity statistics of different frequencies, without any change in parasite prevalence.
Panel (a) shows the noisy behaviour of three genetic diversity statistics (y-axis) from an individual simulation where parasite prevalence was kept fixed at 0.65 for a 25 year period (x-axis). The trajectory of each statistic was smoothed using a rolling mean, with window sizes varying from 1 day (1 d., purple), which is equivalent to no smoothing, up to 10 years (10 y., yellow). The mean of the statistic during the 25-year window is indicated with the grey horizontal bar. Notice how even with a 10-year window rolling mean, the nucleotide divesity still deviates from its mean value. (b) Across 100 independent replicate simulations, the reduction in variance of each genetic diversity statistic with increasing rolling mean window sizes is shown. The y-axis gives the ratio of the variance for the window size indicated by the x-axis (Var(X)) divided by the unsmoothed variance in the genetic diversity statistic (Var(X)). Increasing with window size of the rolling mean always reduces the variance, but at different rates for different statistics.(TIF)Click here for additional data file.
Average behaviour of Tajima’s D during a crash and recovery in parasite prevalence.
Colored lines show mean estimate across 100 replicate simulations, shaded area gives 95% confidence intervals. Notice how Tajima’s D increases during a population contraction and decreases during population growth.(TIF)Click here for additional data file.
Comparing the detection and equilibrium times of genetic diversity statistics following a reduction in the number of vectors from a high prevalence (65%) or low prevalence (30%).
(a) Distributions of detection and equilibrium times are shown for experiments where the prevalence change was from 30% to 10% (Low PfPR, blue) or from 65% to 20% (High PfPR, red). Each point represents one of one hundred replicate simulations for each experiment. In both cases, prevalence was changed by reducing the number of vectors. Black circle indicates median. (b) Histograms of detection time distributions for Mean COI (blue), Fraction IBD (yellow) and Number of Segregating Sites (green). Vertical dashed lines indicate medians. Notice how the Fraction IBD has a faster median detection time when decline starts at 30%.(TIF)Click here for additional data file.
Comparing detection and equilibrium times for varying host population sizes.
(a) The trajectories of Mean COI, Fraction IBD, and nucleotide diversity (π) are shown following a reduction in the number of vectors. Colored lines indicate average metric behaviour across 100 replicate simulations, with shaded area indicating standard error of the mean. Host population sizes of 1000 (red), 700 (blue) and 400 (green) are shown. Metrics are scaled to represent percentage of their pre-intervention mean. (b) Distributions of detection and equilibrium times for all statistics across 100 replicate simulations for each host population size. Median is indicated by black circle. Text at right gives median in years and parentheses give rank amongst all metrics for that host population size.(TIF)Click here for additional data file.
Detection and equilibrium times of genetic diversity statistics following a recovery in parasite prevalence.
(a) Each plot shows the behaviour of a genetic diversity statistic in an individual simulation through a recovery of parasite prevalence, induced by: left column, reducing the host clearance rate (γ); middle column, increasing the vector biting rate (b); or right column, increasing the number of vectors. The intervention occurs at time zero (x-axis, grey vertical bar) in all cases. For each plot, the detection time (vertical dashed bar) and equilibrium time (vertical solid bar) of the genetic diversity statistic is indicated. Note that here a single simulation is shown for each intervention type. (b) Empirical cumulative density functions (ECDFs) of the detection and equilibrium times of diversity statistics, created from 100 independent replicate simulations for each intervention type. The y-axis gives the fraction of replicate simulations with a detection (dashed line) or equilibrium (solid line) less than the time indicated on the x-axis. Line color specifies the type of intervention. Open and closed circles give medians for the detection and equilibrium times, respectively. The first year is magnified for clarity.(TIF)Click here for additional data file.
Zoom on response of average IBD track length to increase in prevalence at beginning of Recovery epoch.
Focus on IBD and IBS statistics during a four-year window around the beginning of the recovery. Notice how for a change in the number of vectors (N) there is an increase in average IBS and IBD segment track length at the beginning of the recovery, consistent with epidemic expansion.(TIF)Click here for additional data file.
Peak memory usage and runtime scaling for forward-dream simulations.
Left panel shows peak memory usage of individual forward-dream simulations brought to equilibrium at different prevalence values. For each prevalence value thirty replicate simulations are plotted; the same simulations analysed in 3. Right panel, same as left but showing run-times of individual simulations (run on 2.6GHz Intel Ivybridge CPUs with 15Gb RAM). Mean run times for each prevalence value are shown at right.(TIF)Click here for additional data file.
Genetic diversity statistics computed in forward-dream.
(PDF)Click here for additional data file.30 Oct 2020Dear Dr Hendry,Thank you very much for submitting your manuscript "Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malaria" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Alex PerkinsAssociate EditorPLOS Computational BiologyVirginia PitzerDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: General Comments:I really enjoyed reading this paper and think it is a welcome addition to the slowly growing number of malaria transmission models looking at understanding parasite genetics. The study is well set out and explores commonly used metrics to explore the dynamics and utility of different metrics. I do have a few major comments I would like to see addressed and then some smaller minor comments.Major CommentsCOI vs Prevalence RelationshipThe COI prevalence relationship seems too low (i.e. not high enough COI at higher prevalences). Looking at estimates derived from THE REAL McCOIL (RMCL) we see evidence in Uganda of mean COI equal to ~7, ~5, and ~2.5. These are in three study sites in Uganda in the original THE REAL McCOIL study (see last table in their SI). RMCL has also been used to estimate COI in 5 countries in https://www.nature.com/articles/s41467-020-15779-8 and the mean COI estimated (see Supp Figure 3) also shows mean COI at least above 2. Also there is good literature on MOI by msp2 by prevalence (Fig 3 in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164054), which has been updated in https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msaa225/5902837. However, I think it would be easy enough to parameterise your model to this relationship.IBDNot sure how you are able to work out IBD if you are only tracking loci as 0 and 1. As you note, you are measuring identity by state. I would recommend not calling it IBD as it is incorrect and should be called IBS.If you want to track IBD though it should be easy enough. Two approaches. Rather than encoding your genomes as 0/1 encode your loci with integers. (you may need larger than int8 for 400 individuals). Run your simulation until equilibrium prevalence and then replace each individual’s strains with integers, i.e. the first infected individual, all loci are encoded as 1, the next infected individual all loci are 2 etc all the way up to n. Then you are able to track ancestry and IBD relative to this time point. If you also store the genetic population at the time of replacing all your loci in this way, you can then use this as a look up table to work out the actual alleles. Then for mutations, when a mutation occurs, assign that compartment as the next integer value (n+1) and store this in your genetic population table (as well as that integer n+1 is a mutant integer and has the same identity as whichever integer the compartment was before the mutation event). This should then allow you to work out true IBD.Alternatively, define a population level allele frequency for your 1000 loci. Then draw from those frequencies when seeding your population. Then you could estimate IBD using something like hmmIBD or deploidIBD and report that.As it is currently, however, I do not agree with it being referred to as IBD and without seeing the comparison between IBD as above and IBS, I do not agree that we know trends in IBD will track with trends in IBS.Lastly, it is a shame that prevalence less than 20% was not explored as this is where I would expect measures of actual IBD to be most interest (For example, figure 4 in the isorelate methodological paper https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007279 shows that it’s only really in SE Asia that you start observing higher IBD sharing). Alternatively, looking at within host IBD comparisons (Figure 4c in the deploidIBD paper https://elifesciences.org/articles/40845#s2) shows the importance of these measures generally at lower transmission intensities. Perhaps worth thinking about if you do look at IBD.(Also I think it would be fine to say that it is not possible to look at IBD at this stage of fwd-dream, instead reporting IBS statistics and then discuss that this is something to be looked at in future versions).Superinfection Sub CompartmentsThe use of sub-compartments to reflect parasites in hosts seems strange or I may have misunderstood something. Why does super infection have to replace previous parasites? I am not aware of any research suggesting that new infections cause old infections to be eliminated. Perhaps it might increase intra-host competition that would drive down the parasite density of older infections more quickly.Related to sub-compartments. Why do you choose to always replace probabilistically half the genotypes in a superinfection? Should this not depend on the level of cotransmission and the genetic diversity of the incoming sporozoites? As well as perhaps the age/parasite density of present infections? For example, if the vector causing the infection was infected by an individual with a COI of 1, then only one genotype would be introduced but this one genotype being introduced could be causing a large reduction in COI to occur by replacing multiple genetic variants in an infected host that could have had previously a high COI. I wonder if this is one of the reasons for the low COI observed from the model as you are making the assumption that previous strains are lost upon a superinfection event.Timescales for metrics changing and crash dynamicsThe discussion of timescales for metrics changing in response to changing prevalence is a really nice way of judging their utility. However, the reported time for metrics to change is somewhat dependent on the effective population size. You have quite a small population size and so I agree COI and other coinfection metrics will change quickly and relatedness metrics will change more slowly, the specific speed of this is really dependent on the effective population size. It would be good to redo this but with a larger and smaller population size (though may need to be 2 larger population sizes to make sure you can look at the same prevalence range).The crash dynamics are quite strong, i.e. a large reduction in a short time span. This could be similar to maybe a period of an effective mass drug administration campaign. However, I would be interested to see the variance change in the metrics in response to a more gradual intervention campaign (similar to scale up of bed nets etc.) whereby we see prevalence change by say 30% absolute over 10 years? How useful are each metric in this more subtle change in transmission intensity and how reliably would these metrics be able to detect these changes?Other CommentsThroughout C.O.I. has been used as opposed to COI which is more common in the literature for complexity of infection. Would recommend replacing in order to align with the literature.In the introduction you mention the model developed for the Daniels et al Senegal study. I agree this is a bespoke model developed for a specific application and is not currently suitable in its current form for looking at other transmission dynamics. However, very recently a new transmission model for simulating malaria genetics was published - https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msaa225/5902837. This appears to address most of the requirements you note as necessary of an integrative genetic-epidemiological model. It would be good to reference this and also to then state the improvements/differences that forward-dream makes.Line 61: Replace ‘so-called “compartment-based”’ with simply ‘compartment-based’Drift model could also be cause of low COI. Perhaps this is also another way to parameterise the COI distribution by reducing the frequency of drift events. Also on the parameter estimate chosen for drift events, there may be some data you could use to estimate this perhaps. There are studies where infected individuals have had their parasites sequenced at various time points throughout an infection, often to test for parasite clearance vs reinfection events. I wonder if this information could be used to estimate observed COI in an individual over time, which could help identify drift event frequencies (in my head the time taken for drift events to remove a parasite lineage from a mixed infection could be the same as the time taken for a newly infected individual with a COI equal to 2 to be detected as a monoclonal).Fig3b where is the biting rate variation for tajimas and no. singletons?I really enjoyed the discussion in the Supplementary on the “Probability a given parasite genome passes through the host-to-vector (ph) or vector-to-host bottleneck (pv)” about the methodological implications of bottlenecks at the vector-to-host bottleneck. I agree that with high sporozoite counts that it is likely that the majority of genetic diversity in the mosquito will be passed on. Might be worth though highlighting that if the SMFA results of ~5 sporozoites being passed on and defined by a geometric distribution is the correct distribution, then actually a 20% bottleneck will have substantial implications on diversity passed on as we would expect over 50% (cumulative density underneath geometric distribution with mean of 5 after removing zero counts as we are assuming infection is occurring) of infections to have between 1 and 5 sporozoites.The parameterisation of prevalence is conducted using biting rate, vector abundance and clearance rate. I would think that vector abundance and clearance rate are the more realistic two rather than biting rate. However, these are only explored in parallel. When vector abundance increases and there is increased immunity in the population this then leads to an increase in clearance rate. Not sure it will change the results but would be interested to see how the dynamics in Figure 3b would look when both of these are changed at the same time.As an aside, it would be interesting to see how forward-dream could be sped up and population size increased. Having gone through the code the first thing that springs to mind re memory is that you have your full human and vector model state as one 3 dimensional array, a lot of which will not be being used when individuals are susceptible. Would be great to see if this could be improved for future versions so that spatial dynamics etc could be explored between multiple populations etc.Reviewer #2: The manuscript is an attempt by the authors to fill an important gap in the malaria field – translating epidemiologic setting to expected genomic data by linking an epidemiologic and genomic model, the latter including within host as well as transmission elements. A useful model in the space would be quite welcome and have numerous immediate practical applications. The model set up and outputs are clearly described and the manuscript is overall well written, particularly given the potential layers of complexity. While all models are abstractions of reality, they need to at least recover the main features of key relationships have any utility. Unfortunately, I am not convinced that the authors have surpassed this bar, limiting my enthusiasm for what is otherwise a well-performed piece of work.Major comments:1) The stated goal of fwd-dream is to provide a basis for assessing the expected performance of methods for inferring transmission intensity from genetic data, and designing sampling strategies for different scenarios (paraphrased from lines 464-7 at the end of the discussion). To have any confidence in utility towards these goals, we would ideally see some model calibration and at a bare minimum at least some very basic “sanity checking” of model outputs with available empiric data on relationships between key transmission parameters and genetic outputs. Unless I have missed it, I do not see any attempt to do this in the manuscript and in fact some of the results presented appear to be red flags. In particular, the percentage of mixed infections and COI (both metrics which are widely, if not systematically, available in the literature) appear to be gross underestimates of what is empirically observed for a given level of parasite prevalence. This first becomes apparent in figure 2, where at 40% parasite prevalence we see only a ~5% prevalence of mixed infections which is much lower available empiric data I am familiar with. In figure 3 (and associated supplemental data) we see COI increase to a maximum of ~1.5 at 80% parasite prevalence, in which setting one might expect much higher COI in most of the population even just based on first principles i.e. nearly everyone is infected, most people are semi-immune and as such have long, untreated infections with considerable superinfection and coinfection. Because these relationships are not believable, it is difficult to know what to make of the other ones, or to be convinced of any of the conclusions regarding the different epidemiologic parameters having differential effects on genetic outputs. The result that metrics of within host diversity would change faster than population level metrics is probably true but does not seem an open question to me given that infections last on the order of one year and most population genetic parameters would be expected to change at a much slower rate.It is difficult to say if this is a fundamental flaw of the relatively simple model structure or a function of the specific parameters and limited population size/number of years for “burn-in” evaluated (which may derive from the amount of time needed to run the model). Regarding model structure, potential problems include not incorporating immunity, treatment, or age structure, heterogeneity, a limited and fixed number of within host compartments, and if I’m reading things correctly clearing all infections simultaneously as opposed to having independent clearance times for different parasites. Re: run-time parameters, authors could consider a introducing a larger number of diverse parasites at the start instead of 10 identical parasites, allowing for more than 10 within host compartments (which seems like a good number but not if you are replacing half of them with every superinfection and trying to incorporate drift) a much larger host population size (400 seems quite low), running for hundreds or thousands of years instead of 25, etc.2) Another major limitation of the current model is that it only “works” at transmission levels consistent with 20% or greater parasite prevalence, and many of the metrics of interest to the community, e.g. those based on IBD, may only be expected to show relevant signal in areas of lower transmission.Minor comments:3) The authors discuss a metric of IBD throughout the manuscript but I believe what they have actually calculated is IBS. This should be called IBS unless it is actually IBD being calculated.4) Some terms such as PR and EIR should be properly defined for clarity.5) Some discussion of how the model could be fitted to real data and used to infer parameters, if this is the goal, should be included.6) The authors should consider justification of the simplicity of the transmission model, given the considerations above.Reviewer #3: Hendry et al. tackle a computational and theoretically difficult problem that has rarely been thoroughly addressed before. They aim to link a population-genetic model summarizing within-host genetic variation with a Ross-MacDonald epidemiology/ecological model. Notably, they use the model to study the impact of potential ecology and interventions on common genetic summary statistics. The goal of the model is not directly inference, but is instead an important contribution to our mechanistic and qualitative understanding of a complicated system.Additionally, one of the most difficult and vague parts of related models is where the parameter values come from. The authors put together a thorough and easy to read summary with many citations and tables in the supplement, which itself is a useful contribution for other papers.Major comments1. I appreciate model-based work as a means to understand general qualitative principles, but a main argument the authors make (especially in the introduction) is the need for inference and application to data. While data analysis is beyond the scope of the paper, it would be really beneficial to have a deeper discussion on the types of data, sampling schemes, inference methods, etc to make the model useful to empirically-driven investigators.2. how does the population size of 400, 100 replicate simulations, or sample of 20 individuals influence results? It seems like a computational decision rather than a biological one? The small sample size is particularly confusing given the high noise. While it is informative to try mimic sampling in data, it would be nice to compare the summary statistics calculated on the full 400 indvs to see where the stochasticity is coming from. Indeed, the authors close the discussion suggesting the tool be used when designing experiments—one of the biggest questions I could foresee the simulations helping with is choosing sample size to sequence under a limited budget, yet no attention is paid to sampling. For example, in Fig 3A, how much of the variation comes from the 30 replicate simulations and how much comes from subsamples of only 20 individuals?3. Figure 3 shows the equilibrium behavior as a function of parameters that are known to vary or humans can directly influence. However, it is also important to understand how parameter choice for other parameters influences model outcomes and if for example error in one of those produce a signal of similar strength as intervention/biologically varying parameters. Said another way, how much does it matter if parameters such as recombination rate are inaccurate?Minor comments•The approach used to study equilibrium behavior is interesting, namely matching the parasite prevalence rather than perturbations in the parameters. I agree this in some ways makes comparisons more equal and interpretable, but it seems important to know the parameter values (perhaps as supplementary information) that are required to produce such parasite prevalences. Are they reasonable?•Brief summary of the recombination rather than citation of their previous paper would help.•A number of the intervention timings seem quite dependent on parameters chosen. This would be worth discussing•DNA sequencing seems to be simulated in a way that only decreases diversity/increases IBD by not differentiating similar strains, but doesn’t account for processes in the other direction such as sequencing error. Is this because with the 5% cutoff similarity, most error would not produce a sequence called as a new strain? How much of an effect on summary statistics does the initial sequencing process make?**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #1: YesReviewer #2: YesReviewer #3: None**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: NoFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see31 Mar 2021Submitted filename: r2r_fwd-dream.docxClick here for additional data file.10 May 2021Dear Dr Hendry,Thank you very much for submitting your manuscript "Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malaria" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Alex PerkinsAssociate EditorPLOS Computational BiologyVirginia PitzerDeputy Editor-in-ChiefPLOS Computational Biology***********************A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:[LINK]Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Thank you to the authors for their revised manuscript. It is much improved and encouraging to see the speed increase. I still have some major concerns though about how accurate the model is in terms of replicating malaria epidemiology and consequently some of the genetic relationships observed. I understand the model tries to be simple (no heterogeneity/immunity) and is still useful for detailing the mean behaviour. However, as a result it is important that the model reflects current evidence and the literature in terms of replication the observed relationships between malaria prevalence, EIR, COI and IBD.The following are major comments that need to be adequately addressed.1. The COI vs prevalence relationship has been improved but is still not sufficiently high. The author's response rebutting the validity of the RMCL at accurately estimating COI at high COIs is not to me fair. The authors claim it struggles greater than 4, but that is for COIL - the performance of RMCL is notably more capable at resolving COIs >4, but I will agree that it will struggle at accurately determining say COI > 8. The comments about the prior are not relevant - a flat prior is used because although we know COI is very unlikely to be as high as 25, we know that they can be high (because biting heterogeneity is very overdispersed) and so the flat prior is simplest and unlikely to be altering the inferred COI because it is by definition an uninformative prior. And although RMCL does not have the power to distinguish well between high COIs, it is not acceptable to use this as an argument to have parameterised the COI vs prevalence relationship to have the mean COI still only equal to 2 at 80% prevalence. For example, msp2 MOI literature I referenced in my initial review shows this. Also your comment about DEploidIBD showing lower COIs is because DEploidIBD fails with COIs above 4. For evidence of this, see the recent single cell analysis from Malawi here https://www.sciencedirect.com/science/article/pii/S1931312819306304. In fact, RMCL run on these same samples actually gives higher COIs than DEploidIBD. The relationship between prevalence and COI should still be higher and should scale exponentially given the relationship between EIR and prevalence.2. EIR outputs. Please could EIR vs prevalence relationships be shown. The relationship between EIR and prevalence is well studied and should be approximately log-linear (https://malariajournal.biomedcentral.com/articles/10.1186/s12936-015-0864-3). I don't think this will be true for a number of the simulations you have done in which prevalence is achieved by altering the clearance rate and keeping the biting rate low. I want to see this relationship as although you have shown that you can recreate different prevalence settings by changing one of three parameters, we have good evidence to know that the biting rate of mosquitoes is unlikely to be as low as 1 feed per 6 days and is closer to 2-3 days climate depending. I think if you constrain this parameter you should be able to identify a set (or range of sets) of the other two that give a suitable EIR vs prevalence relationship as well as a suitable COI vs prevalence relationship (if you don't want to believe the RMCL one then the msp2 literature).3. Could you plot the relationship between mean pairwise IBD fraction of samples within host against the proportion of mixed infections. I.e. what does it look like in comparison to Figure 4c in the DeploidIBD paper - https://elifesciences.org/articles/40845. The population level pairwise IBD seems very high. At 20% prevalence the Fraction IBD seems close to 0.25, which would mean that on average each parasite is a half sibling of a randomly chosen parasite in the population. That cannot be correct. Have I misunderstood? Is Fraction IBD the fraction of the genomes in mixed infections that are IBD? If it is the population level mean IBD, then this needs to be again addressed to better match relationships observed for IBD. There are a number of studies that have used hmmIBD to estimate population level IBD which would be a starting point - collecting these and plotting the relationship between these and malaria prevalence and compare your relationship (which I thinks shows that fraction IBD is too high).---Thank you for the changes made so far and it is encouraging to see the speed gains that have been made so far. Also thank you for keeping your code and analysis notebooks public - has made it much easier to understand what is going on. I look forward to reviewing the manuscript again with these comments addressed.Reviewer #2: The authors have gone to some length to improve the model performance, analysis, and interpretation. The manuscript is improved and I have no remaining major comments. It is still difficult to know to what degree the method will be useful in exploring any quantitative relationships as opposed to broad, qualitative ones, but the authors defer more quantitative evaluation / calibration for future work. The authors now state directly in the discussion section that the goal of the model at least at present is not e.g. for ABC type estimation but for broad stroke evaluation of relationships, and they have done a sufficient job to justify that use.If the authors do embark on more quantitative analysis in the future I still feel somewhat comfortable in stating that available empiric data in the literature suggest that the relationship with COI is substantially underestimated with the current model/parameters, hindering potential inference. The authors state in response to Reviewer 1 that the majority of estimates have been using The Real McCOIL and suggest there may be issues with overestimation with this method, but this is a recent approach based on availability of SNP data and for many years the most common way of evaluating within-host diversity (and likely representing the majority of existing literature) has been through genotyping of a limited number of polymorphic markers (initially via length polymorphism of antigens and microsatellites and more recently via amplicon sequencing). These methods may still over or underestimate COI, but most available data including more recent amplicon sequencing data suggest much higher COI. A few references on quick search:Recent data using amp seq, see supplemental fig 3a and 3b: https://pubmed.ncbi.nlm.nih.gov/31819062/Using microsats, mean 3.4: https://pubmed.ncbi.nlm.nih.gov/32246127/Table 4 here has some refs: https://pubmed.ncbi.nlm.nih.gov/33172455/Table S1 here has some refs: https://pubmed.ncbi.nlm.nih.gov/27711149/Reviewer #3: The authors have spent considerable time and thought putting together a reasonable and thorough response to reviewer requests. I agree with them that their model, while (as always) an imperfect representation of the real world, is a big improvement to the current available models. More detailed models are susceptible to many assumptions based on processes and parameters we have poor estimates for. They also demonstrate biologically-relevant information that can be learned from their model.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: None**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: Yes: Amy GoldbergFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocolsReferences:Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.23 Jun 2021Submitted filename: r2r_r2_fwd-dream.docxClick here for additional data file.19 Jul 2021Dear Dr Hendry,We are pleased to inform you that your manuscript 'Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malaria' has been provisionally accepted for publication in PLOS Computational Biology.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Alex PerkinsAssociate EditorPLOS Computational BiologyVirginia PitzerDeputy Editor-in-ChiefPLOS Computational Biology***********************************************************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: I feel the authors have done sufficient work to demonstrate that the model developed is a helpful tool. Personally, I feel it would be more helpful for the authors to present model defaults that better capture certain relationships observed but I am also more than happy for that to remain as an academic differing of opinions and it is not one that should prevent this paper being published. I am very happy that they have taken the time to explain that the tool is more useful for exploring relationships in a qualitative fashion rather than being used to provide quantitative predictions of specific metrics and detailing that some of this results from the population sizes that are explored by the model. As such I am satisfied with their revisions. Once again thank you for making all code and analysis scripts open and easy to review.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: No16 Aug 2021PCOMPBIOL-D-20-01626R2Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malariaDear Dr Hendry,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Olena SzaboPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
Authors: Benjamin C Haller; Jared Galloway; Jerome Kelleher; Philipp W Messer; Peter L Ralph Journal: Mol Ecol Resour Date: 2019-02-21 Impact factor: 7.090
Authors: Sha Joe Zhu; Jason A Hendry; Jacob Almagro-Garcia; Richard D Pearson; Roberto Amato; Alistair Miles; Daniel J Weiss; Tim Cd Lucas; Michele Nguyen; Peter W Gething; Dominic Kwiatkowski; Gil McVean Journal: Elife Date: 2019-07-12 Impact factor: 8.140
Authors: Peter W Gething; Anand P Patil; David L Smith; Carlos A Guerra; Iqbal R F Elyazar; Geoffrey L Johnston; Andrew J Tatem; Simon I Hay Journal: Malar J Date: 2011-12-20 Impact factor: 2.979
Authors: Melissa A Penny; Nicolas Maire; Caitlin A Bever; Peter Pemberton-Ross; Olivier J T Briët; David L Smith; Peter W Gething; Thomas A Smith Journal: Malar J Date: 2015-10-05 Impact factor: 2.979
Authors: William L Hamilton; Roberto Amato; Rob W van der Pluijm; Christopher G Jacob; Huynh Hong Quang; Nguyen Thanh Thuy-Nhien; Tran Tinh Hien; Bouasy Hongvanthong; Keobouphaphone Chindavongsa; Mayfong Mayxay; Rekol Huy; Rithea Leang; Cheah Huch; Lek Dysoley; Chanaki Amaratunga; Seila Suon; Rick M Fairhurst; Rupam Tripura; Thomas J Peto; Yok Sovann; Podjanee Jittamala; Borimas Hanboonkunupakarn; Sasithon Pukrittayakamee; Nguyen Hoang Chau; Mallika Imwong; Mehul Dhorda; Ranitha Vongpromek; Xin Hui S Chan; Richard J Maude; Richard D Pearson; T Nguyen; Kirk Rockett; Eleanor Drury; Sónia Gonçalves; Nicholas J White; Nicholas P Day; Dominic P Kwiatkowski; Arjen M Dondorp; Olivo Miotto Journal: Lancet Infect Dis Date: 2019-07-22 Impact factor: 71.421
Authors: Standwell C Nkhoma; Shalini Nair; Salma Al-Saai; Elizabeth Ashley; Rose McGready; Aung P Phyo; François Nosten; Tim J C Anderson Journal: Mol Ecol Date: 2012-11-02 Impact factor: 6.185
Authors: Ananias A Escalante; Marcelo U Ferreira; Joseph M Vinetz; Sarah K Volkman; Liwang Cui; Dionicia Gamboa; Donald J Krogstad; Alyssa E Barry; Jane M Carlton; Anna Maria van Eijk; Khageswar Pradhan; Ivo Mueller; Bryan Greenhouse; M Andreina Pacheco; Andres F Vallejo; Socrates Herrera; Ingrid Felger Journal: Am J Trop Med Hyg Date: 2015-08-10 Impact factor: 2.345
Authors: Oliver J Watson; Lucy C Okell; Joel Hellewell; Hannah C Slater; H Juliette T Unwin; Irene Omedo; Philip Bejon; Robert W Snow; Abdisalan M Noor; Kirk Rockett; Christina Hubbart; Joaniter I Nankabirwa; Bryan Greenhouse; Hsiao-Han Chang; Azra C Ghani; Robert Verity Journal: Mol Biol Evol Date: 2021-01-04 Impact factor: 16.240
Authors: Angela M Early; Flavia Camponovo; Stéphane Pelleau; Gustavo C Cerqueira; Yassamine Lazrek; Béatrice Volney; Manuela Carrasquilla; Benoît de Thoisy; Caroline O Buckee; Lauren M Childs; Lise Musset; Daniel E Neafsey Journal: Proc Natl Acad Sci U S A Date: 2022-07-22 Impact factor: 12.779