Tyler S Brown1,2, Olufunmilayo Arogbokun3, Caroline O Buckee1, Hsiao-Han Chang1,4. 1. Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America. 2. Infectious Diseases Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America. 3. Infectious Disease Epidemiology and Ecology Lab, University of North Carolina School of Medicine, Chapel Hill, North Carolina, United States of America. 4. Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu City, Taiwan.
Abstract
Measuring gene flow between malaria parasite populations in different geographic locations can provide strategic information for malaria control interventions. Multiple important questions pertaining to the design of such studies remain unanswered, limiting efforts to operationalize genomic surveillance tools for routine public health use. This report examines the use of population-level summaries of genetic divergence (FST) and relatedness (identity-by-descent) to distinguish levels of gene flow between malaria populations, focused on field-relevant questions about data size, sampling, and interpretability of observations from genomic surveillance studies. To do this, we use P. falciparum whole genome sequence data and simulated sequence data approximating malaria populations evolving under different current and historical epidemiological conditions. We employ mobile-phone associated mobility data to estimate parasite migration rates over different spatial scales and use this to inform our analysis. This analysis underscores the complementary nature of divergence- and relatedness-based metrics for distinguishing gene flow over different temporal and spatial scales and characterizes the data requirements for using these metrics in different contexts. Our results have implications for the design and implementation of malaria genomic surveillance studies.
Measuring gene flow between malaria parasite populations in different geographic locations can provide strategic information for malaria control interventions. Multiple important questions pertaining to the design of such studies remain unanswered, limiting efforts to operationalize genomic surveillance tools for routine public health use. This report examines the use of population-level summaries of genetic divergence (FST) and relatedness (identity-by-descent) to distinguish levels of gene flow between malaria populations, focused on field-relevant questions about data size, sampling, and interpretability of observations from genomic surveillance studies. To do this, we use P. falciparum whole genome sequence data and simulated sequence data approximating malaria populations evolving under different current and historical epidemiological conditions. We employ mobile-phone associated mobility data to estimate parasite migration rates over different spatial scales and use this to inform our analysis. This analysis underscores the complementary nature of divergence- and relatedness-based metrics for distinguishing gene flow over different temporal and spatial scales and characterizes the data requirements for using these metrics in different contexts. Our results have implications for the design and implementation of malaria genomic surveillance studies.
Measuring the extent to which malaria parasite populations are linked across different geographic locations (“connectivity”) can provide important guidance for the design and implementation of malaria control interventions. Multiple approaches are available for this task, including methods that measure human host mobility (using census-derived [1] or mobile phone-associated mobility data [2, 3]) as a proxy for parasite migration and genomic surveillance strategies that directly measure parasite gene flow between locations.Recent studies have combined these two approaches, using both human mobility data and estimated parasite gene flow to examine previously undescribed aspects of malaria spatial epidemiology [3] (for example, source-sink dynamics in endemic areas of Bangladesh [2]). Other recent work has focused on estimating the probability that a pair of genomes are identical by descent (IBD) at a particular locus [4] and using population-level summaries of IBD estimates to assess gene flow between malaria populations [5]. These relatedness-based approaches have uncovered spatial structure in malaria populations at small, local scales, where conventional methods using differentiation-based estimators (including the fixation index, FST) fail to identify population structure [5]. IBD-based measures capture variation due to recombination and as such may be well suited for measuring more recent gene flow between malaria populations (in which mutation rates are relatively slow, but variation due to recombination can accrue more quickly).These studies and others evaluate for the presence or absence of spatial structure by evaluating for isolation by distance, i.e. by testing whether estimated pairwise differentiation or relatedness correlates with inter-location distance (for example, see [6, 7]). Importantly, methods based on isolation by distance may be difficult to interpret if migration is not spatially coherent, i.e. if human and/or parasite migration rates do not consistently correlate with distance, as has been observed in multiple contexts [8-10]. Another common approach used in malaria genomic surveillance studies classifies two malaria populations as somewhat connected (versus not connected) if the confidence interval of their pairwise FST value does not include zero [11-14]. This binary approach may have limited value in situations where ranking weakly versus strongly connected populations is important. From a practical standpoint, important use cases for gene flow estimation include ranking different levels of connectivity between locations (“Are the malaria populations in locations i and j linked by higher or lower connectivity than those in locations k and l?”) and classifying location pairs by their relative levels of connectivity (“Among the malaria populations in locations i, j, k, and l, which of the six total location pairs are most highly connected?”). Isolation by distance can be considered a special case of ranking, in which the ranks of estimated gene flow are compared to the ranks of inter-location geographic distances.The properties of various estimators of FST have been studied extensively and there is a rich and longstanding literature on IBD [15]. However, there is currently only limited guidance on the size [5] and type [16] of genetic data needed for measuring gene flow between malaria parasite populations, and none specific to the questions of ranking and classification. Most studies of this kind use genetic data of widely varying sizes and types, including both whole genome sequence data and SNP ‘barcode’ data (typically consisting of 24- to 100-SNPs [17, 18]), limiting generalizability and comparability between studies. Importantly, population-level summaries of differentiation and relatedness capture genetic signatures resulting from multiple, interconnected processes, including recent prior and ancestral migration, changes in population size over time, and selection. Multiple studies indicate that effective sizes of malaria populations are dynamic, often over relatively short periods of time (for example, marked contraction of effective population size following the successful introduction of various malaria control interventions [19, 20]). For these reasons, it is likely that the amount or type of data needed for estimating gene flow, and reliably ranking or classifying these estimates, will vary across different epidemiological contexts.This report examines the use of both relatedness- and differentiation-based methods for estimating gene flow between malaria populations. We numerically evaluate data size requirements for ranking and classification of these estimates, using P. falciparum genetic data from field isolates and a coalescent simulation informed by mobile phone-associated human mobility data. In addition, we explore how changes in population size and migration rates over time influence IBD- and FST-based estimates of gene flow. (Although a multitude of other methods and estimators exist for studying differentiation and relatedness between populations, we focus here on FST given its common usage in malaria genomic epidemiology studies and IBD-based methods given their increasing use in similar applications.) We specifically examine populations that have undergone recent reduction in their effective population sizes or changes in relative parasite migration rates between locations, given the direct relevance of this scenario to real-world malaria populations. Findings from these analyses have important implications for the design and implementation of malaria genomic surveillance studies.
Methods
P. falciparum genomic data
We obtained P. falciparum whole genome variant call data, generated using the Genome Analysis Toolkit (GATK, [21]), from the MalariaGen database (Pf3k data release 5.1 [22]). We restricted our analysis to samples from the Greater Mekong Subregion (GMS), excluding more widely divergent P. falciparum populations from Sub-Saharan Africa and Bangladesh that are less relevant to the context of our analysis (i.e. genomic surveillance at the regional and national level), and including only individual sequences from clinical cases or survey participants. After excluding “un-callable” regions of the P. falciparum genome (i.e. highly repetitive or highly variable regions in which short-read based genotyping is unreliable [23]) and sites failing any GATK quality filter, we removed suspected polyclonal sequences and those with poor quality sequence data based on the proportions of heterozygous and missing SNP calls across all sites for each sample, and removed low-quality SNPs based on site-wise missingness and heterozygosity (similar to [24]). We first removed SNP sites for which > 1% of sequences had missing or heterozygous calls and then removed sequences for which > 5% SNP sites were missing or heterozygous. This filtering protocol yielded 472 individual sequences and 143,480 SNPs. Of the 143,480 SNPs, 437 had estimated minor allele frequencies > 0.35 and 5537 had estimated minor allele frequencies > 0.05 (estimated using all 472 sequences in this collection, Fig A in S1 Appendix). We conducted analyses with both of these SNP sets, with the goal of examining how SNP allele frequency influences gene flow estimates. In terms of precision, markers with equifrequent alleles (thus high minor allele frequencies) are most informative for relatedness estimation [4]. The first SNP set (estimated minor allele frequencies > 0.35) aligns with prior and ongoing efforts to design other barcodes for malaria surveillance [17, 25].Given evidence that a selective sweep on genes linked to artemisinin and piperaquine resistance (kelch13 and plasmepsin2–3, respectively) substantially altered P. falciparum population structure across the GMS after 2012 [26-28], we included only sequences from 2009–2011 and excluded all sequences carrying the kelch13 haplotype associated with this selective sweep (called KEL1 and identified using the haplotype scoring system in [28]). Lastly, we excluded two Pf3k study locations (Sisakhet and Ranong, Thailand) from the analysis that included < 10 P. falciparum individual sequences after removing post-2011 sequences and KEL1 mutants. Fig B in S1 Appendix provides information on the number of individual samples included from each location. To evaluate whether estimated gene flow decays with distance between locations, we obtained shortest road distances between Pf3k study locations from the Google Maps Distance Matrix API [29]. Between location distances ranged from 34 km (Bu Gia Map-Phouc Long) to 1783 km (Bu Gia Map-Bago Division).
Mobility-informed coalescent simulation
We used a multi-population coalescent simulation to extend our analysis to a wider range of possible population genetic and epidemiological conditions. This approach also allows for unbiased sampling of the simulated populations, avoiding issues with sampling bias inherent to real sequence data collected in the field (as discussed in more detail below). We simulated sequence data for a metapopulation with five demes using the msprime implementation of Hudson’s ms coalescent simulator [30, 31]. For consistency with the terms used to describe the Pf3k data, we hereafter refer to populations in the coalescent simulation as “locations”. Here “individual” refers to either sequence data obtained from a single monoclonal P. falciparum infection in the Pf3k data or sequence data from a single individual in the simulations. Our analysis uses two coalescent modeling frameworks. Motivated by observations from existing studies [5, 32], and in our analysis of the Pf3k dataset, these models seek to approximate situations where migration patterns and effective population sizes vary over time, resulting in different population genetic signatures from ancestral versus more recent migration events. The first framework (Model A, Fig C in S1 Appendix) approximates an ancestral metapopulation with high mixing between locations, followed by a more recent period with lower migration rates with higher statistical variation. Specifically, Model A incorporates two different sets of migration rates between locations (ρ), an ancestral set in which migration is high and equal for all location pairs followed at time = g generations in the past by a more recent set that is specified using aggregated, anonymized call detail records (CDRs) from mobile phone users in Thailand (). We estimated migration rates (measured as the proportion of the population in j that are migrants in i per generation) between 930 districts in Thailand (Supplementary Methods in S1 Appendix) and sampled sets of districts over two spatial scales: “local” or nearby district sets separated by short geographic distances (approximately 100 km or less) and “subnational’ district sets separated by larger geographic distances (Figs D, E, and F in S1 Appendix). We used sets of five districts each, sampling 10 local district sets and 10 subnational district sets, and used the resulting 5 × 5 migration rate matrices to parameterize the coalescent simulation. Effective population size is constant in Model A and denoted N.The second framework (Model B, Fig C in S1 Appendix) seeks to approximate a metapopulation with recent-onset contraction in effective population sizes (for example, following successful implementation of malaria control interventions [32]). Model B uses the same mobility data to specify a constant set of migration rates between locations, with an exponential decrease in effective population size from initial size (N) to current size (N) starting at time = g generations in the past. The total number of migrants in each generation is a function of both migration rate and population size and contraction of effective population size reduces the total number of migrants starting after time = g.For each simulation, we examine a range of parameters for N, N, g, and the recombination rate ϕ. The mutation rate was set at 6.82E-9 mutations per site per generation for all simulations [19, 33]. We obtained a total of 1000 simulation replicates for each parameter set (100 simulation replicates for each of 10 sets of CDR-estimated migration rates). We calculated estimated allele frequencies for each site in the simulated data from a random sample of 80% of all individual sequences. Subsequent analysis was restricted to sites with estimated minor allele frequency > 0.35.
Estimating gene flow between malaria parasite populations
Following [4], we define relatedness, r, between two individuals as the probability that, at any locus across the genome, the alleles for both individuals are IBD. In this study, we use the fract_sites_IBD output from hmmIBD [34]to estimate r and let denote this inter-individual relatedness estimate. To summarize numerous relatedness estimates between individuals from different locations, let denote the proportion of between-location estimates greater than some specified threshold, α. We estimate inter-location differentiation using Hudson’s estimator of F [35] and let denote this estimate of inter-location differentiation. We refer to and as estimates of gene flow since we expect gene flow between malaria parasite populations to correlate with both (positively with and negatively with ).We calculate and using different numbers of individuals per location, n, and different numbers of SNPs, p. For the Pf3k data, we calculate and using all 437 SNPs with minor allele frequency estimates > 0.35 and, for each location pair, the maximum number of individuals available for each location. To distinguish IBD sharing due to ancestral versus more recent migration events, we also examine mean shared IBD sequence length for between-location sequence pairs and stratify this analysis over IBD segments of different lengths [36, 37]. Shorter IBD segment lengths, reflecting haplotypes that have been broken down by recombination over time, represent more distant ancestral events; longer IBD segments reflect sharing due more recent migration events.
Ranking, classification, and isolation by distance for gene flow estimates
We calculate Kendall’s τ for the correlation between or and geographic distance between locations, using distance between study cities for the Pf3k and distance between district centroids in Thailand for the simulated data (which uses administrative districts as the unit of aggregation for mobile phone-associated movement data). A Mantel test with 1000 permutations was used to evaluate the statistical significance of each τ value.For the simulated data, where the migration rate between locations is known a priori, we compared differences in estimated gene flow to differences in pre-specified migration rates to determine whether or values correctly rank pairs of location pairs. For example, if a location pair with a high migration rate has a higher estimated value for than a location pair with a lower pairwise migration rate, we consider these two location pairs as ranked correctly by their respective values. We calculate the proportion of all 45 pairs of location pairs (five choose two choose two) that are ranked correctly by their or values (proportion ranked correctly, “PRC”). We examine PRC as a practical measure for contextualizing observed values of or (i.e. if n total individuals are sampled and genotyped across p SNPs, what is the probability that the observed values for or correctly rank location pairs by their shared gene flow?).Motivated by expected practical use cases in malaria genomic surveillance, we also consider how well or identify or classify highly connected location pairs. Specifically, for each individual simulation, we evaluate whether (1) the single location pair with the highest migration rate also has the highest estimated pairwise gene flow and (2) whether the location pairs with the five highest migration rates also have the five highest gene flow estimates. For each parameter set, we report the proportion of all simulations in which these location pairs are classified correctly.
Results
Relatedness and differentiation between P. falciparum populations
We first estimated connectivity between P. falciparum sequences collected in multiple study sites across the Greater Mekong Subregion using both differentiation-based and relatedness-based estimates of gene flow (Fig 1). We observed distinct signatures of isolation by distance with both (increasing differentiation with distance) and and (decreasing relatedness with distance). The strength of these correlations decreases with smaller data sizes, most markedly with p (the number of SNPs used) < 100, and by SNP ascertainment scheme (Fig 2). Using SNPs with higher estimated minor allele frequencies (> 0.35) yielded stronger correlations with distance for both and ; using SNPs from a wider range of estimated minor allele frequencies (> 0.05), and thus including more rare alleles, yielded less consistent correlations with distance.
Fig 1
Gene flow versus distance for P. falciparum isolates.
(A) F (Hudson’s estimator), (B) , and (C) . Kendall’s τ for divergence or relatedness versus distance and the corresponding p-value obtained via Mantel testing are listed for each metric. The four location pairs with highest estimated gene flow (as measured by F,,or ) are circled in the left panels. Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License.
Fig 2
Gene flow versus distance for P. falciparum isolates by SNP number and estimated minor allele frequency (AF).
Kendall’s τ for distance versus F (A & B) or (C & D) for SNPs with estimated minor allele frequency of > 0.35 (A & C) or > 0.05 (B & D). Points show τ values for 1000 SNP sets of size p randomly sampled from the subsets of SNPs with estimated minor allele frequency of either > 0.35 or > 0.05.
Gene flow versus distance for P. falciparum isolates.
(A) F (Hudson’s estimator), (B) , and (C) . Kendall’s τ for divergence or relatedness versus distance and the corresponding p-value obtained via Mantel testing are listed for each metric. The four location pairs with highest estimated gene flow (as measured by F,,or ) are circled in the left panels. Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License.
Gene flow versus distance for P. falciparum isolates by SNP number and estimated minor allele frequency (AF).
Kendall’s τ for distance versus F (A & B) or (C & D) for SNPs with estimated minor allele frequency of > 0.35 (A & C) or > 0.05 (B & D). Points show τ values for 1000 SNP sets of size p randomly sampled from the subsets of SNPs with estimated minor allele frequency of either > 0.35 or > 0.05.Examining location pairs with the highest estimated shared gene flow, we found largely similar geographic patterns for and (Fig 1A and 1B), with the strongest estimated gene flow observed between geographically-proximate locations in southern Laos, eastern Cambodia, and Vietnam. However, values were highest (consistent with higher gene flow) for location pairs within southern Cambodia and Vietnam. We found an identical geographic pattern when examining mean pairwise IBD sequence length for long IBD tracts (> 95 percentile, equal to tract length ≥ 285.64 kb), indicating that these patterns reflect more recent migration events (Fig G in S1 Appendix) [36, 37]. Location pairs with the highest values also have relatively higher numbers of nearly clonal individual-individual pairs (Fig H in S1 Appendix), which result from recent migration events (where the time since migration is short enough that there is limited or no out-crossing between imported individuals and the receiving population). Mean IBD segment length for between-location individual-individual pairs is also longer for location pairs with the highest (Fig H in S1 Appendix). These findings indicate that population-level summaries of between-location relatedness, if used with appropriate thresholds that can identify highly-related individual-individual pairs, can provide useful estimates of recent migration between P. falciparum populations. reflects gene flow due to both recent and more historically distant migration events and in some contexts may be strongly influenced by more ancestral population structure.Sequence data in the Pf3k database was obtained during multiple clinical and cross-sectional studies [38], each of which may be subject to different forms of sampling bias. We observed marked heterogeneity across study sites in the within-location distributions for (Fig I in S1 Appendix). The proportion of clonal or nearly-clonal individual-individual pairs (where ) differs widely across study locations, likely reflecting both true differences in levels of within-population diversity and/or sampling bias (for example, if samples were collected from individuals infected with the same clone during an outbreak or other shared transmission event).
Data size and isolation by distance over local and subnational scales
To extend this analysis, we used a mobility-informed coalescent simulation to numerically evaluate and under a wider range of possible epidemiological and evolutionary conditions. This approach also allows for unbiased sampling of individuals in each location, obviating the aforementioned issues with sampling bias in real P. falciparum sequence data sets. The two models used for this purpose (Fig C in S1 Appendix, Models A and B) yield and values that closely approximate those observed for P. falciparum populations over both subnational and local geographic scales (Figs J, K, and L in S1 Appendix) [5, 38]. Simulations with higher recombination rates better approximated observed and vales in the Pf3k dataset (Figs M and N in S1 Appendix). The CDR-derived mobility data used to parameterize these models indicates that between-district migration in Thailand is less strongly correlated with distance over local geographic areas (i.e. between nearby districts) compared to subnational movement over a wider range of distances (Fig D in S1 Appendix). This suggests that migration may be less spatially coherent over short distances, such that the presence or absence of isolation by distance may be difficult to interpret as a signature of population structure under certain conditions.Consistent with this finding, we observe weaker correlations between geographic distance and both and for migration over local scales (Fig 3) when compared to migration over larger subnational scales (Fig 4). Similarly, and are more closely correlated with migration rates (as specified in the coalescent model) than distance (Figs O and P in S1 Appendix). These findings are in part attributable to the wider variation of distances, and larger differences in migration rates, for the districts sampled to represent subnational migration, which include districts separated by both large and small distances (Figs D and F in S1 Appendix). Migration rates between nearby or local districts are higher overall and restricted to a narrower range of values with lower dispersion, resulting in and values that are less strongly correlated with distance (Figs J and K in S1 Appendix).
Fig 3
Gene flow versus distance for simulated sequence data (local migration, model A).
Migration rates are estimated using CDR data for neighboring districts in Thailand (separated by ≥ 100 km). (A) and (B): correlation between distance and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Model parameters: recombination rate, ϕ = 0.7, multiplier for recent migration rates, m = 15; multiplier for ancestral migration rates, M = 5; population size, N = 500; time since ancestral migration rates, g = 10 generations.
Fig 4
Gene flow versus distance for simulated sequence data (subnational migration, model A).
Migration rates are estimated using CDR data for neighboring districts in Thailand. Distances are (A) and (B): correlation between distance and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Model parameters: recombination rate, ϕ = 0.7, multiplier for recent migration rates, m = 15; multiplier for ancestral migration rates, M = 5; population size, N = 500; time since ancestral migration rates, g = 10 generations.
Gene flow versus distance for simulated sequence data (local migration, model A).
Migration rates are estimated using CDR data for neighboring districts in Thailand (separated by ≥ 100 km). (A) and (B): correlation between distance and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Model parameters: recombination rate, ϕ = 0.7, multiplier for recent migration rates, m = 15; multiplier for ancestral migration rates, M = 5; population size, N = 500; time since ancestral migration rates, g = 10 generations.
Gene flow versus distance for simulated sequence data (subnational migration, model A).
Migration rates are estimated using CDR data for neighboring districts in Thailand. Distances are (A) and (B): correlation between distance and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Model parameters: recombination rate, ϕ = 0.7, multiplier for recent migration rates, m = 15; multiplier for ancestral migration rates, M = 5; population size, N = 500; time since ancestral migration rates, g = 10 generations.We next examined how the ability to detect isolation by distance is influenced by data size. To do this, we evaluated the proportion of all simulation replicates, of data size of p SNPs and n individuals, for which the correlation between distance and or is significant by Mantel testing. These proportions (proxy measures for the power to detect isolation by distance) are markedly decreased for data sizes with < 100 SNPs or < 50 individuals, across both models A and B and for both local and subnational migration rates (Figs 3 and 4 and Fig Q in S1 Appendix).In both models A and B, we found identified isolation by distance more reliably than over almost all data sizes (Figs 3 and 4 and Fig Q in S1 Appendix). This likely reflects the relatively short number of post-ancestral generations used in our simulations, in which the change from “ancestral” to “recent” migration rates (Model A) or the onset of population size contraction (Model B) occurs at 10–50 generations before present. These findings are consistent with recent studies on the Thai-Myanmar border, where contraction of P. falciparum effective population size began approximately 20 years ago [32] (80–120 generations, assuming 4–6 generations per year) and where IBD-based methods for estimating gene flow delineated hyperlocal spatial population structure that could not be resolved by F [5]. Supporting this observation, sensitivity analysis examining different values for g (in Model A, the number of generations before present when ancestral migration rates are supplanted by recent migration rates) indicates that values are strongly influenced by g, whereas F values are largely similar for both large and small values of g (Fig R in S1 Appendix).
Data size and ranking gene flow estimates
Motivated by practical questions in malaria surveillance, and acknowledging the potential limitations of isolation by distance in contexts where migration is poorly correlated with distance, we sought to evaluate data size requirements for ranking and classification of gene flow estimates. To examine ranking, we calculated for each simulation replicate the proportion of all pairs of locations where ranking by or matches ranking according to the migration rates specified in the coalescent simulation (“proportion ranked correctly”). This proportion approximates the probability, for a given simulation replicate with data size p SNPs and n individuals, that the or values for two location pairs will correspond to the true ordering of their respective migration rates. Overall, a lower proportion of location pairs were ranked correctly when migration is specified using local (Fig S in S1 Appendix) rather than subnational migration matrices (Fig T in S1 Appendix). Notably, when migration is specified using subnational district sets, the proportion of location pairs ranked correctly by is greater than the proportion ranked correctly by over almost all data sizes (Fig T in S1 Appendix).Lastly, we examined, for a given data size, the ability to correctly distinguish highly connected location pairs from less highly connected pairs using or . We found that misclassification of highly connected location pairs was more frequent when local rather than subnational district sets were used to specify migration (Figs S and T in S1 Appendix). The five location pairs with highest connectivity were correctly classified in only a small proportion of simulation replicates (Figs S and T in S1 Appendix), indicating that distinguishing highly connected groups of location pairs may be difficult in many epidemiological contexts.
Discussion
Genomic and genotype-based methods have become increasingly important tools for malaria surveillance and control, providing valuable information on the impact of disease control interventions [20], the geographic dispersal of antimalarial drug resistance [36], and connectivity between regional and local parasite populations [5], among other critical insights. There is an ongoing effort to operationalize these tools for use in routine disease surveillance activities, and developing the evidence base around their optimal use is a central scientific and public health priority. In this study, we used genome sequence data from naturally-occurring P. falciparum infections and sequence data simulated using parameters associated with different epidemiological conditions to examine a specific but potentially important use cases in malaria genomic surveillance, i.e. testing for isolation by distance and ranking gene flow between different P. falciparum population pairs.We highlight the following observations:Recent and ancestral population dynamics, together with the magnitude and dispersion of migration rates, all influence the ability to distinguish levels of gene flow between malaria populations.Identifying highly related or nearly clonal between-location sequence pairs provides an important measure for recent parasite migration events. IBD-based measures that capture these recent migration events (including ) are less influenced by ancestral migration than F.The ability to identify isolation by distance and rank gene flow is lower when 24-SNP datasets are used or when fewer than 50 individuals are sampled per location. Larger data sizes (100 or more SNPs and > 50 individuals per location) are recommended.SNPs with higher estimated minor allele frequencies more reliably identify isolation by distance. For , these results are consistent with prior work [4]. F estimates can be biased by the allele frequencies of the marker set under analysis, with the inclusion of more rare polymorphisms leading to under-estimation of F and enrichment for more equifrequent polymorphisms resulting in over-estimation [39]. Bias toward lower F estimates when using the SNP set with more rare alleles (estimated minor allele frequency > 0.05) may explain why this ascertainment scheme yielded weaker correlations between F and distance (when compared with the set of more equifrequent SNPs). Identifying optimal SNP ascertainment strategies for malaria genomic surveillance studies is an area of active research [16].Detecting or failing to detect isolation by distance may be less informative if parasite migration rates are not consistently correlated with distance [40], as observed in our analysis of mobile phone-associated movement data (specifically, for migration between nearby or local districts in Thailand) and in other studies [8-10]. In these contexts, the ability to reliably rank gene flow estimates or distinguish highly connected versus less highly connected location pairs may be more useful from a programmatic standpoint.Reliable ranking of gene flow between pairs of populations may be difficult in situations where migration rates are high or highly uniform across location pairs (for example, between local or nearby districts), even when using larger numbers of highly informative SNPs and sampling large numbers of individuals.There are multiple limitations that are important for understanding these observations in context. The P. falciparum sequence data we obtained from the Pf3k database includes only a small number of individuals from certain locations (for example, Pailin, Cambodia and Bago Division, Myanmar). This analysis is limited to samples collected over a two-year period (2009–2011) in a relatively small number of locations in Southeast Asia, and observations from this dataset are not generalizable beyond the epidemiological and ecological conditions specific to this context. In addition, our analysis is likely influenced by biases inherent to sample collection in each Pf3k study location, making it difficult to disambiguate whether observed patterns in relatedness and differentiation are due to underlying P. falciparum population structure or sampling biases. Potential sampling biases in field studies include over-sampling of clinical or outbreak related infections, with resultant undersampling of subclinical infections, and laboratory-based sources of bias related to difficulties genotyping infections with low parasite density. Our analysis also excluded polyclonal infections and individuals carrying resistance-associated KEL1 haplotypes. Removal of polyclonal infections may result in substantial levels of missingness, particularly in highly endemic contexts where polyclonal infections are more common. Moreover, the distribution of polyclonal infections across locations may provide important information about gene flow between malaria populations. Likewise, exclusion of individuals carrying known resistance-associated mutations may result in substantial loss of potentially informative data and focusing on these isolates may provide insights into patterns of connectivity underlying dispersal of drug resistance-associated genetic polymorphisms. Lastly, it is not clear that geographic distance between Pf3k study locations should correlate with parasite migration rates and signatures consistent with isolation by distance should be interpreted with caution here.We sought to address some of these limitations via coalescent simulation, allowing for numerical evaluation of relatedness- and differentiation-based approaches over a wider range of possible epidemiological conditions. This approach avoids the potential biases inherent to sampling from naturally-occurring P. falciparum populations in the field and allows for direct comparison between known migration rates between locations and their corresponding F and values. Parameters used for coalescent simulation are subject to potential mis-specification, particularly those parameters where there is uncertainty about their true values in naturally-occurring P. falciparum populations (for example, recombination rate). Although we do not statistically infer model parameters, the parameters used in our simulations yield F and values that are consistent with the range and distribution of values observed in Pf3k data. Importantly, we specified equal population sizes across the five locations in the coalescent simulation, despite the fact that differential transmission between locations (resulting in larger and smaller effective population sizes) could have important impacts on gene flow estimates.There are other aspects of our approach that may limit generalizability to real P. falciparum populations in the field. For example, we have assumed the ability to sample individuals evenly across populations, and that sampling by default captures every population in the metapopulation; in real-life genomic surveillance applications, sampling is likely to be unbalanced and imperfect, such that individuals from certain populations may be undersampled or not captured at all. In addition, our analysis only considers direct connectivity between locations and, in certain contexts, indirect migration between locations may be an important contributor to gene flow. Lastly, our analysis assumes perfect overlap of survey catchment area and population, whereas in real-life surveillance applications, it is more difficult to ascribe a known geolocation to each individual infection (given that individuals may move location between infection and medical diagnosis or survey participation) and thus the task of assigning individual infectious to specific locations involves much more uncertainty.
Conclusion
In conclusion, we have examined numerically the properties of and as estimates of population-level gene flow between P. falciparum populations, and outlined, insofar as our simulations allow, basic data requirements (number of individuals and number of SNPs) for their use. Although this study is focused on a limited set of genomic surveillance applications (identifying isolation by distance and correctly ranking location pairs versus less highly connected ones), our work underscores an important limitation that applies across a wider range of surveillance applications: under certain epidemiological conditions (high migration between large populations) it likely becomes infeasible to reliably rank gene flow between P. falciparum populations, even with dense sampling of individuals and sequence data for large numbers of highly informative genetic markers. Additional scientific work is needed to understand how unbalanced sampling of individuals, incomplete sampling of metapopulations (for example, non-sampling of entire demes), and other exigencies of real-world genomic surveillance impact the ability to rank different levels of gene flow between P. falciparum populations in the field.
Supplementary methods and figures.
Fig A. Positions for filtered SNPs from the Pf3k dataset. (A) SNPs with estimated minor allele frequency > 0.35. (B) SNPs with estimated minor allele frequency > 0.05. Fig B. Number of monoclonal . Data includes only those sequences collected between 2009 and 2011 and excludes KEL1 mutants (as described in Methods). Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License. Fig C. Coalescent models used to generate simulated sequence data. (A): Constant population size with change from ancestral migration rates to current migration rates at time = g generations in the past. The ancestral migration rate is equal for all location-location pairs and specified as ρ = M × max(ρ), where M is a multiplier. Recent migration rates are specified using CDR-estimated mobility data (ρ, as described in the Methods and Supplementary Methods) and a multiplier value m. (B): Constant migration rates with exponential decrease in population size from an initial size (N) to current size (N) starting at time = g generations in the past. (C): Constant population size and constant migration rates. Fig D. CDR-estimated migration rates. (A) Relative migration rates (ρ/max(ρ)) between 930 districts in Thailand with adequate CDR data. (B) 10 randomly sampled sets of n = 5 neighboring districts with distances between district centroids of approximately 100 km or less. (C) 10 randomly sampled sets of n = 5 distantly-separated districts, selected to include districts > 600 km from randomly chosen a center district. Sampling procedures for the district sets in (B) and (C) and their use for specifying migration rates in the coalescent simulation are described in the Methods and Supplementary Methods. Fig E. Example sets of nearby districts. Figure shows four of ten total randomly selected district sets used to specify migration rates in the coalescent model (corresponding to Panel B in Fig D). Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License. Fig F. Example sets of distant districts. Panels show four of ten total randomly selected district sets used to specify migration rates in the coalescent model (corresponding to Panel C in Fig D). Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License. Fig G. Distance versus alternative population-level estimates of relatedness. (A) Distance versus mean pairwise IBD sequence length when only the longest IBD tract lengths are considered (> 95 percentile, equal to tract length ≥ 285.64 kb). (B) Distance versus , the proportion between-location individual-individual pairs with r > 0.8. Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License. Fig H. Between-location relatedness for . (A) Distribution of r values (proportion of SNPs that are IBD between pairs) for between-location individual-individual pairs across 36 location-location pairs in the Great Mekong Subregion. (B) Distribution of mean pairwise IBD segment length (the mean length of shared IBD tracts for individual-individual pairs) for the same location-location pairs. Symbols show location-location pairs with the highest estimated gene flow per F (red circles), (blue circles), (open blue squares), and mean pairwise IBD sequence length (as described in Fig G, filled blue squares). Fig I. Pairwise relatedness for intra-location individual-individual pairs. Plots show distribution of intra-location r values for 9 locations (proportion of SNPs that are IBD between within-location pairs of sequences) included from the Pf3k dataset. Fig J. Observed values for . Data was simulated using constant population size and instantaneous change from ancestral migration rates to recent migrations at time = g generations in the past (as described in Panel A of Fig C). Model parameters: recombination rate, ϕ = 0.7, multiplier for recent migration rates, m = 15; multiplier for ancestral migration rates, M = 5; population size, N = 500; time since ancestral migration rates, g = 10 generations. (A) and (B): Observed F values compared to specified migration rates and distance between districts, respectively. (C) and (D): Observed values compared to specified migration rates and distance between districts, respectively. Filled circles show values for 1000 independent simulations using 10 randomly sampled district sets specifying CDR-estimated migration rates (as shown in Fig D). Open circles show the mean F or values for each migration rate or distance. Rug plot on y-axis shows estimated F or values obtained from the Pf3k P. falciparum data (for comparison with the model-derived values). Fig K. Observed values for . Data was simulated using constant population size and instantaneous change from ancestral migration rates to recent migrations at time = g generations in the past (as described in Panel A of Fig C). Model parameters: recombination rate, ϕ = 0.7, multiplier for recent migration rates, m = 15; multiplier for ancestral migration rates, M = 5; population size, N = 500; time since ancestral migration rates, g = 10 generations. (A) and (B): Observed F values compared to specified migration rates and distance between districts, respectively. (C) and (D): Observed values compared to specified migration rates and distance between districts, respectively. Filled circles show values for 1000 independent simulations using 10 randomly sampled district sets specifying CDR-estimated migration rates (as shown in Fig D). Open circles show the mean F or values for each migration rate or distance. Rug plot on y-axis shows estimated F or values obtained from the Pf3k P. falciparum data (for comparison with the model-derived values). Fig L. Observed values for . Data was simulated using a coalescent model with a constant set of migration rates and exponential decrease from an ancestral population size (N) to current populations size (N) starting time = g generations in the past (as described in Panel B of Fig C). Migration rates are specified using CDR-estimated mobility between nearby districts in Thailand separated by approximately 100 km or less (“local” migration). Model parameters: recombination rate, ϕ = 0.7, multiplier for migration rates, m = 15; final population size, N = 100; ancestral population size, N = 1000; time since onset of exponential decrease in population size, g = 50 generations. (A) and (B): Observed F values compared to specified migration rates and distance between districts, respectively. (C) and (D): Observed values compared to specified migration rates and distance between districts, respectively. Filled circles show values for 1000 independent simulations using 10 randomly sampled district sets specifying CDR-estimated migration rates (as shown in Fig D). Open circles show the mean F or values for each migration rate or distance. Rug plot on y-axis shows estimated F or values obtained from the Pf3k P. falciparum data (for comparison with the model-derived values). Fig M. Sensitivity analysis for recombination rate . Model parameters: recombination rate, ϕ = 0.1 or 0.7, multiplier for recent migration rates m = 15, multiplier for ancestral migration rates M = 5; final population size, N = 500; time since ancestral migration rates, g = 10 generations. (A) and (B): Observed F values compared to specified migration rates between districts for ϕ = 0.1 and 0.7, respectively. (C) and (D): Observed values compared to specified migration rates. Filled circles show values for 1000 independent simulations using 10 randomly sampled district sets specifying CDR-estimated migration rates. Open circles show the mean F or values for each migration rate or distance. Rug plot on y-axis shows estimated F or values obtained from the Pf3k P. falciparum data (for comparison with the model-derived values). Fig N. Sensitivity analysis for recombination rate Model parameters: recombination rate, ϕ = 0.1 or 0.7, multiplier for migration rates m = 15; final population size, N = 100; ancestral population size, N = 1000; time since onset of exponential decrease in population size, g = 50 generations. (A) and (B): Observed F values compared to specified migration rates between districts for ϕ = 0.1 and 0.7, respectively. (C) and (D): Observed values compared to specified migration rates. Filled circles show values for 1000 independent simulations using 10 randomly sampled district sets specifying CDR-estimated migration rates. Open circles show the mean F or values for each migration rate or distance. Rug plot on y-axis shows estimated F or values obtained from the Pf3k P. falciparum data (for comparison with the model-derived values). Fig O. Gene flow versus migration rate for coalescent-simulated sequence data (local migration, model A). Migration rates are estimated using CDR data for neighboring districts in Thailand (separated by ≤ 100 km). (A) and (B) correlation between migration rate (as specified in the coalescent model) and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Fig P. Gene flow versus migration rate for coalescent-simulated sequence data (subnational migration, model A). Migration rates are estimated using CDR data for distant districts in Thailand. (A) and (B) correlation between migration rate (as specified in the coalescent model) and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Fig Q. Gene flow versus distance for simulated sequence data (local migration, model B) Data was simulated using a coalescent model with a constant set of migration rates and exponential decrease from an ancestral population size (N) to current populations size (N) starting time = g generations in the past (as described in Panel B of Fig C). Migration rates are estimated using CDR data for neighboring districts in Thailand (separated by ≥ 100 km). (A) and (B): correlation between distance and observed values of F and , respectively. Points show τ values for 1000 independent simulation replicates. (C) and (D): proportion of simulation replicates where Mantel-estimated p-values for observed τ values are ≥ 0.05, by number of SNPs (p) and number of individuals (n). Fig R. Time since ancestral migration events versus . Top panels: Each point compares the value for the same location pair in the simulation using g = 10 (x-axis) versus g = 500 (y-axis). Bottom panels: Model parameters: M, the multiplier for ancestral migration rates (before g generations in the past), is equal to either 5 (left panels) or 50 (right panels). m, the multiplier for recent migration rates, is equal to 15 and the recombination rate, ϕ, is 0.7 for all simulations shown. Points are colored by the number of migrants per generation in the “recent” migration matrix (from 0 to g generations in the past). The dashed line shows where x-axis and y-axis values are equal. Fig S. Ranking and classification metrics for coalescent-simulated sequence data (local migration, model A) (A) and (B): Proportion of all location-location pairs that are ranked correctly by either F or , respectively, when compared to the migration rate specified in the coalescent simulation. Points show these values for 1000 independent simulation replicates over different numbers of individuals (n) and SNPs (p) used to calculate F or . (C) and (D): Proportion of all simulation replicates where the location-location pair with the highest migration rate is correctly identified as such by F or , respectively. (E) and (F): Proportion of all simulation replicates the location-location pairs with the five highest migration rates are correctly classified as such as such by F or . Fig T. Ranking and classification metrics for coalescent-simulated sequence data (subnational migration, model A). (A) and (B): Proportion of all location-location pairs that are ranked correctly by either F or , respectively, when compared to the migration rate specified in the coalescent simulation. Points show these values for 1000 independent simulation replicates over different numbers of individuals (n) and SNPs (p) used to calculate F or . (C) and (D): Proportion of all simulation replicates where the location-location pair with the highest migration rate is correctly identified as such by F or , respectively. (E) and (F): Proportion of all simulation replicates the location-location pairs with the five highest migration rates are correctly classified as such as such by F or .(PDF)Click here for additional data file.9 Feb 2021Dear Dr Brown,Thank you very much for submitting your Research Article entitled 'Distinguishing gene flow between malaria parasite populations' to PLOS Genetics.The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a revised version. We cannot, of course, promise publication at that time.Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.[LINK]We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.Yours sincerely,Xavier DidelotAssociate EditorPLOS GeneticsHua TangSection Editor: Natural VariationPLOS GeneticsReviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: The manuscript attempts to address statistical inference of a single use case of molecular(genomic) epidemiology in malaria control, that is, ranking gene flow between population pairs. Traditionally, this has been done with allele frequency based metrics such as Fst, however here the authors incorporate a more recently established measure of population connectivity, Identity by Descent (IBD) as measured by the parameter R here. Several key findings include that sample size, marker number and population levels and migration influence the ability to rank gene flow. The paper is written well though is a little technical with the mathematics for some general genetics readers, so some effort could be made to simplify some of the descriptions.One point that stands out for me with IBD statistics is that they rely on high levels of relatedness (near clonality) to measure gene flow. However, these signals will be rapidly lost if a genotype migrated to a high transmission area. Thus, these approaches might be limited when comparing low to low transmission or high to low, but not high to high (as this paper suggests) or low to high transmission. This is something to consider and would be worth commenting on/considering for future analyses.The real data set used was SNPs extracted from genomic data from clinical samples collected in multiple populations of the Greater Mekong Subregion. Here P. falciparum transmission is low and patchy, and finding positive cases is challenging through routine surveys, many thousands of individuals screened to find only a few samples. Therefore, passive collection of isolates from clinical cases may be the only feasible/affordable means of identifying isolates for genomic analysis. However clinical (febrile) cases, as the authors also point out in the discussion, may present a biased view of transmission dynamics, especially if substantial numbers of asymptomatic infections persist that are not being analysed. There are two issues here, inability to accurately define gene flow due to lack of sufficient samples numbers (also keeping in mind that low density infections are very difficult to genotype currently), and the fact that clinical cases may not present an accurate view of what is actually going on.The use of the distance by road measures might not be accurate in all settings as in South East Asia, travel by motorbike, boat and foot might be alternative means. Thus a comparison to other distance measures might be warranted. Some of the co-authors have experience in spatial epidemiology so I suspect they have considered this. A comment on why the distance by road method was chosen would be appropriate.Reviewer #2: Please find the attached review.Reviewer #3: This is a well written manuscript describing approaches for measuring geneflow between populations, connectivity. It addresses an important question relevant to malaria elimination and the translation of genomic surveillance approaches for this deadly disease. It contrasted measures of geneflow based on allele frequencies (FST) and ancestry (measured by an IBD index, R). They employed real and simulated data, the latter controlling for migration rates and parameters that remain vastly unknown in natural parasite populations. Despite the importance of the findings, the manuscript is limited by a by the following:1. The authors deliberately used data from the Greater Mekong region and only monoclonal isolates. This is convenient but far from the reality across malaria endemic regions. With the bulk of malaria transmission in Africa, there should be strong motivation to employ such approaches in this population especially for populations with reduced transmission, where monoclonality is high and elimination plans are being developed. Overall, how this will apply to more complex populations need further discussion.2. As recognised by the authors, the bias in sampling is a major issue for connectivity studies based on genetic data. The authors could discuss the best sampling strategy for capturing connectivity at high confidence3. The simulations allow for only 100 individuals per location without any justification for this number.4. The filtered data retained 418 SNPs already biased for high minor allele frequencies. This will affect measurement of FST as low frequency variants are eliminated. Again, the choice on 0.35 MAF is based on a previous reference.5. The index for pairwise relatedness is considered a summary for the genomes of the pairs. This will be true if the 418 SNPs are well distributed across all chromosomes. Except I missed this, data on how the retained loci span across the genome should be helpful.6. Authors excluded sequences carrying the KEL1 haplotype. Connectivity studies as indicated in their introduction will help also to determine how antimalarial resistance flows between populations. Instead of excluding the isolates with KEL1 haplotypes, removal of SNPs around drug resistant gene sweeps could allow for tracking the flow of drug resistant isolates.7. The Ne estimates used were from an African population in Senegal, obtained after years of interventions. Not sure they apply to SEA, and why this not determined. Determining Ne accurately is an issue as with other popgen parameters.8. Phi was not defined and the value of 0.044 was used to scale the recombination rate. Only specialised readers will get this for an approach intended for future wider translation. How was this derived?9. What is the added value of PRC, given R seems a robust measure of connectivity?10. FST was not displayed on figure 1 to help readers11. R is highly skewed. Low FST but high R will result from the presence of highly related pairs, which can be due to a common source outbreak rather than connectivity.**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Genetics
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: Yes: Jack O'BrienReviewer #3: NoSubmitted filename: plos_genetics_brown_2021.pdfClick here for additional data file.16 Sep 2021Submitted filename: BROWN GENE FLOW RESPONSE 091521.pdfClick here for additional data file.12 Oct 2021Dear Dr Brown,We are pleased to inform you that your manuscript entitled "Distinguishing gene flow between malaria parasite populations" has been editorially accepted for publication in PLOS Genetics. Congratulations!Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!Yours sincerely,Xavier DidelotAssociate EditorPLOS GeneticsHua TangSection Editor: Natural VariationPLOS Geneticswww.plosgenetics.orgTwitter: @PLOSGenetics----------------------------------------------------Comments from the reviewers (if applicable):Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #2: Please find the attached review.**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Genetics
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #2: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #2: Yes: Jack O'Brien----------------------------------------------------Data DepositionIf you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-01965R1More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.----------------------------------------------------Press QueriesIf you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.Submitted filename: plos_genetics_brown_first_revision_2021.pdfClick here for additional data file.24 Nov 2021PGENETICS-D-20-01965R1Distinguishing gene flow between malaria parasite populationsDear Dr Brown,We are pleased to inform you that your manuscript entitled "Distinguishing gene flow between malaria parasite populations" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!With kind regards,Agnes PapPLOS GeneticsOn behalf of:The PLOS Genetics TeamCarlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdomplosgenetics@plos.org | +44 (0) 1223-442823plosgenetics.org | Twitter: @PLOSGenetics
Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043
Authors: T J Anderson; B Haubold; J T Williams; J G Estrada-Franco; L Richardson; R Mollinedo; M Bockarie; J Mokili; S Mharakurwa; N French; J Whitworth; I D Velez; A H Brockman; F Nosten; M U Ferreira; K P Day Journal: Mol Biol Evol Date: 2000-10 Impact factor: 16.240
Authors: Ricardo L D Machado; Marinete M Povoa; Vanja S P Calvosa; Marcelo U Ferreira; Andrea R B Rossit; Eduardo J M dos Santos; David J Conway Journal: J Infect Dis Date: 2004-09-21 Impact factor: 5.226
Authors: Daniel M Parker; Verena I Carrara; Sasithon Pukrittayakamee; Rose McGready; François H Nosten Journal: Malar J Date: 2015-10-05 Impact factor: 2.979
Authors: Rachel Daniels; Sarah K Volkman; Danny A Milner; Nira Mahesh; Daniel E Neafsey; Daniel J Park; David Rosen; Elaine Angelino; Pardis C Sabeti; Dyann F Wirth; Roger C Wiegand Journal: Malar J Date: 2008-10-29 Impact factor: 2.979
Authors: Angela M Early; Marc Lievens; Bronwyn L MacInnis; Christian F Ockenhouse; Sarah K Volkman; Samuel Adjei; Tsiri Agbenyega; Daniel Ansong; Stacey Gondi; Brian Greenwood; Mary Hamel; Chris Odero; Kephas Otieno; Walter Otieno; Seth Owusu-Agyei; Kwaku Poku Asante; Hermann Sorgho; Lucas Tina; Halidou Tinto; Innocent Valea; Dyann F Wirth; Daniel E Neafsey Journal: Nat Commun Date: 2018-04-11 Impact factor: 14.919
Authors: Christopher G Jacob; Nguyen Thuy-Nhien; Mayfong Mayxay; Richard J Maude; Huynh Hong Quang; Bouasy Hongvanthong; Viengxay Vanisaveth; Thang Ngo Duc; Huy Rekol; Rob van der Pluijm; Lorenz von Seidlein; Rick Fairhurst; François Nosten; Md Amir Hossain; Naomi Park; Scott Goodwin; Pascal Ringwald; Keobouphaphone Chindavongsa; Paul Newton; Elizabeth Ashley; Sonexay Phalivong; Rapeephan Maude; Rithea Leang; Cheah Huch; Le Thanh Dong; Kim-Tuyen Nguyen; Tran Minh Nhat; Tran Tinh Hien; Hoa Nguyen; Nicole Zdrojewski; Sara Canavati; Abdullah Abu Sayeed; Didar Uddin; Caroline Buckee; Caterina I Fanello; Marie Onyamboko; Thomas Peto; Rupam Tripura; Chanaki Amaratunga; Aung Myint Thu; Gilles Delmas; Jordi Landier; Daniel M Parker; Nguyen Hoang Chau; Dysoley Lek; Seila Suon; James Callery; Podjanee Jittamala; Borimas Hanboonkunupakarn; Sasithon Pukrittayakamee; Aung Pyae Phyo; Frank Smithuis; Khin Lin; Myo Thant; Tin Maung Hlaing; Parthasarathi Satpathi; Sanghamitra Satpathi; Prativa K Behera; Amar Tripura; Subrata Baidya; Neena Valecha; Anupkumar R Anvikar; Akhter Ul Islam; Abul Faiz; Chanon Kunasol; Eleanor Drury; Mihir Kekre; Mozam Ali; Katie Love; Shavanthi Rajatileka; Anna E Jeffreys; Kate Rowlands; Christina S Hubbart; Mehul Dhorda; Ranitha Vongpromek; Namfon Kotanan; Phrutsamon Wongnak; Jacob Almagro Garcia; Richard D Pearson; Cristina V Ariani; Thanat Chookajorn; Cinzia Malangone; T Nguyen; Jim Stalker; Ben Jeffery; Jonathan Keatley; Kimberly J Johnson; Dawn Muddyman; Xin Hui S Chan; John Sillitoe; Roberto Amato; Victoria Simpson; Sonia Gonçalves; Kirk Rockett; Nicholas P Day; Arjen M Dondorp; Dominic P Kwiatkowski; Olivo Miotto Journal: Elife Date: 2021-08-10 Impact factor: 8.713