Literature DB >> 27634595

Approaches to estimating inbreeding coefficients in clinical isolates of Plasmodium falciparum from genomic sequence data.

John D O'Brien¹, Lucas Amenga-Etego^2,3, Ruiqi Li⁴.

Abstract

BACKGROUND: The advent of whole-genome sequencing has generated increased interest in modelling the structure of strain mixture within clinical infections of Plasmodium falciparum The life cycle of the parasite implies that the mixture of multiple strains within an infected individual is related to the out-crossing rate across populations, making methods for measuring this process in situ central to understanding the genetic epidemiology of the disease.
RESULTS: This paper derives a set of new estimators for inferring inbreeding coefficients using whole genome sequence read count data from P. falciparum clinical samples, which provides resources to assess within-sample mixture that connect to extensive literatures in population genetics and conservation ecology. Features of the P. falciparum genome mean that standard methods for inbreeding coefficients and related F-statistics cannot be used directly. After reviewing an initial effort to estimate the inbreeding coefficient within clinical isolates of P. falciparum, several generalizations using both frequentist and Bayesian approaches are provided. A simpler, more intuitive frequentist estimator is shown to have nearly identical properties to the initial estimator both in simulation and in real data sets. The Bayesian approach connects these estimates to the Balding-Nichols model, a mainstay within genetic epidemiology, and a possible framework for more complex modelling. A simulation study shows strong performance for all estimators with as few as ten variants. Application to samples from the PF3K data set indicate significant across-country variation in within-sample mixture. Finally, a comparison with results from a recent mixture model for within-sample strain mixture show that inbreeding coefficients provide a strong proxy for these more complex models.
CONCLUSIONS: This paper provides a set of methods for estimating inbreeding coefficients within P. falciparum samples from whole-genome sequence data, supported by simulation studies and empirical examples. It includes a substantially simple estimator with similar statistical properties to the estimator in current use. These methods will also be applicable to other species with similar life-cycles. Implementations of the methods described are available in an open-source R package pfmix. Estimates for the PF3K public data release are provide as part of this resource.

Entities: Chemical Disease Species

Keywords: Balding–Nichols model; COI; F-statistics; Inbreeding coefficient; MOI

Mesh：

Year: 2016 PMID： 27634595 PMCID： PMC5025560 DOI： 10.1186/s12936-016-1531-z

Source DB: PubMed Journal: Malar J ISSN： 1475-2875 Impact factor: 2.979

Background

While genetic factors play a crucial role in the emergence of drug resistance within Plasmodium falciparum, many aspects of the genetic epidemiology of the parasite remain obscure [1, 2]. The beginnings of a global perspective on the genetic structure of parasite populations emerged from the analysis of whole-genome sequencing data (WGS) derived from ~200 parasite genomes collected directly from clinical patients in six countries on three continents [3]. This study gave further evidence for the widespread presence of within-isolate strain mixture and significant amounts of variation in its degree across continents. In grappling with the complexity of WGS read count data, the study departed from standard approaches for quantifing the amount of within-sample variation by instead using an inbreeding coefficient, , a form of F-statistic. Strain mixture has been traditionally assessed via multiplicity of infection (MOI) [4-6], using methods for inferring the number of strains from single-nucleotide polymorphisms (SNPs) or other typing technologies applied at a small number of loci. Researchers have subsequently shown how finite mixture models can infer MOI using WGS but the under the heading of complexity of infection (COI) as these models can capture additional mixture features [7, 8]. Still, inbreeding coefficients have a long connection to population genetics and conservation biology and may be of interest to researchers connecting P. falciparum studies to other genetic contexts [9, 10]. This paper presents a collection of statistical methods for estimating , explores their performance in simulation, details their connection to COI estimates, and confirms the variation in values across countries using the P. falciparum 3000 genomes (PF3K), a publicly available data resource. Inbreeding coefficients and the F-statistics from which they derive are measurements of the departure of allelic heterozygosity observed within a population from those expected at Hardy–Weinberg equilibrium (HWE) [10, 11]. HWE specifies the distribution of alleles assuming panmixia, a population exhibiting perfectly random mating with an absence of mutation, migration, drift, selection or other effects [12]. F-statistics calibrate the empirical allele distribution within a subpopulation against those expected under HWE, ranging from a value of one (no mixture) to zero (perfect HWE-type mixture). In the context of comparing the parasites’ genetic diversity within a single infected individual relative to the local geographic population (and absent any geographic structuring of the population, i.e. the Wahlund effect), these statistics effectively become inbreeding coefficients. denotes the relative amount of inbreeding within an individual sample (w) relative to the expected amount in a subpopulation (s). Since here estimates are only considered only relative to a single country (subpopulation), the use of paired subscripts, is deprecated in favour of for a specific sample i. F-statistics have proven to be an effective and extremely popular means for investigating species’ population structure from both allelic and genomic data [10, 13, 14]. However, standard software tools assume specific ploidy structures incommensurate with WGS data from P. falciparum and so cannot be used directly. The critical difference is that, within a human host, P. falciparum exists only in the haploid stage of its life-cycle [15]. Since short-read WGS data cannot yet capture full haplotypes, individual reads cannot be uniquely identified with their strain of origin. Without being able to associate reads to individual P. falciparum strains, no ’out-of-the-box’ use of standard F-statistics approaches with this new data appears possible. Several earlier works have applied the F-statistic framework to P. falciparum within-sample mixture. These concepts—while not under the heading of inbreeding coefficients—undergird much of the seminal work on MOI estimation [5, 6]. More recently, Manske et al. [3] provided an initial estimator for inbreeding coefficients using WGS based on the slope of a modified regression line between the expected and observed heterozygosity within a sample. Auburn et al. [16] explores the connection between this estimator and standard MOI approaches by comparing these estimates with MOI values inferred by genotyping the msp-1 and msp-2 genes, showing strong correlation between these values in their sample sets. This estimator has been further utilized in a number of recent studies on P. falciparum, including analyses of populations in the Gambia, Ghana, and Guinea [17, 18]. It has also been used in analysis of the population structures of Plasmodium vivax and Plasmodium knowlesi [19, 20]. A recent examination of this estimator in the context of microsatellite genotyping explores a strong relationship between the number of variants, allele frequency, and estimator performance[21]. There has been otherwise little statistical work characterizing this estimator or it’s properties. This paper seeks to remedy some of this deficit by providing: a simple presentation of this estimator; a set of alternate estimators that make stronger connections to the tradition around F-statistics; an investigation of their properties through simulations; and several applications to relevant data sets. This paper proceeds as follows. First, an overview of the data and the notation is provided. The initial estimator employed by Manske et al. [3] for estimating is then reviewed, followed by the presentation of two additional frequentist estimators. A Bayesian approach for these statistics is then derived from the the Balding–Nichols model. All of these estimators are compared in an extensive simulation study. To consider their empirical performance, the correlation across all estimators in 344 Ghanaian samples is examined and the Bayesian estimates are compared to COI estimates. To show the performance under controlled circumstances, we apply the methods to several clonal laboratory strains. As a final example, the estimates for the PF3K sample set are presented for each country, confirming significant variation in the amount of within-sample mixture across countries. The conclusion provides brief discussion of the strengths and limitations of the approaches, and possible future directions for modelling within-sample mixture using WGS.

Data and models

Data and preparation

The data used comes largely from Release 3.0 of the PF3K resource. An overview of this project, collection protocols, and a full sequencing protocol can be found at the consortial website [22]. For all the samples considered below, data come from Illumina HiSeq sequencing applied to clinical P. falciparum samples collected from 14 countries. Starting from the publicly available vcf files, samples from Nigeria and Senegal were also excluded due to sample size and differing sequencing technology, respectively. First, only positions that exhibited minor allele frequencies greater than 0.01 were retained. Variants were furthered filtered at the country level by removing samples that exhibited fewer than 80 % of variant positions with at least 20× coverage. SNPs with less than 20× coverage were then removed from all remaining samples. This yielded variable number of SNPs within countries, from 1108 in Cambodia to 6596 in Laos. The number of samples within each country ranged from 35 for Laos to 344 in Ghana. Four additional samples—two replicates each of DD2 and 7G8—were taken from Release 5.0 of the PF3K resource for use as negative control. Each of these samples comprised a single, unmixed strain and was sequenced to high coverage (~65×). These were cleaned according to the steps above, yielding 23,109 viable positions. A subset of 1000 SNPs were randomly selected of those remaining for inference here.

Notation

Within a country, samples are indexed and the SNPs by . At SNP j within sample i, we observe reads that agree with the reference, and reads that are different from the reference. denotes the allele frequency for reference allele for SNP j in sample i and estimate it via the maximum-likelihood estimator (MLE) for proportions: . Similarly, denotes the population-level reference allele frequency for each SNP and is estimated according to the across-sample MLE:All MLEs are calculated by country. Table 1 is provided as a reference to the reader for notation.

Table 1

Notation for parameters used throughout the manuscript

Parameter	Description
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j \,=\, 1,\ldots , M$$\end{document}j=1,…,M	Index over number of SNPs, M
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\,=\, 1,\ldots , N$$\end{document}i=1,…,N	Index over number of samples, N
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{ij}, \ n_{ij}$$\end{document}rij,nij	Reference/non-reference read count data in sample i at variant j
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{ij}\,=\, (r_{ij},n_{ij})$$\end{document}dij=(rij,nij)	Read count data in sample i at variant j
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_j$$\end{document}pj (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p}_j$$\end{document}p^j)	Population-level non-reference allele frequency for SNP j (estimate)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document}pij,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\widehat{p}_{ij})$$\end{document}(p^ij)	Within-sample non-reference allele frequency for SNP j in sample i (estimate)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_i$$\end{document}fi	Inbreeding coefficient for sample i
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_o(b,i)$$\end{document}Ho(b,i)	Observed heterozygosity for sample i in bin b
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_e(b)$$\end{document}He(b)	Expected heterozygosity for bin b
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{f}^{*}_i$$\end{document}f^i∗	Estimator of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_i$$\end{document}fi by method \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$*$$\end{document}∗.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ {f}, \ {p}$$\end{document}f,p	Vector of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_i$$\end{document}fi and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_j$$\end{document}pj’s in Bayesian model

Notation for parameters used throughout the manuscript

A previous frequentist estimator for , and two alternatives

In Manske et al., the authors provide an initial approach to estimating . This estimator is referred to as to contrast it with subsequent estimators. For each sample i, the estimator first partitions alleles into 10 equally-spaced bins based on their minor allele frequency: . Within each bin, b, the average expected heterozygosity assuming country-level HWE is calculated bywhere is the number of SNPs within bin b. The average observed heterozygosity within each bin and each sample is calculated byFinally, is calculated as where is the slope found by regressing the values against values centered within their respective allele frequency bins and constrained to pass through the origin. This is the initial estimator. The binning procedure, while stabilizing the estimator against influence from an excess of low frequency alleles common within samples, may also introduce bias. This effect can be removed by discarding the binning procedure in favour of directly regressing observed heterozygosity for each SNP against the expected value, still constrained to pass through the origin. This provides a closed-form expression for the regressed estimator, , asA similar estimator, more transparently connected to the ideas underpinning traditional F-statistics, can be found in the following way. For a single SNP j, suppose to be the fraction of the population-level heterozygosity equal to the difference between the population-level heterozygosity, and the sample-level heterozygosity, that is,Dividing through by gives an estimate for for the SNP j. Averaging across all SNPs and taking the ratio of expectations to be the expectation of the ratios gives the estimatorThis is the direct estimator, since it contains the critical ratio of the mean observed heterozygosity over the mean expected heterozygosity characteristic of F-statistics. For each of these estimators, a bootstrap approach is employed to estimate the variance in the estimates for confidence intervals [23, 24]. The bootstrap works by assuming that the empirically observed distribution – here, the allele frequencies – provides a reasonable approximation to the true empirical distribution. By repeatedly subsampling with replacement from the observed distribution and recalculating the estimator at each iteration, a distribution of estimates is built from which confidence intervals can be calculated.

Bayesian model framework

Inbreeding coefficients comparable to the above estimators can also be derived by the Balding—Nichols model, a widely used method for measuring inbreeding in other genetic contexts [25]. This approach also has strong similarities to previous work in the context of P. falciparum [5, 6]. In using this model, several simplifying assumptions are required. SNPs are treated as unlinked (i.e. no linkage disequilibrium) and it is assumed that individual parasites within a sample represent a random sample of the surrounding population. It is further assumed that read counts are sampled identically, independently, and represent an unbiased sample of allele frequency .

Likelihood and priors

The approach for the Bayesian estimator adapts the Balding–Nichols model of allele frequency within inbred subpopulations to the specific context of P. falciparum WGS data [25, 26]. In P. falciparum the relevant subpopulation is the collection of parasites within a clinical sample. For sample i and SNP j, conditional upon an inbreeding coefficient and a population-level allele frequency , the Balding-Nichols model gives the allele frequency as a Beta distribution:Since the read counts are assumed to be identical and independent, is drawn from a Beta distribution, and the probability of the data is binomial, the conjugacy of these distributions can be used to eliminate the dependence on the unknown and give a Beta-binomial distribution for the likelihood at a site j and position j:where is the beta function. Since independence by site and by sample is assumed, the complete likelihood of the data, conditional upon the inbreeding coefficients for all samples within the population, and the allele frequency for all SNPs becomesThe only prior information about the values suggests that high levels of inbreeding are common but not obligatory in west African populations, and this is quantitatively interpreted as a uniform prior on each between zero and one. Similarly, a uniform prior distribution is put on each allele frequency, although rare variants were eliminated as part of data cleaning described in the Data preparation subsection above.

Inference

Since the posterior distribution is not known in closed form, standard random-walk Metropolis-Hastings Markov chain (MCMC) approach is used to numerically approximate it [27, 28]. The Metropolis-Hastings algorithm constructs a discrete-time Markov chain over the parameter space in such a way that the posterior distribution of the chain is the stationary distribution of the chain. This requires that at a given iteration in the chain, the move from the current parameter state x to new parameter state with probability occurs according toThe first ratio is that the posterior probabilities of x and , and written as . The second ratio, , gives probability of choosing the current state from the proposed state over the reverse move. Since constitutes assessment of the likelihood and the prior functions that can be calculated directly from the specifications above, only the calculation for is shown. Proposed parameters are denoted with an apostrophe.The autocorrelation of the log-posterior has minimal lag. As a secondary check, the chain was run for all of the chromosomes individually and compared values with the complete data set. Since SNPs are treated as independent, the performance of the model should be unaffected if the model performs similarly across chromosomes. Across all chromosomes performance is nearly identical, with greater than correlation among maximum a posteriori (MAP) estimates. —randomly select i and propose from , leading to . —randomly select j and then draw the proposed parameter from the uniform prior, leading to . —for both of these parameters, randomly select individual components and propose new values directly from the prior distribution, leading to where x and are the current and proposed state of the relevant component.

Implementation

All code was implemented in the R computational environment [29]. The set of scripts implementing each of the estimators, the MCMC algorithm, visualizations, data simulations, and filtered data sets are available at the pacakge website [30]. This repository includes a tutorial and workflow for completing analyses using these approaches. All materials are released under a creative commons license.

Results

Simulations

To compare the qualities of the four estimators, a simulation study was performed under a range of parameter values to capture how estimator performance may vary with the quality of data collected in the field. The number of SNPs, the number of read counts at each SNP, the degree of skew in the allele frequency distribution (, described below), and the amount of inbreeding were examined. For each parameter set, 100 replicate data sets were simulated. The full set of parameters are listed in Table 2. For comparing these results to empirical data, it is important to note that the coverage level is more comparable to the the minimum coverage level, rather than the average coverage level which can vary substantially. This is because, absent other errors, the coverage level determines the statistical properties of the within-sample allele frequency, with the standard error of the estimate scaling inversely with the square root of the minimum coverage.

Table 2

Parameter values for simulated data sets

Parameter	Description	Simulation values
M	Number of SNPs	10, 50, 150, 500, 1500
C	Total read counts per SNP	10, 100, 1000, 10000
f	Inbreeding coefficient	0.01, 0.1, 0.5, 0.9, 0.99
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document}β	Controls skew in allele frequency	1, 10, 100, 1000

For each parameter set, 100 replicate data sets were generated

Parameter values for simulated data sets For each parameter set, 100 replicate data sets were generated The simulated data were created by first fixing the inbreeding coefficient f and the allele frequency distribution. The skewness of the underlying allele frequency was parameterized as a Beta distribution with parameters and . was fixed to one, while was varied according to the simulation to induce differening degrees of skew. As increases, the distribution becomes increasingly right-skewed: when then 1 % of alleles have less than a 0.01 frequency while when more than 99 % of allele have less than a 0.01 frequency. For a fixed and f, M alleles are then sampled from the Beta distribution with parameters defined by Eq. 7. The read counts were then simulated according to a binomial distirbution with those within-sample allele frequencies. Inferred value over simulated values for each estimator across a range of parameter values: , , , and Vertical axis shows inferred/simulated value, with dashed line at one. Specific simulated values can be found in Table 2. Each Tukey boxplot represents 100 replicate data sets with the same parameters Figure 1 summarizes the comparison of point estimates made by the initial, regressed, direct, and Bayesian estimators across the simulated data. All boxplots use Tukey’s design, showing median, inter-quartile range and whiskers up to 1.5 times the inter-quartile range. Inferred/simulated values are reported as a measure of performance. Across all parameter values, the estimators performed similarly, with the Bayesian estimate showing the least bias and highest accuracy. The number of SNPs proves the largest determinant of performance, with 50 SNPs sufficient to ensure reasonable performance in most regimes. Very low f values () correspond to noticeable bias for the frequentist estimators. The initial estimator is largely robust to large skew in the allele frequency distribution, while the other two estimators are slightly biased by it at high levels of mixture. The data was simulated under the Balding–Nichols model as so the Bayesian method has an intrinsic advantage.

Fig. 1

Inferred value over simulated values for each estimator across a range of parameter values: , , , and Vertical axis shows inferred/simulated value, with dashed line at one. Specific simulated values can be found in Table 2. Each Tukey boxplot represents 100 replicate data sets with the same parameters

Boostrap standard deviation for each estimator for the same parameter values as Fig. 1. Specific simulated values can be found in Table 2. Each boxplot represents 50 bootstrap samples, each with 100 replicate data sets Figure 2 shows that the estimator standard deviation was similar for the three frequentist estimators and markedly smaller in the Bayesian case. For each of the parameter regimes in Fig. 1, 100 bootstrap resamplings were performed. The standard deviation is largely diminished with increasing numbers of SNPs, with read counts and beta values playing little role. Note that bias for the frequentist estimators increases with increasing f values.

Fig. 2

Comparison in empirical data

Since the underlying Balding-Nichols model within the simulations is likely misspecified relative to empircal data, performance was examined for each of the estimators applied to the WGS from 344 Ghanaian samples. The results shown in Fig. 3 show very strong correlation between the three frequentist estimators, with correlation better than 0.95. For the Bayesian estimate, the maximum a posteriori (MAP) estimate is reported. The Bayesian estimator is still highly correlated (>0.9) with the other estimators but is significiantly more variable in its estimates. Highly mixed and highly unmixed samples ( and ) appear to have the most correlation, with moderately mixed samples deviating the most from the other three estimators.

Fig. 3

Correlation in inferred value for the four estimators across the set of 344 Ghanaian samples, with each sample represented as point. Each panel shows the correlation between the two estimators on the corresponding diagonal position. For the Bayesian case, the MAP value is reported

Comparison with COI

As noted in the introduction, two recent efforts have extended MOI to WGS read count data and introduced the concept of COI [7, 8]. Both methods use finite mixture models to model the underlying number of strains in the sample. For comparison here, the model of O’Brien et al. is used, as it allows for more careful inference of the number of underlying strains and is more robust to errors in the read count data. For each of the 344 Ghanaian samples, the maximum a posteriori iteration is taken as the point estimate for the number of strains. Figure 4 graphs the relationship between the inferred number of strains and the F-statistic (a Spearman correlation of 0.83). While complex mixture models may provide a more penetrating understanding of within-sample variance, F-statistics appear to capture much of the same information in a single quantity.

Fig. 4

Boxplot of direct estimator for each of 344 Ghanian samples, grouped by number of inferred strains using the complex mixture model of [8]

Laboratory strain data

The direct and Bayesian estimators were applied to the four clonal laboratory samples. The resulting values were very close to one, with the smallest observed value (0.98), above the standard threshold for clonality (0.95). The standard deviation of these estimates was less than 0.01. Within these samples, a moderate number of SNPs ( 20) exhibited minor allele frequencies greater than 0.05, indicating some sequencing or alignment errors. These results indicate that these methods can reliably infer clonality even in the presence of some poorly-behaved SNPs.

PF3K data set

The PF3K clinical samples outlined in the Data section were grouped by country and used the direct estimator to calculate the inbreeding coefficient for each sample. These values are available on the companion website as a community resource [30]. Figure 5 summarizes the results, showing relatively low values throughout west Africa, with the noticeable exception of The Gambia. The median values of south and southeast Asian countries exhibit distinctly less mixture (higher values) than in West Africa. This is consistent with previous reports of highly variable amounts of within-sample mixture across countries [3]. Interestingly, while the median level of mixture mixture varies significantly across countries, highly mixed samples () and unmixed samples () are present everywhere.

Fig. 5

Boxplot of for each sample grouped by country of origin for 12 countries from the PF3K, arranged from west to east. The more intuitive is used to emphasize where low and high levels of mixture are prevelant These data overlap with two studies noted in the Background, where f values were also calculated using the initial estimator [17, 18]. Using the paper-reported values, we find that there is a 0.97 correlation with the samples from Ghana, and 0.96 for those from Guinea and the Gambia, against the direct estimator values found here. The high correlation between these estimates highlights the similar properties of the initial and direct estimators and indicates the strong consistency of these estimates across different data cleaning procedures.

Discussion

This work presents several new approaches to inferring inbreeding coefficients using read counts from WGS, including a frequentist estimator that is significantly simpler and more intuitive than the initial estimator as well as a Bayesian approach that derives from a classical population genetics model. These approaches help connect MOI investigations to a broader set of work within population genetics and conservation ecology that may be helpful in control efforts [31]. This work also demonstrates a strong correlation between these metrics and the results of more complex mixture models for inferring COI [7, 8]. While not intended to supplant these more involved methods for investigating the within-sample mixture, this additional tool can assist researchers in connecting P. falciparum population genetics to a larger literature. To assist other researchers, the implementaton of these methods is also provided as an R package, pfmix, with tutorials and example datasets in an open-source framework at the package site, along with the direct estimates for the PF3K data set[30]. The model underlying the inbreeding coefficient makes a number of assumptions about the structure of the read count data and the biological mixing process that may affect inference. For the read count data, read counts are assumed to be unbiased and the SNPs are unlinked. While short read data can be biased in several ways, previous research indicates that mixture proportions calculated by read count ratios are largely unbiased (for instance, see [3] supporting information). However, P. falciparum exhibits significant linkage disequilibrium on scales significantly larger than the average distance between neighbouring SNPs in the data. This violation is not expected to bias the estimates as this absence of independence occurs (roughly) evenly across the genome. However, inference from a small region of the genome will likely exhibit bias. A perhaps more troublesome assumption is embedded in the underlying structure of the F-statistic. An F-statistic measures the departure of the observed number of heterozygotes relative to those expected under Hardy–Weinberg equilibrium. In the context of mixed P. falciparum infections, the equilibrium assumptions—random mating, no selection, large population size, genetic isolation—are likely each violated at some level. For example, the mixture within a sample may be the result of a small number of founding individuals or be strongly selected by the human immune system. Without a more general approach to understanding the mixing process, anticipating the robustness of these estimates to this sort of misspecification is difficult. However, we do find that the PF3K samples from Cambodia that possess quite significant population structure still exhibit strong correlation between and the inferred number of strains. As genomic data enables more elaborate statistical models for mixed infections and a broader understanding of P. falciparum genetic epidemiology, it will still be useful for field researchers to connect their work with population genetics and ecology through simple metrics. These issues are also relevant for researchers in a number of other Plasmodium species and protozoa with similar life-cycles. Inbreeding coefficients, which have a history going back to the beginnings of modern genetics, connect to a number of population genetic quantities such as effective population size and genetic drift [9, 32, 33] and may serve to complement traditional MOI values and newer models to this end. This work meets this need by providing a basis to infer these quantities and a suite of open-source tools for researchers to calculate them.

22 in total

1. Likelihood-based inference for genetic correlation coefficients.

Authors: David J Balding
Journal: Theor Popul Biol Date: 2003-05 Impact factor: 1.570

Review 2. Estimating F-statistics.

Authors: B S Weir; W G Hill
Journal: Annu Rev Genet Date: 2002-06-11 Impact factor: 16.830

3. Genetic drift and estimation of effective population size.

Authors: M Nei; F Tajima
Journal: Genetics Date: 1981-07 Impact factor: 4.562

4. The epidemiology of multiple-clone Plasmodium falciparum infections in Gambian patients.

Authors: D J Conway; B M Greenwood; J S McBride
Journal: Parasitology Date: 1991-08 Impact factor: 3.234

5. A comprehensive survey of the Plasmodium life cycle by genomic, transcriptomic, and proteomic analyses.

Authors: Neil Hall; Marianna Karras; J Dale Raine; Jane M Carlton; Taco W A Kooij; Matthew Berriman; Laurence Florens; Christoph S Janssen; Arnab Pain; Georges K Christophides; Keith James; Kim Rutherford; Barbara Harris; David Harris; Carol Churcher; Michael A Quail; Doug Ormond; Jon Doggett; Holly E Trueman; Jacqui Mendoza; Shelby L Bidwell; Marie-Adele Rajandream; Daniel J Carucci; John R Yates; Fotis C Kafatos; Chris J Janse; Bart Barrell; C Michael R Turner; Andrew P Waters; Robert E Sinden
Journal: Science Date: 2005-01-07 Impact factor: 47.728

6. The global distribution of clinical episodes of Plasmodium falciparum malaria.

Authors: Robert W Snow; Carlos A Guerra; Abdisalan M Noor; Hla Y Myint; Simon I Hay
Journal: Nature Date: 2005-03-10 Impact factor: 49.962

7. Characterization of within-host Plasmodium falciparum diversity using next-generation sequence data.

Authors: Sarah Auburn; Susana Campino; Olivo Miotto; Abdoulaye A Djimde; Issaka Zongo; Magnus Manske; Gareth Maslen; Valentina Mangano; Daniel Alcock; Bronwyn MacInnis; Kirk A Rockett; Taane G Clark; Ogobara K Doumbo; Jean Bosco Ouédraogo; Dominic P Kwiatkowski
Journal: PLoS One Date: 2012-02-29 Impact factor: 3.240

8. COIL: a methodology for evaluating malarial complexity of infection using likelihood from single nucleotide polymorphism data.

Authors: Kevin Galinsky; Clarissa Valim; Arielle Salmier; Benoit de Thoisy; Lise Musset; Eric Legrand; Aubrey Faust; Mary Lynn Baniecki; Daouda Ndiaye; Rachel F Daniels; Daniel L Hartl; Pardis C Sabeti; Dyann F Wirth; Sarah K Volkman; Daniel E Neafsey
Journal: Malar J Date: 2015-01-19 Impact factor: 2.979

9. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing.

Authors: Magnus Manske; Olivo Miotto; Susana Campino; Sarah Auburn; Jacob Almagro-Garcia; Gareth Maslen; Jack O'Brien; Abdoulaye Djimde; Ogobara Doumbo; Issaka Zongo; Jean-Bosco Ouedraogo; Pascal Michon; Ivo Mueller; Peter Siba; Alexis Nzila; Steffen Borrmann; Steven M Kiara; Kevin Marsh; Hongying Jiang; Xin-Zhuan Su; Chanaki Amaratunga; Rick Fairhurst; Duong Socheat; Francois Nosten; Mallika Imwong; Nicholas J White; Mandy Sanders; Elisa Anastasi; Dan Alcock; Eleanor Drury; Samuel Oyola; Michael A Quail; Daniel J Turner; Valentin Ruano-Rubio; Dushyanth Jyothi; Lucas Amenga-Etego; Christina Hubbart; Anna Jeffreys; Kate Rowlands; Colin Sutherland; Cally Roper; Valentina Mangano; David Modiano; John C Tan; Michael T Ferdig; Alfred Amambua-Ngwa; David J Conway; Shannon Takala-Harrison; Christopher V Plowe; Julian C Rayner; Kirk A Rockett; Taane G Clark; Chris I Newbold; Matthew Berriman; Bronwyn MacInnis; Dominic P Kwiatkowski
Journal: Nature Date: 2012-07-19 Impact factor: 49.962

10. Genome-wide analysis of selection on the malaria parasite Plasmodium falciparum in West African populations of differing infection endemicity.

Authors: Victor A Mobegi; Craig W Duffy; Alfred Amambua-Ngwa; Kovana M Loua; Eugene Laman; Davis C Nwakanma; Bronwyn MacInnis; Harvey Aspeling-Jones; Lee Murray; Taane G Clark; Dominic P Kwiatkowski; David J Conway
Journal: Mol Biol Evol Date: 2014-03-18 Impact factor: 16.240

4 in total

Review 1. Molecular approaches to determine the multiplicity of Plasmodium infections.

Authors: Daibin Zhong; Cristian Koepfli; Liwang Cui; Guiyun Yan
Journal: Malar J Date: 2018-04-23 Impact factor: 2.979

2. Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago.

Authors: Andrew P Morgan; Nicholas F Brazeau; Billy Ngasala; Lwidiko E Mhamilawa; Madeline Denton; Mwinyi Msellem; Ulrika Morris; Dayne L Filer; Ozkan Aydemir; Jeffrey A Bailey; Jonathan B Parr; Andreas Mårtensson; Anders Bjorkman; Jonathan J Juliano
Journal: Malar J Date: 2020-01-28 Impact factor: 2.979

3. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples.

Authors: Ambroise Ahouidi; Mozam Ali; Jacob Almagro-Garcia; Alfred Amambua-Ngwa; Chanaki Amaratunga; Roberto Amato; Lucas Amenga-Etego; Ben Andagalu; Tim J C Anderson; Voahangy Andrianaranjaka; Tobias Apinjoh; Cristina Ariani; Elizabeth A Ashley; Sarah Auburn; Gordon A Awandare; Hampate Ba; Vito Baraka; Alyssa E Barry; Philip Bejon; Gwladys I Bertin; Maciej F Boni; Steffen Borrmann; Teun Bousema; Oralee Branch; Peter C Bull; George B J Busby; Thanat Chookajorn; Kesinee Chotivanich; Antoine Claessens; David Conway; Alister Craig; Umberto D'Alessandro; Souleymane Dama; Nicholas Pj Day; Brigitte Denis; Mahamadou Diakite; Abdoulaye Djimdé; Christiane Dolecek; Arjen M Dondorp; Chris Drakeley; Eleanor Drury; Patrick Duffy; Diego F Echeverry; Thomas G Egwang; Berhanu Erko; Rick M Fairhurst; Abdul Faiz; Caterina A Fanello; Mark M Fukuda; Dionicia Gamboa; Anita Ghansah; Lemu Golassa; Sonia Goncalves; William L Hamilton; G L Abby Harrison; Lee Hart; Christa Henrichs; Tran Tinh Hien; Catherine A Hill; Abraham Hodgson; Christina Hubbart; Mallika Imwong; Deus S Ishengoma; Scott A Jackson; Chris G Jacob; Ben Jeffery; Anna E Jeffreys; Kimberly J Johnson; Dushyanth Jyothi; Claire Kamaliddin; Edwin Kamau; Mihir Kekre; Krzysztof Kluczynski; Theerarat Kochakarn; Abibatou Konaté; Dominic P Kwiatkowski; Myat Phone Kyaw; Pharath Lim; Chanthap Lon; Kovana M Loua; Oumou Maïga-Ascofaré; Cinzia Malangone; Magnus Manske; Jutta Marfurt; Kevin Marsh; Mayfong Mayxay; Alistair Miles; Olivo Miotto; Victor Mobegi; Olugbenga A Mokuolu; Jacqui Montgomery; Ivo Mueller; Paul N Newton; Thuy Nguyen; Thuy-Nhien Nguyen; Harald Noedl; Francois Nosten; Rintis Noviyanti; Alexis Nzila; Lynette I Ochola-Oyier; Harold Ocholla; Abraham Oduro; Irene Omedo; Marie A Onyamboko; Jean-Bosco Ouedraogo; Kolapo Oyebola; Richard D Pearson; Norbert Peshu; Aung Pyae Phyo; Chris V Plowe; Ric N Price; Sasithon Pukrittayakamee; Milijaona Randrianarivelojosia; Julian C Rayner; Pascal Ringwald; Kirk A Rockett; Katherine Rowlands; Lastenia Ruiz; David Saunders; Alex Shayo; Peter Siba; Victoria J Simpson; Jim Stalker; Xin-Zhuan Su; Colin Sutherland; Shannon Takala-Harrison; Livingstone Tavul; Vandana Thathy; Antoinette Tshefu; Federica Verra; Joseph Vinetz; Thomas E Wellems; Jason Wendler; Nicholas J White; Ian Wright; William Yavo; Htut Ye
Journal: Wellcome Open Res Date: 2021-07-13

4. Inferring Strain Mixture within Clinical Plasmodium falciparum Isolates from Genomic Sequence Data.

Authors: John D O'Brien; Zamin Iqbal; Jason Wendler; Lucas Amenga-Etego
Journal: PLoS Comput Biol Date: 2016-06-30 Impact factor: 4.475

4 in total