Literature DB >> 36000888

Dcifer: an IBD-based method to calculate genetic distance between polyclonal infections.

Inna Gerlovina¹, Boris Gerlovin¹, Isabel Rodríguez-Barraquer¹, Bryan Greenhouse¹.

Abstract

An essential step toward reconstructing pathogen transmission and answering epidemiologically relevant questions from genomic data is obtaining pairwise genetic distance between infections. For recombining organisms such as malaria parasites, relatedness measures quantifying recent shared ancestry would provide a meaningful distance, suggesting methods based on identity by descent (IBD). While the concept of relatedness and consequently an IBD approach is fairly straightforward for individual parasites, the distance between polyclonal infections, which are prevalent in malaria, presents specific challenges, and awaits a general solution that could be applied to infections of any clonality and accommodate multiallelic (e.g. microsatellite or microhaplotype) and biallelic [single nucleotide polymorphism (SNP)] data. Filling this methodological gap, we present Dcifer (Distance for complex infections: fast estimation of relatedness), a method for calculating genetic distance between polyclonal infections, which is designed for unphased data, explicitly accounts for population allele frequencies and complexity of infection, and provides reliable inference. Dcifer's IBD-based framework allows us to define model parameters that represent interhost relatedness and to propose corresponding estimators with attractive statistical properties. By using combinatorics to account for unobserved phased haplotypes, Dcifer is able to quickly process large datasets and estimate pairwise relatedness along with measures of uncertainty. We show that Dcifer delivers accurate and interpretable results and detects related infections with statistical power that is 2-4 times greater than that of approaches based on identity by state. Applications to real data indicate that relatedness structure aligns with geographic locations. Dcifer is implemented in a comprehensive publicly available software package.

Entities: Chemical

Keywords: zzm321990 Plasmodiumzzm321990 ; ancestry; genetic distance; genetic relatedness; identity by descent; microhaplotype; polyclonal infection

Mesh：

Year: 2022 PMID： 36000888 PMCID： PMC9526043 DOI： 10.1093/genetics/iyac126

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.402

Introduction

Monitoring, effective control, and ultimately elimination of malaria can be accelerated by understanding the dynamics of malaria transmission including evaluation of interventions, identification of sources and sinks, and determining the drivers of sustained transmission. Given the substantial genetic diversity of malaria parasites, genomic data have the potential to illuminate important aspects of epidemiology (World Health Organization 2019). Compared to viruses where mutations are the main source of variation and can be used directly to make temporal inferences, reconstructing transmission for recombining organisms with lower mutation rates requires a different approach. Since genetic recombination between malaria parasites occurs in the mosquito during person-to-person transmission, genetic relatedness can provide information on their shared ancestry and therefore transmission epidemiology at relevant timescales. Consequently, pairwise genetic distance as a measure of relatedness between infections may be more useful and detailed for answering epidemiologic questions than metrics based on comparison between populations (Taylor ; Wesolowski ; Chang ; Tessema ). By assessing how closely related individual infections are, pairwise distance can also provide answers to questions such as whether particular infections were more likely to have been acquired locally or imported. Due to coinfection and super-infection, individuals in endemic areas are often infected with multiple genetically distinct clones simultaneously. These polyclonal infections are the rule rather than the exception for Plasmodium falciparum in many endemic areas, even in relatively low transmission settings of sub-Saharan Africa (Roh ); polyclonality may be even more common for Plasmodium vivax (Koepfli ; White ). Assessing genetic relatedness between polyclonal infections is more complicated both conceptually and methodologically than doing so for individual parasites. Obtaining phased genotypes of individual parasites from polyclonal infections would present a potential solution, but outside of single-cell sequencing this currently requires the use of statistical methods which are computationally intensive and may have limited accuracy in the absence of informative reference genomes, particularly when more than 2 clones are present (Zhu ). Even with phased genotypes, a unified summary of relatedness might be useful as a distance measure, so that it could be compared across pairs of infections that may be either monoclonal or have a higher complexity of infection (COI). Incorporating multiallelic genetic data, i.e. diverse loci with more than 2 variants, can improve the estimation of relatedness between monoclonal infections (Taylor ) and may offer an even greater improvement over biallelic loci for polyclonal infections (Tessema ). Fortunately, current technologies make it feasible to efficiently amplify and sequence multiple diverse regions of the Plasmodium genome, generating multiallelic data for this purpose (Lerch ; Aydemir ; Tessema ; LaVerriere ). Much of the epidemiologically useful information contained in relatedness measures lies in detecting shared ancestry; there is therefore interest in estimating the proportion of genomes that are identical due to descent. Currently available methods based on identity by descent (IBD) for Plasmodia are developed for monoclonal infections or are adapted from human genetics: hmmIBD is designed for monoclonal infections and can incorporate multiallelic as well as biallelic loci (Schaffner ); isoRelate is able to accommodate polyclonal infections (Henden ), but is limited to biallelic loci and has unclear applicability to infections with COI >2 since it is based on the diploid model. With no existing IBD-based methods to infer a degree of shared ancestry from polyclonal infections using multiallelic data, various suboptimal workarounds are generally employed. For example, some studies have attempted to infer a “dominant strain” from polyclonal infections using within host allele frequencies, while others have excluded polyclonal infections from the analysis altogether. Depending on the proportion of infections that are polyclonal, such procedures may grossly underutilize data or introduce bias to the analysis due to informative missingness. Alternatively, a simple identity by state (IBS) approach has been used (Pringle ; Tessema ; Atuh ); it is convenient and fast but has extensive drawbacks as it produces similarity measures that are not easy to interpret and address relatedness only indirectly (Taylor ). To fill the methodological gap, we introduce Dcifer (distance for complex infections: fast estimation of relatedness), a method employing IBD to estimate the level of common ancestry between polyclonal samples. It allows for unphased multiallelic data such as microsatellites or microhaplotypes as well as SNPs, explicitly takes into account COI and population allele frequencies, and does not require densely spaced or linked markers. Focusing on interhost relatedness, we developed a working model that allowed us to define an estimator with desirable statistical properties and formal inference. As the method provides a probabilistic solution to the multitude of possible underlying phased genomes, we used a unified mixed radix incrementing combinatorial algorithm for its implementation as a comprehensive R software package (Gerlovina 2022): github.com/EPPIcenter/dcifer (accessed 2022 Aug 25). Finally, we assessed the performance of Dcifer for estimating relatedness between P. falciparum infections using simulations and empirical data.

Methods

The working model we developed is designed to address interhost relatedness and includes assumptions that reduce a complex realistic dependence structure to a simpler model that still allows us to formally define and estimate the quantity of interest without introducing significant bias. The main assumptions can be summarized as the absence of linkage disequilibrium and intrahost relatedness. As the observed data are several levels removed from the random variables we are interested in, the likelihood for the model accounts for various possible unobserved combinations (phased haplotypes) with multinomial-based probabilities and the use of combinatorics; consequently, numerical methods are used to find a maximum likelihood estimate. The likelihood incorporates population allele frequencies and COI of both samples, and the estimation process explicitly accounts for the fact that alleles present in both infections may match by chance (i.e. be identical by state but not by descent). Consider 2 infections with COI of n, n, and a panel of T multiallelic markers. At each locus t, , there is a set of possible alleles. For convenience, we can arbitrarily order the alleles and map them to the corresponding population allele frequencies . We assume that the underlying population allele frequencies are the same for both infections.

IBD model for 2 haplotypes

To build up a model for relatedness between polyclonal infections, we first consider 2 haplotypes. Let sequences of random variables represent these haplotypes, and let be a sequence of independent identically distributed random variables, where and parameter r describes the level of relatedness of the 2 haplotypes (Taylor ). Let , where P is a categorical distribution with values in A and corresponding probabilities ; let Y be a random variable such that Note that X and Y are interchangeable in this setup, and the joint distribution of X and Y (marginal and conditional on IBD) would not change if they were switched. While IBD are i.i.d., (X, Y) are marginally independent but not identically distributed since P is different for each t. We can also define a random variable , with . In this model, realizations of X and Y could be observed (e.g. if they represent monoclonal infections and there is no genotyping error) but IBD’s are unobservable. In contrast, IBS are directly observed if X and Y are observed.

Working model for polyclonal infections

Let an matrix and matrix represent 2 polyclonal infections with COI of n and n, with rows of the matrices referring to haplotypes and columns to loci. Thus, , is an i’th haplotype of the first infection, and a column is a sequence of random variables with values in A representing alleles for all the haplotypes at a locus t. Let denote a multiset (a collection of elements that are not necessarily distinct) of unordered elements of a t’th column of and let be a set of unique elements in that column; are defined similarly. For realizations of and we will use notation and ( and are not observed, but and are). The model assumes no genotyping error, and the sequences of sets and are observed data, for which Dcifer is designed. There are pairs of malaria strains that can be related. To differentiate IBD and parameters of their distributions for different pairs, let , and refer to a pair within first infection, a pair within second infection, and a between-host pair, respectively, and, similarly, let , and denote corresponding relatedness parameters. If we are only interested in between-host relatedness (which may be the case for many practical applications), we might formulate the goal as “estimating interhost relatedness adjusted for intrahost relatedness,” which would condense parameters into some lower-dimensional summary. Usefulness of adjusting for intrahost relatedness can be illustrated by considering a case where an extra haplotype , very closely related to an existing one (say, X1, with ), is added to one of the infections. That would result in essentially doubling X1’s contribution to the sum of all interhost relatedness parameters, as well as increasing COI (recall that n is defined as a number of distinct haplotypes). With such goal as our scientific question, we introduce a simplifying assumption of no intrahost relatedness, which projects a realistic model of unconstrained intrahost and interhost relatedness parameters onto a much smaller model space and allows us to make the problem tractable while aiming to arrive at the same summary estimate as we would if we were able to estimate all the parameters in a bigger model. For each pair of strains in 2 infections, e.g. i’th strain in the first sample and j’th in the second, let , and be the random variables as defined in IBD Model for 2 Haplotypes. Then, for the working model for polyclonal infections, we introduce the following assumptions: for all , for all (no intrahost relatedness) and if or for all (all interhost IBD variables are independent at a given locus). An important implication of these two assumptions is that any strain in one sample can be related to at most one strain in another: and . This can be proven by contradiction: since . If and , then , which contradicts assumption 1. For further discussion on the model assumptions, see Supplementary Section 1 in File_1. Since we can order strains within an infection arbitrarily, and in light of the constraints of the model, we order the haplotypes in 2 infections in such a way that X1 can only be related to Y1, X2 to Y2, and so on (Fig. 1). In addition, we introduce M—the number of strain pairs that can be related, . Then, for brevity, we suppress some of the subscripts and use for and for (note that parameters for all the other IBD variables are zero). The goal of Dcifer is to estimate parameters of the joint distribution of . Let denote an estimand, and —its maximum likelihood estimator (MLE):

Fig. 1.

Working model presented at a single locus t: an example featuring n = 4, n = 3, and M = 3. Colors of the circles represent alleles; 2 clones in each infection have the same allele. is a multiset of shared (nonunique) alleles at a locus t. At each locus t, the likelihood needs to account for all the possible combinations of nonunique alleles in both samples (multiple haplotypes will have the same allele if COI is greater than the number of unique alleles). For one sample, this is done by considering a set of all multisets with given support and cardinality (all the that could have produced , see Fig. 1); we denote a set of all multisets such that and by . can be calculated using a probability mass function of a multinomial distribution: the number of permutations of is equal to a multinomial coefficient (“assigning” alleles in s to strains, or going from to ), and allele frequencies correspond to event probabilities. Multiplicities of the multiset’s elements , or the numbers of strains having the same allele, are multinomial random variables. Adopting a short notation for this key component of the likelihood, let denote a probability mass function for a multinomial distribution , where is a multiset of cardinality n () with elements from K categories, and are probabilities for these categories; set . Next, for given and , we divide their elements into 3 groups: shared alleles that are identical by descent (say ), remaining alleles in (), and remaining alleles in (). The probability of each of these multisets is similarly calculated using multinomial distributions. Supplementary Section 2 in File_1 provides more details and builds up the likelihood from M = 1 and . For a general case, where and . When , the likelihood reduces to where . While are independent, are not. This dependence stems from the fact that we do not observe ordered alleles at each locus (or, in other words, phased haplotypes). That also provides intuition for why is estimated more accurately than individual r's : estimating at a locus t is easier than estimating an actual binary sequence . Another useful observation is that the order of parameter values in does not affect the value of , which can be taken into account when the likelihood is evaluated over a grid of .

Implementation

Calculating the likelihood in (2) requires solving a number of combinatorial problems: finding all the collections of nonunique alleles at a locus that are concordant with observed alleles and COI, finding all the multisets included in a given multiset of shared nonunique alleles, and finding all the possible binary sequences with given constraints for IBD variables. These problems are solved with a unified mixed radix incrementing algorithm (https://github.com/innager/mirsa/tree/v1.0.0; accessed 2022 Aug 25) (Supplementary File_2), which is an extension of an algorithm to generate all n-tuples in (Knuth 2011). As the calculation traverses the combinations described above, multiple sequences can be processed at each step, and thus the likelihood for a range of parameter values can be calculated in a single pass. With bounded parameter space, this allows for an efficient way to find MLE by simply calculating the likelihood for an M-dimensional grid of a desired coarseness. The resulting log-likelihood curve or surface can also be useful for inference, especially for procedures based on a likelihood ratio approach, such as testing various hypotheses or determining confidence regions. For a special case of M = 1, the log-likelihood can be calculated using Supplementary Equation (2) in File_1, which also admits fast calculation of the score and consequently numerical methods of solving the likelihood equation (Newton’s method adapted for bounded parameter space is used in the package).

Inference

Along with an estimate of , Dcifer provides a log-likelihood function, which can serve as a basis for various inferential procedures (for some intuition on the shape of that log-likelihood function, the effect of COI and population allele frequencies on it, and implications for the inference, see Supplementary Section 3 in File_1). In our model, sample size is T, but different loci do not provide the same amount of information (recall that are independent but not identically distributed); their contribution can be associated with different measures, e.g. heterozygosity. Given these measures and the complexity of the estimator, methods relying on asymptotic approximations should be approached cautiously; still, as the sample size increases, precision of estimation increases as well. For hypothesis testing and confidence intervals (CI)/regions, we consider common inferential approaches as applied to Dcifer: asymptotic normality, likelihood-ratio statistics, and resampling methods. There are common challenges that affect all 3 approaches: bounded parameter space with edge cases not only included but conceptually important, such as a null hypothesis of infections being unrelated; for other cases, sampling distributions for different (even neighboring) parameter values on the interior of the support could be quite different for panels with even fairly large T. Still, some approaches might be better suited for Dcifer, and some may be chosen on the basis of convenience and computational efficiency. For Wald-type CI, observed Fisher information can be easily calculated numerically (and arguably preferred to expected Fisher information; Efron and Hinkley 1978); likelihood-ratio-based CI, while asymptotically equivalent to Wald’s (1943), are more robust as they are invariant to parameter transformation that could be used to make the log-likelihood function approximately quadratic at MLE (Beale 1960; Cox and Hinkley 1979, pp. 342–343; Cook and Weisberg 1990; Vander Wiel and Meeker 1990; Meeker and Escobar 1995). Resampling methods include bootstrap and generating a null distribution for hypothesis testing. While there are many advantages this approach provides for finite sampling distributions not yet approaching normality, there is a caveat: if a centered sampling distribution at MLE is not close enough to that at the true value, the inference will be problematic. In addition, inverting quantiles of a bootstrap distribution for CI endpoints can lead to violating the bounds of the parameter space (see Supplementary Section 4 in File_1), as can Wald CI. Simulated null distributions do not suffer from this problem but still rely on various assumptions and might be sensitive to misspecifications as demonstrated in the Results. In contrast, likelihood-ratio confidence regions respect parameter bounds and do not require any additional model assumptions for hypothesis testing. Likelihood-ratio-based inference is based on Wilks’ theorem (Wilks 1938) and uses the likelihood ratio test statistic, which in the context of Dcifer hypothesis testing with can be written as , where , and is approximated by chi-squared distribution with M degrees of freedom. The approximate confidence region consists of the values where is a ’th quantile of . As the Wilk’s theorem does not apply to border cases, we specifically address these important cases (for which the likelihood-ratio test is still the most powerful; Neyman and Pearson 1933) and compare the corresponding distribution of the likelihood ratio statistic with that no longer approximates it. First, the test at the boundaries is 1-sided while the chi-squared distribution implies 2-sided tests. Accounting for that would mean dividing a P-value obtained from the chi-squared distribution by 2 or finding a corresponding critical value for the significance level α. Second, it turns out that even with this adjustment, the resulting P-value is still somewhat conservative, and, as shown in the Results, the method has excellent error rate control.

Estimating the number of related strain pairs and rtotal

Parameters of the working model include n, n, and . M, the length of , can be considered a nuisance parameter. In addition, let be a number of positively related strain pairs; unlike M, can be a quantity of interest. The estimator in Equation (1) assumes that the model is constrained by given values of n, n, , and M; the likelihood is calculated using these values. However, while n, n, and are “external” to and are provided or obtained through other processes, M is inherent to relatedness between 2 infections. Thus, here, we consider a less constrained model where M is not given. In this case, a trivial solution to estimating would be to set since ’s associated with different will only differ in the number of zeros (r = 0). If we want to estimate or , which are functions of , they can be similarly obtained from as and . In practical applications, this trivial solution can incur high-computational cost for higher , and therefore we propose alternative estimators , , and that use an iterative procedure with underlying calculation of at each step. The first step is to set and calculate , then at each consecutive step increment M and recalculate until it contains one zero () or until . Accept obtained at the final step as , with and . If rtotal, rather than , is of main interest, the computation time can be cut even further by assuming and using Equation (3) to calculate the likelihood. In this case, we propose yet another set of estimators and , where is calculated for all , and is the value of M that produced the highest maximum likelihood. Then , where is an MLE at .

Simulations and comparison with an IBS metric

Our simulations are based on previously published SNP and microhaplotype panels (Daniels ; Tessema ; Jacob ); allele frequencies for these panels were obtained from previously analyzed empiric datasets. The SNP panels have 23 and 101 loci with 2 alleles per locus, and the microhaplotype panel has 91 loci with the number of alleles at each locus ranging between 3 and 95. To assess the performance of Dcifer with varying number of multiallelic loci, we used allele frequencies for the 91-loci panel and repeated them for consistency (thus creating synthetic 182-loci, 273-loci panels, and so on). To include genotyping errors in simulations, we devised a “miss-and-split” model with parameters ϵ and λ: False negatives: one of k present alleles (drawn with probabilities ) has zero probability of being missed; the remaining alleles can be missed with probability ϵ. Let K be a number of alleles remaining present after this step; then . False positives: draw a number of added alleles (“splitting” event) for each nonmissing allele; subsequently draw Nadd alleles from alleles with replacement. In the final “observed” data, an allele is considered present if selected by at least one of the splitting events. Note that is very small for reasonably small λ’s. For analysis procedures that involve estimating COI and population allele frequencies prior to Dcifer, we used naïve COI estimation with a locus rank c that depended on the number of loci (COI determined by a locus with c’th greatest number of detected alleles) for multiallelic panels, and THE REAL McCOIL method for biallelic SNP panels (Chang ). Population allele frequencies were estimated from simulated datasets of 400 samples, where relatedness was induced in 10% of the pairs and COI for the rest of the samples was generated with truncated Poisson distribution with parameter λ = 3. We used COI-adjusted estimation (see Supplementary Section 5 in File_1), which is important for polyclonal infections; failure to adjust for COI can lead to overestimating heterozygosity and, consequently, relatedness parameters. Most simulations consisted of 10,000 pairs of related infections for a given COI and combination. To compare Dcifer performance with an IBS approach, we used the Jaccard similarity coefficient (Jaccard 1912) as an example since it is a commonly used statistic, which is conceptually simple and fast, and which performs similarly to the other IBS measures we have considered. For this calculation, loci data for a sample (each locus represented by a binary sequence with elements indicating if an allele has been detected or not) were concatenated into a single binary sequence of length , where K is the number of possible alleles at a locus t. Then Jaccard similarity between 2 sequences and that represent 2 infections was calculated as

Results

The main goal of Dcifer is to estimate parameters describing relatedness between infections, and this estimation requires values of the other parameters in the model. These external parameters represent COI (n, n) and population allele frequencies (), which can be known (e.g. in simulations), estimated from data, or otherwise specified. Dcifer is implemented in a software package that takes raw data on the alleles detected at each locus (biallelic or multiallelic) in Plasmodium infections, allowing for missing data, along with COI and population allele frequencies. In simulations, we assess the performance of Dcifer when data have no genotyping error and COI and are known, as well as in the presence of genotyping error with COI and estimated from these data. We also evaluate how sensitive Dcifer is to misspecification of these external quantities and to assumption violations. We start with a case when only one pair of strains can be related (M = 1), since it can be used to quickly identify related infections in a large dataset, and later proceed to the general case. Finally, we apply Dcifer to analysis of real data, where COI and allele frequencies are estimated from the data.

Dcifer produces accurate and interpretable estimates of relatedness

Unlike IBS metrics that simply measure similarity between infections comparing detected alleles, Dcifer aims to produce more interpretable results by estimating parameters that represent IBD and thus separating shared ancestry and chance as underlying reasons for alleles matching between 2 infections. To evaluate the performance of this method in comparison to IBS approach (we used Jaccard similarity coefficient as an example), we simulated genetic data for infection pairs with different degrees of relatedness (induced on a single pair of strains between 2 infections) and COI, based on previously published SNP and microhaplotype panels (Daniels ; Tessema ; Jacob ). Across various values of COI, Dcifer estimates were concentrated around the true values of the parameter while IBS results were not (Fig. 2, shown for a panel of 91 microhaplotypes with no genotyping error and known COI and allele frequencies). As COI increased, Dcifer estimates became more variable but remained centered around the true values and maintained some degree of separation, whereas IBS results shifted and overlapped considerably more. For these simulations, separation between results for completely unrelated (r = 0) and related infections, quantified in receiver operating characteristic (ROC) curves, indicated considerable gain in accuracy by Dcifer compared to the IBS metric across the range of COI, especially for lower degrees of relatedness (Supplementary Fig. 1 in File_1). For example, Dcifer estimates of sibling-level relatedness (r = 0.5) remained readily distinguishable from those of unrelated infections even for fairly high COI.

Fig. 2.

Densities of Dcifer relatedness estimator and IBS similarity metric results obtained from data simulated using a panel of 91 microhaplotypes (Tessema ). Simulations were performed for 5 values of r and for COI combinations ranging between 1 and 5; true values of COI and population allele frequencies were used for Dcifer. Upward arrows indicate highly concentrated distributions with density values extending above the plot range (note: y-axis scales are different for the 2 methods). When the task of detecting related infections is approached in practice, there are additional issues to be considered because neither COI nor population allele frequencies are known, genetic data often contain genotyping errors, and the extent of these errors is unknown. Supplementary Fig. 2 in File_1 illustrates how results changed when genotyping error was included in the simulations, and estimates of COI and population allele frequencies, and not their true values, were used as inputs to Dcifer. IBS results shifted to the left more or less uniformly; since distributions for different values of r were so tightly concentrated and close together, apparently small shifts were significant compared to the differences between these distributions. Dcifer estimates also shifted to the left, but, relative to differences in sampling distributions, the shifts were smaller than those for IBS. The fact that the shifts were more pronounced for larger values of r is explained by genotyping errors breaking up some of the relatedness between infections.

Greater power of hypothesis tests using Dcifer vs. IBS

One way of detecting related infections along with a measure of uncertainty (e.g. P-value) is to compare Dcifer relatedness estimates or IBS similarity results with their corresponding null distributions (), which can in theory be obtained by simulating a large number of unrelated infections. To evaluate the performances of Dcifer and the IBS metric, we calculated false-positive rates (FPR) and power of tests with significance level for different types of genetic data across a range of COI. Genetic data were simulated with genotyping errors; they were incorporated into simulated “null” distributions as well. Distributions for r = 0 are different for different COI and therefore a separate null distribution was generated for each COI pair combination; the effect of COI on such distributions was substantial for IBS (Supplementary Fig. 3 in File_1). Relatedness estimates for each pair of infections were then compared to a rejection cut-off determined by a null distribution corresponding to their estimated COI, and FPR and statistical power were subsequently calculated. In addition to the complexity and computational costs associated with generating a null distribution, this approach relies on a number of assumptions such as COI, allele frequencies, and the error model and its parameters, which are all subject to misspecification. As a welcome alternative, Dcifer offers another inferential approach based on the likelihood ratio, which does not require any additional information (i.e. does not require generating a null distribution) and has essentially no computational overhead. Figure 3 compares hypothesis testing results for IBS, using simulated reference distributions, and Dcifer, using likelihood-ratio P-values adjusted for 1-sided tests. For both methods, FPR was mostly at or below the nominal significance level α across different simulations of COI and genotyping panels, with Dcifer close to α. Statistical power, however, varied considerably. As expected, higher values of relatedness were detected with greater power, increasing the number or diversity of loci increased power, and higher COI led to lower power. Across all simulations, Dcifer consistently demonstrated greater power to detect related infections than the IBS metric, with differences particularly notable for polyclonal infections. For example, with a 91 microhaplotype panel, the power to detect half-siblings (r = 0.25) in a pair of infections with COI of 2 was 0.81 for Dcifer and 0.43 for the IBS metric; with 455 microhaplotypes and COI of 5 that power was 0.88 and 0.22, respectively. Results for an alternative scenario when α is a function of COI, which might be useful if error rate control on the scale of parasite strain pairs rather than infection pairs is desired, are presented in Supplementary Fig. 4 in File_1. While Supplementary Figs. 1 and 2 in File_1 would suggest that there is still some separation between distributions for different values of r for the IBS metric results, which would be expected to improve with increasing the number of loci, its performance was remarkably poor, having very low power for larger COI and r < 0.5 even with highly informative panels. This reflects the fact that for tightly concentrated distributions of IBS results, the difference between cut-offs associated with different assumed null distributions is critical, and consequently, misspecification of COI or an error process had a deleterious effect on either FPR or power (Supplementary Fig. 3 in File_1). The likelihood-ratio-based approach performed very similarly to the one based on null distributions for Dcifer, evidencing this as a preferred approach for the reasons described above (Supplementary Fig. 5 in File_1).

Fig. 3.

Detecting related infections. FPR and statistical power of a test at significance level are shown. Simulations included genotyping error with fixed error model parameters; COIs were estimated from these data. For simulated null distributions, we varied error model parameters since they would not be normally known. SNP and microhaplotype (MH) panels were used as a basis for simulations.

Dcifer provides likelihood-ratio-based CI

The Dcifer likelihood-ratio-based approach allows for calculating M-dimensional confidence regions (where M is the number of related pairs)—or, in a case when only one pair of strains is assumed to be related between 2 infections, CI. Figure 4 shows CI’s for a range of true r values and COI. Infections were simulated using microhaplotype panels with various number of loci. As expected, CI’s were narrower for panels with more loci. In general, the intervals were narrower near endpoints (r = 0 and r = 1) and wider in the midrange. Interestingly, the least COI in the pair () had a greater effect on the CI than the sum of COI (); this can be seen in more rapid widening of the intervals from left to right than from top to bottom of the figure. With large numbers of diverse loci, CI stayed narrow even for higher COI. Coverage for these CI was around , and consistently higher for endpoints, indicating that CI for these endpoints were conservative, even taking into account the 1-sided nature of such intervals (Supplementary Fig. 6 in File_1; also demonstrated by FPR in Fig. 3).

Fig. 4.

95% CI for relatedness estimates using the likelihood ratio produced by Dcifer.

Allowing for multiple pairs of strains to be related

So far we have only presented results for a single related pair of strains between 2 infections (M = 1) regardless of COI. When we allow that multiple pairs of strains may be related, Dcifer produces a corresponding number of estimates—one for each pair. To accurately estimate multiple relatedness parameters without any additional assumptions, a large number of diverse loci is needed; otherwise, there is a lot of variation in the individual estimates (see an example in Supplementary Fig. 7 in File_1). However, while the estimation of individual relatedness parameters was challenging, their sum rtotal was estimated more accurately even with a lower number of loci (which can be seen in the contour plots of Supplementary Fig. 7 in File_1). Supplementary Fig. 8 in File_1 shows likelihood surfaces for 2-dimensional parameters (), where we can clearly see the ridge along ; that value is close to the true sum even when the individual estimates are further away from (r1, r2). If the goal is to estimate overall relatedness between 2 infections, we suggest rtotal as a more identifiable and useful quantity than . To estimate rtotal and the number of positively related strain pairs , we used the procedure described in Estimating the Number of Related Strain Pairs and rtotal and compared the estimates obtained from 2 approaches: (1) with “equality assumption” (estimators ) and (2) without it (estimators ). First, we validated the stopping rule for the second approach (when r are not assumed to be equal), confirming that incrementing M past the iteration that estimates one of r's to be 0 only appended additional 0’s to MLE in most cases. Next, we compared the 2 approaches and assessed the accuracy of the corresponding estimators. For each simulated pair of infections, we first randomly generated for given rtotal and . The estimates were compared across a grid of COI, rtotal, and . Figure 5 shows illustrative examples of these comparisons: in 5(a), is changed while COI and rtotal are fixed, in 5(b) rtotal is changed, and in 5(c) COI is changed. Distributions of and were quite similar, so the equality constraint had a very limited effect on the overall relatedness estimates. There was more difference between and , but, importantly, these differences did not significantly affect rtotal estimates. An effect of varying on the distributions of and was small, while lower COI resulted in more accurate estimates. Higher rtotal made for more accurate estimation of , as it eliminated lower values incompatible with estimates. It is worth noting that simply increasing dimensionality of the grid of relatedness values to evaluate over can become unfeasible for larger M, so the grid would have to be coarsened to accommodate, which in turn would affect precision. No such limitation exists for the fast “equal r” approach as it estimates a single parameter.

Fig. 5.

Estimation of rtotal and with and without equality assumption . Densities of (no assumption) and (with assumption) are shown on the left, and probabilities for the values of and are shown on the right. Quantities highlighted in red indicate those varied between the top and bottom simulations within each panel. Simulations were performed using a panel of 91 microhaplotypes. a) Varying M’. b) Varying rtotal. c) Varying COI.

Misspecifications and assumption violations

In data analysis, COI and population allele frequencies are usually unknown and need to be estimated from data. Allele frequencies can often be estimated from sufficiently large datasets (e.g. over 100 samples) and as such, their estimates are often fairly stable; some implications of their misspecifications are discussed in Supplementary Section 3 in File_1. COI estimation, however, relies on a smaller amount of information, resulting in greater variability of the estimates and more frequent misspecifications. Fortunately, Dcifer appeared to be relatively robust to COI misspecifications, especially for less complex infections (Supplementary Fig. 9 in File_1). Even for higher COI, relatedness estimates were fairly close to the true value in the neighborhood of the correct COI. Next, we address our working model and its defining assumption of no intrahost relatedness. To assess how violating this assumption affects interhost relatedness estimation, we compared 5 scenarios: one with no intrahost relatedness and 4 where different strains within the samples are siblings (r = 0.5)—see the diagrams in Supplementary Fig. 10 in File_1. Note that in scenario 2 (Supplementary Fig. 10b), there are 2 interhost pairs that are related (strains X1-Y1 and X2-Y1) but that “extra” relatedness is only a consequence of the induced X1-X2 sibship and does not add anything to our quantity of interest. The same is true for more complex scenarios 4 (Supplementary Fig. 10d, with related interhost pairs X1-Y1, X2-Y1, X1-Y2, and X2-Y2) and 5 (Supplementary Fig. 10e, with related interhost pairs X1-Y1, X2-Y1, and X3-Y1). All simulations included genotyping error, and processing involved estimating COI and population allele frequencies. Relatedness estimates for all 3 scenarios were very similar (Supplementary Figs. 11 and 12 in File_1), confirming that in many common cases of intrahost relatedness, the working model estimates interhost relatedness without significant biases.

Applications to empirical data

We applied Dcifer to a small dataset that has 87 microhaplotypes and consists of samples obtained from patients presenting with malaria from 2 health facilities in Maputo and Inhambane provinces of Mozambique (Tessema ). There were 52 samples overall with 26 from each clinic; only samples with data for at least 75 loci were considered for the analysis (Supplementary Data_1). From these samples, naïve COI estimates (60% polyclonal samples with maximum COI of 6) and subsequently estimates of population allele frequencies adjusted for COI were calculated (Supplementary Data_2). We initially set M = 1 and used likelihood-ratio statistics to test a null hypothesis at significance level (with the procedure adjusted for a 1-sided test). For comparison, Jaccard similarity was used as an IBS metric; Fig. 6 displays results from both methods. Dcifer results indicated that the majority of samples were unrelated and that related samples were mostly from the same clinic. Note that significant pairs with relatively low relatedness estimates usually shared a rare allele at one or two loci. The IBS metric also picked up very highly related pairs, but, apart from those, it was more difficult to distinguish related samples from background. Some samples appeared to be less related to all the other ones in IBS results, and some—more (stripe-like patterns in the lower triangle); these single-sample relatedness levels correlated with estimated COI (e.g. all the lighter “stripes” corresponded to monoclonal samples) highlighting the fact that IBS similarity is strongly influenced by COI, which obscures contribution of descent. For related samples, we also estimated (the number of related strain pairs) and rtotal (overall relatedness). For these samples, ranged between 0.105 and 1.97 and there were 4 pairs of samples (all in Maputo), for which exceeded 1, with estimated COI of 2 in all samples in these pairs and for all such pairs.

Fig. 6.

Relatedness between samples from 2 clinics: one from Inhambane province and one from Maputo province of Mozambique. Lower triangular matrix displays IBS metric results and upper triangular matrix—Dcifer estimates. The color of each matrix entry represents an estimate for 2 corresponding samples; pairs for which H0 has been rejected using the Dcifer likelihood ratio test are outlined in red. Estimated COIs are displayed on the margins. We also reevaluated microsatellite data from a previously published dataset, which contained 2,585 samples from 29 clinics in 4 districts in Namibia (Tessema ), using Dcifer to estimate relatedness (Supplementary Data_3 contains estimated allele frequencies). These data had 26 loci, and 77% of the samples were polyclonal. The average of relatedness estimates between monoclonal samples (which can be taken to represent relatedness between individual parasites) was similar to the average overall relatedness () between polyclonal infections scaled by the minimum COI for each pair. Minimum COI represents the number of interhost strain pairs in 2 infections that can be related within the Dcifer working model, so scaled can be viewed as an average estimate of relatedness between individual strains belonging to 2 different infections. We assessed the proportion of related pairs of samples within and between clinics (), then performed a permutation test to determine which clinic combinations had more related samples than expected by chance. Most of the within-clinic entries had significantly large numbers of related samples (Fig. 7). In addition, clinics with geographical proximity had significantly more related between-clinic infections, as illustrated by clusters of darker circles along the diagonal. Rundu DH is a large referral hospital, which could explain relative genetic closeness between samples from this and more geographically distant clinics. Rundu, Nyanganna, and Andara districts are adjacent to each other and Zambezi district is distant from them, which is reflected in the relative lack of relatedness between Zambezi and other districts.

Fig. 7.

Namibia clinic-level relatedness and permutation test results. Each circle represents a single clinic (on a diagonal) or a 2-clinic combination (off-diagonal), with clinics ordered geographically and divided into districts. The color of the circle corresponds to the permutation test’s P-value, and the diameter—to the proportion of related samples within or between the clinics. Permutation distributions for clinics with smaller numbers of samples have larger variance leading to larger P-values—e.g. Chinchimani and Kanono with only 9 samples each have relatively high proportions of related infections but their P-values are not small.

Computational efficiency

Because of various simplifications of the likelihood expression for important special cases, there are 3 main versions of the estimation procedure (in the order of increasing computational complexity): (1) M = 1, using Newton’s method adapted for a bounded parameter space to find MLE or calculating likelihood for a 1-dimensional grid of parameter values if a CI is requested, (2) M > 1 with assumption involving likelihood calculation for a 1-dimensional grid, and (3) M > 1 without assumption involving likelihood calculation for an M-dimensional grid. Since we expect that most sample pairs in a typical dataset will be unrelated, an initial analysis step will likely attempt to identify significantly related (or unrelated) ones by evaluating all pairs assuming . Therefore, execution times for this special case are the most important ones. Times for “grid” versions depend heavily on the resolution of the grid, while the Newton-based procedure usually converges in 3–4 iterations even with low tolerance. Table 1 shows execution times for calculating for 1,000 sample pairs (100 loci, each with 2–20 alleles) with M = 1 and various COI combinations for both Newton’s method and a grid; times for all but the highest COI combinations ( for Newton’s method and for the grid) were <5 s. Simple vectorized computation of the IBS metric does not depend on COI and averages 0.055 s for the same data (1,000 pairs). Calculation times are essentially linear in the number of loci, so times for larger panels can easily be estimated. When , calculation times are roughly exponential in M for when r equality is not assumed and increase only slightly with M when it is; these times are shown in Table 2.

Table 1.

Execution times (in seconds) for Dcifer processing 1,000 pairs of samples with COI combinations ranging between 1 and 10.

1	2.24
	3.16
2	2.31	2.42
	3.24	3.24
3	2.31	2.35	2.39
	3.29	3.30	3.29
4	2.20	2.30	2.43	2.43
	3.47	3.44	3.40	3.32
5	2.21	2.26	2.28	2.35	2.36
	3.41	3.55	3.51	3.53	3.66
6	2.26	2.34	2.32	2.36	2.48	2.58
	3.38	3.47	3.47	3.57	3.69	3.79
7	2.26	2.32	2.37	2.49	2.61	2.86	3.46
	3.52	3.51	3.65	3.68	3.90	4.07	4.60
8	2.30	2.40	2.47	2.60	2.85	3.38	4.46	6.53
	3.51	3.68	3.65	3.82	4.09	4.68	5.70	7.99
9	2.40	2.52	2.61	2.88	3.37	4.38	6.50	10.81	19.31
	3.75	3.61	3.78	4.01	4.62	5.60	7.78	12.18	21.03
10	2.67	2.77	2.95	3.45	4.43	6.65	11.28	20.03	37.02	72.40
	3.87	3.98	4.15	4.61	5.72	7.69	12.16	21.04	38.43	72.37
COI	1	2	3	4	5	6	7	8	9	10

The data were simulated from 100 multiallelic loci (the number of alleles at each locus distributed uniformly between 2 and 20). M = 1, each COI combination displays times for using Newton’s method (unshaded) and calculating likelihood for a grid (shaded). Precision was set at 0.001 for both methods.

Table 2.

Execution times (in seconds) for a pair of infections with COI of 6 and 9.

M	1	2	3	4	5	6
r1=⋯=rM constraint	0.01	0.67	0.79	0.86	0.89	0.92
No constraint	0.01	1.32	2.17	3.72	6.65	11.54

M ranges between 1 and 6, and the likelihood is calculated over a grid of 1,000 combinations (or 1,000 r values when is assumed).

Execution times (in seconds) for Dcifer processing 1,000 pairs of samples with COI combinations ranging between 1 and 10. The data were simulated from 100 multiallelic loci (the number of alleles at each locus distributed uniformly between 2 and 20). M = 1, each COI combination displays times for using Newton’s method (unshaded) and calculating likelihood for a grid (shaded). Precision was set at 0.001 for both methods. Execution times (in seconds) for a pair of infections with COI of 6 and 9. M ranges between 1 and 6, and the likelihood is calculated over a grid of 1,000 combinations (or 1,000 r values when is assumed). For data analysis presented in Applications to empirical data, Dcifer execution times were 2.7 s for Mozambique (pairwise distances for 52 samples, or 1,326 comparisons) and 32 min for Namibia (2,585 samples, 3,339,820 comparisons). Consequently, we can estimate that it would take approximately 17 min to calculate all pairwise distances for 1,000 samples of Mozambique-type data (87 microhaplotypes). Calculations were performed on MacBook Pro, 2019, 2.3 GHz Intel Core i9 and were not parallelized.

Discussion

The ability to infer genetic distance between infections is a critical step in translating pathogen genetic data into insight regarding transmission. Despite this, there is a lack of established methods available to infer genetic distance between malaria infections containing multiple parasites, which are the majority in many endemic areas. Options are even more limited when using multiallelic loci, which offer more resolution than biallelic SNPs. The lack of any formal approach has left the community with only ad hoc calculations such as IBS, which yield ambiguous results and require extensive efforts to guide any attempt at meaningful inference. In contrast, Dcifer provides relatedness estimates that are based on IBD, interpretable quantities with consistent meaning across studies—regardless of genotyping methods used—with clear implications for ancestry. Importantly, we also show that Dcifer’s statistical power to detect related infections consistently surpasses that of IBS. The method produces reliable measures of uncertainty, and inference obtained from Dcifer vs IBS is more robust to misspecifications of estimated quantities such as COI or population allele frequencies. The R software package implementing Dcifer provides a fast, convenient, and flexible tool that can be easily incorporated into the analysis stream of a wide range of genotyping data to understand transmission. While Dcifer is designed to work with many types of genotyping data, e.g. biallelic SNPs, multiallelic loci such as microsatellites and microhaplotypes, and any combination thereof, we show that including multiallelic loci results in substantial gains in power to detect related infections. These benefits become more dramatic as COI increases. The result makes intuitive sense, as multiallelic panels provide more within-host strain differentiation and consequently can allow information pertaining to descent to be more easily detected, benefiting relatedness estimation. For example, where 2 infections with high COI may look similar with biallelic genotyping panels regardless of their level of relatedness (both alleles present at most loci with some diversity), having multiallelic data provides the opportunity to compare these infections more meaningfully. Fortunately, the greater availability of methods to obtain multiallelic data from across the genome makes it feasible to generate these data efficiently and in a high-throughput manner (Aydemir ; Tessema ; LaVerriere ). The concept of relatedness for individual parasites does not extend trivially to polyclonal infections, where strains within and between infections can be related. Dcifer offers an approach that focuses on relatedness between infections as this information is very relevant to transmission. Simulations with imposed intrahost relatedness indicate that the working model achieves its stated goal of capturing interhost relatedness by implicitly downweighting the independent contribution of related strains within a host to comparisons between hosts. When multiple pairs of strains are related between 2 infections, degrees of relatedness between these strains could potentially provide an insight into more nuanced aspects of transmission. However, these strain-level comparisons may be difficult to estimate without accurate phasing, which remains a challenging problem. Here, we proposed a more easily identifiable summary, rtotal, which provides a single measure of the overall degree of relatedness between 2 infections. This summary reliably encapsulates information encoded by the number of related strains and degree of relatedness between them even when individual r values are difficult to identify from realistic data—which was the case in our analyses except in situations when a large number of highly informative loci were used. For example, an rtotal of 1.5 could indicate, amongst other scenarios, that 3 pairs of siblings or one clonal pair and one sibling pair are present between 2 infections. Either way, multiple closely related parasites exist between these 2 infections and they are likely to be closely linked by transmission events, e.g. via a single cotransmission event or multiple independent transmissions. We anticipate that the interpretation of this summary and its potential derivatives, as well as its incorporation into downstream analyses, will evolve as it is evaluated in more sophisticated population-level simulations and estimates from empirical data sets. Another conceptual issue concerns population allele frequencies, which can affect Dcifer estimates. The foremost question is what constitutes the relevant source population in regards to relatedness between samples and consequently from which data the frequencies should be estimated. If 2 infections are from communities with different within-community allele frequencies, what are the implications for descent? Dcifer currently assumes the same allele frequencies for both samples but further exploration might be warranted depending on the question of interest. Questions concerning population and scope of the analysis might also arise in regards to potential multiple testing procedures when many pairwise relatedness hypotheses are tested simultaneously. The fact that these hypotheses are not independent should be taken into account when such procedures are considered. The Dcifer model does not account for linkage disequilibrium and assumes independence of loci. As the malaria genome has relatively short linkage disequilibrium segments, loci independence can be assumed up to a reasonably large number of loci for a correspondingly designed genotyping panel. If, however, the panel has loci that are likely to be linked, e.g. those selected to be in close proximity or for a large number of loci, the independence assumption would no longer hold, which could result in anticonservative inference. Another limitation is that the model currently does not account for genotyping errors. Future modification could explicitly incorporate the error process via an appropriate model or assess how a specific error process affects the estimates and inference beyond the explorations we have performed here. Another potential venue for further work is developing an MLE estimator for rtotal directly as this might become a commonly used summary. A direct estimator might be more efficient, would have the properties of MLE, and would require less processing time. In addition, scaling rtotal by some function of COI could provide a useful way of comparing all pairwise relatedness estimates with standardized values in a [0,1] interval. Other future directions could explore alternative inferential approaches, including a nonparametric bootstrap, where loci data would be sampled with replacement. In that case, the fact that variables associated with different loci are not identically distributed, and therefore loci might not be equally informative, would need to be addressed. With potential to facilitate understanding of relatedness structure from unphased genetic data, including multiallelic loci, Dcifer can provide a vital link in the analytical process leading to better understanding of malaria transmission dynamics. While we have demonstrated the utility of this method for Plasmodium infections here, Dcifer may be useful in analyses of other organisms that undergo sexual recombination and where polyclonal infections are encountered, such as shistosomiasis, filarial disease, and soil transmitted helminths (Brouwer ; Churcher ). With the ability to incorporate most types of genetic data, rapid computation, and readily available inference, Dcifer may prove to be an important tool in the analytical toolbox for obtaining epidemiologic insight from pathogen genetics. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

22 in total

1. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria.

Authors: Sha Joe Zhu; Jason A Hendry; Jacob Almagro-Garcia; Richard D Pearson; Roberto Amato; Alistair Miles; Daniel J Weiss; Tim Cd Lucas; Michele Nguyen; Peter W Gething; Dominic Kwiatkowski; Gil McVean
Journal: Elife Date: 2019-07-12 Impact factor: 8.140

2. Multiplicity and diversity of Plasmodium vivax infections in a highly endemic region in Papua New Guinea.

Authors: Cristian Koepfli; Amanda Ross; Benson Kiniboro; Thomas A Smith; Peter A Zimmerman; Peter Siba; Ivo Mueller; Ingrid Felger
Journal: PLoS Negl Trop Dis Date: 2011-12-20

3. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens.

Authors: Lyndal Henden; Stuart Lee; Ivo Mueller; Alyssa Barry; Melanie Bahlo
Journal: PLoS Genet Date: 2018-05-23 Impact factor: 5.917

4. Plasmodium vivax and Plasmodium falciparum infection dynamics: re-infections, recrudescences and relapses.

Authors: Michael T White; Stephan Karl; Cristian Koepfli; Rhea J Longley; Natalie E Hofmann; Rahel Wampfler; Ingrid Felger; Tom Smith; Wang Nguitragool; Jetsumon Sattabongkot; Leanne Robinson; Azra Ghani; Ivo Mueller
Journal: Malar J Date: 2018-04-17 Impact factor: 3.469

5. Mapping imported malaria in Bangladesh using parasite genetic and human mobility data.

Authors: Hsiao-Han Chang; Amy Wesolowski; Richard J Maude; Caroline Buckee; Ipsita Sinha; Christopher G Jacob; Ayesha Mahmud; Didar Uddin; Sazid Ibna Zaman; Md Amir Hossain; M Abul Faiz; Aniruddha Ghose; Abdullah Abu Sayeed; M Ridwanur Rahman; Akramul Islam; Mohammad Jahirul Karim; M Kamar Rezwan; Abul Khair Mohammad Shamsuzzaman; Sanya Tahmina Jhora; M M Aktaruzzaman; Eleanor Drury; Sonia Gonçalves; Mihir Kekre; Mehul Dhorda; Ranitha Vongpromek; Olivo Miotto; Kenth Engø-Monsen; Dominic Kwiatkowski
Journal: Elife Date: 2019-04-02 Impact factor: 8.140

6. Estimating Relatedness Between Malaria Parasites.

Authors: Aimee R Taylor; Pierre E Jacob; Daniel E Neafsey; Caroline O Buckee
Journal: Genetics Date: 2019-06-17 Impact factor: 4.562

7. Using parasite genetic and human mobility data to infer local and cross-border malaria connectivity in Southern Africa.

Authors: Sofonias Tessema; Amy Wesolowski; Anna Chen; Maxwell Murphy; Jordan Wilheim; Anna-Rosa Mupiri; Nick W Ruktanonchai; Victor A Alegana; Andrew J Tatem; Munyaradzi Tambo; Bradley Didier; Justin M Cohen; Adam Bennett; Hugh Jw Sturrock; Roland Gosling; Michelle S Hsiang; David L Smith; Davis R Mumbengegwi; Jennifer L Smith; Bryan Greenhouse
Journal: Elife Date: 2019-04-02 Impact factor: 8.140

8. High Genetic Diversity of Plasmodium falciparum in the Low-Transmission Setting of the Kingdom of Eswatini.

Authors: Michelle E Roh; Sofonias K Tessema; Maxwell Murphy; Nomcebo Nhlabathi; Nomcebo Mkhonta; Sibonakaliso Vilakati; Nyasatu Ntshalintshali; Manik Saini; Gugu Maphalala; Anna Chen; Jordan Wilheim; Lisa Prach; Roly Gosling; Simon Kunene; Michelle S Hsiang; Bryan Greenhouse
Journal: J Infect Dis Date: 2019-09-13 Impact factor: 5.226

9. Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent.

Authors: Aimee R Taylor; Stephen F Schaffner; Gustavo C Cerqueira; Standwell C Nkhoma; Timothy J C Anderson; Kanlaya Sriprawat; Aung Pyae Phyo; François Nosten; Daniel E Neafsey; Caroline O Buckee
Journal: PLoS Genet Date: 2017-10-27 Impact factor: 5.917

10. Sensitive, Highly Multiplexed Sequencing of Microhaplotypes From the Plasmodium falciparum Heterozygome.

Authors: Sofonias K Tessema; Nicholas J Hathaway; Noam B Teyssier; Maxwell Murphy; Anna Chen; Ozkan Aydemir; Elias M Duarte; Wilson Simone; James Colborn; Francisco Saute; Emily Crawford; Pedro Aide; Jeffrey A Bailey; Bryan Greenhouse
Journal: J Infect Dis Date: 2022-04-01 Impact factor: 5.226