Literature DB >> 28040781

An Unbiased Estimator of Gene Diversity with Improved Variance for Samples Containing Related and Inbred Individuals of any Ploidy.

Alexandre M Harris^1,2, Michael DeGiorgio^3,4.

Abstract

Gene diversity, or expected heterozygosity (H), is a common statistic for assessing genetic variation within populations. Estimation of this statistic decreases in accuracy and precision when individuals are related or inbred, due to increased dependence among allele copies in the sample. The original unbiased estimator of expected heterozygosity underestimates true population diversity in samples containing relatives, as it only accounts for sample size. More recently, a general unbiased estimator of expected heterozygosity was developed that explicitly accounts for related and inbred individuals in samples. Though unbiased, this estimator's variance is greater than that of the original estimator. To address this issue, we introduce a general unbiased estimator of gene diversity for samples containing related or inbred individuals, which employs the best linear unbiased estimator of allele frequencies, rather than the commonly used sample proportion. We examine the properties of this estimator, [Formula: see text] relative to alternative estimators using simulations and theoretical predictions, and show that it predominantly has the smallest mean squared error relative to others. Further, we empirically assess the performance of [Formula: see text] on a global human microsatellite dataset of 5795 individuals, from 267 populations, genotyped at 645 loci. Additionally, we show that the improved variance of [Formula: see text] leads to improved estimates of the population differentiation statistic, [Formula: see text] which employs measures of gene diversity within its calculation. Finally, we provide an R script, BestHet, to compute this estimator from genomic and pedigree data.

Entities: Chemical Disease Gene Species

Keywords: expected heterozygosity; identity state; inbreeding; locus-specific branch length; relatedness

Mesh：

Year: 2017 PMID： 28040781 PMCID： PMC5295611 DOI： 10.1534/g3.116.037168

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

The gene diversity of a locus, also known as its expected heterozygosity (H), is a fundamental measure of genetic variation in a population, and describes the proportion of heterozygous genotypes expected under Hardy-Weinberg equilibrium (Nei 1973). Formally, gene diversity is the probability that a pair of randomly sampled allele copies from a population are different, and is computed aswhere I is the number of distinct alleles at a locus, and () is the frequency of allele i in the population. For a sample without related or inbred individuals composed of n allele copies, an unbiased estimator of expected heterozygosity is (Nei and Roychoudhury 1974)where is the sample proportion of allele i. is a biased estimator when inbred or related individuals are included in the sample (DeGiorgio and Rosenberg 2009). This result is based on the idea that, as the proportion of related individuals in the sample increases, the number of independent allele observations decreases. When two alleles are drawn from a sample, one each from a pair of related individuals, there is a nonzero probability that they will be identical by descent (IBD), rather than just identical by state (Lange 2002). This IBD probability is known as the kinship coefficient, and is denoted by for a pair of individuals j and k. Thus, the observed diversity will be lower than the true value because a greater proportion of identical alleles are observed than for a sample in which there are no related individuals. DeGiorgio developed an estimator of expected heterozygosity,which is unbiased for samples containing related and inbred individuals of any ploidy, and employs a weighted mean kinship coefficient as a bias correction factor. is the average of all kinship coefficients for every pair of individuals within the sample (see Methods). Further, DeGiorgio derived the theoretical variance of as well as its approximate value for samples wherein individuals are related to no more than one other sampled individual. As an alternative to the sample proportion (), McPeek introduced the best linear unbiased estimator (BLUE, denoted as ) of population allele frequency, which is an unbiased linear estimator with smaller variance than the unbiased linear estimator The BLUE incorporates the relatedness of individuals in the sample as a covariance matrix to define the weight of each observation. Simulations and analytical evaluation corroborating their result suggest that the mean squared error (MSE) of is always smaller than that of and this difference is especially evident for samples with complex pedigrees. Because has the smallest variance of any unbiased linear estimator of allele frequencies, we expect its low variance to translate to smaller variance of gene diversity statistics that use We developed such a statistic, termed that is an unbiased estimator of expected heterozygosity in samples containing related and inbred individuals of arbitrary ploidy. Through simulations, analytical predictions, and empirical assessments, we compare the performance of to that of and for samples containing related individuals of various types across different ploidy and inbreeding status. Additionally, we derive the variance of any measure of expected heterozygosity that uses unbiased linear estimators of allele frequencies. We find that the increased precision of allele frequency estimates transfers to our unbiased estimator, yielding values for MSE invariably equal to or smaller than those of while occasionally exceeding the precision of The improved properties of translate to its applications as well, which we demonstrate in the calculation of the population differentiation statistic, (Wright 1951). can be written in terms of intrapopulation and interpopulation gene diversity as (Hudson )where and are the values of expected heterozygosity within each of two compared populations, and is the expected heterozygosity between them.

Methods

Consider a locus with I distinct alleles in a sample of n individuals. Let denote the fraction of alleles at the locus in individual k that are of type i, An unbiased linear estimator of population allele frequencies denoted by is defined aswhere is the weight of individual k, and Formally, we have thatwhere is an indicator random variable whose value is 1 if allele t of individual k is of type i, and zero otherwise, and where is the ploidy of individual k. As an example, if individual k were diploid at the locus, then Taking the expectation of shows that it is an unbiased estimator of

Unbiased estimation of gene diversity using unbiased linear estimators of allele frequencies

In this section, we construct an unbiased estimator, of expected heterozygosity that uses a general unbiased linear estimator, of allele frequency (Proposition 1). We then show that the unbiased estimator, of DeGiorgio follows as a corollary, assuming that the sample proportion allele frequency estimator (Corollary 2). We then derive a new estimator, also as a corollary, assuming that the BLUE of allele frequency (Corollary 3).

Proposition 1:

Consider a locus with I distinct alleles and parametric allele frequencies and For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,is an unbiased estimator of expected heterozygosity, whereis a weighted mean kinship coefficient of the sample for all pairs of individuals in the sample, and where is the weight for individual k. The proof of Proposition 1 is found in the Appendix. From the sample proportion estimator of allele frequency i, is recovered when for individual k, leading toHere, each individual is weighted by its contribution to the number of allele copies in the sample.

Corollary 2:

Consider a locus with I distinct alleles and parametric allele frequencies and For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,is an unbiased estimator of expected heterozygosity, whereis the sample proportion estimator of allele frequency i, whereis a weighted mean kinship coefficient of the sample for all pairs of individuals, and where is the ploidy for individual k. The proof of Corollary 2 is found in the Appendix. It may be beneficial to apply an unbiased linear estimator of allele frequencies that has minimum variance. McPeek introduced the BLUE of allele frequencies, which we formally define here. We will use the BLUE of allele frequencies to construct a new unbiased estimator of gene diversity that would ideally have improved variance over other estimators. Let be an symmetric matrix of kinship coefficients, with The BLUE () of allele frequency is obtained when yieldingwhere denotes the inverse matrix of 1 is a column vector of n elements with all entries equal to 1, and is the transpose of 1.

Corollary 3:

Consider a locus with I distinct alleles, and parametric allele frequencies and For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,is an unbiased estimator of expected heterozygosity, whereis the BLUE of allele frequencies, and whereis a weighted mean kinship coefficient of the sample for all pairs of individuals. The proof of Corollary 3 is found in the Appendix.

Variance of H estimators using unbiased linear estimators of allele frequencies

We now derive the equation (Proposition 4) describing the variance of the unbiased estimator which takes as the unbiased linear estimate of population allele frequency This value depends on the weighted mean kinship coefficients of the sample for all pairs, trios, quartets, and pairs of pairs of individuals in the sample, defined asHere, is the probability that three randomly sampled alleles, one each from individuals j, k, and are IBD. is the probability that four randomly sampled alleles, one each from individuals j, k, and are IBD. Finally, is the joint probability that two randomly sampled alleles, one each from individuals j and k are IBD, and two randomly sampled alleles, one each from individuals and are IBD. Note that individuals j, k, and are not necessarily distinct. The variances of and of follow as Corollaries 7 and 8, once again differing only in the weight of a sampled individual in the mean kinship coefficient calculation.

Proposition 4:

Consider a locus with I distinct alleles and parametric allele frequencies and For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,is the variance of the unbiased estimator of expected heterozygosity where is a weighted mean kinship coefficient of the sample, and where for is the weight of individual k. Further, we haveThe proof of Equation 10 is presented for the specific case of in Appendix B of DeGiorgio , where is substituted for and and and coefficients are substituted for and coefficients, respectively. We provide an abbreviated version of this proof for the general case in the Appendix. Further, the approximate value of Equation 10 for samples wherein no individual is related to more than one other isFor this simplifying case, the terms and are negligible compared to In the Appendix, we reintroduce the definition of from DeGiorgio (Corollary 7), and then define (Corollary 8), both of which take the form illustrated in Proposition 4. As demonstrated by DeGiorgio , the mean kinship coefficients composing Equation 10 derive from the relationship between the 15 identity states available to four alleles (Gillois 1965; Cockerham 1971), and the coefficients of kinship between pairs, trios, quartets, and pairs of pairs of alleles within those four.

Bias of for samples containing related or inbred individuals

Here, we briefly derive an equation (Equation 12) within Proposition 5 that describes the bias of which we display in the left panels of Supplemental Material, Figure S1A and Figure S2A. We include Corollaries 9 and 10 to Proposition 5 within the Appendix for specific cases of bias derived from -based and -based estimations, respectively. We also note that Equation A10 of Corollary 9 represents the form of the bias typically encountered in applications of as well as in all of our experimental scenarios.

Proposition 5:

Consider a locus with I distinct alleles and parametric allele frequencies and For a sample of n possibly related or inbred individuals, the bias of the estimator of expected heterozygosity changes with the true locus expected heterozygosity such thatwhere

Proof:

We begin by substituting Equation 6 into Equation 13 such thatandFrom the definition of bias,

Variance of estimators using unbiased linear estimators of allele frequencies

Because the population differentiation statistic (Wright 1951) can be defined in terms of expected heterozygosities, it is possible to theoretically evaluate its approximate variance. A general estimator of can be written aswhere is an unbiased estimator for the expected heterozygosity between a pair of sampled populations, numbered 1 and 2, defined as (where is a linear unbiased estimator of the frequency of allele i in population 2, analogous to in population 1), while and are the within-population expected heterozygosities for populations 1 and 2, respectively. Referring to the numerator as x, and the denominator as y, we can write the expression for an approximation of the variance of a ratio asfollowing the definition for the approximate variance of a ratio (Wolter 2007).

Proposition 6:

Consider a locus with I distinct alleles across two populations and parametric allele frequencies and for population 1, and and for population 2. For samples of size and individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness, the variance of the population differentiation statistic calculated from their respective expected heterozygosities is approximated aswhereIn the Appendix, we provide a derivation of the variance and covariance components of Equations 16 and 17. For each of these equations, the result and proof are fairly long, and do not simplify when arranged into Equation 16.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Results

Analytical validation of

We tested the performance of using both theory and simulations against that of the unbiased estimator (DeGiorgio ), and of (Nei and Roychoudhury 1974). Here, we applied the estimators to samples of individuals wherein each individual was related to exactly one other. Thus, for samples of size n individuals, the number of relative pairs was n/2. When inbred or closely related individuals are included in a sample, is a biased estimator of gene diversity for which we use the symbol To construct an unbiased estimator with we also applied to a reduced sample in which one member of each relative pair was removed randomly for samples containing only diploid individuals, and the haploid member was removed for each haploid-diploid (i.e., male-female) pair (reduced sample size of n/2), and we denote this estimator by To evaluate the performance of the four estimators ( and ), we modified the factors upon which their variance depends: true locus expected heterozygosity (H), sample size n, and relatedness of individuals within the sample ().

Effect of true locus expected heterozygosity, H, on estimators

We first evaluated the theoretical bias, variance, and mean squared error (MSE) of each estimator across the 645 human microsatellite loci from across the genome in the composite dataset MS5795 of Pemberton , where MSE is the sum of the squared bias and variance. The data used in our analyses is freely available online within File S1 of Pemberton (http://www.g3journal.org/content/early/2013/03/27/g3.113.005728/suppl/DC1). We took the sample allele frequencies calculated from all individuals in the MS5795 dataset as the true population allele frequencies for the variance calculations, and, from these, determined the true expected heterozygosity at each locus using Equation 1 (see File S1; incorporated into Equation A10). Here, each sample contained 60 diploid individuals composed of 10 inbred full-sibling, 10 outbred full-sibling, and 10 outbred avuncular pairs. Each point in Figure 1 and Figure S1 represents a single analytical computation for a sample of 60 (or 30 for ) individuals at a microsatellite locus. We report the approximate variance and MSE because each individual is related to exactly one other in the sample, satisfying the assumption of Equation 11. Further, under this scenario DeGiorgio showed that this was a reasonable approximation of the exact variance.

Figure 1

Theoretical difference in MSE between the unbiased estimator (left), (center), or (right), and the biased estimator calculated at each of 645 microsatellite loci () in the MS5795 dataset for samples of 60 diploid individuals containing some inbred relative pairs. Each sampled individual was related to exactly one other, and samples contained 10 pairs of inbred full-siblings (), 10 pairs of outbred full-siblings (), and 10 outbred avuncular pairs (). Dotted lines in each plot correspond to a difference in MSE of zero with See File S1 for the true expected heterozygosity values incorporated into analytical calculations. We begin by demonstrating the relative performance of the unbiased estimators and measured in terms of MSE, against the biased estimator (Figure 1). While the variance of is invariably smaller than that of the other estimators, and the MSE and variance of each estimator decrease with increasing locus expected heterozygosity (), accumulates bias quadratically with increasing H, and thus yields an increasingly unreliable estimate with increasing site diversity (Figure S1A, left). However, the effect of this trend differs for each comparison. The MSE of always exceeds that of because the removal of relatives to create the reduced sample causes a substantial increase in estimator variance, though, for high diversity markers, the MSE values of and converge (Figure 1, left). In contrast, outperforms for most loci, demonstrating that the rate of decrease in MSE with increasing H is greater for than for (Figure 1, center). Interestingly, the comparison of with shows an opposite trend to the preceding two. Despite the impact of bias, the decrease in variance of over the analyzed range outpaces that of Even so, uniformly yields a smaller MSE for the analyzed diploid samples (which contain a proportion of inbred individuals) across all loci (Figure 1, right). To validate these theoretical predictions, we simulated 30 independent genotypes for each locus, and, for each independent genotype, simulated a single relative’s genotype (inbred full-sibling, outbred full-sibling, or avuncular). Briefly, we generated the independent genotypes by sampling alleles uniformly at random from the distribution of allele frequencies at each microsatellite locus, and generated relatives by copying zero, one, or two alleles from the relative according to the probability the pair would share zero, one, or two alleles IBD [see Lange (2002), Chapter 5]. The patterns observed for the simulated data accord with those of the theoretical predictions (Figure S2, each point is based on simulations). It is clear from these results that locus expected heterozygosity is heavily influential on estimator MSE. However, we also find that the observed value of expected heterozygosity for a locus normalized to its range of expected heterozygosity values has an impact on estimator MSE. The maximum and minimum values of expected heterozygosity for a locus depend on the number of distinct alleles (I), and the frequency of the most frequent allele (M), at that locus [see Theorem 2 of Reddy and Rosenberg (2012)]. We quantify proximity of H for a locus to its maximum possible value as where D is the observed value of expected heterozygosity for a locus minus its minimum possible value given I and M, and R is the maximum minus the minimum value of expected heterozygosity, given I and M, such that Loci with a smaller value of B yield a smaller MSE for all estimators (Figure S3).

Effect of sample size, n, on estimators

We next examined the properties of each estimator as a function of sample size. All estimators perform increasingly well for samples of increasing size. We demonstrate this property by measuring estimator MSE for samples containing 2–100 relative pairs of various type and ploidy at the D3S2427 locus, selected to highlight the improved performance of as the bias of increases ( Figure 2). For these tests, we considered only a single relative pair type at a time. The unbiased estimators and perform identically for diploid samples of first- and second-degree relative pairs regardless of inbreeding (Figure 2, A–D). Additionally, estimator MSE is uniformly smaller for samples containing only second-degree relative pairs than it is for samples containing only first-degree pairs (cf. Figure 2, A and B, and Figure 2, C and D; see also, Figure S4A). However, unambiguously outperforms the other estimators with relative pairs of varying ploidy (in this case, male-female full-sibling pairs at an X-linked locus). In this scenario, provides a more accurate estimate of expected heterozygosity than does when the reduced set is created by removing only males from the original while retaining females (Figure 2E). When all females are removed instead, and males retained (Figure 2F), the MSE of is markedly the largest of the four estimators because 2/3 of the alleles in the sample are discarded, rather than 1/3. For samples with inbred full-siblings whose parents are brother and sister (Figure 2, C and D), the trend of MSE with sample size mirrors that of outbred diploid samples (Figure 2, A and B), but with larger MSE. However, the relative performance of is notably worse for samples containing inbred diploid avuncular pairs (Figure 2D) than for samples containing outbred diploid avuncular pairs (Figure 2B). That is, its MSE remains greater than, or equal to, that of the other estimators over the range of sample sizes considered for the inbred diploid avuncular pair scenario (Figure 2D), but consistently has smaller MSE than for the outbred diploid avuncular pair scenario (Figure 2B). Generally, increasing the sample size is most effective for samples of <20 individuals, and it is over this range that the difference in performance of the estimators is most apparent.

Figure 2

Theoretical MSE as a function of sample size for samples of outbred diploid full-siblings (A), outbred diploid avuncular pairs (B), inbred diploid full-siblings (C), inbred diploid avuncular pairs (D), male-female full siblings at an X-linked locus with the reduced set omitting males and retaining females (E), and male-female full siblings at an X-linked locus with the reduced set omitting females and retaining males (F). The samples were evaluated for the D3S2427 locus (), and sample size was always twice the number of relative pairs included in the sample for samples containing 2–100 relative pairs. Each individual in the sample was related to exactly one other.

Effect of varying sample relative pair composition on estimators

Finally, we calculated the MSE of each estimator for all 1326 combinations of one to three relative pair types for samples of 100 individuals fixed at 50 relative pairs, which we represent as triangular heat maps, across samples containing outbred diploids, males and females at an X-linked locus, or inbred diploids (each individual related to exactly one other; Figure 3, Figure S4, Figure S5, Figure S6, Figure S7, and Figure S8). The kinship coefficients () for each relative pair type considered across our tests are defined in Lange (2002, Chapter 5) and DeGiorgio , see Table 2), and modeled on the D3S2427 locus ().

Figure 3

Table 2

Wilcoxon signed-rank test for weighted mean across all loci of with and for the French population with the 92 other populations whose samples contained related individuals

Comparison	P-Value for Wilcoxon Signed-Rank Test
F^ST,red with F^ST	5.25×10−15
F^ST,red with F∼ST,BLUE	0.967

Theoretical difference in MSE between (left), (center), or (right), and for samples of 100 (A) outbred diploid individuals, (B) male and female individuals at an X-linked locus, or (C) diploid individuals wherein some full siblings are inbred with brother-sister parents. The samples and MSE values considered for each subtraction were modeled on the D3S2427 locus (). Each sample contained 50 relative pairs, such that each individual was related to exactly one other. Each sample configuration is a single point in the space of a heat map defined by three coordinates (each representing the count of a relative pair type). For each configuration, the MSE of is subtracted from that of the other estimators, yielding a value >0. Samples were composed of one to three relative pair types where the vertex of each heat map represents a sample with only a single relative pair type. The relative pair types were (A) parent-offspring (PO), second-degree avuncular (AV), and full-sibling (FS), (B) male-male (MM), male-female (MF), and female-female (FF) full-sibling such that the number of males and females in each sample is not fixed, or (C) inbred full-sibling (FSi), second-degree avuncular (AV), and outbred full-sibling (FSo). Blue and black points indicate the smallest and largest values, respectively, on each map. Threshold values for coloration are indicated in the scales to the right of each heat map, with smaller values colored lighter. Note that the scales are not identical across heat maps. The values upon which these subtractions are based are represented as heat maps in (A) Figure S4A, (B) Figure S4B, or (C) Figure S4C. The outbred diploid samples included parent-offspring (), avuncular (), and full-sibling () relative pairs. Because parent-offspring and full-sibling pairs have the same kinship coefficient, the heat maps in Figure 3, Figure S4A, Figure S5A, Figure S6A, and Figure S7A are symmetrical with parent-offspring and full-sibling pairs on the bottom vertices, and avuncular pairs on the top vertex. yielded the largest MSE of the four estimators, and this value was constant throughout the space of the heat map (Figure S4A, second triangle), because all reduced sets are identical for outbred diploid samples. consistently yielded the smallest MSE across configurations (Figure S4A, fourth triangle). As was the case in Figure 2, the MSE of the estimators and was smallest for samples with only avuncular pairs, because these contain fewer dependent allele observations on average. We observed these features in simulated data as well (Figure S8A). Although performed best overall for samples including outbred diploid relative pairs at D3S2427, the estimator with the smallest variance in all situations is the biased estimator (Figure S6A). However, because its squared bias increases with the number of first-degree pairs (Figure S5A), its relative performance declines compared to as more of these pairs are sampled (Figure 3A, left triangle). The relative performance of is highest when the number of first degree pairs is maximized, but this is due to the decreasing performance of as more dependent observations are included (Figure 3A, center triangle). While the difference in MSE between and is always slight for samples of noninbred diploids, these values diverge as the complexity of the sample increases (Figure 3A, right triangle). That is, as the numbers of first- and second-degree pairs approach each other, emerges decisively as the more accurate estimator, with the maximum value of this difference reached at 23 second-degree and 27 first-degree pairs. Thus, while the performance of the estimators for a sample containing relatives follows the same general trend, provides the greatest accuracy for heterogeneous samples of outbred diploid individuals. We also considered the relative performance of each estimator when using either the BLUE () or the sample proportion () to estimate allele frequencies. Notably, all estimators perform best when the BLUE () of allele frequency rather than the sample proportion () is used to infer population allele frequencies. We calculated the theoretical MSE for each estimator once with and once with across all combinations of relative pairs for diploid individuals at the D3S2427 locus and mapped its value for the estimate with minus the estimate with (Figure S7A). Because both frequency estimations yield the same values in samples of unrelated individuals, performs identically for and and is not included. The MSE of an estimator calculated with is always smaller than that of the estimator calculated with and the pattern of divergence between their MSEs follows a similar trend across all estimators, resembling the rightmost panel in Figure 3A. This result suggests that the difference in MSE between and is driven primarily by the difference in performance between and Both the and estimators yield the same value at the vertices of the triangles, and the difference in their MSEs reaches a maximum at 22 second-degree pairs for and 24 second-degree pairs for and (Figure S7A, center and right triangles). The MSE of calculated with is, at most, on the order of greater than that of calculated with indicating its robustness to variance in allele frequency determination (Figure S7A, right triangle). In contrast, the other estimators return a maximum difference in MSE on the order of The estimation of expected heterozygosity with or will always yield a smaller MSE for samples of outbred, diploid individuals when rather than is taken as the estimator of population allele frequency. We repeated these tests in samples of mixed ploidy (Figure 3B, Figure S4B, Figure S5B, Figure S6B, Figure S7B, and Figure S8B), and emerged similarly superior to the other estimators, once again yielding the smallest MSE. We analyzed the D3S2427 locus as X-linked for these tests, counting males as haploid and females as diploid, and observed full-sibling pairs [similarly to DeGiorgio , for male-male pairs, for male-female pairs, and for female-female pairs] for samples of 100 individuals and 50 relative pairs. All estimators reach their maximum MSE in samples containing only male-male pairs (Figure S4B). This is because the number of independent observations (indicated by a larger mean kinship coefficient) is smallest when there are no females in the sample. Correspondingly, the estimators yield smaller MSE values with increasing incorporation of male-female pairs. The minimum MSE of is reached at 50 male-female pairs, as with and because its squared bias (Figure S5B) decreases with increasing male-female pairs, though its variance is smallest at 50 female-female pairs, due to the greater number of alleles in the sample (Figure S6B). To create the reduced sets, males were removed from male-female pairs to minimize the subsequent increase in MSE. That is, the removal of males removes 1/3 of the allele copies from the sample, rather than 2/3 if females are removed, or 1/2 for a pair of same-ploidy individuals, and so has the same value across samples with the same number of male-male pairs (Figure S4B, second triangle). The direct comparison of with the other estimators once again yielded different signatures for each subtraction for mixed-ploidy samples (Figure 3B). The point of greatest difference in MSE between and occurs when all relative pairs are male-male, while the point of least difference occurs for samples of only male-female pairs (Figure 3B, left triangle). This pattern broadly resembles the squared bias of (Figure S5B, first triangle), underscoring the effect of bias on estimator performance. The pattern of difference in performance between and differs markedly, and the two estimators perform most similarly as the number of male-male pairs decreases, reaching a minimum at 33 male-female pairs plus 17 female-female pairs (Figure 3B, middle triangle). yields the closest MSE to that of for all relative pair configurations, and their difference is, at most, on the order of (Figure 3B, right triangle). The pattern here mainly reflects the difference in performance between and estimates of population allele frequency, as in Figure S7B, where estimators yield increasingly smaller comparative MSE values as the numbers of relative pairs in the sample approach each other. We repeated the preceding tests once more for a sample in which full-siblings resulting from a brother-sister mating were included alongside second-degree and outbred full-sibling pairs (Figure 3C, Figure S4C, Figure S5C, Figure S6C, Figure S7C, and Figure S8C). Here, the kinship of inbred individuals with each other was 3/8 rather than 1/4. For all estimators, the inclusion of inbred full-siblings increased the MSE of the estimator, with a maximum MSE at 50 inbred full-sibling pairs, and a minimum at 50 second-degree pairs. For this minimum was also reached for any sample in which there were no inbred individuals, because the reduced sample is identical for these (Figure S4C, second triangle). Again, was the least errant estimator across the space of sample configurations (Figure S4C, fourth triangle), and its advantage over the other estimators differs for each estimator (Figure 3C). Because the bias of is largest at 50 inbred full-sibling pairs, the greatest difference in performance between it and is at this point (Figure 3C, left triangle). Meanwhile, the largest differences in MSE between and are near the top vertex, where second-degree relative pairs predominate, while the smallest are toward the bottom vertices (Figure 3C, center triangle). The difference in MSE between and is at least an order of magnitude less than for the other comparisons, and increases for increasing sample complexity, but reaches its maximum for samples of 28 inbred full-sibling plus 22 second-degree pairs (Figure 3C, right triangle). This pattern reflects the decreased MSE for the estimators when calculated with compared to their calculation with (Figure S7C). Ultimately, the performance of the estimators of expected heterozygosity across varying sample compositions depends on the estimator of allele frequency incorporated into the expected heterozygosity calculation. No matter the sample type, estimators based on outperform estimators based on and outperforms and

Tests of on single-nucleotide polymorphism (SNP) loci

Because SNP datasets are more common in recent studies, we performed analyses equivalent to our microsatellite analyses for 50 hypothetical SNP loci. These loci were biallelic with minor allele frequency (MAF) between 0.01 and 0.5, with increments of 0.01, corresponding to expected heterozygosity values ranging from 0.0198 to 0.5. We first measured the difference in MSE of with that of or as a function of true locus expected heterozygosity (H), as we did in Figure 1 (Figure S9). For each locus, the MSE of was smallest, while that of was generally second-smallest, following the trend for microsatellite loci visible in Figure 1, wherein less diverse loci yielded a smaller MSE for than for However, unlike for microsatellite loci, estimator MSE peaks midway through the range of evaluated SNP loci, such that the smallest MSE values lie at either extreme of the range and the largest MSE value, as well as the largest difference in MSE values for all comparisons, is at the locus with MAF (). Additionally, performs comparatively better than (Figure S9, left) and (Figure S9, center) as H approaches 0.255, but is outperformed by these unbiased estimators as H approaches 0.5. Once more, the trend is opposite for the comparison between and showing the greatest comparative performance by at the same locus (MAF ). Thus, considering the results presented in Figure 1 and Figure S9, the greatest relative performance of for inbred samples is achieved at loci for which estimator MSE is largest. We next examined the effect of sample size on estimator performance for hypothetical samples of outbred diploid, inbred diploid, and outbred male-female relative pairs at the simulated locus with MAF (). As we varied the sample size from two relative pairs to 100 (each individual related to exactly one other, one relative pair type per sample), we found that yielded the smallest MSE of all estimators only for samples containing male-female full-sibling pairs modeled at an X-linked locus (Figure S10, E and F). This observation mirrors the trend seen in Figure 2, wherein outperformed the other estimators across all sample sizes. However, yielded the smallest MSE across all sample sizes for outbred and inbred diploid full-siblings and avuncular pairs (Figure S10, A–D). This result is because the samples modeled here are minimally complex, with only one relative pair type, and modeled for a highly homozygous marker—two conditions under which the low bias and variance of result in favorable performance. Finally, we analyzed estimator performance once more for the locus with MAF (), for a sample of 50 individuals across changing outbred diploid, inbred diploid, and male-female full-sibling relative pair compositions (Figure S11, A–C). We display these results as heat maps, and find that our results here are broadly concordant with those for the D3S2427 human microsatellite locus (). As with the experiments displayed in Figure S10, the least complex samples yielded a smaller MSE for estimates than for estimates. Correspondingly, samples whose relative pair compositions resulted in fewer independent allele observations were more accurately and precisely evaluated with Thus, while sampling lower-diversity markers may occasionally favor the use of the inclusion of two or more relative pair types in the sample is likely to bias , and require the use of to yield accurate inferences.

Empirical application of

To conclude our investigation into the performance of we applied it to empirical data from the MS5795 dataset. We retrieved human microsatellite data from 5795 individuals (11,590 allele copies) across 645 autosomal loci sampled genome wide. We assumed the mean value across loci for in each of 267 populations to be the true expected heterozygosity value for these populations, as it is an unbiased estimate. We additionally chose to compare the other estimators with because an important basis for their evaluation is their agreement with this unbiased estimator, irrespective of the data to which they are applied. To emphasize this, we performed three Wilcoxon signed-rank tests to compare the ranking of populations by their mean expected heterozygosity across all loci calculated with and either or (Table 1). At the significance level, the comparisons showed that the inclusion of relatives for was highly significant on the rankings it yielded, indicating that not correcting for relatedness among samples can significantly alter the estimates of expected heterozygosity. However, both and, especially, yielded P-values greater than the threshold for the test against These results indicate that the estimates of expected heterozygosity are not significantly affected by the inclusion of related individuals in the sample when relatedness is taken into account. Furthermore, a test between and yielded a P-value of suggesting no significant difference in the ranking of populations by mean expected heterozygosity with these two estimators.

Table 1

Wilcoxon signed-rank test for mean across loci of with and for the 93 populations whose samples contained related individuals

Comparison	P-Value for Wilcoxon Signed-Rank Test
H^red with H^full	4.39×10−15
H^red with H∼	1.00×10−2
H^red with H∼BLUE	0.255

Although the unbiased estimators and have smaller MSE than for samples with related individuals, their variance tends to be larger than that of DeGiorgio previously showed that the difference in SD of with was small, while the mean values of and were much more similar to each other than either of them was to the mean of We again show this to be the case, and find as well that not only repeats, or improves upon, the concordance of with but, in some cases, has a smaller SD than does (Figure 4, left and center panels). A direct comparison of the performance of against that of (Figure 4, right panel) shows that has a generally improved SD, and similarity to the estimate over For some samples (primarily those from the Americas), this is not the case, possibly because all close relatives were not identified in the original dataset, resulting in an incorrect kinship matrix for calculation of the statistic.

Figure 4

Application of the estimators to dataset MS5795. Here, we show a comparison of two estimators at a time ( or ) by the difference in their mean with that of across the 645 sampled microsatellite loci of MS5795 (vertical axis), and by their SDs (horizontal axis). The horizontal dotted line corresponds to no difference between the mean of the estimator and the mean of the unbiased estimator Solid lines connect calculations made for the same population with different estimators. Points are colored by geographic division defined in the dataset. Only the 93 populations with relatives in their samples were included because and return the same value for samples of unrelated individuals. In the leftmost plot, open points are estimates for while closed points are for In the center plot, open points are estimates for while closed points are for In the rightmost plot, open points are estimates for while closed points are for

Improving estimates of by application of

We predicted that the smaller MSE of would translate to improved accuracy for estimators that are summaries of expected heterozygosity when samples contain related individuals. To test this hypothesis, we calculated the population differentiation statistic, (Equation 4), for pairs of populations whose samples in the MS5795 dataset contained related individuals. Our intent was to compare the MSE and bias of the commonly used estimator of Reynolds , which is based on and which we label as to an estimate of calculated from which we label The formulas for these estimators follow the form of the general estimator of (Equation 14). We first measured the MSE of both methods (and an estimate using ) on simulated data, where the of pairs of populations with samples of size 60 diploids each (30 relative pairs, 10 inbred full-sibling, 10 outbred full-sibling, and 10 avuncular pairs; Figure 5) was averaged across simulated replicates. The calculations included here were performed for simulated Gujarati and Maya (left), Gujarati and Japanese (center), or Gujarati and Hadza (right) samples for the least diverse (TCTA015M_22), median diverse (D10S2327), and most diverse (D3S2427) loci of the MS5795 dataset, following their allele frequency distribution in MS5795. consistently has a smaller MSE than the others, and the MSE of all estimators of decreases with increasing locus diversity, as the MSE of the estimator of expected heterozygosity decreases.

Figure 5

Application of the estimators and to the calculation of as and respectively, using simulated data for the Gujarati sample, with either the Maya (left), Japanese (center), or Hadza (right) samples, showing MSE on the vertical axis. The Reynolds estimator is equivalent to the application of in calculating population differentiation. The simulated samples contained 60 individuals and 30 relative pairs, of which 10 were inbred full-siblings, 10 were outbred full-siblings, and 10 were outbred avuncular pairs. Each individual was related to exactly one other, and the data were simulated following the same probabilistic method as employed to generate Figure S2. The three loci displayed on the horizontal axis are the least diverse, median diverse, and most diverse loci of the 645 MS5795 human microsatellites. We additionally find that has an upward bias compared with (calculated with ), as well as a larger SD in general than (Figure 6). Furthermore, all values of are smaller than the paired value of calculated for the same population. The difference in the mean of and of across all loci with the mean of an estimator which serves as a proxy for the true value of is displayed on the vertical axis, while the horizontal axis measures the SD of and of (Figure 6). Supporting our observations indicating the improved accuracy of over Wilcoxon signed-rank tests (Table 2) between and either or indicate that the inclusion of relatives significantly affects the estimate of population differentiation at the significance level. Meanwhile, and are not significantly different in their estimates. These results suggest that the improved properties of transfer to the summaries that include it in their calculations.

Figure 6

Application of the estimators and to the estimation of as and respectively, from empirical data. Similarly to Figure 4, the difference between the mean of the estimator of (either derived from or ) and an unbiased estimator (derived from ), is displayed on the vertical axis, while the SD of the estimator is displayed on the horizontal axis. The empty circles represent the Reynolds estimator (identical to the -derived estimation), while the filled circles represent the estimation derived from Here, the values for the French sample with each of the 92 other samples containing related individuals in the dataset MS5795 are plotted, colored by the region of the changing sample.

Discussion

We have introduced an extension to the estimator () of expected heterozygosity developed by DeGiorgio that yields a smaller mean squared error in samples containing related individuals, while maintaining unbiasedness. Conveniently, the derivations of and its variance, are parallel in form to those of and we were therefore able to analytically evaluate the performance of the new estimator simultaneously with that of its predecessor. Our updated estimator, is based on results from McPeek , who characterized the BLUE () of allele frequency. The BLUE improves the precision of allele frequency estimation in complex pedigrees, for which the sample proportion ( the estimator of allele frequency used in and ) is unbiased, but increases in variance with inclusion of related and inbred individuals. Because the properties of the estimator of allele frequency transfer to the estimator of expected heterozygosity, is likely to outperform in situations where has a smaller variance than This trend is true for genome-wide data as well (Figure 4 and Table 1). Overall, yields identical results to in samples containing only one relative pair type, but the two diverge in performance as sample complexity increases (see heat maps in Figure 3, Figure S4, Figure S5, Figure S6, Figure S7, and Figure S8). While both estimators are unbiased, experiences a larger increase in variance for each additional relative pair type introduced into a sample after the first. This holds true for all sample types regardless of ploidy and inbreeding, suggesting that will outperform in practice, where datasets are often complex. Furthermore, the results of our empirical analysis provide an equally important complement to this observation. Of the 93 populations from the MS5795 dataset we considered that contained relative pairs in their samples, each contained sampled individuals that were not related to any other in the sample. Thus, these samples were more complex than those in which each individual was part of a relative pair of the same type. For most of these cases, except for some American populations (discussed below), outperformed This is corroborated by the Wilcoxon signed-rank test (Table 1). We expect therefore that any scenario in which there is heterogeneity in relative pair type among sampled individuals, as is observed in many human population-genetic datasets (Pemberton , 2013), should favor the application of over other estimators. In addition, random sampling of small isolated populations yields an increased chance that related individuals will be included with large enough sample sizes. Further, inbreeding may confound estimates of diversity, and mislead to underreport true population expected heterozygosity. Populations of interest that may display these attributes include geographically isolated human settlements in remote alpine (Coia ; Capocasa ), South American rainforest (Wang ), and Siberian taiga and steppe habitats (Dulik ), and groups such as the Old Order Amish (Van Hout ), Hutterites (Abney ; Chong ), and Mennonites (Payne ). Further, though our analysis did not directly consider polyploid organisms, the applicability of to samples containing individuals of any, and varying, ploidy highlights its usefulness for such data. Prominently, analysis on polyploid organisms such as plants including tetraploid Arabidopsis thaliana (Hollister ), and hexaploid bread wheat (Nielsen ), both of which self-fertilize, and may therefore be inbred, as well as commercially and ecologically significant Hymenopteran insects, including honeybees (Solignac ; Harpur ), bumblebees (Lye ), and ants (Butler ), whose males are haploid at all loci, while females are diploid, is likely to benefit from the improved accuracy and precision of We additionally believe that continued investigations into the diversity at single sites in organisms as diverse as dogs (Sutter ), gray wolves (Zhang ), humans living at high altitude (Simonson ; Huerta-Sánchez ), and rice (Huang ), in addition to host-microbiome studies (Blekhman ), will benefit from the advances provided by These studies, as well as many others, have performed scans for positive selection using genomic outliers of population differentiation-based statistics (e.g., locus-specific branch length, and the population branch statistic), where the calculation is performed per-site, rather than averaged across a large number of sites. Such studies would benefit from estimators of genetic diversity, such as and with improved variance. It is pertinent at this point to revisit a pair of potential limitations in our method and examine their implications. First, in Figure 4 (rightmost panel), the mean of is either closer to that of than to has smaller SD than or both for certain samples (predominantly from the Americas). These observations indicate that the accuracy and precision of may be impacted by the accuracy of the kinship information incorporated into the calculation. The pedigrees of smaller, more remotely located, populations may be more complex compared to those of larger groups. Further, with a greater proportion of relative pairs in each sample, the effect of relative pair type misidentification may be larger. For RELPAIR (Epstein ), which was the software chosen to identify relative pairs in MS5795 samples, second-degree pairs cannot be identified as confidently as first-degree pairs (Pemberton ). Even so, although may exhibit a somewhat greater robustness to relative pair misclassification, it is still generally outperformed by The second point we address is the smaller MSE of at less diverse loci in the dataset, especially for samples with fewer relative pairs. While the variance of is always smaller than that of the other estimators, its bias increases with increasing locus allelic diversity. It is for this reason that the unbiasedness of is its most desirable property. In practice, the mean of expected heterozygosity is often taken across loci. Based on such an approach, (and as well) will return the mean expected heterozygosity, and the variance of this estimation (as with all estimators taking the mean across loci) approaches zero as more loci are sampled. An interesting property of all estimators is that their variance (and therefore MSE) is larger for loci whose value for B is closer to 1, where ( see Results and Figure S3). Because this effect is greatest for loci with lower true values of H, we expect to have the smallest MSE of all estimators at less diverse loci that are close to their maximum expected heterozygosity, and for which the sample mean kinship coefficient is insufficiently large to appreciably bias the estimator (Equation 12). It is thus important to note that no estimator is uniformly superior to the others. Accordingly, the unique limitation of is that the sample kinship matrix must be invertible for the calculation to proceed. additionally confers its improved MSE over downstream to calculations that incorporate estimates of expected heterozygosity. To illustrate this point, we computed as a function of three estimators: and For simulated data, we found that yielded an estimate with smaller MSE for the three tested loci than did (Figure 5) or and a much smaller mean distance from the true value than For empirical data (Figure 6), we observed a consistent upward bias for compared to in samples containing relatives that followed much the same pattern as the downward bias of for such samples. This trend is clear when we consider the formula for which can be written as Taking and as and this expression yields a larger value than if and were used, because the ratio is smaller for downwardly biased estimators. Interestingly, the SD of is, in most cases, smaller than that of for the dataset, while the SD of was frequently (though not consistently) larger than that of (Figure 4, center panel). It is thus noteworthy to consider that the performance of and may diverge further in their applications, where any improvement in MSE for may be magnified downstream. This is highlighted by the increased concordance between and compared to and (cf. P-values between Table 1 and Table 2). With this in mind, applications of can also be considered. Two such examples are the locus-specific branch length (LSBL; Shriver ) and the similar population branch statistic (PBS; Yi ). These statistics incorporate values between three populations as measures of branch length to detect positive selection at a locus. Loci for which the unrooted three-taxon tree indicates a significantly longer branch length in a particular lineage may represent regions possibly under selection. To allow for the easy application of we have written an R script, BestHet, that computes and given matrices of genotype and kinship data for a sample (download available at http://www.personal.psu.edu/mxd60/best_het.html).

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.116.037168/-/DC1. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

34 in total

1. Quantitative-trait homozygosity and association mapping and empirical genomewide significance in large, complex pedigrees: fasting serum-insulin level in the Hutterites.

Authors: Mark Abney; Carole Ober; Mary Sara McPeek
Journal: Am J Hum Genet Date: 2002-03-04 Impact factor: 11.025

2. Amish, mennonite, and hutterite genetic disorder database.

Authors: Michael Payne; C Anthony Rupar; Geoffrey M Siu; Victoria Mok Siu
Journal: Paediatr Child Health Date: 2011-03 Impact factor: 2.253

3. Evidence of high genetic variation among linguistically diverse populations on a micro-geographic scale: a case study of the Italian Alps.

Authors: Valentina Coia; Ilaria Boschi; Federica Trombetta; Fabio Cavulli; Francesco Montinaro; Giovanni Destro-Bisol; Stefano Grimaldi; Annaluisa Pedrotti
Journal: J Hum Genet Date: 2012-03-15 Impact factor: 3.172

4. Population genomics of the honey bee reveals strong signatures of positive selection on worker traits.

Authors: Brock A Harpur; Clement F Kent; Daria Molodtsova; Jonathan M D Lebon; Abdulaziz S Alqarni; Ayman A Owayss; Amro Zayed
Journal: Proc Natl Acad Sci U S A Date: 2014-01-31 Impact factor: 11.205

5. Refining the relationship between homozygosity and the frequency of the most frequent allele.

Authors: Shashir B Reddy; Noah A Rosenberg
Journal: J Math Biol Date: 2011-02-09 Impact factor: 2.259

6. Extent and distribution of linkage disequilibrium in the Old Order Amish.

Authors: Cristopher V Van Hout; Albert M Levin; Evadnie Rampersaud; Haiqing Shen; Jeffrey R O'Connell; Braxton D Mitchell; Alan R Shuldiner; Julie A Douglas
Journal: Genet Epidemiol Date: 2010-02 Impact factor: 2.135

7. Genomic estimated breeding values using genomic relationship matrices in a cloned population of loblolly pine.

Authors: Jaime Zapata-Valenzuela; Ross W Whetten; David Neale; Steve McKeand; Fikret Isik
Journal: G3 (Bethesda) Date: 2013-05-20 Impact factor: 3.154

8. Genetic adaptation associated with genome-doubling in autotetraploid Arabidopsis arenosa.

Authors: Jesse D Hollister; Brian J Arnold; Elisabeth Svedin; Katherine S Xue; Brian P Dilkes; Kirsten Bomblies
Journal: PLoS Genet Date: 2012-12-20 Impact factor: 5.917

9. Conserved microsatellites in ants enable population genetic and colony pedigree studies across a wide range of species.

Authors: Ian A Butler; Kimberly Siletti; Peter R Oxley; Daniel J C Kronauer
Journal: PLoS One Date: 2014-09-22 Impact factor: 3.240

10. A map of rice genome variation reveals the origin of cultivated rice.

Authors: Xuehui Huang; Nori Kurata; Xinghua Wei; Zi-Xuan Wang; Ahong Wang; Qiang Zhao; Yan Zhao; Kunyan Liu; Hengyun Lu; Wenjun Li; Yunli Guo; Yiqi Lu; Congcong Zhou; Danlin Fan; Qijun Weng; Chuanrang Zhu; Tao Huang; Lei Zhang; Yongchun Wang; Lei Feng; Hiroyasu Furuumi; Takahiko Kubo; Toshie Miyabayashi; Xiaoping Yuan; Qun Xu; Guojun Dong; Qilin Zhan; Canyang Li; Asao Fujiyama; Atsushi Toyoda; Tingting Lu; Qi Feng; Qian Qian; Jiayang Li; Bin Han
Journal: Nature Date: 2012-10-03 Impact factor: 49.962

4 in total

1. Genetic Diversity and Population Structure of a Camelina sativa Spring Panel.

Authors: Zinan Luo; Jordan Brock; John M Dyer; Toni Kutchan; Daniel Schachtman; Megan Augustin; Yufeng Ge; Noah Fahlgren; Hussein Abdel-Haleem
Journal: Front Plant Sci Date: 2019-02-20 Impact factor: 5.753

2. Genome-wide association analysis of chickpea germplasms differing for salinity tolerance based on DArTseq markers.

Authors: Shaimaa Mahmoud Ahmed; Alsamman Mahmoud Alsamman; Abdulqader Jighly; Mohamed Hassan Mubarak; Khaled Al-Shamaa; Tawffiq Istanbuli; Osama Ahmed Momtaz; Achraf El Allali; Aladdin Hamwieh
Journal: PLoS One Date: 2021-12-01 Impact factor: 3.240

3. Genome-wide analysis identified candidate variants and genes associated with heat stress adaptation in Egyptian sheep breeds.

Authors: Adel M Aboul-Naga; Alsamman M Alsamman; Achraf El Allali; Mohmed H Elshafie; Ehab S Abdelal; Tarek M Abdelkhalek; Taha H Abdelsabour; Layaly G Mohamed; Aladdin Hamwieh
Journal: Front Genet Date: 2022-10-03 Impact factor: 4.772

4. Patterns of Genetic Diversity and Mating Systems in a Mass-Reared Black Soldier Fly Colony.

Authors: Lelanie Hoffmann; Kelvin L Hull; Anandi Bierman; Rozane Badenhorst; Aletta E Bester-van der Merwe; Clint Rhode
Journal: Insects Date: 2021-05-21 Impact factor: 2.769

4 in total