The fixation index F(st) plays a central role in ecological and evolutionary genetic studies. The estimators of Wright ([Formula: see text]), Weir and Cockerham ([Formula: see text]), and Hudson et al. ([Formula: see text]) are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator [Formula: see text] using [Formula: see text] and [Formula: see text]. We tested [Formula: see text] in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that [Formula: see text] has smaller bias than [Formula: see text] for small sample sizes and smaller bias than [Formula: see text] for large sample sizes. Also, [Formula: see text] has smaller variance than [Formula: see text] for small Fst values and smaller variance than [Formula: see text] for large F(st) values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate F(st).
The fixation index F(st) plays a central role in ecological and evolutionary genetic studies. The estimators of Wright ([Formula: see text]), Weir and Cockerham ([Formula: see text]), and Hudson et al. ([Formula: see text]) are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator [Formula: see text] using [Formula: see text] and [Formula: see text]. We tested [Formula: see text] in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that [Formula: see text] has smaller bias than [Formula: see text] for small sample sizes and smaller bias than [Formula: see text] for large sample sizes. Also, [Formula: see text] has smaller variance than [Formula: see text] for small Fst values and smaller variance than [Formula: see text] for large F(st) values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate F(st).
The fixation index F
is widely used as a measure of population differentiation due to genetic structure. Wright [1, 2] defined F
as the ratio of the observed variance of allele frequencies between subpopulations to the expected variance of allele frequencies assuming panmixis. Wright’s estimator of F
is biased, because a priori expected allele frequencies are unknown and the numerator and denominator terms in the equation are not independent. In practice, various frameworks have been proposed to improve estimation of F
. Weir and Cockerham used an analysis of variance (ANOVA) approach to estimate within- and between-population variance components [3, 4]. Weir and Cockerham’s estimator is widely used because their estimator can describe the genetic population structure in a single summary statistic, is asymptotically unbiased with respect to sample size, and can compensate for overestimates particularly at low levels of genetic differentiation unlike Wright’s estimator [5]. However, it can be upwardly biased unless adjustment is done for intralocus sampling error, the number of subpopulations sampled, time of divergence, etc. [6]. In the present study, we propose a method that improves F
estimation by combining Wright’s and Weir and Cockerham’s estimators to achieve a minimum variance estimate. For comparison, we also include Hudson et al.’s estimator [7], which recently has been recommended by Bhatia et al. [8]. We demonstrate application of our modified estimator in analysis of real data.
Methods
For a diallelic marker, let p be the true minor allele frequency in the total population. Let the true subpopulation allele frequencies be p
1, …, p
in r ≥ 2 subpopulations. Let σ
2 be the true population variance in allele frequencies across subpopulations. Suppose the observed sample frequencies are and the sample sizes are n
1, …, n
. Let and . Let ϑ be the difference in allele frequencies, such that for two subpopulations, .Wright’s F
[2] is defined as
and is estimated as
with
For the special case of two subpopulations, Rosenberg et al. [9] showed that by algebraic rearrangement
Thus, F
is a function of the difference in allele frequencies and is proportional to ϑ
2.Weir and Cockerham’s estimator [4], assuming a random union of gametes or equivalently no individual-level inbreeding, is based on
and
yieldingThe definition of F
of Hudson et al. [7] is
Given observed sample estimates , is a biased estimate of H
, because
An unbiased estimate of p
(1 − p
) is thus given by . However, is an unbiased estimate of H
if , i.e., under the null hypothesis. Therefore, we estimate F
by
which is a ratio of unbiased estimates. This estimator generalizes Bhatia et al.’s [8] version of Hudson et al.’s [7] estimator for r > 2.Note that under the null hypothesis of p
1 = ⋯ = p
, both and are asymptotically zero. Our goal is to construct an estimator based on a linear combination of and such that the new estimator has the smallest variance among all such linear combinations. Let and be the asymptotic variances of and , and σ
12 be the asymptotic covariance. We propose the following weighted version of :
where b > 0 is a fixed number to be chosen later. We choose a = a
0 such that is minimized:
It is seen that and hence is more precise in estimation. From the proof of the Proposition we see that Eq (1) is equivalent to,
which gives, with b = (δ − 1)/(δ + 1),
At the end of the proof of the following Proposition, we show that δ ≥ 1 with equality if and only if n
1 = ⋯ = n
. When n
1 = ⋯ = n
, we have and . Let denote convergence in distribution.Proposition. Assume that 0 < p
0 < 1 and that the n
and
we have
where λ
1, …,λ Ω′1/2
BΩ1/2, Ω = (ω
)
with
ω
= p
0(1 − p
0) if
i = j
and
ω
= 0 if
i ≠ j, Ω−1/2
is the square root of Ω: Ω = Ω′1/2Ω1/2, and
, b
= (−γ
1, …, −γ
, (1 − γ
), −γ
, …, −γ
)′.In the above Proposition, take a = a
0, then a
0+δ(1 − a
0) = 0 and δ (1 − a
0) = −δ/(δ − 1), and we getCorollary 1. Under conditions of the Proposition,IfIfSimulations Under the Balding-Nichols model [10], the allele frequency in each of r subpopulations conditional on p and F
is a random deviate from the beta distribution β
, which has mean p and variance p(1 − p)F
= σ
2.Simulation 1. This simulation was designed to estimate bias in the worst case scenario of two subpopulations. We evaluated the relationships between and F
and between and . First, given the true average allele frequency p for r = 2, F
reaches its maximum value for p
values of 0 and 2p. The estimator [4] yields a constrained range for from 0 to 2p. Therefore, we first assigned the true value for p by drawing a random uniform deviate from the interval (0, 0.5) and the true value for F
by independently drawing a random uniform deviate from the interval (0, 2p). Conditional on the true values of p and F
, we randomly generated p
from the beta distribution. We next assigned the number of individuals per subpopulation n
= [5, 10, 20, 50, 100, 110]. We then randomly drew alleles from the binomial distribution Bin(2n
, p
). We generated 10,000 independent replicate data sets. Based on the above formulae, the four estimators , , , and were calculated. Linear regression models were used to evaluate the relationship between F
and and between and . We assessed the fit in a linear regression model with the F-test, r
2, and the root mean squared error (RMSE), which is the square root of the sum of the variance and the squared bias.Simulation 2. This simulation was designed to evaluate variance under sampling conditions approaching unbiasedness, i.e., large numbers of subpopulations and individuals per subpopulation. We evaluated the relationships between and the number of subpopulations (r) and between and the number of individuals per subpopulations (n
). Conditional on the average allele frequency p, F
, the number of subpopulations r = [5, 10, 20, 50, 100, 250], and the number of individuals per subpopulation n
= [5, 10, 20, 50, 100, 250, 1000], we randomly generated r allele frequencies as in Simulation 1 and calculated , , , and .
Application to data
We included genotype data from a total of 120 unrelated individuals from the Wolaita (WETH) ethnic group from southern Ethiopia who served as controls in a genome-wide association study of podoconiosis [11]. The Wolaita ethnic group speaks an Omotic language, and comparison with HapMap African populations has shown that it has the closest genetic similarity with the Maasai from Kenya and the lowest genetic similarity with the Yoruba in Nigeria [12]. Genotyping was performed by deCODE Genetics using the Illumina HumanHap 610 Bead Chip, which assays > 620,000 single-nucleotide polymorphisms (SNPs). Of the 551,840 autosomal SNPs in the raw genotype data, we excluded 39,249 SNPs that had a minor allele frequency of < 0.05, 378 that were missing in > 0.05 of individuals, and 321 that had a Hardy-Weinberg p-value < 0.001. The remaining 511,892 SNPs were merged with genotype data for ASW (n = 49), CEU (n = 112), CHB (n = 84), CHD (n = 85), GIH (n = 88), JPT (n = 86), LWK (n = 90), MKK (n = 143), MXL (n = 50), TSI (n = 88), and YRI (n = 113) in HapMap phase 3, release 2, which contained 1,440,616 SNPs. A total of 464,642 SNPs were common to both of WETH and HapMap data sets.
, , and were calculated per marker.
Results
Simulation 1: We first compared with the true F
for the worst-case scenario of r = 2. For small sample sizes, was the least biased estimator, followed by , , and (Table 1). For large sample sizes, and were comparably good, and and were identically worse (Table 1). None of the four estimators was strongly sensitive to equal vs. unequal sample sizes (Fig 1). When was close to 0, yielded the most negative estimates, followed by and . As expected, all four estimators showed a quadratic relationship with (Fig 1). With respect to , by all four measures was the best estimator whereas was the worst estimator (Table 1).
Table 1
vs. F
and for two subpopulations.
F^st1
F^st2
F^stm
F^st3
n = 5
F^st vs. Fst
Root MSE
0.2213
0.2241
0.2228
0.2241
Squared bias
0.0465
0.0481
0.0473
0.0484
r2
0.1096
0.0869
0.0979
0.0869
F-test
12310
9522
10848
9522
F^st vs. ϑ2^
Root MSE
0.0340
0.1002
0.0657
0.1127
Squared bias
0.0008
0.0089
0.0036
0.0114
r2
0.9855
0.9129
0.9549
0.9129
F-test
6.808 × 106
1.048 × 106
2.116 × 106
1.048 × 106
n = 1000
F^st vs. Fst
Root MSE
0.2102
0.2105
0.2101
0.2105
Squared bias
0.0415
0.0419
0.0416
0.0419
r2
0.1981
0.1958
0.1986
0.1958
F-test
24701
24344
24778
24344
F^st vs. ϑ2^
Root MSE
0.0233
0.0706
0.0455
0.0706
Squared bias
0.0002
0.0041
0.0015
0.0041
r2
0.9913
0.9355
0.9700
0.9355
F-test
1.140 × 107
1.450 × 106
3.236 × 106
1.450 × 106
Fig 1
The relationship between and for simulated data.
The x-axis shows the difference of allele frequencies between two subpopulations (left plots) and (right plots); the y-axis shows values for Wright’s (top row), Weir and Cockerham’s (second row), the modified (third row), and Hudson et al.’s estimators (bottom row), and the legend indicates the sample sizes n
1 (before hyphen) and n
2 (after hyphen).
The relationship between and for simulated data.
The x-axis shows the difference of allele frequencies between two subpopulations (left plots) and (right plots); the y-axis shows values for Wright’s (top row), Weir and Cockerham’s (second row), the modified (third row), and Hudson et al.’s estimators (bottom row), and the legend indicates the sample sizes n
1 (before hyphen) and n
2 (after hyphen).An assessment of bias by the total sample size (n
1 and n
2) for r = 2 is presented in Fig 2. was biased and this bias was constant across total sample size, as expected given that this estimator does not account for n
. In contrast, , , and were less biased as the total sample size increased. When the total sample size exceeded 30, was the least biased estimator; otherwise, was the least biased estimator. For r = 2, the magnitude of bias for all four estimators was constant when the total sample size was at least 60.
Fig 2
Bias as a function of total sample size.
The x-axis shows the total sample size (n
1 + n
2). The y-axis shows (red), (blue), (green), and (orange) for r = 2.
Bias as a function of total sample size.
The x-axis shows the total sample size (n
1 + n
2). The y-axis shows (red), (blue), (green), and (orange) for r = 2.Simulation 2: Given p = 0.2, F
= [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], n = 1000 individuals, and r = 200 subpopulations, mean values are presented in Table 2. The means for , , , and equaled the expected values, consistent with all four estimators being asymptotically unbiased. First, we investigated the relationship between F
and the variance of the four estimators. Given p = 0.2 and F
< 0.5, had the smallest variance, followed by and (Fig 3). Given p = 0.2 and F
> 0.5, had the smallest variance, followed by and . Similar results were obtained for p = 0.1, 0.3, 0.4, and 0.5 (S1 Table).
Table 2
Means, Variances, and MSEs of in simulation 2.
True Fst
F^st1*
F^st2
F^stm
F^st3
0.1
Means
9.95E-02
9.95E-02
9.95E-02
9.99E-02
Variances
9.40E-05
9.49E-05
9.44E-05
9.48E-05
MSE
9.43E-05
9.51E-05
9.47E-05
9.48E-05
0.2
Means
1.99E-01
1.99E-01
1.99E-01
2.00E-01
Variances
3.36E-04
3.38E-04
3.37E-04
3.38E-04
MSE
3.37E-04
3.39E-04
3.38E-04
3.38E-04
0.3
Means
2.99E-01
2.99E-01
2.99E-01
3.00E-01
Variances
6.27E-04
6.30E-04
6.28E-04
6.29E-04
MSE
6.29E-04
6.30E-04
6.29E-04
6.29E-04
0.4
Means
3.98E-01
3.99E-01
3.99E-01
3.99E-01
Variances
9.12E-04
9.15E-04
9.14E-04
9.14E-04
MSE
9.15E-04
9.16E-04
9.15E-04
9.14E-04
0.5
Means
4.98E-01
4.99E-01
4.98E-01
4.99E-01
Variances
1.11E-03
1.11E-03
1.11E-03
1.11E-03
MSE
1.11E-03
1.11E-03
1.11E-03
1.11E-03
0.6
Means
5.98E-01
5.99E-01
5.99E-01
6.00E-01
Variances
1.18E-03
1.18E-03
1.18E-03
1.18E-03
MSE
1.18E-03
1.18E-03
1.18E-03
1.18E-03
0.7
Means
6.98E-01
6.99E-01
6.99E-01
6.99E-01
Variances
1.12E-03
1.12E-03
1.12E-03
1.11E-03
MSE
1.12E-03
1.12E-03
1.12E-03
1.11E-03
0.8
Means
7.98E-01
7.99E-01
7.99E-01
7.99E-01
Variances
8.67E-04
8.63E-04
8.65E-04
8.62E-04
MSE
8.69E-04
8.64E-04
8.66E-04
8.63E-04
0.9
Means
8.99E-01
8.99E-01
8.99E-01
8.99E-01
Variances
4.81E-04
4.77E-04
4.79E-04
4.77E-04
MSE
4.81E-04
4.78E-04
4.79E-04
4.77E-04
* Means, variances, and Mean Squared Errors (MSEs) from 200 subpopulations with 1000 individuals per subpopulation, given p = 0.2.
Fig 3
Effect of the number of subpopulations on bias.
The x-axis shows the number of subpopulations. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given F
= 0.5 and average allele frequency p = 0.2. The top plot represents 5 individuals per subpopulation and the bottom plot represents 1000 individuals per subpopulation.
* Means, variances, and Mean Squared Errors (MSEs) from 200 subpopulations with 1000 individuals per subpopulation, given p = 0.2.
Effect of the number of subpopulations on bias.
The x-axis shows the number of subpopulations. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given F
= 0.5 and average allele frequency p = 0.2. The top plot represents 5 individuals per subpopulation and the bottom plot represents 1000 individuals per subpopulation.Second, we investigated how the number of subpopulations and the number of individuals per subpopulation affected bias. When the number of subpopulations was approximately 30, no matter the number of individuals per subpopulation, bias was stable (Fig 3). For r > 30 and small n
, all four estimators were biased, with the order of . For r > 30 and large n
, all four estimators were unbiased. For n
> 30, all four estimators were stable and bias decreased as r increased, with the best estimator and the worst estimator (Fig 4).
Fig 4
Effect of the number of individuals per subpopulation on bias.
The x-axis shows the number of individuals per subpopulation. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given F
= 0.5 and an average allele frequency p = 0.2. From top to bottom, the plots represent the number of subpopulations r = 10, 20, and 40, respectively.
Effect of the number of individuals per subpopulation on bias.
The x-axis shows the number of individuals per subpopulation. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given F
= 0.5 and an average allele frequency p = 0.2. From top to bottom, the plots represent the number of subpopulations r = 10, 20, and 40, respectively.Application to Data: The means and variances of values between the WETH and 11 samples in HapMap 3 are presented in Table 3. The WETH sample was closest to the MKK sample, consistent with shared Cushitic and Nilo-Saharan ancestry [13]. and yielded the same order for all pairs of relationships and all four estimators yielded the same order of relationships for the five HapMap samples closest to the WETH sample. The order of the means was < < < . was approximately 30% larger than and approximately 20% smaller than , which has corresponding effects on divergence time estimates. Given that and are less downward biased than for these sample sizes (Fig 2), the larger values are more likely to be correct.
Table 3
between WETH and HapMap 3 samples.
F^st1*
F^st2*
F^stm*
F^st3*
ASW
0.0155 (0.0005)
0.0222 (0.0016)
0.0185 (0.0008)
0.0226 (0.0016)
CEU
0.0368 (0.0023)
0.0630 (0.0067)
0.0499 (0.0042)
0.0632 (0.0067)
CHB
0.0624 (0.0059)
0.1012 (0.0139)
0.0815 (0.0094)
0.1044 (0.0146)
CHD
0.0629 (0.0059)
0.1021 (0.0141)
0.0822 (0.0095)
0.1052 (0.0147)
GIH
0.0359 (0.0022)
0.0603 (0.0063)
0.0479 (0.0039)
0.0611 (0.0064)
JPT
0.0634 (0.0060)
0.1029 (0.0142)
0.0828 (0.0096)
0.1060 (0.0149)
LWK
0.0210 (0.0008)
0.0343 (0.0023)
0.0276 (0.0014)
0.0351 (0.0024)
MKK
0.0081 (0.0002)
0.0121 (0.0005)
0.0101 (0.0003)
0.0122 (0.0005)
MXL
0.0371 (0.0023)
0.0601 (0.0067)
0.0474 (0.0039)
0.0612 (0.0067)
TSI
0.0344 (0.0020)
0.0578 (0.0058)
0.0459 (0.0036)
0.0586 (0.0060)
YRI
0.0264 (0.0011)
0.0451 (0.0035)
0.0358 (0.0021)
0.0454 (0.0036)
* Shown are means (variances) of .
* Shown are means (variances) of .
Discussion
F
is directly related to the variance in allele frequencies among subpopulations. The dependence of F
on allele frequencies and genetic diversity has been observed [14]. In our study, an approximately linear relationship between , , , and with the squared difference of allele frequencies () was observed, as expected. By simulation, we found that all four estimators were unbiased for large numbers of subpopulations and individuals per subpopulation but that no one estimator was uniformly better than the others. For F
< 0.5, had smaller variances and MSE values. For F
> 0.5, had smaller variances and MSE values. For F
≈ 0.5, , , and had similar variance and MSE values.The numbers of individuals and markers have been reported to affect F
estimation [5]. We found that the number of subpopulations was more important than the number of individuals per subpopulation. Estimation of F
, both in terms of means and variances, stabilized with approximately 30 subpopulations, regardless of the number of individuals per subpopulation. This behavior occurs because there are r estimates of with which to estimate p and σ
2. Estimation was biased for r = 2 and improved as r increased, according to the Central Limit Theorem. Estimation was biased for n
< 30 and improved as n
increased (except for Wright’s estimator), also according to the Central Limit Theorem. Our proposed estimator is a minimum variance combination of Wright’s and Weir and Cockerham’s estimators and is less biased than Weir and Cockerham’s estimator for small samples sizes and less biased than Wright’s estimator for large sample sizes.
Conclusion
A modified F
estimator is proposed, which combines Wright’s and Weir and Cockerham’s estimators. It splits the difference in biases present in Wright’s and Weir and Cockerham’s estimators. We propose the routine use of this new and improved estimator of F
as a way to reduce the biases and limitations of the classical estimators. We demonstrated that, in order to estimate F
accurately, at least 30 subpopulations and 30 individuals per subpopulations are required.
Appendix
Proof of the Proposition
As , is asymptotically a chi-squared random variable, , , and
Thus
andLet
and be the asymptotic variance of , then the asymptotic variance of is , and the asymptotic covariance of is . Now we haveIf p
1 = ⋯ = p
2, . Note the ’s are independent, and
Let , p
0 = (p
0, …, p
0)’, γ
= n
/n, and b
= (−γ
1, …, −γ
, (1 − γ
), −γ
, …, −γ
)′, then for j = 1, …, r, and so by the Central Limit Theorem,
where , Ω = (ω
) with ω
= p
0(1 − p
0) if i = j and ω
= 0 if i ≠ j.Now we have, with ,Let Ω−1/2 be the square root of Ω: Ω = Ω′1/2Ω1/2, λ
1, …, λ
be all the eigenvalues of Ω′1/2
BΩ1/2, and Λ = diag(λ
1, …, λ
), then there is an orthogonal normal matrix Q such that Ω′1/2
BΩ1/2 = Q′ΛQ, and so
where the ’s are independent chi-squared random variables with one degree of freedom. This gives the desired result.Lastly, we prove
In fact,
It is known that for r = 1 or 2, with “=” if and only if n
1 = ⋯ = n
. Now we use induction to prove this is true for all integer r. In fact, suppose the above conclusion is true for some integer r > 2, then for integer r + 1,
and
Since by assumption ,
with “=” if and only if n
1 = ⋯n
, since , with “=” if and only if n
= n
.This gives δ ≥ 1 with “=” if and only if n
1 = ⋯ = n
.