Literature DB >> 23861382

Estimating and interpreting FST: the impact of rare variants.

Gaurav Bhatia¹, Nick Patterson, Sriram Sankararaman, Alkes L Price.

Abstract

In a pair of seminal papers, Sewall Wright and Gustave Malécot introduced FST as a measure of structure in natural populations. In the decades that followed, a number of papers provided differing definitions, estimation methods, and interpretations beyond Wright's. While this diversity in methods has enabled many studies in genetics, it has also introduced confusion regarding how to estimate FST from available data. Considering this confusion, wide variation in published estimates of FST for pairs of HapMap populations is a cause for concern. These estimates changed-in some cases more than twofold-when comparing estimates from genotyping arrays to those from sequence data. Indeed, changes in FST from sequencing data might be expected due to population genetic factors affecting rare variants. While rare variants do influence the result, we show that this is largely through differences in estimation methods. Correcting for this yields estimates of FST that are much more concordant between sequence and genotype data. These differences relate to three specific issues: (1) estimating FST for a single SNP, (2) combining estimates of FST across multiple SNPs, and (3) selecting the set of SNPs used in the computation. Changes in each of these aspects of estimation may result in FST estimates that are highly divergent from one another. Here, we clarify these issues and propose solutions.

Mesh：

Year: 2013 PMID： 23861382 PMCID： PMC3759727 DOI： 10.1101/gr.154831.113

Source DB: PubMed Journal: Genome Res ISSN： 1088-9051 Impact factor: 9.043

Since its introduction by Sewall Wright (1949) and Gustave Malécot (1948), FST estimation (Weir and Cockerham 1984; Holsinger and Weir 2009) has become a key component of studies of population structure in humans (International HapMap Consortium 2007; Li et al. 2008; The 1000 Genomes Project Consortium 2010; International HapMap 3 Consortium 2010) and other species (Malécot 1948; Wright 1949; Selander and Hudson 1976; Guries and Ledig 1982; Ellstrand and Elam 1993; Palumbi and Baker 1994). Though the utility of FST and related measures has been subject to recent debate (Jost 2008; Ryman and Leimar 2009), FST continues to be widely used by population geneticists (Xu et al. 2009; Edelaar et al. 2012; Hangartner et al. 2012). Despite this widespread use in genetic studies, confusion remains about what FST is and how to estimate it. Beyond Wright's original description of FST as a ratio of variances, FST has been conceptually defined in many ways (Wright 1949; Cockerham 1969; Cavalli-Sforza and Bodmer 1971; Nei 1973; Slatkin 1991; Hudson et al. 1992). Additionally, multiple estimators for FST have been described in the literature (Nei 1973, 1986; Weir and Cockerham 1984; Hudson et al. 1992; Holsinger 1999; Weir and Hill 2002), often making the correct choice of estimator unclear. With this diversity of definition and estimation in mind, we consider estimates of FST published by The 1000 Genomes Project Consortium (2010) of 0.052 for European and East Asian populations and 0.071 for European and West African populations. These are less than half of the published estimates, 0.111 and 0.156, from HapMap3 data (International HapMap 3 Consortium 2010) and may be the result of demography that differentially impacts FST at rare variants. These estimates have subsequently been used to simulate properties of recent rare variants (Mathieson and McVean 2012), making it imperative to know whether this reduction in FST is a meaningful result of the inclusion of rare variants or merely an artifact of estimation. To answer these questions, we examine the issues surrounding FST estimated on data containing rare variants. We focus our attention on FST estimation in the context of comparing two populations—potentially with differing amounts of drift since the populations split—using a series of bi-allelic SNPs. We use the definition of Weir and Hill (2002), which allows for population-specific FST. Using this definition, we divide the issues surrounding estimation into three categories and examine them using both simulated and 1000 Genomes data: Choice of FST estimator. Combining estimates of FST across multiple SNPs. Dependence of FST on the set of SNPs analyzed. We conclude that the lower FST estimates reported by The 1000 Genomes Project Consortium (2010) are a consequence of the estimation method that was applied and are not informative for human demographic history. Correcting for differences in estimation method yields FST estimates of 0.106 for Europeans and East Asians and 0.139 for Europeans and West Africans—much closer to HapMap3 estimates. Overall, our results contradict a recent statement “among human populations, FST is typically estimated to be <0.1” by Mathieson and McVean (2012), which was based on results from The 1000 Genomes Project Consortium (2010). Altogether, in the setting of rare variants, a careful protocol for producing FST estimates is warranted. We provide such a protocol.

Results

Theory

Defining FST

We use the definition of Weir and Hill (2002) (WH) throughout our manuscript to analyze estimators in the context of comparing two populations at a series of bi-allelic SNPs. In this context, WH define FST as the correlation between randomly drawn alleles from a single population relative to the most recent common ancestral population:where is the allele frequency of the derived allele in population i, at SNP s, is the allele frequency of the derived allele in the ancestral population at SNP s, and is the population-specific FST for population i. For a pair of populations, FST is Although we use the WH definition of FST to compare estimation methods, numerous alternate definitions exist in the literature (see Supplemental Material), in part because of confusion regarding Wright's original description of FST. Wright (1949) defined FST as the correlation of randomly drawn gametes from the same population, relative to the total population. However, he did not clearly specify the “total population,” leaving subsequent investigators to interpret its meaning. For Nei (1973) the “total population” is the combination of the two population samples. This means that FST quantifies drift relative to an average of the two population samples. For Cockerham (1969) and WH, the “total” population is the most recent common ancestral population to the two populations being considered. Consistent with those investigators, we view FST as a parameter of the evolutionary process and not a statistic from observed samples as Nei has described. To view FST as a parameter of the evolutionary process, the Cockerham and WH definitions assume that studied SNPs were polymorphic in the ancestral population. This is clear from Equation 1 as for SNPs arising from recent mutations. While this assumption does not always hold, we believe that the WH definition provides a valid basis for comparing estimation methods, and also assesses the performance of estimators when this assumption is violated. By defining only one FST for both populations in a comparison, Cockerham (1969) and Weir and Cockerham (1984) also assumed that the two populations have experienced identical amounts of drift since splitting. This assumption, which may be unrealistic in many real data sets, was generalized by WH, and motivates our use of the WH definition. In this study, we focus on cases without migration and admixture, though these cases were considered in WH and are the subject of future work (B Weir, pers. comm.). In addition to the definitions described above, FST has been related to divergence time, coalescent times, and migration rates. Additionally, likelihood-based definitions view FST as a parameter of the distribution of allele frequencies in current populations (Balding and Nichols 1995; Nicholson et al. 2002; Balding 2003). Further details are provided in the Supplemental Material.

Choice of FST estimator

While estimators of FST handle issues related to finite sample size, we are interested in their behavior in the limit of large sample sizes, or the “quantity being estimated.” Most published estimates of FST are produced using the Weir and Cockerham (WC) (Weir and Cockerham 1984) (>8000 citations) or Nei (Nei 1973) (>5500 citations) estimators. However, we recommend a different estimator motivated by Hudson et al. (1992). The WC estimator was developed for the case of populations with identical FST, and if it is used when FST is not identical for both populations, we demonstrate that the WC quantity being estimated becomes dependent on the ratio of sample sizes M according to (see Methods): We note that this variation with sample size is not due to any flaw in the WC estimator, but rather due to the use of the WC estimator for a purpose different from what was intended. We also note that the WC estimator is often used to produce single SNP estimates of FST to detect selection. We caution that when sample sizes are very different, the WC estimator can give inflated single SNP estimates of FST, resulting in false-positive signals of selection (see Supplemental Material). In the context of the WH definition, the Nei estimator will consistently overestimate FST, and the degree of overestimation will depend upon the magnitude of FST values (see Methods): We note that this result, with a maximum value of 2, makes it impossible to view FST as a correlation. The Hudson estimator (Hudson et al. 1992; Keinan et al. 2007) produces estimates that are the simple average of FST according to the WH definition. These estimates are independent of sample sizes even when FST is not identical across populations. We note that while Hudson did not explicitly provide an estimator of FST, he did describe a method of estimation that corresponds to the estimator that we explicitly provide here (see Supplemental Material). Thus, we refer to this estimator as the Hudson estimator. Hudson estimates correspond to a simple average of the population specific FST estimates as given by (see Methods): We note that the Hudson estimator is a simple average of the population-specific estimators proposed by Weir and Hill (2002). We provide comparisons of this estimator to the WC and Nei estimators when applied to simulated data (see Supplemental Material) and empirical data (see below).

Combining estimates of FST across multiple SNPs

We investigate two approaches for combining estimates of FST across multiple SNPs. In the first approach, variance components—the numerator and denominator—are averaged separately and the genome-wide estimate of FST is a “ratio of averages” (Weir and Cockerham 1984; International HapMap 3 Consortium 2010). In the second approach, single SNP estimates of FST are averaged across SNPs. The resulting “average of ratios” is reported as the genome-wide estimate (The 1000 Genomes Project Consortium 2010) (see Methods). In the context of the WH definition, the numerator of the Hudson FST estimator (see Methods) is an unbiased estimator of the variance between populations. The denominator is an unbiased estimator of the total variance in the ancestral population. However, this does not mean that the ratio of the estimators is itself an unbiased estimator of FST. We are not aware of any unbiased estimator. While an unbiased estimator is not available, FST estimates produced using a ratio of these two unbiased estimates will be asymptotically consistent, in the sense that they will converge to the correct underlying value as the number of independent SNPs increases. This is the basis of our recommendation that FST be estimated as a ratio of averages. We analyze the effects of choosing an average of ratios in coalescent simulations detailed in the Supplemental Material.

Dependence of FST on the set of SNPs analyzed

It is well known that population genetic factors can cause variation in FST estimates, and that ascertainment schemes can alter the properties of studied SNPs (Ramírez-Soriano and Calafell 2008; Albrechtsen et al. 2010). For example, selection can result in differences between FST estimated on genic and nongenic SNPs (Clark et al. 2005; Barreiro et al. 2008; Hernandez et al. 2011); complex demography can cause FST to vary with SNP allele frequency (Schaffner et al. 2005) (see below). Indeed, variation in FST estimates between ascertained classes of SNPs can be used to test a variety of hypotheses about population history (Weir et al. 2005; McVicker et al. 2009). This usage of FST demonstrates that there is no single correct ascertainment scheme, as FST is a parameter of both the populations and the set of SNPs that are used in the computation. Though there is no single correct ascertainment scheme, ascertainment in an outgroup may have desirable properties. Outgroup ascertainment guarantees that studied SNPs were polymorphic in the most recent common ancestral population (ignoring recurrent mutation), satisfying an assumption made in the Weir and Hill definition. This leads estimates of FST to be independent of allele frequency and depend upon time since divergence according to a simple equation (see Supplemental Material, Equation s1). While we view these as desirable properties, if no reasonable outgroup sample is available, it may become necessary to choose SNPs that are polymorphic in one, both, or either of the populations studied. These choices will affect the estimate of FST produced and may explain discrepancies in FST estimates across studies of the same populations. We explore the effects of various ascertainment schemes on FST estimates across the allele frequency spectrum in a variety of simulated demographic scenarios (see Supplemental Material).

Other FST estimators

In addition to the WC, Nei, and Hudson estimators that we analyzed above, we have also analyzed several additional estimators. Our results on each of these estimators are described in detail in the Supplemental Material. The moment-based estimator of Weir and Hill (2002) (WH) introduced population-specific estimates of FST. Weir and Hill recommend a sample size weighted average of these estimates, which may result in a wide variation with sample size. However, one could also report these estimates independently or perform a simple average of these estimates. A separate maximum-likelihood estimator of Weir and Hill (2002) (WH-ML) is based upon a normal approximation to genetic drift. However, the equations provided for the WH-ML estimator are not applicable to the general case of unequal sample size, and the investigators recommend that estimates be “simply averaged across loci,” causing WH-ML estimates to vary widely with the inclusion of rare variants. We evaluated two max-likelihood estimators based on the beta-binomial likelihood using point estimates for the allele frequency in the ancestral population (D Balding, pers. comm.). These estimates perform well for small values of FST, but do poorly as FST increases. It may be possible to improve on these methods by integrating over the distribution of ancestral allele frequencies, an interesting direction for future research. We also considered the beta-binomial MCMC method of Holsinger (1999). However, our simulations suggest that Holsinger estimates increase dramatically if rare SNPs are analyzed. Additionally, the MCMC-based approach imposes a significant computational burden, making the method difficult to apply to modern data sets.

Analysis of 1000 Genomes data

We analyzed data from 1000 Genomes populations (The 1000 Genomes Project Consortium 2010) to illustrate the effects of changes in each of the aspects of estimation described above. We focus largely on the comparison of Utah residents of European ancestry (CEU) and Chinese individuals from Beijing (CHB), as the Yoruba in Ibadan, Nigeria (YRI) sample functions as a natural outgroup for ascertainment of SNPs. This ascertainment has desirable properties (see above). Estimates of FST for CEU and CHB are 0.106 (s.e. 0.0006), 0.112 (s.e. 0.0006), and 0.107 (s.e. 0.0006) for the WC, Nei, and Hudson estimators, respectively. These estimates were produced over SNPs ascertained as polymorphic in YRI. The higher Nei estimate is expected. In addition, sample sizes for CEU (85 individuals) and CHB (97 individuals) are similar, so we do not expect WC and Hudson estimates to differ. In order to investigate the effects of sample size variation we selected 14 individuals—the size of the smallest sample (Iberian populations in Spain; IBS) in the 1000 Genomes Consortium data—from both CEU and CHB to produce populations CEU14 and CHB14. Hudson FST estimates for CEU14 and CHB are similar to those for CHB14 and CEU (see Table 1). However, WC estimates are 0.114 (s.e. 0.0006) and 0.107 (s.e. 0.0006) for CEU14 vs. CHB and CHB14 vs. CEU, respectively. The difference between these estimates is statistically significant (greater than eight standard errors). To verify that this difference is not due to different sets of polymorphic SNPs, we re-estimated FST restricting to SNPs that were polymorphic in YRI and at least one of CEU14 or CHB14. Re-estimated values of FST were similar to those above and WC estimates remained discordant (data not shown).

Table 1.

FST estimates for pairs of populations in 1000 Genomes

FST estimates for pairs of populations in 1000 Genomes The effect of sample size variation is further exacerbated when ascertainment is performed within the populations studied. For example, in comparing IBS—with a sample size of only 14 individuals—to YRI, no reasonable outgroup population exists in the 1000 Genomes data. If we ascertain within one of these populations, WC estimates are 0.121 and 0.144 for ascertainment in YRI and IBS, respectively. These estimates—computed using identical populations and even identical individuals—are highly divergent at >25 standard errors apart, whereas Hudson estimates are much more stable (see Table 1). This underscores that FST estimates can vary substantially based on the choice of estimator. Regardless of choice of estimator, our estimates of FST from 1000 Genomes data are relatively close to previously reported values of FST (see Supplemental Table S1 for all populations). This suggests that while the choice of estimator can impact the resulting value of FST, it does not explain the disparate results reported by the 1000 Genomes Consortium, and other aspects of estimation may be involved. We consider these in the sections below. From 1000 Genomes data, we estimated FST for CEU and CHB as 0.106 (s.e. 0.0006) and 0.072 (s.e. 0.0003) for the ratio of averages and average of ratios, respectively. These estimates were produced over SNPs ascertained as polymorphic in YRI. This suggests that the result reported by the 1000 Genomes Consortium (0.052) may be partially explained by the large reduction in FST obtained by use of an average of ratios. These results are replicated for several comparisons of populations included in the 1000 Genomes data (see Table 2).

Table 2.

A comparison of the FST estimated using 1000 Genomes and HapMap data by either using a ratio of averages or an average of ratios

A comparison of the FST estimated using 1000 Genomes and HapMap data by either using a ratio of averages or an average of ratios To explore the effect of the rare variants included in sequence data, we compared our results to those obtained using HapMap3 genotypes. We obtain FST estimates for CEU and CHB of 0.110 (s.e. 0.0010) and 0.089 (s.e. 0.0006) using the ratio of averages and average of ratios, respectively. This suggests that the inclusion of rare variants with low single-SNP FST estimates in the 1000 Genomes data tends to exacerbate the discrepancy produced by the average of ratios. We expect that this discrepancy will grow with sample sizes and sequencing depth (see Supplemental Fig. S2). Ultimately, using the average of ratios may make estimates incomparable across studies and unrelated to population demographic history. While the use of the average of ratios clearly results in lower estimates of FST, these estimates are not as low as those published by the 1000 Genomes Consortium. Below, we explore the possibility that the remaining discrepancy can be accounted for by differences in the set of SNPs analyzed. When estimating FST for CEU and CHB, we compared the effects of ascertaining in YRI (YRI ascertainment) versus ascertaining SNPs that were polymorphic in CEU, CHB, both populations, or either population (see Table 3). When using an average of ratios, our estimates of FST were ∼0.103 for all of these modified ascertainment schemes. These can be compared to an FST of 0.106 produced from YRI ascertainment in 1000 Genomes data or 0.110 in HapMap3 data. Though statistically significant, these results suggest that the effects of modified ascertainment are not very large when analyzing human populations using a ratio of averages. This indicates that reasonable estimates of FST may be produced when comparing populations without access to an outgroup.

Table 3.

Assessing the effect of ascertainment schemes and combination methods on the resulting FST estimate for CEU and CHB

Assessing the effect of ascertainment schemes and combination methods on the resulting FST estimate for CEU and CHB However, when using an average of ratios and including all SNPs polymorphic in either CEU or CHB, our estimate changed from 0.072 to 0.047 (s.e. 0.0002), which is similar to the result reported by the 1000 Genomes Consortium. This suggests that much of the discrepancy between previously published estimates of FST for CEU and CHB and the published 1000 Genomes estimate is explained by using the average of ratios and an ascertainment scheme that includes all SNPs that are polymorphic in either of the two populations. These results are replicated for comparisons of continental populations included in the 1000 Genomes data as we obtained values of 0.056 and 0.063 for comparisons of CEU-YRI and CHB-YRI, respectively. Separately, we note that when comparing CEU to CHB on the 1000 Genomes data we observed larger FST estimates of 0.108 for the lowest frequency SNPs (0.0 < MAF ≤ 0.05) versus estimates of 0.103 for the most common SNPs (0.45 < MAF < 0.5) when ascertaining in CEU. These estimates were 0.131 and 0.097 when ascertaining in CHB (see Fig. 1). Increased FST for rare variants suggests that bottlenecks are likely to be a stronger influence on FST estimates for CEU and CHB than recent expansions. Our results also indicate that bottlenecks in the population history of CHB are likely to be stronger than those in the population history of CEU, consistent with the findings of Keinan et al. (2007). This is in contrast to the much lower FST estimates reported on sequence data by the 1000 Genomes Consortium, which might suggest that expansions are a stronger influence on FST at rare SNPs.

Figure 1.

Allele frequency dependence of FST under different ascertainment schemes. This shows FST for CEU and CHB as a function of allele frequency when ascertaining in either CEU, CHB, or YRI. The increased FST for rare variants is consistent with bottlenecks being a stronger force on FST for CEU and CHB than recent expansion. In fact, this is consistent with a stronger bottleneck in the population history of CHB. We note that this frequency dependence disappears when ascertaining in YRI, suggesting that YRI is a reasonable outgroup for the comparison of CEU and CHB. Under a simple demographic history (i.e., without migration or admixture), this dependence on minor allele frequency is expected to disappear when ascertaining SNPs in an outgroup. When ascertaining in YRI we do not observe any significant dependence on frequency, which suggests that YRI is a reasonable outgroup for the comparison for CEU and CHB. We note that when ascertaining in YRI, our genome-wide estimate of FST (0.106) is lower than estimated from HapMap3 (0.110). To investigate whether this difference is due to non-random ascertainment of HapMap3 SNPs, we sampled 10 subsets of SNPs from the 1000 Genomes data that matched the allele frequency spectrum of HapMap3 SNPs (see Supplemental Material). We estimated FST for CEU and CHB in each of these subsets ranging from 0.106 to 0.107 (s.e. 0.0010). This suggests that HapMap3 SNPs are more highly differentiated than random SNPs, consistent with previous findings on the effects of ascertainment on genotyping arrays (Clark et al. 2005; Albrechtsen et al. 2010).

Recommendations

Because the Hudson estimator is not sensitive to the ratio of sample sizes and does not systematically overestimate FST, we recommend that it be used to estimate FST for pairs of populations. The Hudson estimator for FST and a corresponding block-jackknife estimator for the standard error of FST are implemented in the EIGENSOFT software package (EIGENSOFT 4.2 http://www.hsph.harvard.edu/faculty/alkes-price/software/). Using an average of ratios will result in large reductions in FST estimates. This effect will be exacerbated when estimating FST from sequence data. Therefore, we recommend using a ratio of averages. Estimating FST from SNPs ascertained in an outgroup has the following valuable properties: (1) FST estimates are expected to be independent of allele frequency in the outgroup, and (2) FST estimates will relate to divergence time according to Supplemental Equation s1 if there has been no migration or admixture. However, data from a reasonable outgroup is not always available. Additionally, comparison of FST between ascertained classes of SNPs (e.g., genic vs. nongenic) can be used to test a variety of hypotheses regarding population history. Thus, we recommend that future publications of FST estimates include details of the ascertainment scheme used, including the proportion of SNPs that are polymorphic in each sample.

Discussion

The use of FST to quantify the genetic distance between populations and to assess differentiation at individual SNPs is widespread. Here, we point out several challenges surrounding FST and provide a protocol for its robust estimation in the case of two populations and bi-allelic SNPs. We show that the estimator of FST, the method of combining estimates across SNPs, and the scheme for SNP ascertainment can impact the resulting estimate of FST. An inappropriate choice for any of these aspects of estimation can lead to widely disparate estimates of FST, especially in a setting of large numbers of rare variants. Indeed, the FST estimate 0.052 for CEU and CHB reported by The 1000 Genomes Project Consortium (2010) underscores the need for a careful analysis. Utilizing the careful protocol set out here, we provide an estimate of 0.106 for CEU and CHB on 1000 Genomes data, which is close to our estimate of 0.110 on HapMap3 (International HapMap 3 Consortium 2010) data. Additionally, we show that when ascertaining for SNPs in one of the two populations studied, rare variants have higher FST estimates than common variants. This is the exact opposite of the results suggested by the 1000 Genomes data. The difference between these two results changes the conclusions that are drawn about the role of demography in shaping the patterns of differentiation between human populations. In addition to altering genome-wide estimates of FST, the choice of estimator can introduce inflation at the level of single SNP estimates, potentially making it difficult to interpret high FST estimates as signals of selection (see Supplemental Material). Another concern about FST was considered by Jost (2008), who showed that as heterozygosity becomes large, FST will naturally approach 0—indicating low differentiation—even if all alleles at a locus are population private. In an effort to avoid this problem, Jost introduced D as an alternate measure of differentiation. However, it has been suggested that Jost's D shares the same problems as FST, and that these problems are sometimes even more pronounced for Jost's D (Ryman and Leimar 2009). In any case, FST and related measures “unquestionably provide important insights into population structure” (Jost 2008), particularly for species such as humans, in which heterozygosity is relatively low. In conclusion, we recommend the use of the Hudson estimator (Hudson et al. 1992; Keinan et al. 2007) of FST that is independent of sample size. We demonstrate that a ratio of averages is an appropriate method for combining these estimates across multiple SNPs. We also show the value of estimating FST from SNPs ascertained in an outgroup, though we do not view this as a necessity. We do recommend, however, that future publications of FST estimates include details of the ascertainment of SNPs.

Methods

Weir and Cockerham's FST (WC)

Definition

Weir and Cockerham (1984) used the definition provided by Cockerham (1969) of FST as a ratio of the variance between populations to the total variance in the ancestral population. We analyze this definition in the Supplemental Material.

Estimator

In the setting of population-specific FST, described by the WH definition, the WC estimator will result in estimates that vary with the ratio of sample sizes (see Supplemental Material for details). In the case of two populations and biallelic SNPs, the WC estimator iswhere n is the sample size and is the sample allele frequency in population i for i ∈ {1, 2}. Then, in the limit of large sample sizes (n − 1 ≈ n), we can assume that sample allele frequencies become close to population allele frequencies . We analyze the estimator as the sample sizes increase, but their ratio goes to a constant M (see Supplemental Material for a derivation). In this case, we show (see Supplemental Material) that the estimate tends toward Equation 1 (see Results). If the sample sizes are equal, M = 1, then the estimate becomes Also, when FST is identical for both populations, i.e., , it is straightforward to see that , i.e., the estimate will not depend upon the ratio of sample sizes (M). We note that if FST is identical across populations, weighting by sample sizes will reduce the variance of the estimator. This was the intent of Weir and Cockerham. If the sample sizes are unequal or this assumption does not hold, however, the estimate will depend upon the ratio of sample sizes underlying the limit. Given the complexity of human population history, it is unlikely that this assumption will hold in general. This means that even if large numbers of samples and SNPs are used to estimate FST for a pair of populations, this estimate may not be comparable across studies with different sample sizes. We note that when FST is not identical for both populations, it is possible to estimate FST separately for each population (i.e., ) (Weir and Hill 2002). Estimates for those produced according to the method given in Weir and Hill (2002) will not depend on sample size. We focus here on estimating FST for a pair of populations, as this is a very common use when analyzing human genetic data.

Nei's FST

Nei (1986) defined FST (he used the term GST) based upon the sample gene diversity between and within populations aswhere is the average gene diversity between populations and H is the diversity in the average of the two population samples. We consider this definition in detail in the Supplemental Material. In the case of two populations and bi-allelic SNPs, Nei's estimator iswhereand is the sample allele frequency in population i for i ∈ {1, 2}. We note that this is Nei's updated estimator and, in the case of two populations, differs from the estimator given in Nei (1973) and Nei and Chesser (1983) by a factor of 2. We use the estimator given in Nei (1986), as it is most closely related to the other estimators considered. Using the definition of Weir and Hill (2002) we show (see Supplemental Material) that estimates made using Nei's estimator will tend toward Equation 2 (see Results), with a maximum value of 2 as . This overestimates the average of population-specific FST values and alters the relation from this average of FST values to divergence time (see Supplemental Material). Estimates of FST given for the Nei estimator were generated using the proposed estimator for the numerator (see Supplemental Material) and a simple estimator for the denominator.

Hudson's FST

Hudson et al. (1992) defined FST in terms of heterozygosity. The fundamental difference between these estimators is that for Hudson, the total variance is based upon the ancestral population and not the current sample. Hudson's estimator for FST is given bywhere H is the mean number of differences within populations, and H is the mean number of differences between populations. While Hudson did not give explicit equations for H and H, we cast his description into an explicit estimator (see Supplemental Material for a derivation). The estimator that we analyze iswhere n is the sample size and is the sample allele frequency in population i for i ∈ {1, 2}. Analyzing this estimator using the definition of Weir and Hill (2002), we show (see Supplemental Material) that FST estimated using Hudson's estimator will tend toward Equation 3 (see Results), which is exactly the average of population-specific FST values that we seek to estimate. This emerges naturally, as the proposed estimator is the simple average of the population-specific estimators given in Weir and Hill (2002). This estimator has the desirable properties that it is (1) independent of sample composition, and (2) does not overestimate FST (it has a maximum value of 1). We recommend its use to produce estimates of FST for two populations.

Combining estimates of FST across multiple SNPs

The Hudson estimator is asymptotically consistent, as the estimators of the variance components involved in the computation of FST are unbiased in the context of the WH definition. However, as their quotient is not an unbiased estimator of FST, use of an average of ratios will, in general, result in a biased estimate. As many rare variants discovered by deep sequencing are population specific, we analyze the effect of this approach in the presence of many such variants. Consider a rare SNP with p1 = ɛ, p2 = 0. This yields a single SNP F = ɛ. An estimate produced using an average of ratios will be highly sensitive to rare SNPs of this type and is likely to exhibit dependence on both the sequencing depth and sample size used in the analysis (see Supplemental Fig. S2). Previous works have examined this choice and advocated for the use of a ratio of averages (Reynolds et al. 1983; Weir and Cockerham 1984). However, in describing the WH-ML method, Weir and Hill recommend that estimates be “simply averaged over loci.” We believe that use of an average of ratios can account for the bulk of the discrepancy between the estimates of FST from The 1000 Genomes Project Consortium (2010) and previously published estimates (International HapMap 3 Consortium 2010) (see Results).

Dependence of FST on the set of SNPs analyzed

In relating quantities being estimated from current populations to parameters of the evolutionary model, we have calculated expected values given the allele frequency in the ancestral population. This implicitly performs an ascertainment of SNPs that are polymorphic in the ancestral population or, equivalently, in an outgroup population. Provided there is no migration or admixture between populations, the relationship between FST and divergence time is given in Supplemental Equation s12. This relationship accounts for changes in effective population size (i.e., bottlenecks or expansions) in the demographic history of the populations being compared. Additionally, ascertainment in an outgroup renders the estimate independent of the allele frequency spectrum in the outgroup. Therefore, with this type of ascertainment scheme, estimates should be concordant regardless of whether they are produced from rare or common SNPs. While ascertainment in an outgroup has several helpful properties, in many practical circumstances no data from a reasonable outgroup is available. In these instances, FST can be estimated using SNPs ascertained in either one of the populations under study. However, in these instances estimates are not expected to be independent of allele frequency spectrum or complex demographic scenarios.

30 in total

Review 1. Estimating F-statistics.

Authors: B S Weir; W G Hill
Journal: Annu Rev Genet Date: 2002-06-11 Impact factor: 16.830

2. The genetical structure of populations.

Authors: S WRIGHT
Journal: Ann Eugen Date: 1951-03

3. G(ST) is still a useful measure of genetic differentiation - a comment on Jost's D.

Authors: Nils Ryman; Olof Leimar
Journal: Mol Ecol Date: 2009-05 Impact factor: 6.185

4. FABSIM: a software for generating FST distributions with various ascertainment biases.

Authors: Anna Ramírez-Soriano; Francesc Calafell
Journal: Bioinformatics Date: 2008-10-09 Impact factor: 6.937

5. Genomic dissection of population substructure of Han Chinese and its implication in association studies.

Authors: Shuhua Xu; Xianyong Yin; Shilin Li; Wenfei Jin; Haiyi Lou; Ling Yang; Xiaohong Gong; Hongyan Wang; Yiping Shen; Xuedong Pan; Yungang He; Yajun Yang; Yi Wang; Wenqing Fu; Yu An; Jiucun Wang; Jingze Tan; Ji Qian; Xiaoli Chen; Xin Zhang; Yangfei Sun; Xuejun Zhang; Bailin Wu; Li Jin
Journal: Am J Hum Genet Date: 2009-12 Impact factor: 11.025

6. Estimation of fixation indices and gene diversities.

Authors: M Nei; R K Chesser
Journal: Ann Hum Genet Date: 1983-07 Impact factor: 1.670

7. A second generation human haplotype map of over 3.1 million SNPs.

Authors: Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

8. Contrasting population structure from nuclear intron sequences and mtDNA of humpback whales.

Authors: S R Palumbi; C S Baker
Journal: Mol Biol Evol Date: 1994-05 Impact factor: 16.240

Review 9. Genetics in geographically structured populations: defining, estimating and interpreting F(ST).

Authors: Kent E Holsinger; Bruce S Weir
Journal: Nat Rev Genet Date: 2009-09 Impact factor: 53.242

10. Widespread genomic signatures of natural selection in hominid evolution.

Authors: Graham McVicker; David Gordon; Colleen Davis; Phil Green
Journal: PLoS Genet Date: 2009-05-08 Impact factor: 5.917

169 in total

1. Model-free Estimation of Recent Genetic Relatedness.

Authors: Matthew P Conomos; Alexander P Reiner; Bruce S Weir; Timothy A Thornton
Journal: Am J Hum Genet Date: 2016-01-07 Impact factor: 11.025

2. Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders.

Authors: Robert Brown; Hane Lee; Ascia Eskin; Gleb Kichaev; Kirk E Lohmueller; Bruno Reversade; Stanley F Nelson; Bogdan Pasaniuc
Journal: Eur J Hum Genet Date: 2015-04-22 Impact factor: 4.246

3. On Using Local Ancestry to Characterize the Genetic Architecture of Human Traits: Genetic Regulation of Gene Expression in Multiethnic or Admixed Populations.

Authors: Yizhen Zhong; Minoli A Perera; Eric R Gamazon
Journal: Am J Hum Genet Date: 2019-05-16 Impact factor: 11.025

4. Atlas of human diseases influenced by genetic variants with extreme allele frequency differences.

Authors: Arvis Sulovari; Yolanda H Chen; James J Hudziak; Dawei Li
Journal: Hum Genet Date: 2016-10-03 Impact factor: 4.132

Review 5. Making sense of genomic islands of differentiation in light of speciation.

Authors: Jochen B W Wolf; Hans Ellegren
Journal: Nat Rev Genet Date: 2016-11-14 Impact factor: 53.242

6. A frog with three sex chromosomes that co-mingle together in nature: Xenopus tropicalis has a degenerate W and a Y that evolved from a Z chromosome.

Authors: Benjamin L S Furman; Caroline M S Cauret; Martin Knytl; Xue-Ying Song; Tharindu Premachandra; Caleb Ofori-Boateng; Danielle C Jordan; Marko E Horb; Ben J Evans
Journal: PLoS Genet Date: 2020-11-09 Impact factor: 5.917

Estimating and interpreting FST: the impact of rare variants.

Results

Theory

Defining FST

Choice of FST estimator

Combining estimates of FST across multiple SNPs

Dependence of FST on the set of SNPs analyzed

Other FST estimators

Analysis of 1000 Genomes data

Recommendations

Discussion

Methods

Weir and Cockerham's FST (WC)

Definition

Estimator

Nei's FST

Hudson's FST

Combining estimates of FST across multiple SNPs

Dependence of FST on the set of SNPs analyzed

Review 1. Estimating F-statistics.

2. The genetical structure of populations.

3. G(ST) is still a useful measure of genetic differentiation - a comment on Jost's D.

4. FABSIM: a software for generating FST distributions with various ascertainment biases.

5. Genomic dissection of population substructure of Han Chinese and its implication in association studies.

6. Estimation of fixation indices and gene diversities.

7. A second generation human haplotype map of over 3.1 million SNPs.

8. Contrasting population structure from nuclear intron sequences and mtDNA of humpback whales.

Review 9. Genetics in geographically structured populations: defining, estimating and interpreting F(ST).

10. Widespread genomic signatures of natural selection in hominid evolution.

1. Model-free Estimation of Recent Genetic Relatedness.

2. Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders.

3. On Using Local Ancestry to Characterize the Genetic Architecture of Human Traits: Genetic Regulation of Gene Expression in Multiethnic or Admixed Populations.

4. Atlas of human diseases influenced by genetic variants with extreme allele frequency differences.

Review 5. Making sense of genomic islands of differentiation in light of speciation.

6. A frog with three sex chromosomes that co-mingle together in nature: Xenopus tropicalis has a degenerate W and a Y that evolved from a Z chromosome.

7. Whole-genome sequence variation, population structure and demographic history of the Dutch population.

8. Complexity and diversity of F8 genetic variations in the 1000 genomes.

Review 9. Recent advances in the study of fine-scale population structure in humans.

10. Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow.