Literature DB >> 35430885

Mathematical constraints on F_ST: multiallelic markers in arbitrarily many populations.

Abstract

Interpretations of values of the FST measure of genetic differentiation rely on an understanding of its mathematical constraints. Previously, it has been shown that FST values computed from a biallelic locus in a set of multiple populations and FST values computed from a multiallelic locus in a pair of populations are mathematically constrained as a function of the frequency of the allele that is most frequent across populations. We generalize from these cases to report here the mathematical constraint on FST given the frequency M of the most frequent allele at a multiallelic locus in a set of multiple populations. Using coalescent simulations of an island model of migration with an infinitely-many-alleles mutation model, we argue that the joint distribution of FST and M helps in disentangling the separate influences of mutation and migration on FST. Finally, we show that our results explain a puzzling pattern of microsatellite differentiation: the lower FST in an interspecific comparison between humans and chimpanzees than in the comparison of chimpanzee populations. We discuss the implications of our results for the use of FST. This article is part of the theme issue 'Celebrating 50 years since Lewontin's apportionment of human diversity'.

Entities: Chemical

Keywords: allele frequency; chimpanzee; genetic differentiation; migration; population structure

Mesh：

Year: 2022 PMID： 35430885 PMCID： PMC9014193 DOI： 10.1098/rstb.2020.0414

Source DB: PubMed Journal: Philos Trans R Soc Lond B Biol Sci ISSN： 0962-8436 Impact factor: 6.671

Introduction

Multiallelic loci such as microsatellites and haplotype assignments are used to study genetic differentiation in a variety of fields, ranging from ecology and conservation genetics to anthropology and human genomics. Genetic differentiation is often measured for multiallelic loci using the multiallelic extension of Wright’s fixation index F [1] For a polymorphic multiallelic locus with I distinct alleles in a set of K subpopulations, denoting by p the frequency of allele i in subpopulation k, and . F values are known to be smaller for multiallelic than for biallelic loci [2]. One reason invoked to explain this difference is that within-subpopulation heterozygosity H mathematically constrains the maximal value of F to be below 1, and the constraint is stronger when H is high. This phenomenon was noticed concurrently in simulation-based, empirical and theoretical studies [3-7], and the mathematical constraints describing the dependence were subsequently clarified [8,9]. Studies have found that the maximal value of F can be viewed as constrained not only by functions of the within-subpopulation allele frequency distribution such as H, but alternatively by aspects of the global allele frequency distribution across subpopulations. For a biallelic locus in K = 2 subpopulations, Maruki et al. [10] showed that the maximal F as a function of the frequency M of the most frequent allele decreases as M increases from to 1 (see also [11]). Generalizing the biallelic case to arbitrarily many alleles, Jakobsson et al. [12] showed that for multiallelic loci with an unspecified number of distinct alleles, the maximal F increases from 0 to 1 as a function of M if , and decreases from 1 to 0 for in the manner reported by Maruki et al. [10] for biallelic loci. Edge & Rosenberg [13] generalized these results to the case of a fixed finite number of alleles, showing that the maximal F differs slightly from the unspecified case when the fixed number of distinct alleles is an odd number. Generalizing the simplest case of K = I = 2 in a different direction, Alcala & Rosenberg [14] considered biallelic loci in the case of a fixed number of subpopulations K ≥ 2. We showed that the maximal value of F displays a peculiar behaviour as a function of M: the upper bound has a maximum of 1 if and only if M = k/K, for integers k with . The constraints on the maximal value of F dissipate as K tends to infinity, even though for any fixed K, there always exists a value of M for which . Relating F to its maximum as a function of M helps explain surprising phenomena that arise during population-genetic data analysis. For example, Jakobsson et al. [12] showed that stronger constraints on F could explain the low F values seen in pairs of African human populations. They also found that such constraints could explain the lower F values seen in high-diversity multiallelic loci compared to lower-diversity loci—microsatellites compared to single-nucleotide polymorphisms. Alcala & Rosenberg [14] showed that constraints on the maximal F could explain the lower F values between human populations seen when computing F pairwise rather than from all populations simultaneously. In this study, we characterize the relationship between F and the frequency M of the most frequent allele, for a multiallelic locus and an arbitrary specified value of the number of subpopulations K. We derive the mathematical upper bound on F in terms of M, extending the biallelic result of Alcala & Rosenberg [14] to the multiallelic case, and providing the most comprehensive description of the mathematical constraints on F in terms of M to date (table 1). To assist in interpreting the new bound, we simulate the joint distribution of F and M in the island migration model, describing its properties as a function of the number of subpopulations, the migration rate and a mutation rate. The K-subpopulation upper bound on F in terms of M facilitates an explanation of counterintuitive aspects of inter-species genetic differentiation. We discuss the importance of the results for applications of F more generally.

Table 1

Studies describing the mathematical constraints on F. H and H denote the within-subpopulation and total heterozygosities, respectively. δ denotes the absolute difference in the frequency of a specific allele between two subpopulations, and M denotes the frequency of the most frequent allele in the total population. Instead of heterozygosities H or H, some studies consider homozygosities 1 − H or 1 − H.

reference	number of alleles	number of subpopulations	variable in terms of which constraints are reported
Long & Kittles [8]	unspecified value ≥2	fixed finite value ≥2	H_S
Rosenberg et al. [11]	2	2	δ
Hedrick [9]	unspecified value ≥2	fixed finite value ≥2	H_S
Maruki et al. [10]	2	2	H_S, M
Jakobsson et al. [12]	unspecified value ≥2	2	H_T, M
Edge & Rosenberg [13]	fixed finite value ≥2	2	H_T, M
Alcala & Rosenberg [14]	2	fixed finite value ≥2	M
this paper	unspecified value ≥2	fixed finite value ≥2	M

Model

Our goal is to derive the range of values that F can take—the lower and upper bounds on F—as a function of the frequency M of the most frequent allele for a multiallelic locus, when the number of subpopulations K is a fixed finite value greater than or equal to 2. We follow previous studies [12-15] in describing notation and constructing the scenario. We consider a polymorphic locus with an unspecified number of distinct alleles, in a setting with K subpopulations contributing equally to the total population. We denote the frequency of allele i in subpopulation k by p, with sum across subpopulations. Each allele frequency p lies in [0, 1]. Within subpopulations, allele frequencies sum to 1: for each k, . Hence, σ lies in [0, K], and . We number alleles from most to least frequent, so σ ≥ σ for i ≤ j. Because by assumption the locus is polymorphic, σ < K for each i. Alleles 1 and 2 have non-zero frequency in at least one subpopulation, not necessarily the same one; we have σ1 > 0 and σ2 > 0. We denote the mean frequency of the most frequent allele across subpopulations by M = σ1/K. We then have 0 < M < 1. We treat the allele frequencies p and associated quantities M and σ as parametric values, and not as estimates computed from data. Equation (1.1) expresses F as a ratio involving within-subpopulation heterozygosity, H, and total heterozygosity, H, with 0 ≤ H < 1 and 0 ≤ H < 1. Because we assume the locus is polymorphic, H > 0. We write equation (1.1) in terms of allele frequencies, permitting the number of distinct alleles to be arbitrarily large Hence, our goal is, for fixed σ1 = KM, 0 < σ1 < K, to identify the matrices (p), with p in [0, 1], and , that minimize and maximize F in equation (2.1). Note that we adopt the interpretation of F as a ‘statistic’ that describes a mathematical function of allele frequencies rather than as a ‘parameter’ that describes coancestry of individuals in a population [e.g. 16]. See Alcala & Rosenberg [14] for a discussion of interpretations of F when studying its mathematical properties.

Mathematical constraints

Lower bound of F

Bounds on F in terms of the frequency of the most frequent allele can be written with respect to M or σ1, noting that M ranges in (0, 1) and σ1 ranges in (0, K). For the lower bound, from equation (2.1), for any choice of σ1, F = 0 can be achieved. Consider (σ1, σ2, …) with σ in [0, K) for each k, σ ≥ σ for i ≤ j, , and σ1 > 0 and σ2 > 0. We set p = σ/K for all subpopulations k and alleles i; this choice yields F = 0. F = 0 implies that the numerator of equation (2.1), H − H, is zero. This numerator can be written . The Cauchy–Schwarz inequality guarantees that , with equality if and only if p1, = p2, = … = p = σ/K. Applying the Cauchy–Schwarz inequality to all alleles i, the numerator of equation (2.1) is zero only if for all i, (p1,, p2,, …, p) = (σ/K, σ/K, …, σ/K). Thus, we can conclude that the allele frequency matrices in which all K subpopulations have identical allele frequency vectors are the only matrices for which F = 0. The lower bound on F is equal to 0 irrespective of M or σ1, for any value of the number of subpopulations K.

Upper bound of F

To derive the upper bound on F in terms of M = σ1/K, we must maximize F in equation (2.1), assuming that σ1 and K are constant. The computations are performed in appendix A; we write the main result as a function of σ1, noting that it can be converted into a function of M by replacing σ1 with KM. In theorem A.1, we treat the case in which σ1 has an integer value. For non-integer σ1, theorem A.2 shows that the maximal F requires that (i) the sum of squared allele frequencies across alleles and subpopulations, , is maximal, and (ii) alleles i = 2, 3, … are each present in at most one subpopulation, but allele 1 might be present in more than one subpopulation. We then separately maximize F as a function of σ1 for σ1 in (0, 1) and non-integer σ1 in (1, K). These two cases differ in that allele 1 appears in a single subpopulation in the former case, and it must appear in at least two subpopulations in the latter. The maximal F as a function of σ1 for σ1 in (0, K) is where . Here, denotes the smallest integer greater than or equal to x, denotes the greatest integer less than or equal to x, and denotes the fractional part of x. Note that for an integer choice of σ1, the maximum from equation (3.1) and the limits as σ1 tends to the integer from above and below all equal 1, so that the maximum as a function of σ1 is continuous. From appendix A, F reaches its upper bound for integer σ1 when allele 1 has frequency 1 in each of σ1 subpopulations, and when in each of the remaining K − σ1 subpopulations, an allele other than allele 1 has frequency 1. These alleles of frequency 1 need not be private, although they can be; any identity relationships among them are permissible, provided that when summing frequencies across subpopulations, none of these alleles has a sum that exceeds σ1. The locus can have as few as alleles of non-zero frequency and as many as K − σ1 + 1. For σ1 in interval (0, 1), F is maximal when each allele is present in only a single subpopulation, and when each subpopulation has exactly J alleles with a non-zero frequency: J − 1 alleles at frequency σ1 and one allele at frequency 1 − (J − 1)σ1 ≤ σ1. Because each subpopulation has J distinct alleles and no alleles are shared across subpopulations, this upper bound requires that the locus has KJ alleles of non-zero frequency. For non-integer σ1 in (1, K), F reaches its maximum when there are subpopulations in which the most frequent allele has frequency 1, a single subpopulation in which it has frequency {σ1} and a private allele has frequency 1 − {σ1}, and subpopulations each with a different private allele at frequency 1. Only the most frequent allele is shared across subpopulations, and a single subpopulation displays polymorphism. At the maximum, alleles have non-zero frequency.

Properties of the upper bound

Figure 1 shows the maximal value of F in terms of M = σ1/K for various values of the number of subpopulations, K. We describe a number of properties of this upper bound.

Figure 1

Bounds on F as a function of the frequency of the most frequent allele, M, for a multiallelic locus, for each of several different numbers of subpopulations K. (a) K = 2, (b) K=3, (c) K=6, (d) K = 40 and (e) K = 100. The grey regions represent the space between the upper and lower bounds on F. The dashed lines represent the curves that the jagged maximal F touches when M < 1/K, computed from equation (3.2). The upper bound is computed from equation (3.1); for each K, the lower bound is 0 for all values of M.

Piecewise structure of the upper bound

First, we observe that the upper bound has a piecewise structure. For M < 1/K, the upper bound depends on . As KM increases in (0, 1), each decrement in the integer value of produces a distinct ‘piece’ with domain [1/(Kj), 1/(K(j − 1))), for integers j ≥ 2. Within each interval [1/(Kj), 1/(K(j − 1))), J has the constant value j. At M = 1/K, the upper bound has its first transition between cases. For M > 1/K, the upper bound depends on . As KM increases in [1, K), each increment in also produces a distinct piece of the domain. For each k from 1 to K − 1, for M in [k/K, (k + 1)/K). Counting the intervals of the domain, we see that an infinite number of distinct intervals occur for M in (0, 1/K), and K − 1 intervals occur for M in (1/K, 1). Within intervals, the function describing the upper bound is smooth.

Behaviour of the upper bound for M = 1/K, 2/K, …(K − 1)/K

The upper bound is equal to 1 at M = 1/K, 2/K, …(K − 1)/K. For M in (0, 1/K), setting the numerator and denominator equal in equation (3.1), we find that the upper bound is never equal to 1. For M in (1/K, 1), the upper bound is equal to 1 if and only if {σ1} = 0, that is, if and only if σ1 is an integer and M = k/K for k = 2, 3, …, K − 1. Hence, noting that the upper bound is equal to 1 at M = 1/K, we conclude that the upper bound can equal 1 if and only if M = k/K for integers k = 1, 2, …, K − 1. For fixed K, the upper bound on F has exactly K − 1 maxima at which F can equal 1, at M = 1/K, 2/K, …, (K − 1)/K. We can conclude that F is unconstrained within the unit interval only for a finite set of values of the frequency M of the most frequent allele. The size of this set increases with the number of subpopulations K.

Behaviour of the upper bound for M in (0, 1/K)

For M in (0, 1/K), we can compute the value of the upper bound at the transition points between distinct pieces of the domain, namely values of 1/(Kj) for integers j ≥ 2. Applying equation (3.1), we observe that at M = 1/(Kj), the upper bound has value (K − 1)/(Kj − 1). In other words, the upper bound touches the curve This curve is represented in figure 1 as a dashed line. Note that for K = 2, the special case considered by Jakobsson et al. [12], equation (3.2) reduces to q*(M) = M/(1 − M) = σ1/(2 − σ1), which matches equation 21 from Jakobsson et al. [12]. In fact, setting K = 2, equation (3.1) for M in (0, 1/K) reduces to the K = 2 upper bound on F in eqn 9 of [12].

Behaviour of the upper bound for M in (1/K, 1)

Because the upper bound is a smooth function on each interval of its domain, and because it possesses maxima at interval boundaries M = 1/K, 2/K, …, (K − 1)/K, it must possess local minima in intervals [k/K, (k + 1)/K) for k = 1, 2, …, K − 2. Indeed, such minima are visible in figure 1 in cases with K = 3, K = 6, K = 40 and K = 100; for K = 2, only one maximum occurs, so that there is no interval between a pair of maxima in which a minimum can occur. Note that because we restrict attention to M in (0, 1), we do not count the point at M = 1 and F = 0 as a local minimum.

Joint distribution of M and F under an evolutionary model

So far, we have described the mathematical constraint imposed on F by M without respect to the frequency with which particular values of M arise in evolutionary scenarios. As an assessment of the bounds in evolutionary models can illuminate the settings in which they are most salient in population-genetic data analysis [9,14,17-20], we simulated the joint distribution of F and M under an island migration model, relating the distribution to the mathematical bounds on F. This analysis considers allele frequency distributions, and hence values of M and F, generated by evolutionary models. The simulation approach is modified from [14,15].

Simulations

We simulated alleles under a coalescent model, using the software MS [21]. We considered a total population of KN diploid individuals subdivided into K subpopulations of size N. At each generation, a proportion m of the individuals in a subpopulation originated outside the subpopulation. Thus, the scaled migration rate is 4Nm, and it corresponds to twice the number of individuals in a subpopulation that originate elsewhere. We considered the island model [22-24], in which migrants have the same probability m/(K − 1) of coming from any other specific subpopulation. We used an infinitely-many-alleles model; mutations occur at rate μ, and the scaled mutation rate is 4Nμ. We examined three values of K (2, 6, 40), three values of 4Nμ (0.1, 1, 10) and three values of 4Nm (0.1, 1, 10). Note that in MS, time is scaled in units of 4N generations, and there is no need to specify subpopulation sizes N. MS simulates an infinitely-many-sites model, where each mutation occurs at a new site; each haplotype is a new allele, so that each mutation creates a new allele. For our analysis, we are concerned only with the allelic categories and not with the simulated sequences; thus, although the simulation follows the infinitely-many-sites model, the analysis treats simulated datasets as having been generated under an infinitely-many-alleles model. For each parameter triplet (K, 4Nμ, 4Nm), we performed 1000 replicate simulations, sampling 100 sequences per subpopulation in each replicate. We computed F values from the parametric allele (haplotype) frequencies. MS commands appear in electronic supplementary material, File S1; note that the simulation approach here uses the standard method of simulating MS with a specified mutation rate θ = 4Nμ, whereas in our previous analyses of biallelic cases [14,15], we had employed the alternative approach of requiring simulated datasets to possess exactly one segregating site. Figure 2 shows the joint distribution of M and F for the nine values of (4Nμ, 4Nm) in the case of K = 2. Electronic supplementary material, figures S1 and S2 provide similar figures for K = 6 and K = 40, respectively.

Figure 2

Joint density of the frequency M of the most frequent allele and F in the island migration model with K = 2 subpopulations, for different scaled migration rates 4Nm and mutation rates 4Nμ. (a) 4Nμ = 0.1, 4Nm = 0.1. (b) 4Nμ = 1, 4Nm = 0.1. (c) 4Nμ = 10, 4Nm = 0.1. (d) 4Nμ = 0.1, 4Nm = 1. (e) 4Nμ = 1, 4Nm = 1. (f) 4Nμ = 10, 4Nm = 1. (g) 4Nμ = 0.1, 4Nm = 10. (h) 4Nμ = 1, 4Nm = 10. (i) 4Nμ = 10, 4Nm = 10. The black solid line represents the upper bound on F in terms of M (equation 3.1); the black point plots the mean values of M and F. Colours represent the density of loci, estimated using a Gaussian kernel density estimate with a bandwidth of 0.02, with density set to 0 outside of the bounds. Loci are simulated using coalescent software MS, assuming an island model of migration and an infinitely-many-alleles mutation model. Each panel considers 1000 replicate simulations, with 100 lineages sampled per subpopulation. Electronic supplementary material, figures S1 and S2 present similar results for K = 6 and K = 40 subpopulations, respectively.

Impact of the mutation rate

For fixed migration rate 4Nm and number of subpopulations K, the main impact of the mutation rate is on the frequency M of the most frequent allele. For K = 2, under weak mutation (4Nμ = 0.1), the joint distribution of M and F is highest in the high-M region, for all values of 4Nm (figure 2a,d,g). Although most simulation replicates produce with an upper bound on F less than one, this set of parameter values does give rise to replicates near the peak at . Under intermediate mutation (4Nμ = 1), the increased mutation rate tends to decrease M, shifting the joint distribution to lower values of M for all values of 4Nm (figure 2b,e,h). Finally, under strong mutation (4Nμ = 10), the joint distribution of M and F is highest in the low-M region, for all values of 4Nm (figure 2c,f,i). In this region, the upper bound on F is most strongly constrained, leading to low F values.

Impact of the migration rate

For fixed mutation rate 4Nμ and number of subpopulations K, the impact of the migration rate is seen primarily in the F values rather than the values of M. Under weak migration (4Nm = 0.1), subpopulations are differentiated, and the joint distribution of M and F is highest near the upper bound on F in terms of M (figure 2a–c). Under intermediate migration (4Nm = 1), differentiation between subpopulations decreases, and the joint density of M and F is highest at lower values of F (figure 2d–f). Under strong migration (4Nm = 10), the joint density of M and F nears the lower bound (figure 2g–i).

Impact of the number of subpopulations

In figure 1, the number of subpopulations changes the shape of the region in which F is permitted to range as a function of M. Thus, in simulations, the impact of the number of subpopulations K is observed in cases in which a change in K permits F to expand its range within the unit square for (M, F). For each of the nine choices of (4Nμ, 4Nm), figure 3 summarizes the means observed for (M, F) in figures 2 and electronic supplementary material, S1 and S2, corresponding to K = 2, K = 6 and K = 40, respectively.

Figure 3

Mean frequency M of the most frequent allele and mean F in the island migration model, for different scaled migration rates 4Nm and mutation rates 4Nμ and different numbers of subpopulations K. (a) K = 2, (b) K = 6 and (c) K = 40. The black solid lines represent the upper bound on F in terms of M (equation 3.1). The coloured points represent the mean M and mean F, where colours correspond to values of 4Nm. These points are taken from figures 2 and electronic supplementary material, S1 and S2. The number of subpopulations generally increases F for fixed 4Nμ and 4Nm. For example, the mean F can be substantially larger for K = 6 than for K = 2. Consider (4Nμ, 4Nm) = (0.1, 0.1). For K = 2, the mean F is near its upper bound (figure 3a); for K = 6, F is not as close to the bound (figure 3b). However, because the upper bound for K = 6 exceeds that for K = 2, the mean F is nevertheless larger in the case of K = 6.

Example: humans and chimpanzees

We now use our theoretical results to examine genetic differentiation in humans and chimpanzees. Because humans and chimpanzees are distinct species, we might expect a genetic differentiation measure such as F to produce a greater value for a computation between them than for a computation among populations within one or the other. Indeed, studies of multiallelic loci do find that adding chimpanzees to data on multiple human populations increases the value of F [8,25]. However, we will see that F has a more subtle pattern when considering data on multiple chimpanzee populations, and that our theoretical computations explain a surprising result. We examine data on 246 multiallelic microsatellite loci assembled by Pemberton et al. [26] from several studies of worldwide human populations and a study of chimpanzees [27]. We consider F comparisons both between humans and chimpanzees and among populations of chimpanzees. For the human data, we consider all 5795 individuals in the dataset, and for the chimpanzee data, we consider 84 chimpanzee individuals from six populations: one bonobo population, and five common chimpanzee populations (Central, Eastern, Western, hybrid and captive). In the data analysis, we perform a computation to summarize the relationship of F to the upper bound. For a set of Z loci, denote by F and M the values of F and M at locus z. The mean F for the set, or , is Using equation (3.1), we can compute the corresponding maximum F given the observed σ = KM, z = 1, 2, …, Z. Denoting this quantity by Fmax,, we have measures the proximity of the F values to their upper bounds: it ranges from 0, if F values at all loci equal 0, to 1, if F values at all loci equal their upper bounds. We computed the parametric allele frequencies for each subpopulation—the human and chimpanzee groups for the human–chimpanzee comparison, and chimpanzee subpopulations for the comparison of chimpanzees—averaging across subpopulations to obtain the frequency M of the most frequent allele. We then computed F and the associated upper bound for each locus, averaging across loci to obtain the overall and for the full microsatellite set (equations (5.1) and (5.2)). Surprisingly, given the longer evolutionary time between humans and chimpanzees than among chimpanzee populations, the F value is significantly greater when comparing chimpanzee populations () than when comparing humans and chimpanzees (; p = 4.2 × 10−14, Wilcoxon rank sum test). The explanation for this result can be found in the properties of the upper bound on F given M. Values of M are similar in the two comparisons (figure 4a,b). However, K differs, equaling 2 for the human–chimpanzee comparison and 6 for the comparison of chimpanzee subpopulations. Because the theoretical range of F is seen to be smaller for F values computed among smaller sets of subpopulations than among larger sets (figure 1), the F values among chimpanzees possess a larger range. For example, the maximal F at the mean M of 0.27 observed in pairwise comparisons is 0.34 for K = 2 (red segment in figure 4a), whereas the maximal F at the mean M of 0.36 observed for six chimpanzee populations is 0.93 for K = 6 (figure 4b). Given the stronger constraint in pairwise calculations than in calculations with more subpopulations, it is not unexpected that pairwise F values would be smaller than those in a 6-region computation. A high F among chimpanzees compared to between humans and chimpanzees is a by-product of mathematical constraints on F.

Figure 4

F values for comparisons involving humans and chimpanzees based on multiallelic microsatellite loci. (a) F between humans and chimpanzees, considering K = 2 subpopulations (humans, chimpanzees). (b) F among K = 6 chimpanzee subpopulations. In (a,b), colours represent the number of points in a neighbourhood of radius 0.03; red points indicate the mean M and F, and vertical red segments indicate the permissible range of F at the mean M. (c) F, computed using equation (2.1), and F/Fmax, computed using equations (2.1) and (3.1). Each point plotted represents one locus. Interestingly, the effect of K on F is largely eliminated when each F value is normalized by the associated maximum given K and M (figure 4c). The normalization leads to higher values for human–chimpanzee comparisons than among chimpanzee subpopulations ( and 0.20, respectively; p = 1.1 × 10−9, Wilcoxon rank sum test), as expected from the greater evolutionary distance between humans and chimpanzees compared to that among chimpanzees.

Discussion

We have analysed the range of values that F can take as a function of the frequency M of the most frequent allele at a multiallelic locus, for an arbitrary value of the number of subpopulations K. We showed that F can span the full unit interval only for a finite set of values of M, at M = k/K for integers k in [1, K − 1]. For all other M, F necessarily lies below 1. The number of subpopulations K enlarges the range of values that F can take as it increases. This study provides the most complete relationship between F and M obtained to date, generalizing previous results for the case of K = 2 subpopulations [12] and for a restriction to I = 2 alleles [14]. Interestingly, the maximal F we have obtained merges patterns observed in these previous studies. Fixing K = 2, we obtain the upper bound on F in terms of M that was reported by Jakobsson et al. [12]. As K increases, the piecewise pattern seen by Jakobsson et al. [12] for the maximal F in the K = 2 case for M in is observed in the multiallelic case for M in (0, 1/K). The decay from to (M, F) = (1, 0) seen by Jakobsson et al. [12] for K = 2 is observed for M in the decay from ((K − 1)/K, 1) to (1, 0) for arbitrary K. The allele frequency values for which the upper bound is reached for M in (0, 1/K) generalize those seen for the case of K = 2 and M in [12]. The upper bound is reached when all alleles are private, each subpopulation has as many alleles as possible at frequency KM, and at most one additional allele. The allele frequency values for which the upper bound is reached for M in ((K − 1)/K, 1) also generalize those seen for K = 2 and M in : the maximum is reached when the most frequent allele is fixed in all subpopulations except one, and a single private allele is present in this last subpopulation. The results from Alcala & Rosenberg [14] for I = 2 produce a more constrained upper bound on F than for arbitrary I, with the domain of M restricted to . Nevertheless, many properties of the maximal F we observe for unspecified I and M in (1/K, 1) are similar to those seen for I = 2 and M in : finitely many peaks at points M = k/K, local minima between the peaks, and an increase in coverage of the unit square for (M, F) as K increases. The maximal F functions for M in ((K − 1)/K, 1) for unspecified I and for I = 2 agree, as the number of alleles required to maximize F in this interval in the case of unspecified I is simply equal to 2. In assuming that the number of alleles is unspecified, we found that the number of distinct alleles needed for achieving the maximal F is for M in (0, 1/K) and for non-integer M in (1/K, 1); the maximum can be achieved with each number of distinct alleles in for M equal to 1/K, 2/K, …, (K − 1)/K. With a fixed maximal number of distinct alleles, such as in the I = 2 case of Alcala & Rosenberg [14] with K specified and in the K = 2 case with I specified [13], the upper bound on F is less than or equal to that seen in the corresponding unspecified-I case. For K = 2, specifying I has a relatively small effect in reducing the maximal value of F [13]. As in Edge & Rosenberg [13], specifying I in the case of larger values of K is expected to have the greatest impact on the F upper bound at the lowest end of the domain for M. In coalescent simulations, we found that the joint distribution of M and F within their permissible space can help separate the impact of mutation and migration. Although the dependence of F on mutation and migration rates has been long documented, the symmetric effects of mutation and migration under the island model [22] illustrate the difficulty in separating their effects. Under the island model, allele frequency M is informative about the scaled mutation rate 4Nμ, and comparing the value of F to its maximum given M is informative about the scaled migration rate 4Nm. Adding a dimension that is more sensitive to mutation than to migration—M in our case—enables the separation of their effects. Other statistics, such as total heterozygosity H or within-subpopulation heterozygosity H, have the potential to play a similar role [20]. Our results can inform data analyses. In particular, we caution users to examine upper bounds on F to assess how mathematical constraints influence observations. As the constraints are strongest for K = 2, this step is valuable in pairwise comparisons; it is also useful when the frequency M of the most frequent allele can be small in relation to the number of populations K, such as for high-diversity forensic [28] and immunological [29] loci in human populations. Visual inspection of the values of M and F within their bounds can suggest that constraints have an effect. can provide a helpful summary by evaluating the proximity of F values to their maxima. Further, joint use of M along with F could be useful in various applications of F, such as in inference of model parameters by approximate Bayesian computation [30] and machine learning [31]. F outlier tests to detect local adaptation from multiallelic loci [32] could search for F values that represent outliers not in the distribution of F values, but rather, outliers in relation to associated upper bounds. Computing null distributions for F conditional on M could enhance the approach. In an example data analysis, we have shown that taking into account mathematical constraints on F can help understand puzzling F behaviour. In our example, F at a set of loci was higher when comparing K = 6 chimpanzee populations than when comparing humans and chimpanzees (K = 2), even though the same loci were used and the mean value for M was similar in the two comparisons. A comparison of F values to their respective maxima explained these counterintuitive results. We note that analyses of F in relation to M differ from analyses of F in relation to within-subpopulation statistics H and J = 1 − H, such as those performed in deriving the influential Hedrick’s [9] and Jost’s D [33] statistics. We have previously shown that for biallelic loci in K subpopulations, for fixed M, the statistics F, and D are all maximized at the same set of allele frequency values [15]. Although the normalizations of F used to produce and D lead to statistics that are unconstrained in the unit interval as functions of H, and D continue to be constrained as functions of M. A statistic that instead normalizes F by its maximum as a function of M, a statistic of the total population, captures aspects of the allele frequency dependence of F that differ from those captured by normalizations by functions of within-subpopulation statistics. In human populations, efforts to understand F patterns trace in large part to Lewontin’s foundational F-like variance-partitioning computation [34], in which it was seen that among-population differences (analogous to F) were small relative to within-population differences (analogous to 1 − F). Studies using loci with different numbers of alleles, loci with different frequencies for the most frequent allele, and samples with different numbers of subpopulations have varied to some extent in their numerical estimates of F [14,35-38]. Mathematical results on F bounds provide part of the explanation for these differences: they establish that each dataset differing in the character of its loci and subpopulation set has its own distinctive interval in which its associated F calculation could potentially land. Hence, each dataset can give rise to a numerically distinct value not due to features of the underlying human biology, but rather, due to different constraints on the F measure itself. F bounds contribute to explaining quantitative variation in variance-partitioning computations—in which, although numerical values differ, the within-population component of genetic variation consistently predominates. The mathematics serves to support the qualitative claim that worldwide human genetic differentiation measurements represented by F-like statistics have low values—as was argued by Lewontin 50 years ago.

36 in total

1. G(ST) and its relatives do not measure differentiation.

Authors: Lou Jost
Journal: Mol Ecol Date: 2008-09 Impact factor: 6.185

2. Exegeses on maximum genetic differentiation.

Authors: François Rousset
Journal: Genetics Date: 2013-07 Impact factor: 4.562

3. Mathematical Constraints on F_ST: Biallelic Markers in Arbitrarily Many Populations.

Authors: Nicolas Alcala; Noah A Rosenberg
Journal: Genetics Date: 2017-05-05 Impact factor: 4.562

4. ESTIMATING F-STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE.

Authors: B S Weir; C Clark Cockerham
Journal: Evolution Date: 1984-11 Impact factor: 3.694

5. G ST ' , Jost's D, and F_ST are similarly constrained by allele frequencies: A mathematical, simulation, and empirical study.

Authors: Nicolas Alcala; Noah A Rosenberg
Journal: Mol Ecol Date: 2019-04 Impact factor: 6.185