Literature DB >> 36110882

Bray-Curtis (AFD) differentiation in molecular ecology: Forecasting, an adjustment ( ^A A), and comparative performance in selection detection.

Abstract

Geographic genetic differentiation measures are used for purposes such as assessing genetic diversity and connectivity, and searching for signals of selection. Confirmation by unrelated measures can minimize false positives. A popular differentiation measure, Bray-Curtis, has been used increasingly in molecular ecology, renamed AFD (hereafter called BCAFD). Critically, BCAFD is expected to be partially independent of the commonly used Hill "Q-profile" measures. BCAFD needs scrutiny for potential biases, by examining limits on its value, and comparing simulations against expectations. BCAFD has two dependencies on within-population (alpha) variation, undesirable for a between-population (beta) measure. The first dependency is derived from similarity to G ST and F ST . The second dependency is that BCAFD cannot be larger than the highest allele proportion in either location (alpha variation), which can be overcome by data-filtering or by a modified statistic A A or "Adjusted AFD". The first dependency does not forestall applications such as assessing connectivity or selection, if we know the measure's null behavior under selective neutrality with specified conditions-which is shown in this article for A A, for equilibrium, and nonequilibrium, for the commonly used data type of single-nucleotide-polymorphisms (SNPs) in two locations. Thus, A A can be used in tandem with mathematically contrasting differentiation measures, with the aim of reducing false inferences. For detecting adaptive loci, the relative performance of A A and other measures was evaluated, showing that it is best to use two mathematically different measures simultaneously, and that A A is in one of the best such pairwise criteria. For any application, using A A, rather than BCAFD, avoids the counterintuitive limitation by maximum allele proportion within localities.

Entities: Chemical

Keywords: adaptation; allele frequency difference; biodiversity; genetic differentiation; mutual information; outlier loci

Year: 2022 PMID： 36110882 PMCID： PMC9465203 DOI： 10.1002/ece3.9176

Source DB: PubMed Journal: Ecol Evol ISSN： 2045-7758 Impact factor: 3.167

INTRODUCTION

Comparisons of biodiversity between regions are important aspects of understanding both ecological and genetic systems. There are many geographic differentiation measures, used for purposes such as assessing genetic diversity and connectivity (Gruber et al., 2018; Guillot et al., 2005; Manni et al., 2004; Meirmans, 2020; Sherwin et al., 2017, 2021) and searching for signals of different selective regimes geographically, which is expected to have high false‐positive rates (Bierne et al., 2013; Lotterhos & Whitlock, 2014; Narum & Hess, 2011; Schneider et al., 2021; Xiang‐Yu et al., 2016). Because of the anticipated high false‐positive rates, it is important to confirm findings using a wide range of mathematically unrelated measures. Often these measures are chosen from the Hill “Q‐profile,” which includes: counts or sharing of allelic types (Q = 0 measures); Shannon information and differentiation (Q = 1); and heterozygosity, nucleotide diversity, Gini‐Simpson, , , Morisita‐Horn, (Q = 2) (Chao et al., 2014; Gaggiotti et al., 2018; Jost, 2008; Jost et al., 2010; Sherwin et al., 2017, 2021). However, despite their different sensitivity to some matters, such as rare and common alleles, the members of the Q‐profile are all mathematically related (Sherwin et al., 2017, 2021). Notably, one recent addition to the range of measures in molecular ecology is outside the Hill Q‐profile: the Bray–Curtis index of dissimilarity, a method of assessing differentiation that is extremely popular in its original use, to assess differentiation between species assemblages (Bray & Curtis, 1957). During 2021 alone, Bray‐Curtis was mentioned over 10,000 times in Google Scholar. Bray‐Curtis can be expressed in a way that facilitates comparison with differentiation measures derived from Hill numbers; the mathematical equivalence to other formulations of Bray‐Curtis is documented in (Chao & Chiu, 2016; Jost et al., 2010; Ricotta et al., 2021; Ricotta & Pavoine, 2022; Ricotta & Podani, 2017). where and are the abundances (counts or frequencies ) in each of two locations (1,2), for variant j () and S is the total number of species. This measure satisfies many of the requirements of a good measurement of differentiation between assemblages (Chao & Chiu, 2016; Magurran, 2004; Ricotta & Podani, 2017). This index is also used for analysis of operational taxonomic units in metagenomics (Peng et al., 2020). Unification of ecological and genetic approaches is desirable, because of their interaction as parts of the same biological systems, and because of their underlying mathematical similarities (Rosindell et al., 2015; Sherwin, 2018), so it is good to see that a simplified version of Bray‐Curtis has been proposed as a measure of differentiation in molecular ecology and evolution (Berner, 2019a, 2019b; Shriver et al., 1997), echoing a similar measure in community ecology (Whittaker, 1975) (p 118). It was renamed “allele frequency difference” AFD, but I will call it BCAFD, in deference to its original proponents, and because it is a difference of proportions () rather than frequencies (). In the two‐variant two‐location case, Bray‐Curtis simplifies to the unsigned difference between locations 1 and 2 of proportions of either of the two allelic variants (Berner, 2019a, 2019b). where and , and similarly for the other location and . When there are multiple alleles, it is suggested to use the sum of the absolute allele proportion differences divided by two ([Berner, 2019a], Table S1), which actually is equivalent to the more general Equation (1). However, unless otherwise stated this article will deal with the biallelic case which is very common in current molecular ecology—SNPs or single‐nucleotide polymorphisms. In the molecular ecology literature, BCAFD has been used or mentioned many times since Berner's publications (Berner, 2019a, 2019b), including for assessment of population differentiation in time or space, with implications for likely genetic connectivity (Amos, 2021; Lou et al., 2021; Popovic et al., 2021; Subramanian, 2021; Taylor et al., 2021; Weldekidan et al., 2022; Wolf et al., 2021), as well as identifying candidate adaptive loci by their strong differentiation relative to other presumably neutral loci (Bharti et al., 2021; Boyle et al., 2021; Haenel, Guerard, et al., 2021; Haenel, Oke, et al., 2021; Price et al., 2020; Yin et al., 2021; Zhou et al., 2021). For applications including selection detection and assessment of connectivity between locations, it is critical to know the measure's null behavior, that is, in the absence of selection (“neutrality”), with specified conditions such as population size, dispersal, and mutation (Bierne et al., 2013; Gruber et al., 2018; Guillot et al., 2005; Lotterhos & Whitlock, 2014; Manni et al., 2004; Meirmans, 2020; Narum & Hess, 2011; Schneider et al., 2021; Sherwin et al., 2017, 2021; Xiang‐Yu et al., 2016). Despite not belonging to the Hill Q‐profile, BCAFD appears to have some mathematical relationship to two of the Hill measures: and (Appendix 1). Therefore, based on forecasts for those two measures, it will be shown that it is possible to develop forecasts for BCAFD for two‐location, two‐variant systems such as single‐nucleotide polymorphisms (SNPs). All diversity measures must be scrutinized for their particular properties (Leinster & Cobbold, 2012; Leinster, 2021; Sherwin et al., 2017; Sherwin et al., 2021). An important property of differentiation measures is independence between alpha (within location) variation, beta (between location) differentiation, and total (gamma) variation (Chao et al., 2014; Gaggiotti et al., 2018; Jost, 2008; Jost et al., 2010; Leinster, 2021; Sherwin et al., 2017; Sherwin et al., 2021). Critically, and are well‐known to have the serious limitation of being heavily influenced by within‐location variation (alpha), something that is not desirable in a between‐location (beta) differentiation measure. Although was explicitly proposed as a measure of between‐subgroup differentiation (Wright, 1943) and has been used for that extensively, unlike some other Hill‐profile measures, shows strong dependence on alpha within‐locality diversity, as does the related measure (Jost, 2008; Meirmans & Hedrick, 2010; Nei, 1977, 1973). Because of its relationship to and , it is likely that there will be dependency of BCAFD on alpha variation. Another dependency of BCAFD on alpha variation is that it is obvious from Equation (2) that BCAFD can never be larger than , the higher of the two allele proportions, and . In other words, if either or is zero, then the value of Bray‐Curtis will be equal to the other, more abundant, proportion. Of course, the values and are within‐location proportions of one of the two alleles—a within‐population (alpha) measure. This is an extremely counterintuitive limitation on a between‐location (beta) differentiation measure, and is expected to result in biased values. This might be particularly important when using the measure to search for loci that experience different directions of selection in different locations, because this difference of selective regime will obviously give a signal of large differentiation values between locations, relative to other neutral loci. As a result, the truncation of large values of BCAFD due to might be expected to reduce the ability to distinguish such adaptive loci from neutral loci. The confounds with alpha variation due to relationship to , and restriction by maximal allele proportion , require examination in this article; however another possible confound does not appear to be of concern. As well as the proportions of variants, a between location (beta) differentiation measure can be confounded by the number of variant types. This confound can be avoided by restriction to two‐variant systems such as SNPs, as is done in this article. Also, it does not appear to be a problem for the multiallelic version of BCAFD (Equation (1), also [Berner, 2019a] Table S1). When there is maximal differentiation, that is, no alleles shared between locations, one expects to always get the maximal value for the genetic differentiation statistic. This in fact does happen. For example, if there are four alleles w, x, y, and z, with w and x in location 1, and the other two in location 2, so that , and other proportions are equal to zero, then the multiallelic statistic is equal to BCAFD = 1.0. Also, if location 1 only has allele w, and the other three alleles are in location 2, with , then the multiallelic statistic remains BCAFD = 1.0, as expected for the same situation of maximal differentiation (no shared alleles). Irrespective of these confounds, it should be noted that the alpha‐dependency of does not forestall all use of these measures, provided that we know their behavior under selective neutrality with specified conditions such as population size, dispersal, and mutation (Bierne et al., 2013; Gruber et al., 2018; Guillot et al., 2005; Lotterhos & Whitlock, 2014; Manni et al., 2004; Meirmans, 2020; Narum & Hess, 2011; Schneider et al., 2021; Sherwin et al., 2017, 2021; Xiang‐Yu et al., 2016). With this in mind, and responding to the increased use of BCAFD in molecular ecology described above, this paper carries out the following tasks: It creates a modified version of BCAFD termed A (“Adjusted AFD”) that is corrected for the limitation by . Forecasts are made and tested, for Bray‐Curtis (BCAFD) and A, for selectively neutral biallelic SNPs—a very common data type at present—under various scenarios of population size, mutation, and dispersal. This will allow BCAFD, and especially A, to be used for evaluating competing models of population connectivity, making projections for the future, or identifying outlier loci whose differentiation level departs from neutral expectations, and so are candidate adaptive loci. Simulations are performed to investigate how the A correction for bias performs in detecting loci under directional selection, in comparison to competing measures, or in consort with those measures.

MATERIALS AND METHODS

Forecasting equations for Bray‐Curtis were developed for the common and simple case of a single neutral biallelic SNP locus, with two locations (1,2); the measure can be averaged over multiple loci, and can be applied to haploids, or to diploids in Hardy–Weinberg equilibrium (i.e., no population‐wide correlation between the two alleles within diploid genotypes). When there are only two variants, the Bray‐Curtis equation is: (Berner, 2019a, 2019b) (Equation 2, above) where p and p are proportions of one of the two alleles at each location (). The quantity in Equation (2) is a transform of two well‐known differentiation measures (Halliburton, 2004; Wright, 1943): where is the variance of p between locations, H is the Hardy–Weinberg (Binomial) expected heterozygosity, for example, ; and is the average p over the two locations (1,2); . The measures and in Equation (3) are identical in the two‐allele, two location case ([Halliburton, 2004] Box 9.5). Appendix 1 shows that Because of its close relationship to G or F , BCAFD forecasts can be based on well‐known forecasts for those measures (Appendix 1). The expectation for diploid BCAFD at drift‐dispersal‐mutation equilibrium is: where m is symmetrical dispersal between the two locations (0 ≤ m ≤ 1); μ is the rate of mutation (0 ≤ μ ≤ 1); N is the effective population size at each location (identical); and is the second order Hill diversity, or effective number of alleles . The equivalent equation for the haploid SNPs simulated in this article is: The performance of these equations was assessed by simulation of biallelic neutral single‐nucleotide polymorphisms (SNPs) in two haploid subpopulations, for a wide range of scenarios covering all possible combinations of three symmetric dispersal rates (m = 0.01, 0.03, 0.1) and three subpopulation effective sizes (N = 1000, 10,000, 100,000). Starting allele proportions in each subpopulation (p values) were randomized in each replicate. Simulations used the typical SNP mutation rate (μ = 10−9), but essentially identical results were obtained with rates between μ = 10−6 and 10−12. The simulation was programmed in MATLAB, and full details are in Appendix 2, and Dewar et al. (2011). There were 1000 replicate iterations of each scenario, which could also be considered as 1000 independently inherited loci (i.e., in linkage equilibrium). Each iteration was run for 200 generations, and each generation included stochastic binomial sampling of the parents' alleles to establish the allele proportions for the offspring, followed by symmetrical dispersal to create the parent populations for the next generation. Because the forecasts are for drift‐dispersal‐mutation equilibrium, it is important to know whether the simulations had reached equilibrium. The adequacy of the run‐time of 200 generations was confirmed in three ways, detailed in Appendix 2: 200 generations was several times longer than the expected time to half‐equilibrium values; inspection ensured an asymptote to a stable value for BCAFD; and the variance of BCAFD between‐generations was much lower than variance between replicate iterations (typically one tenth or less). The performance of the simulation was checked by comparison with results of EASYPOP (Balloux, 2001) and with known predictions for (see Appendix 2 for details). To assess whether the expectation from Equation (6) was an adequate forecast of BCAFD, BCAFD was calculated at the final generation, then linear regression was used (in EXCEL). If the expectation from Equation (6) is accurate, it is expected that a regression of the simulated BCAFD against the expected BCAFD should have a slope of unity and an intercept of zero. Additionally, alpha‐dependence was assessed, and possible corrections suggested, including an adjusted measure A that has no limitation by . In other investigations, I examined the relationship between BCAFD and three other differentiation measures: , , and mutual information, I (Sherwin et al., 2017, 2021). I also examined whether the forecasts could be made completely independent of within‐location variation. Finally, I produced nonequilibrium forecasts, suitable for situations where there has been recent disturbance to connectivity, for example. Simulations were used to investigate the effect of the adjusted measure A on detectability of loci under different directional selection in each population. These simulations were identical to the ones described above, with two alterations. First, the simulations were restricted to large population size and low dispersal (N = 100,000, m = 0.01). Second, selection was simulated each generation by, in one location, increasing the number of surviving progeny of one genotype by multiplying by a factor of 1 + s/2, and decreasing the same genotype by 1 − s/2 in the other location (s = 0.001, 0.003, 0.005, 0.05). The highest selection strength (s = 0.05) would be expected to result in very high differentiation after the 200 generation simulation period. At the final generation, the program calculated the genetic differentiation measures: A; ; ; (Jost, 2008); and mutual information (Sherwin et al., 2017, 2021). For each measure, I tallied the percentage of loci (out of 1000 simulated) that would be identified as outliers (i.e., potentially under selection) using the “univariate” criterion that their genetic differentiation values were in the top 1% of the 1000 loci simulated without selection in a parallel neutral simulation, separately for each one of the five differentiation measures. As well as those univariate criteria, the same analysis was repeated using a series of more restrictive “bivariate” criteria, that is, that for a locus in the selection simulation to be identified as an outlier, it was required to have differentiation in the top 1% of neutral loci for each of a pair of the differentiation measures listed above. For each of these diagnoses (univariate or bivariate), the true positive (TP) was the number of loci known to be under selection that were actually identified as being under selection, out of the total of 1000 independent loci simulated with selection. The false positive (FP) was the number of loci identified as being under selection in the parallel neutral simulation, again of 1000 loci; with the univariate criteria this of course must be 1%of 1000 = 10 loci, but the bivariate criteria are expected to be more restrictive, giving lower FP. Then I calculated a performance value separately for each strength of selection. The performance value is the percentage of loci that are true positive, out of all loci that were identified as outliers potentially under selection (TP + FP); in the case that 1% of all loci were under that selective regime, and all other loci were neutral, the calculation is 100 × (TP × 0.01)/[(TP × 0.01) + (FP × 0.99)]. Of course, the proportions of neutral and selected loci would not be known beforehand, but given that the analysis is standardized to a constant univariate FD rate, the performance values can be used to compare the performance of the different criteria.

RESULTS

Trials of Equation (6) used the data from the haploid simulation program described above. Figure 1a shows simulated BCAFD (Equation 2), calculated for all 9000 datapoints (nine scenarios × 1000 replicates) regressed against algebraic predictions (Equation 6) of BCAFD for each replicate in each scenario (again 9000 points). The predictions have to be made separately for each replicate because the stochastic nature of the simulations results in each replicate having a different final value for , which is used in Equation 6. Five things are apparent in Figure 1a:

FIGURE 1

(a) Comparison of simulation results with algebraic predictions for BCAFD; 9000 points from the 1000 replicates of each of nine neutral scenarios (effective size N = 1000, 10,000, 100,000, dispersal rate m = 0.01, 0.03, 0.10) and with regression equation (Simulated‐BCAFD) = 0.83 × (Predicted‐BCAFD from Equation 6) (significance P <<0, R 2 = .50, intercept negligibly different from zero: −7.6 × 10−5). The black line is the regression line; the red line is the expected 1:1 relationship. (b) The same data again, using the correction for the limitation by maximum p, that is a plot of A = |p 1 –p 2|/(0.6152 + 0.3985 × p max) against the expectation shown in Equations (6) and (8). In this case, the expected 45‐degree plot is achieved exactly, with the expected slope of unity (slope coefficient = 1.00, 95% confidence limits 0.98 to 1.02, significance P <<0, R = 0.50, intercept negligibly different from zero: 0.0004). The red line for 1:1 slope is exactly coincident with the regression line. (c) The nine scenarios from (b) plotted individually—comparison of simulation results with algebraic predictions, using A, the correction for the limitation by maximum p. Each panel shows 1000 points from the 1000 replicates of one scenario, whose dispersal rate m and effective size N is shown in the panel's headline. The slopes of regression lines are shown on the panels, with 95% confidence intervals, which included unity in all except two marginal cases, and are therefore each concordant with the overall result shown in (b) and the relationship in Equation (8). In all cases, the intercept was negligibly different from zero, and P for significance was <10−18.

there are distinct clumps of points, which identify limits when so that , which gives maximum expected BCAFD values of 0.035 when , 0.064 when , and 0.111 when (Equation 6). there appears to be an oblique upper bound to the scatter of points from the 1000 replicates of each scenario; this will be discussed later. Despite the scatter of replicates, there is an extremely good regression of simulated BCAFD on predicted BCAFD (significance P was extremely low—assigned to zero by the program, see caption of Figure 1a). Note that the scatter is not unexpected given that the initial allele proportions were randomized. the intercept is extremely close to zero, as expected however, the slope is slightly below the expected 45‐degree line for perfect prediction, with a slope of 0.83, see caption of Figure 1a; the 95% confidence limits for the slope were 0.81 to 0.85, so that the limits did not include the expected unity. (a) Comparison of simulation results with algebraic predictions for BCAFD; 9000 points from the 1000 replicates of each of nine neutral scenarios (effective size N = 1000, 10,000, 100,000, dispersal rate m = 0.01, 0.03, 0.10) and with regression equation (Simulated‐BCAFD) = 0.83 × (Predicted‐BCAFD from Equation 6) (significance P <<0, R 2 = .50, intercept negligibly different from zero: −7.6 × 10−5). The black line is the regression line; the red line is the expected 1:1 relationship. (b) The same data again, using the correction for the limitation by maximum p, that is a plot of A = |p 1 –p 2|/(0.6152 + 0.3985 × p max) against the expectation shown in Equations (6) and (8). In this case, the expected 45‐degree plot is achieved exactly, with the expected slope of unity (slope coefficient = 1.00, 95% confidence limits 0.98 to 1.02, significance P <<0, R = 0.50, intercept negligibly different from zero: 0.0004). The red line for 1:1 slope is exactly coincident with the regression line. (c) The nine scenarios from (b) plotted individually—comparison of simulation results with algebraic predictions, using A, the correction for the limitation by maximum p. Each panel shows 1000 points from the 1000 replicates of one scenario, whose dispersal rate m and effective size N is shown in the panel's headline. The slopes of regression lines are shown on the panels, with 95% confidence intervals, which included unity in all except two marginal cases, and are therefore each concordant with the overall result shown in (b) and the relationship in Equation (8). In all cases, the intercept was negligibly different from zero, and P for significance was <10−18. In the introduction it was pointed out that the value of BCAFD is restricted by the maximum p value in either of the two locations, at the generation where BCAFD is calculated. This is a potential reason for the oblique upper bound for the observations in Figure 1a. To Investigate this, the regression of simulated BCAFD on expected BCAFD was repeated on ten subsets of the 9000 datapoints, subdivided by the final value of , the maximum p in either of the two locations. Results in Table 1 show that the departure from a 1:1 slope is indeed due to the restriction by . The bottom two rows of this table are where there is the least constraint on simulated BCAFD values ( and ), and in these two cases the slope of the regression of simulated BCAFD on expected BCAFD is indeed unity as expected. The slope of this regression decreases linearly when it is more constrained, that is, with lower values (Table 1 and Figure 2).

TABLE 1

The effect of on forecasts for BCAFD

Central pmax	R2	P for significance	Intercept	Slope coefficient (95% CL)
0.05	.465	1.0 × 10⁻¹³¹	+0.0008	0.630258 (0.59–0.67)
0.15	.444	1.7 × 10⁻¹¹¹	+0.0020	0.673769 (0.62–0.72)
0.25	.420	2.3 × 10⁻¹⁰⁴	+0.0022	0.713834 (0.66–0.77)
0.35	.456	8.2 × 10⁻¹¹⁸	+0.0007	0.79615 (0.74–0.85)
0.45	.414	2.2 × 10⁻¹⁰⁶	+0.0024	0.766259 (0.71–0.83)
0.55	.482	4.5 × 10⁻¹²⁶	+0.0001	0.849727 (0.79–0.91)
0.65	.569	7.4 × 10⁻¹⁵⁸	−0.0015	0.900086 (0.85–0.95)
0.75	.482	2.8 × 10⁻¹²⁸	−0.0008	0.824037 (0.77–0.88)
0.85	.538	1.6 × 10⁻¹⁵¹	−0.0020	0.947642 (0.89–1.01)
0.95	.586	2.1 × 10⁻²⁰¹	−0.0023	1.042645 (0.99–1.10)

Note: The 9000 data points from Figure 1a, sorted by in the final generation. In the first column, “Central ” identifies the points with , etc. The remaining columns show the results of regression analysis of (Simulated‐BCAFD) against (Predicted‐BCAFD from Equation 6) for the subset of the datapoints identified in the left column. All regressions showed an intercept very close to zero, as expected. Large numbers of significant digits are retained in the slope coefficients because of their subsequent use in the analysis in Figure 2, where the coefficients are plotted against central values.

FIGURE 2

The effect of maximum p‐value on the regression slope coefficient of (simulated BCAFD) on (expected BCAFD from Equation 6). This plot itself has a regression equation: , with = .90, and P = .000025. The values upon which the plot is based are taken from Table 1.

The effect of on forecasts for BCAFD Note: The 9000 data points from Figure 1a, sorted by in the final generation. In the first column, “Central ” identifies the points with , etc. The remaining columns show the results of regression analysis of (Simulated‐BCAFD) against (Predicted‐BCAFD from Equation 6) for the subset of the datapoints identified in the left column. All regressions showed an intercept very close to zero, as expected. Large numbers of significant digits are retained in the slope coefficients because of their subsequent use in the analysis in Figure 2, where the coefficients are plotted against central values. The effect of maximum p‐value on the regression slope coefficient of (simulated BCAFD) on (expected BCAFD from Equation 6). This plot itself has a regression equation: , with = .90, and P = .000025. The values upon which the plot is based are taken from Table 1. There are two possible corrections for this dependency on maximum p value. First, the data could be filtered to only include loci with very high maximum p values (, Table 1, Figure 2), but of course this would greatly reduce the usable data. Second, because the regression in Figure 2 is very linear, one can correct the expectations for the effect seen in that figure, where (coefficient of simulated BCAFD on expected BCAFD) = 0.6152 + , so that we create a modified version of BCAFD, called “ A” which is free of dependence upon : We then find that the forecasts are general for all values of , for haploid: or the same for diploid loci in Hardy–Weinberg equilibrium, replacing with : Figure 1b shows the plot of A (i.e., BCAFD adjusted to compensate for limitation by ) plotted against the expectations from (Equation 8). This regression shows the expected slope of unity and intercept of zero, demonstrating that the simulation confirms the haploid prediction for A in Equation (8), including for each individual scenario (Figure 1c). There are nonlinear relationships between A and three other differentiation measures: , , and mutual information, I, as was suggested by a previous investigation of BCAFD (Berner, 2019a, 2019b) (Figure 3). This shows that A provides information that is not linearly dependent on these other measures, which is important when using multiple measures for confirmation of results such as assessment of connectivity, and searches for loci potentially under selection.

FIGURE 3

(mutual information) plotted against A (i.e., BCAFD corrected for maximum‐value dependency). is shown as squares, as discs, as triangles. All measures were from the same simulated dataset that was used in Figure 1. As well as the equilibrium forecasts just described, it is important to have nonequilibrium forecasts for A, which will often be relevant in many situations, including recently disturbed populations; nonequilibrium forecasts are shown in Equation (A1.11b). It was also investigated whether the dependence of BCAFD on within location (alpha) variation could be fixed by basing the expectations for BCAFD not on , but upon (Meirmans & Hedrick, 2010). Unlike , is free of influence of within‐population variation. In Equation (A1.14), it can be seen that this new formulation of BCAFD is still heavily dependent upon heterozygosity H, including the within population (alpha) measures and . With the false detection of selection held constant at 1%, the important matter is the performance value: what percentage of loci that are classified as outliers, due to their differentiation value surpassing the univariate or bivariate criterion, are actually under selection—the true positives (TP). For a wide range of selection strengths, Table 2 shows the performance values for each univariate criterion (a single differentiation measure), and each bivariate criterion (i.e., an outlier locus must surpass the cutoff value for two differentiation measures). Of course, with the strongest selection (s = 0.05), all criteria performed well, but with very weak selection (s = 0.001), there was poor performance. The right‐hand column of Table 2 shows the performance averaged over all selection strengths, which had similar rankings for the performance of the criteria. The univariate criteria did not perform as well as the bivariates, with no overlap of mean performance ±1 × SE. Within the univariates, there was similar performance for all criteria, but when averaged over all selection strengths, the three best performers were A, , and . Within the bivariate criteria again there was similar performance for all criteria. Nevertheless consistently the three best performers were “ & ,” tied with “ & ,” followed by “ A & .”

TABLE 2

Detection of loci under directional selection

Criteria {Differentiation measure(s)}	Known selection strength (s)
^A A	488.18 ± 5.90 10 ± 0 33.03	851.18 ± 4.20 10 ± 0 46.23	911 ± 2.57 10 ± 0 47.92	999 ± 0.30 10 ± 0 50.23	44.35 ± 3.86
BCAFD	485.36 ± 6.15 10 ± 0 32.9	820.45 ± 4.69 10 ± 0 45.32	891 ± 3.16 10 ± 0 47.37	998.73 ± 0.38 10 ± 0 50.22	43.95 ± 3.82
GST	459.91 ± 5.46 10 ± 0 31.72	857.55 ± 3.75 10 ± 0 46.42	938.09 ± 2.27 10 ± 0 48.65	999.82 ± 0.12 10 ± 0 50.25	44.26 ± 4.25
DEST	488.36 ± 6.57 10 ± 0 33.03	804.82 ± 4.54 10 ± 0 44.84	874.73 ± 3.73 10 ± 0 46.91	998.18 ± 0.40 10 ± 0 50.21	43.75 ± 3.74
I	458.27 ± 5.35 10 ± 0 31.64	857 ± 3.71 10 ± 0 46.4	938 ± 2.27 10 ± 0 48.65	999.82 ± 0.12 10 ± 0 50.25	44.24 ± 4.27
^A A, BCAFD	468.91 ± 5.52 7.45 ± 0.25 38.87	820.45 ± 4.69 7.45 ± 0.25 52.66	891 ± 3.16 7.45 ± 0.25 54.71	998.73 ± 0.38 7.45 ± 0.25 57.52	50.94 ± 4.15
^A A, GST	443.64 ± 5.09 5.64 ± 0.24 44.28	843.82 ± 4.32 5.64 ± 0.24 60.18	910.73 ± 2.57 5.64 ± 0.24 61.99	999 ± 0.30 5.64 ± 0.24 64.15	57.65 ± 4.53
^A A, DEST	470.18 ± 5.87 7.18 ± 0.26 39.81	804.82 ± 4.54 7.18 ± 0.26 53.1	874.3 ± 3.73 7.18 ± 0.26 55.16	998.18 ± 0.40 7.18 ± 0.26 58.41	51.62 ± 4.08
^A A, I	442.18 ± 5.01 5.64 ± 0.24 44.19	843.36 ± 4.26 5.64 ± 0.24 60.17	910.73 ± 2.57 5.64 ± 0.24 61.99	999 ± 0.30 5.64 ± 0.24 64.15	57.63 ± 4.55
BCAFD, GST	450.45 ± 5.67 5.64 ± 0.34 44.65	819.67 ± 4.77 5.64 ± 0.34 59.48	891 ± 3.16 5.64 ± 0.34 61.48	998.73 ± 0.38 5.64 ± 0.34 64.14	57.44 ± 4.37
BCAFD, DEST	475.55 ± 6.22 8.91 ± 0.16 35.03	804.82 ± 4.54 8.91 ± 0.16 47.71	874.73 ± 3.73 8.91 ± 0.16 49.79	998.18 ± 0.40 8.91 ± 0.16 53.09	46.40 ± 3.95
BCAFD, I	449 ± 5.56 6.18 ± 0.44 42.33	819.55 ± 4.71 6.18 ± 0.44 57.26	891 ± 3.16 6.18 ± 0.44 59.29	998.73 ± 0.38 6.18 ± 0.44 62.01	55.22 ± 4.41
GST, DEST	441.36 ± 5.67 4.91 ± 0.31 47.59	804.09 ± 4.62 4.91 ± 0.31 62.32	874.73 ± 3.73 4.91 ± 0.31 64.28	998.18 ± 0.40 4.91 ± 0.31 67.25	60.36 ± 4.38
GST, I	458.27 ± 5.35 10 ± 0 31.64	857 ± 3.71 10 ± 0 46.4	938 ± 2.26 10 ± 0 48.65	999.82 ± 0.12 10 ± 0 50.25	44.24 ± 4.27
DEST, I	439.91 ± 5.59 4.91 ± 0.31 47.51	804.09 ± 4.62 4.91 ± 0.31 62.32	874.73 ± 3.73 4.91 ± 0.31 64.28	998.18 ± 0.40 4.91 ± 0.31 67.25	60.34 ± 4.40

Criteria {Differentiation measure(s)}

Known selection strength (s)

0.001

0.003

0.005

0.05

Mean performance

^A A

488.18 ± 5.90

10 ± 0

33.03

851.18 ± 4.20

10 ± 0

46.23

911 ± 2.57

10 ± 0

47.92

999 ± 0.30

10 ± 0

50.23

44.35 ± 3.86

BCAFD

485.36 ± 6.15

10 ± 0

32.9

820.45 ± 4.69

10 ± 0

45.32

891 ± 3.16

10 ± 0

47.37

998.73 ± 0.38

10 ± 0

50.22

43.95 ± 3.82

GST

459.91 ± 5.46

10 ± 0

31.72

857.55 ± 3.75

10 ± 0

46.42

938.09 ± 2.27

10 ± 0

48.65

999.82 ± 0.12

10 ± 0

50.25

44.26 ± 4.25

DEST

488.36 ± 6.57

10 ± 0

33.03

804.82 ± 4.54

10 ± 0

44.84

874.73 ± 3.73

10 ± 0

46.91

998.18 ± 0.40

10 ± 0

50.21

43.75 ± 3.74

458.27 ± 5.35

10 ± 0

31.64

857 ± 3.71

10 ± 0

46.4

938 ± 2.27

10 ± 0

48.65

999.82 ± 0.12

10 ± 0

50.25

44.24 ± 4.27

^A A, BCAFD

468.91 ± 5.52

7.45 ± 0.25

38.87

820.45 ± 4.69

7.45 ± 0.25

52.66

891 ± 3.16

7.45 ± 0.25

54.71

998.73 ± 0.38

7.45 ± 0.25

57.52

50.94 ± 4.15

^A A, GST

443.64 ± 5.09

5.64 ± 0.24

44.28

843.82 ± 4.32

5.64 ± 0.24

60.18

910.73 ± 2.57

5.64 ± 0.24

61.99

999 ± 0.30

5.64 ± 0.24

64.15

57.65 ± 4.53

^A A, DEST

470.18 ± 5.87

7.18 ± 0.26

39.81

804.82 ± 4.54

7.18 ± 0.26

53.1

874.3 ± 3.73

7.18 ± 0.26

55.16

998.18 ± 0.40

7.18 ± 0.26

58.41

51.62 ± 4.08

^A A, I

442.18 ± 5.01

5.64 ± 0.24

44.19

843.36 ± 4.26

5.64 ± 0.24

60.17

910.73 ± 2.57

5.64 ± 0.24

61.99

999 ± 0.30

5.64 ± 0.24

64.15

57.63 ± 4.55

BCAFD, GST

450.45 ± 5.67

5.64 ± 0.34

44.65

819.67 ± 4.77

5.64 ± 0.34

59.48

891 ± 3.16

5.64 ± 0.34

61.48

998.73 ± 0.38

5.64 ± 0.34

64.14

57.44 ± 4.37

BCAFD, DEST

475.55 ± 6.22

8.91 ± 0.16

35.03

804.82 ± 4.54

8.91 ± 0.16

47.71

874.73 ± 3.73

8.91 ± 0.16

49.79

998.18 ± 0.40

8.91 ± 0.16

53.09

46.40 ± 3.95

BCAFD, I

449 ± 5.56

6.18 ± 0.44

42.33

819.55 ± 4.71

6.18 ± 0.44

57.26

891 ± 3.16

6.18 ± 0.44

59.29

998.73 ± 0.38

6.18 ± 0.44

62.01

55.22 ± 4.41

GST, DEST

441.36 ± 5.67

4.91 ± 0.31

47.59

804.09 ± 4.62

4.91 ± 0.31

62.32

874.73 ± 3.73

4.91 ± 0.31

64.28

998.18 ± 0.40

4.91 ± 0.31

67.25

60.36 ± 4.38

GST, I

458.27 ± 5.35

10 ± 0

31.64

857 ± 3.71

10 ± 0

46.4

938 ± 2.26

10 ± 0

48.65

999.82 ± 0.12

10 ± 0

50.25

44.24 ± 4.27

DEST, I

439.91 ± 5.59

4.91 ± 0.31

47.51

804.09 ± 4.62

4.91 ± 0.31

62.32

874.73 ± 3.73

4.91 ± 0.31

64.28

998.18 ± 0.40

4.91 ± 0.31

67.25

60.34 ± 4.40

Note: The table shows the number of loci (±SE) from selection simulations of 1000 loci, which were identified as being under selection by criteria based on differentiation values from neutral simulations of 1000 loci: either a “univariate” criterion of being in the top 1% of neutral values for one differentiation measure, or a “bivariate” criterion of being simultaneously in the top 1% for two differentiation measures. In each of columns 2–5, the top value in each cell is the number of loci identified as being under selection (true positive, TP), in the selection simulation with the known value of selection shown at the top of the column, out of the total of 1000 independent loci simulated. The second value in each cell is the number of loci identified as being under selection (False positive, FP), in the parallel neutral simulation; of course with univariate criteria and the cutoff being the top 1%, the FP value is always 10 (1% of 1000 loci). The third value in each cell is the “performance” value—the percentage of loci that are true positive, out of all loci identified as outliers by that criterion (TP & FP). The performance value shown is for the case where 1% of all loci were under that selective regime, and all other loci were neutral; the calculation is 100 × (TP × 0.01)/[(TP × 0.01) + (FP × 0.99)]. Of course, the proportions of neutral and selected loci would not be known beforehand in a study designed to detect loci under selection, but given that it is standardized to a constant univariate FD rate, the performance values can be used to compare the criteria. The right column shows the performance averaged over all four selection strengths. Within each of the univariate criteria and the bivariate criteria, the three criteria with the best average performance are bolded. Note that the rank order of performance values is similar for most selection strengths, except the weakest selection (s = 0.001).

Detection of loci under directional selection 488.18 ± 5.90 10 ± 0 33.03 851.18 ± 4.20 10 ± 0 46.23 911 ± 2.57 10 ± 0 47.92 999 ± 0.30 10 ± 0 50.23 485.36 ± 6.15 10 ± 0 32.9 820.45 ± 4.69 10 ± 0 45.32 891 ± 3.16 10 ± 0 47.37 998.73 ± 0.38 10 ± 0 50.22 459.91 ± 5.46 10 ± 0 31.72 857.55 ± 3.75 10 ± 0 46.42 938.09 ± 2.27 10 ± 0 48.65 999.82 ± 0.12 10 ± 0 50.25 488.36 ± 6.57 10 ± 0 33.03 804.82 ± 4.54 10 ± 0 44.84 874.73 ± 3.73 10 ± 0 46.91 998.18 ± 0.40 10 ± 0 50.21 458.27 ± 5.35 10 ± 0 31.64 857 ± 3.71 10 ± 0 46.4 938 ± 2.27 10 ± 0 48.65 999.82 ± 0.12 10 ± 0 50.25 468.91 ± 5.52 7.45 ± 0.25 38.87 820.45 ± 4.69 7.45 ± 0.25 52.66 891 ± 3.16 7.45 ± 0.25 54.71 998.73 ± 0.38 7.45 ± 0.25 57.52 443.64 ± 5.09 5.64 ± 0.24 44.28 843.82 ± 4.32 5.64 ± 0.24 60.18 910.73 ± 2.57 5.64 ± 0.24 61.99 999 ± 0.30 5.64 ± 0.24 64.15 470.18 ± 5.87 7.18 ± 0.26 39.81 804.82 ± 4.54 7.18 ± 0.26 53.1 874.3 ± 3.73 7.18 ± 0.26 55.16 998.18 ± 0.40 7.18 ± 0.26 58.41 442.18 ± 5.01 5.64 ± 0.24 44.19 843.36 ± 4.26 5.64 ± 0.24 60.17 910.73 ± 2.57 5.64 ± 0.24 61.99 999 ± 0.30 5.64 ± 0.24 64.15 450.45 ± 5.67 5.64 ± 0.34 44.65 819.67 ± 4.77 5.64 ± 0.34 59.48 891 ± 3.16 5.64 ± 0.34 61.48 998.73 ± 0.38 5.64 ± 0.34 64.14 475.55 ± 6.22 8.91 ± 0.16 35.03 804.82 ± 4.54 8.91 ± 0.16 47.71 874.73 ± 3.73 8.91 ± 0.16 49.79 998.18 ± 0.40 8.91 ± 0.16 53.09 449 ± 5.56 6.18 ± 0.44 42.33 819.55 ± 4.71 6.18 ± 0.44 57.26 891 ± 3.16 6.18 ± 0.44 59.29 998.73 ± 0.38 6.18 ± 0.44 62.01 441.36 ± 5.67 4.91 ± 0.31 47.59 804.09 ± 4.62 4.91 ± 0.31 62.32 874.73 ± 3.73 4.91 ± 0.31 64.28 998.18 ± 0.40 4.91 ± 0.31 67.25 458.27 ± 5.35 10 ± 0 31.64 857 ± 3.71 10 ± 0 46.4 938 ± 2.26 10 ± 0 48.65 999.82 ± 0.12 10 ± 0 50.25 439.91 ± 5.59 4.91 ± 0.31 47.51 804.09 ± 4.62 4.91 ± 0.31 62.32 874.73 ± 3.73 4.91 ± 0.31 64.28 998.18 ± 0.40 4.91 ± 0.31 67.25 Note: The table shows the number of loci (±SE) from selection simulations of 1000 loci, which were identified as being under selection by criteria based on differentiation values from neutral simulations of 1000 loci: either a “univariate” criterion of being in the top 1% of neutral values for one differentiation measure, or a “bivariate” criterion of being simultaneously in the top 1% for two differentiation measures. In each of columns 2–5, the top value in each cell is the number of loci identified as being under selection (true positive, TP), in the selection simulation with the known value of selection shown at the top of the column, out of the total of 1000 independent loci simulated. The second value in each cell is the number of loci identified as being under selection (False positive, FP), in the parallel neutral simulation; of course with univariate criteria and the cutoff being the top 1%, the FP value is always 10 (1% of 1000 loci). The third value in each cell is the “performance” value—the percentage of loci that are true positive, out of all loci identified as outliers by that criterion (TP & FP). The performance value shown is for the case where 1% of all loci were under that selective regime, and all other loci were neutral; the calculation is 100 × (TP × 0.01)/[(TP × 0.01) + (FP × 0.99)]. Of course, the proportions of neutral and selected loci would not be known beforehand in a study designed to detect loci under selection, but given that it is standardized to a constant univariate FD rate, the performance values can be used to compare the criteria. The right column shows the performance averaged over all four selection strengths. Within each of the univariate criteria and the bivariate criteria, the three criteria with the best average performance are bolded. Note that the rank order of performance values is similar for most selection strengths, except the weakest selection (s = 0.001).

DISCUSSION

Science progresses by making forecasts under given conditions, then testing to see whether these conditions are confirmed by the data. Examples include assessing levels of dispersal by identifying whether neutral loci depart from expectations for isolation or panmixia, and testing for loci that may be responding to geographically variable selection, by identifying whether genetic differentiation is higher than neutral expectation (“outlier loci,” (Bierne et al., 2013; Lotterhos & Whitlock, 2014; Narum & Hess, 2011; Schneider et al., 2021; Xiang‐Yu et al., 2016)). Unfortunately, there are expected to be many false results in such molecular ecological methods (Bierne et al., 2013; Lotterhos & Whitlock, 2014; Narum & Hess, 2011; Schneider et al., 2021; Whitlock & McCauley, 1999; Xiang‐Yu et al., 2016). Therefore, it is advisable to confirm conclusions by methods that are mathematically independent or at least partially independent. Figure 3 shows that A = (Equation 7) provides information that is not linearly dependent on three other differentiation measures: , , I, as previously suggested by an investigation of BCAFD (Berner, 2019a). A is therefore complementary to , , I, and other measures (discussed below), and so it is a useful addition to our range of genetic differentiation measures, able to provide at least partly independent validation of results. The forecasts in Equations (8) and (9) for A can now be added to the armory of null expectations in assessment of connectivity and searches for loci under selection, because the forecasts for A are very accurate in simulation results for the common data type of neutral biallelic SNPs, over a wide range of dispersal rates and effective population sizes (Figure 1b,c). If researchers do wish to use BCAFD and still achieve this accuracy, the researchers need to filter so that they use only those loci with maximum allele proportion (in either of the two locations) in the range 0.8 to unity, thus losing much of their dataset. It is worth noting that A (and BCAFD) are still dependent upon other aspects of within‐locality alpha‐variation, because of their relationship to and . It was not possible to remove this dependence by basing the expectations for A upon (Meirmans & Hedrick, 2010) (Equation A1.14); moreover, such a correction would considerably complicate the derivation of theoretical expectations for A or BCAFD, such as Equations (8) and (9). However, the alpha‐dependence is not fatal; despite their alpha‐dependence, and are frequently used in various ways, including assessing connectivity and searching for loci under geographically variable selection. Moreover, under some conditions and have performance comparable or better than other measures (Schneider et al., 2021). Nevertheless, like all such methods, there are expected to be many false‐positives, so that corroboration with semi‐independent assessments is needed (Bierne et al., 2013; Lotterhos & Whitlock, 2014; Narum & Hess, 2011; Schneider et al., 2021; Xiang‐Yu et al., 2016), which is where A might be used. The neutral forecasts for A can be used either to make biological‐inventories of differentiation between locations (or times), or to be compared to observations in order to assess biological processes that underlie all biology, and are the processes which some conservation initiatives aim to conserve (Anonymous, 1988). Processes to be investigated include population size, mutation, and dispersal in natural or managed systems, or searches for outlier loci that depart from neutral expectations, and are thus candidate adaptive loci, which of course are very important in evolution and conservation (Teixeira & Huber, 2021). Candidate adaptive loci are identified because they depart from neutral forecasts, as is commonly done with , , and other measures (Bierne et al., 2013; Lotterhos & Whitlock, 2014; Narum & Hess, 2011; Schneider et al., 2021; Xiang‐Yu et al., 2016). Similar searches for adaptive loci are now using BCAFD (Bharti et al., 2021; Boyle et al., 2021; Haenel, Guerard, et al., 2021; Haenel, Oke, et al., 2021; Price et al., 2020; Yin et al., 2021; Zhou et al., 2021). These searches are expected to benefit from using A instead of BCAFD, because as shown in the results above, the limitation of BCAFD truncates the high values of differentiation, which are the very values used to identify the potentially adaptive loci. Table 2 shows the performance of various criteria for identifying candidate adaptive loci under selection, due to their being outliers whose geographic genetic differentiation is in the top 1% of values for neutral loci for either a single measure (univariate criterion), or two measures (bivariate criterion). Of course, the more restrictive bivariate criterion eliminated more neutral loci, so the bivariate measures showed the best performance, measured as the percent of all outlier loci that were truly under selection (right column in Table 2). Table 2 indicates that there is no perfect measure for detecting selection, because when we are searching for loci under selection, we cannot know in advance the proportion of loci that are experiencing each selection strength. Nevertheless, it is reassuring to see that the rank order of the average performance over all selection strengths, is similar to the rank order within each selection strength. Of the univariate criteria, the three best performers were A, , and The bivariate criteria generally performed much better, showing the advantage of using more that one differentiation measure as the cutoff in searches for candidate adaptive loci. Of the bivariate criteria, the three best performers were “ & ,” tied with “ & ,” followed by “ A & .” The differences in performance were small, but even small improvements are very important given that this commonly used approach can only identify outlier loci that are putatively under selection, then each of these “candidate” loci must be confirmed by separate extensive investigations, such as “evolve and resequence” experiments in one or more standard environmental conditions (Schlötterer et al., 2015). There could be further research into which complementary measures are best to use with A. This will depend upon the aim of the investigation and the different sensitivities of each measure, but some generalizations are possible. There has been considerable investigation of the properties of the Hill diversity measures, with many having good predictions from underlying factors such as population size, speciation/mutation, and dispersal, as well as showing independence of alpha, beta, and gamma (total) diversity (Sherwin et al., 2017, 2021). In particular, Shannon Mutual Information I/Shannon Differentiation and Morisita‐Horn/ are differentiation measures that have available forecasts under neutrality that can be used as null models. These measures also avoid the dependency on within‐location (alpha) variation seen with , and ; moreover, the Shannon measures avoid the heavy emphasis of effects of common variants, such as is seen with Morisita‐Horn and (Jost, 2008; Magurran, 2004; Sherwin et al., 2017; Sherwin et al., 2021). If the primary purpose of assessing differentiation is for identification of loci under selection, another good measure to contrast with identifications by would be , which can be used at any level of the Hill‐family “Q,” and has a good sensitivity to selection, and is particularly appropriate for multi‐SNP haplotypes, which are not considered in the current work (Schneider et al., 2021). Of course, any use of theory relies upon adherence to assumptions, and this initial theory for A has assumptions like any theory. The equations for , upon which the A forecasts are based, rely on a number of assumptions (Neigel, 2002; Ochoa & Storey, 2021; Semenov et al., 2019; Whitlock & McCauley, 1999) and each of these needs to be investigated if it is proposed to apply Equations (8) or (9) to any particular case. First, it was assumed that there are only two locations, of approximately equal effective size, which may be the case especially in some conservation applications, but other possibilities would require further theory. Second, it was assumed that there is symmetric dispersal m, the same for both locations, so that addressing a source‐sink situation would require further theory based on the continent–island model. Third, it should be noted that unlike the Hill‐family of diversity measures, A (or BCAFD) cannot currently be corrected for absence or under‐representation of rare alleles, due to incomplete sampling of individuals, by the Good‐Turing correction (Chao & Jost, 2015) (A. Chao pers. comm.); however, this correction method is also inapplicable to any measure based on a two‐variant system such as SNPs. Finally, Figure 1b shows a wide scatter, but the regression analysis shows that if there are multiple independent replicates such as hundreds, or a thousand, neutral SNP loci in linkage equilibrium, the neutral forecast is very accurate. This number of statistically unlinked SNP loci is easily achievable with current methods for genotyping‐by‐sequencing (e.g., www.diversityarrays.com). Irrespective of whether one wishes to use theoretical expectations, it is advisable to use A rather than BCAFD, because the latter's dependence on limits its comparability to other studies, even within the same species, if the population pairs analyzed are in parts of the range that have different , due to a strong cline. Several further possible developments are obvious. First, Appendix 1 principally shows equilibrium forecasts; Tables A2.1 and 2.2 show that there is often a wide range of generation times for which equilibrium is a reasonable assumption. However, there are populations that are known to have had recent changes such as severe reductions in connectivity, and for these the Equation (A1.11b) can be used. For other changes such as reduction of population size, further nonequilibrium forecasts could be derived in later research. Second, the initial neutral theory of A in this article gives a good null model for use in searches for outlier loci that may be under directional selection, but could form the basis of further theory that is specific to particular modes of selection, including more complicated geographical patterns of directional selection, or balancing or disruptive selection. Third, at present the theory is limited to cases where there are only two alleles, as is often the case for SNPs, but not for haplotypes encompassing many nucleotides. In future, all the theory in this paper might be extended to cases with multiple alleles, broadening it to encompass the multiallele version of BCAFD ((Berner, 2019a) Supplement). Fourth, the theory could be extended to multiple locations. Fifth, the haploid Equation (8) might also be developed to deal with species variants in two local communities, if the speciation rate is negligible relative to the dispersal rate; this is of course the original use of Bray‐Curtis (Bray & Curtis, 1957), which would require development of multivariant theory plus simulations tailored to species assemblages, including investigation of the wide scatter seen in Figure 1, for which species analyses could not be overcome by using hundreds or more replicate loci—instead, hundreds or more replicate pairs of communities would be needed, which is probably unattainable. In conclusion: The new A measure (Equation 7) provides a semi‐independent means for assessing connectivity, selection, etc. based on geographic genetic differentiation, that can be used in combination with other such measures to minimize errors such as false positives. The A measure avoids counterintuitive truncation of high values of beta‐differentiation by alpha within‐population variation (), Avoiding this truncation means that that studies with different can now be compared realistically, either between species, or even within the same species, if the population pairs analyzed are in parts of the range that have different , due to a strong cline. Avoiding this truncation is especially important if the high values of differentiation are to be used to identify candidate adaptive loci, because the truncation would pull the truly high values in amongst the not‐quite‐so‐high, leading to increased false negatives and positives. As predicted, the best performance at identifying outlier loci that are potentially under selection comes from using two geographic genetic differentiation measures simultaneously, to make bivariate criteria; the three best performers were “ & ,” tied with “ & ,” followed by “ A & .” The differences in performance are very important given that each of the identified “candidate” loci must be confirmed by separate extensive investigations As well as simply presenting patterns in the data, if researchers consider that their system conforms to the assumptions herein, the neutral forecasts for A can be used as a rigorous basis for investigations such as tests for selection and assessment of connectivity. There are equilibrium and nonequilibrium versions of the theory for A (Equations 8, 9, A1.11b). Irrespective of whether the theory in this paper is used, BCAFD cannot be free of the limit of maximum within‐population allele proportion , so it is best if reported differentiation values should be based upon A, not BCAFD.

AUTHOR CONTRIBUTIONS

William B. Sherwin: Conceptualization (lead); data curation (lead); formal analysis (lead); investigation (lead); methodology (lead); project administration (lead); resources (lead); software (lead); validation (equal); visualization (lead); writing – original draft (lead); writing – review and editing (lead).

CONFLICT OF INTEREST

There are no conflicts of interest.

TABLE A1.1

Scheme for the simulation, for each generation, using terms defined in text of Appendix 1

	Location 1	Location 2
Generation t, initially	p ₁, q ₁	p ₂, q ₂
After drift	p1′,q1′	p2′,q2′
After dispersal	p1″=p1′−mp1′+mp2′q1″=1−p1″	p2″=p2′−mp2′+mp1′q2″=1−p2″

TABLE A2.1

Time in generations to half‐equilibrium for the scenario conditions simulated

N	m	t1/2eq
1000	0.01	32.8488
1000	0.03	11.1944
1000	0.10	3.27386
10,000	0.01	34.3131
10,000	0.03	11.3596
10,000	0.10	3.28785
100,000	0.01	34.4666
100,000	0.03	11.3764
100,000	0.10	3.28925

Note: See Appendix 1 for definitions of other symbols.

TABLE A2.2

Expected time in generations to fixation for the scenario conditions simulated

Initial p	N	Fixation time
0.5	100,000	277258.9
0.1	100,000	102337.1
0.01	100,000	18606.7
0.5	10,000	27725.9
0.1	10,000	10233.7
0.01	10,000	1860.7
0.5	1000	2772.6
0.1	1000	1023.4
0.01	1000	186.1

Note: See Appendix 1 for definitions of symbols.

48 in total

1. Geographic patterns of (genetic, morphologic, linguistic) variation: how barriers can be detected by using Monmonier's algorithm.

Authors: Franz Manni; Etienne Guérard; Evelyne Heyer
Journal: Hum Biol Date: 2004-04 Impact factor: 0.553

2. Isolation by Distance.

Authors: S Wright
Journal: Genetics Date: 1943-03 Impact factor: 4.562

Review 3. Assessing population structure: F(ST) and related measures.

Authors: Patrick G Meirmans; Philip W Hedrick
Journal: Mol Ecol Resour Date: 2010-10-26 Impact factor: 7.090

4. Pre-introduction introgression contributes to parallel differentiation and contrasting hybridization outcomes between invasive and native marine mussels.

Authors: Iva Popovic; Nicolas Bierne; Federico Gaiti; Miloš Tanurdžić; Cynthia Riginos
Journal: J Evol Biol Date: 2020-12-14 Impact factor: 2.411

5. Measuring biological diversity.

Authors: Anne E Magurran
Journal: Curr Biol Date: 2021-10-11 Impact factor: 10.834

6. Evaluation of demographic history and neutral parameterization on the performance of FST outlier tests.

Authors: Katie E Lotterhos; Michael C Whitlock
Journal: Mol Ecol Date: 2014-04-11 Impact factor: 6.185

7. Allele Frequency Difference AFD⁻An Intuitive Alternative to F_ST for Quantifying Genetic Population Differentiation.

Authors: Daniel Berner
Journal: Genes (Basel) Date: 2019-04-18 Impact factor: 4.096

8. Integrated 16S rRNA Sequencing, Metagenomics, and Metabolomics to Characterize Gut Microbial Composition, Function, and Fecal Metabolic Phenotype in Non-obese Type 2 Diabetic Goto-Kakizaki Rats.

Authors: Weijun Peng; Jianhua Huang; Jingjing Yang; Zheyu Zhang; Rong Yu; Sharmeen Fayyaz; Shuihan Zhang; Yu-Hui Qin
Journal: Front Microbiol Date: 2020-01-20 Impact factor: 5.640

9. The inflated significance of neutral genetic diversity in conservation genetics.

Authors: João C Teixeira; Christian D Huber
Journal: Proc Natl Acad Sci U S A Date: 2021-03-09 Impact factor: 11.205

10. A Linkage-Based Genome Assembly for the Mosquito Aedes albopictus and Identification of Chromosomal Regions Affecting Diapause.

Authors: John H Boyle; Pasi M A Rastas; Xin Huang; Austin G Garner; Indra Vythilingam; Peter A Armbruster
Journal: Insects Date: 2021-02-16 Impact factor: 2.769