Literature DB >> 31061475

Exploiting selection at linked sites to infer the rate and strength of adaptation.

Lawrence H Uricchio^1,2, Dmitri A Petrov³, David Enard⁴.

Abstract

Genomic data encode past evolutionary events and have the potential to reveal the strength, rate and biological drivers of adaptation. However, joint estimation of adaptation rate (α) and adaptation strength remains challenging because evolutionary processes such as demography, linkage and non-neutral polymorphism can confound inference. Here, we exploit the influence of background selection to reduce the fixation rate of weakly beneficial alleles to jointly infer the strength and rate of adaptation. We develop a McDonald-Kreitman-based method to infer adaptation rate and strength, and estimate α = 0.135 in human protein-coding sequences, 72% of which is contributed by weakly adaptive variants. We show that, in this adaptation regime, α is reduced ~25% by linkage genome-wide. Moreover, we show that virus-interacting proteins undergo adaptation that is both stronger and nearly twice as frequent as the genome average (α = 0.224, 56% due to strongly beneficial alleles). Our results suggest that, while most adaptation in human proteins is weakly beneficial, adaptation to viruses is often strongly beneficial. Our method provides a robust framework for estimation of adaptation rate and strength across species.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31061475 PMCID： PMC6693860 DOI： 10.1038/s41559-019-0890-6

Source DB: PubMed Journal: Nat Ecol Evol ISSN： 2397-334X Impact factor: 15.460

Introduction

The relative importance of selection and drift in driving species’ diversification has been a matter of debate since the origins of evolutionary biology. In the earliest formulations of evolutionary theory, natural selection was proposed to be the predominant driver of differences between species [1, 2]. Subsequent theorists argued that random genetic drift could be a more important contributor to species differences [3, 4, 5, 6], with random changes accumulating over evolutionary time due to reproductive isolation between populations. Although it is now clear that natural selection plays a substantial role in both diversification and constraint in many species [7, 8, 9, 10], considerable uncertainty remains about the relative importance of stochastic drift, mutation, selection, and linkage, with no clear consensus among evolutionary geneticists [11, 12, 13, 14]. A better mechanistic understanding of these processes and how they jointly shape genetic diversity could help to resolve old evolutionary puzzles, such as the narrow range of observed genetic diversity across species [15] and the apparently low rate of adaptation in primates [16]. With the exception of rapidly evolving microbial species, most adaptation events occur too slowly to be directly observed over the timescale of a scientific study. Therefore, detailed study of the molecular basis of adaptation has required the development of computational methods to infer adaptation rates (denoted α, defined as the proportion of fixed differences between species that confer fitness benefits) directly from genetic sequence data. Most existing approaches derive from the McDonald-Kreitman (MK) test [7, 17] and related Poisson random field framework [18], both of which use divergence and polymorphism data to infer adaptation rates. Note that a recent approach uses polymorphism data alone to infer the distribution of fitness effects of fixing mutations [19]. The critical idea behind each of these methods is to compare evidence for differentiation at alleles that are likely to have fitness effects (e.g., nonsynonymous alleles that change protein function by altering the amino acid sequence) to alleles that are less likely to have fitness effects (e.g., synonymous alleles that do not change amino acid sequences). In the classic MK framework, the rate of divergence at putatively functional sites (D, often defined as non-synonymous differences within proteins) is compared to putatively neutral diverged sites (D, often defined as synonymous differences). Polymorphic sites within both the functional and non-functional class (P and P, respectively) are used as a background to calibrate the expected rate of divergence under a neutral model. If mutations at functional sites are assumed to be either virtually lethal or neutral, then the ratio has the same expected value as given that virtually lethal mutations contribute to neither P nor D. When exceeds , this is interpreted as evidence of adaptation because sites with functional effects on proteins are over-represented among the fixed differences relative to the neutral expectation. Smith and Eyre-Walker developed a simple equation that uses the same logic as the MK test to estimate adaptation rate α, and used this approach to provide evidence for a high rate of adaptation in Drosophila [17]. In principle, non-adaptive processes (i.e., processes that do not increase fitness) such as GC-biased gene conversion [20] could also lead to an excess of nonsynonymous fixed differences between species, but only if these processes differentially affect synonymous and nonsynoymous mutations. Unfortunately, this elegant framework is susceptible to many biases, most notably driven by the presence of weakly deleterious polymorphism in the class P. Deleterious polymorphism effectively makes the test overly conservative, because deleterious alleles are unlikely to ever reach fixation and therefore lead to the overestimation of the expected background rate of substitutions in the functional class. Fay et al introduced the idea of including only common polymorphic alleles (e.g., alleles at frequency 15% or greater), which should remove many deleterious alleles [21] – however, this approach has been shown to provide conservative adaptation rate estimates in many contexts [22]. More recently, Messer & Petrov showed that even removing all polymorphism below 50% is insufficient to correct this bias, especially when slightly deleterious mutations are common and the rate of adaptive evolution is high [23]. In order to mitigate this effect, Messer & Petrov introduced the idea of the asymptotic MK test (aMK). In this implementation, in eqn. 1 is replaced by , where P(x) and P(x) are the number of segregating nonsynonymous and synonymous alleles at frequency x, respectively [23]. An exponential curve is fit to the resulting α(x) function, which can be calculated for all values of x in the interval (0,1) for a sample of sequenced chromosomes. The intercept of the best-fit exponential curve at x = 1 is a good approximation for α, as it effectively removes all slightly deleterious polymorphism at all frequencies. This approach was shown to be robust to both the underlying distribution of deleterious effects and recent demographic events [23]. aMK has inspired new approaches to inferring adaptation in mitochondrial genes [24] and revealed a high rate of adaptation in proteins interacting with pathogens [25]. While aMK extends the elegant MK framework for estimating adaptation rate, it does not explicitly account for the possibility that beneficial alleles contribute to segregating polymorphism. It is unknown whether aMK is robust to the presence of weakly beneficial alleles, but there is reason to believe that beneficial alleles would be problematic because they are preferentially found at very high frequencies [19], and thus their effect would not be eliminated by the asymptotic procedure. The recent emphasis on adaptation from standing variation [26, 27, 28, 29, 30] and the reported evidence for weakly-beneficial polymorphism in Drosophila [31] suggest that robust methods for inferring adaptation strength over longer evolutionary time-scales are needed. A key limitation of existing MK-based approaches is that they provide estimates of adaptation rate but not adaptation strength, and therefore it is not clear whether weakly beneficial mutations contribute substantially to the fixation process. The underlying processes driving weak and strong adaptation might differ, and the ability to separately estimate rates of weak and strong adaptation could provide insight into the biological drivers of adaptation. We hypothesized that such a method could be developed by exploiting the impact of background selection (BGS) on the fixation rate of weakly-beneficial alleles. BGS removes neutral and weakly-beneficial variation via linkage to deleterious loci [32], while the fixation rate of strongly-adaptive alleles is not substantially affected [33]. Given that the strength of BGS varies widely and predictably across the human genome [34], a method that interrogates the rate of adaptation as a function of BGS might be able to jointly infer the rate and strength of adaptation. Here, we probe the performance of aMK when weakly-beneficial alleles substantially contribute to segregating polymorphism, and we show that aMK underestimates α in this adaptation regime. We additionally show that when adaptation is weak, true α is predicted to vary substantially across the genome as a function of the strength of BGS. We exploit this signal of covariation between α and BGS in the weak-adaptation regime to develop an approximate Bayesian computation method, which we call ABC-MK, that separately infers the rate of adaptation for weakly-beneficial and strongly-beneficial alleles. Our approach and aMK rely on similar input data, but we use a model-based fitting procedure that directly accounts for BGS and weakly-beneficial alleles. We apply our method to human genetic data to provide evidence that adaptation in humans is primarily weakly-beneficial and varies as a function of BGS strength. Interestingly, adaptation rate estimates on virus-interacting proteins support a much higher rate of strong adaptation, suggesting that adaptation to viruses is both frequent and strongly fitness-increasing. We address seven potential sources of confounding, and discuss our results in light of recent research on adaptation in humans.

Results

α estimates are conservative for weakly-beneficial selection

The aMK approach is known to converge to the true α at high frequency under the assumption that positively-selected mutations make negligible contributions to the frequency spectrum [23]. This assumption is likely to be met when beneficial alleles confer large fitness benefits, because selective sweeps occur rapidly and beneficial alleles are rarely observed as polymorphic. However, when selection is predominantly weak, attaining a substantial α requires much larger mutation rates for beneficial alleles and longer average transit time to fixation, introducing the possibility that weakly-beneficial alleles will contribute non-negligibly to the frequency spectrum. To test whether aMK is sensitive to polymorphic weakly-adaptive alleles, we used simulated polymorphism and divergence data to estimate the rate of adaptation using published aMK software [35]. In our simulations, we set the true value of α to 0.2 and vary the contribution of weakly-beneficial alleles and strongly-beneficial alleles to the adaptation process (see Methods & Supplementary Information). When adaptation is due entirely to strongly-adaptive alleles, the estimated value of was close to the true value but slightly conservative (; Fig. 1A). As we increased the contribution of weakly-beneficial alleles to α, estimates of α became increasingly conservative ( when α = 0.1, and when α = 0.2; Fig. 1B–C). Removing polymorphism above frequency 0.5 has been suggested as approach to account for potential biases induced by high-frequency derived alleles, which could be mispolarized in real datasets [25]. Restricting to alleles below frequency 0.5 produced similar (but conservative) estimates for all three models ( = 0.14271, 0.14529, and 0.14264 for α = 0.0, 0.1 and 0.2, respectively), likely because the frequency spectrum is not strongly dependent on the rate of weakly-beneficial mutation for low-frequency alleles. Lastly, we performed a much larger parameter sweep across α values and selection coefficients. We find that α estimates become increasingly conservative as the proportion of weakly deleterious alleles increases, and as the strength of selection at beneficial alleles decreases (Fig. S12A & Supplementary Information). Asymptotic-MK estimates of α are only weakly dependent on the distribution of deleterious selection coefficients (Fig. S12).

Figure 1:

aMK estimates as a function of adaptation strength.

A-C: We plot α(x) as a function of derived allele count x in a sample of 50 chromosomes. The true value of α = 0.2 in each panel, with varying contributions from weakly (2N s = 10) and strongly-adaptive alleles (2N s = 500). The solid lines show the results of our analytical approximation (eqn. 11 in the Supplemental Information), while the points show the value of α(x) from forward simulations. The blue points and curves show the calculation as applied to all polymorphic loci, while in the pink points and curves we have removed positively selected alleles from the calculation. The dotted line shows the estimated value of α from the simulated data using existing asymptotic-MK methods [23, 35], while the gray bars show the 95% confidence interval around the estimate.

To better understand why parameter estimates decreased as the proportion of weakly-adaptive alleles increased, we performed analytical calculations of α(x) using diffusion theory [36, 37]. Since we use large sample sizes in our analysis herein, we replace the terms p(x) and p(x) in α(x) with ∑p(x) and ∑p(x) in our calculations, which trivially asymptote to the same value as the original formulation but are not strongly affected by sample size (see Supplementary Information). We find that the downward bias in estimates of α is due to segregating weakly-adaptive alleles, and removing these alleles from the simulated and calculated α(x) curves would restore the convergence of α(x) to the true a at high frequency (Fig. 1A–C, red curves). In real data, it is not possible to perfectly partition positively-selected and deleterious polymorphic alleles. Hence, in later sections we focus on using the shape of the α(x) curve to infer the strength and rate of adaptation under models that include linkage and complex demography.

Background selection reduces true α when adaptation is weak

We have shown that weakly-beneficial alleles may impact aMK analyses by contributing to segregating polymorphism. This presents an opportunity to study whether aMK estimates vary as a function of background selection (BGS) strength. BGS, the action of linkage between deleterious alleles and neutral alleles, reduces genetic diversity in the human genome [34] and affects neutral divergence rates [38], and is predicted to decrease the fixation probability of weakly-adaptive alleles [33]. Hence, we hypothesized that if adaptation is partially driven by weakly-beneficial alleles in some species, BGS could play a role in modulating adaptation rate across the genome. To better understand how BGS might affect aMK inference in the presence of weakly-beneficial alleles, we performed analytical calculations and simulations of α(x) with various levels of BGS. We set α = 0.2 in the absence of BGS, and then performed simulations while fixing the rate of adaptive mutations and changing the amount of BGS (ranging from , where π is neutral nucleotide diversity as compared to the neutral diversity in the absence of linked selection, π0). We find that when adaptation is strong, BGS has a modest effect on α(x) and the true value of α (Fig. 2A&C), mostly driven by an increase in the rate of fixation of deleterious alleles(Fig. S2E). When adaptation is weak, BGS removes a substantial portion of weakly-adaptive alleles and precludes them from fixing, resulting in much stronger dependence of α(x) on BGS and a substantial reduction in the true value of α (Fig. 2B&D and Fig. S2C). Similar to the previous section, estimates of α were conservative across all models, but the underestimation was much more pronounced for weak adaptation (Fig. 2C&D).

Figure 2:

The effect of BGS on α.

A-B: α(x) is plotted as a function of derived allele count x for various background selection (/) values. In A, adaptive alleles are strongly-beneficial (2N s = 500), while in B they are weakly-beneficial (2N s = 10). The lines represent analytical approximations, while the points represent the results of stochastic simulations. The dashed lines at α = 0.2 represent the true rate of adaptation in the absence of BGS. C-D: True (dark colors) and estimated (light colors) α for each of the corresponding models in A-B. Panel C corresponds to strong adaptation (2N s = 500) while D corresponds to weak adaptation (2N s = 10). Estimates of α were made using existing asymptotic-MK software [35], and the error bars correspond to 95% confidence intervals reported by the software. For each parameter combination, we used 2 × 105 independent simulations of 103 coding base pairs each.

Human adaptation rate is shaped by linked selection

Our modeling results show that α is likely to be underestimated when weakly-beneficial alleles contribute substantially to the frequency spectrum, and that background selection may reduce adaptation rate when fitness benefits of adaptive alleles are small. Since BGS is thought to drive broad-scale patterns of diversity across the human genome [34], we hypothesized that directly accounting for the action of BGS on adaptation rate could provide new insights into the evolutionary mechanisms driving adaptation. Moreover, the fact that weak adaptation is strongly affected by BGS while strong adaptation is not suggests that strong and weak adaptation could be differentiated in genomic data by comparing regions of differing BGS strengths (from to ). We therefore designed an ABC-based method to infer α while accounting for both BGS and weakly-beneficial alleles. We applied our inference procedure (ABC-MK) to empirical α(x) data computed from human genomes obtained from the Thousand Genomes Project (TGP) for all 661 samples with African ancestry [39]. We find strong posterior support for a substantial component of α driven by weakly-beneficial alleles (; Fig. 3A & see Tab. 1 for area of 95% HPD), as well as posterior support for a smaller component of α from strongly-beneficial alleles ). We estimate that the total , nearly twice the estimate obtained with the same dataset using the original aMK approach (, see Supplementary Information; we note that while our estimate is similar to previous estimates [40, 23], we use a much larger set of genes in our inference and hence the estimates are not directly comparable). In addition to rates of positive selection, our approach provides estimates of negative selection strength. We find support for mean strength of negative selection of 2N s ≈ −220 (Fig. S9C), which is consistent with recent studies using large sample sizes [41] and weaker than earlier estimates using small samples [42, 40].

Figure 3:

Adaptation rate and strength estimates for human genomic data

A: Posterior distribution of α, α, and α = α + α as inferred by applying our ABC approach to 661 samples of African ancestry from the TGP phase 3. B: α(x) as a function of derived allele frequency (DAF) for genomic data (black points) plotted along with the mean posterior estimate from our model (orange line) and 99% confidence interval (gray envelope), as obtained by an independent set of simulations using the posterior parameter estimates. C: Inferred posterior distribution of α as a function of BGS strength in the human genome. D: Mean posterior estimates of α, as determined by separately fitting the model to alleles from each independent background selection strength bin. A linear model fit to the data (green line) supported statistically significant covariation between / and α (p-value=0.0343). The black dashed line shows the predicted change in α as a function of B given the mean estimate of α.

Table 1:

Datasets and corresponding adaptation rates.

Estimated α values represent the mean of the posterior distribution. NS represents the number of nonsynonymous fixations and SYN represents the number of synonymous fixations. Values in parentheses represent the area of 95% highest posterior density. α: total adaptation rate; α

Datasets & inferred adaptation rates
Dataset	NS	SYN	α^	α^W	α^S

Whole-egoism	29925	38135	0.135 (0.096,0.17)	0.097 (0.0,0.21)	0.041 (0.0,0.13)
VIPs	6249	10309	0.224 (0.17,0.28)	0.098 (0.0,0.24)	0.126 (0.018,0.26)
Non-VIPs	23676	27826	0.12 (0.09,0.15)	0.077 (0.01,0.13)	0.042 (0.0, 0.09)

In addition to estimating evolutionary parameters, we sought to better understand how BGS may impact adaptation rate across the genome. We resampled parameter values from our posterior estimates of each parameter, and ran a new set of forward simulations using these parameter values. We then calculated α as a function of BGS in our simulations. We find that α co-varies strongly with BGS, with α in the lowest BGS bins being 33% of α in the highest bins (Fig. 3C). Integrating across the whole genome, our results suggest that human adaptation rate in coding regions is reduced by approximately 25% by BGS (Fig. S9D). To confirm that these model projections are supported by the underlying data, we split the genome into BGS bins and separately estimated adaptation rate in each bin. Although these estimates are substantially noisier than our inference on the full dataset, we find that the rate of adaptation due to weakly-beneficial alleles decreases as a function of BGS strength in accordance with the model predictions (Fig. 3D). In contrast, estimates of the mean strength of negative selection against nonsynoymous mutations did not covary with BGS strength (Fig. S20). Lastly, to validate that our model recapitulates α(x) values that we observe in real data, we also used our independent forward simulations to recompute α(x). We find that our model is in tight agreement with the observed data across the majority of the frequency spectrum. The model and data deviate at high frequency, but both are within the sampling uncertainty (Fig. 3B, gray envelope). Previous research has shown that virus-interacting proteins (VIPs) have undergone faster rates of adaptation than the genome background [25]. However, the strength of selection acting on these genes is unknown, and given our BGS results it is plausible that the higher rate of adaptation in VIPs is driven by lower overall background selection at VIPs rather than increased selection pressure for adaptation. In contrast, if pathogens have imposed large fitness costs on humans it is possible that VIPs would support both higher adaptation rates and greater adaptation strength. We ran our method while restricting to an expanded set of 4,066 VIPs for which we had divergence and polymorphism data available. We found evidence for strikingly higher adaptation rates in VIPs than the genome background (α = 0.224) and a much larger contribution from strongly-adaptive alleles (α = 0.126; Fig. 4). The higher α for VIPs cannot be explained by BGS, because VIPs undergo slightly stronger BGS than average genes; the mean BGS strength at VIPs is 0.574, as compared to 0.629 for all genes (in units of /). Taking α as a point estimate for the rate of strongly-beneficial substitutions in VIPs and α genome-wide, we estimate that 61% of all strongly-beneficial substitutions occurred in VIPs (Tab. 1). Moreover, we estimate that the posterior probability that α is greater in VIPs than non-VIPs is 99.97%, while the posterior probability that α is greater in VIPs is 88.9% (Fig. 4C). Bootstrap samples of non-VIPs (1,000 replicates) never resulted in α estimates as high as those obtained from VIPs (Fig. S19). These results are concordant with the α(x) summary statistics for VIPs, which had larger values at high frequency alleles than non-VIPs (Fig. 4D). Interestingly, α(x) is lower for VIPs than the non-VIPs at low frequency, suggesting increased overall levels of conservation among VIPs (see also Fig. S9, where we find support for stronger negative selection against nonsynonymous mutations in VIPs).

Figure 4:

Virally-interacting genes support a high rate and strength of adaptation

A: Posterior distributions for α, α, and α for virus-interacting proteins (VIPs, 4,066 genes). B: The same quantities for non-VIPs (12,962 genes). C: The posterior distribution of the difference in α for VIPs and non-VIPs. D: α(x) as a function of derived allele frequency x for VIPs and non-VIPs as a function of derived allele frequency x, specifically at the values of x that we use for statistical inference.

Discussion

A long-running debate in evolutionary biology has concerned the relative importance of drift and selection in determining the rate of diversification between species [3, 4, 6, 13]. While previous studies have shown that there is a substantial signal of adaptation in Drosophila [17], estimates of adaptation rate in humans are much lower [13]. Here, we extended the classic MK framework to account for weakly-beneficial alleles, and we provided evidence for a large rate of weakly-adaptive mutation in humans. We showed that a state-of-the-art approach to adaptation rate estimation that does not account for beneficial polymorphism provides conservative estimates of α ( for this data) [23], while our method nearly doubles the estimated human adaptation rate (to ). Most of the adaptation signal that we detect is due to weakly-beneficial alleles. Interestingly, virus-interacting proteins supported a much higher rate of adaptation than the genome background (), especially for strongly-beneficial substitutions as compared to genome-wide). Our results provide an evolutionary mechanism that partially explains the apparently low observed rate of human adaptation in previous studies, and extends the support for viruses as a major driver of adaptation in humans [25]. It has long been known that recombination could in principle affect the evolutionary trajectories of both beneficial and deleterious alleles [43, 44, 33], and studies in Drosophila [45, 46] and dogs [47] have provided evidence for the effect of recombination on divergence and load. Despite the expectation that recombination could have a strong effect on adaptation in humans, studies have differed on how recombination affects human divergence and polymorphism. One human genomic study explored the ratio as a function of recombination rate, and found no evidence for an effect of recombination on divergence rate [48]. Our results may partially explain why does not fully capture the effect of recombination on divergence in humans. As BGS increases in strength, the rate of accumulation of deleterious alleles increases, while the rate of fixation of weakly-adaptive alleles decreases. The two effects partially offset each other, which should reduce the sensitivity of as a tool to detect the effect of recombination on divergence. A more recent study provided evidence that recombination affects the accumulation of deleterious polymorphic alleles [49], but did not provide detailed information about the effect of recombination on adaptation. Our results are consistent with the idea that weakly deleterious alleles are predicted to segregate at higher frequencies in regions under strong BGS, and we additionally show that BGS affects the accumulation of weakly-beneficial alleles in humans. While classic MK approaches estimate only the rate of adaptation, our method extends the MK-framework to provide information about both the rate and strength of selection. Previous approaches to estimating the strength of adaptation have focused on the dip in diversity near sweeping alleles [45, 50, 51, 52, 31] or have directly inferred the DFE from the frequency spectrum [19] – our approach capitalizes on an orthogonal signal of the reduction in fixation rate of weakly-beneficial alleles induced by selection at linked sites. We developed an ABC method to capture this signal, but less computationally intensive methods could also be used - for example, the original aMK approach could be applied in bins of BGS strength. If a substantial proportion of adaptation is due to weakly-beneficial alleles, such an analysis should result in a strong correlation between BGS strength and (potentially conservative) α estimates. However, it should be noted that cryptic covariation between gene functions (such as VIPs) and BGS strength could confound such inferences. We supposed that the main effects of linked selection in humans were due to background selection, but in principle genetic draft could drive similar patterns. Draft is expected to substantially reduce genetic diversity when sweeps occur frequently, and can impede the fixation of linked beneficial alleles [53, 54]. Previous work has also shown that strong draft can alter the fixation rate and frequency spectra of neutral and deleterious alleles [23]. We performed simulations of strong draft in 1MB flanking sequences surrounding a gene evolving under natural selection and tested the magnitude of the deviation from theoretical predictions under a model of background selection alone. Consistent with previous work, we observe that draft increases the fixation rate of deleterious alleles and thereby decreases α [23]. However, the effect on α(x) is only modest at the frequencies that we use in our inference procedure (i.e., below 75%), even when the strength and rate of positive selection are much larger than we and others have inferred in humans (although there is a modest deviation around 75% frequency, the highest frequency we use in our inference; Fig. S4C&D). This implies that draft due to selected sites outside genes would have to be much stronger than draft due to positive selection inside exons in order to drive the effects that we infer in the human genome. We note that it is likely that in species undergoing both strong, frequent sweeps and BGS (e.g., Drosophila – see [31]), draft will contribute to the removal of weakly-beneficial polymorphism. Selection has left many imprints on the human genome, with studies reporting signatures of selective sweeps [52], soft sweeps [29], background selection [34], negative selection [40, 42], and polygenic adaptation [28]. Still, considerable uncertainty remains about the relative importance of these evolutionary mechanisms, especially as concerns the rate and strength of positive selection. Recent work has suggested that the contrasting adaptation rate estimates of previous studies [51, 52] can be reconciled by arguing that most adaptation signals in humans are consistent with adaptation from standing variation [29]. Our results show that the frequency spectra and patterns of divergence are also consistent with the idea that many adaptive alleles segregate much longer than is expected for a classic sweep, and hence also help to reconcile the results of previous studies. In addition to determining the rate, strength, and mechanisms of adaptation, there is an ongoing effort to find the biological processes most important for driving adaptation. Previous work has shown that viruses are a critical driver of adaptation in mammals [25], but the strength of the fitness advantages associated with resistance to (or tolerance of) infection remain unclear. Our approach clarifies that strongly-adaptive fixed differences are also approximately three-fold enriched in virus-interacting proteins relative to non-VIPs. In contrast, weak adaptation rate was not substantially different between VIPs and non-VIPs, suggesting that weak adaptation may proceed through mechanisms that are shared across proteins regardless of function (for example, optimization of stability). While we have focused on VIPs here due to the expected fitness burdens associated with infection, in future research our approach could be used to investigate adaptation in any group of genes, or extended to partition genes into strong and weak adaptation classes. The model that we fit to human data does an excellent job of recapitulating the observed patterns in the Thousand Genomes Project data, but we were concerned that several possible confounding factors could influence our results. We showed that seven confounding factors (ancestral mispolarization [55], demographic model misspecification [56, 57], BGS model misspecification, covariation of BGS and sequence conservation, GC-biased gene conversion [20], selection on synonymous alleles [58], and misspecification of strongly- and/or weakly-beneficial selection coefficients) are unlikely to substantially influence the results (see Supplementary Information), but it should be noted that the adaptive process in our model is exceedingly simple, and it is very likely that the evolutionary processes driving diversification are much more complex. We supposed that adaptation proceeds in two categories, weak and strong selection, each of which is described by a single selection coefficient. In reality, adaptive alleles are likely to have selection coefficients drawn from a broad distribution, and adaptation is likely to proceed by a variety of mechanisms, including sweeps [52], polygenic adaptation [28], and selection from standing variation [29]. While our results show that BGS shapes adaptation rate across the genome, our method does not differentiate among adaptation mechanisms. We expect that future research will further clarify the relative importance of various selection mechanisms to shaping genomic patterns of diversity in the genomes of humans and other organisms [59, 10]. Our method is flexible in that it could be applied to any species for which both divergence/polymorphism data and estimates of background selection strength are available. As with the original aMK approach, we showed that the α estimates we obtained are not highly sensitive to recent demographic uncertainty. Our approach may therefore be effective for providing more accurate estimates of adaptation rate in non-model species. Despite recent advances, the evolutionary mechanisms that shape genetic diversity across species (which could include linked selection, population size, and/or population demography) remain the subject of debate [15, 11, 12]. Future work using and extending our method, which accounts for the effect of weakly-beneficial alleles on adaptation rate estimates, could help to resolve this open question.

Methods

Divergence and polymorphism data

We retrieved the number of polymorphic sites and their allele frequencies in human coding sequences as well as the number of human-specific fixed substitutions in coding sequences since divergence with chimpanzees. Fixed substitutions were identified by parsimony based on alignments of human (hg19 assembly), chimpanzee (panTro4 assembly) and orangutan (ponAbe2 assembly) coding sequences. Human coding sequences from Ensembl v73 [60] were blatted [61] on the panTro4 and ponAbe2 assemblies and the best corresponding hits were blatted back on the hg19 human assembly to finally identify human-chimp-orangutan best reciprocal orthologous hits. We used the Blatfine option to ensure that even short exons at the edge of coding sequences would be included in the hits. We further used a Blat protein -minIdentity threshold of 60%. The corresponding human, chimp and orangutan coding sequences were then aligned with PRANKs coding sequence evolution model [62] after codons containing undefined positions were removed. For each human coding gene in Ensembl we considered all possible protein-coding isoforms and aligned separately each isoform between human, chimp and orangutan. The numbers of polymorphic or divergent sites are therefore the numbers over all possible isoforms of a human gene (however, the same polymorphic or divergent site present in multiple isoforms was counted only once). If a polymorphic or divergent site was synonymous in an isoform but non-synonymous in another isoform, we counted it as a single non-synonymous polymorphic or divergent site. Only fixed divergent sites were included, meaning that substitutions still polymorphic in humans were not counted as divergent. The derived allele frequency of polymorphic sites herein corresponds to the frequency across all African populations from the Thousand Genomes Project phase 3 (TGP), which comprises 661 individuals spread across seven different subpopulations [39]. Allele frequencies were extracted from vcf files provided by the TGP for the phase 3 data. In total, 17,740 human-chimp-orangutan orthologs were included in the analysis. Supplemental Data Table S1 provides the number of synonymous and non-synonymous polymorphic or divergent sites for each of these 17,740 orthologs, as well as the allelic frequencies of the polymorphic sites. Polymorphic sites were counted only if they overlapped those parts of human coding sequences that were aligned with chimp and orangutan coding sequences. The ancestral and derived allele frequencies were based on the ancestral alleles inferred by the TGP phase 3 and available in the previously mentioned vcf files [39].

Model-based simulations and calculations

We tested the robustness of the aMK approach to the presence of weakly-beneficial alleles using simulation and theory. We simulated simultaneous negative and positive selection in coding sequences using model-based forward simulations under a range of scenarios [63, 64]. We supposed that nonsynonymous alleles are under selection, while synonymous alleles are neutral. In each simulation, we set α = α + α = 0.2, where α is the component of α due to weakly-beneficial mutations (2N s = 10) and α represents strongly-beneficial alleles (2N s = 500). Note that α is not treated as a parameter in the analyses herein; we use analytical theory to calculate the mutation rates for deleterious alleles and advantageous alleles that result in the desired α, meaning that α is a model output and not a model input. We drew deleterious selection coefficients from a Gamma distribution inferred from human sequence data [40], and we varied α from 0 to 0.2. We used the simulated allele frequency spectra and fixed differences to calculate the α(x) summary statistics. Results of these simulations are provided in Fig. 1 and additional simulation details are included in the Supplementary Information. We also performed analytical calculations under the same evolutionary model model using results from diffusion theory. These calculations are described in the Supplementary Information (see the sections entitled “Analytical approximation to α(x)” and “Background selection & adaptive divergence”). Software to perform these calculations is available at https://github.com/uricchio/mktest.

Using ABC-MK to infer adaptation rate and strength

We developed an approximate Bayesian computation (ABC) approach for estimating α and α in the presence of BGS and complex human demography [65]. We sampled parameters from prior distributions corresponding to the shape and scale of deleterious selection coefficients (assumed to be Gamma-distributed) and the rate of mutation of weakly and strongly-beneficial mutations. We performed forward simulations [63, 64] of simultaneous negative and positive selection at a coding locus under a demographic model inferred from NHLBI Exome project African American samples [66] with varying levels of background selection from / = 0.2 to / = 1.0 and the sampled parameter values. We then calculated α(x) using this simulated data, sampling alleles from the simulations such that the distribution of BGS values in the simulation matches the distribution in the empirical data as calculated by a previous study [34]. We used α(x) values at a subset of frequencies × as summary statistics in ABC (specifically, at derived allele counts 1, 2, 5, 10, 20, 50, 100, 200, 500, and 1000 in a sample of 1322 chromosomes). To improve efficiency, we employed a resampling-based approach that allows us to query many parameter values using the same set of forward simulations (see Supplementary Information). We tested our approach by estimating parameter values (population scaled mutation rates θ, θ, and the parameters of a Gamma distribution controlling negative selection strength) and quantities of interest (α, α, α) from simulated data. We found that the method produces high-accuracy estimates for most inferred parameters and α values (including α, α, and total α – Fig. S6). Some parameter values (particularly those corresponding the the distribution of fitness effects (DFE) over deleterious alleles and mutation rates of beneficial alleles) were somewhat noisily inferred. We found that α estimates were not very sensitive to various types of model misspecification (See Supplementary Information – Robustness analyses), but α and α are modestly affected by misspecification of the demographic model or the DFE of alleles driving BGS. We call our approach ABC-MK.

Data & code availability

Supplemental Data Table S1 is provided on the publisher’s website. The data and code that we used to parameterize our model are freely available online at https://github.com/uricchio/mktest. Columns in Supplemental Data Table S1 are as follows: First column – Ensembl coding gene ID. Second column – number of non-synonymous polymorphic sites. Third column – respective derived allele frequencies of these sites separated by commas. Fourth column – number of synonymous polymorphic sites. Fifth column – respective frequencies derived allele frequencies of these sites. Sixth column – number of fixed non-synonymous substitutions on the human branch. Seventh column – number of fixed synonymous substitutions on the human branch.

16 in total

1. How Much Does N_e Vary Among Species?

Authors: Nicolas Galtier; Marjolaine Rousselle
Journal: Genetics Date: 2020-08-24 Impact factor: 4.562

2. Comparison of the Full Distribution of Fitness Effects of New Amino Acid Mutations Across Great Apes.

Authors: David Castellano; Moisès Coll Macià; Paula Tataru; Thomas Bataillon; Kasper Munch
Journal: Genetics Date: 2019-09-05 Impact factor: 4.562

Review 3. Males, Outcrossing, and Sexual Selection in Caenorhabditis Nematodes.

Authors: Asher D Cutter; Levi T Morran; Patrick C Phillips
Journal: Genetics Date: 2019-09 Impact factor: 4.562

4. Decreased recent adaptation at human mendelian disease genes as a possible consequence of interference between advantageous and deleterious variants.

Authors: Chenlu Di; Jesus Murga Moreno; Diego F Salazar-Tortosa; M Elise Lauterbur; David Enard
Journal: Elife Date: 2021-10-12 Impact factor: 8.140

5. Sporadic occurrence of recent selective sweeps from standing variation in humans as revealed by an approximate Bayesian computation approach.

Authors: Guillaume Laval; Etienne Patin; Pierre Boutillier; Lluis Quintana-Murci
Journal: Genetics Date: 2021-12-10 Impact factor: 4.402

6. Biased Gene Conversion Constrains Adaptation in Arabidopsis thaliana.

Authors: Tuomas Hämälä; Peter Tiffin
Journal: Genetics Date: 2020-05-15 Impact factor: 4.562

7. The relative fitness of the de novo variants in general Lithuanian population vs. in individuals with intellectual disability.

Authors: Laura Pranckėnienė; Vaidutis Kučinskas
Journal: Eur J Hum Genet Date: 2021-08-06 Impact factor: 4.246

8. Global adaptation complicates the interpretation of genome scans for local adaptation.

Authors: Tom R Booker; Sam Yeaman; Michael C Whitlock
Journal: Evol Lett Date: 2020-12-15

9. An ancient viral epidemic involving host coronavirus interacting genes more than 20,000 years ago in East Asia.

Authors: Yassine Souilmi; M Elise Lauterbur; Ray Tobler; Christian D Huber; Angad S Johar; Shayli Varasteh Moradi; Wayne A Johnston; Nevan J Krogan; Kirill Alexandrov; David Enard
Journal: Curr Biol Date: 2021-06-17 Impact factor: 10.834

10. Inferring Parameters of the Distribution of Fitness Effects of New Mutations When Beneficial Mutations Are Strongly Advantageous and Rare.

Authors: Tom R Booker
Journal: G3 (Bethesda) Date: 2020-07-07 Impact factor: 3.154