Literature DB >> 29109230

Background selection as null hypothesis in population genomics: insights and challenges from Drosophila studies.

Abstract

The consequences of selection at linked sites are multiple and widespread across the genomes of most species. Here, I first review the main concepts behind models of selection and linkage in recombining genomes, present the difficulty in parametrizing these models simply as a reduction in effective population size (Ne) and discuss the predicted impact of recombination rates on levels of diversity across genomes. Arguments are then put forward in favour of using a model of selection and linkage with neutral and deleterious mutations (i.e. the background selection model, BGS) as a sensible null hypothesis for investigating the presence of other forms of selection, such as balancing or positive. I also describe and compare two studies that have generated high-resolution landscapes of the predicted consequences of selection at linked sites in Drosophila melanogaster Both studies show that BGS can explain a very large fraction of the observed variation in diversity across the whole genome, thus supporting its use as null model. Finally, I identify and discuss a number of caveats and challenges in studies of genetic hitchhiking that have been often overlooked, with several of them sharing a potential bias towards overestimating the evidence supporting recent selective sweeps to the detriment of a BGS explanation. One potential source of bias is the analysis of non-equilibrium populations: it is precisely because models of selection and linkage predict variation in Ne across chromosomes that demographic dynamics are not expected to be equivalent chromosome- or genome-wide. Other challenges include the use of incomplete genome annotations, the assumption of temporally stable recombination landscapes, the presence of genes under balancing selection and the consequences of ignoring non-crossover (gene conversion) recombination events.This article is part of the themed issue 'Evolutionary causes and consequences of recombination rate variation in sexual organisms'.

Entities: Chemical Disease Gene Species

Keywords: background selection; genetic hitchhiking; recombination; selective sweep

Mesh：

Year: 2017 PMID： 29109230 PMCID： PMC5698629 DOI： 10.1098/rstb.2016.0471

Source DB: PubMed Journal: Philos Trans R Soc Lond B Biol Sci ISSN： 0962-8436 Impact factor: 6.237

Introduction

Selection at a given genomic site has evolutionary consequences for genetically linked sites, either neutral or under selection themselves. These consequences of selection at linked sites, however, are multiple and strongly dependent on the selective regime and type of data under study. As such, ‘selection at linked sites’ is not a homogeneous phenomenon and the evolutionary outcomes may overlap less than is often assumed. Below, I first present the main concepts behind models of selection and linkage with particular focus on the expected consequences for patterns of diversity across recombining genomic regions and how these models may differ in their estimates of the population parameter ‘effective population size’ (Ne). Based on recent genome-wide studies and a re-analysis of a published dataset in Drosophila melanogaster, I later propose that the background selection model (BGS; [1-4]) should be considered as a default conceptual framework and its predictions across genomes as a null hypothesis in population genomics studies. Finally, I describe several limitations, challenges and potential biases in studies of selection and linkage. Throughout, I will use the term ‘selection at linked sites’ rather than ‘linked selection’ to emphasize that (i) linkage refers to a genetic property between genomic sites that do not recombine freely during meiosis (as opposed to an association between selective events) and (ii) the consequences of selection at a site can alter population dynamics at both neutral and non-neutral sites. Fisher [5] and Muller [6] discussed one of the first proposed consequences of selection at genetically linked sites in terms of selection at a polymorphic site interfering with selection at a second polymorphic site. Hill & Robertson [7] quantified this phenomenon and showed that selection acting on a segregating variant causes a reduction in the probability of fixation of a beneficial mutation at a linked site. This reduction in efficacy of selection at a site due to selection acting at nearby sites is known as the ‘Hill–Robertson effect’ (HRE; [7-9]). Moreover, and because the probability of fixation of a beneficial mutation with selection coefficient s decreases when the product Ne × s decreases [10-12], Hill & Robertson viewed their results in terms of a reduction in Ne relative to single-locus predictions of selection. Another category of models focuses on how selection changes levels of neutral diversity at linked sites; these are the so-called genetic hitchhiking models. In 1974, Maynard Smith & Haigh [13] described the dynamics of a beneficial variant increasing in frequency together with a variant at a linked site. This process eliminates segregating variation (diversity) near the site of the beneficial mutation once it reaches fixation (the classic selective sweep (CS) model). The size of the genomic region showing a reduction in diversity depends on the strength of selection (increasing size when s increases) and the recombination rate (decreasing size when the rate of meiotic recombination per base pair increases) [13,14]. In recombining chromosomes, therefore, genetic diversity is predicted to increase with genetic distance from the position of the recently fixed beneficial mutation, producing a ‘valley’ of diversity characteristic of the CS model (figure 1). Because beneficial mutations are assumed to be rare, the CS model predicts a very dynamic process with highly variable levels of diversity over time and across genomes. The CS model was later expanded by Wiehe & Stephan [15] and Gillespie [16] to include a constant input of beneficial mutations and recurrent selective sweeps (RSS or ‘Draft’ models). Although the target of selection may vary along a chromosome, Draft models forecast a more steady reduction in diversity due to genetic hitchhiking than CS models when analysing large genomic regions (figure 1). Charlesworth and co-workers [1,3,4] proposed a similar phenomenon describing the consequences of selection eliminating linked neutral diversity due to genetic hitchhiking, but with the critical difference that the cause of hitchhiking is the continuous removal of deleterious mutations (BGS; see also Hudson & Kaplan [2,17]).

Figure 1.

Models of selection with recombination. Horizontal lines represent different genetic backgrounds or haplotypes across a genomic region. Panels below the haplotypes depict qualitative levels of neutral diversity across the region, with the dashed line representing the expected level of neutral diversity in the absence of selection. Blue rectangles below the neutral diversity panels represent the location of two functionally relevant sequences where beneficial and deleterious mutations can occur. (a) CS and (b) RSS (or Draft) involve the fixation of new beneficial mutations (red circles) together with linked genetic variants. As a consequence, neutral diversity near the genomic location of the beneficial mutation is strongly reduced immediately after fixation and is expected to recover with time. The RSS/Draft model assumes that selective sweeps can occur before the complete recovery of neutral diversity from a previous sweep. (c) BGS predicts the continual appearance and elimination of deleterious mutations (black circles, shown here before being eliminated by selection) together with linked genetic variants. The deleterious mutation rate at functional sequences is assumed to be much higher than the beneficial mutation rate. Because neutral diversity in finite populations is an increasing function of Ne × u (where u is the mutation rate/bp/generation), the consequences of CS/Draft and BGS models have been also presented in terms of a reduction in Ne near the sites under selection, in this case relative to predictions in the absence of selection. Not all forms of selection, however, predict a reduction in linked genetic variation and local Ne. Balancing selection, for instance, maintains multiple variants for long evolutionary times and increases diversity at closely linked neutral sites [18-23]. Associative overdominance, in more general terms, can enhance variability at linked sites [24-27]. As such, balancing selection and associative overdominance can be viewed as genetic hitchhiking models that would cause local increases in Ne, more noticeable in genomic regions with reduced recombination [21,25]. Across genomes and chromosomes, all models of selection and linkage predict a variable influence of genetic hitchhiking (and, therefore, variable local Ne) as a result of variable recombination rates. More specifically, CS/Draft and BGS genetic hitchhiking models predict a positive correlation between recombination rates and neutral diversity (Ne × u) across genomes. This qualitative prediction was first confirmed in Drosophila [28-33] and has been observed in most other species analysed [34-45]. Moreover, it is now better understood that hitchhiking models predict that local Ne (ultimately at the resolution of single genomic sites) will vary across genomes not only with recombination rates but also with differences in the distribution of sites under selection (e.g. gene distribution and the intron–exon structure of genes) [35,46-50]. The generality of the positive correlation between diversity and recombination rates also suggests that balancing selection and associative overdominance play secondary roles in explaining the levels of genetic variation across recombining chromosomes. Quantitative estimates of the magnitude and form of selection causing these general trends across genomes are, however, less straightforward. For, instance, most population genetic models accept that the actual census population size N can be (much) greater than the effective population size in the absence of selection () owing to factors such as temporal variation in population size, variance in fecundity and sex ratios. Nevertheless, different models of selection and linkage are differentially sensitive to the disparity between and N [51]. More notably, the predicted ‘Ne’ within different models (CS/Draft, BGS or HRE) is not equivalent, given that each model captures different aspects of population dynamics [52-54]. The diversity-related Ne within CS/Draft models is not mathematically equivalent to an Ne representing a population with constant size [52,55,56]. Likewise, the Ne associated with a reduction in diversity within the BGS model is not equivalent to that explaining a reduction in the probability of fixation within HRE models [52,57]. In this regard, it is arguable that an equivalency of Ne and Ne × s among models has been abused and can generate a number of confusing interpretations when used interchangeably [52-54]. This last point can be particularly problematic when trying to model the consequences of BGS/Draft on diversity using HRE-related estimates of Ne × s from divergence data. Moreover, as expanded below, inferences about the effects of linkage on rates of evolution of non-neutral mutations need to assume temporal constancy in recombination landscapes, which is not always the case [58-62]. Additionally, demographic events can alter the relative differences in Ne across the same genome. Combined, it may be safe to point out that estimates of the parameters associated with selection at linked sites can be less direct and less equivalent among models and datasets than is often accepted. Finally, it is worth noting that HRE and the different genetic hitchhiking models are not mutually exclusive. With very weak selection and tight linkage, HRE will reduce the efficacy of selection removing deleterious mutations and, therefore, BGS models can overestimate the predicted impact reducing diversity [4,52,63,64]. Equivalently, interference between beneficial mutations can limit the rate of adaptation and, therefore, the diversity-reducing effects predicted by Draft models [65-67]. Moreover, in the case of non-recombining genomes, random genetic drift can cause the fixation of weakly deleterious mutations owing to Muller's ratchet [68-73]. Furthermore, under specific non-recombining conditions, Muller's ratchet can explain low levels of diversity and a reduction in the rate of adaptation (see [71,74] and references therein). Whereas all these factors are important for the interpretation of patterns of diversity and adaptation in asexual species as well as in non-recombining genomic regions such as Y chromosomes, below I will focus on the influence of selection on levels of diversity in recombining genomic regions.

The background selection model as baseline for studying diversity across recombining genomes

Certainly, beneficial mutations are essential in the evolution of species but a much higher number of non-neutral mutations must be deleterious [75]. Accordingly, the use of models that incorporate neutral and deleterious mutations as a null alternative to investigate the potential presence of other forms of selection has been a hallmark of evolutionary studies for almost 50 years. Specifically, predictions of the neutral theory of molecular evolution [76], which allows for neutral and strongly deleterious mutations, as well as later models that include weakly deleterious mutations [77,78], have been used as null hypothesis in molecular population genetic analyses, and other forms of selection are accepted when these predictions are incompatible with the data. It is, therefore, also sensible to use BGS as a conceptual framework and its predictions of diversity across genomes as null hypothesis when testing for alternative selective regimes using population genomics data [3,36,79], now without the assumption of independence between sites. The use of high-resolution BGS predictions of diversity facilitates the identification of outlier genomic regions that show significantly higher or lower diversity than expected, suggesting the action of balancing selection or recent adaptive events, respectively [79]. However, this approach would only be valid if the BGS model could explain a large fraction of the observed variation in diversity across genomes. Recent advances in generating high-resolution recombination maps together with comprehensive genome annotations are now allowing this type of preliminary studies in a number of species [36,79-82].

How good is the background selection model at explaining the distribution of diversity across genomes? Lessons from Drosophila melanogaster

Charlesworth [3] used theoretical predictions of the BGS model to investigate whether naturally occurring deleterious mutations and variation in recombination rates across the genome could account for the observed heterogeneity in levels of nucleotide diversity in D. melanogaster (see also Hudson & Kaplan [2]). These initial analyses estimated the magnitude of BGS (estimates of the parameter B, ) by using relatively crude information on variation in recombination rates and assuming a uniform distribution of deleterious mutations along chromosomes. Despite these approximations, the results showed that BGS is a realistic explanation for the observed reduction in neutral diversity on the fourth achiasmate (non-recombining) chromosome and near centromeres of recombining chromosomes, particularly when the deleterious consequences of transposable element (TE) insertions were taken into account [3]. More recently, Charlesworth [83] estimated the magnitude of BGS in the middle of the X chromosome and autosomes of D. melanogaster and Drosophila pseudoobscura. By taking into account overall differences in recombination rates between chromosomes and variable selection at exons, introns, untranslated regions (UTRs) and intergenic regions, this study showed that BGS could explain observed differences in diversity levels on the X chromosome relative to autosomes (i.e. the ratio πX/πA). In 2014, Comeron [79] further expanded this approach and estimated the predicted effects of BGS models at every genomic position of the D. melanogaster genome by combining (i) the actual genomic distribution of all regions putatively under selection (UTRs, exons and introns, overlapping coding sequence (CDS), non-coding RNA (ncRNA) and TEs), (ii) the variable incidence of selection at exons, introns, UTRs, ncRNA, TEs and intergenic regions, (iii) the potential cumulative effect of every position across a chromosome arm when estimating B at a site and (iv) different selective and mutational parameters. As in Charlesworth [83], deleterious mutation rates and the distribution of fitness effects (DFEs) were estimated using datasets of diversity that were independent of the ones used to study the potential effects of BGS. Additionally, this study used high-resolution recombination rates [40] and explored the influence of crossover (CO) and non-crossover (NCO, or gene conversion) recombination events on the distribution of BGS effects (see below). This comprehensive approach generated whole-genome high-resolution landscapes of the consequences of selection at linked sites under BGS (i.e. BGS landscapes or B-maps) that were then used to evaluate the general fit to the observed levels of neutral diversity and to identify outlier regions [79]. Rank correlation analyses (based on Spearman's ρ) between estimates of B and observed levels of nucleotide diversity at silent sites (πsil) suggest that BGS landscapes do a very good job of explaining the observed diversity in D. melanogaster. For instance, analyses at the scale of 100 kb show ρ2 = 0.59 between πsil and B across autosomes (see [79] for results at different genomic scales). Elyashiv et al. [82] have applied an alternative approach to improve the study of selection at linked sites across the D. melanogaster genome. Expanding the methodology developed by McVicker et al. [36] (see also [84]), these authors inferred selection parameters by maximizing the composite-likelihood (CL) for the observed levels of neutral diversity along the genome. Importantly, Elyashiv et al. [82] applied this approach to models with deleterious mutations (a BGS scenario), with beneficial mutations (a CS scenario) or their joint effects (BGS + CS). Together with CL calculations of selective parameters from diversity and divergence data, the authors incorporated properties (i) to (iv) described above and high-resolution genome-wide CO rates to generate detailed landscapes of estimates of B (denoted here as BCL_BGS, BCL_CS and BCL_BGS+CS). Their study also shows that BGS can explain a large fraction of the observed variation in diversity across autosomes. Analyses of 100 kb regions show goodness-of-fit estimates of the coefficient of determination (R2) between BCL_BGS and levels of diversity at synonymous sites (πsyn) of 0.42 (see figure 2 with data from Elyashiv et al. [82] for results at different genomic scales). Interestingly, Elyashiv et al. show that a model allowing for deleterious and beneficial mutations (BGS + CS) can improve the overall fit even further when the resolution of the study is 100 kb or smaller. At the same time, this study shows that estimates of BCL_CS perform consistently worse at explaining diversity levels than either BCL_BGS or BCL_BGS+CS (figure 2).

Figure 2.

Summaries of goodness of fit between high-resolution estimates of the strength of selection at linked sites across the D. melanogaster genome and levels of diversity (π) at synonymous sites. B indicates estimates of BGS following the methodology presented in [79]. BCL_BGS, BCL_CS and BCL_BGS+CS indicate CL-based estimates of B from Elyashiv et al. [82] when including deleterious mutations, beneficial mutations or the joint effects of deleterious and beneficial mutations, respectively. Estimates of the coefficient of determination (R2) are shown for analyses of non-overlapping regions of 1, 10, 100 and 1000 kb across autosomes. R2 estimates for BCL_BGS, BCL_CS and BCL_BGS+CS are from Fig. 2 in [82] (see the text for details). In all cases, only regions with recombination rates greater than 0.1 cM per Mb were analysed. A direct comparison of results from the two D. melanogaster studies is, however, not necessarily appropriate because estimates of the explained variance from rank correlations such as Spearman's ρ2 (in [79]) may differ from those from R2 (in [82]). To have more comparable estimates of fit, I reanalysed the data from Comeron [79] to obtain R2 between B and diversity, focusing on neutral synonymous sites and limiting summaries of goodness of fit to regions with recombination greater than 0.1 cM per Mb, as in [82]. The comparison of analyses of fit shows that B landscapes based on BGS models implemented in [79] have a higher or equal explanatory power (higher or equal R2) describing variation in autosomal diversity than CL approaches based on BGS (BCL_BGS; [82]) at all physical scales analysed (figure 2). Combined, the results of these studies in D. melanogaster provide three main lessons: (i) all genomic regions, including those with high recombination rates, are likely influenced by BGS, (ii) predictions of BGS show an impressive fit to diversity data at intermediate and large genomic scales, thus supporting the need for considering BGS when evaluating the presence of additional forms of selection, and (iii) when using the same methodological framework, models of BGS + CS are a better explanation for the observed genomic distribution of diversity than models of BGS alone. This latter point is in agreement with previous population genetic studies in this species that detected severely reduced levels of diversity in a number of genomic regions with average recombination rates [85-91] and with the presence of significant outlier regions with lower diversity than expected when using BGS as baseline [79]. In all, BGS, and models of selection and linkage in general, are an active area of research and the approach proposed by Elyashiv et al. [82] puts forward a valuable framework for future studies of selection in natural populations. As a guide for these studies of selection at linked sites, I present and discuss below a number of potential limitations, caveats and challenges that should be considered and, ideally, addressed.

Influence of demographic events

Selection can distort estimates of demographic events. BGS, for instance, generates a consistent excess of low-frequency variants at neutral sites resulting in negative Tajima's D [64,92-99]. The predicted magnitude of the skew in the frequency of variants is particularly evident when incorporating DFEs with weakly selected mutations and when recombination is reduced or absent, but it is also expected for a wide range of recombination rates as long as the number of sites under selection is high. In fact, an excess of rare mutations due to BGS is expected across most of the genome for species with a range of recombination rates like D. melanogaster [79,100]. These patterns of polymorphism predicted by BGS models and confirmed by simulations could be easily understood as evidence of population expansion and BGS is also likely to bias inferences about most other demographic events [100,101]. At the same time, demography can influence estimates of parameters associated with selection at linked sites [102-104]. In this regard, work by Zeng and co-workers [97,102,105,106] on estimating the joint effects of BGS and demography may help to improve the parametrization of selection coefficients and, ultimately, a default BGS framework. It is, however, tempting to assume that demographic events should affect different genomic regions similarly and, therefore, play a minor role in inferences of selection once genome-wide patterns are taken into account. This common assumption is not correct when considering that models of selection and linkage predict variable Ne across genomes, and that the dynamics and consequences of demographic changes depend on population size. The idea that demographic events are not expected to influence different genomic regions similarly follows arguments put forward in studies of neutral diversity on the X chromosome relative to autosomes or in comparisons of mitochondrial relative to autosomal genes when populations undergo demographic changes [107-112]. I argue that the same should be expected across recombining chromosomes as a consequence of variation in Ne due to genetic hitchhiking. To exemplify the point presented above, I used forward simulations to explore the consequences of a population going through a severe bottleneck and fast recovery at genomic regions under varying intensities of BGS. Figure 3 depicts the temporal dynamics for (panel a) the relative change in diversity at neutral sites, (panel b) relative Tajima's D at neutral sites (D/Dmin) [96,113] and (panel c) the fraction of adaptive amino acid substitutions (α; [114-117]). Different degrees of BGS were generated by using a range of rates of recombination realistic for D. melanogaster: very low (but non-zero) recombination (very strong BGS, red line), low recombination (strong BGS, green line) and genome-wide average recombination (moderate BGS, blue line). Moreover, for each of the three recombination levels, lower (dashed lines) and higher (solid lines) degrees of BGS were generated as a consequence of having lower and higher density of sites under selection (figure 3). For any given time point, the different lines should be taken as exemplars of genomic regions across a chromosome under different degrees of BGS (from dashed blue line to continuous red line representing weakest and strongest degrees of BGS, respectively).

Figure 3.

Population dynamics after a bottleneck and rapid recovery under varying intensities of BGS. (a) The relative change in diversity at neutral sites (π/π0, where π0 indicates neutral diversity at equilibrium), (b) estimates of Tajima's D at neutral sites after normalizing by Dmin (D/Dmin) [96,113] and (c) estimates of α, the fraction of adaptive amino acid substitutions [114-117]. All results are based on forward simulations of a panmictic population of 10 000 diploid individuals (N) with a severe bottleneck at time 0.1N (0.22% of initial population size) and rapid recovery to the initial N after 0.01N generations. Simulations using the program SLiM [118] followed a chromosome segment of 2 Mb that contains one representative Drosophila protein-encoding gene every 10 kb (solid lines) or every 50 kb (dashed lines). Different degrees of BGS were generated through the use of different rates of total recombination observed across D. melanogaster chromosomes. Very strong BGS was accomplished with very low rates of CO (c; N × c = 1 × 10−4/bp/generation; red line), strong BGS was accomplished with low rates of CO (N × c = 1 × 10−3/bp/generation; green line) and moderate BGS was accomplished with a D. melanogaster genome-wide average rate of CO (N × c = 1.2 × 10−2/bp/generation; blue line). All simulations also included an NCO rate (g) of N × g = 4.8 × 10−2/bp/generation [40]. Black lines in (a) and (b) indicate results for neutral sequences not influenced by selection. Estimates of the fraction of adaptive amino acid substitutions α were obtained using the DFE-alpha programs [116,117] after jointly inferring the DFEs on amino acid mutations and demography under a two-epoch model. See electronic supplementary material for details on SLiM simulations and analyses. At the time of the bottleneck, regions with initially stronger BGS (with more skewed frequency spectrum of neutral variants) will subsample relatively fewer variants and, as a result, analyses shortly after the bottleneck could overestimate the consequences of genetic hitchhiking. More notable is the different speed of approach to equilibrium shown by the different regions [111,112]. Because the time to reach equilibrium after a severe bottleneck mainly depends on the final Ne, regions with stronger local BGS and, therefore, smaller local Ne will reach equilibrium much faster than those with weaker local BGS and greater Ne. For long periods (in the hundreds of thousands of years for Drosophila owing to its large population size), regions with weak BGS will likely exhibit more extreme non-equilibrium patterns than those with moderate or strong BGS. Figure 3a shows that regions with high recombination (moderate BGS, blue lines) are much slower in recovering expected levels of neutral diversity than those with low and very low recombination (green and red lines, respectively). Figure 3b shows these dynamics for neutral mutations in terms of frequency spectra. For long periods of time, genomic regions with moderate levels of BGS (e.g. blue lines) will show a more negative Tajima's D than those with stronger BGS (e.g. green lines). Moreover, neutral mutations in regions with lower density of sites under selection (dashed lines) also tend to show a more negative Tajima's D than those in regions with more sites under selection (solid lines) when recombination rates are equivalent. In fact, these results suggest that neutral sites embedded in genomic regions with low BGS (e.g. intergenic regions) may exhibit patterns reminiscent of those caused by selective sweeps when compared with neutral sites under relatively stronger BGS (e.g. synonymous sites or short introns) simply as a consequence of the predicted longer periods of time to equilibrium in the former regions. These different dynamics also influence estimates of α for amino acid substitutions even when using methods that take into account potential changes in population size [116,117,119]. For regions that are considered to have non-reduced recombination rates in D. melanogaster (greater than 0.1 cM per Mb; blue and green lines), non-equilibrium creates a tendency for α to be positive. Moreover, the weaker the strength of BGS, the longer it takes for α to reach the expected equilibrium values (close to zero). Note that for regions with very strong BGS (red lines), the joint estimate of α and demography causes positive α at equilibrium, whereas estimates of α assuming constant population size show equilibrium with α ∼ 0. Combined, figure 3 not only shows that demographic events in species with detectable BGS can generate patterns of diversity that vary across the genome, but also that these patterns can be qualitatively different than those predicted at equilibrium. The simulation study shown in figure 3 represents only one of the many possible demographic events occurring in natural populations, but illustrates the point that qualitative and quantitative interpretations of selection across genomes based on diversity data may be influenced by the joint effects of BGS and demographic events. Moreover, these results emphasize that for long periods of time, every genomic region may be at different stage of its temporal dynamics after demographic events, with some regions showing diversity patterns that can be interpreted as evidence of recent selective sweeps. That is, because every genomic region across recombining chromosomes is probably subject to a different intensity of BGS, it should be analysed separately when studying patterns of demography and selection. Future models of genetic hitchhiking should, therefore, consider genome-wide as well as region-specific temporal changes in Ne. Complementary studies could also take advantage of advances in optimizing forward simulations that can incorporate BGS and demographic events (e.g. SLiM [118,120,121] and SFS_CODE [122,123]). Machine learning approaches to studying jointly demography, selection and linkage are similarly exciting avenues of research [124-126] and offer new opportunities to better evaluate the causes and consequences of selection at linked sites across genomes.

Influence of temporal variation in recombination landscapes

Recombination rates and their distribution across genomes vary between species, among populations and among individuals of the same population. Studies by Noor and co-workers [41,59,127] have shown that recombination rate variation within and between Drosophila species is more extreme when comparing rates at fine (sub-megabase) genomic scale. This strong dependency on the genomic scale of conservation of recombination rates has now been observed in many species [40,41,58,59,61,62,127-130]. Under models of selection and linkage, temporal changes in recombination rates predict that Ne at a given genomic region could also change over time, without invoking demographic events. Moreover, this change in Ne may be associated with a genome-wide change in recombination or with a local change in recombination that would alter Ne relative to other regions of the genome. Temporal variation in recombination landscapes, therefore, adds another layer of uncertainty for long-term Ne at a specific genomic location and can influence studies of selection that combine divergence (past) and diversity (present) data. In general terms, frequent temporal changes in local recombination rates may generate evolutionary patterns that will closely follow the harmonic mean of Ne across generations and thus predict a tendency for past Ne and past Ne × s to be smaller than current Ne and Ne × s. As discussed in [79,116,119], a potential consequence of such temporal disparity in Ne × s is an excess of fixed weakly deleterious mutations relative to levels of diversity, a pattern that could be also interpreted as evidence for adaptive mutations under models that assume constant population size. Equivalently, if recombination rates change frequently, studies that use divergence data to estimate the distribution of selection coefficients on deleterious mutations will estimate past Ne × s and may underestimate recent selection and the consequences of BGS on diversity. Moving forward, analyses of selection may benefit from including the effects of demography [97,116] and from allowing for ‘demographic’ parameters to vary across genomes in order to capture the consequences of changes in recombination landscapes (see above). In addition, Smukowski Heil et al. [59] have proposed restricting analyses to regions that exhibit conserved recombination rates between species. To this end, further efforts should be invested in generating recombination data not only for the populations under study but also for outgroup populations and species, as is customary for evolutionary analyses using sequence data.

Influence of incomplete genome annotations

High-resolution predictions of BGS are as good as the genomic annotation used to assign the distribution of sites potentially targeted by deleterious mutations. Genomic annotations are, however, a work in progress and depend on both the methodology used to obtain data (e.g. transcriptomes) and the variety of biotic (e.g. cell types, age and sex) and abiotic (e.g. temperature and food) conditions investigated. The D. melanogaster genome annotation is a good case in point. Only 2 years after the initial genome reference was released (Release 1 [131]), the D. melanogaster Release 3 [132] altered the majority (85%) of gene models. Even more significantly, Release 5 (2006) of the genome annotation described almost 7000 new alternative splicing forms, 1200 novel ncRNAs and more than 1000 new genes when compared with Release 4 (2004) [133]. Predictably, genome annotations vary almost always in the direction of describing novel functional regions and reveal that a fraction of previously assumed neutral sites is, in fact, sporadically or constitutively under selection. To exemplify how this point can influence population genomic studies, I used D. melanogaster genome annotations from Release 4 (2004) and the more recent R6 (2016) [133]. The fraction of genomic sequence solely annotated as intergenic (not counting ‘N's) decreases from 0.40 to 0.27 for R4 and R6, respectively. The study of nucleotide diversity from the African Rwanda (RG) population of D. melanogaster [134,135] also reveals that the use of incomplete annotations can underestimate levels of diversity and generate a more negative Tajima's D at putatively neutral sites. Genome-wide levels of neutral diversity (fourfold synonymous sites; π4f) are substantially lower when using the R4 relative to when using the R6 annotation (median π4f of 0.009 and 0.014, respectively). Similarly, the relative Tajima's D at fourfold synonymous sites is more negative when using R4 than when using the R6 annotation (median D/Dmin of −0.101 and −0.011, respectively). It follows that studies using an incomplete annotation may also underestimate the influence of BGS across genomes. To quantify this potential effect, I used the complete R4 and R6 genome annotations to generate two different, annotation-specific, high-resolution B landscapes under a BGS model [79]. At 100 kb scale, BGS generates B landscapes that fit substantially better the observed variation in π4f across autosomes when using data from R6 (R2 = 0.484 and ρ2 = 0.557 for genomic regions with recombination rates greater than 0.1 cM per Mb) than when using the more incomplete annotation R4 (R2 = 0.226 and ρ2 = 0.358 for genomic regions with recombination rates greater than 0.1 cM per Mb). Equivalent conclusions are drawn for analyses at 1 and 10 kb scales (data not shown). The reduction in predictive power of BGS to explain neutral diversity across the genome together with inaccurate estimates of neutral diversity when using an incomplete annotation uncovers limitations in studies of selection and linkage, particularly for non-model systems. Equivalent caveats may emerge from using partial information such as when considering only protein-coding sequences or when using a single transcript for genes with multiple transcripts. In all these cases, the influence of BGS could be underestimated and may therefore generate a tendency towards overestimating the need to include adaptive events. For model organisms such D. melanogaster, with a comprehensive annotation, a better alternative may be to follow a ‘shadowing’ approach where all annotations (coding and non-coding genes, all alternative splicing variants, TEs and repetitive sequences) are mapped onto a reference sequence, and only sites that are never annotated as being part of a functional region should be considered as potentially neutral [79].

Consequences of ignoring non-crossover (gene conversion) recombination events

Recombination results from the repair of DNA double-strand breaks through either CO events which shuffle large genomic regions between homologous chromosomes or NCO events which only involve the transfer of short genomic segments called gene conversion tracts. Gene conversion tracts are often only a few hundred nucleotides long or even shorter [40,136-139]; therefore, the number and genomic location of potential new allelic combinations caused by NCOs is highly limited relative to the overall effects associated with COs. As a consequence, NCO recombination has been often assumed to play a minor role in reducing hitchhiking effects in natural populations. In fact, most studies of selection and linkage directly omit NCOs and use COs as the only source of recombination when predicting patterns of diversity and effectiveness of selection across genomes. The comparison of B landscapes based on BGS models that consider only COs (BCO) and those that include both CO and NCO events (BCO+NCO), however, reveals that BCO+NCO landscapes perform consistently better than BCO landscapes when describing variation in nucleotide diversity across the D. melanogaster genome [79]. This result is also in agreement with previous studies by Loewe & Charlesworth [50], and the more recent study by Campos et al. [140] showing that models of BGS that consider CO and NCO recombination can better explain evolutionary patterns across and among genes than models that consider only CO [49,50,141,142]. Combined, these results reveal a non-trivial influence of NCOs and the importance of using both COs and NCOs in studies of genetic hitchhiking and BGS specifically, at least for Drosophila. High-resolution maps of NCO rates are, however, still difficult to obtain. At a practical level, and because NCO rates show a more limited range of variation across genomes than CO rates [40,137,138], it may be sensible for future studies of selection and linkage to use variable CO rates together with a genome-wide average rate for NCOs when generating fine-scale landscapes under BGS or BGS + CS models.

Consequences of not considering balancing selection

An advantage to using a BGS prediction as baseline for levels of diversity across genomes is that it allows the detection of regions under modes of selection other than CS, such as balancing selection [79]. In D. melanogaster, a number of studies have revealed signatures of balancing selection associated with immunity genes and fitness-related temporal and spatial variation [19,79,86,143-145]. That is, our current understanding of selective forces acting in natural populations of this species suggests that the number of genes potentially associated with balancing selection may not be much smaller than that of genes showing clear signals of recent selective sweeps (or at least not smaller by orders of magnitude). When balancing selection is not included in methods where diversity data are fitted to selection models, regions experiencing balancing selection (with an excess of diversity) are likely to be taken as regions with the weakest degree of genetic hitchhiking. Such a case would move upward genome-wide estimates of B (underestimate BGS), thus possibly leaving unexplained the regions with the lowest levels of diversity. This scenario may have little influence on genome-wide patterns of selection, but I propose that future studies designed to identify CS within a general BGS framework may benefit from first identifying and excluding from the analysis genomic regions with diversity-based signatures of balancing selection.

Influence of different methods for creating recombination maps

The direct analysis of co-inheritance of markers in meiotic products is the classic experimental approach for detecting recombination events. Advances in high-throughput sequencing and genotyping methods now allow the study of thousands of genetic markers (single nucleotide polymorphisms (SNPs) or small indels) and the generation of fine-scale maps of recombination events across genomes. Although this direct approach is still time-consuming and relatively expensive, whole-genome high-resolution recombination maps are available for a number of taxa, including yeast [137,146], humans [147-149], mice [150], dogs [151], Drosophila [40,41], Caenorhabditis elegans [152,153] and birds [154]. A complementary approach was proposed by Singh et al. [155], with a clever experimental and sequencing design that can generate recombination maps between two visual markers with unparalleled ultra-high resolution, thus adequate for gene-level studies across a specific genomic region. Moreover, when marker density is high enough, direct approaches can identify NCO and CO events [40,137,138]. In all, high-resolution experimental maps of recombination rates can be used to parametrize models of selection and linkage and are required to study the molecular basis of recombination plasticity (e.g. [156]). On the potentially negative side, the genotyping strategies used in these experimental maps are almost certainly limited to a small number of genotypes and biotic/abiotic conditions; therefore, these recombination landscapes might differ, to an unknown degree, from those in natural populations. Moreover, the presence of polymorphic chromosomal inversions in the individuals used to create genetic maps could not only reduce CO rates within the inverted region, but also increase rates elsewhere in the genome (the interchromosomal effect [157,158]). Additionally, even with the new methodologies, it is unrealistic for most species to generate genome-wide recombination maps that provide reliable recombination rates at sub-gene resolution (≤10 kb), particularly when it is applied to multiple crosses and conditions. Furthermore, the degree of conservation of recombination rate within and between species decreases fast at finer scales (see [59] and references therein). All these factors are probably responsible—at least in part—for the strong dependence on the genomic window size of the goodness of fit between predictions of models of selection and observed diversity in all species analysed. An alternative approach to obtaining recombination maps is to take advantage of whole-genome sequence data from multiple individuals and use linkage disequilibrium (LD) between polymorphisms to estimate recombination rate [128,159-165]. Based on population genetic theory, estimates of LD can be transformed into a population-scaled estimate of recombination (LDρ; LDρ ∼ Ne × r), where r is the recombination rate/bp/generation). This approach can be easily applied genome-wide and provides an estimate of historical recombination rates that are often at a much higher resolution than traditional cross- or pedigree-based genetic maps (see [166] for a review contrasting LD- and pedigree-based approaches to estimating recombination rates). A downside of the LD-based methods is that they require the use of an estimate of Ne to obtain the more relevant recombination rate r. This challenge may be significant because it may not always be direct to gauge the adequate Ne for a given genomic region (see above). Moreover, estimates of LDρ generate a sex-average compound estimate of historical recombination that may not be adequate to study recent patterns of diversity owing to the potential change in recombination rates across genomes with time, including changes in the frequency of polymorphic chromosomal inversions (see above). Below, however, I briefly discuss another potential challenge (to be described in more detail elsewhere). A number of the proposed methods for estimating LDρ at genome-wide scale generate a compound estimate of CO plus NCO rates. This is relevant for at least two reasons. First, COs and NCOs play different roles in predicting the effects of selection at nearby sites and should be used separately in models of selection and linkage. Second, for population genomic analyses using high-density SNP data (as opposed to markers separated by hundreds of kilobases), the distance between SNPs can influence how much of the total LDρ is due to NCOs. This is because most, if not all, COs will be detected regardless of the distance between the SNPs flanking the location of a CO, whereas the probability of detecting a gene conversion tract decreases with increasing distance between SNPs. Therefore, regions with high levels of nucleotide diversity (or marker density) will include more NCOs in estimates of LDρ than those with lower levels of diversity. If this rationale is correct, reducing SNP density (‘thinning’ SNP data) should be accompanied by a reduction in estimates of LDρ in species with high levels of nucleotide diversity. In agreement, Chan et al. [128] reported a reduction in estimates of LDρ when they reduced SNP density: eliminating half of the SNPs along the X chromosome of the D. melanogaster RG population caused a moderate (13%) reduction in LDρ. To further investigate this trend, I applied a more extreme thinning strategy to the same RG population by using 1 of every 10 informative SNPs, increasing the average distance between SNPs to approximately 400 bp. LDρ estimates using low SNP density (LDρSNP1/10) are severely reduced relative to those estimated when using the complete SNP dataset (median LDρ of 0.011 and 0.021, respectively; Wilcoxon matched pairs test Z = 20.7, p < 1 × 10−100 based on 100 kb non-overlapping regions). Thus, estimates of LDρ may vary across genomes not only due to differences in CO and NCO rates, but also as a result of a higher fraction of NCOs that will be detected when SNP density increases. The potential upward bias in LDρ in regions with high nucleotide diversity is particularly relevant in studies of selection and linkage, because a positive relationship between recombination rates and levels of diversity is accepted as evidence of pervasive selection within BGS and CS/Draft models. In short, in species with high levels of diversity or with variable mutation rates across genomes, the use of recombination rates based on LDρ could overestimate the impact of selection on nearby diversity. Computational demands are still limiting genome-wide applications of full-likelihood methods for jointly estimating CO and NCO rates from population genomic data [167]. Meanwhile, I propose considering the influence of variable SNP density in LD-based estimates of recombination that do not differentiate between COs and NCOs. Controlled thinning strategies may help to identify and lessen potential biases in species, or genomic regions, with high levels of nucleotide diversity.

Conclusion

Here, I reviewed the main models of selection and linkage applicable to recombining genomes and two studies supporting the concept that BGS explains a very large fraction of the variation in diversity across the whole genome in D. melanogaster. As such, the BGS framework should be accepted as a sensible null model to study other forms of natural selection. I also identified and discussed demographic, analytical and methodological challenges in studies of selection at linked sites that have been often overlooked. Some of the challenges are easily addressable and I put forward that all should be considered when designing future studies of selection. Notably, several of these challenges and limitations share a potential bias towards overestimating the evidence supporting recent selective sweeps to the detriment of a BGS explanation. In part, some of the challenges stem from the units of reliable data. Most, if not all, selective sweeps initially identified in D. melanogaster covered large genomic regions of tens or hundreds of kilobases and continue to be fine examples of recent adaptive events. Inaccurate recombination rates at the scale of single genes or incomplete genome annotations have evolutionary consequences at finer scales, and caution should be applied when inferring selective signals at this resolution. From several perspectives, the effects of selection on linked sites could be regarded by population geneticists as equivalent to a physicist's view of gravitational waves (though—understandably—with much less fanfare on the news and popular culture). The analogy of selection disrupting the dynamics of drift-related parameters around selected sites, altering (curving) Ne along chromosomes, may be—up to a point—not merely a graphical one but also one that exemplifies how insightful theoretical works move research forward. As discussed, current approaches to identify the signatures of selection using diversity data across genomes do not fully consider the joint effects of demography, genetic linkage, rapid temporal changes in recombination landscapes and different forms of selection. The next steps towards a better understanding of how all these factors influence different genomic regions may require combining traditional population genetics, forward simulations and machine learning methods.

156 in total

1. A human population bottleneck can account for the discordance between patterns of mitochondrial versus nuclear DNA variation.

Authors: J C Fay; C I Wu
Journal: Mol Biol Evol Date: 1999-07 Impact factor: 16.240

2. On the probability of fixation of mutant genes in a population.

Authors: M KIMURA
Journal: Genetics Date: 1962-06 Impact factor: 4.562

3. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

4. Recent demography drives changes in linked selection across the maize genome.

Authors: Timothy M Beissinger; Li Wang; Kate Crosby; Arun Durvasula; Matthew B Hufford; Jeffrey Ross-Ibarra
Journal: Nat Plants Date: 2016-06-13 Impact factor: 15.793

5. The hitch-hiking effect of a favourable gene.

Authors: J M Smith; J Haigh
Journal: Genet Res Date: 1974-02 Impact factor: 1.588

6. The stability of linked systems of loci with a small population size.

Authors: J A Sved
Journal: Genetics Date: 1968-08 Impact factor: 4.562

7. The effect of linkage on limits to artificial selection.

Authors: W G Hill; A Robertson
Journal: Genet Res Date: 1966-12 Impact factor: 1.588

8. Background selection and patterns of genetic diversity in Drosophila melanogaster.

Authors: B Charlesworth
Journal: Genet Res Date: 1996-10 Impact factor: 1.588

9. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies.

Authors: Peter D Keightley; Adam Eyre-Walker
Journal: Genetics Date: 2007-12 Impact factor: 4.562

10. The Drosophila early ovarian transcriptome provides insight to the molecular causes of recombination rate variation across genomes.

Authors: Andrew B Adrian; Josep M Comeron
Journal: BMC Genomics Date: 2013-11-15 Impact factor: 3.969

27 in total

Review 1. Selective Sweeps.

Authors: Wolfgang Stephan
Journal: Genetics Date: 2019-01 Impact factor: 4.562

10. Heterogeneity in effective size across the genome: effects on the inverse instantaneous coalescence rate (IICR) and implications for demographic inference under linked selection.

Authors: Simon Boitard; Armando Arredondo; Lounès Chikhi; Olivier Mazet
Journal: Genetics Date: 2022-03-03 Impact factor: 4.562