Literature DB >> 29635416

New Methods for Inferring the Distribution of Fitness Effects for INDELs and SNPs.

Abstract

Small insertions and deletions (INDELs; ≤50 bp) are the most common type of variability after single nucleotide polymorphism (SNP). However, compared with SNPs, we know little about the distribution of fitness effects (DFE) of new INDEL mutations and how prevalent adaptive INDEL substitutions are. Studying INDELs has been difficult partly because identifying ancestral states at these sites is error-prone and misidentification can lead to severely biased estimates of the strength of selection. To solve these problems, we develop new maximum likelihood methods, which use polymorphism data to simultaneously estimate the DFE, the mutation rate, and the misidentification rate. These methods are applicable to both INDELs and SNPs. Simulations show that they can provide highly accurate results. We applied the methods to an INDEL polymorphism data set in Drosophila melanogaster. We found that the DFE for polymorphic INDELs in protein-coding regions is bimodal, with the variants being either nearly neutral or strongly deleterious. Based on the DFE, we estimated that 71.5-83.7% of the INDEL substitutions that took place along the D. melanogaster lineage were fixed by positive selection, which is comparable with the prevalence of adaptive substitutions at nonsynonymous sites. The new methods have been implemented in the software package anavar.

Entities: Chemical Disease Species

Mesh：

Year: 2018 PMID： 29635416 PMCID： PMC5967470 DOI： 10.1093/molbev/msy054

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

New mutations can have a range of effects on an organism’s fitness, ranging from being strongly harmful, through being only slightly deleterious, to being neutral, and finally on to being either mildly or highly beneficial. The relative frequencies of mutations with different selective effects is known as the distribution of fitness effects (DFE). The DFE is an important parameter as it is required for addressing many fundamental questions (Eyre-Walker and Keightley 2007). Examples include understanding determinants of the efficacy of natural selection (Galtier 2016; Corcoran et al. 2017), the genetic basis of polygenic traits (Zuk et al. 2014), and the evolutionary advantage of sex and recombination (Hartfield and Keightley 2012). Taking advantage of the massive increase in data availability, many methods have been proposed for estimating the DFE using polymorphism data (Eyre-Walker et al. 2006; Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009; Kousathanas and Keightley 2013; Kim et al. 2017; Tataru et al. 2017). Their development in turn allows more reliable inferences about other important quantities such as α, the proportion of adaptive substitutions (Eyre-Walker and Keightley 2009). However, all these methods are concerned with estimating the DFE for single nucleotide polymorphisms (SNPs). Consequently, much less is known about the DFE and α for other types of genetic variation such as small insertions and deletions (INDELs; ≤50 bp), despite the fact that INDELs are the second most common type of variants (e.g., Montgomery et al. 2013), and hence represent an important source of raw materials for selection to act on. A major difficulty in studying INDELs lies with ancestral state identification. This requires multispecies genome alignments. However, INDELs occur disproportionately in repetitive genomic regions (Ananda et al. 2013; Montgomery et al. 2013), where alignment algorithms perform poorly (Earl et al. 2014). Furthermore, there is evidence that homoplasy is a significant issue outside repetitive regions, probably due to the existence of cryptic INDEL mutation hotspots (Kvikstad and Duret 2014). Thus ancestral state identification can be expected to be particularly error prone for INDELs. It is well established that misidenfication of ancestral states can lead to severely biased estimates of the strength of selection using the site-frequency spectrum (SFS) (Hernandez et al. 2007). For SNPs, this difficulty can be avoided by using the folded SFS (e.g., Eyre-Walker et al. 2006; Keightley and Eyre-Walker 2007). However, to determine whether a length variant is an insertion or a deletion, we have to know what the ancestral state is, meaning that the issue of polarisation error is inherent for INDELs. As a result, applying existing methods for estimating the DFE to INDEL data may be liable to biases. Another challenge is that the SFSs for insertions and deletions may be affected by polarisation errors to different extents. This is because when the ancestral state of an insertion segregating at low frequency is misidentified, it will be incorrectly inferred as a deletion segregating at high frequency (and vice versa). There is direct experimental evidence that the deletion mutation rate is higher than the insertion mutation rate (Keightley et al. 2009; Schrider et al. 2013; Besenbacher et al. 2015; Yang et al. 2015). This mutational bias means that there are more deletions segregating in the population than insertions. The larger number of deletions may lead to the SFS for insertions being disproportionally affected by polarisation errors (fig. 1). This asymmetry can cause the insertion SFS to have a more pronounced, but artificial, uptick at the high-frequency end, which can be misinterpreted as stronger positive selection on insertions over deletions. As pointed out by Kvikstad and Duret (2014), this methodological issue can, at least in principle, compromises the results of previous studies, which suggest that insertions are more likely to be under positive selection than deletions to prevent the genome size from unconstrained contraction caused by the mutational bias toward deletions (Parsch 2003). Similarly, it will make it difficult to test the possibility that insertions have a higher fixation probability because they are favored by insertion-biased gene conversion (Leushkin and Bazykin 2013).

. 1.

The SFSs for insertions and deletions may be affected to different extents by polarisation errors. We assume that the population size is constant, that INDELs are neutral, and that the sample size is 10. In the genomic region under consideration, the total scaled mutation rate toward insertions, , is 10, where N is the effective population size u is the insertion mutation rate per site per generation, and m is that size of the focal region. The total scaled mutation rate toward deletions is 20. The expected SFSs were generated using standard neutral theory. The SFSs with polarisation errors were generated by assuming that the ancestral state of an INDEL was wrongly identified with probability 0.1. Toward resolving the confounding efforts ancestral state misidentification have on the study of INDELs, we propose new maximum likelihood methods for inferring the DFE using polymorphism data. These methods are based on recent studies on SNPs which show that polymorphism data contain enough information for simultaneous estimation of the mutation rate, the DFE, and the polarisation error rate (Glémin et al. 2015; Tataru et al. 2017). Our methods are more general than the existing methods in the following aspects. First, they can handle both INDELs and SNPs. Second, insertions and deletions can have different polarisation error rates, mutation rates, and DFEs. Third, for both INDELs and SNPs, the new methods allow the mutation and polarisation error rates to vary across the genome. Incorporating these heterogeneities may be particularly important for INDELs (Kvikstad and Duret 2014). We carried out extensive simulations to examine the performance of the new methods. As an example, we applied the methods to an INDEL polymorphism data set in Drosophila melanogaster we obtained by reanalysing the raw short-read data published by the Drosophila Population Genomics Project (Pool et al. 2012). Through model comparisons, we tried to find the DFE that best described the observed pattern of INDEL polymorphism within protein-coding regions of the genome. Finally, using the best-fitting DFE, we estimated the proportion of INDEL substitutions fixed by positive selection (α).

New Approach

For ease of presentation, we will start with a description of the SNP models. The INDEL models will be presented later as an extension.

The SNP Models

Consider a diploid population with effective size N. The size of the genomic region of interest is m base pairs, and the sample size is n.

The Discrete Model

Assume that there are C different classes of sites in the focal region. These sites can be different with respect to their mutation rates, the fitness effects of new mutations, and polarisation error rates. This discrete model has several advantages. First, it does not assume that the DFE follows a specific probability distribution, and is therefore able to accommodate complex scenarios such as a multimodal DFE (Kousathanas and Keightley 2013). Second, by allowing the mutation and polarisation error rates to vary freely between site classes, the method can include situations whereby these two variables covary (e.g., hypermutable regions may have a higher polarisation error rate). We assume that the mutation process can be approximated by the infinite-sites model. Let the total scaled mutation rate for sites of class c be , where and . To understand u, consider an alternative formulation whereby the mutation rate for the class of sites is v per site per generation, and sites of class c account for a fraction p of all sites in the focal region (i.e., ). We have , which leads to . By using θ, we can perform searches for maximum likelihood estimates (MLEs) of the parameters without having to deal with the constraint . Define: Thus, θ is the average scaled mutation rate per site, and the total scaled mutation rate is . If the per-site mutation rate is uniform across the focal region (i.e., for and ), then . To model selection, we assume that, for mutations arising at sites of class c, the fitnesses of the wild-type, heterozygote, and mutant homozygote genotypes are 1, 1 + s, and 1 + , respectively. The corresponding scaled selection coefficient γ is defined as . Positive and negative γ values signify beneficial and deleterious mutations, respectively. The site-frequency spectrum (SFS) for the cth site class, which is defined as the expected number of polymorphic sites of size i (i.e., sites where the derived allele is represented i times; ), is given by: where Polarisation errors distort the SFS. Specifically, when the ancestral state of a polymorphic site of size i is mis-identified, it will be regarded as a polymorphic site of size n−i. To model polarisation errors, we let ϵ be the probability that the ancestral state of a polymorphic site of class c is incorrectly identified (Glémin et al. 2015). The final SFS for sites of class c is then: In what follows, we refer to the SFS with and without the correction of polarisation errors as the corrected and uncorrected SFS, respectively. The corrected SFS for the focal region is simply the sum of all the contributions from the sites in different classes: Existing models either do not model polarisation error (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009; Kim et al. 2017) or assume that the error rate is constant across the focal region (Glémin et al. 2015; Tataru et al. 2017). The model described here is therefore more general. Allowing variation in the polarisation error rate can be important. For instance, sites under stronger selective constraints tend to evolve slower, and are less likely to be polarised incorrectly due to homoplasy. It should, however, be noted that, when for , not all the parameters are identifiable. To see this, we rewrite equation (5) as: Appealing to equation (1) and defining such that we can rewrite equation (6) as Thus, when there is no difference in fitness effects between mutations arising at sites of different classes, we cannot detect variation in the scaled mutation rate and polarisation error rate because the model reduces to one that depends on θ, γ and . This result has important implications for data analysis by pointing out that a model with a small number of site classes may provide an adequate description of the data even when the underlying biological process features complex variation in the mutation rate across the genome.

The Continuous Model

Instead of assuming that the focal region is composed of several classes of sites, we can assume that the fitness effects of new mutations follows a continuous distribution characterised by parameters . Let θ be the scaled mutation rate per site, and ϵ be the polarisation error rate. The uncorrected SFS becomes: where is the probability density function. The corrected SFS is analogous to equation (4) with c in the subscripts omitted. Although the modeling framework allows the DFE to follow arbitrary probability distribution (including those mixture distributions considered by Galtier 2016), here we only consider the reflected distribution, that is, , where and a and b are the shape and scale parameters, respectively.

Parameter Estimation

Let X = (x1, x2,…, ) represent the observed SFS, where x is the number of polymorphic sites of size i in the sample. Let denote all the parameters in the model (i.e., θ, γ, and ϵ for for the discrete model and θ, , and ϵ for the continuous model). To obtain MLEs of , we use the Poisson random field model (Sawyer and Hartl 1992; Bustamante et al. 2001). Omitting constants that have no effects on the shape of the likelihood surface, the log likelihood function is defined as:

Controlling for Demography

We have so far assumed that the population is panmictic and of constant size N. To control for demography, we employ the method of Eyre-Walker et al. (2006). Take the continuous model as an example. First, we define augmented SFSs as: Next, a set of neutral variants is added to the model, which introduces two additional parameters and , which are the scaled mutation rate per site and the polarisation error rate, respectively, for the neutral sites. Let denote these new parameters and denote the neutral SFS. The log likelihood of the observed data can be calculated as: where R = (r2, r3,…, ) and the two log likelihood functions on the right-hand side are calculated in the same way as equation (10) with ψ replaced by . The above method for controlling for demography has been used extensively (Eyre-Walker et al. 2006; Muyle et al. 2011; Glémin et al. 2015; Galtier 2016; Jackson et al. 2017; Tataru et al. 2017). These previous efforts have gathered clear theoretical and empirical evidence that the method is robust against a wide range of demographic processes, as well as the effects caused by selection at linked sites (e.g., background selection and/or selective sweeps). For instance, in a recent analysis of selection on codon usage bias in Drosophila, Jackson et al. (2017) showed that the estimates of γ produced by an estimation method that corrects for demography using the r parameters as set out above closely matched those produced by another estimation method that considers an explicit one-step change in population size (see figure 4 A in Jackson et al. 2017). It should be noted that equation (12) accommodates the possibility that the focal region and the neutral region have different mutation rates. This is more general than several previous models (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009; Kim et al. 2017; Tataru et al. 2017). However, it may be challenging to distinguish this model from one in which the two regions have the same mutation rate, but a proportion of new mutations in the focal region are so strongly deleterious that they make negligible contributions to the observed SFS.

The INDEL Models

First consider insertions. Assume that there are C different classes of sites. The total scaled mutation rate toward insertions for sites of class c is , and the fitness effect and polarisation error rate are and , respectively (). The uncorrected SFS for insertions of class c can be calculated using equation (2), and is denoted by . For deletions, we can similarly assume that there are C different classes of sites. The associated parameters are , and , and the uncorrected SFS is denoted by (). When the ancestral state of a derived insertion of size i is misidentified, it will be wrongly identified as a deletion of size n−i, and vice versa for deletions (note that size in this context refers to the frequency of the derived allele, not the number of base pairs inserted or deleted). Thus, the corrected SFSs for insertions and deletions are: For insertions, define the per-site scaled mutation rate and the polarisation error rate as θ and ϵ, respectively. The DFE for insertions is determined by parameters . For deletions, we similarly define the following parameters: θ, and ϵ. Finally, the corrected SFSs are: where and are the uncorrected SFSs for insertions and deletions, respectively, and are calculated in the same way as equation (9). As in the SNP case, we only consider cases where the DFE follows a reflected distribution. The shape and scale parameters for insertions and deletions are denoted by a, b, a, and b, respectively. Let X = (,…, ) and X = (,…, ) be the observed SFSs for insertions and deletions, respectively. The log likelihood of the data is calculated as: Take the continuous model as an example. The augmented SFSs are: As for the neutral reference, we can in principle use any combinations of SNPs, insertions, and deletions collected from putatively neutrally evolving regions. Assume that we have access to both neutral insertions and neutral deletions, and the observed SFSs are denoted by and , respectively. The additional parameters needed to model the neutral variants include , and , which are denoted collectively by . The log likelihood is: where the two terms on the right are calculated using equation (15) with replaced by ().

Results and Discussion

Simulation Results

We evaluate the statistical properties of the new models using computer simulations. Unless stated otherwise, the sample size (n) is 50 and the results are based on 100 replicates. In all cases, we assume the population size is constant and only analyse data from the selected region (see Materials and Methods for justification). For the SNP models, we only present results for the discrete SNP model with site classes, because both the C = 1 case and the continuous model have been analysed before (Glémin et al. 2015; Tataru et al. 2017).

Properties of the Discrete SNP Model

First consider a model with C = 2 site classes. As can be seen from table 1, there is information in the SFS for simultaneously estimating all the parameters to a high degree of accuracy. Before discussing more simulation results, it should be pointed out that, when , the order of the site classes is arbitrary. That is, the model considered in table 1 is equivalent to one with parameters , and . For both cases shown in table 1, all the MLEs can be sorted such that and . In other words, the MLEs can be assigned unambiguously to site classes according to the order given in the “True value” row. However, if we were to reduce the amount of data, parameter estimates will become more uncertain, and cases such as those with and will occur, which makes assigning the MLEs to site classes impossible. Thus, presenting mean and SD of the MLEs may give misleading information about the performance of the model.

Table 1.

Maximum Likelihood Estimates (MLEs) of the Parameters of Discrete SNP Models with C = 2 Classes of Sites.

	m	θ1	γ1	ϵ1	θ2	γ2	ϵ2
True value	–	0.005	−5	0.05	0.01	−20	0.01
Mean (SD) of MLEs	10⁶	0.0050 (0.0007)	−5.0 (0.4)	0.051 (0.006)	0.010 (0.001)	−20.2 (1.9)	0.009 (0.006)
Mean (SD) of MLEs	10⁵	0.0044 (0.0017)	−4.4 (1.5)	0.042 (0.022)	0.011 (0.001)	−20.0 (5.7)	0.016 (0.014)

Note.—Simulated data were generated using the parameter values shown in the “True value” row, with two different region sizes, m. For each parameter combination, 100 samples of size 50 were simulated and analysed to obtain MLEs.

Maximum Likelihood Estimates (MLEs) of the Parameters of Discrete SNP Models with C = 2 Classes of Sites. Note.—Simulated data were generated using the parameter values shown in the “True value” row, with two different region sizes, m. For each parameter combination, 100 samples of size 50 were simulated and analysed to obtain MLEs. In light of the earlier discussion, we investigate the statistical properties of the model using two alternative methods. First, we compare the full model to the following reduced models using the χ2 test: “Equal ϵ” (all site share the same polarisation error rate), “ϵ = 0” (no polarisation error), and “”(a model with site classes, where C is the true number of site classes). Second, we assess how well these various models predict the average fixation probability (see eq. 18 in Materials and Methods), which is essential for estimating the prevalence of adaptive substitutions (i.e., α and ω). Considering the two pairs of cases in table 2, and focusing on the data presented under “Percent significant,” we make the following observations. First, as the amount of data reduces, the ability of the model to infer separate ϵ for different site classes drops more rapidly than its ability to detect the existence of either polarisation error or more than one site class. This suggests that estimating heterogeneity in ϵ may be challenging. Considering all four cases, it appears that the tests for detecting the presence of polarisation error (i.e., the full model vs. “ϵ = 0”) and for detecting the existence of more site classes (i.e., the full model vs. “”) are more powerful, especially the latter. It should be noted that the likelihood surface appears to be rather flat when C = 3 such that different parameter combinations may produce very similar log likelihoods. This is particularly evident when the amount of data is limited (Case 3 vs. Case 4), leading to a reduction in power of the tests. A similar observation was made by Keightley and Eyre-Walker (2010), who also showed that it can be partly alleviated by increasing the sample size. Nonetheless there may well be a limit as to how many site classes can be included. This identifiability problem is analogous to that discussed extensively in the context of using SNP-based methods for estimating past demographic changes (e.g., Myers et al. 2008).

Table 2.

Statistical Properties of the Discrete SNP Model.

Case	Parameters	m	Percent Significant			μ¯
			Equal ϵ	ϵ = 0	C−1	True	Full	Equal ϵ	ϵ = 0	C−1
1	Same as table 1	10⁶	93	100	100	0.0113	0.0114	0.0171	>1	0.0022
2	Same as table 1	10⁵	15	92	100	0.0113	0.0158	0.0204	>1	0.0022
3	See notes below	10⁷	3	100	100	0.2204	0.2267	0.2613	>1	0.1755
4	Same as Case 3	2×106	0	33	55	0.2204	0.2271	0.2580	>1	0.1768

Note.—The parameters used in Case 3 were , and n = 100. A large sample size was used for Cases 3 and 4 due to the inclusion of strongly deleterious mutations (i.e., ). Values under “Percent significant” show how often the full model fitted the data better than the three reduced models (see the main text for more details). The (see eq. 18 in Materials and Methods) obtained under the ϵ = 0 model are large because ignoring polarisation error results in the inference of a site class with a strongly positive γ.

Statistical Properties of the Discrete SNP Model. Note.—The parameters used in Case 3 were , and n = 100. A large sample size was used for Cases 3 and 4 due to the inclusion of strongly deleterious mutations (i.e., ). Values under “Percent significant” show how often the full model fitted the data better than the three reduced models (see the main text for more details). The (see eq. 18 in Materials and Methods) obtained under the ϵ = 0 model are large because ignoring polarisation error results in the inference of a site class with a strongly positive γ. Interestingly, the reduced model “Equal ϵ” makes worse predictions of than the full model in all cases presented in table 2, even when the full model does not normally provide a better fit to the data (Cases 2 and 4). The same applies to the other two reduced models. Thus, despite the statistical difficulties discussed earlier, fitting the full model to the data may be important for obtaining accurate estimates of α and ω.

Properties of the INDEL Models

Table 3 contains simulation results based on a discrete model (with ) and two continuous models (differing from each other in terms of the size of the focal region m). The mutation rates are ∼10 times lower than those used in the SNP cases (tables 1 and 2), and polarisation error rates are ∼2 times higher. These choices are to reflect the fact that INDELs are generally less prevalent than SNPs, and are potentially more difficult to polarise. As can be seen, with a reasonable amount of data, all the parameters can be reliably estimated. Comparing the two continuous models, we notice that, with limited data, the scale parameter b of the distribution may be overestimated, but estimates of the shape parameter a and the polarisation error rate remain unbiased.

Table 3.

MLEs of the Parameters of Several INDEL Models.

Model	m	Parameters
Discrete	2×106	Name	θ1ins	γ1ins	ϵ1ins	θ1del	γ1del	ϵ1del
		True	0.0005	−5	0.02	0.001	−15	0.02
		Mean MLE	0.00050	−5.0	0.021	0.0010	−15.0	0.020
Continuous	2×107	Name	θ^ins	a^ins	b^ins	ϵ^ins	θ^del	a^del	b^del	ϵ^del
		True	0.0005	0.5	10	0.08	0.001	0.25	50	0.04
		Mean MLE	0.00050	0.51	10.4	0.080	0.0010	0.251	51.2	0.040
Continuous	2×106	Name	θ^ins	a^ins	b^ins	ϵ^ins	θ^del	a^del	b^del	ϵ^del
		True	0.0005	0.5	10	0.08	0.001	0.25	50	0.04
		Mean MLE	0.00054	0.51	144.7	0.082	0.0010	0.253	93.2	0.041

MLEs of the Parameters of Several INDEL Models. The true values of and for the discrete model are 0.0339 and , respectively. The mean (SD) of the estimates is 0.0345 (0.0055) for , and () for . Thus, the true values are well within the observed ranges of variability. The true values of and for the two continuous cases are 0.384 and 0.429, respectively. The mean (SD) of the estimates for the case with more data is 0.382 (0.012) for and 0.429 (0.008) for . Encouragingly, for the continuous case with less data, despite the tendency to overestimate the scale parameter, estimates of the average fixation probabilities are still highly accurate: 0.388 (0.050) for and 0.418 (0.028) for , suggesting that the reliability of estimates of α and ω is unlikely to be compromised.

Application to D. melanogaster Data

A Summary of the Data

Using the variant calling pipeline detailed in Materials and Methods, a total of 370, 217 INDELs (≤50 bp) and 1,789,367 SNPs were identified from the 17 Rwandan individuals. Our analysis primarily focuses on INDELs because SNPs have been analysed extensively before (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009; Schneider et al. 2011). Similar to previous reports (e.g., Ptak and Petrov 2002), smaller INDELs are more prevalent than larger ones (supplementary fig. S1, Supplementary Material online). INDEL diversity is ∼30 times lower in protein-coding (CDS) regions than in either intronic or intergenic regions (table 4). Additionally, frameshift INDELs are rarer than nonframeshift ones (table 4 and supplementary fig. S1, Supplementary Material online). Interestingly, nonsense mutations are somewhat rarer than frameshift INDELs, an observation also made by Leushkin et al. (2013). These results indicate strong purifying selection against INDELs in protein-coding regions. INDEL diversity patterns appear to be similar between intronic and intergenic regions. They are combined and referred to as noncoding INDELs in what follows to increase statistical power.

Table 4.

Summary Statistics for the INDEL and SNP Data.

Data	Type	Diversity (π)	Tajima’s D
INDELs	CDS	5.20×10−5	−1.208
	Frameshift	2.06×10−5	−1.253
	Nonframeshift	3.14×10−5	−1.177
	Intron	0.0016	−0.729
	Intergenic	0.0017	−0.704
	Noncoding	0.0017	−0.718
SNPs	Nonsense	5.83×10−6	−1.510
	0-fold degenerate sites	0.0016	−0.868
	4-fold degenerate sites	0.0165	−0.210

Summary Statistics for the INDEL and SNP Data. Comparing between INDELs and SNPs, we notice that INDEL diversity in noncoding regions is ∼10 times lower than (4-fold site diversity; table 4), consistent with the fact that the INDEL mutation rate is lower than the point mutation rate (Haag-Liautard et al. 2007; Schrider et al. 2013). However, Tajima’s D calculated on noncoding INDELs is more negative than that calculated on 4-fold sites (table 4), probably reflecting the fact that many noncoding DNA in the D. melanogaster genome are under selection (Andolfatto 2005). Furthermore, (0-fold site diversity; table 4) is only ∼10 times smaller than . This level of reduction is much smaller than the 30-fold difference observed between CDS and noncoding INDELs. This suggests that, in protein-coding regions, INDEL mutations are under much stronger purifying selection than 0-fold mutations, which is consistent with the more negative Tajima’s D value calculated on CDS INDELs (table 4). To further investigate the data, we calculated dN, substitution rate at nonsynonymous sites, using PAML and the reference genomes of D. simulans and D. yakuba (see Materials and Methods). The genes were then divided into 20 equal-sized bins. For each bin, we calculated average and . Both statistics decrease as dN decreases (supplementary fig. S2, Supplementary Material online), consistent with the expectation that mutations are on average more deleterious in more conserved genes (Jackson et al. 2015). The results in this and the preceding paragraphs suggest that our INDEL data set is of high quality.

Inferring the DFE and α Using Noncoding INDELs as the Neutral Reference

To infer the DFE for INDELs in CDS regions, we used noncoding INDELs as the neutral reference. Following previous efforts in estimating the DFE for SNPs (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009; Schneider et al. 2011; Galtier 2016; Tataru et al. 2017), we also assumed that the mutation rate toward insertions and deletions, respectively, were the same between the neutral and selected regions. The best-fitting DFE is one with C = 2 classes of selected sites (table 5 and supplementary table S1, Supplementary Material online). The MLEs of γ suggest that polymorphic INDELs are either nearly neutral or are so strongly deleterious that they contribute little to polymorphism. This seems to be consistent with the 30-fold difference in INDEL diversity level between CDS and noncoding regions, which is more substantial than the 10-fold difference between 0-fold and 4-fold sites (table 4). Fitting the data to a discrete model with C = 3 classes of sites also reveals a bimodal DFE, suggesting that the conclusion is robust (supplementary table S1, Supplementary Material online). With a larger sample containing hundreds or even thousands of alleles, and by fitting a DFE with more site classes, it should be possible to obtain further details of the relative frequencies and fitness effects of strongly selected variants, which tend not to segregate in our current sample of size 17. However, this additional information about the strongly selected end of the DFE is unlikely to affect our estimation of α (see below) because these variants make effectively no contribution to divergence.

Table 5.

Results Based on the Best-Fitting Models for INDELs in the CDS Regions of the D. melanogaster Genome.

Neutral Ref/DFE/Mutation Rate	Parameters for CDS INDELs							α
Noncoding INDELs	Name	θ1ins	γ1ins	ϵ1ins	θ1del	γ1del	ϵ1del	83.7%
Discrete C = 2	MLE	1.8×10−5	1.98	0.023	5.3×10−5	−1.69	0.016
Uniform mutation rate	Name	θ2ins	γ2ins	ϵ2ins	θ2del	γ2del	ϵ2del
	MLE	7.2×10−4	−1566.4	3.6×10−5	0.0011	−642.5	1.6×10−5
4-fold degenerate sites	Name	θ1ins	γ1ins	ϵ1ins	θ1del	γ1del	ϵ1del	71.5%
Discrete C = 2	MLE	1.6×10−5	−1.31	0.0092	4.9×10−5	−3.77	0.0082
Fixed mutation ratios	Name	θ2ins	γ2ins	ϵ2ins	θ2del	γ2del	ϵ2del
	MLE	1.9×10−4	−284.1	1.2×10−4	0.0010	−454.8	6.2×10−5

Note.—The DFE for polymorphic INDELs in the CDS regions were inferred using either noncoding INDELs or 4-fold sites as the neutral reference. A series of different DFEs were fitted to the data, and the best-fitting models presented above were determined by using the Akaike information criterion (AIC) (see supplementary tables S1 and S5, Supplementary Material online). When noncoding INDELs were used as the neutral reference, α was estimated using INDEL divergence in noncoding regions. When 4-fold sites were used as the neutral reference, the mutation rate ratio between SNPs and INDELs, and that between deletions and insertions, were fixed at values obtained from a mutation accumulation experiment (Schrider et al. 2013). α was estimated using a method based on divergence in the 8–30 bp region of short introns < 66 bp long (see the main text).

Results Based on the Best-Fitting Models for INDELs in the CDS Regions of the D. melanogaster Genome. Note.—The DFE for polymorphic INDELs in the CDS regions were inferred using either noncoding INDELs or 4-fold sites as the neutral reference. A series of different DFEs were fitted to the data, and the best-fitting models presented above were determined by using the Akaike information criterion (AIC) (see supplementary tables S1 and S5, Supplementary Material online). When noncoding INDELs were used as the neutral reference, α was estimated using INDEL divergence in noncoding regions. When 4-fold sites were used as the neutral reference, the mutation rate ratio between SNPs and INDELs, and that between deletions and insertions, were fixed at values obtained from a mutation accumulation experiment (Schrider et al. 2013). α was estimated using a method based on divergence in the 8–30 bp region of short introns < 66 bp long (see the main text). To better understand the effects of length, we separated the INDELs in CDS regions into the following length categories: 1 bp, 2 bp, 3 bp, frameshifting (≥4 bp), and nonframeshifting (≥6 bp). We analysed the data in each category separately. As above, noncoding INDELs with the same length were used as the neutral reference and the mutation was assumed to be constant across neutral and selected sites. Considering the dearth of variants, we only fitted a DFE with C = 1 class of selected sites. Viewing the γ in this model as the “average” selection coefficient, frameshift INDELs are consistently more deleterious than nonframeshift INDELs (supplementary fig. S3, Supplementary Material online). Consistent with a prevous study (Leushkin et al. 2013), there is no obvious evidence that longer INDELs are under stronger selection. Using the best-fitting DFE (table 5), the proportion of INDEL substitutions in the CDS regions fixed by positive selection in the D. melanogaster lineage, α, is 83.7% (100% for insertions and 81.8% for deletions). These α estimates are comparable with previous estimates for SNP substitutions in CDS regions (Andolfatto et al. 2011; Schneider et al. 2011). As mentioned earlier, some noncoding INDELs are probably nonneutral, as suggested by the negative Tajima’s D value (table 4). Our use of these variants as the neutral reference are for several practical reasons. Although using INDELs in “dead-on-arrival” transposable elements as neutral reference may be preferable (Petrov 2002), calling variants from repetitive regions using short-read data is highly prone to error (Li 2014). Using data from the 8–30 bp region of short introns ≤65 bp, which are also putatively neutral (Parsch et al. 2010), is also problematic because of evidence for selection maintaining intron size (Ptak and Petrov 2002; Parsch 2003; Leushkin et al. 2013). Note that Tajima’s D is more negative for INDELs in CDS regions than for those in noncoding regions, suggesting that the latter are probably under weaker purifying selection (table 4). If this is the case, our method tends to underestimate the strength of purifying selection on INDELs in CDS regions, as suggested by the simulation results presented in supplementary table S2, Supplementary Material online. This should lead to an overestimation of , the average fixation rate (eq. 18), which should in turn put a downward pressure on the estimation of α (eq. 19). However, biases in α also depend on the way selection on noncoding INDELs alters divergence. For example, if fixations of beneficial noncoding INDELs are so common that dS is greater than the divergence level expected under neutral evolution, then this combined with the overestimation of can lead to a substantial underestimation of α. In contrast, if most noncoding INDELs are selected against and dS is much smaller than the neutral expectation, it may offset the effect caused by the overestimation of and result in an overestimation of α.

Inferring the DFE and α Using 4-Fold Degenerate Sites as the Neutral Reference

To check the robustness of our results, we conducted a second set of analyses without using noncoding INDELs. We extended our model such that it can infer the DFE for INDELs in CDS regions using 4-fold sites as the neutral reference. We chose 4-fold sites instead of the 8–30 bp region of short introns ≤65 bp because 4-fold sites are probably not under ongoing selection on codon usage in D. melanogaster, and are similar to short introns in multiple aspects of polymorphism patterns (Jackson et al. 2017). Considering the parameter richness of the models, using 4-fold SNPs as the neutral reference should help statistical inference because they are much more numerous than short-intron SNPs. We used the following approach to obtain neutral divergence for INDELs along the D. melanogaster lineage. The nucleotide divergence in the 8–30 bp region of short introns ≤65 bp is 0.0674 (Jackson B, personal communication). In a mutation accumulation experiment (Schrider et al. 2013), it was found that the rate to point mutations is 12.2 times higher than that to short INDELs, and that the rate to deletions is 5 times higher than that to insertions (averaging across the two genetic backgrounds considered therein). Thus, an estimate of neutral INDEL divergence can be obtained as , and the corresponding estimates for insertions and deletions are and 0.0046, respectively. Due to the use of 4-fold sites as the neutral reference, it is no longer appropriate to assume that the mutation rate is the same between the selected and neutral regions. Given the evidence that the DFE for INDELs probably features a class of strongly deleterious mutations that make little contribution to polymorphism, allowing the selected and neutral regions to have their separate mutation rates is likely to cause the model to underestimate both the mutation rate in the selected region and strength of purifying selection, as confirmed by simulation results presented in supplementary table S3, Supplementary Material online. An underestimation of the strength of purifying selection is likely to cause an underestimation of α. We observed this in our data set—α for all INDELs obtained from the best-fitting DFE for this analysis (supplementary table S4, Supplementary Material online) is only 21.7%, much smaller than the value of 83.7% when noncoding INDELs were used as the neutral reference (table 5). To resolve the above problem, we again made use of the information reported in the aforementioned mutation accumulation experiment (Schrider et al. 2013). Specifically, we further extended our model, so that the mutation rate ratio between SNPs and INDELs, and that between deletions and insertions, were fixed at 12.2 and 5, respectively. As shown in table 5 (see also supplementary table S5, Supplementary Material online), the best-fitting DFE has C = 2 class of sites, with one under weak selection, and the other being strongly deleterious. The α estimates for all INDELs, insertions, and deletions are, respectively, 71.5%, 59.7%, and 81.3%. To make sure that the above results are not dependent on our use of the mutation rate ratios estimated by Schrider et al. (2013), we repeated the analysis using ratios obtained by either Petrov and Hartl (1998) (SNP/INDEL = 6.9 and deletion/insertion = 8.7) or Haag-Liautard et al. (2007) (SNP/INDEL = 4.2 and deletion/insertion = 3.0) (supplementary table S6, Supplementary Material online). In both cases, the best-fitting DFE has C = 2 classes of selected sites, under weak and strong selection, respectively (supplementary tables S7 and S8, Supplementary Material online). Furthermore, estimates of the strength of purifying selection acting on sites in the weakly selected class are almost identical regardless of the choice of mutation rate ratios (supplementary table S9, Supplementary Material online). Thus, unsurprisingly, all three analyses also produce very similar α estimates (supplementary table S9, Supplementary Material online). Overall, these results are consistent with those based on noncoding INDELs and suggest that a substantial fraction of INDEL substitutions were fixed by positive selection.

Materials and Methods

Numerical Details

We used numerical routines provided by the GNU Scientific Library (GSL; https://www.gnu.org/software/gsl/; last accessed April 6, 2018) to perform the integration in equation (3) numerically. For the continuous model (e.g., eq. 9), the integral was evaluated using Gaussian quadrature, which was implemented based on a routine included in the R package statmod (https://cran.r-project.org/web/packages/statmod/index.html; last accessed April 6, 2018). Maximum likelihood estimates of the model parameters were obtained by both gradient-based and derivative-free optimization algorithms implemented in the NLopt package (http://ab-initio.mit.edu/wiki/index.php/NLopt; last accessed April 6, 2018). To ensure the global maximum was found, we initialised the search algorithm using multiple randomly selected starting points.

Simulations

We performed parameter estimation using our program, anavar, on random samples simulated using Mathematica (http://www.wolfram.com/; last accessed April 6, 2018). Because the generation of simulated data is separate from the numerical routines we used to implement anavar, this set-up can help verify the numerical robustness of anavar. Note that, in all simulations, we only used the models to analyse variants from selected regions because we wanted to find out how much information we could obtain by analysing them alone. Including neutral variants, as routinely done in real data analysis, may help to increase the accuracy of parameter estimation. Therefore, our choice should give us a rather conservative assessment of the methods’ performance. In addition to testing whether the data contained enough information for all the parameters to be estimated, we also assessed how well a model could predict the average fixation rate, (expressed in units of generations). As an example, if nonsynonymous polymorphism data are fitted to the discrete SNP model, can be estimated as: where signifies the MLE of parameter Z and θ is defined by equation (1). Understanding the ability to accurately estimate is important because it is needed for estimating α, the proportion of substitutions fixed by positive selection, which can be written as: where dN and dS are the numbers of selected (e.g., nonsynonymous) and neutral (e.g., synonymous) substitutions per site, respectively (Eyre-Walker and Keightley 2009). We did not generate simulated data from models with demographic changes and selection at linked sites because the effectiveness of the method of Eyre-Walker et al. (2006) in controlling for these confounding factors have been studied extensively (Eyre-Walker et al. 2006; Muyle et al. 2011; Glémin et al. 2015; Galtier 2016; Jackson et al. 2017; Tataru et al. 2017).

The Drosophila melanogaster Data Set

This data set consisted of 17 Rwandan individuals as described in Jackson et al. (2015, 2017) and made available by the Drosophila Population Genomics Project (Pool et al. 2012).

Variant Calling

INDEL realigned BAM files were obtained from Jackson et al. (2017). Initial genotype calling was performed with the HaplotypeCaller and GenotypeGVCF (with the -includeNonVariantSites flag to output genotype calls at both variant and nonvariant positions) tools from GATK 3.7 (DePristo et al. 2011; Van der Auwera et al. 2013). Variant quality score recalibration (VQSR) requires one “truth set” for SNPs and one for INDELs. To generate the truth sets, we intersected the raw variants called from GATK with variants called from SAMtools (version 1.2) (Li et al. 2009). The consensus data were further filtered using the GATK best practice hard filters (for SNPs: QD < 2.0, MQ < 40.0, FS > 60.0, SOR > 3.0, MQRankSum < −12.5, ReadPosRankSum < −8.0; for INDELs: QD < 2.0, ReadPosRankSum < −20.0, FS > 200.0, SOR > 10.0; see https://software.broadinstitute.org/gatk/guide/article? id=3,225; last accessed April 6, 2018). Variants with coverage more than twice, or less than half, the mean coverage of 20× were excluded, along with variants falling into regions identified by RepeatMasker (http://www.repeatmasker.org). Multiallelic sites were excluded along with SNPs falling within INDELs and INDELs >50 bp. We ran VQSR separately for SNPs and INDELs, retaining variants that fell within the 95% tranche cut-off as in Jackson et al. (2017). The passing variants were then refiltered as above with the exception of the GATK hard filters which were not reapplied.

Multispecies Alignments and Polarisation

Multispecies alignments were generated between D. melanogaster (v5.34), D. simulans (Hu et al. 2013), and D. yakuba (v1.3) using D. melanogaster as reference. Firstly pairwise alignments were created using LASTZ (Harris 2007). These were then chained and netted using axtChain and chainNet, respectively (Kent et al. 2003). Single coverage was ensured for the reference genome using single_cov2.v11 from the MULTIZ package (Blanchette et al. 2004) and the pairwise alignments were aligned with MULTIZ. Variants were polarised using the whole genome multispecies alignment and a parsimony approach, where either the alternate or the reference allele had to be supported by all outgroups in the the alignment to be considered ancestral. The site-frequency spectra for insertions and deletions in different genomic regions are presented in supplementary figure S4, Supplementary Material online.

Annotation

Variants were annotated as either intronic, intergenic, or CDS using the D. melanogaster GFF annotation file (version 5.34, available from: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.34_FB2011_02/gff/). Four-fold degenerate and 0-fold degenerate SNPs in CDS regions were annotated using coordinates obtained from the D. melanogaster CDS fasta sequences (version 5.34, available from: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.34_FB2011_02/fasta/dmel-all-CDS-r5.34.fasta.gz).

Summary Statistics

Nucleotide diversity (π) (Tajima 1983), Watterson’s θ (Watterson 1975), and Tajima’s D (Tajima 1989) were calculated for variants in noncoding (intronic and intergenic) and coding regions, as well as for 0-fold and 4-fold degenerate SNPs. The numbers of callable sites used to obtain per-site estimates was taken to be the number of sites in each region that were called in the “all sites” VCF file and passed the filters described previously. Additionally for polarised variants the number of callable sites was reduced to those that could be polarised by our parsimony approach. To obtain rates of divergence at nonsynonymous and synonymous sites, denoted by dN and dS, CDS regions were extracted from the multispecies alignment using the coordinates from the D. melanogaster CDS fasta alignment file. CDS alignments were removed if they were not in frame, did not start with a start codon, did not end with a stop codon or contained premature stop codons. Additionally any codons with missing data were dropped. For each gene, we retained only the longest transcript. These data were then analysed using codeml in PAML (Yang 2007) with a one ratio model to obtain dN and dS.

Software Availability

The new models have been implemented in a user-friendly package anavar, which is freely available at http://zeng-lab.group.shef.ac.uk. In addition to the models developed herein, anavar also contains implementations of several other widely-used models for estimating the DFE (Eyre-Walker et al. 2006) and for studying GC-biased gene conversion (gBGC) (Glémin et al. 2015). All scripts used for the anavar simulation analyses are available at https://github.com/henryjuho/anavar_simulations. Additionally, all scripts used in the D. melanogaster analyses can be found at https://github.com/henryjuho/drosophila_indels.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

51 in total

1. Directional selection and the site-frequency spectrum.

Authors: C D Bustamante; J Wakeley; S Sawyer; D L Hartl
Journal: Genetics Date: 2001-12 Impact factor: 4.562

2. On the number of segregating sites in genetical models without recombination.

Authors: G A Watterson
Journal: Theor Popul Biol Date: 1975-04 Impact factor: 1.570

Review 3. Current hypotheses for the evolution of sex and recombination.

Authors: Matthew Hartfield; Peter D Keightley
Journal: Integr Zool Date: 2012-06 Impact factor: 2.654

4. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

5. How intron splicing affects the deletion and insertion profile in Drosophila melanogaster.

Authors: Susan E Ptak; Dmitri A Petrov
Journal: Genetics Date: 2002-11 Impact factor: 4.562

6. On the utility of short intron sequences as a reference for the detection of positive and negative selection in Drosophila.

Authors: John Parsch; Sergey Novozhilov; Sarah S Saminadin-Peter; Karen M Wong; Peter Andolfatto
Journal: Mol Biol Evol Date: 2010-02-11 Impact factor: 16.240

7. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies.

Authors: Peter D Keightley; Adam Eyre-Walker
Journal: Genetics Date: 2007-12 Impact factor: 4.562

8. DNA loss and evolution of genome size in Drosophila.

Authors: Dmitri A Petrov
Journal: Genetica Date: 2002-05 Impact factor: 1.082

9. Variation in the Intensity of Selection on Codon Bias over Time Causes Contrasting Patterns of Base Composition Evolution in Drosophila.

Authors: Benjamin C Jackson; José L Campos; Penelope R Haddrill; Brian Charlesworth; Kai Zeng
Journal: Genome Biol Evol Date: 2017-01-01 Impact factor: 3.416

10. Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture.

Authors: John E Pool; Russell B Corbett-Detig; Ryuichi P Sugino; Kristian A Stevens; Charis M Cardeno; Marc W Crepeau; Pablo Duchen; J J Emerson; Perot Saelao; David J Begun; Charles H Langley
Journal: PLoS Genet Date: 2012-12-20 Impact factor: 5.917

9 in total

1. Comparison of the Full Distribution of Fitness Effects of New Amino Acid Mutations Across Great Apes.

Authors: David Castellano; Moisès Coll Macià; Paula Tataru; Thomas Bataillon; Kasper Munch
Journal: Genetics Date: 2019-09-05 Impact factor: 4.562

2. GROWTH POLE RING protein forms a 200-nm-diameter ring structure essential for polar growth and rod shape in Agrobacterium tumefaciens.

Authors: J R Zupan; R Grangeon; J S Robalino-Espinosa; N Garnica; P Zambryski
Journal: Proc Natl Acad Sci U S A Date: 2019-05-13 Impact factor: 11.205

3. Exonic splice regulation imposes strong selection at synonymous sites.

Authors: Rosina Savisaar; Laurence D Hurst
Journal: Genome Res Date: 2018-08-24 Impact factor: 9.043

4. The Impact of Natural Selection on Short Insertion and Deletion Variation in the Great Tit Genome.

Authors: Henry J Barton; Kai Zeng
Journal: Genome Biol Evol Date: 2019-06-01 Impact factor: 3.416

5. Methods for Estimating Demography and Detecting Between-Locus Differences in the Effective Population Size and Mutation Rate.

Authors: Kai Zeng; Benjamin C Jackson; Henry J Barton
Journal: Mol Biol Evol Date: 2019-02-01 Impact factor: 16.240

6. Genetic Drift Dominates Genome-Wide Regulatory Evolution Following an Ancient Whole-Genome Duplication in Atlantic Salmon.

Authors: Jukka-Pekka Verta; Henry J Barton; Victoria Pritchard; Craig R Primmer
Journal: Genome Biol Evol Date: 2021-05-07 Impact factor: 3.416

7. Inferring Parameters of the Distribution of Fitness Effects of New Mutations When Beneficial Mutations Are Strongly Advantageous and Rare.

Authors: Tom R Booker
Journal: G3 (Bethesda) Date: 2020-07-07 Impact factor: 3.154

8. A Study of Faster-Z Evolution in the Great Tit (Parus major).

Authors: Kai Hayes; Henry J Barton; Kai Zeng
Journal: Genome Biol Evol Date: 2020-03-01 Impact factor: 3.416

9. Inferring Genome-Wide Correlations of Mutation Fitness Effects between Populations.

Authors: Xin Huang; Alyssa Lyn Fortier; Alec J Coffman; Travis J Struck; Megan N Irby; Jennifer E James; José E León-Burguete; Aaron P Ragsdale; Ryan N Gutenkunst
Journal: Mol Biol Evol Date: 2021-09-27 Impact factor: 16.240

9 in total