| Literature DB >> 29635416 |
Abstract
Small insertions and deletions (INDELs; ≤50 bp) are the most common type of variability after single nucleotide polymorphism (SNP). However, compared with SNPs, we know little about the distribution of fitness effects (DFE) of new INDEL mutations and how prevalent adaptive INDEL substitutions are. Studying INDELs has been difficult partly because identifying ancestral states at these sites is error-prone and misidentification can lead to severely biased estimates of the strength of selection. To solve these problems, we develop new maximum likelihood methods, which use polymorphism data to simultaneously estimate the DFE, the mutation rate, and the misidentification rate. These methods are applicable to both INDELs and SNPs. Simulations show that they can provide highly accurate results. We applied the methods to an INDEL polymorphism data set in Drosophila melanogaster. We found that the DFE for polymorphic INDELs in protein-coding regions is bimodal, with the variants being either nearly neutral or strongly deleterious. Based on the DFE, we estimated that 71.5-83.7% of the INDEL substitutions that took place along the D. melanogaster lineage were fixed by positive selection, which is comparable with the prevalence of adaptive substitutions at nonsynonymous sites. The new methods have been implemented in the software package anavar.Entities:
Mesh:
Year: 2018 PMID: 29635416 PMCID: PMC5967470 DOI: 10.1093/molbev/msy054
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.The SFSs for insertions and deletions may be affected to different extents by polarisation errors. We assume that the population size is constant, that INDELs are neutral, and that the sample size is 10. In the genomic region under consideration, the total scaled mutation rate toward insertions, , is 10, where N is the effective population size u is the insertion mutation rate per site per generation, and m is that size of the focal region. The total scaled mutation rate toward deletions is 20. The expected SFSs were generated using standard neutral theory. The SFSs with polarisation errors were generated by assuming that the ancestral state of an INDEL was wrongly identified with probability 0.1.
Maximum Likelihood Estimates (MLEs) of the Parameters of Discrete SNP Models with C = 2 Classes of Sites.
| True value | – | 0.005 | −5 | 0.05 | 0.01 | −20 | 0.01 |
| Mean (SD) of MLEs | 106 | 0.0050 (0.0007) | −5.0 (0.4) | 0.051 (0.006) | 0.010 (0.001) | −20.2 (1.9) | 0.009 (0.006) |
| Mean (SD) of MLEs | 105 | 0.0044 (0.0017) | −4.4 (1.5) | 0.042 (0.022) | 0.011 (0.001) | −20.0 (5.7) | 0.016 (0.014) |
Note.—Simulated data were generated using the parameter values shown in the “True value” row, with two different region sizes, m. For each parameter combination, 100 samples of size 50 were simulated and analysed to obtain MLEs.
Statistical Properties of the Discrete SNP Model.
| Case | Parameters | Percent Significant | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Equal | True | Full | Equal | |||||||
| 1 | Same as | 106 | 93 | 100 | 100 | 0.0113 | 0.0114 | 0.0171 | 0.0022 | |
| 2 | Same as | 105 | 15 | 92 | 100 | 0.0113 | 0.0158 | 0.0204 | 0.0022 | |
| 3 | See notes below | 107 | 3 | 100 | 100 | 0.2204 | 0.2267 | 0.2613 | 0.1755 | |
| 4 | Same as Case 3 | 0 | 33 | 55 | 0.2204 | 0.2271 | 0.2580 | 0.1768 | ||
Note.—The parameters used in Case 3 were , and n = 100. A large sample size was used for Cases 3 and 4 due to the inclusion of strongly deleterious mutations (i.e., ). Values under “Percent significant” show how often the full model fitted the data better than the three reduced models (see the main text for more details). The (see eq. 18 in Materials and Methods) obtained under the ϵ = 0 model are large because ignoring polarisation error results in the inference of a site class with a strongly positive γ.
MLEs of the Parameters of Several INDEL Models.
| Model | Parameters | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Discrete | Name | |||||||||
| True | 0.0005 | −5 | 0.02 | 0.001 | −15 | 0.02 | ||||
| Mean MLE | 0.00050 | −5.0 | 0.021 | 0.0010 | −15.0 | 0.020 | ||||
| Continuous | Name | |||||||||
| True | 0.0005 | 0.5 | 10 | 0.08 | 0.001 | 0.25 | 50 | 0.04 | ||
| Mean MLE | 0.00050 | 0.51 | 10.4 | 0.080 | 0.0010 | 0.251 | 51.2 | 0.040 | ||
| Continuous | Name | |||||||||
| True | 0.0005 | 0.5 | 10 | 0.08 | 0.001 | 0.25 | 50 | 0.04 | ||
| Mean MLE | 0.00054 | 0.51 | 144.7 | 0.082 | 0.0010 | 0.253 | 93.2 | 0.041 | ||
Summary Statistics for the INDEL and SNP Data.
| Data | Type | Diversity ( | Tajima’s |
|---|---|---|---|
| INDELs | CDS | −1.208 | |
| Frameshift | −1.253 | ||
| Nonframeshift | −1.177 | ||
| Intron | 0.0016 | −0.729 | |
| Intergenic | 0.0017 | −0.704 | |
| Noncoding | 0.0017 | −0.718 | |
| SNPs | Nonsense | −1.510 | |
| 0-fold degenerate sites | 0.0016 | −0.868 | |
| 4-fold degenerate sites | 0.0165 | −0.210 |
Results Based on the Best-Fitting Models for INDELs in the CDS Regions of the D. melanogaster Genome.
| Neutral Ref/DFE/Mutation Rate | Parameters for CDS INDELs | |||||||
|---|---|---|---|---|---|---|---|---|
| Noncoding INDELs | Name | 83.7% | ||||||
| Discrete | MLE | 1.98 | 0.023 | −1.69 | 0.016 | |||
| Uniform mutation rate | Name | |||||||
| MLE | −1566.4 | 0.0011 | −642.5 | |||||
| 4-fold degenerate sites | Name | 71.5% | ||||||
| Discrete | MLE | −1.31 | 0.0092 | −3.77 | 0.0082 | |||
| Fixed mutation ratios | Name | |||||||
| MLE | −284.1 | 0.0010 | −454.8 | |||||
Note.—The DFE for polymorphic INDELs in the CDS regions were inferred using either noncoding INDELs or 4-fold sites as the neutral reference. A series of different DFEs were fitted to the data, and the best-fitting models presented above were determined by using the Akaike information criterion (AIC) (see supplementary tables S1 and S5, Supplementary Material online). When noncoding INDELs were used as the neutral reference, α was estimated using INDEL divergence in noncoding regions. When 4-fold sites were used as the neutral reference, the mutation rate ratio between SNPs and INDELs, and that between deletions and insertions, were fixed at values obtained from a mutation accumulation experiment (Schrider et al. 2013). α was estimated using a method based on divergence in the 8–30 bp region of short introns < 66 bp long (see the main text).