| Literature DB >> 21840857 |
Badri Padhukasahasram1, Bruce Rannala.
Abstract
Meiotic recombination is a fundamental cellular mechanism in sexually reproducing organisms and its different forms, crossing over and gene conversion both play an important role in shaping genetic variation in populations. Here, we describe a coalescent-based full-likelihood Markov chain Monte Carlo (MCMC) method for jointly estimating the crossing-over, gene-conversion, and mean tract length parameters from population genomic data under a Bayesian framework. Although computationally more expensive than methods that use approximate likelihoods, the relative efficiency of our method is expected to be optimal in theory. Furthermore, it is also possible to obtain a posterior sample of genealogies for the data using this method. We first check the performance of the new method on simulated data and verify its correctness. We also extend the method for inference under models with variable gene-conversion and crossing-over rates and demonstrate its ability to identify recombination hotspots. Then, we apply the method to two empirical data sets that were sequenced in the telomeric regions of the X chromosome of Drosophila melanogaster. Our results indicate that gene conversion occurs more frequently than crossing over in the su-w and su-s gene sequences while the local rates of crossing over as inferred by our program are not low. The mean tract lengths for gene-conversion events are estimated to be ∼70 bp and 430 bp, respectively, for these data sets. Finally, we discuss ideas and optimizations for reducing the execution time of our algorithm.Entities:
Mesh:
Year: 2011 PMID: 21840857 PMCID: PMC3189816 DOI: 10.1534/genetics.111.130195
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Comparison between the MCMC program and coalescent simulations
| ρ | γ | E(H) | E(H) | Mean tract | Sample size | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.000 | 0.300 | 0.0000 | 0.0000 | 4.4084 | 4.4233 | 2.4454 | 2.4572 | 5 | 8 | 4 | 4 |
| 0.000 | 0.100 | 0.0000 | 0.0000 | 1.2583 | 1.2614 | 1.9721 | 1.9781 | 2 | 8 | 4 | 4 |
| 0.100 | 0.100 | 0.3163 | 0.3162 | 0.5782 | 0.5777 | 1.2694 | 1.2682 | 5 | 2 | 4 | 4 |
| 0.300 | 0.300 | 2.4264 | 2.4281 | 2.7716 | 2.7760 | 4.4577 | 4.4683 | 5 | 8 | 4 | 4 |
| 0.100 | 0.300 | 0.8060 | 0.8061 | 2.5602 | 2.5605 | 4.4287 | 4.4282 | 5 | 8 | 4 | 4 |
| 0.001 | 0.001 | 29.884 | 30.646 | 3.3402 | 3.4341 | 5.7887 | 5.9864 | 100 | 8 | 8 | 8000 |
Population crossing-over and gene-conversion rates per base pair.
Average of crossing-over count, gene-conversion count, and height of genealogy from coalescent simulations.
Average of crossing-over count, gene-conversion count, and height of genealogy from MCMC algorithm.
Number of markers.
Sequence length in base pairs.
Estimated locations of hotspots in three independent simulated data sets
| Data set | Hotspot start, bp | Hotspot end, bp | Hotspot intensity estimate | |
|---|---|---|---|---|
| Replicate 1 | 9,486 (8,746–10,226) | 10,555 (8,911–12,199) | 4,030.71 | 2.375 (0.375–6.125) |
| Replicate 2 | 10,885 (5,240–16,530) | 11,879 (6,969–16,789) | 8,160.21 | 0.375 (0.125–7.375) |
| Replicate 3 | 9,822 (8,087–11,557) | 11,068 (9,240–12,896) | 3,968.12 | 3.625 (1.125–4.625) |
Each simulated data set consists of 20-kb sequences and 20 samples with a single hotspot of intensity 5000/Mb between 9 kb and 11 kb. The relative rate of gene conversion to crossing over (f) was uniform along the sequence and was equal to 1. The estimated locations for the start and end of hotspots and f are shown along with the 90% credible intervals in parentheses.
Figure 1 Estimated locations of hotspots in data sets simulated with a uniform ratio of gene conversion to crossing over (f) = 10. Simulations consisted of 20-kb sequences and 50 samples with a single hotspot between 9 kb and 11 kb. The red curve shows the gene-conversion rates as inferred by the program of Gay under a variable rate model. The red asterisks mark the posterior means of the starts and ends of hotspots as estimated by InferRho, the blue lines represent 90% credible intervals, and the green lines show the true locations of hotspots. Most of the posterior samples obtained from InferRho contained a single hotspot.
Comparison between three different methods for recombination rate estimation
| Method | ρ | γ | MSE( | MSE( | MSE( | #( | #( | #( | |
|---|---|---|---|---|---|---|---|---|---|
| 500 | 500 | 500 | 3.488 | 1,022,389 | 0.723 | 0.71 | 0.26 | 0.36 | |
| 500 | 500 | 500 | 0.079 | 0.661 | 0.105 | 0.71 | 0.55 | 0.88 | |
| InferRho | 500 | 500 | 500 | 0.041 | 0.222 | 0.644 | 0.89 | 0.77 | 0.56 |
| 500 | 1000 | 500 | 0.085 | 618,219 | 0.349 | 0.78 | 0.48 | 0.35 | |
| 500 | 1000 | 500 | 0.060 | 1.348 | 0.098 | 0.79 | 0.70 | 0.87 | |
| InferRho | 500 | 1000 | 500 | 0.043 | 0.487 | 0.250 | 0.88 | 0.81 | 0.62 |
ρ denotes population crossing-over rate, γ denotes the population gene-conversion rate, and m denotes the mean conversion tract length for the 100 simulated data sets. Recombination rates and mean tract lengths are shown in units of per megabase and base pair, respectively.
MSE, the mean square error of the parameter estimates for the 100 simulated data sets × 10−6. For InferRho, we calculated the marginal estimate of each recombination parameter from the joint posterior distribution. Numbers for the other methods are taken from Yin (2010).
#(:2) represents the fraction of the 100 data sets for which estimates are within a factor of 2 of the true parameter value used in simulations (i.e., ρ). #(:2) and #(:2) are defined similarly.
Joint estimates of crossing over () and gene conversion () for m = 352 bp
| Gene | Method | |||
| 1,700 | 12,000 | 7.10 | ||
| 2,240 | 11,510 | 5.14 | ||
| InferRho | 1,500 (700–5,500) | 6,900 (3,500–10,300) | 4.60 (0.875–7.875) | |
| 570 | 28,000 | 48.0 | ||
| 33 | 27,040 | 819.4 | ||
| InferRho | 4,300 (2,500–8,100) | 11,700 (5,300–16,100) | 2.72 (0.875–3.625) |
The 95% maximum posterior density credible intervals for InferRho are shown in parentheses. For InferRho, MCMC chains were run for 40 million iterations and the first 10 million iterations were discarded as burn-in. For other methods, the numbers are taken from Yin (2010).
Estimates of crossing over (), gene conversion (), and mean tract length ()
| Gene | Method | |||
|---|---|---|---|---|
| 920 | 11,600 | 480 | ||
| 1,290 | 9,860 | 560 | ||
| InferRho | 3,350 | 5,950 | 430 | |
| 8,520 | 251,130 | 5 | ||
| 1,450 | 41,090 | 162 | ||
| InferRho | 5,550 | 28,950 | 70 |
For InferRho, MCMC chains were run for 41 million and 37 million iterations (for su-s and su-w, respectively) and the first 10 million iterations were discarded as burn-in. We show the marginal estimate of each parameter from the joint posterior distribution. For other methods, the numbers are taken from Yin (2010).
Figure 2 (A) Bayes factors along the su-w gene for 10-bp windows. The logarithm of the ratio of posterior to prior odds for gene-conversion initiation points is plotted along the gene. The su-w gene is 2.5 kb long. (B) Bayes factors along the su-s gene for 10-bp windows. The logarithm of the ratio of posterior to prior odds for gene-conversion initiation points is plotted along the gene. The su-s gene is 4.1 kb long.