| Literature DB >> 23797107 |
Hua Chen1, Montgomery Slatkin.
Abstract
It is a challenging task to infer selection intensity and allele age from population genetic data. Here we present a method that can efficiently estimate selection intensity and allele age from the multilocus haplotype structure in the vicinity of a segregating mutant under positive selection. We use a structured-coalescent approach to model the effect of directional selection on the gene genealogies of neutral markers linked to the selected mutant. The frequency trajectory of the selected allele follows the Wright-Fisher model. Given the position of the selected mutant, we propose a simplified multilocus haplotype model that can efficiently model the dynamics of the ancestral haplotypes under the joint influence of selection and recombination. This model approximates the ancestral genealogies of the sample, which reduces the number of states from an exponential function of the number of single-nucleotide polymorphism loci to a quadratic function. That allows parameter inference from data covering DNA regions as large as several hundred kilo-bases. Importance sampling algorithms are adopted to evaluate the probability of a sample by exploring the space of both allele frequency trajectories of the selected mutation and gene genealogies of the linked sites. We demonstrate by simulation that the method can accurately estimate selection intensity for moderate and strong positive selection. We apply the method to a data set of the G6PD gene in an African population and obtain an estimate of 0.0456 (95% confidence interval 0.0144-0.0769) for the selection intensity. The proposed method is novel in jointly modeling the multilocus haplotype pattern caused by recombination and mutation, allowing the analysis of haplotype data in recombining regions. Moreover, the method is applicable to data from populations under exponential growth and a variety of other demographic histories.Entities:
Keywords: allele age; haplotype structure; importance sampling; selection coefficient; structured coalescent; time-varying population size
Mesh:
Substances:
Year: 2013 PMID: 23797107 PMCID: PMC3737182 DOI: 10.1534/g3.113.006197
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1An illustration of the structured-coalescent approach for modeling positive selection. The historical population sizes are indicated by the distance between the two dashed lines; and the allele frequency trajectory of the selected allele is indicated with a thin solid curve. The coalescent history of the selected locus with five derived lineages (solid bold lines) and five ancestral lineages (dotted bold lines) is superimposed on the trajectory and population size curves. The present time, t = 0, is at the bottom. And the time at which the trajectory of the selected mutant merged to the population-size curve denotes the time when the selected mutant arose in the population, i.e., the allele age T. In the model presented in the main text, only the sub-genealogies in the selected allele groups (bold solid lines) are considered.
Figure 2Flowchart of importance sampling procedures of the method.
Possible transitions from (q1, q2; q3, q4) and the rates for the two-locus haplotype model
| Transition | Rate |
|---|---|
| ( | |
| ( | |
| ( | |
| ( |
An example of haplotype configuration to demonstrate the coding rules used to denote the haplotype structure
| Haplotype Type | Count Number | Code |
|---|---|---|
There are 10 haplotypes with 25 single-nucleotide polymorphisms in four distinct groups in the sample. The mutant is located in position 18 and shown in boldface type. The ancestral region for each haplotype is highlighted. The codes for the four haplotype groups are listed in the third column
The transition probabilities for some states of the multilocus haplotype model
| Transition | Rate |
|---|---|
| [ | |
| [ | |
| [ |
Figure 3A realization of the genealogies for a sample of four haplotypes (lineages 1−4) to illustrate possible events in the genealogies. Black denotes the ancestral haplotype region (see the main text for the definition of “ancestral haplotypes”), and white denotes background haplotypes. A star denotes a neutral mutant arising on the ancestral haplotype. The present time, t = 0, is on the bottom. When going back in time, the events are coalescent (lineages 2 and 3 coalesce to the ancestral lineage 5), recombination (lineage 6 → lineage 4), coalescent (lineages 5 and 6 coalesce to the ancestral lineage 7), recombination (lineage 8 → lineage 7), mutation (on lineage 9), coalescent (lineages 8 and 9 coalesce to lineage 10) in sequence.
The proposal distribution and importance weights for the importance sampling algorithm presented in the section A proposal distribution for sampling genealogical histories conditional on a trajectory
| ℚ( | ℙ( | Importance Weight | |
|---|---|---|---|
| ( | |||
| ( | |||
| ( | |||
| ( |
{} denotes four possible events of the genealogical history. ℙ(−1|) and ℚ(|−1) are the one-step transition probability of the forward and backward Markov process constructed for simulating the genealogical history. The importance weight is estimated by
Figure 4The relative likelihood curve for the simulated data with the selection coefficient s = 0.05 and a constant population size N = 10,000. The comparison of eight estimates of the likelihood curves is presented. Each estimate is an independent run of our method on different simulated data. The results are from 1 million iterations.
Figure 5The relative likelihood curve for the simulated data with the selection coefficient s = 0.005 and a constant population size N = 10,000. The comparison of eight estimates of the likelihood curves is presented. Each estimate is an independent run of our method on different simulated data. The results are from 1 million iterations.
The sample configuration of the G6PD data according to coding rules in the section A simplified multilocus model for haplotype structure
| Haplotype Type | Count Number |
|---|---|
| (11, 7) | 5 |
| (4, 7) | 1 |
| (12, 7) | 1 |
| (17, 7) | 3 |
Figure 6Likelihood curve for the G6PD data as a function of the selection coefficient with a constant population size of 10,000. The likelihood curve is smoothed by a local polynomial smoother. The point estimate of the selection coefficients is 0.0456 with the 95% confidence interval of (0.0144, 0.0769).
Figure 7The posterior distribution of the allele age in generations when the selection coefficient is set to the value estimated from Figure 6.
Definitions of notations used in this article
| Notation | Meaning |
|---|---|
| Total number of haplotypes in the sample | |
| Number of selected haplotypes | |
| Number of SNPs of a sample | |
| Number of SNPs on the left and right sides of the mutant | |
| D | The |
| Population size at time | |
| Allele age, or the time when the mutant arose in the population | |
| The selection coefficient | |
| Recombination fraction of the haplotype | |
| Mutation rate of the haplotype | |
| The scaled mutation rate of the haplotype | |
| The scaled recombination rate of the haplotype | |
| The proportion of ancestral haplotype region as a fraction of the length of the | |
| The allele frequency trajectory | |
| The number of the selected allele in the whole population at time | |
| The frequency of the selected allele at time | |
| A recoded haplotype which includes two recombination | |
| Coordinates and | |
| The | |
| The number of haplotypes for each haplotype group in | |
| ( | The sample configuration at time |
| Sampling probability of the sample configuration ( | |
| The | |
| The total rate for events at time | |
| λ | The ratio of population size at |
| The deleting operator that deletes | |
| ℒ( | Likelihood function of the data |
| The | |
| ℚ( | Proposal distribution for |
| ℚ( | Proposal distribution for |