Population genetics is the study of allele frequency changes driven by various evolutionary forces such as mutation, natural selection, and random genetic drift. Although natural selection is widely recognized as a bona-fide phenomenon, the extent to which it drives evolution continues to remain unclear and controversial. Various qualitative techniques, or so-called "tests of neutrality", have been introduced to detect signatures of natural selection. A decade and a half ago, Stanley Sawyer and Daniel Hartl provided a mathematical framework, referred to as the Poisson random field (PRF), with which to determine quantitatively the intensity of selection on a particular gene or genomic region. The recent availability of large-scale genetic polymorphism data has sparked widespread interest in genome-wide investigations of natural selection. To that end, the original PRF model is of particular interest for geneticists and evolutionary genomicists. In this article, we will provide a tutorial of the mathematical derivation of the original Sawyer and Hartl PRF model.
Population genetics is the study of allele frequency changes driven by various evolutionary forces such as mutation, natural selection, and random genetic drift. Although natural selection is widely recognized as a bona-fide phenomenon, the extent to which it drives evolution continues to remain unclear and controversial. Various qualitative techniques, or so-called "tests of neutrality", have been introduced to detect signatures of natural selection. A decade and a half ago, Stanley Sawyer and Daniel Hartl provided a mathematical framework, referred to as the Poisson random field (PRF), with which to determine quantitatively the intensity of selection on a particular gene or genomic region. The recent availability of large-scale genetic polymorphism data has sparked widespread interest in genome-wide investigations of natural selection. To that end, the original PRF model is of particular interest for geneticists and evolutionary genomicists. In this article, we will provide a tutorial of the mathematical derivation of the original Sawyer and Hartl PRF model.
Selectionists and neutralists have fiercely debated, for
the past five decades, the extent to which Darwinian selection has shaped
molecular evolution. However, both camps do agree that Darwinian selection is a
bona fide natural phenomenon. Therefore, various so-called “tests of
neutrality” have been developed to detect natural selection on a particular
gene or genomic location (for a review on this topic, see [1]). However, these
tests are often qualitative and only provide the directionality of selection. A
decade and a half ago, S. Sawyer and D. Hartl provided a mathematical framework
with which to determine quantitatively the intensity of selection on a
particular gene, which they applied to the Adh locus in the Drosophila genome [2].
This framework is referred to as the Poisson random field (PRF) model. They
then further used this framework to analyze codon bias in enteric bacteria [3].
Owing to the recent availability of whole genome sequences and genome-wide
human polymorphism data, it has become increasingly tractable to perform
genome-wide scans for signatures of selection. The PRF model has been applied
to estimate the intensity of selection on synonymous and nonsynonymous sites
throughout mitochondrial and nuclear genomes of a variety of species, including
human [4-12]. Very recently, due to the
advent of high-throughput experimental and computational identification of
genomic regulatory elements, there has been an interest to estimate the
intensity of natural selection on regulatory mutations. Chen and Rajewsky [13]
use the PRF, among other techniques, to provide evidence for purifying
selection (even stronger than on nonsynonymous coding sites) on a class of
regulatory sites known as microRNA target sites. Due to the potentially wide
range of applications of, and opportunities for theoretical extensions to, the
PRF model, it is an increasingly important mathematical framework for quantitative
geneticists. In this article, we will provide a tutorial of the mathematical
derivation of the basic PRF model that was originally developed in [2]. The tutorial will follow the outline provided
below:
The first three items are discussed
in [14], and the last point was
originally presented in [2]. In this tutorial, we aim to provide an
integrated and comprehensive presentation that is accessible to
nonprofessionals or beginners in the field of population genetics. Since the
primary purpose is to review mathematical derivations, familiarity with
calculus and at least a cursory knowledge of genetics will be helpful for the
reader.Wright-Fisher
model,diffusion approximation
to the Wright-Fisher model,derivation,
via diffusion theory, of formulas describing evolutionary processes of interest,derivation of
the PRF using the above-mentioned formulas.
2. The Wright-Fisher Model
The Wright-Fisher (WF) model
describes the change in frequency of a single mutation (derived allele) in a
population over time. The simplest version of the model makes the following
assumptions: (1) nonoverlapping generations, (2) constant population size in
each generation, and (3) random mating, and is described as follows.Consider a population of N diploid individuals that has a single
polymorphic site with two alleles, one ancestral and one derived. Under this
model, the frequency of the derived allele in the current generation is a
function of the selection pressure on this allele and the binomial sampling
effect with probabilities proportional to the frequency of this allele in the
previous generation. The probability, p,
that there are j genes of the derived
allele present at generation G + 1 given i genes of the derived allele present at generation G is given by the following
binomial calculation:
where Ψ depends on the relative fitness of the derived
allele.Assuming no dominance and no
recurrent mutation,
where 1 + s is the fitness of the derived allele relative to 1 for the
ancestral allele, and x (which is
simply i/2N) is the derived allele
frequency (daf) in generation G. In the simplest model (no selection and no
recurrent mutation), Ψ is simply x or i/2N.The intuition behind Ψ is the following. Consider
the scenario where both the ancestral and the derived alleles are neutrally
evolving (no or negligible selection pressure). In this case, the probability
of sampling a gene of the derived allele from the population in generation G is
simply the frequency of the derived allele in generation G, i/2N or x. This can be rewritten as x/[x + (1 − x)]. Now, suppose that the
derived allele is under some selection, s,
meaning that the fitness of the derived allele is 1 + s relative to 1 for the ancestral allele. In this case, genes are
sampled according to their relative fitnesses (as in the equation for Ψ above). Figure 1(a) provides a pictorial representation of the basic Wright-Fisher model.
Figure 1
Pictorial representation of the Wright-Fisher
process and its diffusion approximation: (a) basic Wright-Fisher model assuming selection, but no dominance
or recurrent mutation and (b) diffusion approximation
to the basic Wright-Fisher model.
3. Diffusion Theory
We define p
( as the probability that a polymorphic site has i genes of the derived allele at time t, given that it had k genes of the derived allele at time 0. p
( satisfies the following:
where p
is given in (1).It is convenient to change notation
and write p
( as f(x; p, t),
so that the above becomes
In this framework, it has been
shown to be extremely difficult to explicitly derive formulas for several
quantities of evolutionary interest. However, as the size of the population
approaches infinity (i.e., N → ∞), and assuming that the scaled selection
pressure (Ns) and scaled mutation
rate (N) remain constant, the
discrete Markov process given above can be closely approximated by a
continuous-time, continuous-space diffusion process (Figure 1(b)):
where f(x; p, t) is the probability distribution of x at time t, x is the daf at time t, p is the daf at time 0, and δx is the daf change in time δt.We can perform a Taylor series expansion on both sides in δt and δx to derive the forward Kolmogorov equation:
where
and a(x) and b(x) depend on
the genetic model (e.g., see eq (24).Equation (5) can be represented
diagrammatically as in Figure 2.
The probability of derived allele frequency
x + δx at time t + δt is the product of the probability of moving
from p to x in time t and the
probability of moving from x to x + δx in time δt,
summed over all possible values of x.
Figure 2
A diagrammatic intuition for (3)
illustrates that the probability of derived allele frequency x + δx at time t + δt is the product of the probability of
moving from p to x in time
t and the probability of moving from
x to x + δx in
time δt, summed over all possible values of
x.
The frequency trajectory of a
derived allele can also be depicted as in Figure 3, which illustrates that the
probability of frequency x at time t + δt is the product of the probability of moving
from p to p + δp in time δt and the probability of moving from p + δp to x in time t, summed over
all possible values of . This is
formalized as follows:
We can again perform a Taylor
series expansion on
both sides to derive the backward Kolmogorov equation:
The forward and backward Kolmogorov
equations have played a central role in theoretical population genetics since
1922. For details regarding their derivation, we refer the reader [15, Chapter 4].
Next, we will discuss how they are utilized to derive formulas for various
quantities of evolutionary interest (yellow boxes in Figure 4).
Figure 3
A diagrammatic intuition for (5)
illustrates that the probability of frequency x
at time t + δt is the product of the probability of moving from
p to p + δp in time δt and the probability of moving from
p + δp to x in time
t, summed over all possible values of
δp.
Figure 4
The utility of the Kolmogorov
equations for studying evolutionary processes. Blue boxes
correspond to questions that clarify the assumptions of the
genetic model being used, the red boxes correspond to when the
Kolmogorov equations (KEs) are utilized, and yellow boxes
correspond to quantities of evolutionary interest.
In a model where there is two-way
recurrent mutation (i.e., there are no absorbing states, either extinction or
fixation), stationarity is achieved when the probability of change in the
derived allele frequency is no longer dependent on time t. We solve for the stationary distribution, f(x), in the following manner. First, we integrate through the
forward Kolmogorov equation with respect to x:
where F(x; p, t) is the
probability of the derived allele assuming a frequency between 0 and x at time t. Therefore, the derivative of F(x; p, t) with respect to t can be
interpreted as the probability flux (change in probability over time) of the
diffusion process. The stationary distribution, f(x), can be solved by setting the probability flux equal to zero.
4. Derivation of Formulas Describing Evolutionary Processes of Interest
Let us now focus on a genetic model
that assumes no recurrent mutation (i.e., two absorbing states, one at x = 0 and another at x = 1). As depicted by Figure 4, in such a model, it is possible to
determine the probability of extinction (x = 0),
the probability of fixation (x = 1),
and the mean time until absorption (either at x = 0 or x = 1) by using the
Kolmogorov backward equation (Figure 4). It is also possible to derive the mean
time until absorption conditioned on always eventually reaching only one of the
two states. Since this quantity is not directly
applicable to the PRF, we do not review its derivation here, but instead refer
the reader to [14].
4.1. Probability of Extinction
Using (11), we arrive at an equation
parallel to (9):
The probability that the derived allele
frequency, x, reaches 0 at or before time t follows from (11) and is given by
where p is the initial frequency of the derived allele and 0+ indicates 0 + ε, where ε is very small.Replacing F(0+; p, t) with P
0(p, t),
(12) can be written as
As t → ∞, P
0(p, t) can be interpreted as the probability that
extinction ever occurs (independent of time) and can be rewritten in the form P
0(p).
From (14), it is evident that P
0(p) satisfies the following equation:
Solving (15), we arrive at the
following:
where
and where a(z) and b(z) are defined
as in (6).
4.2. Probability of Fixation
The probability that the derived
allele frequency, x, reaches 1 at
time t follows from (11) and is given
by
where p is the initial frequency of the derived allele and 1− indicates 1 − ε, where ε is very small.In (12), F(x; p, t) can be replaced by 1 − F(x; p, t) without any loss of generality. Also, by replacing 1 − F(1−; p, t) with P
1(p, t),
(12) can be rewritten as
By letting t → ∞ and solving for P
1(p), we arrive at the following:
where ψ(y) has been defined in (17) and a(z) and b(z) have been defined in (6).The probability of fixation and the
probability of extinction must sum to 1. Using (16) and (20), we can verify
that this is indeed the case.Consider a genetic model that
assumes the presence of selection, but no recurrent mutation, where a(x) = sx(1 − x) and b(x) = x(1 − x)/2N.
Starting from (20), we can express the probability of fixation under this
genetic model in the following manner:
4.3. Mean Time Until Either Extinction or Fixation
We define ϕ(p, t) to be the density function of the time t at which absorption occurs. The probability that absorption occurs, at
either boundary x = 0 or x = 1, by time t, is
Furthermore, since absorption must
happen by t = ∞,
we know that
Performing integration by parts, we
get the following:
Equations (14), (19), and (22) show
that ϕ(p, t) satisfies the following equation:
Using (25) and the fact that ϕ(p, t) approaches 0 faster than t approaches ∞,
we can rewrite ((24) as
After interchanging the order of
integration and differentiation we get
where
and t(p, x)dx is the mean time that the daf spends in the interval (x, x + δx) before absorption occurs.We are interested in the case,
where p = 1/2N, since this is the
initial frequency of the derived allele. In this case, we are interested only
in values of x greater than 1/2N, and for these values we can write
and ψ(x) is defined in (17).Under the simplest genetic model
that assumes no selection and no recurrent mutation, we can set s = 0 in (17) and (21) and show that P
1(p) reduces to p and ψ(y) reduces to 1.
It follows from this that (29) can be reduced to
Under a genetic model where s ≠ 0,
using γ = 2Ns, (29) can be rewritten as
After integrating and simplifying
the terms, we obtain
Finally, substituting γ = 2Ns and p = 1/2N, and invoking the approximation e
− = (1 − a) for small values of a, t(p, x) reduces approximately to
where f(x)dx is a notation common in the literature to represent the
expected time for which the population frequency of a derived allele is in the
range (x, x + dx) before eventual
absorption.
5. Poisson Random Field Theory
S. Sawyer and D. Hartl expanded the
modeling of site evolution to multiple sites. Their model makes the following
assumptions: (1) mutations arise at Poisson times, (2) each mutation occurs at
a new site (infinite sites, irreversible), and (3) each mutant follows an independent
WF process (no linkage). Sawyer and Hartl noticed from f(x) in (33), that
is the expected number of sites in
the population with derived allele frequency between x and x
2 (where θ equals 2Nμ, the per-locus mutation rate). The function g(x), for which the full expression is
given below, is also referred to in the literature as the limiting,
equilibrium, or expected density function for derived allele frequencies.
In a sample of size n, the expected number of sites with i (which ranges from 1 to n − 1) copies of the derived allele is defined as a function of g(x):
The intuition behind F(i) is the following. The expected
number of polymorphic sites with population daf x that have i copies of
the derived allele out of n samples
is given by the product of the expected number of sites with population daf x, g(x),
and the probability that each of those sites has i copies in the sample, which is given by the binomial calculation
in the right-hand side of (36). To determine the expected number of sites with any population daf that have i copies of the derived allele, this
product must be integrated over all possible values of x (resulting in F(i) above).Consider the sample data X = (X
1, X
2, X
3,…, X
), where X
is the observed number of sites with i copies of the derived allele out of n.
Sawyer and Hartl showed that the number of derived alleles in the entire
population at a particular frequency is a PRF with mean density given by (35) [2].
It follows, from the marking theorem on Poisson processes [16],
that each random variable X is
an independent Poisson distribution with mean equal to F(i) [2]. This framework
allows us to define the probability of observing x sites that have i copies of the derived allele (and n − i copies of the ancestral allele) as the following:
Since the X's are independent, the probability of observing X = (X
1, X
2, X
3,…, X
) is given as
The likelihood equation above
provides a convenient means of estimating the values of the parameters and . The use of the PRF theory leads directly to a likelihood-ratio
test of neutrality. Λ is defined as
the ratio of the likelihood value under the maximum likelihood estimate of to the likelihood value under the
neutral value of . It is a standard result
that 2lnΛ is asymptotically
chi-square distributed with one degree of freedom [17].Sawyer and Hartl further extended
the PRF model in order to calculate the ratio of expected number of
polymorphisms within species to expected number of fixed differences between
species.In 1991, McDonald and
Kreitman devised a 2-by-2 contingency table test of neutrality that was later
named the MK test [18]. In the
traditional MK test, a 2-by-2 contingency table is formed in order to compare
the number of nonsynonymous and synonymous sites that are polymorphic within a
species (RP and SP) and diverged between species (RF and SF) (Table 1). The
central assumption of the MK test is that only nonsynonymous sites may be under
selective pressure (i.e., synonymous sites are assumed to be neutrally
evolving). If nonsynonymous sites are evolving according to a neutral model,
then the expectation is that P
/P
= D
/D
. However, if nonsynonymous sites are under
negative selection, then the expectation is that P
/P
> D
/D
, and if under positive selection, then P
/P
< D
/D
. Sawyer
and Hartl derived the formulas for the expected values of SP, SF, RP, and RF
using their PRF theory [2]. Below are
the derivations of each of these formulas. For all of the derivations, assume
that the data consists of samples of size m and n from two different species.
Table 1
McDonald-Kreitman contingency
table. 2-by-2 contingency table introduced by [18] for the inference of natural selection
on nonsynonymous coding sites.
MK Table
No. of polymorphic sites
No. of fixed substitutions
Synonymous
SP
SF
Replacement (nonSynonymous)
RP
RF
5.1. Expected Number of Synonymous Polymorphic Sites
Under neutral evolution (s = 0), the expected number of polymorphic
sites with population daf x can be
computed by taking the product of the per-locus mutation rate (θ = 2Nμ)
and the probability under a neutral model of a single mutation having a
frequency of x (from (30)):
Now, consider species 1 with sample
size m. The probability that a
polymorphic site, with population daf equal to x, is detected as polymorphic in a sample of size m is given as
The expected number of synonymous
polymorphic sites, with population daf x,
in the species 1 sample is the product of the expected number of synonymous
polymorphic sites with daf x in the
population (g
neutral(x)) and the fraction of those that are
expected to be detected in a sample of size m(P
(x)). It follows then
that the total expected number of synonymous polymorphic sites, with any
population daf, in the species 1 sample is computed by integrating the product
of g
neutral(x) and P
(x) over the range of
possible values for x:
Finally, the total number of
expected synonymous polymorphic sites in both species' sample data is given as
5.2. Expected Number of Replacement Polymorphic Sites
The derivation of the expected
value of RP follows the same logic. As described in (35), the expected number
of polymorphic sites with population daf x given some average selection pressure γ is given by g(x). Similar to (41), the total expected number of replacement
polymorphic sites in the species 1 sample is computed by integrating the
product of g(x) and P from 0 to 1:
Finally, the total expected number
of replacement polymorphic sites in both species' sample data is given as
5.3. Expected Number of Synonymous Fixed Substitutions
When s = 0, the expected number of fixed substitutions in one species
relative to another that diverged t
div2N generations ago is given as the product of the number of total mutations and
the probability of fixation of each mutation. The number of total mutations is
the product of the mutation rate per generation and the number of generations since
divergence is
The probability of fixation is
given in (21). As s approaches 0
(i.e., neutral evolution), the probability of fixationcan be reduced to p using the approximation e
− = (1 − a) for small values of a. Thus, for a newly derived neutral allele that has an initial
frequency of 1/2N, the probability of
fixation is also 1/2N.Therefore, the total expected
number of fixed substitutions in species 1 is
However, given that the data are
samples of the populations from both species, not all sites identified as fixed
substitutions in the sample are truly fixed substitutions in the entire
population. The expected number of sites in the species 1 sample that fall into
this category is given by
where T
(x) = Pr(a derived allele daf x < 1 is observed with x = 1 in a size m sample) and g
neutral(x) is given in (39).Therefore, the total expected
number of synonymous fixed substitutions in both species' sample data is given
as
5.4. Expected Number of Replacement Fixed Substitutions
Similar to the calculation of (46),
given some selection pressure, , the
expected number of fixed substitutions in one species relative to another that
diverged t
div2N generations ago is given as the product of (45) and (21):
Substituting 1/2N for p and invoking the
approximation that e
− = (1 − a) for small values of a, we arrive at the following:
However, again, given that the data
are samples of the populations from both species, not all sites identified as
fixed substitutions in the sample are truly fixed substitutions in the entire
population. The expected number of sites in the species 1 sample that fall into
this category is given by
Therefore, the total expected
number of replacement fixed substitutions in both species' sample data is given
as
5.5. Estimating Parameters
It is possible to obtain estimates
of and by setting each of the observed values SP, RP, SF, and RF (Table 1) to their PRF expectations given by (42), (44), (48), and (52), respectively,
and solving for the parameters. It has been shown that these estimates are equivalent
to maximum-likelihood estimates [2, 19].
Bustamante et al. also eloquently describe and implement a hierarchical
Bayesian model for parameter estimation [9].
6. Concluding Remarks
Sawyer and Hartl's seminal
presentation of the PRF in 1992 provided an innovative mathematical framework
for estimating selection pressures and mutation rates, which are critical
parameters that influence molecular evolution. However, it is worth noting that
the model does harbor certain limitations. Foremost among these is the
assumption of site independence, which is equivalent to the assumption of free
recombination among mutations (i.e., no linkage). Thus, the model may not be
appropriate for many data wherein strong linkage is present. Another limitation
is the assumption of infinite sites (i.e., each mutation is at a new site). Although
this assumption allows for a simpler model, it is not always biologically
appropriate, especially for organisms that experience a higher mutation rate. Indeed,
recent work has shown that the assumption of infinite sites can underestimate
selection pressures and mutation rates and even infer positive selection, when
in fact there is weak negative selection [20].
Recent theoretical work has focused on relaxing these and other assumptions of
the original PRF model, so as to make it more appropriate for diverse biological
contexts. For a brief list of such studies, we refer the reader to [20]. Ongoing theoretical and empirical work in this area
will undoubtedly continue to extend the power of a PRF-based approach for
population genetic inference.
Authors: Carlos D Bustamante; Adi Fledel-Alon; Scott Williamson; Rasmus Nielsen; Melissa Todd Hubisz; Stephen Glanowski; David M Tanenbaum; Thomas J White; John J Sninsky; Ryan D Hernandez; Daniel Civello; Mark D Adams; Michele Cargill; Andrew G Clark Journal: Nature Date: 2005-10-20 Impact factor: 49.962