Cortland K Griswold1. 1. Department of Integrative Biology, University of Guelph, Guelph, ON, Canada.
Abstract
In polymerase chain reaction (PCR)-based DNA sequencing studies, there is the possibility that mutations at the binding sites of primers result in no primer binding and therefore no amplification. In this article, we call such mutations PCR dropouts and present a coalescent-based theory of the distribution of segregating PCR dropout mutations within a species. We show that dropout mutations typically occur along branch sections that are at or near the base of a coalescent tree, if at all. Given that a dropout mutation occurs along a branch section near the base of a tree, there is a good chance that it causes the alleles of a large fraction of a species to go unamplified, which distorts the tree shape. Expected coalescence times and distributions of pairwise sequence differences in the presence of PCR dropout mutations are derived under the assumptions of both neutrality and background selection. These expectations differ from when PCR dropout mutations are absent and may form the basis of inferential approaches to detect the presence of dropout mutations, as well as the development of unbiased estimators of statistics associated with population-level genetic variation.
In polymerase chain reaction (PCR)-based DNA sequencing studies, there is the possibility that mutations at the binding sites of primers result in no primer binding and therefore no amplification. In this article, we call such mutations PCR dropouts and present a coalescent-based theory of the distribution of segregating PCR dropout mutations within a species. We show that dropout mutations typically occur along branch sections that are at or near the base of a coalescent tree, if at all. Given that a dropout mutation occurs along a branch section near the base of a tree, there is a good chance that it causes the alleles of a large fraction of a species to go unamplified, which distorts the tree shape. Expected coalescence times and distributions of pairwise sequence differences in the presence of PCR dropout mutations are derived under the assumptions of both neutrality and background selection. These expectations differ from when PCR dropout mutations are absent and may form the basis of inferential approaches to detect the presence of dropout mutations, as well as the development of unbiased estimators of statistics associated with population-level genetic variation.
DNA sequencing is either primer based, whereby a specific genetic locus is amplified
by polymerase chain reaction (PCR) and sequenced[1] or by shotgun sequencing, whereby genomes are randomly broken into small
segments and sequenced directly with no amplification.[2,3] In primer-based sequencing, 2
primers that together flank a locus are used. Primers are typically about 18 to 24
nucleotides in length and bind to DNA. Each primer initiates replication on separate
strands, with complementary effects such that after several rounds of replication
and strand separation a specific locus is amplified.[4]Substitutions at the binding sites of primers can lower the probability a primer
binds and initiates replication. Not all substitutions at the binding site of a
primer are equal. For example, substitutions within the last 3 to 4 nucleotides of
the 3′ end can significantly reduce PCR replication.[5,6] Primer coverage corresponds to
the proportion of species or taxa in the sample that are, or are expected to be,
amplified and sequenced for a primer or set of primers.[7] Studies of primer coverage that compare primers to DNA sequence databases
indicate that coverage for a specific primer varies depending on taxon, ranging from
0% to nearly 100%,[7,8]
where the taxon level is at the domain or phylum. For example, in the bacterial
phylum, Nitrospirae coverage is about 97% for the commonly used 16S
rRNA primer 519F if a single mismatch is allowed within the last 4 nucleotides of a
primer binding site, but only about 32% if a single mismatch is not allowed.[7] For the bacterial phylum Lentisphaerae and the commonly used
16S rRNA primer 338F, coverage is 97% if a single mismatch is allowed within the
last 4 nucleotides of a primer binding site, but 0% if a single mismatch is not allowed.[7] High coverage may occur at the phylum level for a specific primer, but is
typically lower at the domain level due to broader sequence diversity.A lack of primer coverage is straightforward to detect in studies where it is sought
to sequence DNA from a single individual because amplification and sequencing will
fail, which is directly detectable. In contrast, it is becoming increasingly common
to pool, amplify, and sequence DNA from multiple individuals across one or more
species. For example, in primer-based metagenomic and eDNA studies, a sample is
taken from the environment that contains individuals and/or DNA from potentially
thousands of species or OTUs (operational taxonomic units) and multiple individuals
per species or OTU. In this approach, PCR amplification and sequencing is often
non-targeted and used to discover or detect the species, OTUs or higher taxonomic
groups present, as well as targeted and used to detect a specific species, OTU or
taxonomic group. In non-targeted applications, there is the possibility that a
taxonomic group or subset of individuals within a taxonomic group go undiscovered or
undetected because of a mismatch between primers and binding sites. Furthermore,
even with targeted sequencing, there is the possibility of low primer
coverage.[7,8]As context, many datasets deposited in the European Bioinformatics Institute
metagenomics database (https://www.ebi.ac.uk/metagenomics/)[9] involve primer/amplicon-based and non-targeted sequencing from nature.
Although there has been progress in developing reference databases for
primer/amplicon-based metagenomic studies to match sequences to species or higher
taxonomic groups in a targeted manner,[10] there is still a need for assessing whether a taxonomic group goes partially
or fully unsequenced in non-targeted sequencing studies, as well as assess the
degree of coverage of targeted taxa.A challenge for primer-based metagenomic studies is assessing coverage for a sample
from nature. A rarefaction curve may be used to assess coverage by plotting the
number of taxa delineated versus the number of sequences recorded.[11] If this curve is asymptotic, then nearly all taxa have been delineated.
Nevertheless, this only applies to taxa that are amplified by the primer or primer
set. Even if primer coverage was less than 100%, an asymptotic rarefaction curve can
occur with more sequences because eventually all amplified species can eventually be
sequenced and delineated to a taxon or as coming from an unknown taxon. Generally,
it would be helpful for a metagenomic study to have available approaches to assess
primer coverage for their specific sample from nature. In this article, we present
theory to help the development of approaches to assess coverage.In particular, we use coalescent theory to derive expectations for the pattern of
nucleotide variation within a species when a mutation or a set of mutations at
primer site(s) block the chain of events during a primer-based sequencing study,
such that a set of sequences are not recorded as being present. This mutation or a
set of mutations could either completely block a primer from binding to its binding
site or reduce binding, such that amplification is so low that the locus is not
sequenced for a set of individuals descendent from the mutation or a set of
mutations. We call such mutations “PCR dropouts.”A PCR dropout mutation or a set of PCR dropout mutations may occur ancestrally to all
individuals within a species or later (forward in time), such that only a subset of
individuals from a species have one or more of the dropouts. Dropout mutations that
occur ancestrally to all individuals and are sufficient to block amplification of
all individuals within the species are not the focus of this study.
Instead, we focus on dropouts that are segregating within a population, such that
these dropouts occur later (forward in time) than the most recent common ancestor
(MRCA) to the species. Using coalescent theory,[12,13] we show that segregating
dropout mutations give rise to distinct patterns of DNA sequence variation. This
pattern may be used to assess whether a species (or OTU) is prone to PCR dropouts
and therefore reduced coverage. Furthermore, if several species (or OTUs) within a
taxon have a signal of dropouts, this may indicate a larger problem of coverage for
the entire taxon.Although this article focuses on dropouts within a species, similar principles apply
at higher levels of biological organization. We choose to first focus on
within-species variation because DNA sequence evolution is well characterized by the
coalescent process and this process is universal among species and taxa. In
contrast, at higher levels, there is no standard theoretical framework of DNA
sequence evolution, although Hey’s[14] approach may be promising in the context PCR dropouts and metagenomics (see
section “Discussion”).Expectations for DNA sequence variation at the population level and in the presence
of segregating dropout mutations will also be of potential use in other contexts
besides assessing primer coverage. For example, there has been the development of
estimators of nucleotide diversity that account for the higher rate of sequencing
errors associated with next-generation sequencing technology.[15,16] Yet, methods
are not available that account for segregating PCR dropouts. Furthermore, methods
have been developed to use population-level variation to infer the phylogenetic
structure of ecological communities using metagenomic data.[17] Accordingly, it would be helpful to consider the effect of PCR dropouts on
population-level variation and whether they may affect inferences of phylogenetic
structure.The organization of this article is as follows: First, we present theory assessing
the probability that dropout mutations occur along branch sections that are deep in
the coalescent tree of a species or along branch section that occur more recently,
as well as whether we expect a species to segregate a single dropout mutation or
multiple dropout mutations. Generally, we find that dropout mutations typically
occur along branch sections that are deep in the coalescent tree of a species.
Furthermore, given rates of mutation at primer sites, a single dropout mutation
typically segregates in a sample. Focusing on dropouts that completely block
amplification, we then derive expectations for the distribution of coalescence times
and pairwise differences of sequenced alleles at a locus. This analysis indicates
that the distribution of coalescence times and pairwise differences is affected by
the presence of a dropout and may therefore give rise to a detectable signal of a
dropout. The theory accounts for the process of background selection,[18,19] which is
occurring at loci used in primer-based metagenomic studies, such as 16S/18S RNA.[1]
Theory
It is assumed that a sample is taken from an environment, be it water, soil, feces,
etc, and DNA is purified from the sample and subject to PCR and sequencing. Although
it may be that within a sample there are individuals across the tree of life, only a
single genetic locus is amplified by PCR. For simplicity, we assume that the PCR
primers target bacteria and archaea such that species are haploid and effective
population sizes are potentially large. Although this article focuses on haploid
microbes, the theory applies directly to diploid microbes with a change in timescale
(see below). For a given species in the sample, there is a sample size of
n. This sample size is not known. In the theory presented, we
nevertheless assume a sample size of n and then develop
expectations for how many individuals within the sample are expected to be
unamplified given a sample of size n. The theory demonstrates that
n is a nuisance parameter and detectable effects of dropout
mutations can be inferred without the knowledge of n.For PCR sequencing of a species or OTU to occur, DNA primers need to amplify the DNA
of a species or OTU to a sufficient level that allows for sequencing. Here we define
to be the rate of mutation that leads to a PCR dropout. For
simplicity, we assume that a mutation either causes a dropout or does not, ie, the
effects of mutations are discrete. An alternative model is quantitative, such that a
mutation may decrease amplification by a certain fraction that is between, but not
equal to, 0 and 1. We assume that the nucleotide sites that affect PCR amplification
do not affect the fitness of an organism and consequently evolve neutrally.
Amplified loci in metagenomic and environmental DNA studies are typically under
purifying selection, such that we expect the amplified PCR locus to be affected by
background selection.[18] Initially, we assume neutrality and no background selection, but later
consider the consequences of background selection. Furthermore, when developing
expectations for polymorphism at an amplified locus, we focus on synonymous sites
and/or assume neutrality and no background selection, but later consider the
consequences of background selection.
Neutrality and no background selection at the amplified locus
A natural theoretical framework to model dropouts assuming neutrality is Kingman’s[12] coalescent (see also Tajima[13]). In the Kingman coalescent, the genealogical history of a sample of size
n is divided into sections in which coalescent events define the sections,
whereby descending from the top to the bottom of the coalescent tree, the first
section consists of n lineages, the second lineages, etc, with the last section consisting of 2 lineages.
Following Fu,[20] we call the section of the tree consisting of k branches
“level k” (Figure 1). Mutations causing a dropout can occur along branches
within each of these sections. A dropout mutation along a branch section is
“unique” if no other dropout mutation occurs along its corresponding ancestral
lineage from the point of mutation back to the MRCA of the sample (see Figure 2, section
“Probability of the dropouts at level k are unique along their
corresponding lines of descent”). Unique dropout mutations are of interest
because all of the descendent lineages of the dropout will not be amplified by
PCR, including lineages that have incurred subsequent dropout mutations (hence
why we focus on unique dropouts). Next, we derive expectations for the
distribution of the number of dropouts when there are k
lineages remaining from a sample of size n. We then derive the
conditional distribution of the number of dropouts that are unique at level
k given a certain number of dropout mutations. For the set
of unique dropouts, we derive the distribution of the number of lineages
descendent from each dropout. With information about the number of descendant
lineages, we then derive the conditional distribution of pairwise differences in
the sample. This distribution is conditioned on the number of unique dropouts
and numbers of descendent lineages from these dropouts, noting that by
determining what individuals are not amplified by PCR we can then focus
attention on the properties of individuals that are amplified by PCR.
Figure 1.
The use of the term “level.” A level corresponds to the number of branch
sections in the coalescent tree. Branch sections are not drawn to scale.
Lower levels in the tree have longer branch sections according to
coalescent theory.[12]
Figure 2.
The use of the term “unique.” In the coalescent tree, 2 branch sections
at level 4 have a dropout mutation, such that and 1 branch section at level 2 has a dropout
mutation, such that . Although there are 2 dropout mutations at level
, only 1 is unique , such that there are no other dropout mutations along
its line of descent to the most recent common ancestor (MRCA) of the
sample. At level , there is 1 unique dropout because no other dropout mutation occurs along its
line of descent to the MRCA.
The use of the term “level.” A level corresponds to the number of branch
sections in the coalescent tree. Branch sections are not drawn to scale.
Lower levels in the tree have longer branch sections according to
coalescent theory.[12]The use of the term “unique.” In the coalescent tree, 2 branch sections
at level 4 have a dropout mutation, such that and 1 branch section at level 2 has a dropout
mutation, such that . Although there are 2 dropout mutations at level
, only 1 is unique , such that there are no other dropout mutations along
its line of descent to the most recent common ancestor (MRCA) of the
sample. At level , there is 1 unique dropout because no other dropout mutation occurs along its
line of descent to the MRCA.
Probability of branches with dropouts at level k,
Define to be a random variable giving the length of branch
sections at level k in the coalescent tree. In continuous
time, the distribution of is approximately exponential[12] such thatwhere to simplify notation we let . The approximation is in the diffusion limit as the
effective size of a population tends to infinity. In the Wright-Fisher model, 1 unit of
time equals generations and in the Moran model 1 unit of time equals
generations.[12] For simplicity, we use 1 effective population size in our analysis,
keeping in mind that the effective size of a Moran population is one half
that of a Wright-Fisher population. Furthermore, note that for diploid
microbes the baseline timescale is generations.The expected number of dropout mutations per branch section at level
k is , where the mutation rate is measured per generation and is measured on a generation timescale. The distribution of
the number of mutations along a branch section is Poisson distributed and
the probability of at least 1 dropout along a branch section at level
k isnoting that is the probability of no dropout mutations. Define
to be a random variable for the number of branch sections
at level k with at least 1 dropout, then the probability
iswhere , is the gamma function, and, to simplify notation, we let
. A straightforward calculation from is the probability , which isand corresponds to the probability of at least 1 mutation prior to a
coalescent event at level k (cf. Wakeley[21]).
Probability of the dropouts at level k are unique along their corresponding
lines of descent
Define to be a random variable for the number of lineages at
level k that have a dropout mutation at level
k and no prior dropout mutations from level
k to the MRCA of the sample, such that is the number of unique dropout mutations at level
k and . Figure
2 illustrates the notation, whereby at level in the tree, there are 2 dropout mutations , but a second dropout mutation occurs along the line of
descent to the MRCA for one of the dropouts, such that ; at level , and .To calculate the distribution for , we note that moving down the coalescent tree from level
k to the MRCA of the sample, we can define a set of
numbers for i from 1 to which consists of the number of lineages with
i descendants that have a unique dropout mutation at
level k. Combining the set with the number , which is the number of lineages out of the
k lineages remaining in the sample, forms a state space
that characterizes the genealogical and dropout mutation
process from level k to the MRCA of the sample. At the
MRCA, and either all of the equal zero, such that there are no unique dropout
mutations at level k, or one and only one of the
equals 1, such that if and then .Backward in time, 2 events can occur that are relevant to determining the
distribution of . One event is a coalescence and the other event is a
dropout mutation along a descendent lineage. If a coalescent event occurs,
it could be between the 2 lineages with a dropout mutation, between a
lineage with a dropout mutation and without a dropout mutation or between 2
lineages without a dropout mutation. For example, 2 lineages within the set
with i descendants may coalesce resulting in the transition
and this occurs at rate on an appropriate timescale. Or, 2 lineages, one with
i descendants and the other with j
descendants, may coalesce resulting in the transitionfor at rate . A coalescent event could occur between lineages that are
not descendants of dropout mutations or between a lineage with a descendent
dropout mutation and one without a descendent dropout, such that
at rate . Last, if a dropout mutation occurs along a descendent
lineage below level k, then a dropout mutation at level
k is no longer unique. For each element in , this event occurs at rate .Generally, for , the following transitions are possible and their
corresponding rate of transition, where for efficiency we only show elements
in the state space set that changeFor the set , there will be a corresponding state space from level k to the MRCA. Define
to be a random variable for the state of the system at
time t into the past starting at some point in level
k. is a finite-state and continuous-time Markov process with
multiple absorbing states when . We can define an matrix to be the corresponding finite-state and continuous-time
Markovian transition rate matrix, where is the size of the set . If the current state of the system is i,
then the probability of being in state j at time
t generations into the past satisfies the differential equationwhere is the full matrix consisting of elements and is defined by the transition rates in equation
(3). The solution to the differential equation isFor our purposes, we are interested in the probability at time
t that the system is in a state in which
starting from the initial state and furthermore whether one of the equals 1. Label the initial state as and states in which with an integer j corresponding to the
number of descendent lineages with a unique dropout mutation at level
k. For example, for and , the possibilities are , , and .Define to be the probability at time t given , thenFurthermore, in the limit as , thenAs has a finite dimension and satisfies properties of
instantaneous transition rates, the limit is finite and defined.[21]For the case , it is straightforward to directly write down
for , which iswhere and , as well as recognizing that for the dropout to be unique
no subsequent dropout mutations can occur along its lineage back to the MRCA
of the sample.
Probability of the set of descendants given unique dropouts at level k
Define to be a set consisting of elements, in which an element is an integer greater than 0
equaling the number of descendants of one of the lineages, respectively, and noting that for
the set is . Enumeration of is a direct application of the Pólya urn scheme and nicely
applied in a similar population genetic context by Fu.[19] For a sample of size n and level k
in a coalescent tree, there areways the n individuals can be assigned to k
ancestors. Of the k ancestors, we are interested in the
number of descendants of each of the unique dropout mutations. The number of ways of drawing a
set from n individuals under the constraint
that at level k issuch that the probability of the set given is
Probability of the set of descendants of unique dropouts at level k
From the law of conditional probability, the probability of the set of
descendants of unique dropouts at level
k isFor sufficiently small , it is most likely that at a level there is a single
dropout, given a dropout occurs, and we can focus our attention on the
caseFor the purposes of metagenomics or eDNA studies, we may be interested in the
probability for level k, which is the probability of the loss of
d or more individuals in the sample due to a unique
dropout at level k, and is calculated as
Distribution of pairwise differences with dropouts
The presence of dropouts may affect the distribution of pairwise differences
for a given species. For sufficiently small , unique dropouts tend to occur along branch sections near
the MRCA of a sample (see section “Analysis and Results”). We therefore
focus on the derivation of expectations for the distribution of pairwise
difference when and , noting that, for unique dropouts at higher levels, the
calculation of pairwise differences follows similar principles.
Single unique dropout at level 2 (Y2 = 1)
A dropout generates a subtree of sequenced individuals from a sample of
n individuals in total. A subtree has levels just
like the coalescent tree for an entire sample. To calculate the
distribution of pairwise differences between 2 random and sequenced
individuals involves determining (1) whether the 2 sequenced individuals
coalesce at level j in the subtree, (2) whether the
entire sample is at level k given the subtree is at
level j, and (3) the time to coalesce given the level
in the entire sample.For a subtree of size , the probability that 2 randomly chosen individuals at
the tips of a subtree coalesce at level
j for isWiuf and Donnelly[22] derived the probability that, given a subtree first enters level
j, the entire sample is at level
k, such thatand where here and .Given k, the distribution of coalescence times
isand can be calculated directed from the sum of exponentially distributed
random variables.[21,24] Note further that
i is indexed initially with because the transition in was from . If Z is a random variable for the
number of pairwise differences between the 2 sequences, then its
distribution isassuming that mutations are Poisson distributed along branch sections and
the rate of mutation causing pairwise differences is . Over the distributions for j and
k, the probability of z pairwise
differences is
Single unique dropout at level 3 (Y3=1)
Here we assume that the dropout mutation occurred such that the 2 clades
of sequenced individuals coalesce when for the entire sample (Figure 3). We label these 2
clades A and B, such that the number
of sequenced individuals in clade A is
a and the number in clade B is
b. Two lineages of sequenced individuals can both
occur in clade A with probability , both in clade B with probability
, or 1 in clade A and 1 in clade
B.
Figure 3.
A coalescent tree when . Triangles indicate clades of individuals in
the sample. Clades A and B are sequenced, whereas the gray clade
descends from a dropout mutation.
A coalescent tree when . Triangles indicate clades of individuals in
the sample. Clades A and B are sequenced, whereas the gray clade
descends from a dropout mutation.Let be the probability of z pairwise
differences given that 2 sequences are in clade A. As
we are finding the marginal probability of z
conditioned on a single clade, the result from Wiuf and Donnelly[22] given above applies, such thatwhich differs from because for clade A the minimum value
of k for which is , noting that a is substituted for
. For clade B, the 2 lineages can
coalesce up to also such that is equal to , except for b substituted for
a.When one sequenced individual is in clade A and the
other in clade B, the time to coalesce is distributed as
, such thatand together for is
The effect of dropouts on coalescence times of sequenced
individuals
The previous section demonstrated that dropouts affect the distribution of
pairwise differences relative to when there are no dropouts. The difference
in the distribution of pairwise difference comes about because of
differences in coalescence times, which in turn reflects the mapping in the function. Here we further explore the effect of the
mapping on coalescence times to see if there is additional
information that may be used to detect dropouts.For , there is a subtree of sequenced individuals that can be divided into
j levels in a similar manner as for the entire
sample n that is divided into k
levels. The mapping is itself a Markov process from to , with the probability of a mapping depending on the k level in
the previous mapping. We leave the study of this Markov process for
later and here investigate the unconditional expectation for the
mapping, which is defined as the expected value of
k at level j in the subtree
orThe expected time between the and levels in the subtree iswhere is the largest integer less than or equal to
. In contrast, the expected time between the
jth and levels without drops isfrom the standard coalescent.
Single unique dropout at level 3 (Y3 = 1)
As before, we assume that the sequenced individuals are in subclades
A and B, as in Figure 3. The
numbers of sequenced individuals in subclades A and
B are a and b,
respectively. We could use a similar approach as in section
“Distribution of pairwise differences with dropouts” and condition on
each subclade. This conditioning is justified when considering 2
sequenced lineages because either the 2 sequenced lineages occur within
the same subclade with mapping according to , or in separate subclades and by definition coalesce
at . In contrast, when seeking to derive expectations for
the properties of coalescence times for the full set of sequenced individuals, we need to account for the
joint set of coalescence times in both subtrees. The Wiuf and Donnelly[23] theory does not include this case.A simple argument indicates that the coalescence times are expected to be
distorted with a dropout mutation versus no dropouts. For example,
consider a period in the history of the sample in which there are
a lineages in clade A and
b lineages in clade B, and
correspondingly unsequenced lineages. In the absence of recognizing
that a dropout mutation has generated a structured coalescent process,
the observed sample size is , as opposed to n. Accordingly, the
expected coalescence rate, given the observed sample, is on a timescale of generations. Yet, the dropout mutation causes the
coalescence process to be structured, such that the actual coalescence
rate among sequenced individuals isfor in the coalescent tree. Together, the coalescence rate
is reduced by a factor ofcompared with what is expected for an observed sample size of
.
Neutrality at the primer sites and background selection at the amplified
locus
If deleterious mutations occur at rate U and have multiplicative
effects on fitness, each with effect s, then the expected
frequency of individuals with j deleterious mutations
is Poisson distributed with mean .[25] For a sample of size n, the expected number of sampled
individuals in the jth mutational class is , where the jth mutational class has
j deleterious mutations. Following the structured
coalescent modeling framework of Nordborg,[26] the class structure of deleterious mutations can be represented
abstractly as a set , where is the number of individuals in the jth class
and is the maximum value of j for which
. Assuming no back mutation and a continuous-time model,
transitions corresponding to deleterious mutations, such that and , occur at rate U and transitions
corresponding to coalescent events, such that , occur at rate . For sufficiently large , mutational transitions occur on a fast timescale, whereas
coalescent transitions occur on a slow timescale, such that the sample is
expected to quickly transition to the 0-class. Once in the 0-class, the sample
behaves in a neutral manner with coalescence rate . Accordingly, background selection is expected to shorten
coalescence times proportionally along a genealogical tree.[18]In principle, we could just scale time as and measure dropout properties leading to an expectation for
pairwise differences, but this would be somewhat inaccurate because it would
miss coalescent events that occur during the transition from the initial sample
to the 0-class. Although these events are rare, they lead to a lack of
differences between sequences, allowing for average pairwise difference to be
less than , where S is the number of segregating sites
and .[27]For a genotype in the jth class, the expected time to reach the
0th class is units of time. During this time, lineages starting in the
jth class may coalesce with other lineages, such that the
probability that of the sequences in the jth mutational class
coalesce at or before reaching the 0-class is approximatelywhere is an approximation because it assumes that only sequences
starting within the same mutational class can coalesce. This assumption is
relaxed later when the distribution of pairwise differences is derived. Allowing
for sequences starting within different mutational classes to coalesce during
the fast period from the initial sample to the 0-class could be modeled as a
nested Markovian process. Nevertheless, is an accurate approximation given that coalescent events are
rare relative to mutation. The expected number of sequences that coalesce during
the period of collapse to the 0th mutational class iswhere is the probability of the jth mutational
class. With , .Once we have , we can use the theory for the neutral case to derive
expectations for the number of descendants of dropouts (, , etc), as well as the distribution of pairwise differences.
Expectations for the probability of a dropout along a branch section are a
function of with background selection compared with without background selection.
Conditional distribution of pairwise differences with dropouts
Single unique dropout at level 2
With background selection, the corresponding expressions for
, , and are the same, except for the replacement of
n with . We need to consider that a pair of sequences coalesce
during the fast process in which sequences collapse to the 0th
mutational class. For a sample of size n, the
probability that 2 random sequences are in deleterious mutation classes
and is for and for .For 2 sequences in the same class , the average time to reach the 0th class is
units of time and the probability that they coalesce
before reaching the 0-class isFor 2 sequences in different classes ( and ), the probability that they come together in the
mutational class isand once in the class they can coalesce, such that the overall
probability they coalesce isTogether, the probability of no pairwise difference due to a coalescent event during the fast process of
collapse to the 0th mutational class isAfter entering the 0-class, pairwise differences are expected to behave
as with n being replaced by
and by such that becomes , and with background selection the distribution of
pairwise differences isFollowing similar logic as for and assuming the topology in Figure 3, with background
selection becomeswhere again is with n substituted for
and for . Likewise, becomesand becomesnoting that the 2 sequences cannot coalesce during the fast time period
when sampled in clades A and B.Once the sample reaches the 0th class mutationally, we expect the effects of
dropouts on coalescence times to be the same as the neutral case, except for
being replaced by . In the neutral case, a signal of the effect of dropouts
on coalesce times comes about by taking the ratio of adjacent coalescence
times, which is independent of . We therefore expect a similar effect of dropouts on
adjacent coalescence times when is replaced by in the presence of background selection.
Analysis and Results
Expected values for
In the context of dropout mutations, is not expected to be large. As presented in the introduction,
primer sites are typically around 20 nucleotides, with the highest probability
that a mutation leads to a dropout occurring within 3 to 4 nucleotides of the 3′ end.[7] Accordingly, regarding mutations with a high probability of directly
leading to a dropout, the mutation rate is near that of the baseline rate for a
single nucleotide and at most about 10 times the baseline rate per nucleotide
site. Estimates of per-site mutation rates in bacteria are typically
.[28] Nevertheless, a species or OTU may be predisposed to a dropout mutation
due to prior mutational history within primer sites, leading to a higher dropout
rate of mutation. is also a function of . Less is known about for microbes. Smith[29] estimated an effective size of or less in Neisseria meningitis, based on
electrophoretic data and noting that this is many orders of magnitude less than
our understanding of census population size. By comparison in the eukaryotic
microbial species Saccharomyces paradoxus, an estimate of its
effective population size is of the order .[29] Focusing on effective population sizes in the range of and mutation rates in the range of to capture uncertainty leads to in the range of approximately . In the following section, we focus the analysis on
values in the middle of this range, namely, of the order
.
Under neutrality
Figure 4A plots
and indicates that the probability of a lineage with 1
dropout along branch sections that are at lower levels in the coalescent
tree is not too uncommon in the context of studies involving thousands of
species or OTUs and assuming 2 values of . Two lineages with dropouts are rarer, but still
appreciable at lower levels of the coalescent tree, provided that
is large enough. Three or more lineages with dropout
mutations are fairly rare in the context of studies involving thousands of
species or OTUs. Although higher levels in a coalescent tree have more
branch sections, these sections are short in length, such that the total
probability of a dropout mutation along levels 500 to 1000 for a sample of
size is less than that for level by itself and levels 4 through 50 together (Figure 4B). The
values of the order lead to the values between 0.010 and 0.091 at level , 0.005 and 0.047 at level , and 0.003 and 0.032 at level . In total, for , equals 0.007 and 0.070 for and , respectively.
Figure 4.
(A) The probability of the number of lineages with a dropout mutation
at different levels in a coalescent tree. Points
in the plot are represented by numbers in which the number
corresponds to the k level in the coalescent tree
of a sample. Black numbers correspond to and gray numbers to . (B) The sum of across levels for . Points in the plot indicate the levels that are
summed. For example, sums for to .
(A) The probability of the number of lineages with a dropout mutation
at different levels in a coalescent tree. Points
in the plot are represented by numbers in which the number
corresponds to the k level in the coalescent tree
of a sample. Black numbers correspond to and gray numbers to . (B) The sum of across levels for . Points in the plot indicate the levels that are
summed. For example, sums for to .
Probability of the dropouts at level k are unique along their corresponding
lines of descent, and
Expressions for and (see section “Probability of the dropouts at level k are unique along
their corresponding lines of descent”) work for any , but given a low dropout mutation rate it is likely that
only 1 dropout occurs at level k, if at all. Multiplying
by gives the joint probability of and in the limit as . Figure
5 indicates that lower levels of a coalescent tree are expected
to have a unique dropout mutation with the highest probability, and that for
the probability that at least 1 level has a unique
mutation is still less than level .
Figure 5.
The combined probability of 1 dropout mutation at level
k and it being unique . For levels greater than or equal to 500, the
point is the sum of at least 1 dropout at each level up to
and it being unique .
The combined probability of 1 dropout mutation at level
k and it being unique . For levels greater than or equal to 500, the
point is the sum of at least 1 dropout at each level up to
and it being unique .Although single dropouts occur with the highest probability, to illustrate
the utility of and , we consider the case when a dropout occurs at level
and assume 2 dropouts at that level . For this case, the state space corresponding to
(see section “Probability of the dropouts at level k are unique along
their corresponding lines of descent”) is the set . The corresponding matrix isand whereas is a somewhat complicated function of and t, is more easily written down, such thatand, numerically, forindicating that there is a high probability given 2 dropout mutations at
level that they are both unique.
Probability of d or more descendent lineages of a unique dropout mutation
at level k,
There is a fairly good chance of a unique dropout at level 2 and that it
leads to the loss of at least 100 individuals (~8%; Figure 6) for and . There is a smaller, but appreciable, chance under the
same conditions at level 4 (~2%). Figure 6 assumes a single dropout
mutation at a given level. To illustrate calculations for the case of 2
dropout mutations at level
Figure 6.
The probability of a dropout out at level k leading
to the loss of d or more individuals in a sample
assuming and . The bottom curve corresponds to .
As is more likely than when , the probability of d or more dropout
individuals is greater for versus (Figure
7). For , it is rare (~0.004%) to simultaneously have 2 unique
dropout mutations at level 4 and the loss of at least 100 individuals in a
sample of size 1000. Here, the rarity primarily comes about from the rarity
of the occurrence of 2 dropout mutations at level 4 (cf. Figure 4A).
Figure 7.
The probability of 2 dropout mutations at level leading to the loss of d or more
individuals in a sample assuming and .
The probability of a dropout out at level k leading
to the loss of d or more individuals in a sample
assuming and . The bottom curve corresponds to .The probability of 2 dropout mutations at level leading to the loss of d or more
individuals in a sample assuming and .
Distribution of pairwise differences with dropouts and ratio of
coalescent times between adjacent nodes
The distribution of pairwise differences is distorted when there is a single
dropout at level 2 (Figure
8A). When the number of sequenced individuals is small relative
to the entire sample, the distribution of pairwise difference is shifted
toward smaller values because sequenced individuals must coalesce before the
dropout, which shortens coalesce times. For a moderate to large number of
sequenced individuals relative to the entire sample, the distribution can
have a mode, unlike the case without dropouts.
Figure 8.
The distribution of pairwise differences with (A) 1 dropout at level
2 leading to a single clade of sequenced individuals and (B) 1
dropout mutation at level 3, leading to 2 clades of sequenced
individuals with the topology shown in Figure 3. The gray curve
shows the distribution in the absence of dropouts. The parameters
are and for both plots. The number of sequenced
individuals is shown in the plots. Note that the shape of the
distribution is qualitatively independent of sample size, such that
we expect a similar pattern for and the corresponding increases in
a and b.
The distribution of pairwise differences with (A) 1 dropout at level
2 leading to a single clade of sequenced individuals and (B) 1
dropout mutation at level 3, leading to 2 clades of sequenced
individuals with the topology shown in Figure 3. The gray curve
shows the distribution in the absence of dropouts. The parameters
are and for both plots. The number of sequenced
individuals is shown in the plots. Note that the shape of the
distribution is qualitatively independent of sample size, such that
we expect a similar pattern for and the corresponding increases in
a and b.With a dropout leading to 2 clades of sequenced individuals, such that
(ie, Figure 3), the distribution of pairwise differences is distorted
further relative to the case, such that coalescences restricted to be within
A or B result in relatively few
pairwise differences and coalescences restricted to be between clades
A and B result in relatively large
pairwise differences (Figure 8B).The ratio of coalescent times between adjacent nodes indicates that it is
lower at levels near the base of the tree in the presence of a dropout
(Figure 9).
Figure 9.
The ratio of coalescence times between adjacent nodes in the absence
of a dropout mutation (gray) and in the presence of a dropout at
level (black). The parameters are and . Note that the ratios are independent of
.
The ratio of coalescence times between adjacent nodes in the absence
of a dropout mutation (gray) and in the presence of a dropout at
level (black). The parameters are and . Note that the ratios are independent of
.
With background selection
Expected values for
Background selection is expected to decrease the effective value of
by a factor of , which is the expected proportion of sequences free of
deleterious mutation as a result of background selection.[17] Furthermore, Charlesworth et al[18] showed that the average pairwise difference is accurately
approximated by a neutral model with multiplied by a factor of , where is of the order to for strong and moderate background selection.
Properties of samples with dropout mutations and background
selection
Let us first revisit in the context of background selection for a species with
effective population size . With background selection, such that and (following Charlesworth et al[18]), and . For mutation rates in the range of , the corresponding values of encompass the range of values explored in Figure 4, which were
and . As an additional point of comparison, for a smaller value
of , , which indicates that segregating dropouts would be rare
for metagenomic studies with about 1000 species under these conditions.
Nevertheless, for smaller values of U and/or
s, is expected to increase.Next, consider for , , , and . The expected number of sequences that coalesce is
prior to collapse of the sample to the 0th mutational
class, such that we expect . As a point of comparison when is smaller , and n, U, and
s remain the same, the number of sequences that
coalesce is . Overall, this suggests that with background selection and large effective population
sizes.The analysis involving suggests that, for large relative to , the probability of no pairwise difference due to a
coalescent event during the fast process of collapse to the 0th mutational
class is small because coalescence events are rare relative to
the mutational collapse process to the 0th class. Numerically, for the
conditions and and for a large sample size, for and for .As for large , , we expect only a small difference between the
distribution of pairwise difference with background selection
versus without background selection , as well as between , , and and , , and , respectively. Therefore, Figure 8 presents an accurate
approximation of the distribution of pairwise differences under background
selection for .
Discussion
Bacterial and other microbial species are associated with populations of large
effective size. This in combination with metagenomic and/or eDNA studies of
microbial communities that include thousands of species or OTUs[9] brings about the possibility that one or more of the sampled species or OTUs
may be segregating mutations that cause dropouts at PCR primer sites. This article
quantified the probability of a segregating dropout mutation at a primer site, as
well as the effects of dropouts on the distribution of pairwise differences and
coalescence times within a species or OTU.The analysis indicates that segregating PCR dropout mutations are reasonably common
provided that the effective population size of a species is large and/or the dropout
mutation rate is reasonably high. Furthermore, if dropout mutations occur, they are
expected to occur along basal branch sections in the coalescent tree of a sample.
That dropouts are expected to occur along the basal branch section brings about the
possibility that a large fraction of sampled individuals go unsequenced due to a
dropout mutation within a species. This fraction of individuals that go unsequenced
leaves behind a signal in the pattern of pairwise differences, as well as the
coalescence times among sequenced individuals. In particular, there is a greater
tendency for the distribution of pairwise differences to have a non-zero mode in its
distribution, with a mode being particularly pronounced when a dropout mutation
gives rise to 2 subclades of sequenced individuals as depicted in Figure 3. Furthermore, a
dropout mutation is expected to distort coalescence times near the base of a
coalescent tree, such that coalescence times are more equal in value relative to
when dropouts are absent. The distortion of coalescence times is independent of
n (the true sample size) in the context of equation
(21). In other contexts, such as pairwise differences, the effect of
dropouts is qualitatively the same across n. Here, inference
approaches could integrate across possible values of n to assess
whether a distribution is consistent with the presence of one or more dropout
mutations.When a dropout mutation gives rise to 2 clades of sequenced individuals as in Figure 3, the coalescent
process for these 2 clades is structured such that individuals between clades cannot
coalesce until after the dropout mutation (backwards in time). Qualitatively, the
distribution of pairwise differences is similar to that when there is population structure,[25] in which the distribution is a combination of small pairwise differences
between individuals within the same clade and large pairwise differences between
individuals in different clades. When a single dropout mutation occurs at level 2 in
a coalescent tree, its effect on the distribution of pairwise differences is less
pronounced than a level 3 dropout mutation that gives rise to 2 clades, such that
there is a tendency for the distribution to have a non-zero mode.A future direction for research is to develop statistical techniques to assess
whether a metagenomic or eDNA sample has segregating dropout mutations. Assessing
whether a dropout mutation occurs in a single species will be challenging because
the pattern that is generated is similar to population structure[26] or changes in population size.[31] Furthermore, the coalescence process in a panmictic population of constant
size gives rise to a highly diverse set of distributions of pairwise differences and
coalesce times for a given sample size. It may be more promising to first develop
methods that seek to assess whether there is a signal of dropouts at the population
level across a clade of species or OTUs in a metagenomic study. A consistent signal
of dropouts within a clade may lead to greater confidence in the presence of
dropouts. A clade of species or OTUs may be predisposed to dropout mutations due to
prior substitutions at the primer sites in the ancestor to the clade. We think that
the EMBL Metagenomics database[9] provides an ideal resource to begin to assess the potential occurrence of PCR
dropout mutations. The database consists of thousands of studies and often within a
study there are multiple samples from a site, each of which is sequenced using an
amplicon approach. Having multiple samples from a site would allow for the
assessment of whether a signature of PCR dropouts, perhaps across species or OTUs
within a clade, is replicated across samples. If it is, then this would be evidence
against the random occurrence of a pattern that is similar to predictions made in
this study.A consistent signal across samples could nevertheless be due to population structure
or changes in population size, instead of PCR dropouts. Although qualitatively there
are similarities in patterns of DNA sequence variation, quantitatively there are
likely differences that can distinguish PCR dropouts from the population structure
and changes in population size. For example, polymorphisms within clades of a
coalescent tree are expected to be at different frequencies geographically because
of partial isolation, drift, and the history of mutation. In contrast at a single
geographic site and with no population structure at a larger spatial scale, we would
expect more equal frequencies of polymorphisms, on average, across clades of a
coalescent tree in the presence of PCR dropouts. Population growth may be
distinguished from PCR dropouts in that recent growth is expected to increase the
number of rare polymorphisms in a population, in contrast to PCR dropouts. Overall,
the detection of PCR dropouts will likely be quantitative and involve the
development of a model-based inference method with the processes of PCR dropout
mutations, population structure, and changes in population size co-occurring.
Assessment of the occurrence of PCR dropouts would then be probabilistic, whereby
the likelihood or posterior probability of occurrence of a set of PCR dropouts would
be weighed in combination with parameters that model population structure and
demography.The detection of the presence of dropout mutations may be useful in metagenomic
studies that seek to not only identify taxon-level diversity, but also the
abundances of taxa in a community.[32] If a taxon has a signature of dropout mutations, then its abundance estimate
may be lower than its true value.Regarding existing methods that use next-generation sequencing approaches to the
estimate nucleotide diversity[14,15] and phylogenetic structure,[17] our results indicate that PCR dropouts are a factor of consideration. In the
case of nucleotide diversity, dropouts give rise to patterns of DNA sequence
variation similar to population structure or changes in population size (see above)
and therefore are expected to affect estimates of nucleotide diversity. The process
of PCR dropout mutations could be added to coalescent-based estimators of
population-level statistics, particularly in the context of the use of
next-generation sequencing and environmental samples. Likewise, when inferring the
phylogenetic structure of environmental samples that use primer-based sequencing,
the addition of the process of mutation at primer sites may be warranted, as these
mutations may lead to DNA sequence patterns that suggest demographic processes such
as population structure and growth. Similar to the previous discussion of detecting
PCR dropouts, the incorporation of dropouts to estimators of statistics and
phylogenetic structure will likely involve adding the dropout process to model-based
approaches, which would then affect the likelihood or posterior distributions of
focal statistics or histories.This article focused on properties of segregating dropout mutations within a species.
The development of more powerful methods to detect dropouts will likely involve
combining within-species and between-species approaches. In the context of
between-species approaches, a promising theoretical framework is Hey’s[14] model of cladogenesis and its potential generalization. In this article, we
showed that dropout mutations affect the shape of coalescent trees and it is likely
that they also affect the shape of phylogenetic trees. Properties of tree shape at
the phylogenetic level under different diversification processes has been a topic of
ongoing research, with detectable differences in diversification processes arising
from differences in internode lengths in a phylogenetic tree.[33] Differences in internode lengths phylogenetically are analogous to the effect
of dropouts on the ratio of coalescence times. A strong signal for the presence of
dropouts would be a clade with a distorted tree shape at the species level combined
with distorted coalescence times within species within the clade. Of course, this
type of inference requires segregating dropouts within species and polymorphism for
dropouts at the clade level, whereas a more common pattern may be that an entire
clade is not amplified by PCR due to ancestral dropout mutations. Furthermore, it is
important to note that expectations about where dropout mutations occur
genealogically at the population or species level using the coalescent may not hold
at higher taxonomic levels due to differences in the expected distribution of
relative internode lengths in a phylogenetic versus coalescent model.Regarding sequencing methods, a solution to the occurrence of PCR dropout mutations
is to use complementary non-overlapping primer sets that both amplify the same
genetic locus.[34] Using this approach, a locus in a species’ DNA will be amplified even if
there are dropout mutations at a primer binding site. Nevertheless, this approach is
not always used and there exist a large number of datasets that did not use
complementary non-overlapping primer sets (see Mitchell et al).[9] For these studies, approaches to assess coverage would be useful.This article is a starting point to derive expectations for the effects of PCR
efficiencies on the inference of taxon diversity and within-taxon genetic diversity
in other contexts. For example, a common approach to biodiversity assessment is DNA metabarcoding.[35] DNA metabarcoding is often applied to organisms with smaller effective
population sizes than prokaryotes, such that the population-level rate of mutation
at primer sites is lower. Accordingly, PCR dropout mutations are expected to be less
common in these types of studies, but other factors such as differential PCR
inhibition or amplification bias may result in the dropout of individual sequences.[36] Generally, accounting for PCR efficiency may be important in studies that
seek to infer population-level genetic diversity from metabarcode data.[37]Coalescent theory has been applied in metagenomic studies previously. For example,
O’Brien et al[17] derived a coalescent-based approach to infer the phylogenetic relationships
of species from a metagenomic sample, and Bittner et al[38] and Liberles et al[39] provide a broad overview for integrating coalescent approaches into
metagenomic studies. Furthermore, Johnson and Slatkin[40] presented a coalescent-based approach to detect recombination in a
metagenomic sample. Our article suggests that coalescent theory may allow for the
detection of PCR primer dropouts in metagenomic studies.
Conclusions
Primer coverage in microbial metagenomic and eDNA studies is a key uncertainty. In
this article, we derived coalescent-based expectations for the pattern of DNA
sequence variation within a species in the presence of segregating PCR dropout
mutations, where a dropout mutation completely blocks PCR amplification. Dropout
mutations can alter coalescence times and the distribution of pairwise differences,
which may form the basis of statistical techniques to detect or account for reduced
primer coverage.
Authors: J Craig Venter; Karin Remington; John F Heidelberg; Aaron L Halpern; Doug Rusch; Jonathan A Eisen; Dongying Wu; Ian Paulsen; Karen E Nelson; William Nelson; Derrick E Fouts; Samuel Levy; Anthony H Knap; Michael W Lomas; Ken Nealson; Owen White; Jeremy Peterson; Jeff Hoffman; Rachel Parsons; Holly Baden-Tillson; Cynthia Pfannkoch; Yu-Hui Rogers; Hamilton O Smith Journal: Science Date: 2004-03-04 Impact factor: 47.728
Authors: Anna Klindworth; Elmar Pruesse; Timmy Schweer; Jörg Peplies; Christian Quast; Matthias Horn; Frank Oliver Glöckner Journal: Nucleic Acids Res Date: 2012-08-28 Impact factor: 16.971
Authors: Alex L Mitchell; Maxim Scheremetjew; Hubert Denise; Simon Potter; Aleksandra Tarkowska; Matloob Qureshi; Gustavo A Salazar; Sebastien Pesseat; Miguel A Boland; Fiona M I Hunter; Petra Ten Hoopen; Blaise Alako; Clara Amid; Darren J Wilkinson; Thomas P Curtis; Guy Cochrane; Robert D Finn Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971
Authors: Yoamel Milián-García; Robert Young; Mary Madden; Erin Bullas-Appleton; Robert H Hanner Journal: Ecol Evol Date: 2021-02-10 Impact factor: 2.912