Evolution of sequences mostly involves independent changes at different sites. However, substitutions at neighboring sites may co-occur as multinucleotide replacement events (MNRs). Here, we compare noncoding sequences of several species of primates, and of three species of Drosophila fruit flies, in a phylogenetic analysis of the replacements that occurred between species at nearby nucleotide sites. Both in primates and in Drosophila, the frequency of single-nucleotide replacements is substantially elevated within 10 nucleotides from other replacements that occurred on the same lineage but not on another lineage. The data imply that dinucleotide replacements (DNRs) affecting sites at distances of up to 10 nucleotides from each other are responsible for 2.3% of single-nucleotide replacements in primate genomes and for 5.6% in Drosophila genomes. Among these DNRs, 26% and 69%, respectively, are in fact parts of replacements of three or more trinucleotide replacements (TNRs). The plurality of MNRs affect nearby nucleotides, so that at least six times as many DNRs affect two adjacent nucleotide sites than sites 10 nucleotides apart. Still, approximately 60% of DNRs, and approximately 90% of TNRs, span distances more than two (or three) nucleotides. MNRs make a major contribution to the observed clustering of substitutions: In the human-chimpanzee comparison, DNRs are responsible for 50% of cases when two nearby replacements are observed on the human lineage, and TNRs are responsible for 83% of cases when three replacements at three immediately adjacent sites are observed on the human lineage. The prevalence of MNRs matches that is observed in data on de novo mutations and is also observed in the regions with the lowest sequence conservation, suggesting that MNRs mainly have mutational origin; however, epistatic selection and/or gene conversion may also play a role.
Evolution of sequences mostly involves independent changes at different sites. However, substitutions at neighboring sites may co-occur as multinucleotide replacement events (MNRs). Here, we compare noncoding sequences of several species of primates, and of three species of Drosophilafruit flies, in a phylogenetic analysis of the replacements that occurred between species at nearby nucleotide sites. Both in primates and in Drosophila, the frequency of single-nucleotide replacements is substantially elevated within 10 nucleotides from other replacements that occurred on the same lineage but not on another lineage. The data imply that dinucleotide replacements (DNRs) affecting sites at distances of up to 10 nucleotides from each other are responsible for 2.3% of single-nucleotide replacements in primate genomes and for 5.6% in Drosophila genomes. Among these DNRs, 26% and 69%, respectively, are in fact parts of replacements of three or more trinucleotide replacements (TNRs). The plurality of MNRs affect nearby nucleotides, so that at least six times as many DNRs affect two adjacent nucleotide sites than sites 10 nucleotides apart. Still, approximately 60% of DNRs, and approximately 90% of TNRs, span distances more than two (or three) nucleotides. MNRs make a major contribution to the observed clustering of substitutions: In the human-chimpanzee comparison, DNRs are responsible for 50% of cases when two nearby replacements are observed on the human lineage, and TNRs are responsible for 83% of cases when three replacements at three immediately adjacent sites are observed on the human lineage. The prevalence of MNRs matches that is observed in data on de novo mutations and is also observed in the regions with the lowest sequence conservation, suggesting that MNRs mainly have mutational origin; however, epistatic selection and/or gene conversion may also play a role.
Entities:
Keywords:
D. melanogaster; H. sapiens; complex mutations; multinucleotide replacements; mutagenesis
Evolution at all levels, including that of DNA sequences, primarily proceeds through small steps. Simple evolutionary models do not include complex events, assuming that different sites of diverging sequences accumulate nucleotide substitutions independently (e.g., Yang and Bielawski 2000). However, several factors can lead to violations of this assumption.First, changes at different sites can be correlated due to complex mutations. A complex mutation consists of multiple changes in the DNA sequence, which appear not too far from each other as parts of a single event; each change may be a single-nucleotide mutation (multinucleotide mutations, MNMs), or more complex changes may be involved. As with other types of mutations, the frequency of occurrence of complex mutations can be most reliably estimated from data on de novo mutations, because other factors of evolution have had least chance to affect the fate of these mutations; the second best option is the mutation-accumulation experiments where selection has been relaxed (Kondrashov 2008). Analyses of de novo mutations (Kondrashov 2003; Chen et al. 2009) and of mutation-accumulation experiments (Keightley et al. 2009) reveal that complex mutations occur with a nonnegligible frequency. Patterns in within-population variation (Hodgkinson and Eyre-Walker 2010; Schrider et al. 2011) and in interspecies divergence (Averof et al. 2000; Smith et al. 2003) also demonstrate a substantial frequency of occurrence of correlated changes. Although other, nonmutational, explanations are possible for complex events observed in polymorphism and divergence data (see later), in each instance, arguments have been put forward that these observations are primarily due to complex mutational events (Averof et al. 2000; Smith et al. 2003; Hodgkinson and Eyre-Walker 2010; Schrider et al. 2011).The differences in contamination by non-MNMs may be one reason why the estimates of the rates of MNMs vary widely. Early estimates based on the frequencies of transitions between the two serine codon families in metazoans, and of double-nucleotide substitutions in a pseudogene of primates, put the rate of MNMs at approximately 0.1 per site per billion years (Averof et al. 2000). In primates, the fraction of mutations that are involved in a MNM was thus estimated as approximately 2% (Averof et al. 2000); this estimate dropped to 0.3% after the local heterogeneity of the mutation rates was taken into account (Smith et al. 2003). The rate of double mutations was estimated as approximately 10−11 per site per generation from data on human de novo nonsense mutations (Kondrashov 2003). The frequency of MNMs among the de novo mutations in mouse was estimated at 0.2–1% (Wang et al. 2007). More recently, the frequencies of single nucleotide polymorphisms (SNPs) in human put this estimate at 0.9% for MNMs hitting the adjacent sites (Hodgkinson and Eyre-Walker 2010) or at 3% for MNMs spanning distances of up to 20 nucleotides (Schrider et al. 2011). To the best of our knowledge, the only estimate for the rate of the MNMs in Drosophila is based on the observation of four individual dinucleotide mutations among 174 mutations observed in a mutation-accumulation experiment (Keightley et al. 2009). Most analyses do not distinguish between dinucleotide, trinucleotide, and higher order co-occurring mutations (but see Wang et al. 2007).Second, nonindependent evolution at different sites on one lineage can be caused by selection. A strong correlation of nonsynonymous substitutions, and a weaker effect in synonymous substitutions, was observed in evolution of rodent (Bazykin et al. 2004), HIV-1 (Bazykin et al. 2006), and Drosophila (Callahan et al. 2011; Bazykin and Kondrashov 2012) proteins; a similar effect was also observed for coding SNPs of Ciona savignyi (Donmez et al. 2009). The contrast between the nonsynonymous and synonymous substitutions or SNPs suggests that this effect is caused by selection. Runs of adjacent substitutions in coding sequences were also described by Stoletzki and Eyre-Walker (2011).Third, evolution at nearby sites can be correlated due to nonallelic gene conversion (Chen et al. 2007). When a segment of DNA is converted by a paralogous sequence carrying a number of nucleotide differences, the resulting pattern may be reminiscent of multiple simultaneous substitutions (Kidd et al. 2010, fig. S3). A typical conversion tract spans DNA segments of more than 300 nucleotides (Chen et al. 2007; Mancera et al. 2008; Paigen and Petkov 2010).Multinucleotide replacement events (MNRs) can be accounted for in evolutionary models. There are a lot of approaches, based on HMM (Yang 1995), maximum likelihood (ML; Whelan and Goldman 2004), Bayesian graphical models (Poon et al. 2008), and so on. In ML modeling of evolution of protein-coding sequence, including MNRs significantly improves the fit of the model (Whelan and Goldman 2004; Kosiol et al. 2007; Miyazawa 2011). The model implemented in PAML (Yang 2007) put the estimates for the rates of dinucleotide replacements (DNRs) and trinucleotide replacements (TNR) for aligned protein domains in Pandit database (Whelan et al. 2006) at 21% and 3.5%, respectively (Kosiol et al. 2007).Here, we estimate the frequency of MNRs from data on divergence of noncoding (intronic and intergenic) sequences of primates and Drosophila. We show that the frequency of single-nucleotide substitutions is increased by the presence of a nearby substitution on the same lineage to a larger extent than by the presence of a nearby substitution on another lineage, implying a substantial frequency of MNRs.
Results
To estimate the fraction α of single-nucleotide replacements that occurred as part of an MNR, we use a phylogenetic approach (fig. 1). In each analysis, we consider a trio of species, including the “focal” species (Homo sapiens or Drosophila melanogaster), a “proxy” sister species, and an outgroup. Using the outgroup to infer the ancestral state, we then consider the replacements that have occurred either on the focal or on the proxy lineage since their divergence from their last common ancestor. The logic of the analysis is as follows. Replacements on different lineages may occur only as independent events; conversely, multiple nearby replacements on the same lineage may occur either as independent events or as MNRs. Therefore, to calculate the rate of DNRs αd(k), we compare the conditional frequencies of substitutions on the focal lineage, given another substitution that has occurred at distance k nucleotides on the focal lineage (sd(k)) or on the proxy lineage (dd(k)):
F
Inferring the frequencies of DNRs (A,B) and of TNRs (C–E). The schematic phylogenetic tree at the left represents, from top to bottom, the two sister species, focal and proxy, and the outgroup; the adjacent horizontal lines represent multiple sequence alignments for the corresponding species. (A,B) The frequencies of substitutions on the focal lineage are measured within the set of sites (vertical ovals), such that another substitution (vertical line) is observed at distance k from them on the proxy (A) or on the focal (B) lineage (dd(k) and sd(k), respectively). (C,D) The frequencies of pairs of substitutions on the focal lineage are measured within the set of sites (pairs of vertical ovals at distances l from each other), such that another substitution (vertical line) is observed at distance k from them on the proxy (A) or the focal (B) lineage (dt(k,l) and st(k,l), respectively). E, five scenarios that can give rise to three adjacent substitutions on the same lineage, from top to bottom: three distinct mutations; two distinct mutations (an arc connects positions involved in an MNR), one being a DNR involving the second and the third, the first and the second, or the first and the third of the three nucleotides; and one TNR.
Inferring the frequencies of DNRs (A,B) and of TNRs (C–E). The schematic phylogenetic tree at the left represents, from top to bottom, the two sister species, focal and proxy, and the outgroup; the adjacent horizontal lines represent multiple sequence alignments for the corresponding species. (A,B) The frequencies of substitutions on the focal lineage are measured within the set of sites (vertical ovals), such that another substitution (vertical line) is observed at distance k from them on the proxy (A) or on the focal (B) lineage (dd(k) and sd(k), respectively). (C,D) The frequencies of pairs of substitutions on the focal lineage are measured within the set of sites (pairs of vertical ovals at distances l from each other), such that another substitution (vertical line) is observed at distance k from them on the proxy (A) or the focal (B) lineage (dt(k,l) and st(k,l), respectively). E, five scenarios that can give rise to three adjacent substitutions on the same lineage, from top to bottom: three distinct mutations; two distinct mutations (an arc connects positions involved in an MNR), one being a DNR involving the second and the third, the first and the second, or the first and the third of the three nucleotides; and one TNR.A similar approach can be used to estimate the frequency of TNR events, on the basis of the difference in the frequencies of substitutions pairs when one more substitution has occurred nearby on the same lineage or on a different lineage. The frequency of TNRs can be calculated as the difference between the prevalence of these two patterns, after the DNRs are controlled for (see Materials and Methods). Three substitutions on the same lineage may occur due to five scenarios: as three independent events; involving three different sets of DNRs; or involving a TNR (fig. 1E). Therefore, we calculate the rate of TNRs αt(k,l) as the difference between the conditional frequencies at which two substitutions at distances k and k + l from the first substitution are observed, given the first substitution that has occurred on the same (focal) phylogenetic lineage (st(k,l)) or on the other (proxy) lineage (dt(k,l)). We also need to subtract the probabilities that the first and the second mutations come as a DNR and that the first and the third mutations come as a DNR (see Materials and Methods for a more detailed explanation). Therefore,This approach controls for the nonuniformity of the replacement rates along the genome, as long as this nonuniformity is conserved between the focal and the proxy species. It allows us to infer the rates of DNRs and TNRs for different distances between sites k and l; it can also be easily extended for more complex events, that is, those involving more than three simultaneous nucleotide substitutions or other types of mutations.
Primates
In primates, we study four trios of species. All trios use the lineage of H. sapiens as the focal lineage but consider a range of species at increasing phylogenetic distances from H. sapiens as the proxy and the outgroup species.Even for the shortest considered phylogenetic distances (human–chimpanzee pair), the observation of a substitution in the proxy species does not bias the frequency of a substitution at a nearby site on the H. sapiens lineage (fig. 2, red lines). The sole exception is the immediately adjacent sites (k = 1), where this frequency is underestimated due to exclusion of a subset of pairs of substitutions giving rise to a CpG dinucleotide. This result, consistently with previous findings (Hodgkinson et al. 2009; Johnson and Hellmann 2011; Seplyarskiy et al. 2012), shows that there is little, if any, short-scale variation in the mutation rates along the genomes of primates, perhaps with the exception of the immediately adjacent positions.
F
Frequencies of DNRs in primates for different distances between sites. dd(k) (red), sd(k) (blue), and αd(k) (green) are shown for distances between the sites (horizontal axis). Left column (A–D), introns; right column (E–H), intergenic regions. (A,E) Homo sapiens and Pan troglodytes (Gorilla gorilla as outgroup), (B,F) H. sapiens and G. gorilla (Pongo pygmaeus as outgroup), (C,G) H. sapiens and P. pygmaeus (Macaca mulatta as outgroup), and (D,H) H. sapiens and M. mulatta (Callithrix jacchus as outgroup). CpG dinucleotides are excluded, which leads to underestimation of dd(1) and sd(1) (see Materials and Methods).
Frequencies of DNRs in primates for different distances between sites. dd(k) (red), sd(k) (blue), and αd(k) (green) are shown for distances between the sites (horizontal axis). Left column (A–D), introns; right column (E–H), intergenic regions. (A,E) Homo sapiens and Pan troglodytes (Gorilla gorilla as outgroup), (B,F) H. sapiens and G. gorilla (Pongo pygmaeus as outgroup), (C,G) H. sapiens and P. pygmaeus (Macaca mulatta as outgroup), and (D,H) H. sapiens and M. mulatta (Callithrix jacchus as outgroup). CpG dinucleotides are excluded, which leads to underestimation of dd(1) and sd(1) (see Materials and Methods).In contrast, a substitution that occurs on the H. sapiens lineage radically increases the probability of another substitution on the same lineage nearby (fig. 2, blue lines). Therefore, a substantial fraction of substitutions within a lineage at nearby sites comes as a single-MNR event. These results suggest that DNRs spanning distances of up to 10 nucleotides are responsible for at least of all observed single-nucleotide substitutions. decreases rapidly with k, so that αd(1) exceeds αd(10) by at least a factor of approximately 6 (fig. 2, green lines); among all DNRs at distances up to 10 nucleotides, 36% occur at immediately adjacent sites.A substantial fraction of substitutions also come as part of a TNR. Figure 3 presents the results for the case when two of the three mutations involved in a TNR occurred at adjacent sites (l =1), and figure 4A, all values of for k
10 and l
10 (see also supplementary table S1, Supplementary Material online). TNRs spanning distances of up to 10 nucleotides are responsible for at least of all observed single-nucleotide substitutions or for 26% of DNRs spanning such distances. Similar to DNRs, the rate of TNRs decreases rapidly with the distance spanned, so that αt(1,1) exceeds αt(10,1) by a factor of approximately 7 (fig. 3, green lines). Among TNRs spanning distances up to 10 nucleotides, 7.8% affect the three immediately adjacent nucleotides (fig. 4A).
F
Frequencies of TNRs, such that two of the three replacements affect adjacent sites (l = 1), for different distances k from the third site in primates. dt(k,l) (red), st(k,l) (blue), and αt(k,l) (green) are shown for distances (horizontal axis). The panels correspond to the panels in figure 2.
F
Frequencies of TNRs. αt(k,l) for all values of k and l between 1 and 10. (A) Human–chimpanzee comparison and (B) D. melanogaster–D. simulans comparison. Only intronic sites are shown.
Frequencies of TNRs, such that two of the three replacements affect adjacent sites (l = 1), for different distances k from the third site in primates. dt(k,l) (red), st(k,l) (blue), and αt(k,l) (green) are shown for distances (horizontal axis). The panels correspond to the panels in figure 2.Frequencies of TNRs. αt(k,l) for all values of k and l between 1 and 10. (A) Human–chimpanzee comparison and (B) D. melanogaster–D. simulans comparison. Only intronic sites are shown.Between the four analyzed phylogenies, the length of the considered focal (H. sapiens) lineage differs by a factor of 3.8. Correspondingly, the number of pairs of nearby nucleotide sites each carrying a substitution on the human lineage Dd(k) (which scales with lineage length quadratically) differs by a factor of approximately 10, and the fraction of sites, among sites with a substitution on the human lineage, that also carry another substitution at a nearby site dd(k) (which scales with lineage length linearly) differs by a factor of approximately 4. Nevertheless, we get similar estimates of from different phylogenies, implying that our estimates of the frequencies of MNRs are robust (fig. 5A).
F
Dependence of the contribution of DNRs on the length of the phylogenetic lineage of the focal species in introns. (A) as the function of and (B) as the function of . At each panel, the four points, from left to right, correspond to the substitutions on the Homo sapiens lineage after its divergence from Pan troglodytes, Gorilla gorilla, Pongo pygmaeus, and Macaca mulatta, respectively.
Dependence of the contribution of DNRs on the length of the phylogenetic lineage of the focal species in introns. (A) as the function of and (B) as the function of . At each panel, the four points, from left to right, correspond to the substitutions on the Homo sapiens lineage after its divergence from Pan troglodytes, Gorilla gorilla, Pongo pygmaeus, and Macaca mulatta, respectively.Although the fraction of MNRs among all replacements is constant and independent of the phylogeny, their contribution to the genome-level patterns is disproportionally high for very closely related species. Indeed, in the human–chimp comparison, 50% of pairs of substitutions at adjacent sites were due to DNRs (fig. 5B) and 83% of triplets of substitutions at adjacent sites were due to TNRs. This fraction is reduced as species accumulate more differences: In the human–macaque comparison, DNRs and TNRs are responsible for only 22% (44%) of such pairs (fig. 5B). This is intuitive: In closely related species, multiple substitutions at adjacent sites were unlikely to have accumulated by chance and are usually due to an MNR.
Drosophila
In contrast to primates, in Drosophila, even a substitution that occurs on the proxy lineage substantially increases the frequency of nearby substitutions on the D. melanogaster lineage (figs. 6 and 7, red lines). This effect spans distances of tens of nucleotides and is consistent with pervasive short-scale heterogeneity of the substitution rate in Drosophila (Seplyarskiy et al. 2012).
F
Frequencies of DNRs in D. melanogaster–D. simulans comparison for different distances between sites. (A–C) dd(k) (red), sd(k) (blue), and αd(k) (green) are shown for distances between the sites (horizontal axis). (A) Introns; (B) intergenic regions; and (C) positions 8–30 in introns with lengths up to 120 nucleotides. (D) αd(k) for all intronic sites (gray) and positions 8–30 in introns with lengths up to 120 nucleotides (orange) plotted together. The differences between the two curves are insignificant for all k. Error bars for αd(k) correspond to 95% confidence intervals obtained by 1,000 bootstrap simulations; for data from all intronic sites and intergenic regions, they are not shown because they would be barely visible.
F
Frequencies of TNRs in D. melanogaster–D. simulans comparison, such that two of the three replacements affect adjacent sites (l = 1), for different distances k from the third site in drosophilids. dt(k,l) (red), st(k,l) (blue), and αt(k,l) (green) are shown for distances between sites (horizontal axis). (A) Introns and (B) intergenic regions.
Frequencies of DNRs in D. melanogaster–D. simulans comparison for different distances between sites. (A–C) dd(k) (red), sd(k) (blue), and αd(k) (green) are shown for distances between the sites (horizontal axis). (A) Introns; (B) intergenic regions; and (C) positions 8–30 in introns with lengths up to 120 nucleotides. (D) αd(k) for all intronic sites (gray) and positions 8–30 in introns with lengths up to 120 nucleotides (orange) plotted together. The differences between the two curves are insignificant for all k. Error bars for αd(k) correspond to 95% confidence intervals obtained by 1,000 bootstrap simulations; for data from all intronic sites and intergenic regions, they are not shown because they would be barely visible.Frequencies of TNRs in D. melanogaster–D. simulans comparison, such that two of the three replacements affect adjacent sites (l = 1), for different distances k from the third site in drosophilids. dt(k,l) (red), st(k,l) (blue), and αt(k,l) (green) are shown for distances between sites (horizontal axis). (A) Introns and (B) intergenic regions.Nevertheless, a substitution that occurs on the D. melanogaster lineage increases the frequency of another substitution on the D. melanogaster lineage nearby more radically than the substitution on the proxy lineage (figs. 6 and 7, blue lines), implying a high frequency of MNRs (figs. 6 and 7, green lines). The contributions of DNRs and TNRs spanning distances of up to 10 nucleotides are
and (fig. 4B and supplementary table S2, Supplementary Material online), respectively; DNRs and TNRs affecting two or three immediately adjacent nucleotides are responsible for αd(1) 0.02 and αt(1,1) 0.004 of all single-nucleotide substitutions, respectively. Therefore, DNRs and TNRs contribute, respectively, approximately 2 times and approximately 6 times more to the substitution rate in Drosophila than in primates. DNRs are responsible for 34% of pairs of substitutions at adjacent sites, and TNRs are responsible for 43% of triplets of substitutions at adjacent sites. As in primates, both and decrease rapidly with k over a distance of approximately 10 nucleotides. Slightly higher estimates of are obtained using the ML-based inference of the ancestral states (supplementary fig. S1, Supplementary Material online). This difference is due to inclusion of nucleotide sites with multiple replacements in the ML analysis; when such less reliable sites are excluded, identical estimates are obtained using the two methods.
Discussion
Our results imply that at least 2.3% of single-nucleotide replacements observed in evolution of the human lineage, and at least 5.6% of those observed on the D. melanogaster lineage, occurred as part of MNR events spanning multiple nucleotides at distances up to 10 from each other. These estimates control for the nonuniformity of replacement rates along the genomes, as long as this nonuniformity is preserved in both evolving lineages. Such nonuniformity may occur due to variation in the mutation rates, which is known to be high both in primates (Hodgkinson et al. 2009; Hodgkinson and Eyre-Walker 2011; Johnson and Hellmann 2011; Seplyarskiy et al. 2012) and, in particular, in Drosophila (Stamatoyannopoulos et al. 2009; Seplyarskiy et al. 2012; Weber et al. 2012) or to differences in the strength of negative selection affecting noncoding regions. For example, if the local mutation rates are autocorrelated at distances up to approximately 10 nucleotides, and for k < 10 will be elevated equally (because any such heterogeneity will cause mutations to be clumped along the sequence, no matter on what lineage they occur), but will not be affected.Still, the observation of MNRs can come from a number of sources: nonuniformity of mutation rates and/or selection that varies between the two analyzed lineages; genome sequencing, assembly, or alignments errors; mutations simultaneously affecting several nucleotides; epistatic interactions among replacements; and/or nonallelic gene conversion.Local patterns of mutation and selection may change in the course of evolution. If the sister species are so distant that homologous genome segments have different local replacement rates (e.g., because their local mutation rates diverged), α may provide a biased estimate for the frequency of MNRs; in this case, for example, substitutions may be clustered on the same lineage due to differences in the mutation rates between lineages rather than due to MNRs. To estimate this effect, we used as proxy species four different primate species located at different phylogenetic distances from the focal species (there is no range of suitable Drosophila species) and compared the obtained . The results were similar (fig. 5), suggesting that the contribution of the change in the local mutation or selection rates to our estimates of the rate of MNRs is minor.Assembly and/or alignment errors may lead to spurious clustering of nucleotide replacements, and sequencing errors may have the same effect if they are clustered in the genome (Löytynoja and Goldman 2008; Wong et al. 2008; Mallick et al. 2009; Schneider et al. 2009). With our approach, such double errors in the genomes of proxy species do not affect or . Only multiple errors at nearby sites in the genomes of H. sapiens or D. melanogaster can inflate or ; however, these genomes have high-quality sequences, and errors in them are infrequent. Nevertheless, we did two additional tests to estimate their contribution. First, we reasoned that errors are unlikely to coincide in two species with independently assembled genomes. However, when only nucleotides matching between human and chimpanzee are used in the human–orangutan comparison, the values of remain very similar to those when obtained using all nucleotides (supplementary fig. S2, Supplementary Material online). Second, to minimize the effect of alignment errors, we only considered nucleotides further than 20 nucleotides from alignment gaps; again, the results were similar (supplementary figs. S3 and S4, Supplementary Material online), implying that the contribution of alignment errors is also minor.MNRs may occur due to epistatic selection. Under positive epistasis, that is, when the first nucleotide replacement facilitates the second, replacements can co-occur on the same lineage (Bazykin et al. 2004). This is the leading cause of co-occurrence of nonsynonymous replacements within coding regions (Bazykin et al. 2004, 2006; Callahan et al. 2011; Bazykin and Kondrashov 2012). However, epistasis is impossible in the absence of selective constraint. Indeed, if the ancestral and the two-substitution variants have the same fitness, but the intermediate variant is deleterious, the rate of evolution decreases as the selection against the intermediate variant increases (Kimura 1985). Conversely, if selection against the intermediate variant is weak, we expect a near-neutral rate of evolution, but little clumping, and no clumping at all if the intermediate variant is neutral. Clumping due to epistasis can also occur when the double mutant has a higher fitness than the ancestral variant and the two-substitution variant as a whole is positively selected. In this case, the overall rate of evolution (and, correspondingly, degree of conservation) can be higher, lower, or the same as in the neutral regions, depending on the parameters (fig. 1 in Lynch and Abegg 2010). Still, data suggest that this mode of selection is more likely in regions where the overall conservation is high. Indeed, clumping of nonsynonymous substitutions caused by positive selection is stronger in conservative regions of proteins (Bazykin and Kondrashov 2012). More generally, positive selection affects a larger fraction of substitutions in conservative regions of proteins (Bazykin and Kondrashov 2012) and of noncoding regions (Cai et al. 2009, Halligan et al. 2011). Therefore, excluding the conservative sites should restrict the influence of epistatic selection.The estimates for the fraction of intronic or intergenic sites in primates that are selectively constrained based on divergence and polymorphism patterns range from 2.8% to 11% (Waterston et al. 2002; Dermitzakis et al. 2005; Asthana et al. 2007; Ponting and Hardison 2011; Ward and Kellis 2012). Selection in noncoding regions of Drosophila is more prevalent and constrains the evolution of more than 50% of intronic and intergenic sites (Andolfatto 2005). The higher fraction of genome under selection in Drosophila, compared with human, seems consistent with a higher observed rate of MNRs in the former if MNRs are caused by selection. However, to minimize the contribution of selection in our analyses, we excluded all the nucleotide sites that show signs of conservation between species (see Materials and Methods). In doing so, we excluded the fraction of the genome (∼40% in humans and ∼60% in Drosophila) that is larger than the currently estimated fraction of the genome under selective constraint. Choosing an even more radical threshold (excluding ∼70% of the genome in Drosophila) leads to very similar results (supplementary fig. S5, Supplementary Material online).To further study the possible contribution of selection in Drosophila to MNRs, we compared the rate of MNRs at positions 8–30 of short introns (fig. 6C) and the rest of the intronic and intergenic sites (fig. 6A and B). Positions 8–30 of short introns are the class of sites in Drosophila that is least subject to both positive or negative selection (Parsch et al. 2010). Because of an approximately 10× smaller sample size, the confidence intervals for the estimates of the rate of MNRs at this class of sites were large; still, we observe a significantly nonzero for k < 3, and no difference between the for short and all introns for all values of k (fig. 6D). In summary, the data seem to suggest that the contribution of selection to MNRs in nonconstrained regions is minor.Conceivably, MNRs may also be caused by nonallelic gene conversion. In this process, a segment of DNA may be converted in one of the lineages by a similar genomic region. If the conversion tract spans more than one nucleotide difference, this will lead to nearby correlated changes in the genome sequence and may inflate . However, our results show that the majority of MNRs span distances of no more than 10 nucleotides, whereas conversion tracts are much longer (Hilliker et al. 1994; Jeffreys and May 2004; Mancera et al. 2008; Wang et al. 2012). To estimate the magnitude of any potential contribution of gene conversion, we calculated for the 50% of the genome that lacked close paralogous sequences and was thus unlikely to have experienced nonallelic gene conversion. The results of this analysis (supplementary figs. S6 and S7, Supplementary Material online) were similar to those obtained using the entire genome. Therefore, it is unlikely that gene conversion is a major contributor to the MNRs.Finally, MNRs can be caused by MNMs. Although direct sequencing experiments are the best way to measure the mutation rates, such data sets so far are of limited scope; few MNMs have been observed in them, and therefore, the resulting estimates have a large variance. Still, our 2.3% estimate for the rate of DNRs in primates falls near to 2–3% obtained by 1000 Genomes Project Consortium et al. (2010); and our 5.6% estimate for this rate in Drosophila is consistent with four dinucleotide mutations among 174 de novo mutations (2.3%) observed by Keightley et al. (2009). These numbers are also concordant with indirect measurements of MNM rates in humans based on polymorphism frequencies (2.0%; Schrider et al. 2011).In summary, our results support a major role for MNRs in metazoans. A large fraction of the MNRs involves more than two nucleotides. Moreover, although the MNRs spanning immediately adjacent nucleotides are numerically the most prevalent, they comprise only approximately 30% of DNRs, and only approximately 10% of TNRs; the remaining MNRs span larger distances. Although a high prevalence of MNRs appears to be universal (Schrider et al. 2011), we show that their frequency differs between species, so that the fraction of MNRs among all replacements in Drosophila exceeds that in primates by a factor of approximately 2 for DNRs and by a factor of approximately 7 for TNRs. Such a high fraction of MNRs should be taken into account in evolutionary models, especially when estimation of epistatic selection is the goal, because MNRs may facilitate adaptation (Kimura 1985).
Materials and Methods
Data
Multiple complete genome alignment of Callithrix jacchus, Macaca mulatta, Pongo pygmaeus, Gorilla gorilla, and Pan troglodytes to H. sapiens (hg18) (Fujita et al. 2011) was obtained from UCSC Genome Browser (http://genome.ucsc.edu). Data on human variation in nine diploid nuclear genotypes and the reference human genome, for a total of 19 haploid genotypes, were obtained as described in Seplyarskiy et al. (2012).Multiple complete genome alignment of D. simulans, D. yakuba, and D. erecta to D. melanogaster (dm3) was downloaded from UCSC Genome Browser (Fujita et al. 2011) (http://genome.ucsc.edu). Polymorphism data for 162 complete genomes of D. melanogaster aligned to dm3 reference genome of D. melanogaster (Mackay et al. 2012) were downloaded from http://www.hgsc.bcm.tmc.edu/projects/dgrp/freeze1_July_2010/, and data for six complete genomes of D. simulans (Begun et al. 2007) aligned to dm2 reference genome of D. melanogaster were downloaded from DPGP (http://www.dpgp.org/). dm2 co-ordinates were converted to dm3 co-ordinates using liftover (http://hgdownload.cse.ucsc.edu/goldenPath/dm3/liftOver/). FlyBase genes (Tweedie et al. 2009, BDGP release 5) were used to map D. melanogaster introns and intergenic regions onto the four-species alignment.The sets of analyzed nucleotide sites were formed as follows. Nucleotides masked by RepeatMasker, not aligned, or containing gaps or non-ACGT characters were not considered. To reduce the effect of natural selection on our inferences, we excluded all coding regions, 5′- and 3′-UTRs, all alternatively spliced regions, and nucleotide sites with high interspecies conservation: phastCons (Siepel et al. 2005) scores >0.1 in Drosophila and PhyloP (Pollard et al. 2010) scores >0.1 in primates. (For supplementary fig. S5, Supplementary Material online, phastCons scores >0.05 in Drosophila were excluded.) Including polymorphic sites may bias the estimates of divergence when the compared species are closely related (Keightley and Eyre-Walker 2012); therefore, we also excluded sites polymorphic in human in the analyses of primates, or in D. melanogaster or in D. simulans in the analyses of Drosophila.To infer the lineage-specific nucleotide substitutions, two approaches were used. First, we used a maximum parsimony-based approach. To this end, in each comparison, we compared trios of species including two sister species and an outgroup. In primates, we considered several trios, with increasing phylogenetic distance between the sister species: H. sapiens and Pan troglodytes (G. gorilla as outgroup), H. sapiens and G. gorilla (P. pygmaeus as outgroup), H. sapiens and P. pygmaeus (M. mulatta as outgroup), and H. sapiens and M. mulatta (C. jacchus as outgroup). In Drosophila, we compared D. melanogaster and D. simulans, using matched positions of D. yakuba and D. erecta to infer the ancestral state; positions that differed between D. yakuba and D. erecta were excluded. We then assumed that a substitution has occurred on one of the sister lineages if the nucleotides on the other sister lineage and in the outgroup coincided and differed from the nucleotide on the considered lineage.Second, in Drosophila, where the inference of the ancestral state is less reliable due to higher phylogenetic distances between species, we used an ML-based approach. To this end, we used the baseml program of PAML (Yang 2007).
Analysis
To estimate the fraction α of single-nucleotide substitutions that occurred as part of an MNR, we compared the lineage-specific substitutions in the genomes of two sister species, using the genome of H. sapiens or D. melanogaster as the “focal” genome, and its sister genome as the “proxy” genome. The focal genomes have a higher quality sequence and annotation; therefore, in each analysis, we used them to measure the substitution rates.We estimated the fraction of single-nucleotide differences that arose due to a DNR as follows. For each distance, between the two analyzed nucleotide sites k ∈ (1..100), we calculated two values: 1) dd(k), the fraction of sites carrying a single-nucleotide substitution on the focal lineage, among sites at distance k from a site with a substitution on the proxy lineage:
and 2) sd(k), the fraction of sites carrying a single-nucleotide substitution on the focal lineage, among sites at distance k from a site with another substitution on the focal lineage:Here, is the number of pairs of nucleotide sites with co-ordinates (i, i + k), such that one substitution occurred at the first site on the proxy linage and another at the second site on the focal lineage; is the number of pairs of nucleotide sites with co-ordinates (i, i + k), such that one substitution occurred at each of the two sites on the focal lineage (fig. 1A and B); and P and F are the overall numbers of single-nucleotide substitutions on the proxy and the focal lineage, respectively.Because two substitutions that occur on different lineages can only arise due to two distinct replacement events, , where is the probability of a substitution on the focal lineage at a site at distance k from a substitution on the proxy lineage, andIn contrast, two substitutions on the same lineage can occur due to either two distinct replacement events or a single DNR. Therefore, , where is the probability of a substitution on the focal lineage at a site at distance k from another substitution on the focal lineage, and is the probability that a single-nucleotide substitution at position i in the focal genome originated as part of a DNR also involving a substitution at position i + k. Therefore,If the distribution of the replacement rates along the genome is preserved between the two lineages,
, andTo estimate the contribution of TNRs, we calculate the frequencies of cases when two substitutions occurred at sites separated by l nucleotides in the focal genome, within the set of sites at distance k from the nearest of them in the proxy genome or in the focal genome :
where is the number of triplets of nucleotide sites with co-ordinates (i, i + k, i + k + l), such that one substitution occurred at the first site on the proxy lineage and two more in the second and the third sites on the focal lineage; and is the number of triplets of nucleotide sites with co-ordinates (i, i + k, i + k + l), such that a substitution occurred at each of these three sites on the focal lineage (fig. 1C and D). The first pattern can arise due to either three distinct replacement events or two replacement events, one of them being a DNR:The second pattern can arise under five different scenarios (fig. 1E):
where is the probability that a single-nucleotide substitution at position i in the focal genome originated as part of a TNR also involving substitutions at positions i + k and i + k + l, and the five components of the sum correspond to the five scenarios in figure 1E. Therefore,CpG dinucleotides are hypermutable in vertebrates; this can lead to correlations between consecutive mutations at adjacent sites if the first mutation creates a CpG (Bird 1980). To avoid the contribution of this phenomenon to measurements of MNR rates, in all analyses involving primates, we excluded all sites involved in a CpG dinucleotide in the outgroup species. For analyses involving pairs of adjacent sites, we also excluded pairs of substitutions in the same lineage that possibly involved a CpG intermediate, that is, N1pG → CpN2 or CpN1 →N2pG, where N1 and N2 correspond to any nucleotide and pairs of substitutions on different lineages, such that the ancestral variant was N1pG or CpN2 and one of the derived variants was a CpG dinucleotide. This latter filter leads to underestimation of dd(1), sd(1), dt(1,l), and st(1,l) but should not bias or .To estimate the possible contribution of nonallelic gene conversion to MNRs, we took the following approach. We reasoned that genomic regions that have a close paralog elsewhere in the genome are more likely to undergo nonallelic gene conversion. To identify these regions, we ran basic local alignment search tool (BLAST) (Altschul et al. 1990) of the entire H. sapiens (D. melanogaster) genome against itself. Among the resulting BLAST hits, we removed all those with co-ordinates overlapping the query sequence. We then sorted the remaining BLAST hits by their e values and considered the number of BLAST hits from the top of this list that cumulatively covered 50% of the genome. The 50% of the genome that was covered by these BLAST hits and the remaining 50% were then considered as prone and not prone to nonallelic gene conversion, respectively.
Supplementary Material
Supplementary figures S1–S7 and tables S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Authors: Georgii A Bazykin; Jonathan Dushoff; Simon A Levin; Alexey S Kondrashov Journal: Proc Natl Acad Sci U S A Date: 2006-12-12 Impact factor: 11.205
Authors: John Parsch; Sergey Novozhilov; Sarah S Saminadin-Peter; Karen M Wong; Peter Andolfatto Journal: Mol Biol Evol Date: 2010-02-11 Impact factor: 16.240
Authors: John A Stamatoyannopoulos; Ivan Adzhubei; Robert E Thurman; Gregory V Kryukov; Sergei M Mirkin; Shamil R Sunyaev Journal: Nat Genet Date: 2009-03-15 Impact factor: 38.330
Authors: Jian-Min Chen; David N Cooper; Nadia Chuzhanova; Claude Férec; George P Patrinos Journal: Nat Rev Genet Date: 2007-09-11 Impact factor: 53.242
Authors: Pauline A Fujita; Brooke Rhead; Ann S Zweig; Angie S Hinrichs; Donna Karolchik; Melissa S Cline; Mary Goldman; Galt P Barber; Hiram Clawson; Antonio Coelho; Mark Diekhans; Timothy R Dreszer; Belinda M Giardine; Rachel A Harte; Jennifer Hillman-Jackson; Fan Hsu; Vanessa Kirkup; Robert M Kuhn; Katrina Learned; Chin H Li; Laurence R Meyer; Andy Pohl; Brian J Raney; Kate R Rosenbloom; Kayla E Smith; David Haussler; W James Kent Journal: Nucleic Acids Res Date: 2010-10-18 Impact factor: 16.971
Authors: Simon Whelan; Paul I W de Bakker; Emmanuel Quevillon; Nicolas Rodriguez; Nick Goldman Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971
Authors: Susan Tweedie; Michael Ashburner; Kathleen Falls; Paul Leyland; Peter McQuilton; Steven Marygold; Gillian Millburn; David Osumi-Sutherland; Andrew Schroeder; Ruth Seal; Haiyan Zhang Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971
Authors: Jakob M Goldmann; Vladimir B Seplyarskiy; Wendy S W Wong; Thierry Vilboux; Pieter B Neerincx; Dale L Bodian; Benjamin D Solomon; Joris A Veltman; John F Deeken; Christian Gilissen; John E Niederhuber Journal: Nat Genet Date: 2018-03-05 Impact factor: 38.330