Literature DB >> 31147380

Quantifying GC-Biased Gene Conversion in Great Ape Genomes Using Polymorphism-Aware Models.

Rui Borges¹, Gergely J Szöllősi², Carolin Kosiol^3,4.

Abstract

As multi-individual population-scale data become available, more complex modeling strategies are needed to quantify genome-wide patterns of nucleotide usage and associated mechanisms of evolution. Recently, the multivariate neutral Moran model was proposed. However, it was shown insufficient to explain the distribution of alleles in great apes. Here, we propose a new model that includes allelic selection. Our theoretical results constitute the basis of a new Bayesian framework to estimate mutation rates and selection coefficients from population data. We apply the new framework to a great ape dataset, where we found patterns of allelic selection that match those of genome-wide GC-biased gene conversion (gBGC). In particular, we show that great apes have patterns of allelic selection that vary in intensity-a feature that we correlated with great apes' distinct demographies. We also demonstrate that the AT/GC toggling effect decreases the probability of a substitution, promoting more polymorphisms in the base composition of great ape genomes. We further assess the impact of GC-bias in molecular analysis, and find that mutation rates and genetic distances are estimated under bias when gBGC is not properly accounted for. Our results contribute to the discussion on the tempo and mode of gBGC evolution, while stressing the need for gBGC-aware models in population genetics and phylogenetics.

Entities: Chemical Disease Species

Keywords: GC-bias; Moran model; allelic selection; boundary mutations; gBGC; great apes

Mesh：

Year: 2019 PMID： 31147380 PMCID： PMC6707462 DOI： 10.1534/genetics.119.302074

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

THE field of molecular population genetics is currently being revolutionized by progress in data acquisition. New challenges are emerging as new lines of inquiry are posed by increasingly large population-scale sequence data (Casillas and Barbadilla 2017). Mathematical theory describing population dynamics was developed before molecular sequences were available (e.g., Fisher 1930; Wright 1931; Moran 1958; Kimura 1964); now that ample data are available to perform statistical inference, many models have been revisited. Recently, the multivariate Moran model with boundary mutations was developed and applied to exome-wide allele frequency data from great ape populations (Schrempf and Hobolth 2017). However, drift and mutation are not fully sufficient to explain the observed allele counts (Schrempf and Hobolth 2017). It was hypothesized that other forces, such as directional selection and GC-biased gene conversion (gBGC), may also play a role in shaping the distribution of alleles in great apes. Directional selection and gBGC have different causes but similar signatures: under directional selection, the advantageous allele increases as a consequence of differences in survival and reproduction among different phenotypes; under gBGC, GC alleles are systematically preferred. gBGC is a recombination-associated segregation bias that favors GC-alleles (hereafter, strong alleles) over AT-alleles (hereafter, weak alleles) during the repair of mismatches that occur within heteroduplex DNA during meiotic recombination (Marais 2003). gBGC has been studied in several taxa including mammals (Duret and Galtier 2009; Romiguier ; Lartillot 2013), birds (Webster ; Weber ; Smeds ; Corcoran ), reptiles (Figuet ), plants (Muyle ; Serres-Giardi ; Clément ; Liu ), and fungi (Pessia ; Lesecque ; Liu ). However, apart from some studies in human populations (Katzman ; Glémin ; Pouyet ), a population-level perspective of the intensity and diversity of patterns of gBGC among closely related populations is still lacking. Several questions remain open regarding the tempo and mode of gBGC evolution. For example, the effect of demography on gBGC is still controversial. While theoretical results and studies in mammals and birds advocate a positive relationship between the effective population size and the intensity of gBGC (Nagylaki 1983; Romiguier ; Weber ), Galtier failed to detect such relationship between animal phyla. These results open the question as to which extent demography shapes the intensity of gBGC in closely vs. distantly related species/populations. Another aspect that is not completely understood is the impact of GC-bias on the base composition of genomes (Phillips ; Romiguier ). In particular, the individual and joint effect of gBGC and mutations shaping the substitution process remains elusive. Here, we address these two questions by revisiting great ape data (Prado-Martinez ) with a Moran model that accounts for allelic selection. The Moran model (Moran 1958) has a central position in describing the evolution of a population in that it models the dynamics of allele frequency changes in a finite haploid population. Recently, an approximate solution for the multivariate Moran model with boundary mutations (i.e., low mutation rates) was derived (Schrempf and Hobolth 2017). In particular, the stationary distribution was shown useful to infer population parameters from allele frequency data (Schrempf ; Schrempf and Hobolth 2017). Here, we present the Moran model with boundary mutations and allelic selection, derive the stationary distribution, and build a Bayesian framework to estimate population parameters. While De Maio had previously proposed a Moran model with allelic selection, we introduce further assumptions on the mutation scheme that permit us to mechanistically describe the relative importance and impact of the population processes mediating the base composition of genomes and expected divergence. Other approaches making use of allele frequency data to estimate mutation rates and selection coefficients have been proposed in the literature. Glémin proposed a method to quantify gBGC from the derived allele frequency spectra that incorporates polarization errors, takes spatial heterogeneity into account, and jointly estimates mutation bias. The number of derived alleles is modeled by a Poisson distribution on the mutation rates among weak, strong, and neutral alleles (Muyle ). Our approach differs from that of Glémin as it does not require polarized data or need to account for polarization errors. In addition, our method makes use of the information given by the fixed sites—information that is usually discarded by other methods (Glémin included). Furthermore, our application to great apes shows that most great apes have patterns of allelic selection consistent with gBGC. Our results suggest further that demography plays a major role in determining the intensity of gBGC among great apes, as the intensity of the obtained selection coefficients correlates significantly with the effective population size of great apes. We also show that not accounting for GC-bias may considerably distort the reconstructed evolutionary process, as mutation and substitution rates are estimated under bias.

Methods

The multivariate Moran model with allelic selection

The modeling framework defined in this work builds on the model described by Schrempf , which, according to some proposed terminology (Vogl and Bergman 2015; Schrempf and Hobolth 2017), can be addressed as the multivariate Moran model with boundary mutations (hereafter, ). Here, we describe the and allelic selection (hereafter, ). The multivariate Moran model can be also referred to as a polymorphism-aware phylogenetic model (PoMo) if we consider the four-variate case (De Maio , 2015; Schrempf ), i.e., representing the four nucleotide bases (Figure 1).

Figure 1

PoMo state-space using . The four alleles represent the four nucleotide bases. Black and gray arrows indicate mutations, and genetic drift plus selection, respectively. Monomorphic or boundary states are represented in the tetrahedron’s vertices, while the polymorphic states are represented in its edges. Monomorphic states interact with polymorphic states via mutation, but a polymorphic state can reach a monomorphic state only via drift or selection. Between polymorphic states, only drift and selection events occur. Consider a haploid population of N individuals and a single locus with K alleles: and are two possible alleles. The evolution of this population over time is described by a continuous-time Markov chain with a discrete character-space defined by N and K, each of which represents a specific assortment of alleles. Two types of state can be defined: if all the individuals in a populations have the same allele, the population is monomorphic , i.e., the N individuals have the allele ; differently, if two alleles are present in the population, the population is polymorphic , meaning that n individuals have the allele and have the allele . is therefore the frequency of allele in the population. Allele trajectories are given by the rate matrix . Time is accelerated by a factor of N, and, therefore, instead of describing Moran dynamics in terms of Moran events (Moran 1958), we developed a continuous version in which time is measured as the coalescent in generation time (in units of N). Drift is defined by the neutral Moran model: the transition rates of the allelic frequency shifts depend only on the allele frequency, and are therefore equal regardless of allele increases or decreases in the population (Durrett 2008).We accommodated mutation and selection in the framework of the neutral Moran model by assuming them to be decoupled (Baake and Bialowons 2008; Etheridge ). Mutation is incorporated based on a boundary mutation model, in which mutations occur only in the boundary states. The boundary mutations assumption is met if the mutation rates are small (and N is not too large). More specifically, by comparing the expectations of the diffusion equation with the polymorphic diversity under the Moran model, Schrempf established that should be lower than 0.1. In fact, most eukaryotes fulfill this condition [see Lynch for a review of mutation rates]. Another assumption of our boundary mutation model is that the polymorphic states can only be biallelic. However, this assumption is not a significant constraint as tri-or-more allelic sites are rare in sequences with low mutation rates. We employed the strategy used by Burden and Tang (2016), and separated our model into a time-reversible and a flux part. We wrote the mutation rates as the entries of a specific mutation model, the general time-reversible model (GTR) (Tavaré 1986): , where represents the exchangeabilities between any two alleles ,and the allele base composition (Equation 2). Here, we restricted ourselves to the GTR, as this model simplifies obtaining formal results (Burden and Tang 2016). Because has free parameters, and includes the exchangeabilities for all the possible pairwise combinations of K alleles, we ended up having free parameters in the GTR-based boundary mutation model. Until now, we have essentially described the model proposed by Schrempf ); this work extends this model by including allelic selection. We modeled allelic selection by defining relative selection coefficients : the selection coefficient of an arbitrary allele (A in our experiments) is fixed to 0. The selection coefficients defined this way guarantee that our multi-allelic model behaves neutrally only under the condition that all the selection coefficients are the same and equal to 0. Defining the fitness as the probability that an offspring of allele is replaced with probability (Durrett 2008), we can formulate the component of allelic selection alongside with drift, and thus among the polymorphic states (Equation 2). Altogether, the instantaneous rate matrix of the multivariate Moran model with boundary mutations and allelic selection can be defined aswhere u and v represent a frequency change in the allele counts (though N remains constant). The diagonal elements are defined by the mathematical requirement such that the respective row sum is 0. As the parameters of the population size, mutation rate, and selection coefficients are confined, it is possible to scale them down to a small value of N while keeping the overall dynamics unchanged (Appendix A). The virtual population size N becomes a parameter describing the number of bins the allele frequencies can fall into. As a result, we can think of N either as a population size or a discretization scheme.

The stationary distribution

The stationary distribution of a Markov process can be obtained by computing the vector satisfying the condition (Appendix B). is the normalized stationary vector, and has the solutionk is the normalization constantwhere is the alphabet of the K alleles , representing the monomorphic states, and all the possible pairwise combinations of representing the types of polymorphic states , , …, . For example, for the four-multivariate case, we write as the alphabet of the four nucleotide bases and as all the possible pairwise combinations of the four nucleotide bases . For a population of size N, we have possible states, four of which are monomorphic (Figure 1).

Expected number of Moran events

From .. and , we can compute the expected number of Moran events (mutations, drift, and selection) or the expected divergence per unit of time (in generations) under the model (Appendix C):The quantity (5) can also be interpreted as the overall rate of the model. The expected number of Moran events for the neutral model can be easily calculated by letting . To compare the Moran distance with the standard models of evolution, we recalculated the Moran distance to account only for substitutions events : we corrected by the probability of a mutation and subsequent fixation under the Moran model (Appendix D)

Bayesian inference with stationary distribution

We can define a likelihood function based on the stationary distribution for a set of S independent sites in N individuals by taking the product of over counts of monomorphic and polymorphic sites , thus:We employed a Bayesian approach: we define the prior distributions independently, a Dirichlet prior for and an exponential prior for and ; a Dirichlet and multiplier proposals were set for the aforementioned parameters with tuning parameters guaranteeing a target acceptance rate of 0.234 (Roberts ). We employed the Metropolis-Hastings algorithm (Hastings 1970) for each conditional posterior in a Markov chain Monte Carlo (MCMC) sequence to obtain random samples from the posterior. The algorithm was coded in the R statistical programing language (R Core Team 2015): the packages MCMCpack and expm were integrated in our code to obtain samples from the Dirichlet density and to compute the matrix exponential, respectively (Martin ; Goulet ).

Application: great ape population data

The stationary distribution of the four-multivariate model was employed to infer the distribution of allele frequencies, selection coefficients, and mutation rates from fourfold degenerate sites of exome-wide population data from great apes (Prado-Martinez ). We used 11 populations with up to 27 diploid individuals, totaling ∼2.8 million sites per population (Table 1). Data preparation follows the pipeline described in De Maio . Estimates of the Watterson’s θ genetic diversity is <0.003 for all the studied populations (Schrempf ), validating the boundary mutations assumption of 0.1.

Table 1

Evidence of allelic selection among the great ape populations.

Population	N	S	logp(c\|MM)	logp(c\|MS)	log BF
African humans	6	2827135	−3941390.98	−3940993.95	397
Non-African humans	12	2826956	−3940071.64	−3939858.12	213
Eastern gorillas	6	2823830	−3917375.00	−3917370.00	5
Western gorillas	54	2813092	−3955462.98	−3954663.09	799
Western chimpanzees	10	2823911	−3935188.83	−3934928.50	260
Nigeria-Cameroon chimpanzees	20	2825739	−3980386.43	−3979429.05	957
Eastern chimpanzees	12	2822976	−3961202.57	−3960561.15	641
Central chimpanzees	8	2822685	−3958674.29	−3957704.55	969
Bonobos	26	2824240	−3948520.55	−3947835.54	685
Bornean orangutans	10	2824768	−3952527.89	−3952358.67	169
Sumatran orangutans	10	2824618	−3973247.40	−3972725.44	521

The number of haploid individuals and the number of fourfold degenerate sites per population are indicated by N and S, respectively. The log Bayes factors ( BF) were calculated as the sum over the product of the allele counts and the posterior predictive probabilities under the Moran model with boundary mutations and allelic selection . BF favors the model with allelic selection when >1.

Data availability

Data and R scripts necessary to confirm the findings of this article are available on GitHub (https://github.com/pomo-dev/pomo_selection). Supplemental material available at FigShare: https://doi.org/10.6084/m9.figshare.8180960.

Results

Simulations and algorithm validation

To validate the analytical solution for the stationary distribution of the multivariate Moran model, we compare it to the numerical solution obtained by calculating the probability matrix of for large enough t. We confirmed that the numerical solution converges to the analytical solution (Supplemental Material, Figure S1). We validated the Bayesian algorithm for estimating population parameters from the stationary distribution by performing simulations (Figures S2–S5 and Table S2). Our algorithm efficiently recovers the true population parameters from simulated allele counts. We tested the algorithms for different numbers of sites (, and ) and state-spaces ( 5, 10, and 50). The number of sites does not increase the computational time substantially and is not a limiting factor for genome-wide analysis. In contrast, the size of the state-space influences the computational time. For larger state-spaces, N, more iterations are needed to obtain converged, independent, and mixed MCMC chains during the posterior estimation.

Patterns of allelic selection in great apes

To test the role of allelic selection defining the distribution of alleles in the great apes, we compared the neutral model and the model with allelic selection (). Using the predictive stationary distribution and the observed allele counts, we computed the Bayes factors (BF) favoring the more complex model (i.e., BF favors the model with allelic selection) for all populations. It is clear that fits the data considerably better for most of the studied great apes ( BF , Table 1). The only exception is the Eastern gorilla population, for which a lower BF was obtained ( BF , Table 1). We have also corroborated our BF by inspecting the fit of the predictive distribution of and with the allele counts (Figure S6, A–K). The allele counts for the polymorphic states are not symmetrical; generally, one allele is preferred, and thus also the polymorphic states that have it in higher proportions. As expected, we observed that better reproduces the skewed distribution of allele counts among great apes. We further investigated the patterns of allelic selection in great apes by analyzing the posterior distribution of the relative selection coefficients of C, G, and T ( was set to 0) under . A general pattern of allelic selection is observed in great apes: the selection coefficients of C and G are similar (meaning that their posterior distributions largely overlap), but different from the selection coefficient of T, which, in turn, overlaps 0 (approximately equal to the selection coefficient of A) (Figure 2). The only exception is Eastern gorillas, for which the selection coefficients are all only slightly higher than 0 and rather similar (Figure 2). This result corroborates the relatively low BF found for evidence of allelic selection in the Eastern gorilla population.

Figure 2

Scaled allelic selection coefficients for the great apes fourfold degenerate synonymous sites. The boxplots represent the posterior distribution of the C, G, and T scaled selection coefficients ( was set to 0); the estimates were obtained using the four-variate Moran model. The line in blue represents . We scaled or selection coefficients according to Equation 11 in Appendix A: . Note that, if σ is small enough, the scaling is valid, as . Table S3 summarizes the average unscaled selection coefficients for each great ape population. We further explored this result in order to check if the patterns of GC-bias found among great apes can be associated with gBGC. We correlate the GC-bias per chromosome (by averaging the scaled and ) with the chromosome size and recombination rate in the non-African human population (Figure S7), for which these data are particularly well characterized (Jensen-Seaman 2004). We found a significant positive correlation between the GC-bias and recombination rate (Spearman’s ρ = 0.57, P = 0.006), but a negative correlation with the chromosome length (Spearman’s ρ = −0.52, P = 0.014). Although the patterns of selection among great apes are similar qualitatively, they differ quantitatively. For example, the Central chimpanzees have patterns of GC-bias ∼6.17/4.05 (, Table S3 and Figure 2; scaled according to Equation 11), while the closely related population of Western chimpanzees shows less strong patterns (∼2.21/2.30). Likewise, the GC-bias content in African and non-African human populations contrasts: 4.47/2.86 and 1.83/1.76, respectively. These results show that the patterns of allelic selection vary greatly among great apes, even among closely related populations. It has been hypothesized that gBGC is a compensation mechanism for the mutational bias that exists in favor of the weak alleles, A and T (Duret and Galtier 2009; Philippe )—the AT/GC toggling effect. We observed that mutation rates from strong to weak alleles are more frequent (by a factor of 2.80; 3.26 if the stationary frequencies are accounted for), while no mutational bias was found between alleles of the same type (1.02; 0.98 if the stationary frequencies are accounted for; Table S3). As the estimated selection coefficients have a clear pattern of GC-bias in most great apes, we can conclude that our analyses are congruent with the expectations of the AT/GC toggling effect. Furthermore, we compared our method with that of Glémin , by considering only two alleles [the strong (S) and weak (W) alleles] using human allele counts from the first human chromosome, divided into 51 regions of 1 million sites (data taken from Glémin ). We compared estimates of the gBGC rate coefficient as predicted by our model and that of Glémin ( and B, respectively), and observed that they are negatively correlated (Spearman’s , value = 0.012). Interestingly, B correlates significantly with our estimates of (the mutation rate of weak to strong alleles; , value = 0.001). We have further checked the influence of the fixed sites in our estimates of gBGC, and, as expected, we observed that correlates positively with the percentage of monomorphic sites (, value = 0.012); intriguingly, B is negatively correlated (, value = 0.001). Scatter plots of the mentioned correlation tests can all be found in Figure S8.

and the total rate of mutation and selection in great apes

It is widely known that the intensity of mutation and selection reflect population demography. To check whether the estimated mutation and selection coefficients among great ape populations may be explained by demography, we tested the correlation between the total rate of mutation and selection and (obtained from Tenesa ; Prado-Martinez ). Positive correlations between the total mutation and selection rates and the effective population size were obtained (Figure 3): Pearson’s correlation coefficient of 0.57 (P = 0.089) and 0.89 (P 0.001), respectively. These correlations were obtained using independent contrasts (Felsenstein 1985) accounting for the great apes phylogeny as predicted in Prado-Martinez .

Figure 3

Correlating with the strength of (A) mutation and (B) selection in great apes. The strength of mutation and selection were obtained by summing up the scaled mutation rates and selection coefficients, respectively. Great ape populations are numbered: 1. African humans, 2. Non-African humans, 3. Eastern gorillas, 4. Western gorillas, 5. Western chimpanzees, 6. Nigeria-Cameroon chimpanzees, 7. Eastern chimpanzees, 8. Central chimpanzees, 9. Bonobos, 10. Bornean orangutans, and 11. Sumatran orangutans. Estimates of were taken from Prado-Martinez and Tenesa . *Correlation coefficients calculated using independent contrasts and correcting for the effect of the great apes phylogeny [as predicted in Prado-Martinez ]. This result shows that plays an important role in determining the intensity of selection. In particular, it becomes clear that the different patterns of GC-bias found among great apes are due, in part, to different demographies. For example, Central chimpanzees have the highest GC-bias among the studied great apes, and they are indeed the population that was estimated with the largest (30,000; Prado-Martinez ). Eastern gorillas showed the opposite pattern; this population had no evidence of GC-bias (with very homogeneous selection coefficients) and congruently Prado-Martinez estimated its as only 2000—the lowest of the studied populations.

Comparing the expected number of substitutions in great apes

We calculated the expected number of substitutions under and to evaluate the impact of allelic selection (in particular, GC-bias) in the evolutionary process. With Equation 6, we calculated and using the posterior estimates of the respective model parameters. We observe that, for most great ape populations, the expected number of substitutions is lower when allelic selection is accounted for (Table 2); Eastern gorillas are an exception, and the opposite pattern was observed. We also calculated the ratio between the expected number of substitutions in both models (i.e., ), and we obtained minor (99.8% in Bornean orangutans) to major (82.1% in bonobos) deviations; the average difference is −7.3% (Table 2). These results suggest that not accounting for GC-bias may distort the reconstructed evolutionary process by overestimating the expected number of substitutions.

Table 2

Expected number of substitutions per unit of time

Population	dMM*×103	dMS*×103	dMS/dMM
African humans	0.123	0.120	0.978
Non-African humans	0.041	0.039	0.954
Eastern gorillas	0.061	0.064	1.045
Western gorillas	0.011	0.009	0.845
Western chimpanzees	0.054	0.052	0.956
Nigeria-Cameroon chimpanzees	0.045	0.038	0.858
Eastern chimpanzees	0.073	0.066	0.910
Central chimpanzees	0.130	0.114	0.873
Bonobos	0.019	0.016	0.821
Bornean orangutans	0.077	0.077	0.998
Sumatran orangutans	0.111	0.106	0.959

The expected number of substitutions for the four-variate Moran model with boundary mutations and allelic selection were calculated based on the posterior distributions of the model parameters and Equation 6. The relative difference between the average number of events between the two models was used to assess how dissimilar these distances are. We complement this result by comparing the posterior distribution of the mutations rates in and . Because we wanted to identify the mutational types that may be estimated differently between these models, we calculated the relative difference between the mutation rate from allele to allele using the following ratio: . If for a certain mutation rate , then this mutation rate is being underestimated in when compared with (and vice versa if ); if the mutation rates are equally estimated in both models. We observed a systematic bias among great apes. While weak-to-weak and strong-to-strong mutation rates are generally nondeferentially estimated in both models (most of their r overlap 1, Figure 4) the strong-to-weak and weak-to-strong mutation rates are generally biased in . In particular, we obtained that weak-to-strong mutation rates are augmented, while mutations rates from strong-to-weak alleles are deprecated (Figure 4), which suggests that not accounting for GC-bias may bias the estimation of mutation rates. Eastern gorillas behave differently by not showing significant differences between the estimated mutations rates (all overlap 1, Figure 4).

Figure 4

Relative difference in the mutation rates estimated under the neutral and non-neutral Moran model. represents the ratio between the mutation from allele to allele in the model with allelic selection and the model with boundary mutations: . The 12 mutational types are indicated in the western gorillas plot: all of the plots follow this arrangement.

Discussion

In this work, we built on the multivariate Moran model with boundary mutations and allelic selection to explain the population processes shaping the observed distribution of alleles. We obtained new formulae to characterize this model. In particular, we derived the stationary distribution and the rate of the process. In addition, we built a Bayesian framework to estimate population parameters (mutation rates and selection coefficients) from population data. This work accomplishes tasks set by Schrempf and Hobolth (2017), who observed derivations from neutrality without having a model in place to enlighten the causes.

Variable patterns of gBGC among great apes

A genome-wide application to the great apes provides important insight into the strength and magnitude of GC-bias patterns and also the impact of gBGC in the evolutionary process. To our knowledge, this is the first work giving a population perspective of the patterns of GC-bias in nonhuman populations. Here, we focus on GC-bias because it is a genome-wide effect. Mathematically speaking, it is difficult to disentangle gBGC from directional selection: they may have different biological explanations, but represent the exact same process modeling-wise (i.e., one allele is preferred over the others). Therefore, existing signatures of directional selection are most likely canceling out, when several site-histories (∼2.8 million sites in our case) are summarized to perform inferences. In agreement with previous studies in mammals and humans (Spencer ; Capra ; Lartillot 2013; Lachance and Tishkoff 2014; Glémin ), we found that gBGC is weak on average. Indeed, among great apes, the effect of GC-bias is 2.751.27 (value obtained by averaging scaled and ), consistent with the nearly neutral scenario (Ohta and Gillespie 1996; Vogl and Bergman 2015). Other studies provided estimates of the scaled conversion coefficient in coding regions: Lynch (2010) estimated as 0.82 in humans and Lartillot (2013) adopted a phylogenetic approach that predicted scaled conversion coefficients <1 in all apes. Our estimates are comparatively higher; however, ours methods and those of Lynch (2010) and Lartillot (2013) have different underlying assumptions. In particular, our method employs the Moran model, which has a rate of genetic drift twice as fast as the Wright-Fisher model. Therefore, we expect to estimate selection coefficients that are twice as high as those in the studies cited. We found no quantitative agreement between our estimates of the gBGC rate coefficient and those derived from the method of Glémin . In addition, we found that our model attributes to mutation what Glémin attributes to gBGC. This might be a consequence the use of monomorphic sites in our method. Indeed, unlike those of Glémin , our estimates of gBGC correlate positively with the percentage of fixed sizes. In general, the gBGC rate coefficient should promote greater fixation by boosting the purging of polymorphic sites (at least for low mutation rates, as those observed in humans). On the other hand, Glémin also considered a varying GC-content, which may explain why their estimates of gBGC do not correlate with the percentage of fixed sites. We have preliminary evidence showing that monomorphic sites can significantly impact estimates of population parameters. Nevertheless, a more comprehensive model accounting for both fixed states and variable GC-content would be necessary to disentangle their relative contribution to explaining allele counts. The patterns of GC-bias we found in great apes are in concordance with the well-known process of gBGC. As expected, we observed that the larger the recombination rate or the lower the chromosome length, the higher the GC-effect. Evidently, recombination promotes gBGC; however, a negative association between gBGC and chromosome size is expected [in most organisms, small chromosomes undergo more recombination per unit of physical distance than large chromosomes (Kaback )]. We performed these analyses in non-African humans, for which these data are available; however, we are confident that the patterns of GC-bias found in great apes are due to gBGC. It has been hypothesized that GC-bias is a compensation mechanism for the mutational bias that exists in favor of the weak alleles, A and T (Duret and Galtier 2009; Galtier ; Philippe ). Congruent with this expectation, we observed that mutation rates from strong to weak alleles are higher, but rather similar between alleles of the same type. Interestingly, as we have demonstrated, this symmetric manner by which mutations and selection act in great apes leads the number of substitutions to decrease on average. This suggests that AT/GC toggling may increase population variability by promoting more polymorphic sites; however, further studies would be necessary to clarify this prediction.

Intensity of gBGC and demography in great apes

Glémin hypothesized that differences in GC-bias intensity among human populations were due to the effects of demography. We also advance that demography regulates the intensity of gBGC in great apes. We obtained a positive correlation between the total rate of selection and in great apes. An important conclusion of our study is that patterns of gBGC can change rapidly due to demography, even among closely related populations. In fact, most of the studied populations are known to have diverged <0.5 MYA (Prado-Martinez ). Here, we showed that GC-bias determines the genome-wide base composition of genomes in a factor proportional to [or in the true dynamic]. Therefore, by either changing or s, we are able to change the AT/GC composition of genomes. Because we were able to correlate with the intensity of allelic selection (Pearson’s ρ = 0.89), we are convinced that demography plays a major role determining the base composition of great ape genomes. Studies using life history traits (i.e., body size) in mammals (Romiguier ) and ancestral reconstructions of the effective population size in birds (Weber ) also advocated for correlations between and GC-content [although not as strong as that found here; ρ ∼ in Weber ]. In contrast, Galtier did not find this correlation in a data set covering 31 species of distinct metazoa phyla (including vertebrates, insects, molluscs, crustaceans, echinoderms, tunicates, annelids, nematodes, nemertians, and cnidarians). This is most likely happening because aspects of the recombination landscape, such as genome-wide recombination rate, length of gene conversion tracts, and repair biases, may also affect the intensity of gBGC (Duret and Galtier 2009; Lesecque ; Galtier ). As the recombination landscape varies significantly across species, but not so much across closely related populations (e.g., the karyotype is very conserved among great apes, with humans having 46 diploid chromosomes whereas other great apes having 48), we expected stronger correlations between the intensity of gBGC and demography. Knowing to what extent variations in or s determine the base composition of genomes will require further study. In particular, determining s experimentally in different populations/species would help assess the real impact of gBGC. If we assume that s varies slightly among closely related populations/species, then we might attribute different intensities of GC-bias almost solely to demographic effects, which simplifies the task of accommodating gBGC in population models.

gBGC calls for caution in molecular and phylogenetic analyses

The effects of gBGC in molecular analysis have been described extensively in the literature [reviewed in Romiguier and Roux (2017)]. We complement these results by showing how GC-bias affects the base composition of genomes, and how the mutation rates and genetic distances may be biased if GC-bias is not properly accounted for. In particular, we observed that mutation rates from weak-to-strong and strong-to-weak alleles are systematically over and underestimated, respectively. The idea that gBGC may distort the reconstructed evolutionary process comes mainly from phylogenetic studies. For example, it is hypothesized that gBGC may promote substitution saturation (Romiguier and Roux 2017). We have shown that the number of substitutions may be significantly overestimated if we do not account for GC-bias, meaning that gBGC may indeed promote branch saturation. Based on this and other gBGC-related complications [e.g., GC-bias promotes incomplete lineage sorting (Hobolth )], some authors advocate that only GC-poor markers should be used for phylogenetic analysis (McCormack ; Romiguier ). Contradicting this approach, our results show that we may gain more inferential power if GC-bias is accounted for when estimating evolutionary distances. Here, we have not performed phylogenetic inference, but previous applications of the Moran model to phylogenetic problems (i.e., PoMo) (De Maio ; Schrempf ) show that it can be done. Therefore, a necessary future work would be to test the effect of allelic selection (or, more specifically, GC-bias) in phylogeny reconstruction; in particular, it would be of major interest to determine how much of its signal can account for the increase in accuracy of tree estimation. Recently, a nucleotide substitution process that accounts for gBGC was proposed by Lartillot (2013). In this latter model, a scaled conversion coefficient is used to correct substitution rates in a manner similar to that used here to calculate the expected number of substitutions for the Moran distance (i.e., assessing the relative fixation probabilities under GC-bias, File S3). Therefore, these models should perform similarly, with the exception that PoMo should be able to disentangle the contribution of selection and mutation to the observed diversity, as it additionally accounts for polymorphic sites.

Conclusions

Despite widespread evidence of gBGC in several taxa, several questions remain open regarding the role of gBGC in determining the base composition of genomes. In this work, we provide a mechanistic model and theoretical results that allow quantification of patterns of gBGC in nonhuman closely related populations, providing a new layer of understanding of the tempo and mode of gBGC evolution in vertebrate genomes. In addition, our multivariate Moran model with allelic selection makes a significant contribution to the endeavor of estimating population parameters from multi-individual population-scale data. Importantly, our analysis showed that gBGC may significantly distort estimates of population parameters and genetic distances, highlighting that gBGC-aware models should be used when employing molecular phylogenetics and population genetics analyses. We stress that, although our application to great apes show evidence of GC-bias, our framework can be employed more generally to estimate patterns of nucleotide usage and associated mechanisms of evolution.

48 in total

Quantifying GC-Biased Gene Conversion in Great Ape Genomes Using Polymorphism-Aware Models.

Methods

The multivariate Moran model with allelic selection

The stationary distribution

Expected number of Moran events

Bayesian inference with stationary distribution

Application: great ape population data

Data availability

Results

Simulations and algorithm validation

Patterns of allelic selection in great apes

and the total rate of mutation and selection in great apes

Comparing the expected number of substitutions in great apes

Discussion

Variable patterns of gBGC among great apes

Intensity of gBGC and demography in great apes

gBGC calls for caution in molecular and phylogenetic analyses

Conclusions

Review 1. Biased gene conversion: implications for genome and sex evolution.

2. Evolution in Mendelian Populations.

3. Recent human effective population size estimated from linkage disequilibrium.

4. Strong regional biases in nucleotide substitution in the chicken genome.

5. Rate, molecular spectrum, and consequences of human mutation.

6. Genome-scale phylogeny and the detection of systematic biases.

7. Comparative recombination rates in the rat, mouse, and human genomes.

Review 8. Biased gene conversion and the evolution of mammalian genomic landscapes.

9. GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates.

10. The influence of recombination on human genetic diversity.

1. Nucleotide Usage Biases Distort Inferences of the Species Tree.

2. Population dynamics of GC-changing mutations in humans and great apes.

3. GC-biased gene conversion in X-chromosome palindromes conserved in human, chimpanzee, and rhesus macaque.

4. Runaway GC Evolution in Gerbil Genomes.