Michael A Gilchrist1, Wei-Chen Chen2, Premal Shah3, Cedric L Landerer4, Russell Zaretzki5. 1. Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville National Institute for Mathematical and Biological Synthesis, Knoxville, Tennessee mikeg@utk.edu. 2. Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, Maryland. 3. Department of Biology, University of Pennsylvania. 4. Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville. 5. National Institute for Mathematical and Biological Synthesis, Knoxville, Tennessee Department of Business Analytics and Statistics, University of Tennessee, Knoxville.
Abstract
Extracting biologically meaningful information from the continuing flood of genomic data is a major challenge in the life sciences. Codon usage bias (CUB) is a general feature of most genomes and is thought to reflect the effects of both natural selection for efficient translation and mutation bias. Here we present a mechanistically interpretable, Bayesian model (ribosome overhead costs Stochastic Evolutionary Model of Protein Production Rate [ROC SEMPPR]) to extract meaningful information from patterns of CUB within a genome. ROC SEMPPR is grounded in population genetics and allows us to separate the contributions of mutational biases and natural selection against translational inefficiency on a gene-by-gene and codon-by-codon basis. Until now, the primary disadvantage of similar approaches was the need for genome scale measurements of gene expression. Here, we demonstrate that it is possible to both extract accurate estimates of codon-specific mutation biases and translational efficiencies while simultaneously generating accurate estimates of gene expression, rather than requiring such information. We demonstrate the utility of ROC SEMPPR using the Saccharomyces cerevisiae S288c genome. When we compare our model fits with previous approaches we observe an exceptionally high agreement between estimates of both codon-specific parameters and gene expression levels ([Formula: see text] in all cases). We also observe strong agreement between our parameter estimates and those derived from alternative data sets. For example, our estimates of mutation bias and those from mutational accumulation experiments are highly correlated ([Formula: see text]). Our estimates of codon-specific translational inefficiencies and tRNA copy number-based estimates of ribosome pausing time ([Formula: see text]), and mRNA and ribosome profiling footprint-based estimates of gene expression ([Formula: see text]) are also highly correlated, thus supporting the hypothesis that selection against translational inefficiency is an important force driving the evolution of CUB. Surprisingly, we find that for particular amino acids, codon usage in highly expressed genes can still be largely driven by mutation bias and that failing to take mutation bias into account can lead to the misidentification of an amino acid's "optimal" codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage, accessing this information does not require gene expression measurements, but instead carefully formulated biologically interpretable models.
Extracting biologically meaningful information from the continuing flood of genomic data is a major challenge in the life sciences. Codon usage bias (CUB) is a general feature of most genomes and is thought to reflect the effects of both natural selection for efficient translation and mutation bias. Here we present a mechanistically interpretable, Bayesian model (ribosome overhead costs Stochastic Evolutionary Model of Protein Production Rate [ROC SEMPPR]) to extract meaningful information from patterns of CUB within a genome. ROC SEMPPR is grounded in population genetics and allows us to separate the contributions of mutational biases and natural selection against translational inefficiency on a gene-by-gene and codon-by-codon basis. Until now, the primary disadvantage of similar approaches was the need for genome scale measurements of gene expression. Here, we demonstrate that it is possible to both extract accurate estimates of codon-specific mutation biases and translational efficiencies while simultaneously generating accurate estimates of gene expression, rather than requiring such information. We demonstrate the utility of ROC SEMPPR using the Saccharomyces cerevisiae S288c genome. When we compare our model fits with previous approaches we observe an exceptionally high agreement between estimates of both codon-specific parameters and gene expression levels ([Formula: see text] in all cases). We also observe strong agreement between our parameter estimates and those derived from alternative data sets. For example, our estimates of mutation bias and those from mutational accumulation experiments are highly correlated ([Formula: see text]). Our estimates of codon-specific translational inefficiencies and tRNA copy number-based estimates of ribosome pausing time ([Formula: see text]), and mRNA and ribosome profiling footprint-based estimates of gene expression ([Formula: see text]) are also highly correlated, thus supporting the hypothesis that selection against translational inefficiency is an important force driving the evolution of CUB. Surprisingly, we find that for particular amino acids, codon usage in highly expressed genes can still be largely driven by mutation bias and that failing to take mutation bias into account can lead to the misidentification of an amino acid's "optimal" codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage, accessing this information does not require gene expression measurements, but instead carefully formulated biologically interpretable models.
Genomic sequences encode a trove of biologically important information. Over 49,600
genomes are currently available from the Genomes OnLine Database (Pagani et al. 2012) alone and the flow of newly sequenced
genomes is expected to continue far into the future. As a result, developing ways to
turn these data into useful information is one of the major challenges in the life
sciences today. Although great strides have been made in extracting this information,
ranging from the simple, for example, identification of protein-coding regions, to the
more difficult, for example, identification of regulatory elements (Hughes et al. 2000; Wasserman and Sandelin 2004; Dunham et al. 2012; Kundaje et al. 2015), much of this information remains untapped. To address
one aspect of this challenge, we present a method to estimate the expression levels of
every gene, codon-specific selection coefficients, and mutation biases solely from
patterns of codon usage bias (CUB) in protein-coding sequences within a genome.One of the earliest arguments against neutrality between synonymous codon usages was
given by Clarke (1970). Since then,
evidence for selection acting on CUB has been repeatedly observed. CUB clearly varies
systematically within and between open-reading frames (ORFs) within a species as well as
across species (Grantham et al. 1980;
Ikemura 1981, 1985; Bennetzen and
Hall 1982; Sharp and Li 1987;
Andersson and Kurland 1990; Qin et al. 2004; Gilchrist and Wagner 2006; Chamary et al. 2006; Hershberg and Petrov 2008; Plotkin and
Kudla 2011). These patterns in CUB are driven by two evolutionary forces:
Mutation bias and natural selection (Ikemura
1981; Bulmer 1988, 1991). Current evidence supports multiple
selective forces contributing to the evolution of CUB. Most of these hypothesized
selective forces affect the efficiency and efficacy of ORF translational through
factors, such as ribosome pausing times (Andersson
and Kurland 1990; Bulmer 1991;
Sørensen and Pedersen 1991;
Plotkin and Kudla 2011; Shah and Gilchrist 2011), missense and
nonsense errors (Kurland 1987, 1992; Akashi 1994; Gilchrist 2007;
Drummond and Wilke 2008, 2009), cotranslational protein folding (Thanaraj and Argos 1996; Kimchi-Sarfaty et al. 2007; Tsai et al. 2008; Pechmann and Frydman 2013), equalizing tRNA availability
(Qian et al. 2012), and the stability
and/or accessibility of mRNA secondary structures (Kudla et al. 2009; Tuller et al. 2010; Gu et al.
2012; Bentele et al. 2013). The
relative importance of each of these selective forces is expected to vary both within
and between genes. The effects of these forces can be unified within a single framework
by considering how the codon usage of a given ORF alters the ratio of the expected cost
of protein synthesis over the expected benefit of protein synthesis, or the
cost–benefit ratio η for short (Gilchrist et al. 2009) (see Materials and Methods).One likely way different synonymous codons lead to changes in a gene’s
cost–benefit ratio η results from differences in the abundances of cognate and
near cognate tRNAs and the stability of the Watson–Crick base pairing between a
given codon and tRNA anticodons (Ikemura
1981; Zaher and Green 2009;
Plotkin and Kudla 2011). These
differences, in turn, are predicted to lead to differences in ribosome pausing times and
error rates between codons. Specifically, codons with higher abundances of cognate and
near-cognate tRNAs are thought to have both shorter pausing times and lower error rates
than codons with lower abundances of cognate and near-cognate tRNA (Ikemura [1981]; Kurland [1992], though see Shah and Gilchrist [2010] for a more nuanced view).The assumption that natural selection favors codon usage which reduces the protein
synthesis cost–benefit ratio η implies that the strength of this selection
should scale with the gene’s protein synthesis rate: Highly expressed genes should
show the strongest bias for codons with shorter pausing times and error rates (Ikemura 1981, 1985; Sharp and Li
1986, 1987). As a result, given
sufficiently large effective population size N, such that
high expression genes contain some signal of adaptation, the patterns of CUB observed
within a genome should contain a significant amount of information about the average
protein synthesis rate for a given gene. Further, because low expression genes
are under very weak selection to reduce η, their patterns of CUB should provide
information on the mutational biases experienced within a genome.Accessing this information held within CUB patterns of an organism’s genome has
been the focus of several decades of research in molecular evolution. However, most
approaches examine mutation bias and selection in isolation and ignore their possible
interactions. The strength of mutation bias has typically been investigated by comparing
the differences in GC content of synonymous sites of codons with the rest of the gene
(Galtier et al. 2001; Knight et al. 2001; Palidwor et al. 2010).Numerous methods have been used to quantify or describe selection on synonymous codon
usage. For example, Sharp and Li (1987)
relied on the codon usage in a set of highly expressed genes to identify the
“optimal” codon for a given amino acid as these genes are under stronger
selection to be translated efficiently and accurately. Approaches that focus on a subset
of high expression genes in this way implicitly assume the contribution of mutation bias
to CUB is overwhelmed by natural selection and, therefore, can be ignored. As our
results show, because this view lacks an explicit population genetics framework it is
likely overly simplistic and may lead to the misidentification of optimal codons.Phylogenetic models of protein evolution, some of which are derived from population
genetics models, have also been used to generate estimates of codon-specific selection
coefficients and mutation biases (Tamuri et al.
2012; Rodrigue et al. 2010;
Yang and Nielsen 2008). Other
approaches have relied on intraspecific variation to make similar types of inferences
(Keightley and Eyre-Walker 2007; Lawrie et al. 2013) or a combination of
interspecific divergence and intraspecific variation (Akashi 1995). However, all of these approaches fail to
disentangle how the contributions of mutation bias and natural selection change with
gene expression. Furthermore, these models are either fitted independently across genes
and thus estimate a large number of gene-specific parameters from a relatively small
amount of data or assume that the magnitude of selection is uniform across genes.We, along with others, have previously worked to link gene expression levels to patterns
of CUB by nesting a mechanistic model of protein translation into a population genetics
model of allele fixation in order to estimate codon-specific mutation and selection
parameters (Gilchrist and Wagner 2006;
Gilchrist 2007; Shah and Gilchrist 2011; Wallace et al. 2013). Although these methods represent
significant advances in estimating codon-specific mutation biases and selection
coefficients from genomic data, they are limited to genomes with independent
measurements of gene-specific protein synthesis rates or a close proxy. Historically,
mRNA abundances have been used as such a proxy due to the fact that generating reliable
genome scale measurements of protein synthesis is an expensive undertaking (e.g., Arava et al. 2005; Ingolia et al. 2009; Li et al. 2014). In contrast, the method proposed here does away with the
necessity of having protein synthesis rate estimates (or their proxy) and provides
estimates of the average protein synthesis rate for each gene, . Importantly, our method also provides estimates of
codon-specific mutation biases and translational inefficiencies, which is the additive
contribution of a codon to the cost of protein synthesis.Furthermore, we can combine our gene-specific estimates of protein synthesis rates and
codon-specific translational inefficiencies to produce estimates of the strength of
natural selection on synonymous substitutions on a gene-by-gene and codon-by-codon
basis. Estimating gene-specific selection coefficients on synonymous codons is critical
to determining whether a gene is evolving under purifying or positive selection. Current
models to identify the selection regime under which a gene evolves rely on estimates of
the rates of nonsynonymous changes to rates of synonymous changes
(dN/dS) (Li
et al. 1985; Nei and Gojobori
1986; Yang and Nielsen 2000).
However, the commonly made assumption that all synonymous changes within a gene are
neutral can bias values of dN/dS toward overestimating
the number of genes evolving under positive selection (Spielman and Wilke 2015). By accurately estimating strength
of selection on synonymous changes, researchers can begin to explicitly incorporate
these effects into methods for identifying purifying and positive selection.In order to extract information from the genome-wide patterns of CUB using our
Stochastic Evolutionary Model of Protein Production Rate (SEMPPR) (Gilchrist 2007; Shah
and Gilchrist 2011), we build on the Bayesian statistical advances of Wallace .
Because the costs in our model can be interpreted as proportional differences in
ribosome overhead costs (ROC) due to ribosome pausing, for simplicity we refer to the
model formulated here as ROC SEMPPR (see Materials and Methods).Using the Saccharomyces cerevisiae S288c genome as an example, we
demonstrate that ROC SEMPPR can be used to accurately estimate differences in
codon-specific mutation biases and contributions to the protein synthesis
cost–benefit function η without the need for gene expression data. ROC
SEMPPR’s codon-specific estimates of mutation biases and translational
inefficiencies generated without gene expression data match almost exactly those
generated with gene expression data (Pearson correlation coefficient
for both sets of parameters). In the end, we observe a
Pearson correlation coefficient of between our predicted protein synthesis rates and the
mRNA abundances from Yassour (which was identified as the most reliable data set out
of five different mRNA abundance data sets by Wallace ). The variation between our
predictions and Yassour ’s measurements is on par with the variation observed between mRNA
abundance measurements from different laboratories (Wallace et al. 2013). Further, our predictions show strong
and significant correlations with measurements of mRNA abundance from four other labs
and estimates of protein synthesis rates based on ribosome-profiling (RPF) data from
three other labs (supplementary figs. S4 and S5, Supplementary Material online).By releasing our work as a standalone package in R (see Chen ), researchers can
potentially take the genome of any microorganism and obtain accurate, quantitative
information on the effect of synonymous substitutions on protein translation costs, gene
expression levels, and the strength of selection on CUB.
Results
The posterior means estimated from our Bayesian Markov chain Monte Carlo (MCMC)
simulation of ROC SEMPPR demonstrate two key facts: 1) We are able to estimate the
strength of selection on synonymous CUB from the patterns of codon usage observed within
a genome and 2) we can attribute this selection to the interaction of two underlying
biological traits: Difference between synonymous codons in their contribution to the
cost–benefit ratio η for protein synthesis and the protein synthesis rate of
the ORF averaged across its various environments and
life-stages.For this study, we scale our codon-specific translational inefficiencies relative to the
strength of genetic drift, , where q described the proportional
decline in fitness per ATP wasted per unit time. More specifically,
describes the difference in the contribution of
synonymous codons i and j to the protein synthesis
cost–benefit ratios of an ORF, , scaled by effective population size
and the relative fitness cost of expending an extra ATP
per unit time, q. The greater the contribution of a codon to η, the
greater its inefficiency. For a set of synonymous codons, by convention, we define codon
1 as the codon with the lowest inefficiency, that is, the codon which makes the smallest
additive contribution to η and is most favored by selection. Thus,
for i = 1 and
for i > 1. For notational
simplicity, we will only include the subscripts when needed for clarity.At equilibrium, under the weak-mutation regime (Sella and Hirsh 2005; Shah and
Gilchrist 2011b; McCandlish and
Stoltzfus 2014), the expected frequency of observing a synonymous codon
i (p) of an amino acid in a gene with
an average protein synthesis rate follows a multinomial logistic distribution.
Specifically, for a given amino acid a with
n unique codons where is a unitless measure of codon-specific mutation bias.
Note that, as with for i = 1 but, unlike with
can be positive or negative. Further, because it relies
on the stationary probability of observing a synonymous codon, ROC SEMPPR can only
detect variation in mutation bias, not variation in absolute mutation rates. Additional
model details can be found in the Materials and Methods.The utility of equation (2) is that
it allows us to probabilistically link ROC SEMPPR’s parameters of interest, that
is, codon-specific differences in mutation biases, , translational inefficiencies , and gene-specific protein synthesis rates,
, to the CUB patterns observed within and between ORF of
a given genome. The term represents the number of amino acids that use multiple
codons, whereas n represents the number of genes in the
genome.Because moving between the synonymous codon groups (TCA, TCC, TCG, TCT) and (AGC, AGT)
for Ser requires at least one nonsynonymous nucleotide substitution, we treated these
two groups as if they were different amino acids, Ser4 and Ser2,
respectively. So while strictly speaking, 18 of the canonical 20 amino acids use more
than one codon, because we treat Ser as two separate amino acids, Ser2 and
Ser4, for our purposes . Assuming a log-normal distribution (LogN) with a mean
of 1 as the prior for allows us to employ a random walk Metropolis chain to
estimate the posteriors for , and without the need for any laboratory measurements of gene
expression, . This ability to fit our ROC SEMPPR model
without
data is the major advance of our work over Wallace .
Tables with estimates of gene-specific protein synthesis rates,
, mutation biases, ΔM, and
translational inefficiencies, Δη, based on ROC SEMPPR’s posterior
sampling for the S. cerevisiae genome can be found in the supplementary material, Supplementary Material online.
Evaluating Model Parameter Estimates
Briefly, when fitted to the S. cerevisiae S288c genome, we find
nearly perfect agreement between ROC SEMPPR’s with and
without
estimates for codon-specific protein synthesis
translational inefficiencies, Δη, and mutation bias,
ΔM (Pearson correlation for both sets of parameters, see figs. 1 and 2).
We note that, with the exception Arginine’s , the central 95% credibility intervals (CIs)
for ROC SEMPPR’s Δη and ΔM parameters do not
overlap with zero (see supplementary tables S1–S4, Supplementary Material online). These results indicate that
information on the genome scale parameters, and are robustly encoded and estimable from CUB patterns
and that provides little additional information.
F
Comparison of with and without
ROC SEMPPR estimates for codon-specific
differences in translational efficiencies Δη which have units
where the units of time are set such that
the average protein synthesis rate across the genome,
, equals 1. To improve legibility of the
plots, the two codon amino acids have been combined into two plots and all
of the amino acids with greater than two codons into separate plots. The
dashed blue line represents the 1:1 line between axes and error bars
indicate the 95% posterior CIs for each parameter. For both the
with and without
fits of ROC SEMPPR, all codons but one, Arg
codon CGT, have CIs that do not overlap with 0. As illustrated in the last
plot, a linear regression between estimates of for all codons produces a correlation
coefficient .
F
Comparison of with and without
ROC SEMPPR estimates for codon-specific
differences in mutation biases terms ΔM which are
unitless. Specifically, equals the natural logarithm of the ratio of
the frequencies of synonymous codon 1 to i in the absence
of natural selection. To improve legibility of the plots, the two codon
amino acids have been combined into two plots and all of the amino acids
with greater than two codons into separate plots. The dashed blue line
represents the 1:1 line between axes and error bars indicate the 95%
posterior CIs for each parameter. For both the with and
without fits of ROC SEMPPR, all codons have CIs that do
not overlap with 0. As illustrated in the last plot, a linear regression
between estimates of for all codons produces a correlation
coefficient .
Comparison of with and without
ROC SEMPPR estimates for codon-specific
differences in translational efficiencies Δη which have units
where the units of time are set such that
the average protein synthesis rate across the genome,
, equals 1. To improve legibility of the
plots, the two codon amino acids have been combined into two plots and all
of the amino acids with greater than two codons into separate plots. The
dashed blue line represents the 1:1 line between axes and error bars
indicate the 95% posterior CIs for each parameter. For both the
with and without
fits of ROC SEMPPR, all codons but one, Arg
codon CGT, have CIs that do not overlap with 0. As illustrated in the last
plot, a linear regression between estimates of for all codons produces a correlation
coefficient .Comparison of with and without
ROC SEMPPR estimates for codon-specific
differences in mutation biases terms ΔM which are
unitless. Specifically, equals the natural logarithm of the ratio of
the frequencies of synonymous codon 1 to i in the absence
of natural selection. To improve legibility of the plots, the two codon
amino acids have been combined into two plots and all of the amino acids
with greater than two codons into separate plots. The dashed blue line
represents the 1:1 line between axes and error bars indicate the 95%
posterior CIs for each parameter. For both the with and
without fits of ROC SEMPPR, all codons have CIs that do
not overlap with 0. As illustrated in the last plot, a linear regression
between estimates of for all codons produces a correlation
coefficient .Instead of simply comparing our ROC SEMPPR model’s without
estimates of ΔM and
Δη to its with
estimates, we can also compare these parameters with
other data. Due to the detailed balance requirement of the stationary distribution of
our population genetics model (Sella and Hirsh
2005), differences in ΔM values between codons that
can directly mutate to one another will equal the log of the ratio of their mutation
rates. Thus, our estimates of ΔM provide testable hypotheses
about the ratio of mutation rates in S. cerevisiae. We use estimates
of per base-pair mutation rates from a recent high-throughput mutation accumulation
experiment in S. cerevisiae (Zhu et al. 2014). These experimental estimates of mutation bias,
, are calculated as where is the number of mutations observed per n
bases in the genome. As mutations in mutation accumulation experiments are strand
agnostic, that is, they do not distinguish between the coding and template strand
nucleotides, we cannot distinguish between the mutations NNCNNG and NNGNNC nor NNANNT and NNTNNA. As a result, our empirical estimates of
and are set to 0. We find that our estimates of
codon-specific mutation rates correlate highly with empirical mutation rates in
S. cerevisiae (, fig. 3).
F
Comparison of without
estimates of codon-specific mutation biases
ΔM and estimates generated from mutation
accumulation experiments (Zhu et al.
2014). For each amino acid the codon with the shortest pausing
time is used as a reference and is not shown because, by definition their
values are 0. Pearson correlation
coefficient ρ for all of the codons is given. The solid line represents
the best fit linear regression.
Comparison of without
estimates of codon-specific mutation biases
ΔM and estimates generated from mutation
accumulation experiments (Zhu et al.
2014). For each amino acid the codon with the shortest pausing
time is used as a reference and is not shown because, by definition their
values are 0. Pearson correlation
coefficient ρ for all of the codons is given. The solid line represents
the best fit linear regression.Unlike mutation bias parameters, empirical estimates of the codon-specific
differences in translational efficiencies do not exist. However, one of the simplest
ways of linking a codon to η is based on the indirect cost of codon-specific
ribosome pausing during translation. That is, where t is the average
time a ribosome pauses when translating codon i. We calculate
empirical estimates of pausing times based on a simple model of translation where
pausing times at a codon depend only on its cognate tRNA abundances and associated
wobble parameters (Ikemura 1981; Andersson and Kurland 1990; Sørensen and Pedersen 1991; Kanaya et al. 1999; Gilchrist and Wagner 2006; Zaher and Green 2009; Shah et al. 2013).Specifically, tRNA is the gene copy number of the tRNA
that recognize codon i and w is the
wobble term between the anticodon of and codon i. When a codon is
recognized by its canonical tRNA, we set w = 1.
We assume a purine–purine (RR) or pyrimidine–pyrimidine (YY) wobble term
to be and a purine–pyrimidine (RY/YR) wobble term to
be based on Curran and Yarus (1989) and Lim and
Curran (2001). We find that our genome-wide estimates of
Δt are positively correlated with empirical estimates of
Δt in S. cerevisiae
(, fig. 4).
F
Comparison of without
estimates of codon-specific translational
inefficiencies Δη and estimates of differences in ribosome pausing
times, based on tRNA gene copy number and wobble
inefficiencies. For each amino acid, the codon with the shortest pausing
time is used as a reference and is not shown because, by definition their
values are 0. Pearson correlation
coefficient ρ for all of the codons is given. The
dashed blue line represents the 1:1 line and the red line represents the
best fit linear regression line.
Comparison of without
estimates of codon-specific translational
inefficiencies Δη and estimates of differences in ribosome pausing
times, based on tRNA gene copy number and wobble
inefficiencies. For each amino acid, the codon with the shortest pausing
time is used as a reference and is not shown because, by definition their
values are 0. Pearson correlation
coefficient ρ for all of the codons is given. The
dashed blue line represents the 1:1 line and the red line represents the
best fit linear regression line.
Predicting Protein Synthesis Rates
Given the strong correlation between ROC SEMPPR’s with and
without
estimates of the codon-specific mutation biases
and translational inefficiencies
, it is not surprising that with and
without
estimates of from ROC SEMPPR are highly correlated
(, fig.
5a). More importantly, the without
based estimates of show substantial correlation with the mRNA abundance
based estimates of values from Yassour (, fig.
5b). To be clear, these values are the same values used as inputs to the
with
model fits.
F
Evaluation of predicted gene expression levels between models and empirical
measurements from Yassour . (a) Comparison of
with and without
ROC SEMPPR estimates of protein synthesis
rates, . The units for are proteins/t and time
t is scaled such that the prior for
satisfies . Note the very strong correlation between
the with and without
estimates of for the high expression genes.
(b) Comparison of without
estimates of and empirical measurements of mRNA
abundances, . The empirical mRNA abundance measurements,
[mRNA], are being used here as a proxy for protein synthesis rates, that is,
[mRNA] proteins/t. The
measurements are scaled such that the mean [mRNA] value is 1. Pearson
correlation coefficients ρ are given and the dashed gray line indicates
1:1 line.
Evaluation of predicted gene expression levels between models and empirical
measurements from Yassour . (a) Comparison of
with and without
ROC SEMPPR estimates of protein synthesis
rates, . The units for are proteins/t and time
t is scaled such that the prior for
satisfies . Note the very strong correlation between
the with and without
estimates of for the high expression genes.
(b) Comparison of without
estimates of and empirical measurements of mRNA
abundances, . The empirical mRNA abundance measurements,
[mRNA], are being used here as a proxy for protein synthesis rates, that is,
[mRNA] proteins/t. The
measurements are scaled such that the mean [mRNA] value is 1. Pearson
correlation coefficients ρ are given and the dashed gray line indicates
1:1 line.Supplementary figures S4 and S5, Supplementary Material online, explore this issue further by plotting
ROC SEMPPR’s posterior mean estimates of produced with and
without
against eight sets of experimental data. These data
include three genome-wide estimates based on RPF measurements (Ingolia et al. 2009; Artieri and Fraser 2014; McManus et
al. 2014) and five other genome-wide estimates of mRNA abundances (Arava et al. 2003; Nagalakshmi et al. 2008; Holstege et al. 1998; Sun et al. 2012). The with
posterior estimates are generated using mRNA
abundance measurements from Yassour and are, therefore, independent of the measurements
from other labs. Correlation between estimates for the without
ROC SEMPPR fits and measured mRNA abundances ranges
from 0.534 to 0.707, and measured RPF reads ranges from 0.629 to 0.742. The
correlation between estimates for the with
fits and mRNA provides only a 7–15%
increase in explanatory power over the without
ROC SEMPPR predictions of . Similarly, correlation between
estimates for ROC SEMPPR’s
with
fits and RPF reads provides a 6–12%
increase in explanatory power over its without
predictions of .
Changes in CUB with Protein Synthesis Rate
As first shown in Shah and Gilchrist
(2011), the relationship between codon usage and protein synthesis rate
can range from simple and monotonic to complex. Figure 6 illustrates how codon usage changes
across approximately 2 orders of magnitudes of for each of the multicodon amino acids. Both ROC SEMPPR’s
with and without
model fits accurately predict how CUB changes with
protein synthesis rates (fig. 6). Indeed,
the predicted changes in CUB between the with and
without
ROC SEMPPR model fits are almost indistinguishable
from one another, reflecting the strong agreement between their estimates of
ΔM and Δη across models as discussed above.
F
Model predictions and observed codon usage frequencies as a function of
estimated protein synthesis rate for the S. cerevisiae S288c
genome. The units for are proteins/t and time
t is scaled such that the prior for
satisfies . Each amino acid is represented by a
separate subplot. Solid, dashed, and dotted lines represent the
without
, with
ROC SEMPPR model fits, and a simple logistic
regression approach where the estimation error in is ignored, respectively. None of the
parameter estimates’ 95% CIs overlaps with 0 except
. Genes are binned by their expression levels
with solid dots indicating the mean codon frequency of the genes in the
respective bin. Error bars indicate the standard deviation in codon
frequency across genes within a bin. For each amino acid, the codon favored
by natural selection for reducing translational inefficiency is indicated by
a . The four indicate codons that have been previously
identified as optimal but our ROC SEMPPR model fits indicate these codons
actually are the second most efficient codons. A histogram of the
values is presented in the lower right
corner. Estimates of protein synthesis rates are based on the with
ROC SEMPPR model fits, thus representing our
best estimate of their values.
Model predictions and observed codon usage frequencies as a function of
estimated protein synthesis rate for the S. cerevisiae S288c
genome. The units for are proteins/t and time
t is scaled such that the prior for
satisfies . Each amino acid is represented by a
separate subplot. Solid, dashed, and dotted lines represent the
without
, with
ROC SEMPPR model fits, and a simple logistic
regression approach where the estimation error in is ignored, respectively. None of the
parameter estimates’ 95% CIs overlaps with 0 except
. Genes are binned by their expression levels
with solid dots indicating the mean codon frequency of the genes in the
respective bin. Error bars indicate the standard deviation in codon
frequency across genes within a bin. For each amino acid, the codon favored
by natural selection for reducing translational inefficiency is indicated by
a . The four indicate codons that have been previously
identified as optimal but our ROC SEMPPR model fits indicate these codons
actually are the second most efficient codons. A histogram of the
values is presented in the lower right
corner. Estimates of protein synthesis rates are based on the with
ROC SEMPPR model fits, thus representing our
best estimate of their values.Changes in codon frequency with are the result of a subtle interplay between natural
selection for reducing η and mutation bias (Figure 6). The simplest cases involve two codon amino acids where the same
codon is favored by both selection and mutation bias, that is, Cys, Glu, and
Ser2. In these three cases, the selectively and mutationally favored
codon 1 is used preferentially across all protein synthesis rates and the frequency
of the preferred codon increases monotonically with . The next simplest cases involve two codon amino
acids where codon 1 is favored by selection and codon 2 is favored by mutation bias,
for example, Asp, Asn, and Phe. In these cases, the mutationally favored codon 2 is
used preferentially at low values and the selectively favored codon 1 is used
preferentially used in genes at high values. Nevertheless, as before the codon frequencies
change monotonically with .More complex, nonmonotonic changes in codon frequencies can occur in amino acids that
use three or more codons. For example, the Ile codon ATC has the lowest translational
inefficiency Δη and, therefore, is the most favored codon by natural
selection whereas ATT has the second lowest translational inefficiency. As a result,
both codons initially increase in frequency with increasing at the expense of the most inefficient codon ATA.
However, once the frequency of ATA approaches 0, selection for ATC begins driving the
frequency of ATT down. These nonmonotonic changes in codon frequency is most notable
in Ala, Ile, Thr, and Val. Examining the derivative of with respect to indicates that if , a given codon i will increase in
frequency with , if , that is, if the sum of the derivatives of the
selective advantage of codon i over the other codons is positive.
For the reference codon 1 where, by definition, , we see that this inequality always holds. This
criterion can only be met by the nonreference codon in amino acids with more than two
synonyms and when there are other nonreference codons with lower fitnesses at
appreciable probabilities. In the S. cerevisiae S288c genome, these
conditions can occur when the codon most favored by natural selection is strongly
disfavored by mutation. Although this nonlinear quality of multinomial logistic
regression is well known among statisticians, the fact that nonoptimal codons can
increase with production rate has not been widely recognized by biologists.If we ignore the noise in the data, our with
model fitting simplifies to the standard logistic
regression model applied in Shah and Gilchrist
(2011). This simplification results in a slight distortion of
ΔM estimates and a general attenuation of our estimates of
Δη (Wallace et al. 2013). The
effect of this attenuation can be seen in figure
6 where the changes in CUB predicted from the standard logistic regression
model fit lag behind the predicted changes when either the error in
is accounted for or the data are not used. In the case of Ser2,
controlling for error leads to a change in the codon identified as being favored by
natural selection. Although Shah and Gilchrist
(2011) predicted codon AGC would be favored by selection over AGT, both of
ROC SEMPPR’s with and without
fits predict the opposite. Although, this switch in
order is “significant” in that the 95% CI for
is less than 0, the amino acid Ser2 is
used at very low frequency in high expression genes and its 97.5% CI boundary
lies very close to 0. (The upper boundary lies at 0.00387 and 0.000634 for the
with and without
ROC SEMPPR fits, respectively.) As a result, this
discrepancy is not strongly supported and warrants further investigation.In summary, for genes with protein synthesis rates substantially lower than the
average, that is, , codon usage is largely determined by mutation bias
terms . For about half of the amino acids (e.g., Cys, Lys,
and Pro), in genes with protein synthesis rates ten or more times greater than
average, that is, , codon usage is largely determined by selection for
the codon with the smallest translational inefficiency . This result is largely consistent with the frequent
assumption that in the set of genes with the highest expression levels the most
translationally efficient codon dominates. However, for the amino acids (e.g., Ala,
Ile, and Arg) selection for reducing η in high expression genes is substantially
tempered by the force of mutation bias.
Estimating Selection on Synonymous Codon Usage
The assumptions of the ROC SEMPPR model imply that the codon-specific translational
inefficiencies are independent of codon position within a sequence. As a result, the
relative strength of purifying selection on synonymous codon j in
comparison to codon i in a gene with an average protein synthesis
rate is We remind the reader that includes the effective population size,
, in its definition. As a result, our selection
coefficients S are measured relative to the strength of genetic
drift, , as is commonly done. The distribution of
S across all genes for each alternative to an amino acid’s
reference codon is illustrated in figure
7 and summarized in table 1.
Tables with genome-wide gene and codon-specific estimates of S can
be found in the supplementary material, Supplementary Material online. Recall that S is
scaled by and that the distribution of
values across genes appears to follow a heavy tailed
distribution. As a result even though, by definition, the average value of
is 1, the large majority of genes have
values less than 1. As a result, although purifying
selection on synonymous codons is universal, its selection coefficients are usually
quite small (i.e., ). Nevertheless, because our framework utilizes
information on CUB held across genes, we can clearly detect the signature of
selection at the genome level, specifically in the form of values whose posterior CIs differ from 0, whereas
other approaches might fail.
F
Distribution of gene-specific selection coefficients on synonymous codon
usage from the without
model fit to the S.
cerevisiae genome. Selection coefficient S was
calculated on a gene-by-gene basis and relative to reference codon, which is
most favored by selection and for which, by definition, S
= 0, and is listed first within the legend of each panel. Genes with
were combined together into a single bin.
For reference, the fixation probability of a codon relative to a pure drift
process, , is also plotted (–line).
Summary statistics can be found in table 1.
Table 1
Summary Statistics for Gene-Specific Selection Coefficients on Synonymous
Codon Usage from the without
ROC SEMPPR Model Fit to the
Saccharomyces cerevisiae Genome
Quantile
AA
Codon
S¯
var(S)
1%
5%
25%
50%
75%
95%
99%
A
GCA
−0.4743
0.6171
−4.3994
−1.7817
−0.4715
−0.2096
−0.1135
−0.0620
−0.0466
A
GCC
−0.0648
0.0115
−0.6014
−0.2436
−0.0645
−0.0287
−0.0155
−0.0085
−0.0064
A
GCG
−0.6420
1.1306
−5.9551
−2.4117
−0.6382
−0.2837
−0.1536
−0.0839
−0.0631
C
TGC
−0.3777
0.3913
−3.5032
−1.4187
−0.3754
−0.1669
−0.0903
−0.0493
−0.0371
D
GAT
−0.1201
0.0396
−1.1142
−0.4512
−0.1194
−0.0531
−0.0287
−0.0157
−0.0118
E
GAG
−0.2527
0.1752
−2.3440
−0.9493
−0.2512
−0.1117
−0.0605
−0.0330
−0.0248
F
TTT
−0.2741
0.2061
−2.5425
−1.0297
−0.2725
−0.1211
−0.0656
−0.0358
−0.0269
G
GGA
−0.8286
1.8834
−7.6861
−3.1127
−0.8237
−0.3662
−0.1982
−0.1083
−0.0815
G
GGC
−0.4436
0.5398
−4.1149
−1.6664
−0.4410
−0.1960
−0.1061
−0.0580
−0.0436
G
GGG
−0.7052
1.3641
−6.5412
−2.6490
−0.7010
−0.3116
−0.1687
−0.0921
−0.0693
H
CAT
−0.1966
0.1060
−1.8233
−0.7384
−0.1954
−0.0869
−0.0470
−0.0257
−0.0193
I
ATA
−0.8874
2.1604
−8.2318
−3.3337
−0.8822
−0.3922
−0.2123
−0.1159
−0.0872
I
ATT
−0.0906
0.0225
−0.8403
−0.3403
−0.0901
−0.0400
−0.0217
−0.0118
−0.0089
K
AAA
−0.2661
0.1943
−2.4686
−0.9997
−0.2646
−0.1176
−0.0637
−0.0348
−0.0262
L
CTA
−0.2636
0.1906
−2.4452
−0.9902
−0.2620
−0.1165
−0.0631
−0.0344
−0.0259
L
CTC
−0.7185
1.4162
−6.6650
−2.6991
−0.7143
−0.3175
−0.1719
−0.0939
−0.0706
L
CTG
−0.4662
0.5961
−4.3242
−1.7512
−0.4634
−0.2060
−0.1115
−0.0609
−0.0458
L
CTT
−0.4601
0.5807
−4.2679
−1.7284
−0.4574
−0.2033
−0.1101
−0.0601
−0.0452
L
TTA
−0.2128
0.1242
−1.9741
−0.7995
−0.2116
−0.0941
−0.0509
−0.0278
−0.0209
N
AAT
−0.3164
0.2746
−2.9347
−1.1885
−0.3145
−0.1398
−0.0757
−0.0413
−0.0311
P
CCC
−0.5061
0.7027
−4.6948
−1.9013
−0.5031
−0.2237
−0.1211
−0.0661
−0.0498
P
CCG
−0.8002
1.7567
−7.4230
−3.0061
−0.7955
−0.3537
−0.1914
−0.1046
−0.0787
P
CCT
−0.2319
0.1476
−2.1513
−0.8712
−0.2306
−0.1025
−0.0555
−0.0303
−0.0228
Q
CAG
−0.3840
0.4046
−3.5624
−1.4427
−0.3818
−0.1697
−0.0919
−0.0502
−0.0378
R
AGG
−0.5378
0.7935
−4.9890
−2.0204
−0.5347
−0.2377
−0.1287
−0.0703
−0.0529
R
CGA
−1.7330
8.2391
−16.0757
−6.5103
−1.7228
−0.7659
−0.4146
−0.2264
−0.1704
R
CGC
−0.5568
0.8504
−5.1646
−2.0915
−0.5535
−0.2461
−0.1332
−0.0727
−0.0547
R
CGG
−1.5092
6.2486
−13.9999
−5.6696
−1.5003
−0.6670
−0.3611
−0.1972
−0.1484
R
CGT
−0.0080
0.0002
−0.0744
−0.0301
−0.0080
−0.0035
−0.0019
−0.0010
−0.0008
S
TCA
−0.3942
0.4264
−3.6571
−1.4810
−0.3919
−0.1742
−0.0943
−0.0515
−0.0388
S
TCG
−0.4927
0.6660
−4.5705
−1.8509
−0.4898
−0.2178
−0.1179
−0.0644
−0.0484
S
TCT
−0.0121
0.0004
−0.1120
−0.0454
−0.0120
−0.0053
−0.0029
−0.0016
−0.0012
T
ACA
−0.4278
0.5020
−3.9679
−1.6069
−0.4252
−0.1890
−0.1023
−0.0559
−0.0421
T
ACG
−0.6503
1.1603
−6.0327
−2.4431
−0.6465
−0.2874
−0.1556
−0.0850
−0.0639
T
ACT
−0.0482
0.0064
−0.4468
−0.1810
−0.0479
−0.0213
−0.0115
−0.0063
−0.0047
V
GTA
−0.6310
1.0922
−5.8529
−2.3703
−0.6272
−0.2789
−0.1509
−0.0824
−0.0620
V
GTG
−0.4308
0.5092
−3.9966
−1.6185
−0.4283
−0.1904
−0.1031
−0.0563
−0.0424
V
GTT
−0.0570
0.0089
−0.5288
−0.2142
−0.0567
−0.0252
−0.0136
−0.0074
−0.0056
Y
TAT
−0.2857
0.2239
−2.6501
−1.0732
−0.2840
−0.1263
−0.0683
−0.0373
−0.0281
Z
AGC
−0.0248
0.0017
−0.2297
−0.0930
−0.0246
−0.0109
−0.0059
−0.0032
−0.0024
Note.—The selection coefficient S was
calculated relative to the most translationally efficient codon for a
given amino acid on a gene-by-gene basis.
Distribution of gene-specific selection coefficients on synonymous codon
usage from the without
model fit to the S.
cerevisiae genome. Selection coefficient S was
calculated on a gene-by-gene basis and relative to reference codon, which is
most favored by selection and for which, by definition, S
= 0, and is listed first within the legend of each panel. Genes with
were combined together into a single bin.
For reference, the fixation probability of a codon relative to a pure drift
process, , is also plotted (–line).
Summary statistics can be found in table 1.Summary Statistics for Gene-Specific Selection Coefficients on Synonymous
Codon Usage from the without
ROC SEMPPR Model Fit to the
Saccharomyces cerevisiae GenomeNote.—The selection coefficient S was
calculated relative to the most translationally efficient codon for a
given amino acid on a gene-by-gene basis.Figure 8 compares our
without
ROC SEMPPR-based estimates of S with
those estimated using the FMutSel phylogenetic model of Yang and Nielsen (2008) using PAML (Yang 2007) for the 106 genes in the Rokas data set. Overall, we
observe reasonable qualitative agreement between the two models with the majority of
codon-specific predictions having correlation coefficients . Unfortunately, although PAML provides
maximum-likelihood point estimates of parameters, it does not provide any confidence
intervals for these parameters. Given the large number of parameters (>60)
estimated from each coding sequence by FMutSel, the confidence intervals for each
parameter are likely to be large and, hence, could explain much of the variation we
observe between ROC SEMPPR and FMutSel parameter estimates. Nonetheless, for
85% of the codons examined (34/40), we observe is a significant
(P < 0.05) and positive linear relationship between the ROC
SEMPPR and the FMutSel estimates of S (see supplementary table S11, Supplementary Material online). Of the remaining six codons, half
exhibit a positive, but nonsignificant relationship between ROC SEMPPR and
FMutSel’s estimates of S, whereas the other half exhibit a
negative, but again nonsignificant, relationship between estimates of
S. Thus for 92% of the codons, both the ROC SEMPPR and
FMutSel estimates of S agree qualitatively.
F
Comparison of gene-specific selection coefficients on synonymous codon usage
from the without
model fit to the S.
cerevisiae genome and those from fitting the FMutSel model from
Yang and Nielsen (2008) for
106 yeast genes used in Rokas
as estimated by Kubatko LS, Shah
P, Herbei R, Gilchrist M (unpublished data). For more details, see the main
text. Selection coefficient S was calculated on a
gene-by-gene basis and relative to the most translationally efficient codon
for a given amino acid (which is the codon listed first in the legend).
Lines indicate linear regression line best fit and the corresponding
correlation coefficients are listed as well with an
indicating model fits with
P < 0.05. Under the FMutSel model, monomorphic sites
across species can lead to estimates of S =
−∞, these observations are plotted on the
x axis.
Comparison of gene-specific selection coefficients on synonymous codon usage
from the without
model fit to the S.
cerevisiae genome and those from fitting the FMutSel model from
Yang and Nielsen (2008) for
106 yeast genes used in Rokas
as estimated by Kubatko LS, Shah
P, Herbei R, Gilchrist M (unpublished data). For more details, see the main
text. Selection coefficient S was calculated on a
gene-by-gene basis and relative to the most translationally efficient codon
for a given amino acid (which is the codon listed first in the legend).
Lines indicate linear regression line best fit and the corresponding
correlation coefficients are listed as well with an
indicating model fits with
P < 0.05. Under the FMutSel model, monomorphic sites
across species can lead to estimates of S =
−∞, these observations are plotted on the
x axis.The three exceptions to this qualitative agreement are codons CGT (Arg), TCT
(Ser4), and ACT (Thr) and it is worth noting two points. First, the
central 95% CI for CGT (Arg) overlaps with 0 in both the with
and without
ROC SEMPPR model fits. Second, the Ser4
codon TCT and Thr codon ACT are two of the four codons that ROC SEMPPR indicates have
been misidentified as optimal codons in the past. Relative to the ROC SEMPPR
reference codons, TCT and ACT have small Δη values, approximately 0.01 and
approximately 0.05, respectively, and large ΔM values,
approximately −0.5 for both. Thus, it appears in these last two cases the
FMutSel model is misattributing the CUB toward these codons to selection rather than
mutation (see figure 6).
Discussion
Recent advances in technology have led a remarkable and continuing decrease in the cost
of genome sequencing. What is now needed are robust models and computational tools that
allow researchers to access the information encoded within these genomes. Several models
have been proposed that estimate selection coefficients of all 61 sense codons either on
a whole gene basis or on a site-by-site basis (Tamuri et al. 2012; Rodrigue et al.
2010; Yang and Nielsen 2008).
While important advances, these models fail to leverage information on CUB encoded
across genes. In contrast, ROC SEMPPR estimates selection coefficients and other key
parameters by assuming a common direction of selection on CUB, but where the strength of
selection varies with protein synthesis rate.As a result, ROC SEMPPR provides a modeling framework which can quickly extract
information on codon-specific translational inefficiencies Δη, mutation biases
ΔM, and gene-specific estimates of protein synthesis rates
, using only genome-wide patterns of CUB. This ability
stems from the hypotheses that the intergenic variation in patterns of CUB observed
within a genome reflects a lineage’s evolutionary responses to selection against
inefficient protein translation as well as mutation bias. Our results clearly show that
these CUB patterns contain remarkably large amounts of useful quantitative information
and the use of carefully constructed, mechanistically driven mathematical models can
greatly improve our ability to access and interpret this information. Indeed, we find
that for S. cerevisiae ROC SEMPPR’s without
estimates of ΔM, Δη,
and Φ⃗ values match almost exactly with the
with
estimates of these parameters. By removing the need for
gene expression data and, instead, providing reliable predictions of their
average protein synthesis rates , the methods developed here should be especially helpful
for molecular-, systems-, and microbiologists for whom genomic sequence data are both
abundant and inexpensive to obtain. For example, the protein translation rates we
estimate should contain useful information about the physiology
and ecology of the organism. Indeed, for the large number of sequenced microorganisms
that cannot be easily cultured in the laboratory, their genome sequence may become the
primary source of information about their biology for the near future.Of course, ROC SEMPPR may not work for all organisms. For example, some organisms may
evolve under N values too small for adaptation in CUB to
occur. Under these conditions, our method should fail to confidently identify the
selectively preferred codon (i.e., our CIs for our Δη parameters will overlap
with 0). However, because our estimates of Δη are based on the analysis of the
entire genome simultaneously rather than the combination of independent assessments of
individual genes, our method may be able to detect the signature of selection on CUB in
organisms where it previously went undetected. Alternatively, there may be organisms
where N is so large that, as a result, there is not enough
variation in CUB to reliably estimate our parameters. Assuming we retain our flat priors
on Δη, in these cases we expect our estimates of Δη for the most
selectively favored codons to continually increase in magnitude during the MCMC
simulation rather than eventually stabilizing. Such behavior reflects a lack of
information in the data rather than a flaw in our model and has been observed in other
approaches, such as those using inter- and intraspecific variation (Yang and Nielsen [2008] and Lawrie ,
respectively). ROC SEMPPR may also fail to work with organisms whose adaptation in CUB
is driven by more complex or less consistent selective forces. If these forces are
uncorrelated across amino acids within a gene or varied greatly with position within a
gene, then our method should not be able to confidently identify the selectively
preferred codons, similar to the case with of organisms with small
N.Although direct, codon-specific estimates of and do not exist, data from mutation accumulation lines and
tRNA copy number can be used as proxies. Reassuringly, we observe strong and consistent
agreement between ROC SEMPPR’s parameter estimates and these proxies. In addition,
when comparing ROC SEMPPR’s estimates of S with the FMutSel model
we observe general agreement between our estimates with the three key exceptions likely
due to relatively small differences in translational inefficiencies between these
synonymous codons and their most efficient alternative and strong mutation bias against
the most efficient, which is misinterpreted by FMutSel as selection. In contrast, the
selective values S on synonymous codon usage we estimate using ROC
SEMPPR are substantially smaller than those estimated by Lawrie based on intraspecific
variation in 4-fold degenerate sites for Drosophila melanogaster.
Although we have no immediate explanation for these differences we do note that Lawrie
acknowledge that the high S values they estimate are the exception
rather than the rule for population genetic studies, including those looking at
nonsynonymous substitutions.Because of ROC SEMPPR’s derivation from population genetics, it should be possible
to take any observed intraspecific variation into account by expanding our codon counts
likelihood function in equation (8)
to be calculated across the polymorphic alleles in proportion to their frequencies.
Further, given that the direction of selection in ROC SEMPPR is estimated using
information from across the genome, our ability to detect site-specific violations of
the model should be much greater than when analyzing the CUB of each gene separately.
Expanding ROC SEMPPR to utilize interspecific variation, however, is more complex and
will require expanding the model to include the effects of nonsynonymous substitutions
and phylogenetic history.For organisms that can be cultured in the laboratory, researchers can utilize
experimental techniques to measure mRNA, ribosome profiles, and protein abundances. Even
though impressive gains have been made in our ability to measure these quantities at a
genome scale, abundance data still have limitations. For example, mRNA abundance
measurements have been shown to vary substantially between labs using the same strain
and the same general conditions (Wallace et al.
2013). Indeed, our posterior mean estimate of the error in mRNA abundance
measurement () indicates that the error in a given measurement ranges
over an order of magnitude. In terms of protein abundance measurements, most proteomic
studies have difficulty quantifying membrane bound proteins (Durr ; Babu ; Chen ). Furthermore, both
transcriptomic and proteomic measurements are, by their very nature, restricted to the
specific growth conditions used. Unfortunately, the frequency with which organisms
outside of the lab encounter such conditions is generally unknown. This is particularly
important for understanding a pathogenic organism, where expression of genes involved in
its persistence and spread is highly dependent on their hosts and is difficult to mimic
in vitro.The predictions of protein synthesis rates generated by ROC SEMPPR contain independent and
complementary information to that found in mRNA or protein abundance measurements. As a
result, this information can be used on its own or in combination with other measures of
gene expression. For example, our work provides estimates of protein production based on
the average environment that an organism’s lineage has experienced. These
estimates of average gene expression can be used to further contextualize gene
expression measurements in different environments. For example, comparing the
values for proteins involved in different,
environment-specific pathways should give researchers an understanding of the relative
importance of these environments in the lineage’s evolutionary history. At a finer
scale, gene-specific incongruences between mRNA abundance measurements and
estimates may indicate genes undergoing extensive
posttranscriptional regulation, a hypothesis that can be evaluated experimentally.The fact that the additional information provided by the data from Yassour
leads to a relatively small increase in the
quality of our predictions of data from other labs may seem surprising.However, we believe this behavior indicates that the information in
about gene-specific protein synthesis rates is largely
redundant with the information held within the CUB patterns within a gene and across a
genome.Of course a skeptic might proffer a different interpretation, that is, the model is
somehow ignoring or insensitive to the information in . We, however, believe this is not the case for the
following reasons. First, ROC SEMPPR was carefully formulated to combine the information
from independent measurements and the CUB of each gene in a
straightforward and logical manner (see supplementary material: Fitting of Model to Genomic Data and Noisy
Measurements and eq. S1 in particular, Supplementary Material online). Instead of a priori assuming one source
of information is better than the other, ROC SEMPPR actually evaluates the relative
quality of each source of information in explaining the observed bias in codon usage for
a gene across the naa amino acids. Next, given the fact that
the 95% posterior CIs for Δη differ for at least one pair of codons for
each amino acid indicates that the information held within the CUB patterns is reliable.
In contrast, ROC SEMPPR’s estimate of the error in indicates that the empirical measurements
are noisy, consistent with the findings from other
studies. For example, Wallace looked at the correlation in measurements between independent labs and found
nontrivial disagreement in their values. Finally, and perhaps most convincingly, the
without
version of ROC SEMPPR treats the
values as missing values and is able to predict their
values to a similar level of accuracy observed between empirical measurements from
different laboratories and using different platforms (supplementary figs. S4 and S5, Supplementary Material online).Accessing information on using a mechanistic, model-based approach as developed
here has additional, distinct advantages over more ad-hoc approaches frequently used by
other researchers. Quantifying selection on synonymous codons is important for
phylogenetic inference. Classical codon substitution models of protein evolution
typically assume that synonymous codons of an amino acid are selectively neutral. In
contrast, our estimates of codon-specific translation inefficiencies Δη and
expression levels provide an independent measure of selection on
synonymous codons from a single genome. By incorporating these measures in codon
substitution models, researchers would be able to measure selection on nonsynonymous
changes either within a gene or on a site-by-site basis.In addition, current measures to identify the selective regime in which a gene evolves,
for example, positive, negative or nearly neutral, are based on estimating the number of
nonsynonymous to synonymous changes (dN/dS) (Li et al. 1985; Nei and Gojobori 1986; Yang and Nielsen 2000) or polymorphism data (McDonald and Kreitman 1991). These tests generally assume
that synonymous changes are neutral. However, Spielman and Wilke (2015) have recently shown that ignoring selection on
synonymous changes can lead to a false positive signal of a gene evolving in response to
diversifying selection. By using our codon-specific estimates of translation
inefficiencies, researchers will now be able to explicitly account for biases in
estimates of dS due to selection on synonymous changes (Spielman and Wilke 2015).Estimates of codon-specific translation inefficiencies are also important for practical
applications such as codon-optimization algorithms that are used to increase
heterologous gene expression, for example, insulin expression in
Escherichia coli. When heterologous genes are
expressed in a particular model organism such as E. coli or S.
cerevisiae, their codon usage is “optimized” by assuming that
the most frequently used codon in a set of highly expressed genes is the optimal one.
This approach implicitly assumes that natural selection against translational
inefficiencies overwhelms any mutation bias. In several amino acids that use more than
two synonymous codons, for example, Ser4, Thr and Val, genes with highest
expression are more often encoded by the mutationally favored, second-best codon rather
than the mutationally disfavored optimal codon. As a result, relying on the codon usage
of highly expressed genes appears to be overly simplistic in the case of the S.
cerevisiae genome and, if our inferences are correct, has led to
misidentification of the optimal codon.In addition to codon-specific translation inefficiencies Δη, we also estimate
codon-specific mutation biases ΔM. We find that the direction of
mutation biases between synonymous codons is consistent across all amino acids and in
the same direction as genomic AT content. However, as we documented in Shah and Gilchrist (2011),
ΔM for similar sets of nucleotides differs significantly
between amino acids. For instance, in the case of two-codon amino acids with C-T wobble,
we find that ranges from 0.27 to 0.75. For genes with low expression
levels (i.e., ), this corresponds to ratios of T-ending codons to
C-ending codons between amino acids ranging from 1.3 to 2.1. One possible explanation
for this wider than expected range of mutation biases could be context-dependence of
mutation rates. Recent high-throughput mutation accumulation experiments in yeast
support this idea, estimating that the mutation rate at a particular nucleotide depends
on the context of surrounding nucleotides: The C nucleotide in the context
of CCG has several fold higher mutation rate than in the context of
CCT (Zhu et al. 2014).Despite the numerous advances outlined above, our work is not without its limitations.
One important limitation stems from our assumption that codons contribute to the
cost–benefit ratio of protein translation in an additive manner. Although this
assumption is consistent with certain costs of protein translation, such as ribosome
pausing, it ignores many others selective forces potentially shaping the evolution of
CUB. For example, the cost of nonsense errors, that is, premature termination events, is
generally expected to increase with codon position along an ORF and, thus, leads to a
nonadditive contribution of a given codon to the cost–benefit ratio η (Gilchrist et al. 2009). Similarly, if one
assumes that the main effect of missense errors is to reduce the functionality of the
protein produced, then the cost of these errors is expected to depend greatly on
specific details such as the structural and functional role of the amino acid at which
the error occurs and the physiochemical differences between the correct and the
erroneously incorporated amino acids. Finally, the pausing time at a codon is also
influenced by several factors such as downstream mRNA folding (Yang et al. 2014), presence of polybasic stretches (Brandman et al. 2012) as well as
cotranslational folding of the growing polypeptide (Thanaraj and Argos 1996; Pechmann and Frydman 2013). Although the contributions of
these factors to ribosomal pausing times are often idiosyncratic and vary widely between
genes, they can all influence the cost–benefit ratio η. The situation becomes
even more complex and nonlinear when considering how nonsense and missense errors along
with various factors influencing pausing time costs combine to affect η. In all of
these situations, the nonlinear mapping between a codon sequence and η makes direct
evaluation of the likelihood function difficult. In such situations alternative,
approximate methods and simulation techniques, such as those developed by Murray et al. (2006), will become necessary.
Expanding our approach to include these additional selective forces should allow us to
quantitatively evaluate the separate contributions of ribosome pausing time, nonsense
errors, and missense errors have made to the evolution of CUB for a given species. Doing
so will allow us to address the long held goal in molecular and evolutionary biology of
accurately quantifying the factors contributing to the evolution of CUB within a coding
sequence and across a genome.
Materials and Methods
Modeling Natural Selection on Synonymous Codons
Following the notation and framework introduced in Gilchrist (2007) and Shah and Gilchrist (2011), we assume that for each gene,
the organism has a target, average protein synthesis rate . Protein synthesis rates have units of 1/time; for
convenience and ease of interpretation, we define our time units such that the
average or expected protein synthesis rate across the genome is one, that is,
. The cost–benefit ratio
represents the expected cost, in ATPs, to produce one
functional protein from the coding sequence where c represents the
codon used at position i in a protein of length n.
In its most general form, , where is the expected direct and indirect energetic costs
incurred by a cell when a ribosome initiates translation of a transcript containing
c⃗. Similarly, is the expected benefit, relative to a complete and
error free protein, received by a cell when a ribosome initiates translation of a
transcript containing c⃗. By definition, in the absence of
translation errors, ribosomes will only produce complete and error free proteins,
that is, for ROC SEMPPR . Thus any differences in η are the result of
differences in between alternative
c⃗’s and E(Cost) simplifies to
where a1 is the direct
and indirect cost of translation initiation, a2 is the
direct cost of peptide elongation (four ATPs/amino acid), is the average pausing time a ribosome takes to
translate codon c, and v scales this
indirect cost of ribosome pausing from units of time to ATPs. Based on these
definitions, represents the average energy flux an organism must
expend to meet its target production rate for a given protein. If we assume that
every ATP/time spent leads to a small, proportional reduction in genotype fitness
q, then the fitness of a given genotype isIn the simplest scenarios, such as when there is selection to minimize ribosome
pausing during protein synthesis, a synonymous codon i makes an
additive, position-independent contribution to η. In this scenario, the evolution
of the codons in is independent between positions. As a result, the
information held within c⃗ can be summarized by the number
of times each synonymous codon is used within c⃗. Given
these assumptions, within the ORF of a given gene the stationary probability of
observing a set of codon counts for a given amino acid with
n synonymous codons within
c⃗ will follow a multinomial distribution with the
probability vector . Here, for , where is a measure of codon-specific mutation bias and
is a measure of translational inefficiency.
Specifically, , that is the natural logarithm of the ratio of the
frequencies of synonymous codon 1 to i in the absence of natural
selection. Following the detailed balance assumptions in our population genetics
model, in the specific cases where codons i and 1 can mutate
directly between each other, is also equal to the log of the ratio of the mutation
rates between the two codons (Sella and Hirsh
2005; Shah and Gilchrist
2011; Wallace et al. 2013).
Following Sella and Hirsh (2005), for
, for both a haploid and diploid Fisher–Wright
populations, we scale the differences in the contribution two synonymous codons make
to η relative to genetic drift, that is, . Because the reference codon 1 is determined by
pausing time values, values can be both negative and positive, unlike
.
Fitting the Model to Genomic Data
Our main goal is to estimate codon-specific differences in mutation bias,
, translational inefficiencies,
, and protein synthesis rates for all genes,
from the information encoded in the codon usage
patterns found across a genome. To test our approach, we used the S.
cerevisiae S288c genome file orf_coding.fasta.gz which was posted on
February 3, 2011 by Saccharomyces Genome Database http://www.yeastgenome.org/
(last accessed April 4, 2015) (Engel et al.
2014). These data contain 5,887 genes and consist of the ORFs for all
“Verified” and “Uncharacterized” genes as well as any
transposable elements. To fit the with
model, we used RNA-seq-derived mRNA abundance
measurements from Yassour . We combined the abundance measures from the four
samples, YPD0.1, YPD0.2, YPD15.1, and YPD15.2, taken during log growth phase and used
the geometric mean of these values as a proxy for relative protein synthesis rates
. As is commonly done by empiricists, we rescaled our
values such that they summed to 15,000. Because our
with
model fits estimate the scaling term,
, the only effect of this rescaling is on our estimate
of . To reduce noise in the data, we only used genes with at least three nonzero
measurements. The intersection of 5,887 DNA ORF sequences and 6,303 mRNA abundance
measurements produced 5,346 ORFs in common to both data sets. These 5,346 genes made
up the final data set used for ROC SEMPPR’s with and
without
model fits.Using an MCMC approach we sample from the posterior distribution, according to the
equation where the likelihood of the codon counts,
, are naturally modeled as a multinomial distribution
(Multinom) for the amino acid i in the ORF of gene
j as defined in equation (7), is an inverse multinomial logit function
() of , , and , and is the prior for the protein synthesis rate
. In order to enforce the restriction that
for all genes, we include the constraint that
. As a result there is only one free parameter for the
distribution . Further, we propose a flat prior for
, that is, for .Figure 9 presents an overview of the
structure of our approach, but to summarize, Our MCMC routine provides posterior samples of the
genome-wide parameters , , and and the gene-specific, protein synthesis parameters
Φ⃗. We refer to this model as the ROC SEMPPR
without
model.
F
Dependence graph of with and without ROC
SEMPPR methods. Shaded circles and represent observed data. Dashed circles
represent key random model parameters, whereas the solid oval represents a
random hierarchical parameter. Solid black squares provide information on
the distributional relationships between quantities. Large rectangular boxes
represent replication of each model component across both amino acids and
genes, for example, pausing, and mutation parameters differ across amino
acids but are common across genes, whereas counts differ across both amino acids and
genes.
Dependence graph of with and without ROC
SEMPPR methods. Shaded circles and represent observed data. Dashed circles
represent key random model parameters, whereas the solid oval represents a
random hierarchical parameter. Solid black squares provide information on
the distributional relationships between quantities. Large rectangular boxes
represent replication of each model component across both amino acids and
genes, for example, pausing, and mutation parameters differ across amino
acids but are common across genes, whereas counts differ across both amino acids and
genes.We refer to the more general model which incorporates information on
from noisy protein synthesis measurements or their
proxy, such as mRNA abundances, as the with
model. This model differs from that of Wallace in
that 1) we assume is drawn from a log-normal distribution rather than
an asymmetric Laplace distribution, 2) we include and estimate an explicit empirical
scaling term for the data, and 3) as in the without
approach, we force the prior for
, to have instead of rescaling estimates of
as a postprocessing step. This prevents the
introduction of additional biases in our parameter estimates. See the supplementary material, Supplementary Material online, for more details.
Model Fitting Details
We briefly describe the model fitting procedure here; full details can be found in
Chen W-C, Zaretzki R, Gilchrist MA (unpublished data). The code was originally based
on a script published by Wallace , which was modified extensively and expanded
greatly. Unless otherwise mentioned, all model fits were carried out using R version
3.0.2 (R Core Team 2013) using standard
routines, specifically developed routines, and custom scripts. All code was run on a
multicore workstation with AMD Opteron 6378 processors. For both ROC SEMPPR’s
with and without model fits, it takes less than
30 min and less than 3 GB of memory to run 10,000 iterations of a chain when using
5,346 genes of S. cerevisiae S288c genome. Each MCMC sampling
iteration was divided into three parts:Conditional on a new set of parameters, propose new
and values independently for each amino
acid.Conditional on the updates of (1), propose a new value for the prior distribution of
.Conditional on the updates of (2), propose new values independently for each gene. Update
the new set of parameters and return to (1).In all three phases, proposals were based on a random walk with step sizes normally
or log-normally distributed around the current state of the chain.In order to generate reasonable starting values for in the without
version of ROC SEMPPR, we first calculated the SCUO
value for each gene (Wan et al. 2006)
and then ordered the genes according to these corresponding values. We then simulated
a random vector of equal dimension to from a distribution where represents the initial value of
and controls the standard deviation of
. Next, these random variates were rank ordered and assigned to the
corresponding gene of the same SCUO rank. As a result, the rank order of a
gene’s initial value, , was the same as the rank order of its SCUO value. We
tried a variety of values and they all converged to similar parameter
values. For the with
model, we tried both the SCUO-based approach and
using the data to initialize our values of
. In this second scenario, we set
where represents the geometric mean of the observed mRNA
abundances for gene j. As in the without
ROC SEMPPR model fit, we found the
with
chains consistently converged to the same region of
parameter space independent of the initial values. It is worth noting that the structure of the
probability function defined in equation (7) is such that if the rank order of were reversed from their true order, the model would
converge to a similar quality of model fit and the signs of the parameters would
change. Thus it is recommended that model fits be checked to ensure that the final
estimates of for housekeeping genes, such as and ribosomal
proteins, are much greater than 1.Treating our initial protein synthesis rates for the entire genome as explanatory variables, the
initial values for and were generated through multinomial logistic
regression using the vglm() function of the VGAM package
(Yee 2013). We also used the
covariance matrix returned by vglm() as the proposal
covariance matrix for and for each amino acid. In order to make our random walk
more efficient, we used an adaptive proposal function for all parameters in order to
reach a target range of acceptance rates between 20% and 35%. For
example, the covariance matrix of the step sizes was multiplied by a scalar value
that was then increased or decreased by 20% every 100 steps when the
acceptance rate of a parameter set was greater than 35% or less than
20%, respectively. The variance terms of the random walks for the
and the global parameter were also adjusted in a similar manner.The results presented here were generated by running the MCMC algorithm for 10,000
iterations and, after examining the traces of the samples for evidence of
convergence, selecting the last 5,000 iterations as our posterior samples. The
arithmetic means of the posterior samples were used as point estimates based on the
mean of our posterior samples. Posterior CIs are generated by excluding the lower and
upper 2.5% of samples. Additional details on the model fit can be found in the
supplementary material, Supplementary Material online, and in Chen W-C, Zaretzki R, Gilchrist
MA (unpublished data). The code is implemented in an R package “cubfits”
(Chen et al. 2014) which is freely
available for download at http://cran.r-project.org/package=cubfits (last accessed April
4, 2015).
Estimating Selection Coefficients Using FMutSel
In order to evaluate the consistency of our estimates of with other approaches, we used the data set from
Rokas
which consisted of 106 aligned genes from 8 yeast species. Details of the model
fitting can be found in Kubatko LS, Shah P, Herbei R, Gilchrist M (unpublished data)
(available at bioRxiv doi: http://dx.doi.org/10.1101/007849, last accessed May 28, 2015), but
briefly, we used the maximum-likelihood tree found by Rokas and then generated
maximum likelihood estimates of the stationary probability of a given codon under the
FMutSel model from Yang and Nielsen
(2008) using CODONML in PAML 4.4 (Yang 2007). Using the same notation as in Yang and Nielsen (2008) we have where, for a given gene,
π represents the stationary probability of
observing codon J given nucleotide-specific mutational bias terms
and where . It follows that the comparable selection coefficient
on synonymous codon usage relative to our reference codon 1 isA list of these parameter estimates can be found in the supplementary material, Supplementary Material online.
Supplementary Material
Supplementary material, figures S1–S5, and tables S1–S11 are available at Genome Biology and
Evolution online (http://www.gbe.oxfordjournals.org/).
Authors: Mai Sun; Björn Schwalb; Daniel Schulz; Nicole Pirkl; Stefanie Etzold; Laurent Larivière; Kerstin C Maier; Martin Seizl; Achim Tresch; Patrick Cramer Journal: Genome Res Date: 2012-03-30 Impact factor: 9.043
Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis Journal: Nature Date: 2015-02-19 Impact factor: 69.504
Authors: Stacia R Engel; Fred S Dietrich; Dianna G Fisk; Gail Binkley; Rama Balakrishnan; Maria C Costanzo; Selina S Dwight; Benjamin C Hitz; Kalpana Karra; Robert S Nash; Shuai Weng; Edith D Wong; Paul Lloyd; Marek S Skrzypek; Stuart R Miyasato; Matt Simison; J Michael Cherry Journal: G3 (Bethesda) Date: 2014-03-20 Impact factor: 3.154
Authors: Péter Szövényi; Kristian K Ullrich; Stefan A Rensing; Daniel Lang; Nico van Gessel; Hans K Stenøien; Elena Conti; Ralf Reski Journal: Genome Biol Evol Date: 2017-06-01 Impact factor: 3.416
Authors: Jeremy M Beaulieu; Brian C O'Meara; Russell Zaretzki; Cedric Landerer; Juanjuan Chai; Michael A Gilchrist Journal: Mol Biol Evol Date: 2019-04-01 Impact factor: 16.240
Authors: Yun Deng; Fabio de Lima Hedayioglu; Jeremie Kalfon; Dominique Chu; Tobias von der Haar Journal: J R Soc Interface Date: 2020-02-19 Impact factor: 4.118