Literature DB >> 30111621

Accounting for Programmed Ribosomal Frameshifting in the Computation of Codon Usage Bias Indices.

Victor Garcia^1,2, Stefan Zoller³, Maria Anisimova^4,2.

Abstract

Experimental evidence shows that synonymous mutations can have important consequences on genetic fitness. Many organisms display codon usage bias (CUB), where synonymous codons that are translated into the same amino acid appear with distinct frequency. Within genomes, CUB is thought to arise from selection for translational efficiency and accuracy, termed the translational efficiency hypothesis (TEH). Indeed, CUB indices correlate with protein expression levels, which is widely interpreted as evidence for translational selection. However, these tests neglect -1 programmed ribosomal frameshifting (-1 PRF), an important translational disruption effect found across all organisms of the tree of life. Genes that contain -1 PRF signals should cost more to express than genes without. Thus, CUB indices that do not consider -1 PRF may overestimate genes' true adaptation to translational efficiency and accuracy constraints. Here, we first investigate whether -1 PRF signals do indeed carry such translational cost. We then propose two corrections for CUB indices for genes containing -1 PRF signals. We retest the TEH in Saccharomyces cerevisiae under these corrections. We find that the correlation between corrected CUB index and protein expression remains intact for most levels of uniform -1 PRF efficiencies, and tends to increase when these efficiencies decline with protein expression. We conclude that the TEH is strengthened and that -1 PRF events constitute a promising and useful tool to examine the relationships between CUB and selection for translation efficiency and accuracy.

Entities: Chemical Disease Gene Species

Keywords: codon usage bias programmed ribosomal frameshifting translation efficiency hypothesis genetics molecular biology

Mesh：

Substances：
Codon
RNA, Messenger

Year: 2018 PMID： 30111621 PMCID： PMC6169388 DOI： 10.1534/g3.118.200185

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Across an organism’s genome, among those codons that decode for the same amino acids (codon families) some are used preferentially over others (Sharp and Li 1986) (reviews in (Hershberg and Petrov 2008; Plotkin and Kudla 2011)). The variation of this codon usage bias (CUB) within organisms’ genomes is mainly explained by selection for translation efficiency by protein expression demands –the translation efficiency hypothesis (TEH) (Hershberg and Petrov 2008; Plotkin and Kudla 2011). The main support for the TEH stems from two separate lines of evidence that concern associations between CUB and tRNA abundances and CUB and protein expression, respectively (Plotkin and Kudla 2011). In the first line, studies revealed that the most frequently used codons –so called preferred codons (Hershberg and Petrov 2008)– typically match the most abundant iso-accepting tRNA in both Escherichia coli (Ikemura 1985; Yamao ) and Saccharomyces cerevisiae (Ikemura 1985). Similar conclusions were drawn for a number of additional species (Kanaya , 2001), but using tRNA gene copy numbers as proxies for tRNA abundances. The assumption here was that tRNA gene copy numbers and tRNA abundances correlate, as is the case in E. coli and yeast (Ikemura, 1981, 1982). Further studies suggest that employing these more abundant tRNAs in translation provides efficiency (Tuller ) as well as accuracy gains (at least for E. coli (Stoletzki and Eyre-Walker 2007; Varenne ) and Drosophila melanogaster (Akashi 1994)). Furthermore, exchanging preferred codons in highly expressed genes by non-preferred had strong effects on gene expression (Carlini and Stephan 2003; Carlini 2004). If valid, these two mechanisms together imply that the disproportionately frequent usage of such preferred codons in an mRNA sequence is indicative of high translational efficiency. The second line of evidence concerns the existence of associations between CUB and protein expression (Ghaemmaghami ; Jansen ; Kliman ; Roth ; Quax ). High protein expression demands are assumed to generate a selective pressure for translation efficiency (Hershberg and Petrov 2008; Chamary ; Duret 2002). Given the mechanisms of the first line of evidence, these efficiency increases could be provided by synonymous mutations that alter codon usage frequencies toward preferred codons (Hershberg and Petrov, 2008, 2009). Taken together with the first line of evidence, these associations suggest that protein expression is also associated with translation efficiency, completing the support of TEH. The test of TEH by detection of CUB-protein expression association does therefore not rely on a direct comparison between translation efficiency and protein expression. Instead, CUB indices serve as a measure for adaptation due to translational selection and thus as a proxy for translational efficiency (Roth ). CUB indices are computed from a given mRNA sequence. Thus, the test of the TEH by CUB-protein expression association hinges on the assumption that the CUB index of an analyzed mRNA appropriately reflects its translational efficiency. The phenomenon of programmed ribosomal frameshifting (PRF) (Jacks and Varmus 1985; Craigen ; Clare ; Caliskan ; Dinman, 2006, 2012b, 2012a) indicates that the validity of that assumption may not always be warranted. Programmed ribosomal frameshifting is a process by which ribosomal translation of an mRNA sequence is induced to stall at specific sites, termed slippery sites, which leads to a rearrangement of the ribosome on the mRNA sequence (Plant and Dinman 2006; Caliskan ). Translation then proceeds in a new frame that is shifted relative to the original open reading frame (ORF) (Belew and Dinman 2015). This mechanism emerges in all domains of the tree of life (Cobucci-Ponzano ; Baranov ; Caliskan ; Dinman, 2012b). Here, we focus on events that shift ribosomes one nucleotide back in the translational direction, termed -1 PRF (Caliskan ; Advani and Dinman 2016), further restricting ourselves to yeast. The function of -1 PRF differs across organisms. In viruses, -1 PRF is predominantly employed to package more information into available sequence material. By opening a new reading frame, the -1 PRF mechanism allows for dual coding. This function is exemplified in the gag-pol gene overlap appearing in many retroviruses (Jacks and Varmus 1985), in particular human immunodeficiency virus (HIV) (Brierley and Dos Ramos 2006). In these viruses, the appropriate proportion of gag vs. pol protein expression is regulated by a -1 PRF signal (Maia ; Dulude ; Brierley and Dos Ramos 2006). In eukaryotes, -1 PRF is predominantly used for gene expression regulation (Plant ; Advani and Dinman 2016). About 10% of genes in yeast are hypothesized to contain functional slippery sites (Belew ). Ribosomes redirected to -1 shifted frames majorly encounter premature termination codons (PTC), that is, stop codons that appear well before the poly-A tails of mRNA sequences (Advani and Dinman 2016). This typically triggers the activation of the nonsense-mediated decay (NMD) pathway (Chang ). The NMD pathway induces the degradation of both the mRNA, as well as of the partly assembled protein, while leaving the ribosome intact (Advani and Dinman 2016). Thus, -1 PRF acts as an NMD-mediated destabilizing element of mRNA (Belew , 2014; Advani ). -1 PRF efficiency appears to be controlled by sequence specific elements, such as miRNA (Belew ), but can be uniformly affected across all PRF-harboring genes by mutations as well as drugs (Advani ). The reason why CUB may not appropriately reflect translation efficiency for genes containing -1 PRF signals, is because -1 PRF is likely to conceal costs to translation. When a -1 PRF signal redirects a fraction of the mRNA-translating ribosomes to premature termination (Dinman, 2012b; Advani and Dinman 2016), a translation efficiency cost is incurred that is not reflected in the sequence’s CUB index. In this study, we retest the CUB-protein expression association that lends major support for the TEH while accounting for the -1 PRF phenomenon. We draw from PRFdb, a database of predicted -1 PRF signals in the Saccharomyce cerevisiae genome (Belew ). First, we devise a series of hypotheses for how -1 PRF signals are influenced by evolutionary pressures. In accordance with our hypotheses, we find evidence for the existence of a cost of -1 PRF mechanism maintenance. Second, we devise general corrections for any codon usage bias index of mRNA containing -1 PRF signals. We use sequences from PRFdb to compute the new, corrected codon bias indices, and compare them to uncorrected values. We find that the TEH is robust against these corrections and strengthened under biologically plausible assumptions of a -1 PRF dependency on protein expression. We conclude that the -1 PRF signals offer untapped potential to analyze translation efficiency in mRNA.

Materials and Methods

Slippery site data

We obtained a dataset of the Saccharomyces cerevisiae S288C strain genes with predicted -1 PRF events from the Programmed Ribosomal Frameshifting database (PRFdb) of the University of Maryland (Belew ). These putatitve -1 PRF signals were identified using first a filter for slippery site identification, and subsequent detection of mRNA pseudoknots by means of the Nupack algorithm (Dirks and Pierce 2004). For a slippery site to be found, a confluence of signals in mRNA is required. First, a site must match the pattern (X XXY YYZ), where X is some base, Y is either A or U, and Z is A, C or U. Second, a spacer sequence of a minimum of 8 nucleotides in length needs to exist between the slippery site and the following pseudoknot. Third, a pseudoknot predicted by minimum free energy values of the mRNA secondary structure is expected downstream the slippery site. The dataset also included a full list of gene annotations, accession numbers, the relative position of the slippery sites within the gene in which they were found, as well as the gene mRNA. For each slippery site, we computed were premature stop codons first appear downstream. With this, we computed the frameshifted sequences.

Protein expression and mRNA level data

We obtained protein expression as well as mRNA level data from the Supporting Information of (von der Haar 2008). Von der Haar has produced an extensive curated data set that merges data from various sources. For protein expression, the data include the seminal studies of Ghaemmaghami et al. (Ghaemmaghami ), Newman et al. (Newman ), and Lu et al. (Lu ). Additional data stems from 46 further studies, specified in (von der Haar 2008). To ensure comparability, only studies were included in which yeast was grown in rich medium. Transcriptome data for mRNA levels were obtained from (Holstege ; Arava ; Holland 2002; Jelinsky and Samson 1999; Roth ). These data include about 6000 genes. The construction of the curated data set is described in detail in (von der Haar 2008).

Indices for codon usage bias

Many indices have been proposed to assess the degree of codon usage bias present in a gien mRNA sequence. Here, we focus on the codon adaptation index (CAI) (Sharp and Li 1987; Carbone ). In the following, we give the implemented definition of the CAI. Let L be the length of the mRNA sequence in codons, c is the index of the synonymous codons decoding the same amino acid a, and is the observed count of the synonymous codon c of amino acid a in the sequence. is the index set of all codons within an amino acid a. A is the index set of all amino acids a. We defineas the relative adaptiveness of codon within amino acid a of a given mRNA sequence. Then, the codon adaptation index is defined aswhere is measured on the observed on the mRNA of interest, and is taken from a reference sequence of highly expressed genes. For the reference sequence, we concatenated all of the mRNA of ribosomal genes (Friberg ). The ribosomal genes were taken from the ribosomal gene database http://ribosome.med.miyazaki-u.ac.jp/ (Nakao ).

Data Availability

The code used for analysis of the data are available on Figshare (doi: 10.6084/m9.figshare.5955886; url: https://figshare.com/s/ff89f9e614b129469fd4). Supplemental material available at Figshare: https://doi.org/10.6084/m9.figshare.6970277.

Results

The cost of -1 PRF maintenance

Effective -1 PRF events classically comprise three constitutive elements. The first element is a heptameric sequence –the slippery site– (Jacks and Varmus 1985; Plant ). This site has the structure (X XXY YYZ), where X can be any base (A,C,G,U), Y is either A or U, and lastly, Z is either A, C or U (Plant ; Plant and Dinman 2006; Ketteler 2012). The second element is a spacer sequence, separating the slippery site from the third element (Dinman, 2012b). The spacer sequence typically comprises around 1-12 nucleotides (Ketteler 2012; Dinman, 2012a). The third element is a pseudoknot, an mRNA secondary structure that is thought to generate a mechanical tension on the spacer sequence when the slippery site is being translated (Ketteler 2012; Dinman, 2012b, a). To release the tension, the ribosome, while stalling at the slippery site, is pushed one nucleotide back (Dinman, 2012b; Caliskan ; Ketteler 2012). The frequency with which the ribosome is redirected to the -1 frame is called -1 PRF efficiency (Advani ; Advani and Dinman 2016). These criteria were used to generate an extensive database of -1 PRF signals, PRFdb, of the yeast genome by algorithmic search (see Materials and Methods). If PRFdb comprises a sufficiently large set of functional slippery sites, the effects of putative -1 PRF induced costs to translation efficiency should be detectable. To this end, we devised three hypotheses. We derive these hypotheses from the assumption that the maintenance of -1 PRF signals carries an intrinsic cost to the organism. Furthermore, we assume that mutations that alter CUB are accumulated more slowly than mutations that alter -1 PRF efficiency, p. This is because the -1 PRF regulation mechanism is very likely to be affected by only a few mutations, whereas several mutations would be needed to significantly modify CUB. Hence, most short-term changes in protein expression demand in genes with slippery sites are more likely absorbed by -1 PRF. We first introduce a framework to formulate these hypotheses (see Figure 1). Let L be the length (in nucleotides) of an mRNA sequence x. A single slippery site is located at site l of x, downstream from the start. Ribosomes are redirected to the -1 frame with probability p, the -1 PRF efficiency. They continue with the mRNA’s translation until encountering a premature termination codon (PTC) at a distance λ post slippery site.

Figure 1

Notation for -1 PRF signals. A) The mRNA sequence in which a -1 PRF signal does not lead to ribosomal frameshifting is termed x (green). Arrows indicate the distances in nucleotides. B) The mRNA sequence with -1 PRF is termed . This is the concatenation of the mRNA x until the slippery site (green), and the -1 frame after the slippery site until a premature termination codon is encountered (violet). In a first hypothesis, we formed expectations about how the number of slippery sites per gene should vary with the ratio of protein expression to mRNA levels per cell. This ratio has been used as a measure for translation efficiency in past studies (Tuller ). To formulate the first hypothesis, we conduct a thought experiment. Let us assume that an organism’s environmental conditions favor lower expression of a protein from a -1 PRF-gene. Then, increasing the average p in that gene will adapt the organism to the new conditions. Conversely, if such environmental changes demanded higher protein expression, genes containing -1 PRF signals may adapt by reducing p. If protein demands exceed the production capacity of a gene with very small p, the -1 PRF mechanism may offer only costs, but no benefits from protein regulation. Then, mutations that remove the heptameric slippery site signature will be selected for. Taken together, these effects imply an asymmetry in the probability of slippery site loss depending on the direction of protein demand changes. Losses of slippery sites are likelier when protein demands increase than when they decrease. -1 PRF signals should thus become rarer with increasing protein expression levels. The second hypothesis states that the costs of -1 PRF maintenance are reduced as slippery sites approach the 5′ end or start of the mRNA. Every time ribosomes slip through -1 PRF, the energy of translating a stretch of length of mRNA is wasted for protein production. Since λ is roughly constant across within-gene sites (see Fig. S1), the cost to translational efficiency incurred from -1 PRF should depend on l. Therefore, diminishing l will minimize cost. Thus, the distribution of slippery site positions relative to the gene length should be skewed toward mRNA starts, that is, small l values. This effect is expected to be weaker than the effect in the first hypothesis. In the first hypothesis, if protein demands exceed yields generated by a gene and p is small, a fitness cost is incurred from i) the reduction of protein expression, ii) the cost of NMD-mediated mRNA degradation and iii) the cost of unnecessary translation of frameshifted mRNA sequences. Displacing the slippery site would only reduce the latter cost. Thus, we expect the signal for this hypothesis to be weaker than for the first hypothesis. The third hypothesis states that a skew in the distribution of the location of the slippery site within a gene (relative to that gene’s length) should also become more pronounced with higher expression levels. This is because for fixed p, increasing expression levels should mirror an increase in -1 PRF induced translation costs. Thus, the benefits of slippery sites closer to translation-initiation positions should increase with expression levels. As in the second hypothesis, selection for such an effect is likely to be very weak compared to effects in the first hypothesis. Figure 2 shows the results of testing all three hypotheses. Figure 2A shows that -1 PRF signals in genes become rarer with gene expression, consistent with our first hypothesis. Figure 2B shows that slippery sites positions are more prevalent in the first half of a gene than in the second, consistent with our second hypothesis. The frequency of slippery sites diminishes toward both extremes of mRNA sequences, and more markedly so toward the stop codon of the mRNA. This skew persists when including all genes from the PRFdb (see Figure S2). Figure 2C shows that the average slippery site position is displaced toward the 5′ end of the mRNA as protein expression levels increase from 102 molecules per cell to 105 molecules per cell. Above 105 molecules per cell, the uncertainty around the averages becomes large due to low sample sizes, and unambiguous deductions become impossible. The decreasing trend in the average with protein expression is confirmed by a linear regression. Lastly, we retested all of the hypotheses with an alternative integrated data set from PaxDB, and obtained the same results (see Figure S3).

Figure 2

Testing the cost of -1 PRF mechanism maintenance. A) Number of slippery sites –identified by relying on Nupack algorithm for pseudoknot detection (Dirks and Pierce 2004)– per gene and per nucleotide across protein expression levels per mRNA level for genes from the von der Haar data set (von der Haar 2008). The red dotted line is the average number of slippery sites per gene. The blue line is a regression line through the data set. The text in the panels gives i) the Spearman correlation coefficient r and the p-value for the null-hypothesis that the correlation is zero and ii) the slope of the line and the p-value of the t-test for a non-zero slope value. B) The frequency distribution of the within-gene positions of the slippery sites, relative to the length of the gene, To ensure comparability, only genes from the von der Haar data set are considered. The mean of the distribution, as well as its 95% confidence intervals, are smaller than 0.5 -the expectation in the absence of selective pressure. C) Slippery site positions relative to gene length across protein expression levels in the von der Haar data set. The red line is the average slippery site position (computed across 20 bins of equal width in logarithmic scale) and the widgets are the uncertainty ( standard error of the mean) around the average estimate. Averages with large uncertainties are in violet. Analogously to A), a regression line with the corresponding correlation coefficient r (with p-value) and slope (with p-value) are added.

Retesting the translational efficiency hypothesis while accounting for -1 PRF genes

Since -1 PRF signals carry a translational cost, their presence in an mRNA sequence changes the meaning of the associated CUB index. In fact, a slippery site may indicate a mismatch between the CUB index of that sequence and the actual efficiency with which it is translated (see Figure 3).

Figure 3

-1 PRF signals bias codon usage bias indices’ predictiveness of translation efficiency. A) An mRNA sequence (green line) without a -1 PRF signal. Within a given period of time, the transcribed mRNA in the cell will produce a finite number of proteins. We define the efficiency of the protein synthetisation process as the energy content of the proteins (the nominator) divided by the energy expended by the mRNA transcription and translation processes (the denominator) (Tuller ). The value of a codon usage bias index should reflect the efficiency of the protein synthesis process under the TEH. B) The same mRNA considering a slippery site (green point). The slippery site redirects ribosomes to a -1 shifted frame, where they encounter premature termination codons (PTC). The same number of proteins are produced as in A). An additional cost, unaccounted for in A), is incurred for protein production due to NMD-mediated protein decay. Thus, the effective translation efficiency when accounting for -1 PRF must be smaller than A). This diminished efficiency should be reflected by a correction in the codon usage bias index value. To elucidate this point, consider an mRNA without -1 PRF signal. In the -1 PRF signal’s absence, the CUB index of a translated mRNA sequence appropriately reflects how efficiently ribosomes elongate it in a pre-specified time (see Figure 3A) (Tuller ; Stoletzki and Eyre-Walker 2007; Akashi 1994). However, given a functional slippery site in that same mRNA sequence, the interpretation changes (see Figure 3B). Let us assume an average -1 PRF efficiency p. Then, a fraction p of the times in which a ribosome is in the process of translating a specific mRNA, it will be redirected to another frame by -1 PRF. This redirection leads to degradation of the mRNA. Thus, the production of the same number of proteins will require more energy and time in the presence of a -1 PRF signal than it would in its absence. However, traditional CUB indices will give no indication of that process. If accounted for, -1 PRF must therefore lead to a downward correction of the mRNA’s translational efficiency and with it, of the associated CUB index. Since a substantial fraction of genes may contain -1 PRF signals (hypothesized to be 10% of genes in yeast (Belew )), neglecting this effect might introduce considerable biases in CUB index values. These biases could affect the associations found between CUB and protein expression (Ghaemmaghami ; Jansen ; Kliman ; Roth ; Quax ). To address this issue, in the following we propose two general corrections for CUB indices in the presence of -1 PRF signals, and retest the basis for the TEH stemming from CUB and protein expression correlations.

Codon Usage Bias Index Correction from Translation Efficiency:

In a first approach, we derive an estimator for a CUB index correction from basic relationships between CUB indices and translation efficiencies of an mRNA. In the absence of a -1 PRF signal, the CUB on x will be measured by some index function As noted, if a functional -1 PRF signal is present, does not account for the translation efficiency loss due to the unproductive translation of the hybrid sequence (see Figure 1). We aim to derive a “corrected” index that more appropriately reflects the diminished translation efficiency in the presence of -1 PRF signals. To find an expression for we begin with the assumption that there exists a monotonous mapping F that maps the translation efficiency of a sequence x (in the absence of -1 PRF) to a codon usage bias index I: F is a monotonously increasing function, where increases in translation efficiency are reflected by increases in the codon usage bias index. Research by Tuller et al. further suggests that F is concave (Tuller ), such that codon usage bias index values saturate with increasing translation efficiency values. We define translation efficiency classically as an input-output energy ratio. More specifically, is the ratio of the total per-protein energy contained in the synthesized proteins from x to the energy, used for producing those proteins in a sufficiently long time frame Here, the energy going into the synthesization machinery, does not contain the expenditure for -1 PRF’s regulatory use. Thus, for a sequence x that additionally carries a functional slippery site, we haveHere, the same number of proteins were produced as when -1 PRF was not considered, that is, Since the final proteins are structurally equivalent to those in the absence of -1 PRF, we also have However, more energy was expended to produce these proteins, and therefore Their synthesization requires at least A part of the expended energy on translation is not implemented in the proteins, but in the translation of and the NMD pathway activation, Thus, it follows that The corrected usage bias statistic is defined asSince the form of F as well as the values of , and are either difficult to measure or unknown, we aim to to compute indirectly from To this end, we separate the translation efficiency into two components:Solving for giveswhere With this, we haveFurther analysis of the ratio is complicated by the inability to directly measure and To address this issue, we pursue an approach where the ratio is approximated by information about the relative one-elongation energies spent translating x and More specifically, we assume that each time x is translated, an energy input of is expended per time unit The total number of times translation is initiated on the mRNA sequence, is This is because corresponds to the fraction of times that an mRNA-elongating ribosome remains in frame. Analogously, the number of times translation is interrupted by a -1 PRF event is Hence, each time the sequence is translated, an energy expenditure of is ensued. Note that does not correspond to a protein number. The ratio between these two is approximated by:which is independent of the number of synthesized proteins Here, we have assumed that ratio of the per-run translation energies expended for the non-frameshifting to the frameshifting scenarios corresponds roughly to the ratio of the lengths of the translated sequences to x, respectively. However, there is an extra cost to each frameshifted mRNA, stemming from NMD-mediated degradation. This cost is accounted for by π, which is an unknown fixed cost associated with each -1 PRF event. π is measured in units of equivalents of translation cost per nucleotide. With (9), we can approximate the With this, we are ready to address the last approximation required to find an analytical expression for Since F is concave, it follows that for Thus, setting and using definition (4), we find a lower bound for The approximation for results in reasonable and useful properties of More specifically, as we have that This entails that no mRNA is translated into proteins, as expected. The same follows if The extraction of a from within the brackets leads to the undesirable behavior that as or we have that Instead, should approximate some minimal value A major drawback of is its reliance on p and π, which are both unknown. Typically, p is assumed to lie between but may reach up to (for example the gene in yeast, (Advani )). In the following, and if not stated otherwise, we assume that p is constant across all genes. It is unclear how large π should be. We have found no evidence that the value of π is dependent on any characteristic of mRNA, such as its length, or codon composition (Chang ; Dinman, 2012b). NMD is an evolutionarily conserved surveillance pathway (Chang ). Its activation may thus differ in energy expenditure between organisms (indeed, in yeast, NMD does not require an exon-junction complex, unlike most other eukaryotes). However, we have found no indication that the energy required to fully complete the NMD process will substantially vary from mRNA to mRNA within an organism. Thus, we assume that π is uniform across genes, although we do not know how large it is relative to other energy inputs. Again, if not stated otherwise, we assume that in order to give a conservative estimate of the effects of -1 PRF on the TEH.

Codon Usage Bias Index Correction from Averaging:

In a second approach, we define a correction for a -1 PRF-aware CUB index using a balancing principle. To this end, we add the CUB index in successful, non-NMD mediated translations of x and the index of untranslated while weighing both with their respective probability of occurrence:The corrected index is thus the expected value of I when considering that x is translated a fraction of the time, and is translated a fraction p of the time. therefore acts as a penalization function. may underestimate the extra energy costs to -1 PRF incurred from the activation of the NMD-mediated degradation processes, since these are not comprised in Similarly to the major drawback of the estimator is its dependence on p. As before, we thus evaluate for different, plausible values of p to assess -1 PRF’s effect on the TEH. However, has the advantage that it is also well defined for

Reexamining the TEH with corrected CUB indices:

Both corrections and represent a change in the value of a codon usage bias index given a -1 PRF efficacy. The models for correcting I presented here do not relate these measures to protein expression levels, P. They do therefore not predict protein expression. Instead, they allow us to reexamine the association between protein expression and a corrected codon usage bias index that in part underpins the TEH. To examine how these corrections affect associations between CUB and protein expression, we used the widely employed codon adaptation index (CAI) (Carbone ) as an example for a CUB index I. We then computed both corrected and for different values of p and for all genes in the von der Haar data set (von der Haar 2008) that contain -1 PRF signals. For we also assumed different values of π. A precise computation of and would require information about the value of p across genes. Since no such information is available as of now, we follow two approaches. First, we treat all -1 PRF sites equally, assuming an equal p value for all, independent of the protein expression levels P of the genes they are located in. Second, we assume that p is associated with P, and allow p to slowly decrease with the P of the gene. Figure 4 shows that correcting for the presence of -1 PRF signals with a -1 PRF efficiency of up to does not substantially affect the CUB index to protein expression relationship for both CUB index corrections, if the extra cost of mRNA degradation by NMD is neglected (). Figure 4A) shows that all CAI values for all genes are diminished when using the correction A substantial correlation between protein expression and corrected CAI index remains. Accounting for the frameshifted sequences also diminishes codon adaptation measures considerably compared to uncorrected CUB index values, as shown in Figure 4B). However, unlike with this reduction does not correspond to a uniform negative offset. Instead, the effect of accounting for -1 PRF is to both shift and broaden the distribution of values relative to (see Figure S4).

Figure 4

Corrected and original codon adaptation indices vs. protein expression levels for genes containing algorithmically identified -1 PRF signals and with known expression levels (von der Haar 2008), assuming and A) The original codon adaptation index values of mRNA sequences x of genes from the (von der Haar 2008) data set are shown as orange, semi-transparent points. The shown genes all contain algorithmically identified -1 PRF signals. The shape of the relation between CAI and protein expression is similar to that shown in (Ghaemmaghami ) for all genes with known expression levels. Blue, semi-transparent points denote the values of -1 PRF corrected CAI values, vs. same protein expression levels. Regressions between the -protein expression levels and both corrected an original CAI values are shown as orange and blue lines, respectively. B) is analogous to A) using the corrected CUB index The two corrections and show different sensitivities to the -1 PRF efficiency. In particular, the sensitivity of to -1 PRF efficiency is mediated by the value of π. Figure 5A shows that the correlations between both and to protein expression levels decline for increasing values of p and different values of π. As p increases, and for low costs of NMD-mediated mRNA decay, the correlation of to protein expression levels declines to very low values ( at for ). If the energetic cost of -1 PRF induced mRNA degradation, π, becomes larger (), the correlation declines very rapidly with p, and almost vanishes when p reaches For estimates, correlations’ dependency on p is moderate, never falling below 0.4 at biologically implausible values of p of unity.

Figure 5

Correlation values between corrections of CAI to -1 PRF presence () to protein expression levels P (von der Haar 2008) for different -1 PRF efficiencies, p. A) For we considered different values of π, the cost of degrading an mRNA after -1 PRF, namely For correlation values were computed at -1 PRF efficiencies was computed at the same levels except for where instead was used. B) Analogous to A), but with p dependent on protein expression. For each gene, the -1 PRF efficiency is computed by where for the baseline -1 PRF efficiency, we chose the same values as for p in A). These results corroborate the support for the translation efficiency hypothesis. For uniform -1 PRF efficiencies and for -based estimates, correlations of corrected CAI to protein expression levels decline with p. They only become sufficiently diminished to challenge the TEH when both the values of π and p are very high. Indeed, according to our mathematical framework, corresponds to the cost of either translating a large gene, or tenfold that cost. Except for all correlation values between corrected CUB indices and protein expression levels remain at around for biologically relevant -1 PRF efficiencies of For estimates, correlation seems generally robust to changes in p.

Varying -1 PRF efficiency with protein expression levels:

Since the data in Figure 2 strongly suggest a cost to -1 PRF maintenance, we also explored how a -1 PRF efficiency decline with protein expression levels consistent with such cost would affect the correlations in Figure 5A. Unlike a uniform p across protein levels, corrections to are expected to be larger at low expression levels than at high levels. The -1 PRF decline is assumed as follows: where we call the baseline -1 PRF efficiency. For this additional assumption leads to a measure:Analogously, becomesFigure 5B shows that under this assumption, correlations between protein expression and rise with larger baseline -1 PRF efficiencies for all except the largest values of π, while again remains unaffected. We observe analogous results using the Spearman rank correlation (see Figure S5). When -1 PRF efficiencies decline with protein expression –in accordance with a cost to -1 PRF– most correlations between -based estimates and protein expression rise with -1 PRF baselines. Except for all correlation values increase with -1 PRF baselines. This also adds support for the TEH, suggesting that -1 PRF conceals stronger associations than measured with uncorrected codon usage bias indices.

Discussion

In this study, we have investigated the hypothesis that -1 PRF presupposes translational costs to an organism, while at the same time generating benefits associated with protein expression regulation. We explored whether such a cost might be identified more directly in data and used both PRFdb (Belew ) and the von der Haar data set (von der Haar 2008) to address this question. We devised three hypotheses for likely signals of such cost in -1 PRF carrying genes. We could not find contradictory evidence for any of those hypotheses in the data. In a second step, we explored whether these costs, if not accounted for, bias important measures of CUB. We devised two new general approaches to correct CUB indices for the presence of -1 PRF signals. We then tested whether the concealment of such costs may unduly influence, falsify or strengthen, one classical test of the translational efficiency hypothesis: the association of CUB indices with protein expression levels. Under the assumption of uniform -1 PRF efficiencies, the energetic costs related with NMD activation would need to be implausibly high to warrant this conclusion. We find that on the contrary, assuming that -1 PRF efficiency decreases with protein expression levels -as suggested by the existence of a cost to -1 PRF-, the TEH is strengthened. Thus, taken together, our results suggests that high organismal demands for specific proteins are reflected in CUB-mediated translation efficiency gains. Our study comes with a series of caveats. A first caveat of this study lies in that the slippery sites are inferred (utilizing the Nupack algorithm (Dirks and Pierce 2004) for mRNA pseudoknot detection), but not always experimentally confirmed (Advani ). Previous studies suggest that inferred slippery sites are likely to be functional. In (Advani ) it is reported that in previous work, nine out of nine high confidence -1 PRF sites detected by methods used in the PRFdb in S. cerevisae were confirmed to be functional in vivo. Crucially, whether a -1 PRF site is regarded to be functional depends on whether it exceeds a predetermined p-threshold. For example, in (Advani ), a search of -1 PRF slippery sites identified 10 candidate genes in , three in , 2 in STN, and 1 in CDC in mRNA involved in telomerase. Out of these, and employing a cutoff of 1% for -1 PRF efficiency, seven carry functional slippery sites (: 3, : 2, STN: 1, CDC: 1) (see (Advani ), Table 1). Had a cutoff of 0% -1 PRF efficiency been employed, all slippery sites except one would be functional. This limited sample provides confidence that the methods described by (Belew ) appropriately capture biological mechanisms. Even with such uncertainty, it is unlikely that the presence of non-functional -1 PRF sites in the analyzed data would affect the claim of an intrinsic -1 PRF maintenance cost (Figure 2). This claim would only be biased if the probability of a candidate site to be functional were affected by either protein expression levels of the gene within which it resides or alternatively, the relative position in the gene of the putative slippery site. Note that these effects should arise from mechanisms that are independent of the ones studied here. Again, we are unaware of any such mechanisms acting in yeast, except for the requirement of a spacer sequence. A spacer sequence of a minimum length of say, 8 nt, will prohibit slippery sites to be located within 8 nt of the mRNA end. This restriction will slightly bias the a priori position of candidate slippery sites. However, as average mRNA by far exceed this length, such a restriction cannot explain the effects documented here. Another caveat of our study was the lack of estimates for p for the whole data set. Due to this restriction, we could not explore how -1 PRF efficiency p relates to CAI or protein expression. As assumed in Figure 5B, we expect that -1 PRF maintenance costs should on average translate into an inverse relationship between p and protein expression levels, because higher protein expression demands by the organism should be countered with reductions of p, minimizing the loss of mRNA to NMD-induced degradation. However, it is also possible that there exist cases in which -1 PRF has important functions in highly expressed genes. Hence, although there are fewer slippery sites for highly expressed genes and they are costly in terms of translation efficiencies, these sites could have large -1 PRF efficiencies. We have not found evidence in the literature to support this notion, and it would concomitantly contradict the evidence here presented. Additional testing of our hypotheses on the costs -1 PRF maintenance would be greatly helped if information on how -1 PRF efficiencies vary with protein expression was available. While our analysis of the behavior of gives us an indication how -1 PRF costs to translation efficiency could bias CUB indices, these insights rely on the assumptions made during ’s derivation. Importantly, is a lower bound to which means that downward corrections of are potentially exaggerated. Thus, the claim that corrected CUB indices will leave the basis for the TEH unaffected is resistant to such a bias. Another key assumption is that CUB indices of an mRNA sequence should increase monotonically with translation efficiency. In practice, this association is surely not perfect, due to multiple additional influences on CUB from unrelated biological processes. For these processes to systematically bias our results they should i) dominate over translational efficiency effects or ii) result in directional effects when combined. Despite these possible shortcomings, the assumption reflects key properties with which many CUB indices are designed, namely to mirror translational efficiency. For example, the CAI uses ribosomal mRNA as a reference to compute preferred codons because it is highly plausible that they are translationally efficient. Moreover, our results suggest that utilizing expected values for corrected CUB index definition, such as in the case of is suboptimal. Analysis of the distribution of levels with of genes in the von der Haar data set show that corrected codon adaptation can increase. This behavior contradicts the rationale behind introducing such corrections in the first place, as, surprisingly, can occur. To test the hypotheses about the TEH derived here, we have only utilized genes with identified slippery sites, and not all genes. Testing on all genes would be indicated if the TEH was challenged. However, if the correlation coefficient measured within a subset in which a correction has been applied is not substantially diminished relative to uncorrected values, the same correction will not affect the whole set either. Therefore testing it on the whole set is not necessary. The results indicating intrinsic -1 PRF maintenance cost and TEH support are in mutual agreement. -1 PRF maintenance induces a reduction of the size of l with increasing protein expression levels (Figure 2C). At fixed p, as assumed in our analysis, significant reductions in l would diminish the penalization to the original, un-corrected CUB. This could only be compensated by increases of p or π with protein expression levels, contrary to intuition and available evidence (Advani ). In fact, our analysis shows that a more plausible p dependency to protein expression leads to a strengthening of the basic CUB index to expression level association. These results have to be interpreted in the context of current codon usage bias research. The mechanism of translational selection (TEH) remains the main explanation put forward for selection based origins of codon bias. This explanation presupposes that silent mutations affect fitness via translation processes. These fitness increases originate from translation efficiency and accuracy gains, that in turn are hypothesized to stem from translation initiation and elongation processes. In fact, mRNA elongation rates do indeed appear to correlate positively with preferred codon frequencies (Gustafsson ), although better evidence would be desirable. However, for initiation processes –which are assumed to contribute the bulk of these fitness increases–, the effects of CUB on initiation rate increase remain subject of debate (Quax ). In fact, Tuller and Zur have analyzed the effect of the structure of the 5′ end of an mRNA on translation initiation and elongation rates and found various regulatory signals that affect these rates in different ways (Tuller and Zur 2015). Indeed, the induced folding at the 5′ end of the ORF appears to affect translation efficiency. Overall, the evidence from codon usage bias statistics and its associations to tRNA abundance and protein expression offers a compelling narrative for the TEH. However, current research efforts aiming to identify the exact mechanisms that give rise to these associations must account for conflating selective forces. Indeed, the mechanisms laid out in the TEH may not be the only way in which selection shapes codon usage frequencies. For example, besides the abundance of tRNAs, other factors have been discovered to crucially affect elongation rates and hence, to be possible targets selection (Chamary ; Resch , 2009; Hunt ). More precisely, specific synonymous changes can influence mRNA splicing, mRNA secondary structure, protein stability as well as protein folding (Komar 2009; Chamary ; Carlini and Genut 2006; Parmley ; Tsai ; Komar 2007). Synonymous changes may also alter the secondary structure of mRNA and thus affect the rate of translation –as established in vitro (Ivanov ; Parmley and Hurst 2007; Tuller and Zur 2015) and subsequently in vivo (Kimchi-Sarfaty ). A comprehensive review on how protein expression is fine tuned by codon usage bias is given by Quax (Quax ). Further, in a very recent study, a synonymous difference between mammalian cytoskeletal β- and γ-actin proteins was found to affect co-translational processing, ubiquitination, and co-translational degradation, leading to differential stability properties of the corresponding protein products (Zhang ). Within genomes, the biologial role of a gene influences that gene’s CUB in various additional ways. Some codons are more abundant in genes depending on their function, displaying distinct codon bias patterns. Supek has reviewed how gene function modifies codon preferences (Supek 2016). Selective pressure to maintain (or alter) gene function is superimposed to what is expected from translational efficiency and accuracy optimization. Genes where altered CUB patterns have been found are involved in diverse functions: amino acid starvation responses, cyclical protein expression, tissue specific expression, cellular differentiation, stress responses, and carcinogenesis. While genes can differ in function, there are also differences in function within a gene’s sequence that also affect local CUB. For example, Tuller and Zur have surveyed the multiple roles of the 5′ end of coding sequences in gene expression regulation (Tuller and Zur 2015). They hypothesize that due to multitude of regulatory signals found in that region, selection pressures regarding codon utilization are likely different than in other regions. Unlike cross-gene function, such effects are local, which should not affect our analysis. How effects from gene function may affect our results, crucially depends on their frequency and direction. The literature reviewed here does not appear to warrant the assumption that all the function-dependent selective pressures will align to influence CUB in the same way, leading to systematic bias. The phenomenon of -1 PRF is thus only one of many ways in which CUB may be influenced. The particular appeal of the -1 PRF phenomenon lies in its potential to elucidate many of these processes. Because slippery sites are precisely localized and -1 PRF efficiencies measurable, -1 PRF signals constitute natural experiments to translation efficiency and accuracy theories. This is because the separation of the mRNA by a slippery site should create differential translational costs across that mRNA. This translational cost gradient should be reflected in codon usage bias differences. -1 PRF based approaches to analyzing codon usage bias behavior, like the one presented in this study, may thus offer novel tools to better understand the means by which translation efficiency gains are realized in nature.

73 in total

1. Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae.

Authors: Yoav Arava; Yulei Wang; John D Storey; Chih Long Liu; Patrick O Brown; Daniel Herschlag
Journal: Proc Natl Acad Sci U S A Date: 2003-03-26 Impact factor: 11.205

Review 2. Evolution of synonymous codon usage in metazoans.

Authors: Laurent Duret
Journal: Curr Opin Genet Dev Date: 2002-12 Impact factor: 5.578

Review 3. Translational recoding in archaea.

Authors: Beatrice Cobucci-Ponzano; Mosè Rossi; Marco Moracci
Journal: Extremophiles Date: 2012-09-27 Impact factor: 2.395

4. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications.

Authors: P M Sharp; W H Li
Journal: Nucleic Acids Res Date: 1987-02-11 Impact factor: 16.971

5. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs.

Authors: T Ikemura
Journal: J Mol Biol Date: 1982-07-15 Impact factor: 5.469

Review 6. Synonymous but not the same: the causes and consequences of codon bias.

Authors: Joshua B Plotkin; Grzegorz Kudla
Journal: Nat Rev Genet Date: 2010-11-23 Impact factor: 53.242

7. Experimental reduction of codon bias in the Drosophila alcohol dehydrogenase gene results in decreased ethanol tolerance of adult flies.

Authors: David B Carlini
Journal: J Evol Biol Date: 2004-07 Impact factor: 2.411