Literature DB >> 28802259

Cis-regulatory elements explain most of the mRNA stability variation across genes in yeast.

Jun Cheng^1,2, Kerstin C Maier³, Žiga Avsec^1,2, Petra Rus³, Julien Gagneur^4,2.

Abstract

The stability of mRNA is one of the major determinants of gene expression. Although a wealth of sequence elements regulating mRNA stability has been described, their quantitative contributions to half-life are unknown. Here, we built a quantitative model for Saccharomyces cerevisiae based on functional mRNA sequence features that explains 59% of the half-life variation between genes and predicts half-life at a median relative error of 30%. The model revealed a new destabilizing 3' UTR motif, ATATTC, which we functionally validated. Codon usage proves to be the major determinant of mRNA stability. Nonetheless, single-nucleotide variations have the largest effect when occurring on 3' UTR motifs or upstream AUGs. Analyzing mRNA half-life data of 34 knockout strains showed that the effect of codon usage not only requires functional decapping and deadenylation, but also the 5'-to-3' exonuclease Xrn1, the nonsense-mediated decay genes, but not no-go decay. Altogether, this study quantitatively delineates the contributions of mRNA sequence features on stability in yeast, reveals their functional dependencies on degradation pathways, and allows accurate prediction of half-life from mRNA sequence.

Entities: Gene Species

Keywords: cis-regulatory elements; codon optimality; mRNA half-life

Mesh：

Substances：

Year: 2017 PMID： 28802259 PMCID： PMC5648033 DOI： 10.1261/rna.062224.117

Source DB: PubMed Journal: RNA ISSN： 1355-8382 Impact factor: 4.942

INTRODUCTION

The stability of messenger RNAs is an important aspect of gene regulation. It influences the overall cellular mRNA concentration, as mRNA steady-state levels are the ratio of synthesis and degradation rate. Moreover, low stability confers high turnover to mRNA and, therefore, the capacity to rapidly reach a new steady-state level in response to a transcriptional trigger (Shalem et al. 2008). Hence, stress genes, which must rapidly respond to environmental signals, show low stability (Miller et al. 2011; Zeisel et al. 2011; Marguerat et al. 2014; Rabani et al. 2014). In contrast, high stability provides robustness to variations in transcription. Accordingly, a wide range of mRNA half-lives is observed in eukaryotes, with typical variations in a given genome spanning one to two orders of magnitude (Schwanhäusser et al. 2011; Eser et al. 2016; Schwalb et al. 2016). Also, significant variability in mRNA half-life among human individuals could be demonstrated for about a quarter of genes in lymphoblastoid cells and estimated to account for more than a third of the gene expression variability (Duan et al. 2013). How mRNA stability is encoded in a gene sequence has long been a subject of study. Cis-regulatory elements (CREs) affecting mRNA stability are mainly encoded in the mRNA itself. Here we use the formal definition of CRE, i.e., a regulatory element affecting expression of the gene it belongs to in an allele-specific manner (Rockman and Kruglyak 2006; Skelly et al. 2009). CREs affecting mRNA stability include but are not limited to secondary structure (Rabani et al. 2008; Geisberg et al. 2014), sequence motifs present in the 3′ UTR including binding sites of RNA-binding proteins (Olivas and Parker 2000; Duttagupta et al. 2005; Shalgi et al. 2005; Hogan et al. 2008; Hasan et al. 2014), and, in higher eukaryotes, microRNAs (Lee et al. 1993). Moreover, translation-related features are frequently associated with mRNA stability. For instance, inserting strong secondary structure elements in the 5′ UTR or modifying the translation start codon context strongly destabilizes the long-lived PGK1 mRNA in S. cerevisiae (Muhlrad et al. 1995; LaGrandeur and Parker 1999). Codon usage, which affects the translation elongation rate, also regulates mRNA stability (Hoekema et al. 1987; Presnyak et al. 2015; Bazzini et al. 2016; Mishima and Tomari 2016). Further correlations between codon usage and mRNA stability have been reported in E. coli and S. pombe (Boël et al. 2016; Harigaya and Parker 2016). Adjacent codon pairs were also demonstrated to associate with mRNA decay in addition to individual codons in S. cerevisiae (Harigaya and Parker 2017). Since the RNA degradation machineries are well conserved among eukaryotes, the pathways have been extensively studied using S. cerevisiae as a model organism (Garneau et al. 2007; Parker 2012). The general mRNA degradation pathway starts with the removal of the poly(A) tail by the Pan2/Pan3 (Brown et al. 1996) and Ccr4/Not complexes (Tucker et al. 2001). Subsequently, mRNA is subjected to decapping carried out by Dcp2 and promoted by several factors, including Dhh1 and Pat1 (Pilkington and Parker 2008; She et al. 2008). The decapped and deadenylated mRNA can be rapidly degraded in the 3′ to 5′ direction by the exosome (Anderson and Parker 1998) or in the 5′ to 3′ direction by Xrn1 (Hsu and Stevens 1993). Further mRNA degradation pathways are triggered when aberrant translational status is detected, including nonsense-mediated decay (NMD), no-go decay (NGD), and nonstop decay (NSD) (Garneau et al. 2007; Parker 2012). Despite all this knowledge, prediction of mRNA half-life from a gene sequence is still not established. Moreover, most of the mechanistic studies so far were only performed on individual genes or reporter genes. It is therefore unclear how the measured effects generalize genome-wide. A recent study showed that translation-related features can be predictive for mRNA stability (Neymotin et al. 2016). Although this analysis supported the general correlation between translation and stability (Lackner et al. 2007), the model was not based purely on sequence-derived features. It also contained measured transcript properties such as ribosome density and normalized translation efficiencies. Hence, the question of how half-life is genetically encoded in mRNA sequence remains to be addressed. Additionally, the dependencies of sequence features to distinct mRNA degradation pathways have not been systematically studied. One example of this is codon-mediated stability control. Although a causal link from codon usage to mRNA half-life has been shown for a wide range of organisms (Hoekema et al. 1987; Presnyak et al. 2015; Bazzini et al. 2016; Mishima and Tomari 2016), the underlying mechanism remains poorly understood. In S. cerevisiae, reporter gene experiments showed that codon-mediated stability control depends on the RNA helicase Dhh1 (Radhakrishnan et al. 2016). However, it is unclear whether this generalizes to all mRNAs genome-wide. Also, the role of other closely related degradation pathways has not been systematically assessed with genome-wide half-life data. Here, we mathematically modeled mRNA half-life as a function of its sequence. Applied to S. cerevisiae, our model can explain most of the between-gene half-life variance from sequence alone. Using a semimechanistic model, we could interpret individual sequence features in the 5′ UTR, coding region, and 3′ UTR. Quantification of the respective contributions revealed that codon usage is the major contributor to mRNA stability. Applying the modeling approach to S. pombe supports the generality of these findings. Moreover, we systematically assessed the dependencies of these sequence elements on mRNA degradation pathways using half-life data for 34 knockout strains. This analysis revealed in particular novel pathways through which codon usage affects half-life.

RESULTS

To study cis-regulatory determinants of mRNA stability in S. cerevisiae, we chose the data set by Sun et al. (2013), which provides genome-wide half-life measurements for 4388 expressed genes of a wild-type laboratory strain and 34 strains knocked out for RNA degradation pathway genes (Fig. 1; Supplemental Table S1). When applicable, we also investigated half-life measurements of S. pombe for 3614 expressed mRNAs in a wild-type laboratory strain from Eser et al. (2016). We considered sequence features within five overlapping regions: the 5′ UTR, the start codon context, the coding sequence, the stop codon context, and the 3′ UTR. We assessed their effects in the wild type and in the 34 knockout strains (Fig. 1). Finally, we fitted a joint model to assess the contribution of individual sequence features and their single-nucleotide effects (Fig. 1). In all analyses, we considered the logarithm of half-life as the response variable rather than half-life in the natural scale. The primary motivation for choosing a logarithmic scale is that measurement noise for half-life is typically multiplicative. Also, the data did not provide supportive evidence discriminating between multiplicative or additive effects of the cis-regulatory elements on half-life (Supplemental Information). For simplicity, we used linear regressions, i.e., due to the logarithmic response, multiplicative models.

FIGURE 1.

Study overview. The goal of this study is to discover and integrate cis-regulatory mRNA elements affecting mRNA stability and assess their dependence on mRNA degradation pathways. (Data) We obtained S. cerevisiae genome-wide half-life data from wild-type (WT) as well as from 34 knockout strains from Sun et al. (2013). Each of the knockout strains has one gene closely related to mRNA degradation pathways knocked out. (Analysis) We systematically searched for novel sequence features associating with half-life from 5′ UTR, start codon context, CDS, stop codon context, and 3′ UTR. Effects of previously reported cis-regulatory elements were also assessed. Moreover, we assessed the dependencies of different sequence features on degradation pathways by analyzing their effects on the knockout strains. (Integrative model) We built a statistical model to predict genome-wide half-life solely from mRNA sequence. This allowed the quantification of the relative contributions of the sequence features to the overall variation across genes and assessing the sensitivity of mRNA stability with respect to single-nucleotide variants. The correlations between sequence lengths, GC contents and folding energies (Materials and Methods) with half-life and corresponding P-values are summarized in Supplemental Table S2 and Supplemental Figures S1–S3. In general, sequence lengths correlated negatively with half-life and folding energies correlated positively with half-life in both yeast species, whereas correlations of GC content varied with species and gene regions. In the following subsections, we describe first the findings for each of the five gene regions and then a model that integrates all these sequence features.

Upstream AUGs destabilize mRNAs by triggering nonsense-mediated decay

Occurrence of an upstream AUG (uAUG) associated significantly with shorter half-life (median fold-change = 1.37, P < 2 × 10−16). This effect was strengthened for genes with two or more AUGs (Fig. 2A,B). Among the 34 knock-out strains, the association between uAUG and shorter half-life was almost lost only for mutants of the two essential components of the nonsense-mediated mRNA decay (NMD) UPF2 and UPF3 (Leeds et al. 1992; Cui et al. 1995), and for the general 5′ to 3′ exonuclease Xrn1 (Fig. 2A; Supplemental Fig. S6). The dependence on NMD suggested that the association might be due to the occurrence of a premature stop codon. Consistent with this hypothesis, the association of uAUG with decreased half-lives was only found for genes with a premature stop codon cognate with the uAUG (Fig. 2C). This held not only for cognate premature stop codons within the 5′ UTR, leading to a potential upstream ORF, but also for cognate premature stop codons within the ORF, which occurred almost always for uAUG out-of-frame with the main ORF (Fig. 2C). This finding likely holds for many other eukaryotes as we found the same trends in S. pombe (Fig. 2D). These observations are consistent with a single-gene study demonstrating that translation of upstream ORFs can lead to RNA degradation by NMD (Gaba et al. 2005) and that uORFs are enriched in NMD substrates (Celik et al. 2017). Altogether, these results show that uAUGs are mRNA destabilizing elements as they almost surely match with cognate premature stop codons, which, whether in frame or not with the gene, and within the UTR or in the coding region, trigger NMD.

FIGURE 2.

Upstream AUG codons (uAUG) destabilize mRNA. (A) Distribution of mRNA half-lives for mRNAs without uAUG (left) and with at least one uAUG (right). From left to right: wild type, XRN1, UPF2, and UPF3 knockout S. cerevisiae strains. Median fold-change (Median FC) calculated by dividing the median of the group without uAUG with the group with uAUG. A complete view of the effect of uAUG across different knockouts is provided in Supplemental Figure S6. (B) Distribution of mRNA half-lives for mRNAs with zero (left), one (middle), or more (right) uAUGs in S. cerevisiae. (C) Distribution of mRNA half-lives for S. cerevisiae mRNAs with, from left to right: no uAUG, with one in-frame uAUG but no cognate premature termination codon, with one out-of-frame uAUG and one cognate premature termination codon in the CDS, and with one uAUG and one cognate stop codon in the 5′ UTR (uORF). (D) Same as in C for S. pombe mRNAs. All P-values were calculated with Wilcoxon rank-sum test. Numbers in the boxes indicate number of members in the corresponding group. Boxes represent quartiles, whiskers extend to the highest or lowest value within 1.5 times the interquartile range, and horizontal bars in the boxes represent medians. Data points falling further than 1.5-fold the interquartile distance are considered outliers and are shown as dots.

Translation initiation sequence features associate with mRNA stability

Several sequence features in the 5′ UTR including the start codon context associated with mRNA half-life (Supplemental Information; Supplemental Figs. S4–S5). This indicates that 5′ UTR elements may affect mRNA stability by altering translation initiation. However, none of these sequence features remained significant in the final joint model. Our analysis is therefore not conclusive on this point. A detailed analysis is provided in the Supplemental Information for interested readers.

Codon usage regulates mRNA stability through common mRNA decay pathways

When using frequency of each codon as an independent covariate, codon usage marginally explained 55% of the between-gene half-life variation in S. cerevisiae on test data (linear regression, Materials and Methods, Fig. 3A). The species-specific tRNA adaptation index (sTAI) (Sabi and Tuller 2014) significantly positively correlated with the coefficients for codons in this regression [Supplemental Fig. S4E, r = 0.48 with log(sTAI), P = 0.0001, Materials and Methods], confirming the association between codon optimality and mRNA stability (Presnyak et al. 2015; Harigaya and Parker 2016). We also performed regression against gene-level sTAI. However, it yielded to significant yet less accurate predictions (40% explained variance on test data). We therefore proceeded with modeling frequency of each codon as an independent covariate.

FIGURE 3.

Codon usage regulates mRNA stability through common mRNA decay pathways. (A) Predicted mRNA half-life using only codons as features (linear regression) versus measured mRNA half-life. (B) mRNA half-life explained variance (y-axis, Materials and Methods) in wild-type (WT) and across all 34 knockout strains (grouped according to their functions). Each blue dot represents one replicate; bar heights indicate means across replicates. Bars with a red star are significantly different from the wild-type level (FDR < 0.1, Wilcoxon rank-sum test, followed by Benjamini–Hochberg correction). Next, we quantified how much variation of mRNA half-life can be explained by codons in different knockout strains using the out-of-folds explained variance as a summary statistic (Supplemental Methods). The effect of codon usage exclusively depended on the genes from the common deadenylation- and decapping-dependent 5′ to 3′ mRNA decay pathway and the NMD pathway (all FDR < 0.1, Fig. 3B). In particular, all assessed genes of the Ccr4–Not complex, including CCR4, NOT3, CAF40, and POP2, were required for wild-type level effects of codon usage on mRNA decay. Among them, CCR4 has the largest effect. This confirmed a recent study in zebrafish showing that accelerated decay of nonoptimal codon genes requires deadenylation activities of Ccr4–Not (Mishima and Tomari 2016). In contrast to genes of the Ccr4–Not complex, PAN2/3 genes that also encode deadenylation enzymes were not found to be essential for the coupling between codon usage and mRNA decay (Fig. 3B). Furthermore, our results not only confirm the dependence on Dhh1 (Radhakrishnan et al. 2016), but also on its interacting partner Pat1. The difference might come from the fact that we analyzed genome-wide half-life data, whereas mRNA half-life measurements from Radhakrishnan and colleagues were only performed on reporter genes. Our systematic analysis revealed two additional novel dependencies: First, on the common 5′ to 3′ exonuclease Xrn1, and second, on UPF2 and UPF3 genes, which are essential players of NMD (all FDR < 0.1, Fig. 3B). Consistently, previous studies have shown that UPF genes are involved in more than just the degradation of nonsense messages, but rather target a wide range of mRNAs, including aberrant and normal ones (He et al. 2003; Hug et al. 2015). In line with this, substrates of Upf proteins have lower codon optimality (Celik et al. 2017). Furthermore, we did not observe any change of effect upon knockout of DOM34 and HBS1 (Fig. 3B), which are essential genes for the No-Go decay pathway. This implies that the effect of codon usage is unlikely due to stalled ribosomes at nonoptimal codons. Altogether, our analysis indicates that the so-called “codon-mediated decay” (Mishima and Tomari 2016) is not an mRNA decay pathway itself, but a regulatory mechanism of the common mRNA decay pathways.

Stop codon context associates with mRNA stability

The first nucleotide 3′ of the stop codon significantly associated with mRNA stability. This association was observed for each of the three possible stop codons, and for each codon a cytosine significantly associated with lower half-life (Supplemental Fig. S4, also for P-values and fold-changes). However, this feature was not significant in the joint model, and analysis of the knockout strains did not reveal clear pathway dependencies for it (Supplemental Fig. S6). A detailed description is provided in the Supplemental Information for interested readers.

Sequence motifs in 3′ UTR

De novo motif search identified four motifs in the 3′ UTR to be significantly associated with mRNA stability (Fig. 4A, Materials and Methods). These include three described motifs: the Puf3 binding motif TGTAAATA (FDR = 3.2 × 10−5, median fold-change 1.29) (Gerber et al. 2004; Gupta et al. 2014), the Whi3 binding motif TGCAT (FDR = 7 × 10−4, median fold-change 1.24) (Colomina et al. 2008; Cai and Futcher 2013), and a poly(U) motif TTTTTTA (FDR = 0.09, median fold-change 1.20), which can be bound by Pub1 (Duttagupta et al. 2005), or is part of the long poly(U) stretch that forms a looping structure with a poly(A) tail (Geisberg et al. 2014). Moreover, an uncharacterized motif, ATATTC, was associated with lower mRNA half-life (FDR = 2 × 10−5, median fold-change 1.24). Genes harboring the ATATTC motif are significantly enriched for genes involved in oxidative phosphorylation (Bonferroni corrected P < 0.01, 4.4-fold enrichment, Gene Ontology analysis, Supplemental Methods; Supplemental Table S3). The motif ATATC preferentially localizes in the vicinity of the poly(A) site (Fig. 4B), and functionally depends on Ccr4 (FDR < 0.1, Supplemental Fig. S6), suggesting a potential interaction with deadenylation factors. Notably, the motif ATATTC was found in 13% of the genes (591 out of 4388) and significantly co-occurred with the other two destabilizing motifs found in 3′ UTR: Puf3 motif (FDR = 0.01) and Whi3 motif (FDR = 3 × 10−3) binding motifs (Fig. 4F). This 3′ UTR motif had been computationally identified by conservation analysis (Kellis et al. 2003), by regression of steady-state expression levels (Foat et al. 2005), and by enrichment analysis within gene expression clusters (Elemento et al. 2007). The motif was suggested to be named as PRSE (positive response to starvation element), because of its enrichment among genes that are up-regulated upon starvation (Foat et al. 2005). However, it was not experimentally validated for controlling of mRNA stability.

FIGURE 4.

3′ UTR half-life determinant motifs in S. cerevisiae. (A) Distribution of half-lives for mRNAs grouped by the number of occurrence(s) of the motif ATATTC, TGCAT (Whi3), TGTAAATA (Puf3), and TTTTTTA (Pub1), respectively, in their 3′ UTR sequence. Numbers in the boxes represent the number of members in each box. FDR were reported from the linear mixed effect model (Materials and Methods). (B) Fraction of transcripts containing the motif (y-axis) within a 20-bp window centered at a position (x-axis) with respect to poly(A) site for different motifs (facet titles). Positional bias was not observed when aligning 3′ UTR motifs with respect to the stop codon. (C) Prediction of the relative effect on half-life (y-axis) for single-nucleotide substitution in the motif with respect to the consensus motif (y = 1, horizontal line). The motifs were extended two bases at each flanking site (positions +1, +2, −1, −2). (D) Nucleotide frequency within motif instances, when allowing for one mismatch compared with the consensus motif. (E) Mean conservation score (phastCons, Materials and Methods) of each base in the consensus motif with two flanking nucleotides (y-axis). (F) Co-occurrence significance (FDR, Fisher test P-value corrected with Benjamini–Hochberg) between different motifs (left). Number of occurrences among the 4388 mRNAs (right). (G) Steady-state expression level of SFG1 and NYV1 (normalized by ACT1 and TUB2 expression, Supplemental Methods). Bar height represents mean of each group, error bars represent ± one standard error of the mean, each dot represents one biological replicate (jittered at x-axis to avoid overlapping). P-values were calculated by comparing the normalized expression level of constructs with two scrambled motifs embedded versus that with two functional ATATTC motifs embedded (Wilcoxon rank-sum test). We validated the 3′ UTR motif ATATTC with a reporter assay on two different genes, SFG1 and NYV1. Given the predicted small effect of a single motif, we generated constructs with two instances of the motif and compared them to constructs harboring two scrambled motifs at the same locations (Fig. 4G, Materials and Methods). Both reporter genes showed decreased expression levels compared to scrambled controls (P = 0.019 for SFG1, P = 0.00016 for NYV1, Wilcoxon rank-sum test). Since the 3′ UTR motif ATATTC is not significantly associated with mRNA synthesis rate (P = 0.38, Wilcoxon rank-sum test, synthesis rate of genes without motif versus genes with motif), we conclude that this decreased expression is due to decreased stability. Consistent with the role of Puf3 in recruiting deadenylation factors, Puf3 binding motif localized preferentially close to the poly(A) site (Fig. 4B). The effect of the Puf3 motifs was significantly lower in the knockout of PUF3 (FDR < 0.1, Supplemental Fig. S6). We also found a significant dependence on the deadenylation (CCR4, POP2) and decapping (DHH1, PAT1) pathways (all FDR < 0.1, Supplemental Fig. S6), consistent with previous single gene experiments showing that Puf3 binding promotes both deadenylation and decapping (Olivas and Parker 2000; Goldstrohm et al. 2007). Strikingly, the Puf3 binding motif switched to a stabilization motif in the absence of Puf3 and Ccr4 (all FDR < 0.1, Supplemental Fig. S6), suggesting that deadenylation of the Puf3 motif containing mRNAs is not only facilitated by Puf3 binding, but also depends on it. Whi3 plays an important role in cell cycle control (Garí et al. 2001). Binding of Whi3 leads to destabilization of the CLN3 mRNA (Cai and Futcher 2013). A subset of yeast genes are up-regulated in the Whi3 knockout strain (Cai and Futcher 2013). However, so far it was unclear whether Whi3 generally destabilizes mRNAs upon its binding. Our analysis showed that mRNAs containing the Whi3 binding motif (TGCAT) have a significantly shorter half-life (FDR = 6.9 × 10−04, median fold-change 1.24). Surprisingly, this binding motif is extremely widespread, with 896 out of 4388 (20%) genes that we examined containing the motif on the 3′ UTR region, which enriched for genes involved in several processes (Supplemental Table S3). Functionality of the Whi3 binding motif was found to be dependent on Ccr4 (FDR < 0.1, Supplemental Fig. S6). The mRNAs harboring the TTTTTTA motif tended to be more stable (FDR = 0.086, median fold-change 1.22) and enriched for translation (P = 1.34 × 10−3, twofold enrichment; Supplemental Table S3). No positional preferences were observed for this motif (Fig. 4B). The effect of this motif depends on genes from Ccr4–Not complex and Xrn1 (Supplemental Fig. S6). An additional four lines of evidence further supported the functionality of our identified motifs. First, single-nucleotide deviations from the motif's consensus sequence associated with decreased effects on half-life (Fig. 4C, linear regression allowing for one mismatch, Materials and Methods). Moreover, the flanking nucleotides did not show further associations indicating that the whole lengths of the motifs were recovered (Fig. 4C). Second, when allowing for one mismatch, the motif still showed strong preferences (Fig. 4D). Third, the motif instances were more conserved than their flanking bases from the 3′ UTR (Fig. 4E). Fourth, all four motifs show significant effects in the RNA half-life data set generated by Miller et al. (2011), which is also based on 4sU labeling, as well as in the data set of Presnyak et al. (2015), which is in contrast based on transcriptional arrest (Supplemental Fig. S7).

Fifty-nine percent between-gene half-life variation can be explained by sequence features

We next asked how well one could predict mRNA half-life from these mRNA sequence features, and what their respective contributions were when considered jointly. To this end, we performed a multivariate linear regression of the logarithm of the half-life against the identified sequence features. The predictive power of the model on unseen data was assessed using 10-fold cross-validation (Materials and Methods; a complete list of model features and their P-values is provided in Supplemental Table S4). To prevent overfitting, we performed motif discovery on each of the 10 training sets and observed the same set of motifs across all the folds. Altogether, 59% of S. cerevisiae half-life variance in the logarithmic scale can be explained by simple linear combinations of the above sequence features (Fig. 5A; Supplemental Table S5). The median out-of-folds relative error across genes is 30%. A median relative error of 30% for half-life is remarkably low because it is in the order of magnitude of the expression variation that is typically physiologically tolerated, and it is also about the amount of variation observed between replicate experiments (Eser et al. 2016). To make sure that our findings are not biased to a specific data set, we fitted the same model to a data set using RATE-seq (Neymotin et al. 2014), a modified version of the protocol used by Sun et al. (2013). On these data, the model was able to explain 51% of the variance (Supplemental Fig. S8). Moreover, the same procedure applied to S. pombe explained 45% of the total half-life variance, suggesting the generality of this approach. Because the measures also entail measurement noise, these numbers are conservative underestimations of the total biological variance explained by our model.

FIGURE 5.

Genome-wide prediction of mRNA half-life from sequence features and analysis of the contributions. (A,B) mRNA half-life predicted (x-axis) versus measured (y-axis) for S. cerevisiae (A) and S. pombe (B), respectively. (C) Contribution of each sequence feature individually (Individual), cumulatively when sequentially added into a combined model (Cumulative), and explained variance drop when each single feature is removed from the full model separately (Drop). Values reported are the mean of 100 times of cross-validated evaluation (Materials and Methods). (D) Expected half-life fold-change of single-nucleotide variations on sequence features. For length and GC, dots represent median half-life fold-change of one nucleotide shorter or one G/C to A/T transition, respectively. For codon usage, each dot represents median half-life fold-change of one type of synonymous mutation; all kinds of synonymous mutations are considered. For uAUG, each dot represents median half-life fold-change of mutating out one uAUG. For motifs, each dot represents median half-life fold-change of one type of nucleotide transition at one position on the motif (Materials and Methods). Medians are calculated across all mRNAs. The uAUG, 5′ UTR length, 5′ UTR GC content, 61 coding codons, CDS folding energy, all four 3′ UTR motifs, and 3′ UTR length remained significant in the joint model, indicating that they contributed individually to half-life (Supplemental Table S4). Most of them showed decreased effect in a joint model compared to marginal effects (Fig. 5C), likely because they correlate with each other. In contrast, start codon context, stop codon context, 5′ folding energy, the 5′ UTR motif AAACAAA (Supplemental Fig. S5), CDS length, and 3′ UTR GC content dropped below the significance when considered in the joint model (Supplemental Table S4). This loss of statistical significance may be due to lack of statistical power. Another possibility is that the marginal association of these sequence features with half-life is a consequence of a correlation with other sequence features. Among all sequence features, codon usage as a group is the best predictor both in a univariate model (55.29%) and in the joint model (44.63 %) (Fig. 5C). This shows that, quantitatively, codon usage is the major determinant of mRNA stability in yeast. This explains why only a small fraction of mRNA stability variation can be explained by RNA-binding proteins (Hasan et al. 2014). The variance analysis quantifies the contribution of each sequence feature to the variation across genes. Features that vary a lot between genes, such as UTR length and codon usage, favorably contribute to the variation. However, this does not reflect the effect on a given gene of elementary sequence variations in these features. For instance, a single-nucleotide variant can lead to the creation of an uAUG with a strong effect on half-life, but a single-nucleotide variant in the coding sequence may have little impact on overall codon usage. We used the joint model to assess the sensitivity of each feature to single-nucleotide mutations as median fold-change across genes, simulating single-nucleotide deletions for the length features and single-nucleotide substitutions for the remaining ones (Materials and Methods). Single-nucleotide variations typically altered half-life by <10%. The largest effects were observed in the 3′ UTR motifs and uAUG (Fig. 5D). Notably, although codon usage was the major contributor to the variance, synonymous variation on codons typically affected half-life by <2% (Fig. 5D; Supplemental Fig. S9). For those synonymous variations that changed half-life by more than 2%, most of them were variations that involved the most nonoptimized codons CGA or ATA (Supplemental Fig. S9; Presnyak et al. 2015). Altogether, our results show that most of yeast mRNA half-life variation can be predicted from mRNA sequence alone, with codon usage being the major contributor. However, single-nucleotide variation at 3′ UTR motifs or uAUG had the largest expected effect on mRNA stability.

DISCUSSION

We systematically searched for mRNA sequence features associating with mRNA stability and estimated their effects at single-nucleotide resolution in a joint model. Up to GC content and length, all elements of the joint model are causal. One of them, the 3′ UTR motif ATATTC has been validated in this study. Overall, the joint model showed that 59% of the variance could be predicted from mRNA sequence alone in S. cerevisiae. This analysis showed that translation-related features, in particular codon usage, contributed most to the explained variance. This finding strengthens further the importance of the coupling between translation and mRNA degradation (Roy and Jacobson 2013; Huch and Nissan 2014; Radhakrishnan and Green 2016). Moreover, we assessed the dependencies of each sequence element on RNA degradation pathways. Remarkably, we identified that codon-mediated decay is a regulatory mechanism of the canonical decay pathways, including deadenylation- and decapping-dependent 5′ to 3′ decay and NMD (Figs. 3B, 6).

FIGURE 6.

Overview and summary of conclusions from this study.

Overview and summary of conclusions from this study. Predicting various steps of gene expression from sequence alone has long been a subject of study (Beer and Tavazoie 2004; Vogel et al. 2010; Zur and Tuller 2013; Wang et al. 2016). To this end, two distinct classes of models have been proposed: the biophysical models on the one hand and the machine learning models on the other hand (Zur and Tuller 2016). Biophysical models provide detailed understanding of the processes. On the other hand, machine learning approaches can reach much higher predictive accuracy but are more difficult to interpret. Also, machine learning approaches can pick up signals with predictive power that are correlative but not causal. Here we adopted an intermediate, semimechanistic modeling approach. We used a simple linear model that is interpretable. Also, all elements are functional, up to two covariates: GC content and length. Our approach was based on the analysis of endogenous sequence, which allowed the identification of a novel cis-regulatory element. An alternative approach to the modeling of endogenous sequence is to use large-scale synthetic libraries (Dvir et al. 2013; Shalem et al. 2015; Wissink et al. 2016). Although very powerful to dissect known cis-regulatory elements or to investigate small variations around select genes, the sequence space is so large that these large-scale perturbation screens cannot uncover all regulatory motifs. It would be interesting to combine both approaches and design large-scale validation experiments guided by insights coming from modeling of endogenous sequences as we developed here. Recently, Neymotin et al. (2016) showed that several translation-related transcript properties associated with half-life. This study derived a model explaining 50% of the total variance using many transcript properties including some not based on sequence (ribosome profiling, expression levels, etc.). Although non-sequence based predictors can facilitate prediction, they may do so because they are consequences rather than causes of half-life. For instance, increased half-life causes higher expression level. Also, increased cytoplasmic half-life, provides a higher ratio of cytoplasmic over nuclear RNA, and thus more RNAs available to ribosomes. Hence both expression level and ribosome density may help making good predictions of half-life, but not necessarily because they causally increase half-life. In contrast, we aimed here to understand how mRNA half-life is encoded in mRNA sequence and derived a model that is based on functional elements. This avoided using transcript properties that could be consequences of mRNA stability. Hence, our present analysis confirms the quantitative importance of translation in determining mRNA stability that Neymotin and colleagues quantified, and anchors it into pure sequence elements. Confounding associations of sequence elements with mRNA stability could arise because of selection on expression levels acting at multiple stages of gene expression. For instance, genes that are selected for high protein expression levels may be enriched for elements that enhance translation and for elements that enhance mRNA stability. Functional validations are therefore needed to disentangle causality from co-selection. The sequence elements of our joint model, up to GC content and length, are all functional. However, we reported further elements that associate marginally with half-life. One of the interesting sequence elements that we found associated with half-life but did not turn out significant in the joint model is the start codon context. Given its established effect on translation initiation (Kozak 1986; Dvir et al. 2013), the general coupling between translation and mRNA degradation (Roy and Jacobson 2013; Huch and Nissan 2014; Radhakrishnan and Green 2016), as well as several observations directly on mRNA stability for single genes (LaGrandeur and Parker 1999; Schwartz and Parker 1999), the start codon context may nonetheless functionally affect mRNA stability. Consistent with this hypothesis, large-scale experiments that perturb 5′ sequence secondary structure and start codon context indeed showed a wide range of mRNA level changes in the direction that we would predict (Dvir et al. 2013). We are not aware of previous studies that systematically assessed the effects of cis-regulatory elements in the context of knockout backgrounds, as we did here. This part of our analysis turned out to be very insightful. By assessing the dependencies of codon usage mediated mRNA stability control systematically and comprehensively, we generalized results from recent studies on the Ccr4–Not complex and Dhh1, but also identified important novel ones including NMD factors, Pat1 and Xrn1. With the growing availability of knockout or mutant background in model organisms and human cell lines, we anticipate this approach to become a fruitful methodology to unravel regulatory mechanisms.

MATERIALS AND METHODS

Data and genomes

Wild-type and knockout genome-wide S. cerevisiae half-life data were obtained from Sun et al. (2013), whereby all strains are histidine, leucine, methionine, and uracil auxotrophs. A complete list of knockout strains used in this study is provided in Supplemental Table S1. S. cerevisiae gene boundaries were taken from the boundaries of the most abundant isoform quantified by Pelechano et al. (2013). Reference genome fasta file and genome annotation were obtained from the Ensembl database (release 79). UTR regions were defined by subtracting out gene body (exon and introns from the Ensembl annotation) from the gene boundaries. Processed S. cerevisiae UTR annotation is provided in Supplemental Table S6. Genome-wide half-life data of S. pombe as well as refined transcription unit annotation were obtained from Eser et al. (2016). Reference genome version ASM294v2.26 was used to obtain sequence information. Half-life outliers of S. pombe (half-life less than 1 or larger than 250 min) were removed. For both half-life data sets, only mRNAs with mapped 5′ UTR and 3′ UTR were considered. mRNAs with 5′ UTR length shorter than 6 nt were further filtered out. Codon-wise species-specific tRNA adaptation index (sTAI) of yeasts were obtained from Sabi and Tuller (2014). Gene-wise sTAIs were calculated as the geometric mean of sTAIs of all its codons (stop codon excluded).

Analysis of knockout strains

The effect level of an individual sequence feature was compared against the wild-type with Wilcoxon rank-sum test followed by multiple hypothesis testing P-value correction (FDR < 0.1). For details, see Supplemental Methods.

Motif discovery

Motif discovery was conducted for the 5′ UTR, the CDS and the 3′ UTR regions. A linear mixed effect model was used to assess the effect of each individual k-mer while controlling the effects of the others and for the region length as a covariate as described previously (Eser et al. 2016). For CDS we also used codons as further covariates. In contrast to Eser and colleagues, we tested the effects of all possible k-mers with lengths from 3 to 8. The linear mixed model for motif discovery was fitted with GEMMA software (Zhou et al. 2013). P-values were corrected for multiple testing using Benjamini–Hochberg's FDR. Motifs were subsequently manually assembled based on overlapping significant (FDR < 0.1) k-mers.

Folding energy calculation

RNA sequence folding energy was calculated with RNAfold from ViennaRNA version 2.1.9 (Lorenz et al. 2011), with default parameters.

S. cerevisiae conservation analysis

The phastCons (Siepel et al. 2005) conservation track for S. cerevisiae was downloaded from the UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/phastCons7way/). Motif single-nucleotide level conservation scores were computed as the mean conservation score of each nucleotide (including two extended nucleotides at each side of the motif) across all motif instances genome-wide (removing NA values).

Linear regression model for codon usage

Throughout the study, we modeled codon usage in the linear model with each codon as an independent covariate using its frequency. where , n is the number of codon c in gene g. L is the CDS length of gene g.

Relation between codon regression coefficient and sTAI

The coefficients of codon frequencies have an analogous interpretation as species-specific tRNA adaptation index (sTAI). The same applies also to tAI. The sTAI of a gene is defined as the geometric mean of the sTAIs of all its coding codons (Sabi and Tuller 2014). For a gene g with N number of codons, its sTAI is defined as follows: where w represent the sTAI of the i codon in the gene. The logarithm of a gene sTAI with N codons is where x is defined in Equation 1, 3N = L is the CDS length, n is the number of codon c in gene g, w is the sTAI of codon c. Hence, in a linear model the regression coefficient β of Equation 1 has an analogous interpretation to the log of sTAI [log(w)].

Linear model for genome-wide half-life prediction

Multivariate linear regression models were used to predict genome-wide mRNA half-life on the logarithmic scale from sequence features. Only mRNAs that contain all features were used to fit the models, resulting in 3838 mRNAs for S. cerevisiae and 3360 mRNAs for S. pombe. Out-of-fold predictions were applied with 10-fold cross validation for any prediction task in this study. For each fold, a linear model was first fitted to the training data with all sequence features as covariates, then a stepwise model selection procedure was applied to select the best model with Bayesian Information Criterion as criteria [step function in R, with k = log(n)]. L1 or L2 regularization was not necessary, as they did not improve the out-of-fold prediction accuracy (tested with the glmnet R package [Friedman et al. 2010]). Motif discovery was performed again at each fold. The same set of motifs was identified within each training set only. For details, see Supplemental Methods.

Analysis of sequence feature contribution

Linear models were first fitted on the complete data with all sequence features as covariates, nonsignificant sequence features were then removed from the final models, ending up with 69 features for the S. cerevisiae model and 76 features for S. pombe (each single-coding codon was fitted as a single covariate). The contribution of each sequence feature was analyzed individually as a univariate regression and also jointly in a multivariate regression model. The contribution of each feature individually was calculated as the variance explained by a univariate model. Features were then added in a descending order of their individual explained variance to a joint model; “cumulative” variances explained were then calculated. The “drop” quantifies the drop of variance explained as leaving out one feature separately from the full model. All contribution statistics were quantified by taking the average of 100 times of 10-fold cross-validation.

Single-nucleotide variant effect predictions

The same model used in sequence feature contribution analysis was used for single-nucleotide variant effect prediction. For motifs, effects of single-nucleotide variants were predicted with the linear model modified from Eser et al. (2016). When assessing the effect of a given motif variation, instead of estimating the marginal effect size, we controlled for the effect of all other sequence features using a linear model with the other features as covariates. For details, see Supplemental Methods. For other sequence features, effects of single-nucleotide variants were predicted by introducing a single-nucleotide perturbation into the full prediction model for each gene, and summarizing the effect with the median half-life change across all genes. For details, see Supplemental Methods.

Construction of SFG1 and NYV1 mutant strains

One hundred base pair primers (IDT) containing the respective 3′ UTR mutations were used to amplify the kanMX cassette from plasmid pFA6a-KanMX6 (Euroscarf). PCR products were used for transformation of strain BY4741 (MATa his3Δ1 leu2Δ0 met15Δ0 ura3Δ0, Euroscarf) by homologous recombination, and transformants were selected on G418 plates. Correct clones were confirmed by sequencing. Details of the reporter assay design are provided in the Supplemental Methods. Sequences of the constructs are given in Supplemental Table S7.

Quantitative PCR

Cells were grown to OD600 0.8 in YPD from overnight cultures inoculated from single colonies. Cells were centrifuged at 4000 rpm for 1 min at 30°C and pellets were flash-frozen in liquid nitrogen. RNA was phenol/chloroform purified. cDNA synthesis was performed with 1.5 µg RNA using the Maxima Reverse Transcriptase (Thermo Fisher). qPCR was performed on a qTower 2.2 (Analytik Jena) using a 2-min denaturing step at 95°C, followed by 39 cycles of 5 sec at 95°C, 10 sec at 64°C, and 15 sec at 72°C with a final step at 72°C for 5 min. qPCR was performed using the SensiFAST SYBR No-ROX Kit (Bioline). Primer efficiencies were determined by performing standard curves for all primer combinations. All primer pairs had efficiencies of 95% or higher. Sequence information of primer pairs and efficiencies are provided in Supplemental Table S7. Ct data from nine biological and three technical replicates were used for analysis. Details of analyzing qPCR data are described in Supplemental Methods.

DATA DEPOSITION

Analysis scripts are available at https://github.com/gagneurlab/Manuscript_Cheng_RNA_2017.

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

75 in total

1. Codon replacement in the PGK1 gene of Saccharomyces cerevisiae: experimental approach to study the role of biased codon usage in gene expression.

Authors: A Hoekema; R A Kastelein; M Vasser; H A de Boer
Journal: Mol Cell Biol Date: 1987-08 Impact factor: 4.272

2. The transcription factor associated Ccr4 and Caf1 proteins are components of the major cytoplasmic mRNA deadenylase in Saccharomyces cerevisiae.

Authors: M Tucker; M A Valencia-Sanchez; R R Staples; J Chen; C L Denis; R Parker
Journal: Cell Date: 2001-02-09 Impact factor: 41.582

3. Codon identity regulates mRNA stability and translation efficiency during the maternal-to-zygotic transition.

Authors: Ariel A Bazzini; Florencia Del Viso; Miguel A Moreno-Mateos; Timothy G Johnstone; Charles E Vejnar; Yidan Qin; Jun Yao; Mustafa K Khokha; Antonio J Giraldez
Journal: EMBO J Date: 2016-07-19 Impact factor: 11.598

4. Gene products that promote mRNA turnover in Saccharomyces cerevisiae.

Authors: P Leeds; J M Wood; B S Lee; M R Culbertson
Journal: Mol Cell Biol Date: 1992-05 Impact factor: 4.272

5. Mutations in translation initiation factors lead to increased rates of deadenylation and decapping of mRNAs in Saccharomyces cerevisiae.

Authors: D C Schwartz; R Parker
Journal: Mol Cell Biol Date: 1999-08 Impact factor: 4.272

6. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

Review 7. RNA degradation in Saccharomyces cerevisae.

Authors: Roy Parker
Journal: Genetics Date: 2012-07 Impact factor: 4.562

8. Transcript features alone enable accurate prediction and understanding of gene expression in S. cerevisiae.

Authors: Hadas Zur; Tamir Tuller
Journal: BMC Bioinformatics Date: 2013-10-15 Impact factor: 3.169

9. High-resolution sequencing and modeling identifies distinct dynamic RNA regulatory strategies.

Authors: Michal Rabani; Raktima Raychowdhury; Marko Jovanovic; Michael Rooney; Deborah J Stumpo; Andrea Pauli; Nir Hacohen; Alexander F Schier; Perry J Blackshear; Nir Friedman; Ido Amit; Aviv Regev
Journal: Cell Date: 2014-12-11 Impact factor: 41.582

10. Extensive transcriptional heterogeneity revealed by isoform profiling.

Authors: Vicent Pelechano; Wu Wei; Lars M Steinmetz
Journal: Nature Date: 2013-04-24 Impact factor: 49.962

24 in total

1. Codon bias confers stability to human mRNAs.

Authors: Fabian Hia; Sheng Fan Yang; Yuichi Shichino; Masanori Yoshinaga; Yasuhiro Murakawa; Alexis Vandenbon; Akira Fukao; Toshinobu Fujiwara; Markus Landthaler; Tohru Natsume; Shungo Adachi; Shintaro Iwasaki; Osamu Takeuchi
Journal: EMBO Rep Date: 2019-09-03 Impact factor: 8.807

2. Codon stabilization coefficient as a metric to gain insights into mRNA stability and codon bias and their relationships with translation.

Authors: Rodolfo L Carneiro; Rodrigo D Requião; Silvana Rossetto; Tatiana Domitrovic; Fernando L Palhano
Journal: Nucleic Acids Res Date: 2019-03-18 Impact factor: 16.971

3. A cis-Acting Element Downstream of the Mouse Mammary Tumor Virus Major Splice Donor Critical for RNA Elongation and Stability.

Authors: Shaima Akhlaq; Neena G Panicker; Pretty S Philip; Lizna M Ali; Jaquelin P Dudley; Tahir A Rizvi; Farah Mustafa
Journal: J Mol Biol Date: 2018-09-01 Impact factor: 5.469

Review 4. Roles of mRNA poly(A) tails in regulation of eukaryotic gene expression.

Authors: Lori A Passmore; Jeff Coller
Journal: Nat Rev Mol Cell Biol Date: 2021-09-30 Impact factor: 94.444

5. Gene age shapes the transcriptional landscape of sexual morphogenesis in mushroom-forming fungi (Agaricomycetes).

Authors: Zsolt Merényi; Máté Virágh; Emile Gluck-Thaler; Jason C Slot; Brigitta Kiss; Torda Varga; András Geösel; Botond Hegedüs; Balázs Bálint; László G Nagy
Journal: Elife Date: 2022-02-14 Impact factor: 8.713

6. The codon sequences predict protein lifetimes and other parameters of the protein life cycle in the mouse brain.

Authors: Sunit Mandad; Raza-Ur Rahman; Tonatiuh Pena Centeno; Ramon O Vidal; Hanna Wildhagen; Burkhard Rammner; Sarva Keihani; Felipe Opazo; Inga Urban; Till Ischebeck; Koray Kirli; Eva Benito; André Fischer; Roya Y Yousefi; Sven Dennerlein; Peter Rehling; Ivo Feussner; Henning Urlaub; Stefan Bonn; Silvio O Rizzoli; Eugenio F Fornasiero
Journal: Sci Rep Date: 2018-11-15 Impact factor: 4.379