Literature DB >> 25170020

Quantitative analysis of mammalian translation initiation sites by FACS-seq.

William L Noderer¹, Ross J Flockhart², Aparna Bhaduri³, Alexander J Diaz de Arce¹, Jiajing Zhang², Paul A Khavari⁴, Clifford L Wang⁵.

Abstract

An approach combining fluorescence-activated cell sorting and high-throughput DNA sequencing (FACS-seq) was employed to determine the efficiency of start codon recognition for all possible translation initiation sites (TIS) utilizing AUG start codons. Using FACS-seq, we measured translation from a genetic reporter library representing all 65,536 possible TIS sequences spanning the -6 to +5 positions. We found that the motif RYMRMVAUGGC enhanced start codon recognition and translation efficiency. However, dinucleotide interactions, which cannot be conveyed by a single motif, were also important for modeling TIS efficiency. Our dataset combined with modeling allowed us to predict genome-wide translation initiation efficiency for all mRNA transcripts. Additionally, we screened somatic TIS mutations associated with tumorigenesis to identify candidate driver mutations consistent with known tumor expression patterns. Finally, we implemented a quantitative leaky scanning model to predict alternative initiation sites that produce truncated protein isoforms and compared predictions with ribosome footprint profiling data. The comprehensive analysis of the TIS sequence space enables quantitative predictions of translation initiation based on genome sequence.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: FACS‐seq; Kozak motif; proteome modeling; start codon; translation initiation

Mesh：

Substances：

Year: 2014 PMID： 25170020 PMCID： PMC4299517 DOI： 10.15252/msb.20145136

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

Introduction

Cells can control protein translation levels by tuning translation initiation (Kozak, 1991; Sonenberg & Hinnebusch, 2009; Ivanov et al, 2010). In eukaryotes, translation initiation typically follows the scanning ribosome model. In this model, the ribosomal preinitiation complex consisting of the small 40S ribosomal subunit, Met-tRNA, eIF2-GTP, eIF1, eIF1A, eIF3, and eIF5 is loaded onto the mRNA 5′ cap (Kozak, 2002b; Jackson et al, 2010; Hinnebusch, 2011). The preinitiation complex, along with additional initiation factors, scans from the mRNA 5′ cap in the 3′ direction in search of a start codon, which in most circumstances is AUG. When the preinitiation complex recognizes a start codon, initiation factors dissociate and a phosphate group is irreversibly released. The large 60S ribosomal subunit is then able to join the small 40S ribosomal subunit to complete the translation initiation process (Pestova & Kolupaeva, 2002; Nanda et al, 2013). Yet, the scanning ribosomal preinitiation complex does not initiate at every start codon that it encounters. With a certain probability, the ribosomal complex initiates translation, while others ‘leak’ past the start codon and continue scanning. This probability of initiation, or translational initiation efficiency, is governed by the sequence of the translation initiation site (TIS), which consists of the start codon and its adjacent bases. Therefore, cells can control translation levels in a gene sequence-dependent manner by controlling the efficiency at which a ribosome recognizes the start codon and initiates translation. Kozak has reported GCCRCCAUGG (purine, R=A or G; start codon underlined) to be a highly efficient mammalian TIS (Kozak, 1986, 1987a,b). By further generating point mutants of TIS sequences and evaluating reporter expression, Kozak found the −3R and +4G to be the first and second most important bases for efficient initiation, respectively (+1 denotes first base of start codon) (Kozak, 1986, 1997). Consequently, as a ‘rule of thumb’, highly efficient TISs are typically considered those with an AUG start codon, a −3R, and a +4G (Kozak, 1995; Harte et al, 2012). Applying these criteria, 40% of human genes (8,629) utilize highly efficient TISs (RefSeq release 55, September 2012). Yet, thousands of genes have apparently evolved TISs that are not highly efficient and could have been purposefully tuned for low or moderate translation efficiency (Kozak, 1991; Lukaszewicz et al, 2000). In fact, 9.6% of human genes (2,065) utilize neither a −3R nor a +4G. Any quantitative, genome-wide analysis of protein expression would need to account for the broad range of translation initiation efficiencies achieved by the diverse TIS sequences. To our knowledge, no study has systematically analyzed all possible mammalian TIS sequences. In summary, we are limited by our knowledge of the relationship between TIS sequence and translation initiation efficiency. Here, we combined fluorescence-activated cell sorting (FACS) with high-throughput DNA sequencing to analyze the translation initiation efficiency of 65,536 TIS sequences. After gauging the translation level mediated by each TIS sequence utilizing an AUG start codon, we report a comprehensive analysis of the TIS motif. We also developed algorithms that accounted for variable TIS-dependent initiation levels throughout the genome. Our analysis revealed key roles for TIS-dependent expression control in regulating cellular processes, generating protein isoforms, and tumorigenesis.

Results

High-throughput analysis TIS sequences by FACS-seq

Our first objective was to measure translation levels associated with each possible TIS sequence utilizing AUG as a start codon. Because we sought a high-throughput solution, it was crucial that these measurements could be made easily and precisely. To achieve this objective, we adapted a genetic reporter system that we developed previously (Ferreira et al, 2013). With this system, different TIS sequences of interest were used to initiate translation of green fluorescent protein (GFP) (Fig 1A). As a reference, red fluorescent protein (RFP) was translated from the same transcript using an internal ribosome entry site (IRES) and the TIS sequence GCCACCAUGGU. Because both fluorescent proteins were translated from the same transcript, normalizing GFP with RFP reduced the effect of extrinsic noise and improved our measurement of translation (Supplementary Fig S1) (Dean & Grayhack, 2012). Next, we generated a TIS reporter library. By utilizing a PCR-amplification approach using degenerate primers, we inserted TIS sequences with randomly chosen bases flanking the AUG start codon at positions −6 to −1, +4, and +5 (i.e., NNNNNNAUGNN where Ns indicate varied positions). Our library was sufficiently large so that it likely (> 99% probability) contained all 65,536 (48) TIS sequences.

Figure 1

Analysis of TIS sequences via FACS-seq

Data information: green fluorescent protein, GFP; red fluorescent protein, RFP; internal ribosomal entry site, IRES; 2A slippage site, 2A; puromycin resistance gene, PuroR.

A TIS reporter library was created by degenerate PCR to create a random permutation of the six bases upstream and two bases downstream of a GFP start codon, resulting in a library size of 65,536 TIS sequences. An IRES followed by RFP was used to normalize the GFP expression.

Stably transduced cells were sorted based on translation efficiency Analysis of TIS sequences via FACS-seq (GFP/RFP). 20 gates were drawn such that each gate contained 5% of the total library population. Cells that were off-scale were collected in the last gate.

The TIS sequences from each sorted population were PCR amplified and barcoded.

The barcoded TIS library was pooled and sequenced. The number of reads for each TIS sequence and barcode combination was counted.

An individual histogram for each TIS sequence is created in silico.

Analysis of TIS sequences via FACS-seq

Data information: green fluorescent protein, GFP; red fluorescent protein, RFP; internal ribosomal entry site, IRES; 2A slippage site, 2A; puromycin resistance gene, PuroR. A TIS reporter library was created by degenerate PCR to create a random permutation of the six bases upstream and two bases downstream of a GFP start codon, resulting in a library size of 65,536 TIS sequences. An IRES followed by RFP was used to normalize the GFP expression. Stably transduced cells were sorted based on translation efficiency Analysis of TIS sequences via FACS-seq (GFP/RFP). 20 gates were drawn such that each gate contained 5% of the total library population. Cells that were off-scale were collected in the last gate. The TIS sequences from each sorted population were PCR amplified and barcoded. The barcoded TIS library was pooled and sequenced. The number of reads for each TIS sequence and barcode combination was counted. An individual histogram for each TIS sequence is created in silico. In the past, we have used flow cytometry to analyze individual TIS sequences one genetic construct at a time (Ferreira et al, 2013). However, because of the sheer number of possible TIS sequences, we could not take this approach. Instead, we stably transduced cells (PD-31 mouse pre-B lymphocytes) with the entire TIS reporter library. This produced a heterogeneous population where each cell received one copy of the reporter construct. We now sought to generate a population histogram that reflected the translation level for each TIS sequence in the library. Using FACS, we isolated cell subpopulations producing GFP/RFP levels within 20 different ranges (a.k.a. FACS gates), which correspond to the x-axis intervals of the desired histograms (Fig 1B). To reduce sorting time while maintaining adequate analytical resolution, the 20 gates were chosen such that 5% of the total population fell into each gate. After sorting, each subpopulation still contained a mixture of cells with different TIS reporters, but now the numbers of each specific TIS sequence in each subpopulation could be used to generate population histograms for each translation level. To determine the numbers of each TIS sequence, we PCR amplified the vector region containing the TIS sequence from each subpopulation (Fig 1C). Using PCR, we also added a barcode sequence that could be used to identify subpopulations. All amplified products were then pooled and sequenced on the Illumina next-generation sequencing platform (Fig 1D). We counted the number of each TIS sequence for each translation interval and generated histograms describing the translation level associated with each TIS sequence (Fig 1E). We refer to this combined FACS and high-throughput sequencing approach as FACS-seq. From each of these population distributions (i.e., histograms, Fig 2A), we determined a median translation level (GFP/RFP). These values were then normalized so that translation of 100 relative units was equal to that produced by the TIS GCCACCAUGGG, a TIS often cited as being optimal for initiation (Kozak, 1987a; Harte et al, 2012). Since the TIS reporters only differed at the TIS sequence, we used reporter output as a measure of the relative translation initiation efficiency of each TIS sequence; however, when interpreting our results, one should still consider that changes to the eight TIS bases could have indirectly affected the many other factors that influence translation, for example, loading of the ribosome preinitiation complex or the rate of translation elongation. To visualize all of the relative translation initiation efficiencies measured by FACS-seq, we generated a heat map (Fig 2B) where sequences were clustered to highlight trends in TIS efficiency. Immediately, two trends supported the validity of our FACS-seq method. The strongest trend could be seen when comparing the translation efficiencies of TIS with a purine (A or G) to those with a pyrimidine (C or U) at the −3 position (Fig 2B; bottom vs. top halves, respectively). This confirmed previous reports that −3 purines promote efficient translation initiation (Kozak, 1986, 1987a, 1995). Second, when the GFP TIS sequences contained an upstream, out-of-frame AUG start codon, it would be expected that ribosomes preferentially initiate translation at the upstream AUG and synthesize an out-of-frame peptide instead of GFP (Kozak, 1995; Calvo et al, 2009). In agreement with this prediction, the GFP TIS sequences with −5A/−4U/−3G demonstrated significantly reduced translation, distinguished by the blue block of values near the bottom-left corner of the heat map (Fig 2B).

Figure 2

TIS efficiency determined by FACS-seq

Heat map of entire TIS sequence space. The sequences were arranged to highlight trends in TIS strength. The +4/+5 positions were not labeled, but they follow the same nucleotide order of U/C/A/G.

TIS efficiency determined by FACS-seq

Individual FACS-seq histograms spanning a range of translation efficiency values. The background color corresponds to the median translation initiation efficiency. The TIS sequences are indicated above each histogram. Heat map of entire TIS sequence space. The sequences were arranged to highlight trends in TIS strength. The +4/+5 positions were not labeled, but they follow the same nucleotide order of U/C/A/G.

Modeling the relationship between TIS sequence and initiation efficiency

Ideally, the sequence coverage generated by FACS-seq would be sufficient to generate histograms of high resolution for all TIS sequences. In practice, TIS sequences with repeat regions were sometimes absent or underrepresented (Fig 2B, white values). Because of data noise, we also could not be certain that any single value could precisely represent an initiation efficiency. Analogous to data obtained from microarray analysis of mRNA, the raw data revealed meaningful trends but individual data points may or may not stand on their own (Kerr et al, 2000; Sultan et al, 2002). By fitting the raw data to a model, our goal was to more accurately estimate the TIS efficiency for every TIS sequence. We first attempted to model the relationship between TIS sequence and efficiency with a mononucleotide position weight matrix (PWM), a model where each base contributes independently to initiation without cooperativity. The mononucleotide PWM was constructed by performing regression analysis on the natural logarithm of the raw FACS-seq data (Fig 3A and B) (Barrick et al, 1994; Salis et al, 2009). To avoid multiple initiation sites, TIS sequences containing an upstream AUG (e.g., TISs containing −5A, −4U, and −3G) were excluded from our training dataset. The resulting mononucleotide PWM revealed many of the known qualitative trends in TIS efficiency. For example, the mononucleotide PWM predicted that a −3A enhanced TIS efficiency by 58% relative to a −3U (Kozak, 1986, 1987a). To test the mononucleotide PWM, we compared the TIS efficiencies predicted by the model with those determined by conventional flow cytometry of cells expressing individual TIS reporters (i.e., each cell population translated GFP using a specific TIS sequence). We found that the mononucleotide PWM only moderately improved the accuracy of the TIS efficiency predictions relative to the raw FACS-seq data (raw data, R2 = 0.44, P = 2.0 × 10−6; mononucleotide PWM, R2 = 0.52, P = 1.5 × 10−15) (Fig 4A and B).

Figure 3

Mononucleotide PWM of TIS efficiency

Normalized values of the mononucleotide PWM (exp(C)). Values above 1.00 enhance TIS efficiency; values below 1.00 reduce TIS efficiency.

The same values in matrix form. The bases from the high-efficiency TIS motif are highlighted. TIS efficiency is calculated by multiplying the intercept value 76.9 by the appropriate base/position values.

Figure 4

Comparison on TIS efficiency models

Source data are available online for this figure.

Mononucleotide PWM of TIS efficiency

Normalized values of the mononucleotide PWM (exp(C)). Values above 1.00 enhance TIS efficiency; values below 1.00 reduce TIS efficiency. The same values in matrix form. The bases from the high-efficiency TIS motif are highlighted. TIS efficiency is calculated by multiplying the intercept value 76.9 by the appropriate base/position values.

Comparison on TIS efficiency models

A–D Individual reporter constructs with TIS sequences of interest were assayed for TIS efficiency (y-axis) and compared with (A) the raw FACS-seq data generated from analysis of the degenerate TIS reporter library, (B) the mononucleotide PWM predictions, (C) the dinucleotide PWM predictions, or (D) the trinucleotide PWM predictions. Filled squares correspond to TIS sequences absent from the raw FAC-seq data. Source data are available online for this figure. Because the mononucleotide PWM accounts for base contributions independently and does not consider base combinations, this treatment might be an inadequate model of the ribosome–mRNA interactions (Bulyk et al, 2002). To investigate the possibility of cooperativity between TIS positions, we next evaluated a dinucleotide PWM, which accounts for all possible interactions between any two base positions. The dinucleotide PWM was trained on the raw FACS-seq data in a similar manner as the mononucleotide PWM, again excluding TIS sequences with an upstream AUG (Supplementary Table S1). Compared to the mononucleotide PWM, the dinucleotide PWM was in better agreement with the test data generated by conventional flow cytometry (R2 = 0.83, P < 2 × 10−16) (Fig 4C). The model also more accurately predicted TIS efficiency values for those TIS sequences absent from the raw FACS-seq data (Fig 4C, filled squares). The substantial improvement of the dinucleotide PWM over the mononucleotide PWM demonstrated the importance of cooperativity between pairs of bases in start codon recognition. A trinucleotide PWM did not further improve the model (R2 = 0.71, P < 2 × 10−16) (Fig 4D). The poor performance of the trinucleotide PWM may indicate over-fitting of the raw FACS-seq data or may indicate that higher order cooperativity beyond pairwise interactions is of lesser importance. The dinucleotide PWM was used to generate a complete TIS efficiency reference table for every TIS sequence (Supplementary Table S2). The dinucleotide PWM indicated the strongest interaction to be between the +4 and +5 positions. To visualize this interaction, we grouped TIS sequences according to their +4 and +5 bases and calculated the median translation initiation efficiency for each set of TISs (Fig 5A). In line with the dinucleotide PWM, we observed a substantial interaction between the +4 and +5 positions. For example, TIS sequences containing a +4G/+5C were on average 24.8 ± 0.2% more efficient than sequences containing a +4G/+5A, suggesting that a +5C improved start codon recognition (relative to a +5A). However, TIS sequences containing a +4C/+5C were on average 14.6 ± 0.3% less efficient than sequences containing a +4C/+5A, suggesting the opposite trend. Moreover, a +4G, which is commonly thought to be critical for efficient start codon recognition (Kozak, 1986, 1987a), was not always optimal. TIS sequences containing a +4G/+5U were on average 8.2 ± 0.3% less efficient than sequences containing a +4U/+5U. Due to the strong interaction between the +4 and +5 positions, no individual base at either position was a singular determinant for translation initiation efficiency.

Figure 5

Cooperativity in TIS efficiency

Source data are available online for this figure.

Cooperativity in TIS efficiency

A–C Median TIS efficiency (GFP/RFP) for all TIS sequences that contain the specified (A) +4/+5 nucleotides, (B) −3/−2 nucleotides, or (C) −4/−3 nucleotides. All median TIS efficiency values are ± 0.2 (95% CI). Source data are available online for this figure. Surprisingly, the dinucleotide PWM predictions depended heavily on the −4 and −2 positions, which were previously reported to have a minimal impact on start codon recognition (Kozak, 1986, 1987a). Moreover, the model revealed a strong interaction between both positions and the critical −3 position. To visualize the effect, we grouped TIS sequences according to their −4, −3, and −2 sequence (Fig 5B and C). In agreement with the dinucleotide PWM, we observed that the influence of the −4 and −2 positions was dependent on the −3 position. For example, TIS sequences containing a −4C/−3A were on average only 1.7 ± 0.3% more efficient than sequences containing a −4G/−3A, suggesting that the −4 position had minimal influence. However, sequences containing a −4C/−3U were 15.3 ± 0.4% more efficient than sequences containing a −4G/−3U, demonstrating that the influence of the −4 position was dependent on the −3 position. A similar cooperativity was observed between the −3 and −2 positions. For example, TIS sequences containing a −3G/−2A were 2.8 ± 0.3% more efficient than sequences containing a −3G/−2G. However, sequences containing a −3C/−2A were 17.6 ± 0.4% more efficient than sequences containing a −3C/−2G. The implication of these results is that an A or C in the −4 and/or −2 positions partially compensates for a U or C in the −3 position (which generally disfavors start codon recognition). The dinucleotide PWM provided the most accurate, quantitative method of determining TIS efficiency. Yet, for simplicity, we also sought to generalize our results with the following high-efficiency TIS motif: where Y = U or C, M = A or C, R = A or G, and V = A, C, or G. The −4, −3, −2, +4, and +5 positions were the most critical for efficient start codon recognition. Due to the considerable interactions between the +4 and +5 positions, it was difficult to generalize these positions without accounting for dinucleotide combinations. Nonetheless, we chose to include the +4G/+5C combination because it resulted in consistently high initiation efficiency, even without the −3R. Other +4/+5 combinations, such as +4G/+5G or +4A/+5C, were also favorable to initiation. TIS sequences that matched our full motif were on average 10 ± 1% more efficient than TIS sequences that only contained a −3R/+4G (Supplementary Fig S4). As opposed to previously reported TIS consensus motifs, which were based on the frequency of TIS usage in the genome, we have constructed a high-efficiency TIS motif by directly measuring the relative translation efficiency of tens of thousands of TIS sequences.

Effects due to different reporter genes, cell types, environment, and mRNA secondary structure

We next sought to determine the degree to which other experimental variables affected our measurements of translation initiation. When we evaluated a subset of TIS sequences using our reporter system, we observed reproducible levels of translation initiation across a variety of cell lines: mouse pre-B lymphocytes (PD-31), mouse fibroblasts (NIH-3T3), human colon cancer (HCT-116), human cervical cancer (HeLa), human hepatocellular carcinoma (HepG2), and human chronic myelogenous leukemia (K562) (Supplementary Fig S2A). We also found that translation of yellow fluorescent protein (YFP) and blue fluorescent protein (BFP) yielded similar but not identical measurements when compared to the translation levels of the GFP reporter (Supplementary Fig S2B). Furthermore, the TIS efficiency measurements were not greatly affected by varying the culture and growth conditions (Supplementary Fig S2C). In summary, while factors other than the TIS sequence can affect translation initiation efficiency, our experiments indicated that our reporter system was adequate to gauge relative levels of translation initiation efficiency across various experimental situations. We considered whether mRNA secondary structures could have affected the output from our TIS reporters. Secondary structures, especially near the 5′ cap, have been reported to decrease the overall rate of translation, presumably by decreasing the flux of scanning ribosomes (Babendure et al, 2006). The 5′ UTR used in this study was predicted to have a weak hairpin at the 5′ cap (bases 1–28, ΔG = −8 kcal/mol) and a stable structure in the retroviral U5 region (bases 83–191, ΔG = −51 kcal/mol). These mRNA structures may have affected the absolute rate of translation. However, because the structures were shared by the entire TIS reporter library, the relative TIS efficiency values were not affected. It was also possible that specific TIS sequences could have resulted in the formation of an mRNA structure near the start codon (Gu et al, 2010; Tuller et al, 2010; Goodman et al, 2013). To investigate this possibility, we calculated the folding energy for 65,536 mRNA sequence associated with each TIS sequence in our reporter library. Each sequence consisted of the 70 bases upstream of the TIS, a single 11-base TIS region, and 70 bases downstream of the TIS (Supplementary Fig S3). The difference between the most stable and least stable secondary structure was −16 kcal/mol. We did not observe any significant relationship between the mRNA folding energy and the TIS efficiency (P = 0.18).

Genome-wide analysis of TIS efficiency

Next, we performed a quantitative genome-wide analysis of human TIS efficiency. As a point of reference, we first considered the distribution of initiation efficiencies for the entire TIS sequence space (i.e., all 65,536 TIS sequences) as determined by our dinucleotide PWM (Fig 6A). The range of TIS efficiencies varied 12-fold with the distribution skewed toward efficient initiation, indicating that the majority of possible TIS sequences that contain an AUG start codon result in efficient initiation. The TIS sequences predicted to be the most efficient often contained a −3R or a +4G, consistent with previous research (Kozak, 1986, 1987a). However, there were many discrepancies. For example, 15% of the TIS sequences in the top quartile did not have a −3R and 68% did not have a +4G. These discrepancies emphasize the importance of considering the full sequence when determining TIS efficiency.

Figure 6

Genome-wide analysis of human translation initiation efficiency

Distribution of the initiation efficiencies for the entire TIS sequence space (i.e., all permutations of the motif NNNNNNAUGNN).

Distribution of initiation efficiencies for all annotated human TISs.

Distribution of TIS efficiencies for all human uORF TISs (i.e., AUG start codons located in 5′ UTRs). Distribution of efficiencies for all TIS sequences included for comparison (dashed-line, B and C). Efficiency values were predicted with the dinucleotide PWM. A value of 100 corresponds to the TIS efficiency of GCCACCAUGGG.

Genome-wide analysis of human translation initiation efficiency

Distribution of the initiation efficiencies for the entire TIS sequence space (i.e., all permutations of the motif NNNNNNAUGNN). Distribution of initiation efficiencies for all annotated human TISs. Distribution of TIS efficiencies for all human uORF TISs (i.e., AUG start codons located in 5′ UTRs). Distribution of efficiencies for all TIS sequences included for comparison (dashed-line, B and C). Efficiency values were predicted with the dinucleotide PWM. A value of 100 corresponds to the TIS efficiency of GCCACCAUGGG. Next, we analyzed the initiation efficiency for all human protein coding TISs (i.e., TISs that start a protein coding ORF). The dinucleotide PWM was applied to 21,579 protein coding TIS sequences from the RefSeq database (RefSeq release 55, September 2012). By comparing the distribution of the protein coding TISs to that of the entire TIS sequence space, we concluded that human protein coding TISs are enriched for efficient initiation sequences (Fig 6B, Supplementary Fig S5), which qualitatively agrees with previous reports (Kozak, 1986). Quantitatively, the mean efficiency of human protein coding TISs was 15.2 ± 0.4% higher than that of the TIS sequence space (i.e., higher than if the TIS sequences were chosen at random). 42% of the protein coding TISs had an efficiency value in the top quartile of the TIS sequence space, while only 7% of the protein coding TISs had an efficiency value in the bottom quartile of the TIS sequence space. We also compared the protein coding TISs to upstream ORF (uORF) TISs (i.e., AUG-containing TISs found in the 5′ UTR that initiate uORFs). The dinucleotide PWM was used to predict the efficiency of 37,474 uORF TIS sequences found in the RefSeq database. The distribution of the uORF TISs was similar to that of the entire TIS sequence space (Fig 6C, Supplementary Fig S5). The mean efficiency of uORF TISs was only 1.6 ± 0.4% higher than that of the TIS sequence space, suggesting that there was no overarching preference for higher or lower initiation efficiency at uORFs. While protein coding TISs were enriched for efficient initiation and minimal leaky scanning, uORFs used the full spectrum of possible TIS efficiencies. To our knowledge, this is the first quantitative genome-wide survey of TIS efficiency.

Somatic TIS mutations associated with tumorigenesis

Our comprehensive analysis of the TIS sequence space enabled us to search for candidate driver mutations in tumorigenesis. Mutations in the TIS motif have the potential to alter the efficiency of start codon recognition, thereby disrupting the protein translation levels and possibly impacting human health (Kozak, 2002a; Wolf et al, 2011). Prior to this work, the qualitative effect of TIS mutations at positions other than the −3 or +4 was difficult to predict (Xu et al, 2010). However, a few studies had already demonstrated that mutations at positions other than the −3 or +4 are biologically relevant (Afshar-Kharghan et al, 1999; Usuki & Maruyama, 2000; González-Conejero et al, 2002; Jacobson et al, 2005). For example, a −1C > U polymorphism in the CD40 gene has been associated with Graves' disease, an autoimmune disorder (Jacobson et al, 2005). The polymorphism decreased the CD40 TIS efficiency by 15–32%, demonstrating that even modest changes in TIS efficiency can be biologically relevant. TIS mutations at positions other than the −3 and +4 are largely uncharacterized despite their potential impact on human health. We screened the Catalogue of Somatic Mutations in Cancer (COSMIC) for somatic TIS mutations in positions other than the −3 or +4. The database contains somatic mutations associated with genes implicated in tumorigenesis (reported in the literature and from sequencing of patient samples) (Forbes et al, 2011). We were most interested in the TIS mutations that could be causative drivers of tumorigenesis (in contrast to passenger mutations). Therefore, we chose mutations where the predicted change in protein translation efficiency was consistent with other known tumor expression patterns (i.e., changes in the mRNA level or gene copy number). For example, if a TIS mutation in the database was predicted to decrease the protein translation efficiency and the gene containing that mutation was also known to be down-regulated at the mRNA level in other tumor samples, then the TIS mutation was consistent with known tumor expression patterns. The COSMIC database contained 554 mutations in the TIS motif (−6 to +5 positions, excluding AUG start codon) (Supplementary Table S3). Of these mutations, 276 were located at positions other than −3 or +4. We assessed the candidate mutations using publicly available gene expression data, gene copy number data, and finally, through a literature search. Here, we have chosen to report 7 TIS mutations that spanned an array of predicted changes to TIS efficiency, which we used to verify our predictions (Table 1). The identified mutations occurred at a variety of positions and were potentially causative in tumorigenesis. The 11-base TIS sequences, both with and without the somatic mutation, were generated, inserted into our dual fluorescence TIS reporter, and analyzed by conventional flow cytometry. The predicted changes in TIS efficiency were in good agreement (R2 = 0.93) with the measured changes in TIS efficiency (Fig 7), indicating that the dinucleotide PWM can be used to predict the effect of mutations in the TIS sequence.

Table 1

Somatic TIS mutations associated with tumorigenesis

Gene	TIS sequence and mutation	Tumor expression patterns
MAP2K3		Downregulated in human breast cancer. Forced expression induces senescence (Jia et al, 2010)

MBP		Deleted in 61% of ovarian serous carcinomas and 54% of head neck squamous cell carcinomas (P < 3 × 10⁻¹³)

NOS1		Downregulated 2-fold in glioma samples (P = 3.6 × 10⁻³⁹)

PLD5		Downregulated 2.3-fold in serous carcinomas (P < 3 × 10⁻¹³)

DAB2		Amplified in 71% of lung squamous cell carcinomas, and 56% of lung adenocarcinomas (P = 3.2 × 10⁻⁴)

MED8		Upregulated 3-fold in uterine tumor and endometrioid samples (P = 4 × 10⁻¹⁰¹). Key regulator of cell proliferation (Miklos et al, 2008)

DHX33		Upregulated 2-fold in lung cancer (P < 10⁻¹⁰⁰). Ras^G12V upgregulates gene; p19^ARF downregulates gene (Zhang et al, 2013)

Figure 7

Effect of mutations on TIS reporter

Somatic TIS mutations associated with tumorigenesis

Effect of mutations on TIS reporter

TIS sequences from Table 1 (both with and without mutations) were inserted into the TIS reporter. The TIS sequences were analyzed individually by conventional flow cytometry, and the change in TIS efficiency caused by the TIS mutation was calculated. The observed changes in TIS reporter efficiencies were compared to the dinucleotide PWM predictions. Error bars represent 95% CI (n = 3). The predicted change in TIS efficiency for DHX33 was 798% (off-scale). Source data are available online for this figure. For example, we identified a mutation in the −4 position of the NOS1 TIS sequence that could play a role in tumorigenesis. The gene NOS1 is part of the neuronal nitrous oxide synthesis pathway and has been associated with tumorigenesis (Fukumura et al, 2006). In lower grade glioma datasets, NOS1 mRNA expression was on average down-regulated twofold (The Cancer Genome Atlas, https://tcga-data.nci.nih.gov/tcga/). A mutation that decreases the translation efficiency would impact on NOS1 protein levels in a manner similar to the observed mRNA down-regulation. A −4C > U mutation in the NOS1 TIS was identified in the COSMIC database and was predicted by the dinucleotide PWM to reduce the translational efficiency by 27 ± 9%. When we analyzed the mutation with specifically generated TIS reporters (with and without the mutation), we observed a 30 ± 3% decrease in efficiency. As another example, we identified a possible driver mutation in the +5 position of the MED8 gene, which assists in the regulation of cell proliferation through the transcriptional activation of RNA polymerase II-dependent genes (Miklos et al, 2008; Taatjes, 2010). A threefold up-regulation of MED8 mRNA was observed in uterine tumors and endometrioid samples (The Cancer Genome Atlas). A +5U > G mutation in the MED8 TIS was predicted to increase the efficiency by 46 ± 15% and was observed to increase the efficiency of our TIS reporter by 73 ± 13%. Therefore, the mutation was consistent with the tumor expression pattern and was identified as a candidate tumorigenic driver mutation. In summary, we have linked known tumorigenic expression patterns, previously validated only for RNA level or gene copy number, with changes in TIS efficiency, thereby providing a candidate explanation for how the TIS mutations could promote tumorigenesis.

Employing a quantitative leaky scanning model to predicting truncated protein isoforms

Next, we investigated how suboptimal TIS efficiency could increase proteomic diversity. Ribosomes that ‘leak’ past a TIS are capable of initiating at downstream TISs, thereby generating alternative translational isoforms. Just as alternative RNA splicing enables a single gene to encode multiple transcript variants, leaky scanning enables a single gene to encode multiple protein isoforms (Ingolia et al, 2011; Lee et al, 2012; Michel et al, 2012; Ben-Yehezkel et al, 2013). In the instances where a TIS is downstream and in-frame with the transcript's annotated start codon, an N-terminal truncated protein is translated. Compared to the full-length protein, the N-terminal truncated protein may have an altered biological function or, because many localization signals reside on the N-terminus, may have an altered cellular localization (Danpure, 1995). The effect of leaky scanning on proteomic diversity remains poorly investigated in part due to the difficulty in identifying when N-terminal truncated protein isoforms are translated (Bazykin & Kochetov, 2011). To identify N-terminal truncated protein isoforms, we developed a quantitative leaky scanning model that incorporated our TIS efficiency values to predict the relative translation of truncated ORFs. To model the leaky scanning mechanism, we considered the sequence of two TISs: the annotated TIS and the nearest downstream TIS with an in-frame AUG start codon (Supplementary Fig S6). The ratio of the initiation occurring at the second TIS compared to the initiation occurring at the first TIS was described by The parameters E1 and E2 are the relative efficiency values determined by the first and second TIS sequences, respectively. The constant k relates the relative efficiency values to an absolute probability of initiation. We previously estimated k to be 0.86 by analyzing the effect of synthetic uORFs on a downstream initiation reporter (Ferreira et al, 2013). Using the quantitative leaky scanning model, we predicted 1,023 human genes to have secondary protein isoforms where 0.50 ≤ X < 1.0 and 318 genes where X ≥ 1.0 (Supplementary Table S4). In other words, for all 1,341 genes, we predict initiation to occur at a truncation TIS (i.e., initiation site of a truncated ORF) with a frequency greater than or equal to 50% of that of the annotated TIS. The predictions from our model suggest that ‘leaky’ ribosome scanning broadly increases the proteomic diversity. We sought to corroborate the quantitative leaky scanning model predictions with previously reported ribosome footprint profiling data. Briefly, ribosome footprint profiling uses the small molecule drug harringtonine to prevent newly initiated ribosomes from translating, resulting in the accumulation of ribosomes at functional initiation sites and the depletion of downstream ribosomes (Ingolia et al, 2009, 2011, 2012). By sequencing and aligning the protected mRNA fragments, one can determine the locations of translation initiation. A drawback of the method is that the harringtonine treatment skews initiation toward upstream TISs, since a ribosome arrested at an upstream TIS would block another scanning ribosome, one that has not yet reached the upstream TIS, from reaching the downstream TIS. In contrast, a ribosome arrested at the downstream TIS would not prevent a second ribosome from initiating at the upstream TIS. We compared the quantitative leaky scanning model predictions (applied to the mouse transcriptome; Supplementary Table S5) to ribosome footprint profiling data from mouse embryonic stem cells (Ingolia et al, 2011). We filtered the data such that we only analyzed transcripts with a sufficient number of ribosome footprint reads (≥ 50 at the annotated TIS), where the TISs were sufficiently separated (≥ 20 bases), and no annotated alternative 5′ transcript isoforms existed. The 980 transcripts that fit our criteria were scored for evidence of initiation at the putative truncation TIS (Fig 8A). We observed evidence of a truncation TIS in 93% of the transcripts that had a predicted initiation ratio, X ≥ 1.0. These results were compared to qualitative predictions based on −3/+4 position ‘rules of thumb’, which would have generated the most reasonable predictions prior to this work (Fig 8B). Transcripts where the annotated TIS contained a −3U and lacked a +4G would previously have been considered the most inefficient TIS sequences and therefore the most likely TISs to allow for leaky scanning (Kozak, 1995). We observed evidence of a truncation TIS in 63% of the transcripts that met these criteria. Therefore, the quantitative leaky scanning model was more accurate than simple qualitative −3/+4 position rules at predicting which genes were most likely to have an active truncation TIS (P = 0.013).

Figure 8

Ribosome footprint profiling of genes with predicted truncation isoforms

A The fraction of transcripts where ribosome profiling supports initiation at the truncation TIS grouped by predicted initiation ratio, X. Error bars represent 95% C.I.

B Same as (A), but transcripts were grouped by the −3 and +4 positions of the annotated TIS.

C–E Examples of transcripts with observed initiation at the truncation TIS. Ribosome density provided for transcripts with and without harringtonine treatment. The full-length annotated ORFs and the predicted truncated ORFs are indicated with their TIS sequences. Gene names, transcript names, and predicted initiation ratios are provided.

Ribosome footprint profiling of genes with predicted truncation isoforms

An in silico quantitative leaky scanning model predicted the ratio, X, of initiation at truncation TISs relative to the initiation at annotated TISs. The predictions were compared to ribosome footprint profiling data from mouse embryonic stem cells. A The fraction of transcripts where ribosome profiling supports initiation at the truncation TIS grouped by predicted initiation ratio, X. Error bars represent 95% C.I. B Same as (A), but transcripts were grouped by the −3 and +4 positions of the annotated TIS. C–E Examples of transcripts with observed initiation at the truncation TIS. Ribosome density provided for transcripts with and without harringtonine treatment. The full-length annotated ORFs and the predicted truncated ORFs are indicated with their TIS sequences. Gene names, transcript names, and predicted initiation ratios are provided. Sap18, Npepps, and Hagh are examples where the quantitative leaky scanning model was in strong agreement with the ribosome footprinting data. Sap18, which encodes for a histone deacetylase protein (McCallum et al, 2006), has a low-efficiency annotated TIS sequence UAGCUCAUGCU followed by the high-efficiency TIS sequence AGGAAGAUGGC (Fig 8C). The quantitative leaky scanning model predicted significant amounts of initiation at the truncation TIS. In support of this prediction, the ribosomal footprint profiling data showed peaks at both TISs, suggesting that initiation was occurring at both TISs. As a second example, Npepps, which encodes for an aminopeptidase protein (Yanagi et al, 2009), also demonstrated evidence of initiation at a truncation TIS (Fig 8D). The low-efficiency annotated TIS CGGUGAAUGUG was followed by the high-efficiency TIS sequence GCCGCGAUGCC. The quantitative leaky model predicted approximately equal initiation at the annotated TIS and at the truncation TIS, which was in agreement with ribosome footprint profiling data. A third example is Hagh, which encodes a glyoxalase II enzyme involved in metabolism (Fig 8E). The gene's annotated TIS sequence, UGGGUCAUGGU contains a −3G/+4G, which would traditionally be considered highly efficient (Harte et al, 2012). However, using the full TIS sequence, our quantitative leaky scanning model predicted that a considerable fraction of the ribosomes would leak past the annotated TIS and initiate at the truncation TIS (GACAUAAUGAA). The ribosome footprint profiling peaks at both the annotated TIS and the putative truncation TIS suggest that initiation occurs at both sites, supporting our predictive model. Overall, we found the quantitative leaky scanning model generally agreed with the ribosome footprint profiling data, demonstrating the ability of our model to predict the expression of translational protein isoforms from the mRNA sequence alone.

Discussion

A long-term goal of genomics has been to use the genome sequence to predict proteome-wide expression (McLeay et al, 2012). Any quantitative prediction would need to account for variation in the efficiency of translation initiation, which in turn depends on the start codon context. As a step toward this goal, we employed FACS-seq to quantitatively measure the efficiency of translation initiation for a library consisting of 65,536 TIS sequences, which covers more TIS sequences than are present in the human genome. Using a dinucleotide PWM model to analyze this dataset revealed the importance of cooperativity between pairs of bases. The model also allowed us to process the FACS-seq raw data and more reliably gauge initiation efficiency. We generalized our findings with the high-efficiency motif RYMRMVAUGGC. Consistent with previous research (Kozak, 1986, 1987a), we found the −3 and the +4 positions to be important in determining TIS efficiency. However, our analysis also found the −2, −4, and +5 positions to be influential. The divergence of consensus sequences within eukaryotic species has been used to argue that the initiation machinery, and its preference for TIS sequences, has undergone substantial evolution (Hinnebusch, 2011; Zur & Tuller, 2013). Indeed, there are differences between the reported consensus sequences for yeast (AMAMAAUGUCU (Cigan & Donahue, 1987; Hamilton et al, 1987)), higher plants (CAAMAAUGGCG (Joshi et al, 1997)), invertebrates (CAARAUGG (Pesole et al, 2000)), and vertebrates (GCCRCCAUGG (Kozak, 1987b)). However, the TIS usage does not necessarily equate to the ribosome's TIS sequence preference (Lukaszewicz et al, 2000). Except for the −5A and +4U in the yeast sequence, the three consensus sequences are remarkably similar to the high-efficiency mammalian TIS motif RYMRMVAUGGC, which we determined without incorporating any genome information. The differences in consensus sequences may be related to factors other than ribosome binding preferences. For example, eukaryotes with low GC-content genomes have been shown to favor A-rich TIS sequences (Nakagawa et al, 2008). In the case of yeast, which has a low GC-content genome, the scanning ribosome is particularly sensitive to secondary mRNA structure (Cigan & Donahue, 1987; Baim & Sherman, 1988). The high frequency of As in the TIS may reflect a selective pressure to prevent stable secondary structures while also maintaining high translational efficiency. In our study, we do not observe any appreciable relationship between secondary structure and TIS efficiency, perhaps because the mammalian eIF4 RNA helicase prevents stable secondary structures from interfering with the scanning ribosome. We hypothesize that much of the divergence in eukaryotic TIS consensus sequences could be due to variations in sensitivity of the preinitiation complex to RNA secondary structure near the TIS. The debate surrounding the role, if any, of the +5 position originally inspired us to vary the +5 position in our TIS library (Boeck & Kolakofsky, 1994; Grunert & Jackson, 1994; Kozak, 1997; Xia, 2007; Nakagawa et al, 2008). We found that the +5 position had a large impact on TIS efficiency and displayed significant cooperativity with the +4 position. We cannot discount the possibility that the amino acid encoded by the +4 and +5 nucleotides affected our fluorescence reporter, either by altering the fluorescence per GFP molecule or by altering the GFP half-life. Since GFP fluorescence is generally robust to fusion proteins (Tsien, 1998), it is unlikely that a single amino acid fused to GFP altered the fluorescence per molecule. Alternatively, it is possible that the GFP half-life was affected by the phenomenon commonly referred to as the N-end rule. This rule states that the N-terminal amino acid affects protein stability (Hwang et al, 2010; Varshavsky, 2011). Since all TIS sequences in our library shared a +6C, according to the N-end rule, any TIS sequence without a +4G/+5G (Gly) or +4C/+5C (Pro) should have produced a GFP targeted for acetylation and eventual degradation. Therefore, if the N-end rule had significantly impacted our data, then TIS sequences with +4G/+5G or +4C/+5C should have produced a GFP with a longer half-life, increasing the GFP/RFP measurements. We did not observe this trend and, in fact, TIS sequences with +4C/+5C had some of the lowest GFP/RFP measurements (Fig 5A). Thus, our data suggest that the N-end rule had a minor effect, if any, on our GFP/RFP measurements. The biophysical details of how the TIS sequence affects the ribosomal preinitiation complex remain unclear. Recently, Lomakin and Steitz (2013) determined the crystal structure of the mammalian preinitiation complex (48S PIC). The structure confirmed that the −6 to the +7 positions were within the ribosome's mRNA channel, allowing for the possibility that the +6 and +7 positions could also influence initiation. Downstream of the start codon, the +4 and +5 positions were proximal to 18S helix h44, eIF1, and the basic loop of eIF1A. Upstream of the start codon, the ribosomal proteins rpS26e and rpS28e were proximal to the influential −4, −3, and −2 positions. Additionally, the −3 position has been shown to UV-cross-link with eIF2α, which was not included in the crystal structure (Pisarev et al, 2006). Interestingly, the suboptimal −3U was cross-linked less efficiently with eIF2α than a −3G. It is possible −2 and −4 positions influence initiation by promoting the −3 interaction with eIF2α. Unfortunately, the preinitiation complex crystal structure resolution was too low to determine the orientation of the mRNA bases or the orientation of the residue side chains in the mRNA channel. A higher resolution crystal structure would enable us to determine not just which parts of the preinitiation complex are proximal to the TIS, but also the nature of these interactions, enabling a biophysical explanation for the TIS efficiency trends described here. The comprehensive and quantitative analysis of TIS efficiency enabled us to perform a genome-wide analysis of TIS efficiency, to identify TIS mutations that could drive tumorigenesis, and to identify transcripts that encode truncated protein isoforms. Using a quantitative leaky scanning model that accounted for TIS efficiency, we predicted the occurrence of protein isoforms resulting from internal initiation within annotated ORFs, effectively generating truncations. The agreement between our model and the available ribosome footprint profiling data suggests that initiation events previously attributed to internal ribosome entry site (IRES) structures may instead be a result of leaky scanning (Ingolia et al, 2011). For example, a truncation TIS in the HAGH gene had been experimentally validated (Cordell et al, 2004). Since the annotated TIS contained a −3G/+4G, which was thought to cause efficient initiation with minimal leaky scanning, the initiation at the truncation TIS was attributed to a putative IRES structure. However, our leaky scanning model, which considered the full TIS sequence (UGGGUCAUGGU), predicted substantial leaky scanning (predicted initiation ratio X = 0.52) (Fig 8E). While we cannot rule out the presence of an IRES, our results suggested that leaky scanning can account for some, if not all, of the initiation at the downstream TIS. Thus, our results agree with previous studies calling into doubt many of the putative mammalian IRES structures (Han & Zhang, 2002; Kozak, 2003, 2007; Bert et al, 2006; Elango et al, 2006). By employing FACS-seq, a method combining FACS and high-throughput DNA sequencing, we were able to analyze a library of genetic reporters and determine the translation initiation efficiency for all possible TIS sequences utilizing an AUG start codon. The method is similar in design to other recent studies (Kinney et al, 2010; Sharon et al, 2012; Kosuri et al, 2013). To our knowledge, our implementation of FACS-seq has to date analyzed the largest number of sequences in one experiment (65,536 vs. 14,234 (Goodman et al, 2013)). With some optimization, we believe that FACS-seq can be applied to libraries approaching 106 sequences, with the practical limit being determined by the time required to sort the library. Our ability to analyze such a large number of sequences was dependent on the precision of the readout from our translation reporter. As long as an effective fluorescent reporter can be designed, FACS-seq and similar methods should in principle enable comprehensive analysis of any DNA or RNA sequence library. Because the sequence space of biological motifs grows exponentially larger with each additional base, comprehensive analysis with traditional techniques can be challenging or impractical. Instead, massively parallel methods like FACS-seq enable the thorough analysis of a desired sequence space such that one can now predict the behavior of all sequence elements employed in the genome.

Materials and Methods

TIS reporter library construction

The TIS reporter library was constructed using degenerate primers followed by a high-efficiency Gibson reaction. Monomeric enhanced GFP (EGFP A207K, here referred to as GFP) was PCR amplified and gel purified using the forward primer 5′-CATCCTCTAGACTGCCGGATCTCGAGTAACTGACTAGT NNNNNN ATG NN CGAATTCAGCAAGGGCGAGGAG-3′ and reverse primer 5′-CGGAATTGGCCGCCCTAGATG-3′. The eight nucleotides that were varied are indicated by Ns in the degenerate forward primer. The three upstream stop codons in all three reading frames prevented an upstream start codon from translating GFP. To construct the plasmid library, pCru5-IRES-mCherry-F2A-Puro plasmid was digested with XhoI and NotI, gel purified, and then used along with the amplified GFP in an optimized Gibson reaction. TOP10 E. coli cells were transformed with the Gibson reaction product to yield greater than 106 E. coli transformants. This yielded a > 99% chance of each possible TIS sequence being present in at least one of the transformants (assuming that the reaction was not biased for specific TIS sequences). The colonies were scraped into a single flask followed by a DNA HiPure Maxiprep purification (Life Technologies, Carlsbad, CA).

Cell culture

Retroviral particles of the TIS reporter library were produced by transiently transfecting HEK-293T cells with equal amounts of pCru5-GFP-IRES-mCherry-F2A-Puro DNA and pCL-Eco retrovirus packaging vector. The transfection followed the CalPhos Mammalian Transfection Kit protocol (Clonetech Laboratories, Inc., Mountain View, CA). Viral particles were harvested and filtered through a 0.4-μm filter. FACS-seq experiments were carried out in PD-31 cells, an Abelson murine leukemia virus-transformed pre-B cell line. The cells were cultured in RPMI-1640 medium (Life Technologies) supplemented with 10% fetal bovine serum (FBS, Gemini Bio Products, Sacramento, CA), 2 mM glutamine, 1 mM sodium pyruvate, 0.05 mM 2-mercaptoethanol, 100 U/ml penicillin, and 100 μg/ml streptomycin at 37°C with 5% CO2. To insure full library coverage, virus was added to 350 × 106 PD-31 cells with 8 μg/ml polybrene (hexadimethrine bromide). The infection frequency was 2.3% as judged by flow cytometry 2 days post-infection, insuring that transduced cells received a single copy of the vector. Puromycin (2 μg/ml) was added to the media 3 days post-infection to select for infected cells. Additional experiments were carried out in a variety of cell lines and conditions. NIH-3T3 cells, HeLa cells, and HepG2 cells were cultured in DMEM medium (Life Technologies) with 10% FBS, 4.5 g/ml glucose, and 2 mM glutamine. PD-31 cells (standard conditions) and K562 cells were cultured in RPMI-1640 medium with 10% FBS, 2 mM glutamine, 1 mM sodium pyruvate, and 0.05 mM 2-mercaptoethanol. HCT-116 cells were cultured in McCoy's 5A medium (HyClone Laboratories, Logan, UT) with 10% FBS. Where indicated, 1.75 μM imatinib (Novartis, Basel, Switzerland) was added (Muljo & Schlissel, 2003; Ferreira & Wang, 2013). All cells were cultured with 100 U/ml penicillin and 100 μg/ml streptomycin at 37°C with 5% CO2.

Cell sorting

PD-31 cells transduced with the TIS reporter library were sorted based on the ratio of GFP to mCherry (RFP). All sorting was performed on an Aria II (BD Biosciences, Franklin Lakes, NJ) using the GFP channel (488 nm excitation laser, 505 nm splitter, 525/50 nm emission filter) and the mCherry channel (561 nm excitation laser, no splitter, 582/18 nm emission filter). The voltages of the GFP and RFP channels were reduced such that the GFP and RFP values of non-fluorescent cells were approximately 50 a.u. Cells were gated for RFP values greater than 103 (i.e., 20× autofluorescence). Using the FACSDiva software (BD Biosciences), a GFP to RFP ratio parameter (here referred to as GFP/RFP) was created. On the day of sorting, a preliminary reading was taken of the TIS library expressing cells. Custom MATLAB software (MathWorks, Natick, MA) analyzed the preliminary data to calculate the correct gate spacing so that 5% of the cells fell into each of the 20 gates. 106 PD-31 cells were sorted into each of the 20 gates.

Barcoding and sequencing

The TIS sequences from each sorted population were PCR amplified, barcoded, and sequenced. The PD-31 cells were cultured for 2 days after cell sorting. Genomic DNA (gDNA) was isolated with DNeasy spin column (Qiagen, Venlo, Netherlands). An isopropanol precipitation was used to concentrate the gDNA. A mean of 22 μg of gDNA was added to each of the 20 PCRs using barcoded primers (Supplementary Table S6). Each barcoding PCR was 100 μl in size and used OneTaq HS with the standard buffer (New England Biolabs, Ipswich, MA). The PCR products were gel extracted, purified, and mixed in equimolar ratios. A second PCR was used to add the standard Illumina sequencing adapters (Supplementary Table S6). The final PCR product was sequenced with the Illumina Genome Analyzer IIx (San Diego, CA).

Data analysis

The TIS sequences and barcodes were extracted from the DNA sequencing reads to create the FACS-seq histograms. The sequences were first normalized by barcode number, such that each barcode represented an equal fraction of the total population. The gate coordinates used for the cell sorting were used to simulate the original GFP/RFP histograms and to calculate the median translation initiation (GFP/RFP). All of the analysis was performed with custom MATLAB scripts.

Position weight matrices

To generate a position weight matrix, linear regression was performed using the natural logarithm of the raw FACS-seq data as the dependent variable and the TIS sequence as the independent variables (Barrick et al, 1994; Salis et al, 2009). By using the logarithm, we were modeling the interaction between the mRNA and the ribosomal preinitiation complex as an association reaction at equilibrium. Hence, the efficiency of translation initiation, E, would be described by: where ΔG is the change in Gibbs free energy upon binding of the preinitiation complex with the TIS sequence, R is the ideal gas constant, and T is temperature. Using this rational, the mononucleotide PWM modeled the efficiency as: where k was a proportionality constant and C was the coefficient for base b = {U, C, A, or G} at position i = {−6, −5, −4, −3, −2, −1, +4, or +5}. The coefficients were normalized such that: which caused the coefficients to be proportional to the specific free energies of ribosome binding (Barrick et al, 1994). The dinucleotide PWM was constructed in a manner similar to the mononucleotide PWM except the independent variables included all possible pair-wise interaction. where C matched the mononucleotide PWM coefficients, and was the interaction coefficient for base b at position i and base b at position i (subject to i ≠ i). The coefficients were normalized such that The dinucleotide PWM was applied using custom MATLAB and R scripts. For genome-wide predictions, the mouse and human RefSeq GenBank files (release 55, September 2012) were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/refseq/). The AUG-containing TIS sequences were extracted from the mRNA sequence and used to predict TIS efficiency values.

mRNA secondary structure

All mRNA secondary structure predictions were performed with the open source NUPACK software (Zadeh et al, 2011). The mRNA folding free energy (ΔG) was calculated using the pfunc command. The temperature was set at 37°C, dangle energies were included (-dangles all), and the default RNA parameters were used (-material rna1995).

SNP analysis

Cancer-related mutations were identified by intersecting COSMIC mutations (version 65 full export updated May 28, 2013) with the annotated human TIS sequences (GRCh7/hg19) using bedTools (Quinlan & Hall, 2010). Mutations in these regions were evaluated using the dinucleotide PWM to predict the change in expression. Genes shown in Table 1 were evaluated for expression and copy number changes using publicly available data from The Cancer Genome Atlas Project using the Cancer Genome Browser (Goldman et al, 2012) and the Copy Number Portal. All statistical analyses on expression data were performed using a Student's t-test with a Bonferroni multiple hypothesis testing correction. Copy number variation P-values came from the Copy Number Portal.

Ribosome footprint profiling analysis

Ribosome footprint profiling datasets from Ingolia et al, 2011, were downloaded from NCBI SRA browser (http://www.ncbi.nlm.nih.gov/sra): SRR315612, SRR315613, SRR315614, SRR315615 (mESC treated with 1 μg/ml harringtonine followed by 100 μg/ml cyclohexamide) and SRR315602, SRR315601 (mESC treated with 100 μg/ml cyclohexamide). After removing low-quality reads and the cloning adapter, ribosomal RNAs were removed using Bowtie. The remaining reads were mapped to the mouse genome (NCBI37/mm9). Scoring of ribosome footprint profiles was performed using a custom MATLAB graphical user interface. The user was displayed a trace of the ribosome footprint data with the location of the annotated TIS and the putative truncation TIS. The user was asked to manually score if the ribosome footprint data supported the putative truncation TIS. As with all visual scoring methods, human bias was a major concern. In order to limit any bias, the transcripts were scored in a random order and with no additional information provided to the user. An attempt was made to automate the TIS peak calling. However, the results were inconsistent with visual inspection and were, in our opinion, unsatisfactory. Because we had a limited number of transcripts to score, we chose a manual process that emphasized accuracy over throughput. For full transparency, all transcript traces, predictions, and scoring are provided in Supplementary Fig S7.

80 in total

1. NUPACK: Analysis and design of nucleic acid systems.

Authors: Joseph N Zadeh; Conrad D Steenberg; Justin S Bois; Brian R Wolfe; Marshall B Pierce; Asif R Khan; Robert M Dirks; Niles A Pierce
Journal: J Comput Chem Date: 2011-01-15 Impact factor: 3.376

Review 2. The human Mediator complex: a versatile, genome-wide regulator of transcription.

Authors: Dylan J Taatjes
Journal: Trends Biochem Sci Date: 2010-03-17 Impact factor: 13.807

3. Causes and effects of N-terminal codon bias in bacterial genes.

Authors: Daniel B Goodman; George M Church; Sriram Kosuri
Journal: Science Date: 2013-09-26 Impact factor: 47.728

4. Kozak sequence polymorphism of the glycoprotein (GP) Ibalpha gene is a major determinant of the plasma membrane levels of the platelet GP Ib-IX-V complex.

Authors: V Afshar-Kharghan; C Q Li; M Khoshnevis-Asl; J A López
Journal: Blood Date: 1999-07-01 Impact factor: 22.113

5. Coordinated movements of eukaryotic translation initiation factors eIF1, eIF1A, and eIF5 trigger phosphate release from eIF2 in response to start codon recognition by the ribosomal preinitiation complex.

Authors: Jagpreet S Nanda; Adesh K Saini; Antonio M Muñoz; Alan G Hinnebusch; Jon R Lorsch
Journal: J Biol Chem Date: 2013-01-04 Impact factor: 5.157

Review 6. Sequence and structural features associated with translational initiator regions in yeast--a review.

Authors: A M Cigan; T F Donahue
Journal: Gene Date: 1987 Impact factor: 3.688

7. A common polymorphism in the annexin V Kozak sequence (-1C>T) increases translation efficiency and plasma levels of annexin V, and decreases the risk of myocardial infarction in young patients.

Authors: Rocio González-Conejero; Javier Corral; Vanessa Roldán; Constantino Martínez; Francisco Marín; José Rivera; Juan A Iniesta; María L Lozano; Pascual Marco; Vicente Vicente
Journal: Blood Date: 2002-09-15 Impact factor: 22.113

8. A universal trend of reduced mRNA stability near the translation-initiation site in prokaryotes and eukaryotes.

Authors: Wanjun Gu; Tong Zhou; Claus O Wilke
Journal: PLoS Comput Biol Date: 2010-02-05 Impact factor: 4.475

9. The Human hydroxyacylglutathione hydrolase (HAGH) gene encodes both cytosolic and mitochondrial forms of glyoxalase II.

Authors: Paul A Cordell; T Simon Futers; Peter J Grant; Richard J Pease
Journal: J Biol Chem Date: 2004-04-26 Impact factor: 5.157

10. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.

Authors: Nicholas T Ingolia; Sina Ghaemmaghami; John R S Newman; Jonathan S Weissman
Journal: Science Date: 2009-02-12 Impact factor: 47.728

64 in total

Review 1. The Untranslated Regions of mRNAs in Cancer.

Authors: Samantha L Schuster; Andrew C Hsieh
Journal: Trends Cancer Date: 2019-03-22

Review 2. uORF-mediated translational control: recently elucidated mechanisms and implications in cancer.

Authors: Hung-Hsi Chen; Woan-Yuh Tarn
Journal: RNA Biol Date: 2019-06-24 Impact factor: 4.652

3. Directed evolution of excited state lifetime and brightness in FusionRed using a microfluidic sorter.

Authors: Premashis Manna; Sheng-Ting Hung; Srijit Mukherjee; Pia Friis; David M Simpson; Maria N Lo; Amy E Palmer; Ralph Jimenez
Journal: Integr Biol (Camb) Date: 2018-09-17 Impact factor: 2.192

4. Upstream ORFs are prevalent translational repressors in vertebrates.

Authors: Timothy G Johnstone; Ariel A Bazzini; Antonio J Giraldez
Journal: EMBO J Date: 2016-02-19 Impact factor: 11.598

Review 5. Conserved Upstream Open Reading Frame Nascent Peptides That Control Translation.

Authors: Thomas E Dever; Ivaylo P Ivanov; Matthew S Sachs
Journal: Annu Rev Genet Date: 2020-09-01 Impact factor: 16.830

Review 6. Functional 5' UTR mRNA structures in eukaryotic translation regulation and how to find them.

Authors: Kathrin Leppek; Rhiju Das; Maria Barna
Journal: Nat Rev Mol Cell Biol Date: 2017-11-22 Impact factor: 94.444

7. The dynamic N(1)-methyladenosine methylome in eukaryotic messenger RNA.

Authors: Dan Dominissini; Sigrid Nachtergaele; Sharon Moshitch-Moshkovitz; Eyal Peer; Nitzan Kol; Moshe Shay Ben-Haim; Qing Dai; Ayelet Di Segni; Mali Salmon-Divon; Wesley C Clark; Guanqun Zheng; Tao Pan; Oz Solomon; Eran Eyal; Vera Hershkovitz; Dali Han; Louis C Doré; Ninette Amariglio; Gideon Rechavi; Chuan He
Journal: Nature Date: 2016-02-10 Impact factor: 49.962

Introduction

Results

High-throughput analysis TIS sequences by FACS-seq

Analysis of TIS sequences via FACS-seq

TIS efficiency determined by FACS-seq

Modeling the relationship between TIS sequence and initiation efficiency

Mononucleotide PWM of TIS efficiency

Comparison on TIS efficiency models

Cooperativity in TIS efficiency

Effects due to different reporter genes, cell types, environment, and mRNA secondary structure

Genome-wide analysis of TIS efficiency

Genome-wide analysis of human translation initiation efficiency

Somatic TIS mutations associated with tumorigenesis

Effect of mutations on TIS reporter

Employing a quantitative leaky scanning model to predicting truncated protein isoforms

Ribosome footprint profiling of genes with predicted truncation isoforms

Discussion

Materials and Methods

TIS reporter library construction

Cell culture

Cell sorting

Barcoding and sequencing

Data analysis

Position weight matrices

mRNA secondary structure

SNP analysis

Ribosome footprint profiling analysis

Review 2. The human Mediator complex: a versatile, genome-wide regulator of transcription.

Review 6. Sequence and structural features associated with translational initiator regions in yeast--a review.

Review 1. The Untranslated Regions of mRNAs in Cancer.

Review 2. uORF-mediated translational control: recently elucidated mechanisms and implications in cancer.

Review 5. Conserved Upstream Open Reading Frame Nascent Peptides That Control Translation.

Review 6. Functional 5' UTR mRNA structures in eukaryotic translation regulation and how to find them.

Review 9. Functional assays for transcription mechanisms in high-throughput.