| Literature DB >> 27994090 |
Can Cenik1,2, Hon Nian Chua3,4,5, Guramrit Singh2,6,7,8, Abdalla Akef9, Michael P Snyder1, Alexander F Palazzo9, Melissa J Moore10,7,8, Frederick P Roth11,4,12,13.
Abstract
Introns are found in 5' untranslated regions (5'UTRs) for 35% of all human transcripts. These 5'UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5'UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5'UTR intron status, we developed a classifier that can predict 5'UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5' proximal-intron-minus-like-coding regions ("5IM" transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5' cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5' proximal positions. Finally, N1-methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5' proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N1-methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC.Entities:
Keywords: 5′-UTR introns; N1-methyladenosine; exon junction complex; random forest
Mesh:
Substances:
Year: 2016 PMID: 27994090 PMCID: PMC5311483 DOI: 10.1261/rna.059105.116
Source DB: PubMed Journal: RNA ISSN: 1355-8382 Impact factor: 4.942
FIGURE 1.Modeling the relationship between sequence features in the early coding region and the absence of 5′UTR introns (5UIs). (A) For all human transcripts, information about 36 sequence features of the early coding region (first 99 nt) and 5UI presence was extracted. (B) Transcripts containing a signal sequence coding region (SSCR) were used to train a random forest classifier that modeled the relationship between 5UI absence and 36 sequence features. (C) With this classifier, all human transcripts were assigned a score that quantifies the likelihood of 5UI absence based on specific RNA sequence features in the early coding region. Transcripts with high scores are thus considered to have 5′-proximal intron minus-like coding regions (5IMs). (D) “5′UTR-intron-minus-predictor” (5IMP) score distributions for SSCR-containing transcripts shift to higher scores with later-appearing first introns, suggesting that 5IM coding region features not only predict lack of a 5UI, but also lack of early coding region introns. (E) Classifier performance was optimized by excluding 5UI− transcripts with introns appearing early in the coding region. Cross-validation performance (area under the precision recall curve, AUPRC) was examined for a series of alternative 5IM classifiers using different first-intron-position criterion for excluding 5UI− transcripts from the training set (Materials and Methods).
FIGURE 2.Predicting 5UI status accurately using only early coding sequences. (A) As judged by area under the receiver operating characteristic curve (AUROC) and AUPRC, the 5IM classifier performed well for several different transcript classes. (B) The distribution of 5IMP scores reveals clear separation of 5UI+ and 5UI− transcripts for SSCR-containing transcripts, where each SSCR-containing transcript was scored using a classifier that did not use that transcript in training (Materials and Methods). (C) Coding sequence features that are predictive of 5′ proximal intron presence are restricted to the early coding region. This was supported by identical 5IM classifier score distributions with respect to 5UI presence for negative control sequences, each derived from a single randomly chosen “window” downstream from the third exon from one of the evaluated transcripts. (D) MSCR transcripts exhibited a major difference in 5IMP scores based on their 5UI status even though no MSCR transcripts were used in training the classifier. (E) Transcripts predicted to contain signal peptides (SignalP+) had a 5IMP score distribution similar to that of SSCR-containing transcripts. (F) After eliminating SSCR, MSCR, and SignalP+ transcripts, the remaining S–/MSCR– SignalP– transcripts were still significantly enriched for high 5IM classifier scores among 5UI− transcripts. (G) The control set of randomly chosen sequences downstream from the third exon from each transcript was used to calculate an empirical cumulative null distribution of 5IMP scores. Using this function, we determined the P-value corresponding to the 5IMP score for all transcripts. The red dashed line indicates the P-value corresponding to 5% false discovery rate. The inset depicts the distribution of various classes of mRNAs among the input set and 5IM transcripts.
FIGURE 3.5IM transcripts have sequence features associated with lower translation efficiency. (A) The 5IM classifier score was positively correlated with the propensity for mRNA structure preceding the start codon (−ΔG) (Spearman ρ = 0.39; P < 2.2 × 10−16). For each transcript, 35 nt immediately upstream of the AUG were used to calculate −ΔG (Materials and Methods). (B) The 5IM classifier score was positively correlated with the propensity for mRNA structure near the 5′cap (−ΔG) (Spearman ρ = 0.18; P = 7.9 × 10−130; Materials and Methods). (C) Transcripts that are translationally up-regulated in response to eIF4E overexpression (Larsson et al. 2007) (blue) were enriched for higher 5IMP scores. Light green shading indicates 5IMP scores corresponding to 5% FDR. (D) Transcripts with non-AUG start codons (blue) exhibited significantly higher 5IMP scores than transcripts with a canonical ATG start codon (yellow). (E) Higher 5IMP scores were associated with less optimal codons (as measured by the tRNA adaptation index, tAI) for the first 33 codons. For all transcripts within each 5IMP score category (blue, high; orange, low), the mean tAI was calculated at each codon position. Start codon was not shown. (F) Transcripts with lower translation efficiency were enriched for higher 5IMP scores. Transcripts with translation efficiency one standard deviation below the mean (“LOW” translation, yellow) and one standard deviation higher than the mean (“HIGH” translation, blue) were identified using ribosome profiling and RNA-seq data from human lymphoblastoid cell lines (Materials and Methods).
FIGURE 4.5IM transcripts are more likely to exhibit ER-proximal ribosome occupancy, even where there is no evidence of ER-targeting. A moving average of ER-proximal ribosome occupancy was calculated by grouping genes by 5IMP score (see Materials and Methods). We plotted the moving average of 5IMP scores for transcripts with no evidence of ER- or mitochondrial targeting (green) or for transcripts predicted to be mitochondrial (purple). We plotted a random subsample of transcripts on top of the moving average (circles).
FIGURE 5.5IM transcripts harbor noncanonical exon junction complex (EJC) binding sites. (A) Observed EJC binding sites (Singh et al. 2012) are shown for an example 5IM transcript (LAMC1). Canonical EJC binding sites (purple) are ∼24 nt upstream of an exon–intron boundary. The remaining binding sites are considered to be noncanonical (green). (B) A CG-rich sequence motif previously identified to be enriched among ncEJC binding sites in first exons (Singh et al. 2012) is shown. (C) 5IMP score for transcripts with zero, one, two, or more noncanonical EJC binding sites in the first 99 coding nucleotides reveals that transcripts with high 5IMP scores frequently harbor noncanonical EJC binding sites. (D) Transcripts with high 5IMP scores are enriched for noncanonical EJCs regardless of 5UI presence or absence.
FIGURE 6.5IM transcripts are enriched for mRNAs with early coding region m1A modifications. (A) Transcripts with m1A modifications (blue) in the first 99 coding nucleotides exhibit significant enrichment for 5IM transcripts and have higher 5IMP scores than transcripts without m1A modifications in the first 99 coding nucleotides (yellow). (B) Transcripts with m1A modifications (blue) in the 5′UTR do not display a similar enrichment.