Literature DB >> 27559081

Ultra-deep sequencing of ribosome-associated poly-adenylated RNA in early Drosophila embryos reveals hundreds of conserved translated sORFs.

Hongmei Li¹, Chuansheng Hu², Ling Bai², Hua Li², Mingfa Li³, Xiaodong Zhao², Daniel M Czajkowsky⁴, Zhifeng Shao⁴.

Abstract

There is growing recognition that small open reading frames (sORFs) encoding peptides shorter than 100 amino acids are an important class of functional elements in the eukaryotic genome, with several already identified to play critical roles in growth, development, and disease. However, our understanding of their biological importance has been hindered owing to the significant technical challenges limiting their annotation. Here we combined ultra-deep sequencing of ribosome-associated poly-adenylated RNAs with rigorous conservation analysis to identify a comprehensive population of translated sORFs during early Drosophila embryogenesis. In total, we identify 399 sORFs, including those previously annotated but without evidence of translational capacity, those found within transcripts previously classified as non-coding, and those not previously known to be transcribed. Further, we find, for the first time, evidence for translation of many sORFs with different isoforms, suggesting their regulation is as complex as longer ORFs. Furthermore, many sORFs are found not associated with ribosomes in late-stage Drosophila S2 cells, suggesting that many of the translated sORFs may have stage-specific functions during embryogenesis. These results thus provide the first comprehensive annotation of the sORFs present during early Drosophila embryogenesis, a necessary basis for a detailed delineation of their function in embryogenesis and other biological processes.

Entities: CellLine Chemical Gene Species

Keywords: PhyloCSF; early Drosophila embryo; sORFs; small open reading frames; translatome

Mesh：

Substances：

Year: 2016 PMID： 27559081 PMCID： PMC5144680 DOI： 10.1093/dnares/dsw040

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

1. Introduction

Now that the genomes of many organisms have been sequenced, their comprehensive annotation is required to fully understand the functional elements that are encoded therein. Although the focus of much of this annotation is of longer open reading frames (ORFs), there is growing appreciation that the much less studied small ORFs (sORFs), historically defined as <100 amino acids (aa) in length, may prove to be as widely significant in many biological processes as the larger class. For example, the tarsal-less (tal) gene in Drosophila, which was previously thought to be a long non-coding RNA (lncRNA), has been found to contain four conserved sORFs encoding three 11-aa and one 32-aa peptides, that are required for embryonic tracheal development and leg morphogenesis.,, Similarly, a conserved 56-aa peptide encoded by the Toddler gene in zebrafish has been found to function as a mitogen to promote cell migration during gastrulation. However, what has primarily hindered our appreciation of the extent to which the sORFs are biologically significant is, in fact, their reliable identification, the necessary first step before any detailed functional characterization., In general, strictly bioinformatic approaches require a large number of experimentally validated ORFs to serve as a training set to enable subsequent de novo prediction, but there are presently not a sufficient number of identified sORFs for this purpose. Alternatively, ORFs can be identified based on conservation alone, but for that, reliable prediction generally requires sequences that encode peptides that are longer than 100-aa. Direct identification of translated peptides in the cell by proteomic methods is also often highly effective, however these methods are well known to be inefficient at detecting proteins of small size.,, Thus, it is highly likely that many sORFs have been misclassified as lncRNAs or missed entirely from present annotation. Several attempts have been made to identify sORFs on a genome-wide scale in model organisms such as Drosophila using more targeted bioinformatics approaches. Yet, while this work suggests that there may be thousands of sORFs translated in these organisms, there is presently a lack of experimental translational verification for the majority of the predictions. Perhaps the most successful experimental method to identify sORFs so far has been deep sequencing of ribosome-associated RNA., However, recent work has suggested that ribosome occupancy alone is insufficient to unequivocally conclude that a potential sORF is indeed translated., Instead, combining deep sequencing of ribosome-associated RNAs with bioinformatics analysis has emerged as a powerful approach to identify translated sORFs, genome-wide. Of the few well characterized sORFs, a surprisingly large fraction, like tal and Toddler mentioned above, has been found to play vital roles during development.,,, We thus speculated that there might be a large number of presently unannotated sORFs that perform critical functions during development, and that identifying the complete repertoire of translated sORFs during embryogenesis would prove both a useful strategy to identify a large set of sORFs to aid de novo sORF prediction as well as a necessary resource to understand this fundamental biological process. Thus, to this end, we have performed ultra-deep sequencing of ribosome-associated poly-adenylated RNAs together with conservation analyses to identify conserved translated sORFs during the first 4 hours of embryogenesis in Drosophila, the period during which control shifts from maternal- to zygotic-encoded transcripts. Overall, we have identified 399 sORFs that are translated during early embryogenesis, which substantially increases the number of verified sORFs in Drosophila. Of these, 128 were previously predicted sORFs but lacked experimental support, 22 are located in transcripts previously classified as lncRNAs, and 45 are novel sORFs found on transcripts not previously known to be transcribed. Further, among the sORFs that were previously annotated, we provide the first evidence of translation for sORFs with multiple isoforms. We tested the translational capability of randomly selected sORFs identified here in Drosophila S2R+ cell lines using an eGFP-tagged transfection assay, and found that most (22 out of 23) were highly translated, attesting to the validity of our combined experimental and bioinformatics approach. Thus, with this work, we provide the first comprehensive annotation of sORFs during early Drosophila embryogenesis, which we anticipate will aid our understanding of early development as well as the functions of sORFs in biological processes in general.

2. Materials and methods

2.1. Drosophila embryo collections

Early (0–4 h) Canton S embryos were collected from egg laying dishes. The embryos were then dechorionated by treatment with 50% bleach for 3–5 min and washed thoroughly with phosphate buffer solution (PBS buffer, pH7.4). The embryos were then transferred into Eppendorf tubes.

2.2. Ribosome material preparation

To obtain the ribosome material, the dechorionated embryos were immediately incubated with 100 µg/ml cycloheximide in PBS for 5 min on ice, then the embryos were homogenized with a plastic pellet pestle in 100 µl of a mild ribosome lysis buffer (20 mM Tris-HCl, pH 7.4, 140 mM KCl, 5 mM MgCl2, 0.5 mM DTT, 1% Triton X-100, 100 µg/ml cycloheximide, 0.5 U/ml RNasin) and incubated for 10 min on ice. The nuclei and whole cells were removed by centrifugation at 16,000 × g for 10 min at 4°C, and the lipid and other membranes were filtered out from the supernatant using a 100 µm sieve mesh. Finally, the cytosolic supernatant was loaded onto 20–50% continuous sucrose density gradients with the lysis buffer (20 mM Tris-HCl, pH 7.4, 140 mM KCl, 5 mM MgCl2, 0.5 mM DTT), followed by ultracentrifugation at 35,000 rpm for 2.5 h in a SW41 rotor at 4°C (OptimaL-100 XP 2100 Ultracentrifuge, Beckman). Absorbance in each layer of the sucrose density gradient was measured at an optical density of 254 nm using the Piston Gradient Fractionater (BioComp). The RNA components and the amount of ribosomes were determined based on distinct peaks in the polysome profiling. Ribosome material was collected by pooling both the 80S (monosome) and the polysome peaks identified in the profile. The polysome profiling of the material treated with EDTA was obtained as above, except that the cytosolic supernatant was treated with 50 mM EDTA for 5 min on ice before loading onto the sucrose gradient.

2.3. Strand-specific RNA-seq library construction

The ribosome-associated RNA was extracted by adding an equal volume of Trizol reagent (Invitrogen) to the ribosome material, followed by chloroform extraction and ethanol precipitation. The RNA concentration was quantified by Nanodrop2000 (Thermo Scientific) and the RNA quality was detected by Agilent Bioanalyzer 2100 (Shanghai Biotechnology Corporation). The method to purify poly-adenylated RNA of ribosome-associated RNA was optimized using the RiboMinus Eukaryote Kit for RNA-Seq (no. A10837-08, Ambion) to delete ribosomal RNAs (rRNAs) and Dynabeads oligo (dT)25 (no. 61002, Life Technologies) purification to select RNAs with poly-adenylated tails (Supplementary Fig. S1). The strand-specific RNA-seq library of the ribosome-associated poly-adenylated RNA was prepared using the Illumina TruSeq Stranded mRNA Sample Preparation Kit (A10837-08, Ambion). The library was sequenced on the Illumina HiSeq 2000 (Shanghai Biotechnology Corporation) to a depth of about 70 M reads per library.

2.4. Preparation of total RNA and cytosolic RNA

Total RNA was isolated by homogenization of dechorionated embryos, followed by the RNA extraction protocol with Trizol. The total RNA was then treated with DNase I to eliminate genomic DNA contamination. Cytosolic RNA was isolated from the supernatant of the embryo extract before loading onto sucrose gradient. The enrichment of both total RNA and cytosolic RNA for poly-adenylated RNAs was also performed using Dynabeads oligo (dT)25 purification. Both samples were sequenced as the ribosome-associated RNA above.

2.5. Transcript assembly

The ribosome-associated RNA, total RNA, and cytosolic RNA deep sequencing reads were each separately aligned using the TopHat v2.0.9 package. We used a built-in strategy of TopHat for a higher mapping rate. Raw data were first mapped to the Drosophila melanogaster transcriptome (FlyBase r5.57 release) using the ‘-G’ parameter with other parameters set as default. The read pairs that were not completely mapped were then aligned to the D. melanogaster genome (dm3) for a second run. Only the uniquely aligned and concordant read pairs were used for further analysis. We used Cufflinks v2.2.1 with parameters set as default except using ‘-G’ parameter to assemble transcripts in the ribosome RNA, total RNA, and cytosolic RNA separately, and merged the three assembled gtf files and Flybase r5.57 annotation into a comprehensive transcriptome annotation, from which all further analysis was based. The raw read counts were normalized by the FPKM (fragments per kilobase of exon model per million mapped fragments) for each transcript based on the Flybase r5.57 gene models. The FPKM values were calculated using the Cufflinks v2.2.1 package with parameters set as default.

2.6. Estimation of detectable level of transcription

A machine learning algorithm described by Ramsköld et al. was first used to determine an optimal FPKM cutoff by comparing the expression levels of all annotated genes with that of randomly selected intergenic regions. The intergenic regions were at least 5 kb away from any annotated genes of FlyBase and the length distribution of the selected intergenic regions was the same as the distribution of the annotated exons to avoid a FPKM calculation bias. In this way, we identified a threshold FPKM value of 0.15 (Supplementary Fig. S2). To increase the confidence of the expression, we further required that the read counts be >20 reads. This value was determined based on an analysis of the annotated genes in our data of length between 0.2 and 0.6 kb, assuming that background reads for a gene would follow a geometric distribution. Small RNAs (shorter than 200 nt) were excluded because these were not efficiently captured and would be more likely resulting in assembly artifacts. For translation detection, we also required that the transcripts exhibit an expression level of ≥ 0.15 FPKM and ≥ 20 reads in our ribosome-associated RNA data.

2.7. Reverse transcription polymerase chain reaction

Total RNA or ribosome RNA was reverse transcribed into cDNAs with SuperScript III Reverse Transcriptase (Invitrogen) and oligo (dT)20VN primer. cDNAs were used to amplify the RNA targets by polymerase chain reaction (PCR) using the internal gene-specific primers and DNA Taq polymerase (no. DR100A, Takara).

2.8. Evaluation of translational evidence for annotated sORFs

We considered as evidence of translation for a given annotated sORF as either: (i) the peptide fragment corresponding to the sORF was present in the most-recent proteomic screen (FlyBase r5.57 release); or (ii) the sORF was identified as a translated sORF in a previous ribosome profiling study in Drosophila S2 cells.

2.9. Analysis of translated potential by PhyloCSF

To identify conserved translated sORFs, we utilized two annotated datasets to estimate the PhyloCSF threshold, comparing with 11 other Drosophila species: (i) the positive dataset was the annotated sORFs in FlyBase; and (ii) the negative dataset was all the sORFs contained in annotated lincRNAs based on the assumption that all of these sORFs are non-translated. From the histogram of the PhyloCSF score distribution in each set, we found an optimal PhyloCSF score of 50 which could discriminate annotated sORFs from untranslated sORFs (Supplementary Fig. S3). We then used the PhyloCSF package to evaluate the coding potential of ORFs contained in ribosome-associated lncRNAs and novel transcripts with parameters ‘12flies -orf = ATGStop –frames = 3 –minCodons = 10’, which was intended to find all ORFs longer than 10-aa in frame. To avoid any influence of annotated ORFs, we excluded the ORFs which overlapped with annotated ORFs in the same or opposite strand. The multi-alignment file of 12 flies species were downloaded from the Galaxy cloud tool.

2.10. Identification of embryo specific sORFs

For the annotated sORFs, we compared them directly with the translated sORFs identified in the S2 cell line. Due to the lack information of sORFs in lncRNA and novel transcripts, we manually examined for the presence of at least one read in the S2 ribosome data mapped to the identified sORF regions.

2.11. Calculation of arginine frequency of the identified sORFs

We calculated the arginine frequency by counting the number of arginines within all sORFs in each set. For the random control, we determined the expected frequencies of arginine based on that encoded from a random distribution of nucleotides. For this calculation we used, as observed frequencies of the four DNA bases in nature, as 0.22 of uracil, 0.303 of adenine, 0.217 of cytosine, and 0.261 of guanine.

2.12. eGFP-tagged sORF construction

The eGFP-tagged sORF vectors were based on the full-length cDNA of the corresponding sORFs. The full-length cDNAs were amplified by gene-specific full-length primers that introduced two different restriction enzyme digestion sites at the two ends. The cDNAs were then cloned into the pGEM T-easy vector, inserting an AvrII enzyme digestion site before the stop codon. The sequence of the eGFP coding regions (CDS) which did not contain start or stop codons of the CDS was amplified-tagged with AvrII digestion sites at the two ends. CDS sequences of eGFP were digested by AvrII and cloned into the AvrII linearized sORF vector in-frame. These eGFP-sORF sequences were excised by double restriction enzyme digestion and directionally cloned into pUAST.

2.13. Transfections and immunoblotting

S2R+ cells were grown in Schneider’s medium (no. 21720-024, Invitrogen) with 10% heat-inactivated fetal bovine serum (no. 16140-071, Gibco). S2R+ cells were transfected with reconstructed pUAST plasmid using X-tremeGENE HP DNA Transfection Reagent (no. 06366244001, Roche). After 48 h, proteins were extracted with RIPA Buffer (no. R0278, Sigma) containing protease inhibitor (no. 04693159001, Roche). The cell extract was then run in 12% Bis-Tris gels. Immunoblots were incubated with anti-GFP (1:1,000; no. M048-3, MBL) and then the secondary antibody Alexa Fluor 680 donkey anti-mouse lgG (1:1,000; no. A10038, Life Technologies). Controls were incubated with anti-α-tubulin (1:1,000; no. PM054, MBL) and then secondary antibody Alexa Fluor 680 goat anti-rabbit lgG (1:1000; no. A21076, Life Technologies).

2.14. Localization of eGFP-fusion peptide

After 48 hr post transfection, the cells were fixed for 10 min with 4% paraformaldehyde and then mounted in antifade mountant with DAPI (no. P36962, Life Technologies). Imaging was acquired using a Nikon A1Si confocal microscope with a CFI Plan Fluor 40 × objective.

3. Results and discussion

3.1. Ultra-deep sequencing identifies an exhaustive set of ribosome-associated RNAs

To globally identify translated sORFs in the 0–4 h Drosophila embryos, we performed ultra-deep sequencing of ribosome-associated poly-adenylated RNAs isolated using density gradient velocity sedimentation (see Materials and methods) (Fig. 1A). Our ribosome profile revealed the presence of ribosomal subunits (40S and 60S) as well as monosomes (80S) and polysomes (Fig. 1B). To verify the identity of this ribosome material, prior to the sucrose gradient separation, we treated the embryo extract with 50 mM EDTA, which is known to dissociate intact ribosomes into their constituent subunits., As expected, this treatment completely eliminated the peaks of the intact ribosomes in the profile, while the peaks associated with the 40S and 60S subunits increased significantly (Fig. 1C). For our analysis, we collected not only polysomes as is typical with this approach but also monosomes as well, with which some short transcripts are only associated., Further, we purified only the poly-adenylated RNAs from this ribosome material, since non-poly-adenylated RNAs, a large proportion of the transcriptome, do not likely contain translated ORFs. The ribosome-associated poly-adenylated RNAs were converted into cDNA libraries for strand-specific, paired-end 100 base-pair (bp) sequencing with HiSeq 2000.

Figure 1

Isolation procedure of ribosome-associated RNA. (A) Schematic procedure for the preparation of ribosome material from 0-4 hr Drosophila embryos. (B) Polysome profiling of this sample enables clear resolution of the monosome and polysome fractions that were isolated for deep sequencing of the ribosome-associated RNA. (C) Validation of the identification of the monosome and polysome peaks. As expected, treating the sample with 50 mM EDTA before loading the sample onto the sucrose gradient caused the intact ribosomal peaks to disappear and those of the 40S and 60S subunits to increase. Overall, we obtained a total of 71.2 million (M) aligned reads, of which 68.9 M (96.8%) were uniquely mapped to the Drosophila genome (Dm3), representing a ∼460-fold coverage of the Drosophila transcriptome. To enable the calculation of ribosome association efficiency (see below), we also deep sequenced the poly-adenylated RNAs from the total RNA population from the 0-4 hr Drosophila embryos in a similar way as described for the ribosome-associated RNAs. We also deep sequenced the cytosolic RNA from this same embryonic sample to maximize the annotation of the translated transcripts (see ‘Materials and methods’ section). In total, 213.9 M paired-end aligned reads were obtained in the combined dataset, of which 91.7% were uniquely aligned to Dm3, nearly 10-fold higher in depth than previously obtained transcriptomic datasets of this stage.,, Following a procedure detailed in the Materials and methods, we finally identified 20,614 unique transcripts from 9,582 loci with high confidence in the 0–4 h Drosophila embryo (Supplementary Table S1). Comparing this with the latest release of the Drosophila transcriptome (FlyBase r5.57), we found that 18,613 transcripts (90.3%) are identical to the annotated transcripts, attesting to the high quality of our assembly. We further validated this assembly by randomly selecting a set of 18 of the novel transcripts using reverse transcription-PCR (RT-PCR) and found that 17 transcripts are indeed detected in the 0–4 h embryo (Supplementary Fig. S4 and Supplementary Table S3). From among the 20,614 unique transcripts identified in the combined dataset, we found that 17,166 from 8,803 loci are ribosome-associated (Supplementary Table S1). We directly compared the FPKM values for each transcript in the ribosome RNA data with the corresponding average intensities measured in a previous microarray study of 0–2 h embryos, and found a high degree of correlation (Spearman correlation coefficient 0.76). These ribosome-associated transcripts thus represent a large fraction of the Drosophila transcriptome (83.3%). We classified these transcripts into three categories based on the annotation in FlyBase: protein-coding transcripts including novel variant isoforms (16,576), lncRNAs that also include novel variant isoforms (349), and assembled novel transcripts (241). We validated the ribosome association of 35 randomly selected transcripts in the latter two categories with RT-PCR, and found that 34 transcripts could indeed be detected in the 0–4 h embryo (Supplementary Fig. S5 and Supplementary Table S3).

3.2. Ribosome association provides translational evidence for annotated sORFs

The 16,576 ribosome-associated protein-coding transcripts correspond to 11,092 different ORFs, including 332 sORFs (Supplementary Table S2). Of the latter, inspection of FlyBase revealed that there was prior evidence of translation for only 204 annotated sORFs (Fig. 2A). Thus, our data provides the first necessary translational evidence for 128 annotated sORFs, substantially increasing the number of sORFs in Drosophila with evidence of translation. We note that the well-studied functional sORFs in Drosophila, tal and scl, were highly enriched in our ribosome-associated fraction (Fig. 2B), lending further confidence in the quality of our data.

Figure 2

Ribosome-associated sORFs, lncRNAs and novel transcripts. (A) Among the annotated sORFs associated with ribosomes, we provide evidence of translation for 128 previously predicted Drosophila sORFs. (B) The well-studied Drosophila genes, tal and scl which were previously thought to be lincRNAs but then shown to encode functional sORFs, are well-resolved in our ribosome-associated RNA data. (C) Proportion of expressed lncRNAs that are ribosome-associated. (D) Proportion of expressed novel transcripts that are ribosome-associated. For both the lncRNAs and the novel transcripts, there was a large fraction of the total number of expressed transcripts that were found to be associated to the ribosomes. Much of this previous translational evidence for the annotated sORFs was obtained from Drosophila S2 cells, which are derived from late stage embryos (20–24 h)., Of the 332 sORFs in our data, 148 are not translated in the S2 cell line. Though cultured cells can exhibit phenotypes different from the original cells from which they are derived, these results indicate that a substantial fraction of the sORFs might be translated specifically during defined developmental stages. This comparison also suggests that a detailed characterization of the ribosome-associated RNAs at the other stages of development might uncover evidence for the translation of other sORFs. One advantage of our experimental approach is that we obtain the identity of the full-length of the ribosome-associated transcripts, enabling the detection of specific gene isoforms. Of the 332 different sORFs (corresponding to 313 genes), there were 17 genes with two different sORFs and 1 gene with three different sORFs. In addition, there were 177 genes with more than one annotated isoforms and 86 genes with variant isoforms not previously described. This is, in fact, the first detection of variant isoforms of translated sORFs. For example, with the gene CG40228, we found five isoforms with (in total) 3 different sORFs, including a variant isoform not previously characterized with a unique 5′ untranslated region (UTR) that is 19 bp upstream of all other isoforms. Variant isoforms in longer ORFs are usually associated with their highly regulated, differential translation,, and so this observation of variant sORF isoforms suggests that the translational processing of the sORFs may be as complex as that governing their longer counterpart. Ribosome profiling might prove useful in this regard, since a recent study employing this method thoroughly characterized stop codon readthrough in many genes in early Drosophila embryos.

3.3. Ribosome-associated RNA sequencing identifies translated sORFs among the lncRNAs

Transcripts that do not encode peptides and lack a 100-aa ORF are usually classified as lncRNAs., Although a number of these transcripts have been found to indeed function in a non-coding capacity, it remains to be determined whether at least some of these transcripts are misclassified and actually encode sORFs., In this regard, it is interesting that we have found 349 lncRNA transcripts associated with 264 genes that are associated to ribosomes, which corresponds to 76.9% of the expressed lncRNAs in these early embryos (Fig. 2C), an amount that is consistent with previous findings in different species. We reasoned that if these lncRNAs actually encode for peptides, then they may be associated to ribosomes to a similar extent as established protein-coding genes. We thus evaluated the degree to which these lncRNAs and the protein-coding genes are associated with ribosomes compared with their total level of expression (their ‘ribosome association efficiency’). We found that, although these ribosome-associated lncRNAs are expressed at a much lower level than the protein-coding genes (Wilcoxon test, P–value < 2.2e-16) (Fig. 3A), their ribosome association efficiency is not significantly different from that of the protein-coding genes (Wilcoxon test, P–value > 0.05) (Fig. 3B). Thus, these apparently lncRNA transcripts are indeed associated to ribosomes to a similar extent as bona fide protein-coding transcripts.

Figure 3

Characterization of the ribosome-associated lncRNAs and novel transcripts. (A) There was a lower abundance of the lncRNAs and novel transcripts on the ribosomes than the protein-coding RNAs. (P-value < 2.2e-16 (***); P-value = 3.638e-12 (**); Wilcoxon rank-sum test). (B) Despite their lower ribosome occupation, the ribosome association efficiency of the lncRNAs and novel transcripts was not significantly different from that of the protein-coding RNAs (all P-values > 0.05 relative to the protein coding RNA). Here, the ribosome association efficiency is defined as the ratio of the abundance of the ribosome-associated RNA to the total RNA. (C) Overall, the length of ORFs contained within these lncRNAs and novel transcripts are, in general, shorter than the annotated sORFs. Although it is likely that many of these transcripts are indeed non-coding, we reasoned that at least some of these transcripts may encode sORFs and thus examined these transcripts with a bioinformatics approach for potential sORFs. In particular, we first identified ORFs with an ATG start codon and an in-frame stop codon, using the longest ORF for each stop codon. We then discarded those ORFs that encode for peptides smaller than 10-aa or that overlapped annotated ORFs in the same or opposite strand. In this way, we identified 1,784 putative sORFs (median length of 20-aa) and 9 potential long-ORFs in 347 out of the 349 ribosome-associated lncRNAs (Fig. 3C). As a stricter criterion, we examined the conservation at the amino-acid level of these putative sORFs using PhyloCSF, which has been demonstrated to be a highly effective method to identify sORFs., Using annotated sORFs and lincRNAs to establish rigorous threshold values in this approach (see ‘Materials and methods’ section), we identified 28 ORFs located in 21 ribosome-associated lncRNA genes as conserved translated ORFs. Of these, 22 are sORFs (Supplementary Table S2), including 3 that are poly-cistronic like tal and scl.,, We note that among these 22 novel sORFs, 64% are not found in the late-stage Drosophila S2 cell line, and thus may be specific for early embryos. The finding of a fair number of ribosome-associated lncRNAs that lack coding potential is intriguing and remains to be resolved. Based on the current understanding, one might speculate that they may play roles in RNA localization, RNA nonsense mediated decay, translational regulation, or they may produce non-canonical proteins that are quickly degraded. Alternatively, these lncRNAs may encode proteins with non-AUG start codons.,. However, one should also not exclude the possibility that a further analysis would yield functions that are not yet known.

3.4. Ribosome-associated RNA sequencing identifies translated sORFs in novel transcripts

Of the 350 assembled novel transcripts identified in this study, 241 are associated with the ribosomes (Fig. 2D). Similar to the ribosome-associated lncRNAs, we found that these ribosome-associated novel transcripts are as tightly associated to the ribosomes as the protein-coding genes (Wilcoxon test, P-value > 0.05) (Fig. 3B), although their expression level is much lower than those known to encode for peptides (Wilcoxon test, P-value = 3.638e-12) (Fig. 3A). Examining these transcripts for potential ORFs similarly to the lncRNAs described above, we identified 2,521 putative ORFs, most of which (98.5%) were sORFs (median length of 24-aa) (Fig. 3C). Analysing these putative ORFs in terms of their conservation using PhyloCSF, we identified 66 different conserved translated ORFs contained within 32 of these novel transcripts (Supplementary Table S2). Of these, 45 are sORFs, most (87%) of which were present only in the early embryos and not in the S2 cell line.

3.5. Translational validation of identified translated sORFs

As a low frequency of arginine occurrence is a common feature of proteins, and has been used as an indicator of the translational capacity of potential ORFs, we compared the arginine usage of our identified translated ORFs with that expected from the aa frequencies associated with randomly distributed nucleotides. We indeed found that, like the annotated sORFs, the novel translated ORFs in both the previously identified lncRNAs and in the novel transcripts exhibit a much lower usage of arginine than this random distribution, consistent with the notion that these are indeed translatable sORFs (Fig. 4A).

Figure 4

Translation validation of identified sORFs. (A) The novel identified translated sORFs contained in lncRNAs and novel transcripts have a much lower arginine frequency compared with that expected from random sequences of nucleotides, similar to annotated sORFs. (B) Schematic of the transfection construct of the eGFP-tagged sORFs. (C–E) Translational validation of CG34136, CR44101 and pncr004:X. For each of the fusions, the number of reads in the ribosome-associated RNA data of the corresponding sORF is shown in the left panel, whereas a typical image from fluorescence microscopy (DAPI: DNA; eGFP: translated sORF) and the results from the Western blot are presented in the middle and right panels, respectively. In each Western DNA, the upper panel is the α-tubulin control detected with an anti-α-tubulin antibody and the molecular weight standard corresponds to 49 kDa. The lower panel in each Western blot is the fused-sORF detected with anti-eGFP and the molecular weight standards refer to 38 kDa (top) and 28 kDa (bottom). The expected molecular weights of the sORFs are 36.6, 30.8, and 30.3 kDa, respectively. Scale bar: 2.5 µm. To provide additional support for this translational capacity, we examined the ability of 23 randomly selected sORFs, including 15 annotated sORFs without evidence for translation, 4 sORFs in lncRNAs, and 4 sORFs in novel transcripts, to be translated in Drosophila S2R+ cells (Supplementary Table S3). We generated eGFP-fusion vectors that contained all of the translation-related elements of the sORF, including the 5′UTR and 3′UTR, together with the enhanced green fluorescent protein (eGFP) coding sequence (CDS) in-frame following the sORF (Fig. 4B). Thus, translation of this eGFP-tagged sORF would produce an eGFP-fusion protein, for which we examined using Western blotting and fluorescence microscopy. Overall, we found that 22 of the 23 candidates were well translated (Fig. 4C–E and Supplementary Fig. S6). Of these, 3 were clearly localized in both the nucleus and cytoplasm as observed with eGFP control (Supplementary Fig. S6), while the other 19 were mainly localized in cytoplasm (Fig. 4C–E and Supplementary Fig. S6). Of note though, the cytoplasmic localization did not appear to be the same for all of the fusions, with some enriched in a single, large subsection of the cytoplasm (Fig. 4C and Supplementary Fig. S6d, e, h, i, s), others with a more punctate distribution scattered throughout the cytoplasm (Fig. 4D and Supplementary Fig. S6c, f, g, o), and others with a more uniform distribution within the cytoplasm except for punctate locations (Fig. 4E and Supplementary Fig. S6f, l, m, n, p, q, t, u). Such a wide range of localizations is thus likely owing to the sORF-encoded peptide and not the eGFP, since the latter is present in all of the fusions (Supplementary Fig. S6b). Taken together, these results indicate that most of the identified translated sORFs could indeed be translated into peptides in vivo.

4. Conclusion

In this study, we provide the first genome-wide annotation of the translated sORFs population that is present during the very early stages of Drosophila embryogenesis, thus setting the stage for detailed characterizations of their functions during this fundamental biological process. The 399 sORFs identified here significantly expands the population of known sORFs in this model organism, which we anticipate will aid in future bioinformatics approaches for de novo predictions of sORFs both in Drosophila, as well as in much less well studied organisms, such as humans., Determining if their translation is indeed as complex as the longer ORFs, or if they form and evolve by mechanisms distinct from their longer counterparts, or indeed if their spectrums of biological functions are as diverse as the longer ORFs will be fascinating to now resolve.

5. Availability

RNA-seq data have been submitted to the EMBL with the accession numbers E-MTAB-4571.

75 in total

1. The Drosophila melanogaster z600 gene encodes a chromatin-associated protein synthesized in the syncytial blastoderm.

Authors: S Galewsky; X L Xie; R A Schulz
Journal: Gene Date: 1990-12-15 Impact factor: 3.688

2. Identification and expression analysis of putative mRNA-like non-coding RNA in Drosophila.

Authors: Sachi Inagaki; Koji Numata; Takefumi Kondo; Masaru Tomita; Kunio Yasuda; Akio Kanai; Yuji Kageyama
Journal: Genes Cells Date: 2005-12 Impact factor: 1.891

3. Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA.

Authors: Takefumi Kondo; Yoshiko Hashimoto; Kagayaki Kato; Sachi Inagaki; Shigeo Hayashi; Yuji Kageyama
Journal: Nat Cell Biol Date: 2007-05-07 Impact factor: 28.824

Review 4. Identifying (non-)coding RNAs and small peptides: challenges and opportunities.

Authors: Andrea Pauli; Eivind Valen; Alexander F Schier
Journal: Bioessays Date: 2014-10-24 Impact factor: 4.345

5. Hundreds of putatively functional small open reading frames in Drosophila.

Authors: Emmanuel Ladoukakis; Vini Pereira; Emile G Magny; Adam Eyre-Walker; Juan Pablo Couso
Journal: Genome Biol Date: 2011-11-25 Impact factor: 13.583

6. Humanin peptide suppresses apoptosis by interfering with Bax activation.

Authors: Bin Guo; Dayong Zhai; Edelmira Cabezas; Kate Welsh; Shahrzad Nouraini; Arnold C Satterthwait; John C Reed
Journal: Nature Date: 2003-05-04 Impact factor: 49.962

7. Drosophila Pgc protein inhibits P-TEFb recruitment to chromatin in primordial germ cells.

Authors: Kazuko Hanyu-Nakamura; Hiroko Sonobe-Nojima; Akie Tanigawa; Paul Lasko; Akira Nakamura
Journal: Nature Date: 2008-01-16 Impact factor: 49.962

8. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.

Authors: Nicholas T Ingolia; Sina Ghaemmaghami; John R S Newman; Jonathan S Weissman
Journal: Science Date: 2009-02-12 Impact factor: 47.728

9. The developmental transcriptome of Drosophila melanogaster.

Authors: Brenton R Graveley; Angela N Brooks; Joseph W Carlson; Michael O Duff; Jane M Landolin; Li Yang; Carlo G Artieri; Marijke J van Baren; Nathan Boley; Benjamin W Booth; James B Brown; Lucy Cherbas; Carrie A Davis; Alex Dobin; Renhua Li; Wei Lin; John H Malone; Nicolas R Mattiuzzo; David Miller; David Sturgill; Brian B Tuch; Chris Zaleski; Dayu Zhang; Marco Blanchette; Sandrine Dudoit; Brian Eads; Richard E Green; Ann Hammonds; Lichun Jiang; Phil Kapranov; Laura Langton; Norbert Perrimon; Jeremy E Sandler; Kenneth H Wan; Aarron Willingham; Yu Zhang; Yi Zou; Justen Andrews; Peter J Bickel; Steven E Brenner; Michael R Brent; Peter Cherbas; Thomas R Gingeras; Roger A Hoskins; Thomas C Kaufman; Brian Oliver; Susan E Celniker
Journal: Nature Date: 2010-12-22 Impact factor: 49.962

10. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model.

Authors: Liguo Wang; Hyun Jung Park; Surendra Dasari; Shengqin Wang; Jean-Pierre Kocher; Wei Li
Journal: Nucleic Acids Res Date: 2013-01-17 Impact factor: 16.971

4 in total

1. Strategies and Challenges in Identifying Function for Thousands of sORF-Encoded Peptides in Meiosis.

Authors: Ina Hollerer; Andrea Higdon; Gloria A Brar
Journal: Proteomics Date: 2017-10-26 Impact factor: 3.984

Review 2. Recognition of the polycistronic nature of human genes is critical to understanding the genotype-phenotype relationship.

Authors: Marie A Brunet; Sébastien A Levesque; Darel J Hunting; Alan A Cohen; Xavier Roucou
Journal: Genome Res Date: 2018-04-06 Impact factor: 9.043

3. Analysis of Eukaryotic lincRNA Sequences Indicates Signatures of Hindered Translation Linked to Selection Pressure.

Authors: Anneke Brümmer; René Dreos; Ana Claudia Marques; Sven Bergmann
Journal: Mol Biol Evol Date: 2022-02-03 Impact factor: 16.240

4. Sub-kb Hi-C in D. melanogaster reveals conserved characteristics of TADs between insect and mammalian cells.

Authors: Qi Wang; Qiu Sun; Daniel M Czajkowsky; Zhifeng Shao
Journal: Nat Commun Date: 2018-01-15 Impact factor: 14.919

4 in total