Literature DB >> 23884177

RNA at 92 °C: the non-coding transcriptome of the hyperthermophilic archaeon Pyrococcus abyssi.

Claire Toffano-Nioche¹, Alban Ott, Estelle Crozat, An N Nguyen, Matthias Zytnicki, Fabrice Leclerc, Patrick Forterre, Philippe Bouloc, Daniel Gautheret.

Abstract

The non-coding transcriptome of the hyperthermophilic archaeon Pyrococcus abyssi is investigated using the RNA-seq technology. A dedicated computational pipeline analyzes RNA-seq reads and prior genome annotation to identify small RNAs, untranslated regions of mRNAs, and cis-encoded antisense transcripts. Unlike other archaea, such as Sulfolobus and Halobacteriales, P. abyssi produces few leaderless mRNA transcripts. Antisense transcription is widespread (215 transcripts) and targets protein-coding genes that are less conserved than average genes. We identify at least three novel H/ACA-like guide RNAs among the newly characterized non-coding RNAs. Long 5' UTRs in mRNAs of ribosomal proteins and amino-acid biosynthesis genes strongly suggest the presence of cis-regulatory leaders in these mRNAs. We selected a high-interest subset of non-coding RNAs based on their strong promoters, high GC-content, phylogenetic conservation, or abundance. Some of the novel small RNAs and long 5' UTRs display high GC contents, suggesting unknown structural RNA functions. However, we were surprised to observe that most of the high-interest RNAs are AU-rich, which suggests an absence of stable secondary structure in the high-temperature environment of P. abyssi. Yet, these transcripts display other hallmarks of functionality, such as high expression or high conservation, which leads us to consider possible RNA functions that do not require extensive secondary structure.

Entities: Chemical Disease Species

Keywords: archaea; hyperthermophile; non-coding RNA; transcriptome

Mesh：

Substances：

Year: 2013 PMID： 23884177 PMCID： PMC3849170 DOI： 10.4161/rna.25567

Source DB: PubMed Journal: RNA Biol ISSN： 1547-6286 Impact factor: 4.652

Introduction

Widespread or “pervasive” transcription of genomic regions outside protein-coding genes is now well established in a wide range of eukaryotic species. The case for pervasive transcription in bacterial and archeal genomes is not as clear since these genomes are compact with short intergenic regions that are often part of transcribed operons. Yet, several recent studies used deep sequencing or tiling arrays to evaluate non-coding transcription in a variety of bacterial species and identified large amounts of small RNAs, antisense transcripts and UTR extensions of protein coding genes (reviewed in ref. 1). Although the term “pervasive transcription” is not often associated to non-eukaryotic organisms, it turns out that a large part of non-coding regions in bacteria are covered by regulatory RNAs or UTR extensions, and that many, if not all, bacterial coding genes produce antisense transcripts. Screens for non-coding RNAs (ncRNAs) in archaea are not as developed as in the other domains of life. Early attempts of archeal Rnomics involved cDNA cloning and computational screens possibly combined to northern blot or PCR validation.- classes of RNAs identified by these archeal screens notably differed from bacterial RNAs and were dominated by modification guides H/ACA and C/D-box RNAs. These approaches are progressively superseded by high-throughput sequencing technologies enabling deep sequencing of total or size-selected RNA. In addition to new H/ACA and C/D box RNAs,- deep sequencing identified a number of new CRISPR (a defense system present in bacteria and archaea) RNA loci,, as well as widespread antisense transcription of coding and non-coding loci., Such “cis-encoded antisenses” are clearly a significant part of the non-coding transcriptome as they were visible already using low-throughput Rnomics. Other recently identified classes of archeal RNAs include circular RNAs, split RNA genes, and a bacterial-like trans-acting small RNA. Among archaeal species sampled by Rnomics studies lie a number of hyperthermophiles (growth temperatures higher than 80 °C): Pyrococcus abyssi, Nanoarchaeum equitans, Sulfolobus solfataricus (refs. 2–3, 5–6, and 10) and members of the Pyrobaculum genus. Counter-intuitively, organisms living at high temperature do not necessarily have GC-rich genomes.- Indeed, topological constraints on circular DNA molecules may enforce double strand formation up to over 100 °C independently of GC-content. Structured RNAs, such as tRNAs and rRNAs, tend to have very high GC-content in hyperthermophiles possibly because they do not have such constraints. Computational biologists embraced this discrepancy to develop ncRNA detection programs based on local GC-enrichment in otherwise AT-rich genomes., These algorithms were successful in identifying a number of structured RNAs but the question of non-coding RNA genes in hyperthermophiles has been set aside since then. Currently, it is not known whether the multiple non-coding RNAs identified by deep sequencing adopt secondary structures like rRNA and tRNA do. Here, we sequenced the complete transcriptome of P. abyssi grown at 92 °C and reconstructed the set of coding and non-coding transcripts. Taking advantage of a higher coverage than in previous P. abyssi transcriptome screens, we confirmed a set of short or extended ORFs and estimated the number of novel non-coding elements in each major ncRNA category: independent transcripts, cis-encoded antisenses, and UTR extensions. We analyzed the properties of novel RNA candidates in each category and identified several new functional RNAs of the H/ACA box class, and possibly of the cis-regulatory RNA class. Finally, we analyzed the G:C content and abundance of new ncRNA elements and showed how they differ from ncRNAs identified in previous experimental and computational screens. This analysis revealed that a large fraction of the newly identified ncRNAs do not adopt stable secondary structures although they harbor other hallmarks of functional RNAs such as the presence of strong promoters, high expression levels, or phylogenetic conservation.

Results

RNA deep sequencing and transcript classification

We collected RNA from a P. abyssi culture grown at 92 °C sampled at successive growth stages up to stationary phase and submitted RNA to directional RNA-seq library preparation. Illumina sequencing produced 51 million single reads of length 40 nt of which 5.6 million mapped to unique, non-rRNA loci. The P. abyssi genome is very dense and comprises a large number of genes that overlap at their UTRs. This context makes it difficult to distinguish new independent transcripts from the 5′ and 3′ extensions of previously annotated genes. We developed a protocol that clusters RNA-seq reads in a strand-specific fashion and detects overlaps with previous annotations in order to merge clusters into transcription units (TUs). Our basis for previous annotation was the NCBI annotation updated by that of Gao et al. who identified 115 extra ORFs. TUs overlapping previous annotation were split into 5′ UTR, 3′ UTR, CDS, and operon spacers. New TUs were analyzed for their coding potential and classified either as novel protein-coding genes (CDS) or independent ncRNAs, here called small RNAs (sRNA). Small RNAs overlapping TUs on the opposite strand were reclassified as cis-encoded antisense RNAs (asRNAs). 5′ UTR were further classified into “long 5′ UTRs” when longer than 50 nt.

Promoters and UTR regions

Consensus sequence motifs were extracted from the upstream region of annotated CDSs (Fig. 1A) and the region upstream of RNA-seq-based TUs (Fig. 1B). The Pyrococcus consensus promoter is composed of two boxes often referred to in archaea as BRE element and TATA box. Here, the TATA box sequence is TTT(A/T)(T/A)AA, similar to that of Methanococcus vannielii (TTTATAATA), and Sulfolobus solfataricus (TTTTAAA). RNA-seq-based transcription start sites (TSS) are located in average 20 nt downstream of the 3′ end of the TATA box (Fig. 1B). The fraction of transcription units featuring both a TATA and a BRE box (68%) is higher than the fraction of annotated CDS in this case (53%), indicating that our annotation of TUs improves previous CDS annotation.

Figure 1. Consensus promoter motifs. Each frame shows the best ranking sequence motif identified by a MEME search performed on the 50 nt upstream region of: (A) CDS annotations; (B) RNA-seq-based transcription units; (C) long 5′UTRs, (D) sRNAs, (E) asRNAs. Number of occurrences, MEME P value and number of sites are given for each motif. Motif coordinates are numbered from the first base of the RNA-seq transcription unit (B‒D) or from the ATG start codon (B) and correspond to the dominant motif location. No dominant location was found for the asRNA motif (E). Few leaderless transcripts are present in P. abyssi. Previous observations in Sulfolobus and Halobacteriales established a majority of leaderless transcripts in these species (> 69% in Sulfolobus, > 90% in halobacteriales), when the methanogen Methanosarcina mazei revealed a majority of long 5′UTR (up to 500 nt), unveiling a variety of situations in Archaea. The median size of the P. abyssi 5′ UTRs is 37 nt (). We predict 292 5′ UTRs with a size under 20 nt (), but most can be attributed to imperfect RNA-seq coverage rather than to a true lack of leader sequence. Indeed, an analysis of sequences upstream of the predicted TSS reveals an RBS motif in 180 of the 207 UTRs with a size of 6‒16 nt, leaving only 27 transcripts (13%) that may lack a proper leader (). Independently of the predicted UTR size, the distance between the 3′ end of the TATA box and the RBS is constant () and suggests an actual size of these short 5′ UTRs of about 20 nt. The promoters of long 5′ UTRs and independent non-coding RNAs (sRNAs) display consensus motifs similar to that of other promoters (Fig. 1C and D). However, the fraction of sRNAs with a major promoter (27/107 = 25%) is significantly lower than for protein coding genes (68%), suggesting a relatively lower accuracy of TSS definition in our sRNA annotation. Antisense RNAs (asRNAs) are not preceded by canonical promoters. Instead, a small proportion (36/215) are flanked by a motif (Fig. 1E), which does not correspond to a known regulatory sequence or its reverse complement. The fact the majority of asRNAs are not flanked by any visible promoter sequence suggests a large fraction of these RNAs result either from leaky transcription or processing of longer transcripts. Our sequencing coverage was sufficient to identify 3′ UTRs of size 10 nt or more for 446 genes. Considering all genes with a detectable 3′ UTR, the median size of predicted 3′ UTRs in Pyrococcus is 37 nt (). However, note that single-end RNA-seq protocols like the one we used here do not sequence the 3′ side of cDNA library fragments, thus actual 3′ UTRs are probably longer.

Overall portrait of non-coding elements

Table 1 presents the amount of RNA-seq reads associated to each class of ncRNA element. The most populated classes of ncRNAs are asRNAs (215), followed by sRNAs (107) and long 5′ UTRs (98). A substantial fraction of the sRNAs and long 5′ UTRs (respectively, 37% and 45%) were already identified in previous studies and/or annotated as encoding ncRNA elements in databases. However, the vast majority of asRNAs (97%) were previously unreported (Fig. 2A). Non-coding RNAs are generally less conserved than coding genes at the nucleotide sequence level but more than intergenic region. Between 47‒57% of the non-coding elements are specific to P. abyssi, compared with 28% for coding sequences and 76% for intergenic region (Fig. 2B).

Table 1. RNA abundance by class

RNA class	Number of elements	Total number of reads	Median RPKM
rRNA	5	32,727,463	14,818
tRNA	46	33,414	264
Long 5′ UTR	98	172,014	185
sRNA	107	84,797	63
Antisense RNA	215	13,253	43
CDS	1,893	5,261,317	164

Figure 2. Characteristics of ncRNAs identified by RNA-seq. (A) Numbers of known RNAs vs. novel RNAs. Sources for known ncRNAs: NCBI features, RFAM, and studies from Klein et al. and Phok et al. classes are ranked in order of decreasing confidence starting from experimentally validated RNAs (from left to right on the bar plot). When the known ncRNA is from P. abyssi, a minimum overlap of 25 nt with the known RNA and the new candidate is required to assign the candidate to a class. When the known ncRNA is from another Archaeal species, a BLASTN sequence conservation with a minimum BLASTN bit score of 42 is required to assign the candidate to a class. When a candidate appears in several classes, it is counted only in the class with highest confidence. (B) Conservation of ncRNA classes. At each taxonomic level, the histogram shows the fraction of elements conserved up to, and not deeper than, this taxonomic level (see Materials and Methods). Elements shown include the three ncRNA classes (107 sRNAs, 98 long 5′UTRs, 215 asRNAs), CDS, antisense-associated CDS (CDS-asRNA), and intergenic region. To avoid conservation bias due to size differences, conservation for CDS, CDS-asRNA, and intergenic region were obtained on randomly sampled fragments from CDS, CDS-asRNA, and intergenic regions with the same size distribution as ncRNA, asRNA, and ncRNA, respectively. We measured RPKM (reads per kb per million) as an approximation of RNA abundance (Table 1). Since we used in our library preparation an exonuclease degrading 5′-monophosphorylated RNA species (Materials and Methods), we expect a depletion of rRNAs and tRNAs, as well as of any processed RNAs, including some sRNAs or asRNAs. Keeping this limitation in mind, the most abundant non-coding elements after rRNAs and tRNAs are long 5′ UTRs (median: 185 RPKM), followed by sRNAs (63 RPKM) and asRNAs (43 RPKM). The relatively high abundance of long 5′ UTRs only reflects that of coding transcripts in general, since the median RPKM in coding regions is 164 RPKM (Table 1). Analysis of the new transcription units revealed 26 new or extended protein coding sequences: nine in sRNAs and in 18 in long 5′ UTRs (). The coverage of intergenic regions by RNA-seq reads enabled us to combine a number of annotated genes into multicistronic transcription units (). Overall, 1,466 out of 1,944 annotated genes (75%) are part of multi-cistronic transcripts and are thus likely expressed as operons. Conversely, 444 out of 888 extended transcription units involve two or more CDSs or ncRNA genes. The longest polycistronic transcripts (28 CDS and 20 CDS) are ones encoding ribosomal proteins.

Widespread cis-antisense transcription includes known functional RNAs

Although we used a strict definition of antisense, which we require to be fully embedded in the opposite strand of a transcription unit, our asRNA list still includes four C/D box guide RNAs. Indeed, one of the longest and most expressed asRNAs (PabO115, size 277 nt: ) matches a C/D box RNA (RFAM family RF01130) and sits right against the center of a five-gene operon (). Therefore, in the compact P. abyssi genome, functional trans-acting ncRNAs can be produced from fully antisense transcripts. Conservation analysis does not support a predominance of trans-acting asRNAs. Indeed, such asRNAs would sustain a dual selection pressure, first due to their trans-acting role and second due to the coding sequence on the opposite strand. However, asRNAs are significantly less conserved than average protein coding genes (see “CDS,” Fig. 2B), which argues against widespread trans-acting functions. The conservation level of asRNAs is similar to that of other ncRNAs (5′UTRs and sRNAs). This relatively low conservation also applies to other regions of genes harboring asRNA (see CDS-asRNA in Fig. 2B). Moreover, genes harboring asRNA are less expressed than average CDS (). Altogether, this shows that asRNAs generally occur in a class of genes that are less conserved and expressed than average protein-coding genes. Overall, the antisense category is that with fewer previous annotation (Fig. 2A), which is unsurprising since previous screens mostly relied on sequence conservation or GC-content and systematically excluded coding sequences and their antisense strands. The number of actual functional RNAs among the 215 identified antisense elements remains to be determined. We selected 71 “high interest” candidate functional asRNAs based on their high abundance, large size, strong promoter, or high GC-content (noted PabO100 to PabO170 in ). It should be noted that the only four asRNAs with a known function rank among the top six asRNAs ranked by expression level. This suggests RPKM is a good criterion for selecting putative functional asRNAs.

New sRNA elements are mostly unstructured

Our annotation workflow identified 107 putative sRNAs (). We requalified nine of them as ORFs. Eighteen others were previously annotated in RFAM or in prior publications as CRISPR, C/D box, or H/ACA box RNAs, while 13 were identified in previous screens, but had no assigned function. Expectedly, sRNAs with high GC-contents, high conservation, or high expression were more likely to be reported in previous studies. Among 27 sRNAs meeting at least two out of these three criteria, only five are novel to this study (PabO1-5, ). Of 67 sRNAs identified here for the first time, 47 can be considered as “high interest” based on any of the criteria: high expression, high conservation, strong promoter, or high-GC (noted PabO01 to PabO47, ). Of note, 16 sRNAs have unusually long sizes of over 200 nt, which includes four CRISPR RNAs and nine high interest candidates that were not previously reported. Furthermore, two of the high interest sRNAs (PabO9, PabO10) are arranged in reciprocal antisense orientation (), reminiscent of a bacterial toxin/antitoxin system. These antisense sRNAs were not classified as asRNAs by our annotation pipeline as they overlap only partially. A high GC-content is a hallmark of structured RNA in hyperthermophiles. Prior RNA detection screens based on GC-content mostly identified RNAs in the range 50–75% GC. Here a threshold of 50% GC would retain only 15% of sRNAs. Nine of the novel RNAs are above this threshold and may thus constitute new instances of archaeal structured RNAs. Most of these new high-GC sRNAs are low-expression transcripts, which explains why they were not previously detected. Figure 3A shows a plot of abundances vs. GC-contents for all sRNAs, identified by their functional status. This confirms the general tendency for novel sRNAs to harbor low GC% and low abundance. Notable exceptions are PabO5 expressed at 293 RPKM and PabO17 at 155 RPKM. Three novel high-GC RNAs (PabO5, Pab043, PabO2) were computationally predicted by Phok et al. but could not be confirmed experimentally.

Figure 3. (A) Distribution of 107 sRNAs as a function of GC% and RPKM. Unknown RNAs are dark blue, known RNAs are colored as in legend. (B) GC-contents of transcript classes. Sequences were extracted following annotations (CDS, tRNA, rRNA, snoRNA) or RNA-seq analysis (sRNA and long 5′UTR). If we compare the GC content distribution of new sRNAs with that of other transcript classes, the AU-richness of new sRNAs appears even more strikingly (Fig. 3B). Novel sRNAs peak at around 40–45% GC, against 70% for tRNAs and rRNAs, and 55% for C/D and H/ACA RNAs. Even coding regions have a GC content peak (50%) that is significantly higher than that of new sRNAs. This extreme AU-richness is a strong indication that most of the novel sRNAs identified in this study are generally unstructured. Importantly, however, AU-richness does not imply absence of function: for instance, CRISPR RNAs are generally AU-rich (Fig. 2A). A strong indication of functionality in the P. abyssi AU-rich sRNAs is the fact that 41% of sRNAs identified by RNA-seq are conserved in thermococcales or deeper in the evolutionary tree (Fig. 2B), which is much higher than the 15% GC-rich fraction. Indeed, 16 AU-rich sRNAs (< 45% GC) are deeply conserved and 41 AU-rich sRNAs are considered “high quality” by the above criteria ().

New long 5′ UTR elements suggest cis-regulated mRNAs

Our workflow identified 98 5′ UTRs of 50 nt or more. Among these, 19 most likely include new ORFs that constitute extensions of previous annotated CDSs (). As for sRNAs, most of the “low hanging fruits” in the long 5′ UTR category had been picked up by previous screens: among 26 long 5′ UTRs meeting at least two criteria among high GC, high conservation, and high expression, only seven are novel to this study or were only known as computational predictions (PabO60–66). Thirteen of the long 5′ UTRs overlap RFAM entries for C/D or H/ACA guide RNAs. Location in 5′ UTR is a frequent feature of archaeal guide RNAs, which are known to often overlap coding regions.- In such cases, it is difficult to infer from the RNA-seq data alone whether the ncRNA is transcribed independently or processed from the mRNA leader. Both situations have been reported. At the time we submit this manuscript, there is still no experimentally demonstrated riboswitch or other cis-regulatory RNA in archaea. Here, we can however pinpoint several strong candidates: The crcb RNA known as the fluoride riboswitch is similar to one of our highly expressed long 5′ UTRs. This ncRNA was predicted (and not confirmed) by the Klein et al. and Phok et al. screens and the presence of the riboswitch inferred in archaea by comparative genomics. Here, we confirm its expression in the form of a 176 nt leader. Another long leader that was validated in previous screens, is proposed to be a pre-Q1 riboswitch by Phok et al. (under the name RNA28) and confirmed in our analysis (). Besides previously proposed riboswitches and guide RNAs, we confirm the expression of Ssca, an archeal RNA of unknown function already identified by Klein and Phok and described as RFAM family RF00063. Several long 5′ UTRs are located upstream of genes that are known to be cis-regulated in bacteria, such as ribosomal protein genes (rps15p, rpl15e, rpl21e, rps6e, rpl7ae), and amino acid metabolism pathway genes (aminotransferase, N-terminal protease, leu-, val-, and trp-aminoacyl tRNA synthetase). The presence of regulatory leader sequences upstream of ribosomal protein genes was suggested in S. solfataricus. Here, we observe that this cis-regulatory leaders may also involve amino acid biosynthesis genes. Few of the novel long 5′ UTRs are GC-rich (Fig. 3B; ), suggesting the absence of extensive secondary structure such as found for instance in the T-boxes of bacterial aminoacyl tRNA synthetases. Most of the high-GC leaders are already annotated as C/D or H/ACA box RNAs. Only five new leaders of unknown function have a GC content above 50%. One of them, PabO66, is 71% GC and is located upstream of an uncharacterized CDS.

Novel H/ACA RNAs

H/ACA RNAs are well-characterized small non-coding RNAs present both in eukaryotes and archaeas. They are involved in RNA-guided modifications where a specific U residue, often in rRNA, is converted into a pseudouridine, a common base modification in the ribosome. Other RNA targets like tRNAs also exhibit specific pseudo-uridylations, some of which can be directed by the same RNA guided machinery. In Pyrococcus genomes, seven H/ACA sRNAs have been identified by an in silico approach and validated experimentally. Those H/ACA RNAs are annotated in the Pyrococcus genomes and can also be retrieved from their RFAM identifier (Pab19, Pab21-snoR9: RF00065, Pab160, Pab35-HgcF: RF00058, Pab40-HgcG: RF00064, Pab91, Pab105-HgcE: RF00060). They were initially identified using a structure descriptor corresponding to a helix-loop-helix motif where an internal loop of variable length (from five to 11 residues on both 5′ and 3′ ends) connects two stems: a basal stem and an apical stem including at least seven and five base pairs, respectively. Another essential feature in the descriptor is the presence of a terminal loop including an embedded K-turn or K-loop sub-motif which is required to recruit one of the proteins (L7Ae) necessary for the H/ACA sRNP assembly. A new family of H/ACA-related motifs was recently discovered in Pyrobaculum genomes. These non-canonical H/ACA sRNAs differ from their canonical relatives found in P. abyssi in that the first stem is absent. However, the truncated motifs still maintain the second stem and a K-turn motif, two structural features which are sufficient to preserve the function of these H/ACA like motifs as RNA guides. The 5′ and 3′ ends of the H/ACA like motifs remain unpaired and may be used as antisense elements for the RNA-RNA interaction with its target(s). Although the most abundant motifs are H/ACA like motifs in Pyrobaculum, both H/ACA- and H/ACA-like motifs coexist and are functional. Nevertheless, the H/ACA motifs in Pyrobaculum are slightly different from those in Pyrococcus with a basal stem shortened by one or two base pairs (five or six base pairs instead of seven or eight) that may also include some bulge residue. We compared candidates ncRNAs from our RNA-seq analysis with the results of a genome wide search for novel H/ACA RNAs containing H/ACA or H/ACA like motifs similar to those found in Pyrobaculum (see Materials and Methods). Four new H/ACA RNA candidates were identified (shown with their possible targets in ). Figure 4 presents the predicted secondary structure of the three most reliable candidates (PabO1, PabO48, and PabO78), based on the criteria mentioned above (high GC content, high conservation, high abundance) in a 2D structure representation including the H/ACA(like) motif and its possible targets. These findings suggest H/ACA like RNAs are not specifc to crenarchaea, and may also occur in euryarchaea.

Figure 4. New H/ACA gene candidates (PabO1, PabO48, PabO78) corresponding to H/ACA- or H/ACA-like motifs identified in RNA-seq ncRNAs. The 2D structure representations were generated using VARNA - version 3.9 (Darty et al., 2009). The guide RNA candidates include the annotation of the K-turn or K-loop motif (green). The RNA targets (red) are represented as paired to the guide RNA in the internal loop of the H/ACA motif or in the free ends of the H/ACA-like motif. The coding strand is indicated for all targets by a (+) or (-) symbol. Targets located on the non-coding strand with respect to the CDS, are noted “strand (-).” The CRISPR 2 (+) target follows the nomenclature adopted by Phok et al. All targets shown are expressed at some degree in the genome.

Discussion

In hyperthermophile bacteria and archaea, structured RNA such as tRNA, rRNA, or modification guide RNA have seldom been found with a GC content below 50% (this study and refs. 13 and 14—some C/D box RNAs with very short hairpin structures may constitute rare exceptions). It is tempting to generalize this observation and propose that no low-GC transcript can ever fold into a stable secondary structure at extreme high temperature. Indeed, factors other than a high GC content can contribute to RNA stabilization in hyperthermophiles. These include nucleotide modifications, the presence of polyamines, and high salt concentration, which was shown to stabilize DNA and RNA. Most likely, however, RNA molecules that require secondary structure to operate at high temperature should be, just like rRNAs and tRNAs, under selection for GC-rich helices in addition to any extra stabilizing factor. Consequently, we can reasonably assume an absence of secondary structure over most of the AU-rich sRNAs and mRNA leaders. However, an absence of secondary structure does not mean these RNAs are not functional. Indeed there are several strong indications of functionality in a large fraction of the AU-rich ncRNAs, including high expression level, the presence of strong promoters and, importantly, sequence conservation. Then what can these function be? Small RNAs exert regulatory functions by targeting mRNAs and triggering RNA degradation or translation block. Cases have recently emerged showing such regulatory RNA activity exists in archaea., Of course, RNA-RNA interactions would require at least a short GC-rich stretch for efficient targeting at high temperature, but this may not affect the overall GC-content. Alternatively, AU-rich RNAs may act by recruiting proteins through sequence-based interactions or short local structures. Again, a short local GC-rich hairpin may form in an otherwise AU-rich RNA. This type of behavior would be reminiscent of the function of certain lncRNAs in eukaryotes, which act mainly by recruiting proteins although they lack extensive sequence or structure signals., Pervasive transcription was first described in eukaryotic organisms and later extended to bacterial cells. It is now clear that archeal genomes also produce transcripts over most of their intergenic regions and antisense to a large number of genes. Klein et al. observed that structured ncRNAs were rarer in the archaea P. furiosus and M. janaschii than in bacterial genomes. They hypothesized that the use of ncRNA could be selected against in high temperature environments, or that organisms with a reduced gene set such as many archaea did not require sophisticated RNA-based regulation. However, later transcriptome analysis of the reduced bacterial genome of Mycoplasma pneumoniae revealed over a hundred RNAs with potential regulatory functions, most of them antisense with respect to protein-coding genes. Hence compact genomes retain extensive RNA-based regulation. It is important though to dissociate RNA functionality from the existence of stable secondary structures. The emerging classes of eukaryotic regulatory RNAs such as endogenous siRNAs, lncRNAs, or XUTs all seem to bind their RNA or protein targets without help from extensive secondary structure. Archaea have been invaluable models for studying various aspects of eukaryotic cell functions. It might turn out that RNA-based regulation is another key feature that may benefit from the archaeal model.

Materials and Methods

Strain growth

Pyrococcus abyssi GE5 was grown anaerobically in ASW-YT rich medium, containing artificial seawater, yeast extract and tryptone at 92 °C. After two overnight transfers, cells were diluted to 2.106 cells/mL. After an initial growth of 3 h, samples (10 mL) were taken every 1.5 h until cells reach the stationary phase (11 h of growth). Samples were quickly cooled down in liquid nitrogen, spinned, and the pellets were frozen at -80 °C.

RNA extraction and sequencing

Total RNAs were extracted using the TRIzol method (Invitrogen), following the manufacturer's instructions. RNA quality was monitored by agarose gel electrophoresis and concentration was measured using the NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific Inc.). A pool of equal amounts of each sample was checked for integrity by 2100 Bioanalyzer (Agilent Technologies Inc.) and treated to enrich in primary transcripts using a 5′-phosphate-dependent exonuclease (Terminator, Epicenter), following the manufacturer’s instruction except for the amount of enzyme used in the reaction was 1U per microgram of enriched RNAs and incubation was done at 30 °C for 1.5 h. Total RNA composition before and after treatment are shown in . RNA samples were subsequently purified and concentrated using the RNA Clean-Up and Concentration Kit (Norgen Biotek Corp) before library preparation as described elsewhere in detail. Briefly, the strand-specific RNA-seq template library was prepared starting from a pool of total samples (50 ng) following the directional mRNA-seq library preparation protocol provided by Illumina Inc. The library was sequenced (40 bp single-read) using an Illumina GA-IIx sequencer.

ncRNA classification into long 5′ UTR, sRNA, asRNA

Oriented RNA-seq reads were mapped to the P. abyssi GE5 genome sequence using the Bowtie program. Uniquely mapping reads were then processed using the DetR’prok ncRNA annotation pipeline, available in the Galaxy Tool Shed (http://toolshed.g2.bx.psu.edu). Briefly, this workflow clusters mapped reads, compares clusters with previous annotation and generates extended annotations including 5′ and 3′ non-coding regions. Furthermore, extended annotations merge tandem protein coding, tRNA, and rRNA genes separated by intergenic regions shorter than 25 nt or fully covered by RNA-seq reads. The workflow then classifies non-coding fragments of extended annotations into long 5′ UTRs, sRNAs, and asRNAs (). Antisense RNAs must be fully included in an extended annotation on the opposite strand. We used the P. abyssi genome annotation corrected by Gao et al. The ORF search was conducted using the EasyGene software that uses a HMM model to score putative coding sequences based on codon statistics. The model is trained over a set of genes that is automatically extracted based on similarity with known genes. Naming convention: we provide a name in the form PabOx (P. abyssi Orsay x) to any novel “high interest” RNA that was not experimentally confirmed in prior publications. We define as high interest any RNA meeting at least one criterion among the following. For sRNAs: high-GC, high conservation, high-RPKM, strong promoter; for long 5′ UTRs: high-GC, high conservation, high-RPKM; for asRNA: high-GC, strong promoter, long size. We define high-GC as > 50% GC; high conservation as conserved in five species or more (see below definition of conservation), high-RPKM as more than twice the median RPKM value for this type of element and long size as over 200 nt. Annotated ncRNAs and extended_annotations are available in supplementary data as GFF (General Feature Format) files providing locations, names, and qualitative information for each transcript ().

Promoter motif detection

We used the MEME web server using default options except for “search given strand only” to detect motifs in the upstream sequences of CDS, transcription units, long 5′UTR, sRNA, and asRNA (Fig. 1). We extracted 50 nt sequences upstream of each annotation. Upstream sequences were excluded when another CDS was encountered before 50 nt. Sequences shorter than 10 nt were excluded from analysis. The list of sRNAs with both BRE and TATA boxes motifs () was produced as the combination of two searches: the results of a MEME search from the 50 nt upstream sequences of sRNA and the sRNA with a BRE-TATA box motif in its upstream region (50 nt) given by a FIMO (Grant et al. 2011) search, performed over the whole genome and with the BRE-TATA box motif defined from 50 nt upstream regions of expressed genes (default option with P value threshold set to 10e-3).

Conservation analysis

All identified ncRNA sequences were submitted to a BLASTN (version 2.2.15 with parameters: -W 7 -r 2 -q -3 -G 5 -E 2 -e 10) search on a database of 76 archaeal genomes, selected based both on their “complete” status given by the Genomes OnLine Database (http://www.genomesonline.org) in March 2012, and their availability in the NCBI genome repository. An empirical filter to remove BLASTN alignments with a bit score below 42 (corresponding to an expected value of 0.06) was applied and all retained hits were inspected to assess the extent of conservation for each query. Inspection was based on the taxonomy lineage offered by the ORGANISM record of the genome GenBank file, where “Archaea” is the deepest level and species strain the uppermost level. A BLASTN hit associates two species and, thus, two lineages. For each BLASTN hit, the uppermost common level of the two lineages was retained. The conservation level of a ncRNA (Fig. 2B; ) was then given by the deepest level among all pairwise taxonomy levels. For example, an RNA is found in two species: Pyrococcus furiosus and Methanosarcina acetivorans. The taxonomic description for P. furiosus is “Archaea; Euryarchaeota; Thermococci; Thermococcales; Thermococcaceae; Pyrococcus” and that for M. acetivorans is “Archaea; Euryarchaeota; Methanomicrobia; Methanosarcinales; Methanosarcinaceae; Methanosarcina.” Then the deepest common level between these two species is “Euryarchaeota.”

H/ACA-like motif search

A standard descriptor-based search was performed using RNAMotif to identify canonical or non-canonical H/ACA motifs with structural features similar to those found in Pyrobaculum except that we also allowed the possible substitution of the Kturn motif by a Kloop. A series of filters were then used to post-process the results, which initially included around 3,300 hits: around 100 H/ACA or H/ACA-like motifs with a Kturn, ~3,100 for H/ACA or H/ACA-like motifs with a Kloop. The first filter was expression-based, retaining only hits in expressed regions from the RNA-seq data, thus eliminating 33% and 87% of the H/ACA(like) motifs with a K-turn and a K-loop, respectively. The second filter was sequence and structure conservation-based, keeping hits conserved in other archaea where the H-ACA(like) motif is also preserved. The third filter was function-based, looking for the presence of potential RNA targets (using the YASS program) that might be chemically modified by the RNA guide machinery. The following base pairing constraints were imposed between guide and target: a minimum of seven “consecutive” base pairs (spanning the unpaired UN dinucleotide corresponding to the pseudouridylable position) for the canonical H/ACA motifs where the first stem is still present, but slightly truncated with respect to the typical H/ACA motifs in Pyrococcus abyssi (e.g., Pae sR201 and sR202 in Pyrobaculum aerophilum, see Bernick et al.). In the case of H/ACA-like motifs, a minimum of 11 consecutive base pairs was required. In both cases, the constraints are consistent with the minimum number of base pairs observed in H/ACA- or H/ACA-like motifs when paired to their RNA target(s)., The basic workflow of the full search is summarized in .

48 in total

1. Normalized nucleotide frequencies allow the definition of archaeal promoter elements for different archaeal groups and reveal base-specific TFB contacts upstream of the TATA box.

Authors: J Soppa
Journal: Mol Microbiol Date: 1999-03 Impact factor: 3.501

Review 2. A dedicated computational approach for the identification of archaeal H/ACA sRNAs.

Authors: Sébastien Muller; Bruno Charpentier; Christiane Branlant; Fabrice Leclerc
Journal: Methods Enzymol Date: 2007 Impact factor: 1.600

Review 3. Hyperthermophiles and the problem of DNA instability.

Authors: D W Grogan
Journal: Mol Microbiol Date: 1998-06 Impact factor: 3.501

4. Fitting a mixture model by expectation maximization to discover motifs in biopolymers.

Authors: T L Bailey; C Elkan
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1994

5. Cell-free transcription at 95 degrees: thermostability of transcriptional components and DNA topology requirements of Pyrococcus transcription.

Authors: C Hethke; A Bergerat; W Hausner; P Forterre; M Thomm
Journal: Genetics Date: 1999-08 Impact factor: 4.562

6. The expanding world of small RNAs in the hyperthermophilic archaeon Sulfolobus solfataricus.

Authors: Maria A Zago; Patrick P Dennis; Arina D Omer
Journal: Mol Microbiol Date: 2005-03 Impact factor: 3.501

7. The role of posttranscriptional modification in stabilization of transfer RNA from hyperthermophiles.

Authors: J A Kowalak; J J Dalluge; J A McCloskey; K O Stetter
Journal: Biochemistry Date: 1994-06-28 Impact factor: 3.162

8. Protection of DNA by salts against thermodegradation at temperatures typical for hyperthermophiles.

Authors: E Marguet; P Forterre
Journal: Extremophiles Date: 1998-05 Impact factor: 2.395

9. YASS: enhancing the sensitivity of DNA similarity search.

Authors: Laurent Noé; Gregory Kucherov
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. Experimental characterization of Cis-acting elements important for translation and transcription in halophilic archaea.

Authors: Mariam Brenneis; Oliver Hering; Christian Lange; Jörg Soppa
Journal: PLoS Genet Date: 2007-12 Impact factor: 5.917

17 in total

1. Role of aIF1 in Pyrococcus abyssi translation initiation.

Authors: Auriane Monestier; Christine Lazennec-Schurdevin; Pierre-Damien Coureux; Yves Mechulam; Emmanuelle Schmitt
Journal: Nucleic Acids Res Date: 2018-11-16 Impact factor: 16.971

Review 2. Transcription Regulation in Archaea.

Authors: Alexandra M Gehring; Julie E Walker; Thomas J Santangelo
Journal: J Bacteriol Date: 2016-06-27 Impact factor: 3.490

3. Revisiting the structure/function relationships of H/ACA(-like) RNAs: a unified model for Euryarchaea and Crenarchaea.

Authors: Claire Toffano-Nioche; Daniel Gautheret; Fabrice Leclerc
Journal: Nucleic Acids Res Date: 2015-08-03 Impact factor: 16.971

4. Comparative genomics reveals conserved positioning of essential genomic clusters in highly rearranged Thermococcales chromosomes.

Authors: Matteo Cossu; Violette Da Cunha; Claire Toffano-Nioche; Patrick Forterre; Jacques Oberto
Journal: Biochimie Date: 2015-07-10 Impact factor: 4.079

5. Primary transcriptome map of the hyperthermophilic archaeon Thermococcus kodakarensis.

Authors: Dominik Jäger; Konrad U Förstner; Cynthia M Sharma; Thomas J Santangelo; John N Reeve
Journal: BMC Genomics Date: 2014-08-16 Impact factor: 3.969

6. Transcription start site associated RNAs (TSSaRNAs) are ubiquitous in all domains of life.

Authors: Livia S Zaramela; Ricardo Z N Vêncio; Felipe ten-Caten; Nitin S Baliga; Tie Koide
Journal: PLoS One Date: 2014-09-19 Impact factor: 3.240

Review 7. Small regulatory RNAs in Archaea.

Authors: Julia Babski; Lisa-Katharina Maier; Ruth Heyer; Katharina Jaschinski; Daniela Prasse; Dominik Jäger; Lennart Randau; Ruth A Schmitz; Anita Marchfelder; Jörg Soppa
Journal: RNA Biol Date: 2014-03-31 Impact factor: 4.652

8. Genome-wide primary transcriptome analysis of H₂-producing archaeon Thermococcus onnurineus NA1.

Authors: Suhyung Cho; Min-Sik Kim; Yujin Jeong; Bo-Rahm Lee; Jung-Hyun Lee; Sung Gyun Kang; Byung-Kwan Cho
Journal: Sci Rep Date: 2017-02-20 Impact factor: 4.379

9. RNA sequencing and proteogenomics reveal the importance of leaderless mRNAs in the radiation-tolerant bacterium Deinococcus deserti.

Authors: Arjan de Groot; David Roche; Bernard Fernandez; Monika Ludanyi; Stéphane Cruveiller; David Pignol; David Vallenet; Jean Armengaud; Laurence Blanchard
Journal: Genome Biol Evol Date: 2014-04 Impact factor: 3.416

10. Genome-wide identification of transcriptional start sites in the haloarchaeon Haloferax volcanii based on differential RNA-Seq (dRNA-Seq).

Authors: Julia Babski; Karina A Haas; Daniela Näther-Schindler; Friedhelm Pfeiffer; Konrad U Förstner; Matthias Hammelmann; Rolf Hilker; Anke Becker; Cynthia M Sharma; Anita Marchfelder; Jörg Soppa
Journal: BMC Genomics Date: 2016-08-12 Impact factor: 3.969