Literature DB >> 24755880

Bioinformatics of prokaryotic RNAs.

Rolf Backofen¹, Fabian Amman², Fabrizio Costa³, Sven Findeiß⁴, Andreas S Richter⁵, Peter F Stadler⁶.

Abstract

The genome of most prokaryotes gives rise to surprisingly complex transcriptomes, comprising not only protein-coding mRNAs, often organized as operons, but also harbors dozens or even hundreds of highly structured small regulatory RNAs and unexpectedly large levels of anti-sense transcripts. Comprehensive surveys of prokaryotic transcriptomes and the need to characterize also their non-coding components is heavily dependent on computational methods and workflows, many of which have been developed or at least adapted specifically for the use with bacterial and archaeal data. This review provides an overview on the state-of-the-art of RNA bioinformatics focusing on applications to prokaryotes.

Entities: Chemical Disease Gene Species

Keywords: RNA bioinformatics; RNA–RNA interaction; TSS annotation; gene finding; secondary structure prediction; target prediction

Mesh：

Substances：
RNA

Year: 2014 PMID： 24755880 PMCID： PMC4152356 DOI： 10.4161/rna.28647

Source DB: PubMed Journal: RNA Biol ISSN： 1547-6286 Impact factor: 4.652

Introduction

During the last decade, thousands of small RNAs (sRNAs) have been discovered in a widely diverse set of prokaryotes. Beyond the evolutionary ancient “housekeeping” RNA genes encoding tRNAs, rRNAs, RNase P RNA, and SRP RNA (as well as tmRNA and 6S RNA in bacteria), typical genomes harbor dozens or even hundreds of sRNAs with predominantly regulatory roles. Archaea, in addition, have homologs of the small nucleolar RNAs of Eukarya (snoRNAs), directing chemical modifications of rRNAs and other RNA targets. Compared with protein-coding genes, most of the prokaryotic RNAs are still rather poorly characterized in terms of their structure, function, and phylogenetic distribution. In particular, with the advent of high-throughput transcriptomics, large numbers of sRNA candidates have been detected, but so far have not received attention beyond a note of their genomic coordinates. Computational approaches have been very successful in facilitating, extending, and complementing experimental investigations. In this contribution, we review the state-of-the-art and the limitations of RNA bioinformatics as applied to prokaryotes. Albeit we cover a broad variety of approaches, our presentation emphasizes particular methods and tools that were developed or substantially improved within the Priority Program SPP 1258: Sensory and regulatory RNAs in Prokaryotes funded by the Deutsche Forschungsgemeinschaft from 2007–2013. It is a successful example of a coordinated project in which many new or adapted bioinformatics tools have been developed specifically according to the needs of several experimental groups.

Structure Prediction

The complex three-dimensional structures formed by many functional single-stranded nucleic acids are dominated by base pairing both in terms of the energy of folding and in the sense that much of the shape can be understood in terms of the co-planar arrangement of the bases. At the same time, the status of a nucleotide as either paired or unpaired can be interrogated experimentally by means of chemical or enzymatic probing. This makes secondary structures an important level of description. The problem of secondary structure prediction is well investigated and described elsewhere.- The most prominent implementations of RNA folding algorithms are mfold and the ViennaRNA Package., Standard approaches consider only non-crossing structures, a condition that is not always satisfied. Different classes of pseudoknot structures have been defined and corresponding prediction algorithms have been implemented, albeit at the expense of higher computational complexity.- The accuracy of secondary structure prediction from single sequences is far from perfect for a wide variety of reasons. Some derive from limitations of the secondary structure model, such as deviations from the additive model, insufficient knowledge of energy parameters, simplified parametrization of multi-loops, and the exclusion of non-standard base pairs. In addition, the precise transcript might be known only partially, or structure motifs are embedded into a larger RNA, which leads to the even harder problem of local structure prediction. There are two remedies for these problems: (1) instead of just a single sequence, evolutionary information on patterns of sequence conservation may be taken into account, or (2) experimental evidence such as chemical probing or FRET data may be incorporated into structure prediction. When accurate sequence alignments can be obtained, these may serve as a basis for computing consensus structures. The simplest approach, implemented e.g., in RNAalifold,, is to extend the RNA folding algorithms to compute a secondary structure that minimizes the average folding energy of the aligned sequences. A more sophisticated phylogenetic model replacing simple averaging is implemented in PETfold. At lower levels of sequence conservation, folding and alignment must be computed simultaneously at a much higher computational cost. Several practical approaches exist, from full-fledged implementations of the Sankoff algorithm, e.g., in Foldalign and Dynalign, to computationally much more efficient approximations that restrict themselves to base pairs that are thermodynamically plausible for the individual sequences. Tools of the latter type are LocaRNA and its variants,- and SPARSE. A conceptually different approach taken by the RNAshapes package makes use of coarse-grained structures. In all cases, the output consists of a sequence alignment annotated by a consensus structure—exactly the input required later on for homology search. Experimental data can be integrated into structure prediction either as hard constraints (enforcing or prohibiting certain base pairs) or as soft constraints that distort the ensemble of structure by adding bonus energies or energy penalties to encouraged or discouraged structural elements, resp. Measurement of SHAPE, PARS, or other chemical or enzymatic probing methods can be converted into pseudo-energies added to paired or unpaired bases, leading to a distortion of the Boltzmann ensemble toward the experimental signal., Most recently, more sophisticated approaches have appeared toward reconciliating experimental data with the thermodynamic folding approach. RNAassist formulates the problem in terms of simultaneously minimizing position-dependent energy penalties and the deviation of observed and predicted probabilities for unpaired nucleotides. SeqFold uses the experimental data to select locally stable secondary structure from the Boltzmann ensemble. In ShapeKnots, an interative procedure is used to include pseudoknots and SHAPE information. It has been applied to e.g., investigate the structure of a SAM-I riboswitch.

Gene Finding and Transcriptomics

Homology search

The initial gene annotation of a newly sequenced genome is created by comparison with known sequences of related organisms together with the application of de novo prediction methods; in particular, the search of open reading frames of sufficient length. Since non-coding RNAs (ncRNAs) do not offer a similar generic sequence pattern, they are much harder to predict from scratch. As a consequence, only a few well-known RNA genes such as tRNAs, RNase P RNA, SPR RNA, and the rRNA subunits, are annotated for most prokaryotic genomes. Both homology search and many of the comparative genomics approaches discussed below are applicable not only to independent sRNAs but also to structured RNA elements, which includes, in particular, riboswitches, RNA thermometers, and several other cis-acting elements. For brevity, we will simply speak of ncRNAs in the following. The Rfam database, as the most extensive repository of structured RNAs, lists in its current version 11.0 a total of 605 RNA families with prokaryotic members (527 bacterial and 107 archaeal). This number includes, however, a large number of CRISPR RNA repeats, many riboswitches, and other mRNA elements, as well as ubiquitous RNA families such as tRNAs or RNase P. There is, at present, no comprehensive repository of prokaryotic small RNAs. The overwhelming majority of sRNAs discovered after the publication of a reference genome are documented only in the main text of publications or in supplemental material. Despite community efforts and incentives such as free open access publication of RNA family descriptions in this journal, only a very moderate number of prokaryotic RNA families have been described in detail and deposited to databases, see e.g. references 39–42. As a consequence, the majority of sRNA families remain in practice unavailable for genome annotation pipelines. For the same reason, it is impossible to give an accurate estimate on the total number of bacterial or archaeal sRNA families or to globally assess their phylogenetic distributions with any degree of certainty. The most widely used tool for homology search is blast. For highly diverged sequences, blast typically reports several small fragments instead of the full-length match to the query sequence. Thus, it is not implicitly the method of choice. Specialized ncRNA sequence homology search derivates of blast are available, e.g., blastR. Semi-global dynamic programming algorithms such as Gotohscan are a viable alternative given the small genome size of prokaryotes. This program reports full-length hits, makes subsequent processing of the predicted homologs much easier, and is particularly well-suited for ncRNAs, which—in contrast to protein-coding genes—are typically short and evolve rapidly at the sequence level. These properties generally limit the sensitivity of purely sequence-based methods. The information content of the query can be increased by making use of secondary structure conservation as well. Covariance models (CMs), a generalization of HMMs to tree-like structures, provide a convenient technical basis. They have to be trained from multiple sequence alignments annotated by a consensus structure. In contrast to blast, which is content with a single query sequence, CMs require a collection of evolutionarily related and alignable homologs as a starting point. With infernal 1.1, a highly efficient implementation of a search tool for CMs has become available that is suitable for large-scale applications. Most covariance models, in particular, the models of the Rfam families, are dominated by sequence information. At least in this regime, infernal is the most effective tool available. Phylogenetic distance, and hence, decreasing sequence conservation, eventually limits applicability of homology search. It is possible in principle to include thermodynamic stability, either using the idea of thermodynamic matchers or employing structural alignments. It remains unclear, however, whether such techniques can substantially improve the sensitivity of homology search for distantly related species.

Feature-based gene prediction

sRNApredict uses typical features of prokaryotic sRNAs: elevated sequence conservation, putative promoter sequences, and Rho-independent terminator elements. TranstermHP, for instance, is used to predict Rho-independent terminators. Its scoring function favors G/C-rich stem loops followed by a poly-T track. It is obviously extremely difficult to detect correct terminator elements in species with a high G/C-content and in those that use structural elements deviating from the canonical terminator structure. In order to increase sensitivity and specificity, sRNApredict focuses on intergenic regions and analyzes the co-occurrence of several of the above-mentioned features. While this strategy works quite well for well-characterized bacterial clades, it is bound to fail in others. Xanthomonas and Helicobacter, for example, lack typical promoter sequences and distinct terminator hairpins.,

Transcriptomics

Bacterial (and archaeal) transcriptomics can almost always be performed with a reference genome in place. This simplifies the workflow, which is basically composed of the following steps. (1) Library preparation: Transcriptome analyses consist of “wet-lab” experiments and “dry-lab” data evaluation. Both components greatly influence the final outcome and it is therefore recommended to design the experimental setup in a cooperative way, such that practical and theoretical issues are discussed at the very beginning. Selection of an appropriate sequencing platform, e.g., 454 or Illumina, and the enrichment or depletion of certain RNA classes, are only two of many design decisions that depend on the research question. The actual experiments are performed and, depending on the sequencing platform and sequencing depth, several gigabytes of RNA transcript data are reported. (2) Quality check: Sequencing machines typically output FASTQ-formatted files. This extended version of FASTA files is augmented by quality information for each called nucleotide along the sequence. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) is commonly used to initially check and visualize the quality of the raw sequencing data. Software suites such as the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit) provide several tools to preprocess the raw sequencing reads by e.g., removal of the adaptor and bar code sequences that have been attached during library preparation, or by filtering of low complexity reads. These steps can have a drastic influence on the mapping quality. (3) Read mapping: A large number of software tools for read mapping has become available that differ widely in their algorithmic basis, memory consumption, speed, and versatility. Mapping strategies furthermore differ in their treatment of reads that map equally good to multiple genomic locations and in their handling of insertions and deletions.- It is therefore important to match the choice of mapping tool to the research question. We used segemehl, very successfully in a variety of studies, ranging from dRNA-seq analysis to split read mapping in prokaryotes. In our hands, segemehl has proven to be a flexible and highly accurate framework. This has also been repeatedly shown in benchmarks using real live and simulated data., Once the mapping step is completed, mapping summary statistics help to verify whether all prior steps have been successful. Transcriptome studies that investigate prokaryotes usually assume that reads map without interruption (“split-free”) and with near perfect sequence identity to the genome. This is, indeed, the case for the overwhelming majority of the reads. There are, however, biological relevant exceptions that usually end up in the “sequencing trash bin.” Examples include transcripts containing self-splicing introns in bacteria, as well as enzymatically spliced and circularized RNAs in archaea. A recent study showed that such “atypical” transcript structures may be much more abundant than expected. It remains, however, unclear to what extent rare transcripts of this type are biologically relevant, how many of them are technical artifacts, and to what extent one detects true cellular RNAs that are nevertheless functionally irrelevant. Post-transcriptional modifications may furthermore lead to large local error rates. (4) Transcript annotation and classification: The transcripts are then evaluated with respect to the genomic loci they have been mapped to. This covers in general a classification into protein-coding, non-coding, and intergenic regions. For a typical prokaryotic genome, the non-coding portion is mainly comprised of reads that originate from the highly abundant tRNAs and rRNAs and from a few well-characterized house keeping genes such as tmRNA and 6S RNA. In most prokaryotes, only the open reading frames of protein-coding genes are annotated, while regulatory regions of mRNA transcripts, i.e., their UTRs (untranslated regions), are missing and the structure of polycistronic transcripts, i.e., transcripts that contain more than one gene, remains uncertain. Thereby, the number of reads mapping to intergenic regions is overestimated due to this knowledge gap. The detection of polycistronic transcripts can be achieved by using a high-sequencing depth close to saturation. The exact determination of transcriptional units is, however, challenging, as gap-free expression cannot be found even for well-characterized cases such as the cag pathogenicity island of H. pylori. Another difficult task is the precise mapping of the genomic positions where transcription is initiated. This challenge has been addressed by specific sequencing library preparation steps; the evaluation of the resulting read patterns is described in more detail in the next subsection on transcription start site (TSS) annotation. The determined TSS maps revealed an unexpected complexity of the transcription unit organization. Transcription is initiated as expected ahead of annotated genes and polycistronic transcripts but also internally and anti-sense to them, and therefore, almost everywhere along the genome. Upstream of the determined TSS, promoter sequence motifs are expected. Textbook knowledge describing two conserved elements, i.e., the -10 and -35 box, has been revised, as these motifs are extremely variable between species. In Xanthomonas and Helicobacter, for instance, only traces of the -10 box are detectable, but no distinct -35 box has been reported., It seems to be a matter of fact that the current experimental setups enable the detection of TSS with species-specific housekeeping promoters, but alternative binding motifs are still hidden. The sequence between an annotated TSS and the start of a nearby downstream protein-coding gene gives rise to its 5′ UTR. So-called leader-less transcripts that lack 5′ UTRs completely, i.e., translation start and TSS are mapped to (almost) the same position, are abundant in archaea,, but have been thought to be quite rare in bacteria. Surprisingly, dRNA-seq experiments, however, reported a large number of leaderless transcripts and 5′ UTRs lacking Shine-Dalgarno sequence patterns in diverse bacteria., Besides the possibility to gain new insights into protein-coding genes, most prokaryotic transcriptome studies are set up to detect novel non-coding RNA genes. These are typically identified by the analysis of read accumulations in intergenic regions or anti-sense to annotated genes. The existence of transcription units that might correspond to non-coding genes is verified by independent experiments such as northern blotting, and their exact size is determined by RACE. A single study reveals dozens of novel RNA genes that need to be further characterized. Common tasks are the detection of homologous sequences, structural conservation analysis, evaluation of their coding potential, and target prediction. For a detailed description of these evaluations, we refer to the sections on homology search, comparative genomics, and RNA—RNA interactions, respectively.

TSS annotation

In contrast to translation start sites that can be identified by well-established gene annotation strategies,, surprisingly little is known about transcription start sites (TSS) in most bacteria. Even though a thorough TSS annotation can serve as valuable source of information to (1) understand the architecture of polycistronic transcripts, (2) use it as a paramount hallmark for ncRNA gene annotation, and (3) determine the extent of the 5′ UTR, which often harbors regulatory elements such as riboswitches, RNA thermometer, and sRNA binding sites. The first successfully applied methods to annotate TSS were primer extension and RACE. Both techniques aim to find the 5′ end of partly characterized genes, but suffer from two major drawbacks. First, with these techniques it is not possible to distinguish between 5′ ends of an RNA formed by a transcription initiation event or by an RNA cleavage event, which often occurs in the course of RNA processing. Second, both techniques are difficult to scale up to a genome-wide high-throughput application. Therefore, two RNA-seq-based methods for reliable annotation of TSS in bacterial genomes were developed recently., Both methods exploit the phosphorylation pattern unique to primary TSS. Mono-nucleotides for transcription are provided to the RNA polymerase in the form of nucleotide triphosphates, which are broken down in the process of transcription elongation and the released energy is used to form a phosphodiester bond between the newly conjoined nucleosides. As a consequence, the first nucleotide still has a triphosphate attached to its 5′ carbon atom. In contrast, if the phosphodiester bond of two consecutive nucleosides is broken by endonucleolytic cleavage, the remaining fragment is a 5′-phosphomonoester. In the method developed by Wurtzel, et al., the total RNA is treated with tobacco acid pyrophosphatase (TAP), which removes the 5′-triphosphate, and hence, makes the RNA susceptible for the subsequent 5′-sequencing-adaptor ligation. The 3′-adaptor is attached by a random primer. In contrast to a library, which is not TAP-treated, reads associated with primary TSS are enriched in the TAP-treated library. An alternative method uses the Terminator-5′-phosphate-dependent exonuclease (TEX) to deplete the total RNA of fragments that are not protected from exonuclease degradation by a 5′-triphosphate. As a control, total RNA from the same extraction is processed the same way, but without the TEX treatment. Therefore, in the final analysis step, the differences between the treated (a.k.a. plus) library and the untreated (a.k.a. minus) library have to be screened position-wise for sites with a compelling enrichment of RNA-seq read starts in the plus vs. the minus library. That is why this method was named differential RNA-seq (dRNA-seq). The first applications of dRNA-seq were manually analyzed by visualizing the reads and assessing the enrichment. Since such a screening is very time-consuming and tedious on genome-scale, and since it involves the subjective assessment of the analyzer, the results suffer from a certain lack of reproducibility and consistency. Therefore, soon after, the first statistical approaches to evaluate dRNA-seq data were proposed. Schmidtke, et al. modeled the density of read starts within the genome locally by applying a sliding window approach. Within each window, the distribution of read start counts per position are assumed to follow a Poisson distribution. As a consequence, the differences between the two libraries can be modeled by the Skellam distribution, which allows to calculate the probability to encounter the observed enrichment by chance. Alternatively, global thresholds are applied to discriminate between significant read enrichment and background noise., To gain specificity, the TSS calling is split into two steps. First, the relative read coverage increase in the treated library from position i-1 to position i is evaluated. If this increase surpasses a defined threshold, the position is further evaluated whether the ratio of observed transcription initiation between treated and untreated library exceeds a defined threshold. If both tests are passed, the position is annotated as a TSS. The strength of this method, as implemented in the program TTSpredator, lies in the ability to regard dRNA-seq data from different strains and/or growth conditions and dynamically adjust the thresholds if strong signals are observed in one sample. This circumvents a strict a priori threshold definition, which might be difficult to find for a new data set with different sequencing depth, genome size, and TEX treatment efficiency. The most recent development in automated TSS annotation from dRNA-seq data, TSSAR, picks up the idea from Schmidtke, et al. to model the differences between the treated and untreated library with a Skellam distribution. However, to deduce the parameters from the underlying individual libraries, a zero-inflated Poisson distribution is used instead of a mere Poisson distribution. This allows one to consider the region in focus as a mixture of transcribed and not transcribed segments, where the former are assumed to follow a Poisson distribution and the latter to be zeros with probability 1. The parameters specifying the Skellam distribution are solely deduced from the read density in the transcribed region. The main advantage of TSSAR is the statistical sound analysis resulting in a robust enrichment P value for each genomic position, which in turn, leads to little dependency to a priori defined parameters that can greatly depend on the details of the experimental design and execution. Furthermore, TSSAR is provided as an easy-to-use web service, making its application rather convenient. A comparison of TSSpredator and TSSAR is given in Figure 1.

Figure 1. Comparison of automated TSS annotation from dRNA-seq data with TTSpredator and TSSAR. The upper plot pair shows the mapped read coverage in the treated (L+) and untreated (L-) library for an exemplary region from H. pylori dRNA-seq data. Blue dashed lines indicate TSS annotated by TTS predator (using default parameter). The middle plot pair shows essentially the same data, but only the read start coverage is plotted. This is how TSSAR looks at the data. Dashed red lines indicate TSS annotated by TSSAR (P value cutoff of 10−4). The bottom part shows the positions of the annotated genes in the considered region. The read coverage plots indicate that the data produced by dRNA-seq is more complex than it might appear from the method description; therefore, statistical data analysis is required. Similar to the eukaryotic research community, the understanding of prokaryotic genomes can benefit from shifting from the established protein-coding gene centered genome annotation to the incorporation of more information on transcripts, with all their diversity in function and architecture. With the recent developments both in wet-lab experiments and computational analysis that allow one to characterize bacterial transcriptomes semi-automated in a high-throughput manner, a comprehensive transcript annotation becomes feasible.

Comparative genomics

Non-coding RNAs are in many cases detectable by comparative genomics alone, i.e., without the benefit of either known homologs or expression data. SIPHT makes use of invariant features of many bacterial genes. It identifies candidate loci based on sequence conservation in intergenic regions combined with predicted Rho-independent terminators (downstream) and predicted transcription factor binding sites (upstream). The software also evaluates homology with known sRNAs and cis-regulatory RNA elements. The tool is not directly applicable to some genera such as Helicobacter, which has an A/T-rich genome, and thereby, lacks recognizable terminator hairpins. Stabilizing selection acting to preserve secondary structure elements imposes constraints on variations that become fixed in a population, and hence, are observable as differences between orthologous sequences from evolutionarily related organism. In particular, evolutionarily conserved base pairs admit only six of 16 possible nucleotide pairs: GC, CG, AU, UA, GU, and UG. Computer simulations have indicated that RNA sequences still evolve in a drift-like manner even under very strong selection on their secondary structure, so that sequence patterns reflecting the structural constraints rapidly accumulate and become readily detectable already at 10% of sequence divergence. Qrna investigates pair-wise alignments. The algorithm is based on stochastic context-free grammars and estimates the posterior probabilities for an input alignment to be structured RNA, protein-coding, or neither. Its first application to E. coli resulted in the prediction of several dozens of novel ncRNAs, many of which have been validated. Multiple sequence alignments convey much more information on substitution patterns than pairwise alignments but are also much harder to simulate as a detailed stochastic model as in EvoFold. In RNAz, Figure 2, we have therefore taken a different approach. Two lines of evidence inform about conservation of RNA structures: (1) structural similarity above the level expected from placing the differences at random positions, (2) a lower free energy of folding than expected for the same sequence composition. Instead of an explicit stochastic model, RNAz uses machine learning to distinguish between true ncRNAs and decoys with the same dinucleotide content and the same gap pattern as the input alignments. The software is primarily designed for the large genomes of higher eukaryotes but has been employed successfully also for many prokaryotes.- It detects all types of conserved secondary structure elements, including bona fide sRNAs, riboswitches, RNA thermometers, structured cis-acting elements, as well as terminator hairpins. Since its initial publication, several improvements have been introduced. In particular, RNAz 2.0 makes use of improved consensus structure prediction for assessing structural conservation, it explicitly accounts for dinucleotide distribution, and it has been retrained on a much larger training set, including many prokaroytic RNAs. Nevertheless, RNAz still suffers from relatively large false discovery rates (FDR) and a limited accuracy in particular of the boundaries of its predicted structures. Reevaluating the RNAz predictions with structure-based alignment reliability scores computed by LocARNA-P not only improves the boundary prediction by more than a factor of three but also halves the FDR.

Figure 2. Evolutionary signals are used to classify multiple sequence alignments into non- or protein-coding. RNAz combines structural and thermodynamic descriptors and measures of sequence conservation to detect excess conservation of secondary structure, while RNAcode identifies increased conservation of putative ORFs compared with the observed sequence conservation of the nucleic acid sequences. Well-conserved structured RNAs, such as Xanthomonas sX13, which is involved in virulence-specific gene expression and hfq mRNA regulation, can easily be identified with RNAz. The E. coli transcript C0343, originally annotated as a small RNA, does not exhibit typical features of a structured RNA. Instead, RNAcode reveals a well-conserved short coding sequence. Dual transcripts such as B. subtilis sR1 are detectable by both RNAz and RNAcode. A completely different comparative approach is taken by NAPP. First, it determines the phylogenetic distribution of conserved sequence elements as well as annotated protein-coding genes. Coherent phylogenetic distribution and co-occurrence in clusters of conserved non-coding elements and coding sequences then indicate that conserved, un-annotated sequences may harbor sRNAs or conserved UTR elements, including riboswitches. An advantage of this approach is that the association with known proteins at least hints at potential functions of the candidate sRNA. A comparison of different computation approaches toward sRNA prediction can be found e.g., in reference 90. Discrimination between coding and non-coding regions poses technical as well as biological challenges not addressed by standard gene finders. Ironically, authors working on non-coding RNAs repeatedly had to implement ad hoc solutions to detect coding regions. While longer protein-coding sequences are easily recognized by the absence of stop codons and characteristic, often species-specific patterns of codon usage, it is impossible to reliably detect short peptides of 20 amino acids or less in a single sequence. In complete analogy to RNA secondary structures, however, conservation of peptide sequences constrains the variation of the underlying nucleic acid sequence in characteristic ways. Most obviously, third codon positions are expected to be much more variable. RNAcode, Figure 2, is based on this idea and evaluates for all six possible reading frames whether the amino acids obtained by translating a putative codon is more conserved than expected by the conservation at nucleic acid level. Translated into log odds scores these estimates form the basis of a dynamic programming algorithm that identifies statistically significant conserved peptides in the alignment of nucleic acid sequences. The method was applied e.g., to identify very small peptides as well as annotation errors in H. phylori., A particular difficulty is posed by transcripts that function both as sRNA by virtue of a conserved secondary structure and at the same time code for a conserved peptide. Well-known examples from the realm of prokaryotes is the Staphylococcus aureus RNAIII, which regulates target genes as sRNA and encodes the 26 amino acid sequence of delta-hemolysin, and the Bacillus SR1 RNA involved in the regulation of arginine catabolism. The detection of such cases in genome-wide surveys remains difficult, although software for similar tasks has become available. In particular, RNAdecoder searches for conserved RNA structure within DNA regions known to be protein-coding; it suffers from very high FDRs, however. The intersection of RNAz and RNAcode predictions can provide at least plausible candidates but is certainly not ideal either. To the best of our knowledge, no systematic survey for dual-function RNAs has been conducted in prokaryotes so far.

Estimation of RNA families and classes

The Rfam database divides ncRNAs according to inherent functional, structural, or compositional similarities in more than 2200 different RNA families. Rfam’s notion of a clan aggregates families that clearly share a common ancestor but are too divergent to be reasonably aligned or groups of families that could be aligned, but have clearly distinct functions. At an even higher level, an RNA class further groups together ncRNA families or clans whose members have no clear homology at the sequence level and presumably do not derive from a common ancestor, but still share common structural properties as a consequence of functional analogy. Prominent examples are microRNAs (miRNAs) and the two distinct classes of snoRNAs (box H/ACA and box C/D). Current methods for the de novo annotation of ncRNAs rely on unsupervised techniques, such as clustering, to group similar RNAs and subsequent computation of the consensus structure. Using methods implemented in tools like RNAz and EvoFold, further characteristics that are indicative of functional ncRNA genes are evaluated. In this framework, the initial clustering phase is a crucial step, and in order to be successful it requires the specification of an appropriate distance or similarity notion that can characterize the functional properties of RNA sequences. The distance measures of course depend on the level of information available and ultimately on the representation used to encode the RNA molecules. These representations can be based on (1) the nucleotide sequence, (2) the connectivity graph of base pairing interactions, or (3) the full three-dimensional conformation. The third option is not yet viable as there is a lack of both experimental techniques to determine 3D conformations of functional RNAs in a large-scale setting (i.e., for machine learning approaches), and of efficient, and sufficiently accurate, modeling techniques to compute these conformations. Frequently, only sequence information is used since it is directly available from sequencing experiments, of relatively low noise, and it can be manipulated efficiently and with ease by computers., By construction, any pure sequence-based approach is restricted to RNA families and must fail to detect functional similarity in case of low sequence identity. Indeed, family assignments of structured RNAs obtained from sequence alignments are often wrong when pairwise sequence identities drops below 60%. Much lower similarity levels are quite common within a single RNA class. There is therefore a pressing need for similarity and distance notions that efficiently take into account both sequence and structure. One possible solution is to do structure prediction simultaneous with the construction of alignments, as described in the section on structure prediction. This approach was successfully used to classify all known CRISPR repeats. However, these alignment-based methods do not scale to efficiently cluster hundred of thousands of candidate ncRNAs predicted by e.g. RNAz screens. With GraphClust, a very different approach has become available. It avoids the alignment phase and the explicit computation of a distance matrix altogether. At the same time it is not restricted to a single structural hypothesis. In order to deal with structural alternatives, abstract shape analysis is used to summarize the ensemble of predicted structures. It provides an a priori classification of structures and allows the efficient retrieval of a single representative secondary structure per class, so that each sequence is represented by a small set of sufficiently different secondary structures. Each structure is then interpreted as a labeled graph from which structural features defined as small-localized subgraphs are extracted as outlined in Figure 3. The resulting sparse feature vectors for each structure amount to a direct generalization of the well-known k-mer similarity from strings to labeled graphs, which could be used for clustering.

Figure 3. Features describing a secondary structure graph. Each graph is described by the set of all neighborhood subgraphs (indicated by shaded areas) up to a maximal radius r around a reference nucleotide (marked by a circle). For large data sets (i.e., > 104 sequences), one cannot afford the quadratic complexity of clustering algorithms that rely on a pairwise distance or similarity information. Instead, GraphClust formulates the clustering problem in terms of approximate nearest neighbor queries, which can be answered with a sub-linear complexity using locality-sensitive hashing. The similarity of the k-nearest neighbors can then be used to estimate how compact or dense each neighborhood is within the set of feature vectors so that the most compact non-overlapping neighborhoods can be selected as candidate clusters. Each of these candidate clusters is then refined using alignment techniques designed to discard incompatible RNA sequences. A corresponding covariance model is employed to scan the original data set for similar sequences that were missed by graph-based pre-clustering. The entire procedure is then iterated on the remaining instances producing in each round a user-defined number of clusters that can later be merged to decrease the final cluster fragmentation. GraphClust was successfully applied to cluster bacterial ncRNAs. Using a benchmark set of 363 ncRNAs, GraphClust detected 43 high-quality clusters representing 38 families. In this benchmark, additional genomic context was added to simulate the application scenario of unknown precise transcript boundaries. The quality of clustering (measured with the F-measure or with the Rand index) was higher than the state-of-the-art clustering using LocARNA. Thus, GraphClust can successfully determine RNA classes for bacterial ncRNAs, even when the precise transcript boundaries are unknown.

RNA-RNA Interactions

Models for predicting sRNA–mRNA interactions

The rise of high-throughput methods, first tiling arrays and now RNA-seq, to characterize transcriptomes, had led to an explosion in the number of identified sRNAs in prokaryotes; more than 100 sRNAs have been reported in most species (e.g., refs. 105–108). Most sRNAs studied to date form base pair interactions with mRNAs to post-transcriptionally regulate their targets’ translation and stability. The functional characterization of novel sRNAs thus involves identification of their interaction partners together with the precise interaction sites. A promising strategy to cope with the steadily increasing number of discovered but uncharacterized sRNAs is computational prediction of candidate sRNA targets, followed by experimental verification using transcriptomics and proteomics approaches. Computational methods for predicting RNA–RNA interactions fall into four main classes. The following section gives an overview of the available methods and tools with an emphasis on sRNA–mRNA interaction prediction (previously also reviewed in refs. 110 and 111). Table 1 summarizes web-based applications designed for genome-wide sRNA target predictions.

Table 1. Web server for genome-scale prediction of sRNA target genes

Name	Features for target prediction			Classifier	Functional enrichment	URL of web server	References
	Conservation	Accessibility	Seed region
CopraRNA	X	X	X	-	X	http://rna.informatik.uni-freiburg.de/CopraRNA	112
IntaRNA	-	X	X	-	X	http://rna.informatik.uni-freiburg.de/IntaRNA	113, 114
RNApredator	-	X	-	-	X	http://rna.tbi.univie.ac.at/RNApredator	115, 116
sRNATarget	-	-	X	X	-	http://ccb.bmi.ac.cn/srnatarget	117
sTarPicker	-	X	X	X	-	http://ccb.bmi.ac.cn/starpicker	118
TargetRNA2	X	X	X	-	-	http://snowwhite.wellesley.edu/targetRNA

All web servers are based on computational methods that score the sRNA–target interaction by their hybridization energy and by additional features as indicated in the table. Some servers directly allow for functional enrichment analysis of the highest-ranking target predictions. The first class of methods evaluates the stability of the duplex formed between two RNA molecules aiming to find the loci in both partners that yield the energetically most favorable hybridization. Only base pairs between the two RNAs are evaluated, while their intramolecular structure is ignored. The most popular tools of this type are RNAhybrid, RNAduplex and RNAplex, DINAMelt,, and RIsearch. Methods of this class are primarily tailored for predicting potential binding sites of short RNAs (like eukaryotic miRNAs) in large target RNAs as they tend to maximize the hybridization length. The prediction is based on a modified version of the secondary structure prediction algorithm of reference 124 that omits multi-loops. A simplified loop energy model was introduced by RNAplex. This tool also allows one to favor shorter interactions by per-nucleotide penalties. RIsearch further simplifies the nearest-neighbor energy model by a local alignment-like algorithm that uses dinucleotide scoring. Its main application is the efficient pre-filtering of interaction candidates in genome-wide screens; the resulting putative interactions can later be evaluated with more complex interaction prediction approaches. The web server TargetRNA, was specifically designed for the prediction of bacterial sRNA targets; it provides two scoring schemes: (1) scoring of individual base pairs by a local alignment-like algorithm or (2) duplex minimum free energy (mfe) similar to RNAhybrid. Recently, its successor TargetRNA2 was released (unpublished). Methods of the second class determine a joint secondary structure of two RNAs, i.e., a common structure including both intra- and intermolecular base pairs. The two input RNA sequences are concatenated and then folded by an RNA folding algorithm such as Zuker’s algorithm, which is extended to handle the loop containing the concatenation point energetically as an external loop. Tools implementing this idea are, for example, PairFold and RNAcofold. The sRNATarget web server, computes the mfe structure of the concatenated sequence to derive interaction features, such as length-normalized free energy, seed match length, and A/U-content in single-stranded regions. A naïve Bayes classifier based on these features is then applied to discriminate sRNA–mRNA interactions from non-interacting sRNAs and mRNAs. The main disadvantage of all concatenation-based approaches is their restriction on the allowed interaction types. The underlying RNA folding algorithm can only predict pseudoknot-free secondary structures, although many interaction sites are actually located in loop regions. Interactions between two stem loops (loop–loop interactions) represent a pseudoknot in the context of the concatenated sequences, and therefore, cannot be predicted by these approaches. The third class comprises interaction prediction methods that model the competition between formation of duplex and intramolecular base pairs by the structural accessibility of the interaction sites. This strategy is supported by two systematic studies, which showed that functional interaction sites are typically well-accessible in both sRNAs and their target mRNAs., The tools IntaRNA and RNAup, calculate the thermodynamics of RNA–RNA interactions as sum of two energy contributions: (1) the energy required to make the sRNA and target interaction sites accessible, which is calculated from the ensemble of all secondary structures, and (2) the hybridization energy of the two interacting subsequences. IntaRNA additionally incorporates seed regions, i.e., regions of (nearly) perfect sequence complementarity that are thought to initiate interaction formation. The IntaRNA web server allows for genome-scale sRNA target predictions followed by functional enrichment analysis of top target predictions and visualization of putative interaction regions. RNAplex optionally approximates interaction site accessibility by position-specific per-nucleotide penalties. An sRNA target prediction web server on top of RNAplex is implemented by the software RNApredator. The web server sTarPicker combines ideas from accessibility-based and concatenation-based approaches. Putative seed interactions are extended by computing a joint secondary structure of sRNA and mRNA. The predictions are then classified into true and false interaction predictions based on the interaction features A/U-content, hybridization energy, accessibility, and seed length. All methods represented by this class can predict complex interactions like loop–loop interactions, but interactions are restricted to one locus. For RNA–RNA interactions involving two or more interaction sites as, e.g., OxyS–fhlA and RNAIII–rot, only one of the interaction sites can be predicted. Whether formation of interactions at multiple loci is a common principle and frequently required for regulation by sRNAs in vivo is still an open question. The sRNA RNAIII, for example, binds its target coa in Staphylococcus aureus both via an imperfect duplex and a loop–loop interaction, but the former interaction alone is sufficient for in vivo repression. Several tools of the third class have been successfully applied to identify sRNA targets in various prokaryotic species. IntaRNA, for example, aided in finding that the cyanobacterial sRNA Yfr1 inhibits translation of two outer membrane proteins and that the sRNA PhrS stimulates translation of the quorum-sensing regulator pqsR in Pseudomonas. But sRNA–mRNA interactions are not restricted to the bacterial domain of life. Jäger, et al., for example, showed by a combination of computational and experimental approaches that the archaeal sRNA162 targets both a cis- and a trans-encoded mRNA via two distinct domains. Methods of the final class can predict more complex joint secondary structures and also allow for multiple interaction sites. The IRIS tool introduced a model that maximizes the number of base pairs. Alkan et al. then presented a more realistic energy model. The type of joint structures considered in this study were the basis for several subsequent approaches to predict mfe structures,- to compute the partition function of joint secondary structures,, and to sample joint secondary structures. All these algorithms have a high time and space complexity, in practice precluding genome-wide application. Except for IRIS, all methods of this class are also not able to handle pseudoknotted structures or crossing interactions. Consequently, they still cannot predict instances like the two loop–loop interactions between RNAIII and rot in Staphylococcus aureus as these constitute a crossing interaction.

Comparative sRNA target prediction

Genome-scale prediction of sRNA target genes is a computationally challenging task and all methods presented above suffer from a high false positive rate. Starting from the observation that the target binding site in the sRNA is marked by high-sequence conservation across related species,, comparative target prediction for conserved sRNAs appears to be a promising strategy to reduce the number of false positive predictions. PETcofold was the first comparative method for the prediction of RNA–RNA interactions and joint secondary structures.- Using two multiple alignments of RNA sequences as input, PETcofold predicts conserved RNA–RNA interactions and RNA structures taking into account covariance information arising from compensatory base pair exchanges. Such an alignment-based strategy will predominantly report duplexes in which the interaction base pairing is conserved across species. Its applicability is, therefore, limited to a subclass of interactions that exhibit broad evolutionary conservation. The same constraint applies to other comparative joint secondary structures prediction approaches such as ripalign. Interactions with conserved base pairing pattern cover only a subset of all observed interactions; conservation of target complementarity can range from marginal to full conservation even for different targets of the same sRNA. This observation is particularly challenging for alignment-based approaches as it is not known a priori whether the interaction between a specific sRNA and mRNA is well conserved or not. CopraRNA introduced a very promising alternative strategy overcoming fixed input sequence alignments. As for other comparative approaches, CopraRNA’s main idea is to combine the target prediction in several species. But in contrast to the above-mentioned approaches, CopraRNA does neither enforce conservation of the interaction site nor of the interaction pattern. Rather, it performs target prediction in each organism independently and then combines the evidence for all these predictions (see Fig. 4). The basic assumption is that only the target regulation by the sRNA is required to be conserved, but the specific base-pairing pattern can be variable and the interaction site might have even been shifted, especially in the mRNA. For a functional interaction, it is often sufficient to have a binding in proximity to the ribosomal binding site without the necessity of a fixed position.

Figure 4. Comparative prediction of sRNA targets as implemented in the CopraRNA pipeline. For a given pair of sRNA and mRNA sequences, the associated homologs are selected. In the next step, the best interaction in each species is determined and scored by its P value. Finally, all species-specific P values are combined into a single joint P value while taking the evolutionary distances into account. In order to combine the single evidences of an interaction from each organism, one could naïvely use the average of all calculated scores. This approach has, however, two caveats: (1) the scores are not normalized and depend, e.g., on the G/C-content of the organism, and (2) closely related species are likely to have similar scores due to their similarity in sequence composition. Concerning the first point, a way to normalize the score is to use P values instead of raw scores. Since each sRNA has typically only few functional interactions (for example, a total of 21 direct targets has previously been reported for the well-characterized sRNA GcvB), one can use the score distribution of all genome-wide predicted interactions for a given sRNA in one organism as background to calculate the P values. For the second point, one first has to determine how P values from different organism can be combined. Even though intuitively a good solution, the product of P values does not constitute a P value anymore as it is not uniform across the background. For that purpose, one has to use a transformation. In CopraRNA, the inverse normal method of Hartung was used since it additionally allows to weight the P- values, thus correcting for the evolutionary distance of the species.

Open Questions

Many questions and computational problems remain open. Although experimental and computational methods are now in place to identify transcription start sites, the corresponding termination sites still cannot be determined reliably, in particular, when they are not associated with Rho-independent terminator structures. Even less is known about other forms of RNA processing, such as cleavage and editing: Where does it occur? How do processing patterns look like in RNA-seq data? Although it has become clear that sRNAs are abundant in most prokaryotes, we still lack a clear picture of their phylogenetic distribution. In particular, distant homologies have remained largely unexplored. The abundance of pseudoknots and complex interaction structures is still unknown, at least in part due to the high-computational cost but also the limited reliability of prediction algorithms in particular when applied to single sequences. The RNA chaperone Hfq facilitates pairing of sRNA and target mRNA in diverse bacterial lineages. The still unknown rules governing the binding of Hfq to specific sRNAs in what appears to be a highly dynamic molecular mechanism are likely to provide a dramatic improvement for predicting functional sRNA–mRNA interactions, and thus, for the functional annotation of sRNAs. Eventually, the goal would be to complete the whole bacterial gene regulatory network. Due to their influence on RNA–RNA interactions, this must also include the determination of protein–RNA interactions. Furthermore, not only the sRNA targets, but also the transcriptional regulation of the sRNA itself has to be understood. This would allow one to apply the systems biology toolbox to explore the dynamics of the full gene regulatory network, which is most likely to be altered by the introduction of sRNAs into the network. The recent time has seen the development of a plethora of high-throughput approaches like CLIP-seq to further investigate the gene regulatory network. It can also be seen that these new experimental techniques require a constant development of appropriate bioinformatic tools. The constant mutual development of experimental techniques and associated bioinformatic methods was well established in the Priority Program SPP 1258, which thus can serve as a blueprint for similar collaborative projects.

142 in total

1. Prediction of hybridization and melting for double-stranded nucleic acids.

Authors: Roumen A Dimitrov; Michael Zuker
Journal: Biophys J Date: 2004-07 Impact factor: 4.033

2. Staphylococcus aureus RNAIII binds to two distant regions of coa mRNA to arrest translation and promote mRNA degradation.

Authors: Clément Chevalier; Sandrine Boisset; Cédric Romilly; Benoit Masquida; Pierre Fechter; Thomas Geissmann; François Vandenesch; Pascale Romby
Journal: PLoS Pathog Date: 2010-03-12 Impact factor: 6.823

3. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data.

Authors: Zhengqing Ouyang; Michael P Snyder; Howard Y Chang
Journal: Genome Res Date: 2012-10-11 Impact factor: 9.043

4. BlastR--fast and accurate database searches for non-coding RNAs.

Authors: Giovanni Bussotti; Emanuele Raineri; Ionas Erb; Matthias Zytnicki; Andreas Wilm; Emmanuel Beaudoing; Philipp Bucher; Cedric Notredame
Journal: Nucleic Acids Res Date: 2011-05-30 Impact factor: 16.971

5. Identification and classification of conserved RNA secondary structures in the human genome.

Authors: Jakob Skou Pedersen; Gill Bejerano; Adam Siepel; Kate Rosenbloom; Kerstin Lindblad-Toh; Eric S Lander; Jim Kent; Webb Miller; David Haussler
Journal: PLoS Comput Biol Date: 2006-04-21 Impact factor: 4.475

6. Fast prediction of RNA-RNA interaction.

Authors: Raheleh Salari; Rolf Backofen; S Cenk Sahinalp
Journal: Algorithms Mol Biol Date: 2010-01-04 Impact factor: 1.405

7. Benchmarking short sequence mapping tools.

Authors: Ayat Hatem; Doruk Bozdağ; Amanda E Toland; Ümit V Çatalyürek
Journal: BMC Bioinformatics Date: 2013-06-07 Impact factor: 3.169

8. Small RNA sX13: a multifaceted regulator of virulence in the plant pathogen Xanthomonas.

Authors: Cornelius Schmidtke; Ulrike Abendroth; Juliane Brock; Javier Serrania; Anke Becker; Ulla Bonas
Journal: PLoS Pathog Date: 2013-09-12 Impact factor: 6.823

9. RNAalifold: improved consensus structure prediction for RNA alignments.

Authors: Stephan H Bernhart; Ivo L Hofacker; Sebastian Will; Andreas R Gruber; Peter F Stadler
Journal: BMC Bioinformatics Date: 2008-11-11 Impact factor: 3.169

10. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection.

Authors: Steve Hoffmann; Christian Otto; Gero Doose; Andrea Tanzer; David Langenberger; Sabina Christ; Manfred Kunz; Lesca M Holdt; Daniel Teupser; Jörg Hackermüller; Peter F Stadler
Journal: Genome Biol Date: 2014-02-10 Impact factor: 13.583

5 in total

1. Prokaryotic Genome Annotation.

Authors: Jeffrey A Kimbrel; Brendan M Jeffrey; Christopher S Ward
Journal: Methods Mol Biol Date: 2022

2. Search for 5'-leader regulatory RNA structures based on gene annotation aided by the RiboGap database.

Authors: Mohammad Reza Naghdi; Katia Smail; Joy X Wang; Fallou Wade; Ronald R Breaker; Jonathan Perreault
Journal: Methods Date: 2017-03-06 Impact factor: 3.608

Review 3. In silico discovery and modeling of non-coding RNA structure in viruses.

Authors: Walter N Moss; Joan A Steitz
Journal: Methods Date: 2015-06-23 Impact factor: 3.608

4. Small RNA interactome of pathogenic E. coli revealed through crosslinking of RNase E.

Authors: Shafagh A Waters; Sean P McAteer; Grzegorz Kudla; Ignatius Pang; Nandan P Deshpande; Timothy G Amos; Kai Wen Leong; Marc R Wilkins; Richard Strugnell; David L Gally; David Tollervey; Jai J Tree
Journal: EMBO J Date: 2016-11-11 Impact factor: 11.598

5. Non-coding RNAs Potentially Controlling Cell Cycle in the Model Caulobacter crescentus: A Bioinformatic Approach.

Authors: Wanassa Beroual; Matteo Brilli; Emanuele G Biondi
Journal: Front Genet Date: 2018-05-30 Impact factor: 4.599

5 in total