Anke Busch1, Andreas S Richter, Rolf Backofen. 1. Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, Freiburg D-79110, Germany. IntaRNA@informatik.uni-freiburg.de
Abstract
MOTIVATION: During the last few years, several new small regulatory RNAs (sRNAs) have been discovered in bacteria. Most of them act as post-transcriptional regulators by base pairing to a target mRNA, causing translational repression or activation, or mRNA degradation. Numerous sRNAs have already been identified, but the number of experimentally verified targets is considerably lower. Consequently, computational target prediction is in great demand. Many existing target prediction programs neglect the accessibility of target sites and the existence of a seed, while other approaches are either specialized to certain types of RNAs or too slow for genome-wide searches. RESULTS: We introduce INTARNA, a new general and fast approach to the prediction of RNA-RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. We successfully applied INTARNA to the prediction of bacterial sRNA targets and determined the exact locations of the interactions with a higher accuracy than competing programs. AVAILABILITY: http://www.bioinf.uni-freiburg.de/Software/
MOTIVATION: During the last few years, several new small regulatory RNAs (sRNAs) have been discovered in bacteria. Most of them act as post-transcriptional regulators by base pairing to a target mRNA, causing translational repression or activation, or mRNA degradation. Numerous sRNAs have already been identified, but the number of experimentally verified targets is considerably lower. Consequently, computational target prediction is in great demand. Many existing target prediction programs neglect the accessibility of target sites and the existence of a seed, while other approaches are either specialized to certain types of RNAs or too slow for genome-wide searches. RESULTS: We introduce INTARNA, a new general and fast approach to the prediction of RNA-RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. We successfully applied INTARNA to the prediction of bacterial sRNA targets and determined the exact locations of the interactions with a higher accuracy than competing programs. AVAILABILITY: http://www.bioinf.uni-freiburg.de/Software/
Starting with the discovery of microRNAs (miRNAs) and the advent of genome-wide transcriptomics, it has become obvious that RNA plays a large variety of important, often regulatory, roles in living organism that extends far beyond being a mere intermediate in protein biosynthesis (Storz, 2002). Several of these non-protein coding RNAs (ncRNAs) regulate gene expression post-transcriptionally through base pairing to a target mRNA, like eukaryotic miRNAs and small interfering RNAs (siRNAs) (Bartel, 2004; Hannon, 2002; Zamore and Haley, 2005) as well as bacterial small regulatory RNAs (sRNAs) (Gottesman, 2005).Typically, the size of bacterial sRNAs ranges from 50 nt to 250 nt (Vogel and Wagner, 2007). The post-transcriptional interaction with their target mRNA causes translational repression or activation, mRNA degradation or changes in mRNA stability (Storz et al., 2004). They have been found to be crucial in the bacterial stress response and in bacterial virulence (Gottesman, 2005). Please note that we use the term sRNAs for small, regulatory RNAs in bacteria as done predominantly in the literature (e.g. Storz and Haas, 2007).Since base pairing to a target mRNA is the main regulatory mechanism of sRNAs (Vogel and Wagner, 2007), the hybridization energy is a widely used criterion to predict RNA–RNA interactions (Rehmsmeier et al., 2004; Tjaden et al., 2006). For miRNAs and siRNAs, it was shown that the accessibility of the target sites has also an important influence (Ameres et al., 2007; Kertesz et al., 2007; Kretschmer-Kazemi Far and Sczakiel, 2003; Long et al., 2007; Luo and Chang, 2004; Shao et al., 2007). It can be assumed that the same holds for sRNAs. Furthermore, perfect Watson–Crick pairing of seven or eight consecutive bases (typically at positions 2–8) at the 5′ end of animal miRNAs (the seed region) is often sufficient for effective regulation (Bentwich, 2005; Brennecke et al., 2005; Doench and Sharp, 2004). There is not much known about possible seed regions in bacterial sRNAs. A previous work about sRNA target prediction suggested that the interaction of sRNA and mRNA also starts with a stretch of bases that are unpaired in the sRNA and the mRNA and that form at least a minimal number of consecutive base pairs (Tjaden et al., 2006). A very recent study in Salmonella confirms this assumption by showing that the conserved 5′-end of the RybB sRNA recognizes many omp mRNA targets by short seed pairings (Mika et al., 2008).In Escherichia coli (E.coli), there are >70 validated sRNAs, but only about 20 with a known cellular function since the experimental identification and verification of sRNA targets has been lagging behind (Vogel and Wagner, 2007). Thus, there is a high demand for computational target predictions for regulatory sRNAs. So far, there is no tool that integrates both accessibility of the target regions in the mRNA and in the sRNA and the existence of an arbitrary seed in a general approach.This article presents INTARNA, a new approach to the prediction of interacting RNAs (INTARNA), where a combined energy score of the interaction is calculated as the sum of the free energy of hybridization and the free energy required for making the interaction sites accessible. In addition, the existence (but not the exact location) of a seed is enforced. The length of the seed can be set by the user. We present two variants: a complete approach, which is O(n2m2) in time and O(nm) in space, when restricting the size of internal loops and where n and m are the lengths of the interacting RNA sequences (n>m), and a heuristic simplification of the complete approach, which has a time complexity of and a space complexity of O(nm), where and L is the size of the sequence window in which both the target mRNA and the sRNA are folded. We successfully applied INTARNA to the prediction of sRNA targets. Additionally, INTARNA predicted the exact location of the RNA–RNA interactions.Related work: to date, most existing target prediction approaches are based on one of the two following basic ideas. The first major group determines a common structure for ncRNA and targeted mRNA by concatenating the two RNA sequences and memorizing the linkage location. The new single sequence is folded by an usual RNA-folding algorithm, e.g. the algorithm of Zuker and Stiegler (1981), with slightly different parameters for the loop including the linkage location. The most prominent tools using this idea are PAIRFOLD (Andronescu et al., 2005) and RNACOFOLD (Bernhart et al., 2006b). They have a space complexity of O((n+m)2) and a time complexity of O((n+m)3) when restricting the size of interior loops. The main problem with these tools is that they can only predict interactions where the common structure is pseudoknot-free. In contrast, many interactions in living cells are located in loop regions and would represent a pseudoknot in the context of concatenated sequences.The second major group of target prediction tools neglects intra-molecular binding in both RNA molecules. Algorithms based on this idea find the energetically most favorable hybridization of two RNA sequences. The most popular tools incorporating this idea are RNAHYBRID (Kruger and Rehmsmeier, 2006; Rehmsmeier et al., 2004), RNADUPLEX and RNAPLEX (Tafer and Hofacker, 2008), and DINAMELT (Dimitrov and Zuker, 2004; Markham and Zuker, 2005). RNAHYBRID is primarily tailored for predicting potential miRNA binding sites in large target RNAs. This method uses a modification of the classical secondary structure prediction algorithm of Zuker and Stiegler (1981) that neglects multiloops. Furthermore, the loop size is restricted to a fixed value to reduce complexity. In principle, RNADUPLEX and RNAPLEX incorporate the same ideas as RNAHYBRID, but RNAPLEX uses a simplified energy scoring of loops and a length penalty to favor short stable interactions. By doing so, RNAPLEX performs 10–27 times faster than RNAHYBRID (Tafer and Hofacker, 2008).There is a variety of additional tools that are specially designed to search for miRNA target sites [for a review see Bentwich (2005) and Yoon and De Micheli (2006)]. In contrast, there has been little investigation so far regarding the computational prediction of mRNA targets of bacterial sRNAs. RNAPLEX is also suitable for longer queries like sRNAs because it integrates a nucleotide penalty. It was recently applied to the prediction of sRNA targets (Tafer and Hofacker, 2008). Tjaden et al. (2006) developed a tool named TARGETRNA that predicts the targets of bacterial sRNAs (neglecting intra-molecular base pairs) and outputs them in a ranked list.While most of the aforementioned approaches use the free energy of the hybridized duplex to predict the potential target site, in general, the free energy of the entire duplex is a poor predictor for that aim (Rajewsky, 2006). Several authors have shown that the secondary structure of the target mRNA (Ameres et al., 2007; Kertesz et al., 2007; Kretschmer-Kazemi Far and Sczakiel, 2003; Long et al., 2007; Luo and Chang, 2004; Shao et al., 2007) and the ncRNA (Koberle et al., 2006) has a strong effect on target recognition. To the best of our knowledge, there are only two tools that incorporate the secondary structure of the mRNA (Kertesz et al., 2007; Mückstein et al., 2008). RNAUP (Mückstein et al., 2008) calculates the thermodynamics of RNA–RNA interactions as the sum of two energy contributions: the energy needed to open the binding sites and make them accessible, and the hybridization energy. It has a space complexity of O(n2+nw3) and a time complexity of O(n3+nw5), when restricting the interior loop sizes to a fixed value and limiting the size of interaction to w. In contrast to our approach, RNAUP does not use any seed condition. PITA, developed by Kertesz et al. (2007) for miRNA target prediction, starts with a genome-wide search for initial seed regions and tries to extend these sites in one direction. This is typical for a large proportion of miRNAs, but is unlikely to hold for other ncRNAs.Herein, we present the algorithmic details of INTARNA, our general approach to the prediction of RNA–RNA interactions, including target site accessibility and user-definable seeds.
2 METHODS
We are given two potentially interacting sequences S1 and S2 of lengths n and m, respectively. For every single RNA sequence S, a target site is a pair of positions [x,y] that define an interval with x being the first and y being the last included position.Hybridization energy: the first component determining the quality of an RNA–RNA interaction between target site [i,k] in S1 and target site [j,l] in S2 is the hybridization energy Ehybrid(i,j,k,l). Its calculation is based on the energy model of RNAHYBRID. The energy parameters used are from Mathews et al. (1999). With H(i,j), we denote the hybridization energy of the best interaction of subsequences S1…S1 and S2…S2, where the left-most positions of both subsequences i and j form a base pair. H(i,j) can be calculated using a restricted variant of the algorithm of Zuker and Stiegler (1981) discarding multiloop structures. This algorithm has a time complexity of O(nm) when restricting the internal loop length. Using the RNAHYBRID convention to number the first RNA 5′→3′ and the second in the reverse direction, we get the following basic recursion for H(i,j):
Here, Eloop(i,j,p,q) indicates the free energy of the loop including base pairs (i,j) and (p,q). We disregard dangling end energy contributions for the purpose of simplification here and in the following. Thus, matrix H is initialized with 0. The final hybridization energy is then calculated by choosing min{H(i,j)}. The target site itself is calculated using a normal traceback.Accessibility: the second component contributing to the quality of an RNA–RNA interaction is the accessibility of the target sites in each sequence, which is the energy required to make them single stranded. It is defined as the difference between the energy of the ensemble of all structures and the energy of the ensemble of structures, where the target site [i,k] is single stranded. It is denoted by ED(i,k) and calculated using a partition function approach (McCaskill, 1990). Let 𝒮 be the set of all structures (called ensemble) that can be formed by a sequence S. Then
where Z𝒮 is the partition function of 𝒮, E(Q) is the free energy of sequence S folded into the secondary structure Q and Eens(𝒮) denotes the ensemble energy of the set of structures 𝒮. Let 𝒮unpaired be the set of all structures of S that have nucleotides S, S,…,S unpaired. Then,
which is greater or equal 0 by definition. The probability PU(i,k) that the complete region between i and k is unpaired can be calculated by the equation . ED(i,k) can be obtained using RNAPLFOLD (parameter -u) (Bernhart et al., 2006a; Bompfünewerer et al., 2008) in O(nL2), where L represents the size of the locally folded subsequence.Next, we combined both energy contributions. The extended hybridization energy of a specific interaction of two target sites [i,k] and [j,l] is now defined by summing up the ED-values and the hybridization energy. For calculating the ED-values, we must know the first and the last interacting base in both sequences. Hence, the basic recursion for calculating the extended hybridization energy requires a 4D array C(i,j,k,l). Note that the basic assumption in C(i,j,k,l) is that both (i,j) and (k,l) form a base pair. Thus, we have
where H(i,j,k,l) is the 4D variant of the matrix calculated in Equation (1):
We achieve a complexity of O(n2m2) time and O(n2m2) space when limiting the size of the loops similar to RNAUP. When we limit the interaction length to w (as done in RNAUP), this approach has a complexity of O(nmw2) time and O(nmw2) space. In the following, we will show how to improve the time and space complexity without restricting the interaction size while integrating seed information in addition.
2.1 Reducing the space complexity
The space complexity can be improved by calculating all interactions for a common interaction start in one step. This leads to a 2D matrix C(i,j), which is basically the slice of C(i,j,k,l) for fixed k,l. The hybridization that starts at base pair (k,l), is elongated to the left and ends with base pair (i,j) (Fig. 1).
Fig. 1.
Interpretation of the matrix C(i,j).
Interpretation of the matrix C(i,j).Considering the recursion for C(i,j), note that the ED-values are already included. Since ED-values are not additive, we have to subtract the old ED-values before adding the new ones. The idea of the recursion for C(i,j) is illustrated in Figure 2.
Fig. 2.
Visualization of the recursion for C(i,j). The hybridized part is shown in red, while the energy required to make the mRNA and the sRNA target site accessible (ED) is given in blue and green, respectively. Since ED-values are not additive, e.g. ED(i,k)≠ED(i,p)+ED(p,k), we need to substract ED(p,k) and ED(q,l), and add ED(i,k) and ED(j,l) to get the final result of C(i,j).
Visualization of the recursion for C(i,j). The hybridized part is shown in red, while the energy required to make the mRNA and the sRNA target site accessible (ED) is given in blue and green, respectively. Since ED-values are not additive, e.g. ED(i,k)≠ED(i,p)+ED(p,k), we need to substract ED(p,k) and ED(q,l), and add ED(i,k) and ED(j,l) to get the final result of C(i,j).Formally, we get the following recursion:
For the initial case we have:
Finally, we get a 2D matrix C(i,j) that stores for all left-end base pairs (i,j) the best value found so far for all (k,l) with i≤k and j≤l, i.e.
The calculation of C(i,j) for all (k,l) first and finding the minimal value C(i,j) afterwards has still time and space complexity O(n2m2). Instead, we can update C(i,j) successively after each evaluation of a right-end base pair (k,l) and reuse the matrix C in order to reduce the space complexity. Thus
This gives an O(n2m2) time algorithm (applying the usual trick of restricted loops) with only O(nm) space requirement.
2.2 Incorporation of seed features
According to the findings on seed regions presented in Section 1, we introduce seed features that define their properties:
The seed features are a variable part of our algorithm and can be specified by the user. Although there is a preferred position for the seed (in the case of miRNA, it is the 5′-end), it has been shown that seeds could also be on other positions (Brennecke et al., 2005). Hence, we require only the existence of a single seed sequence at any position. For this purpose, we introduce a function seed(i,j,k,l;P), which stores the minimal free energy between [i,k] and [j,l] such that the interaction includes exactly P base pairs. If i, j, k, l and P are given, the numbers of unpaired bases in the mRNA and the sRNA are fixed to k−i+1−P and l−j+1−P, respectively.
The conditions k−p+1≥P−1 and l−q+1≥P−1 ensure that P−1 base pairs are possible in [p,k] and [q,l], respectively. Let l=k−i+1 and l=l−j+1 be the lengths of intervals [i,k] and [j,l]. Then, seed(i,j,k,l;P) is only valid if l−P≤bmax, l-P≤bmax and l+l−2P≤bmax. These three conditions assure compliance with the seed features.P: the number of bases perfectly paired in the seed regionbmax, bmax and bmax: the maximal number of bases not hybridized in the seed region of both RNAs, the mRNA and the sRNA, respectively.While seed(i,j,k,l;P) finds the minimal free energy for two fixed intervals [i,k] and [j,l], all valid intervals have to be analyzed to find the optimal seed region. This is done during the calculation of a second 2D array Cseed(i,j), which contains energy scores of interactions with a seed region. The interpretation of Cseed(i,j) is given as in Figure 3. This leads to the following recursion:
where l=p−i+1 and l=q−j+1 now are the lengths of intervals [i,p] and [j,q], respectively. The first of the two inner minima addresses the case where a seed region was already found right of base pair (p,q). The second minimum refers to the case where a seed region was not found right of base pair (p,q), but between pairs (i,j) and (p,q).
Fig. 3.
Cseed(i,j) matrix and its relation to the other matrices. Note that both C(p,q) and seed(i,j,p,q;5) consistently assume that (p,q) is a pair. Here, a seed of 5 bp and one unpaired base is shown.
Cseed(i,j) matrix and its relation to the other matrices. Note that both C(p,q) and seed(i,j,p,q;5) consistently assume that (p,q) is a pair. Here, a seed of 5 bp and one unpaired base is shown.Note that this extension of the algorithm does not increase its complexity. The final values are stored in C(i,j) by replacing C(i,j) with Cseed(i,j) in Equation (5).
2.3 Reducing the space and time complexity
Although the INTARNA algorithm presented above has a space complexity of O(nm) like RNAHYBRID, its time complexity of O(n2m2) is still impractical for a genome-wide search. Hence, we want to achieve also the same time complexity as RNAHYBRID. In the following, the idea is described for the case without seeds to simplify the presentation. However, seeds are integrated in the same way as described above.Before introducing the version of INTARNA with reduced space and time complexity, we summarize the full version of the algorithm to clarify the differences between them. For this purpose, Equations (2), (3) and (4) are combined into the following single equation:To reduce both time and space complexity, we use a heuristic simplification that is inspired by the sparsification technique. Here, the basic idea is that the matrix C(i,j,k,l) is sparse in the sense that many entries will have the same values. This is due to the fact that many right hybridization ends will not be used in the next recursion steps. Our idea is to store the C(i,j) values only for one starting point (k,l). Thus, we use a matrix che(i,j) that stores for every (i,j) the right hybridization end (k,l), which yields the best extended hybridization energy until the left end (i,j) [see Equation (10)]. Given che(i,j), we can directly use a 2D matrix C′(i,j) and do not need C(i,j) and C(i,j) any longer. Denoting with che1(i,j) the first component k of the pair (k,l)=che(i,j), and with che2(i,j) the second component l, we get the following recursion:
After the calculation of C′(i,j), the corresponding value in che(i,j) has to be updated according to the following recursion:
Equivalent simplifications are applied to the recursions that incorporate a seed region. The respective values are stored in Cseed′(i,j). The final best hybridization score including a seed can be found by min{Cseed′(i,j)}.Compared with Equation (8), Equation (9) computes the minimum over only two instead of four variables. Since these two variables p and q are restricted by the maximal loop size, INTARNA has a space complexity of O(nm) and a time complexity of , where and L is the size of the sequence window in which both mRNA and sRNA are folded. All ED-values are calculated in O(nL3) time by integrating RNAPLFOLD into INTARNA via the Vienna RNA library (Hofacker et al., 1994).Suboptimal hybridizations: INTARNA can predict multiple potential target sites per sRNA. The computation of suboptimal hybridizations is implemented by multiple traceback. Since target sites at different locations within the mRNA are especially of interest, an interaction is accepted as suboptimal if it does not overlap with any other interaction predicted thus far. The desired number of suboptimal hybridizations are computed iteratively from the matrix Cseed′.
2.4 Functional analysis of sRNA target sites
In prokaryotes, the interaction between ribosome and mRNA is promoted by the Shine–Dalgarno (SD) sequence, a sequence motif typically 4–5 nt in length and located around 5–8 nt upstream of the start codon. The SD sequence is bound by a complementary motif of the 3′-tail of the 16S ribosomal RNA (rRNA). The translation is usually regulated by blocking access to this initiation site (Kozak, 2005).The majority of bacterial sRNAs act as antisense regulators on trans-encoded mRNAs. Often, the base pairing occurs at the ribosome binding site (RBS), which leads to blockage of ribosome entry and thus to translation inhibition. In contrast, some sRNAs activate translation of their target mRNAs. Thereby, the sRNA binding results in melting of an inhibitory structure that sequesters the RBS (Vogel and Wagner, 2007).Here, we study the consequences of sRNA binding to the target mRNA regarding the accessibility of the SD sequence, and thus the translational regulation. First, we predict the SD sequence location for every studied gene by simulating the hybridization between the mRNA and the single stranded 16S rRNA 3′-tail. The SD sequence is located by the position of the minimum free energy hybridization with a free energy below a significance threshold (Starmer et al., 2006). Then, we calculate the probability that the SD sequence is unpaired before (PUnohybrid) and after (PUhybrid) the hybridization of the sRNA and the mRNA. The change in this probability, ΔPU, is defined as
where the region between i and j is the location of the SD sequence, the region between k and l is the mRNA target site and 𝒮unpaired is the set of all structures of the sequence S having S…S and S…Sunpaired. ΔPU>0 suggests translational activation by the sRNA–mRNA interaction, whereas ΔPU<0 suggests translational repression. The higher the absolute value, the higher is the expected regulatory outcome. However, a special case arises if the mRNA target site overlaps with or is in close vicinity to the SD sequence. Then, the RBS is blocked and translation inactivation is expected. This measurement of single strandness has the advantage that it is based on base pair probabilities. Thus, it accounts for all possible secondary structures within the thermodynamic ensemble. The concept has been previously applied for searching binding motifs of RNA-binding proteins (Hiller et al., 2006).
3 RESULTS
In order to assess the performance of the INTARNA algorithm as presented in Section 2.3, we used the program to predict targets of bacterial regulatory sRNAs. The test set consisted of 10 biochemically mapped sRNA–mRNA interactions from E.coli and eight interactions from Salmonella typhimurium (denoted Salmonella in the following) that were previously published. For each sRNA, we predicted interactions for all genes of the respective genome. The genome sequences were downloaded from GenBank database of the National Center for Biotechnology Information (NCBI) (Benson et al., 2008). Since the majority of the known sRNAs bind their target gene in close proximity to the RBS, we defined a subsequence of 150-nt upstream and 50-nt downstream of the first base of the start codon as the (putative) target region. We obtained 4294 target regions from the E.coli genome (GenBank accession number NC_000913) and 4425 target regions from the Salmonella genome (GenBank accession number NC_003197).The seed features and other parameters used for the target prediction were chosen according to the known interactions from our test set. They included a seed of at least eight paired nucleotides length and no restriction on the interaction length. All interactions except OxyS-fhlA have a continuous hybridization pattern. Amongst these examples, the Spot42-galK interaction is the longest one with a length of 75 nt.We compared the results with several state-of-the-art methods for the prediction of RNA–RNA interactions, namely TARGETRNA, RNAHYBRID, RNAPLEX and RNAUP. Although RNAHYBRID is primarily designed for the prediction of miRNA target sites, it has been used occasionally for prediction related to sRNAs (see, e.g. Sharma et al., 2007; Urban and Vogel, 2008). Therefore, it has been included in our comparison for the sake of completeness using the default parameters. For TARGETRNA, we used the web application (Tjaden, 2008) with default parameters, except that the search was focused on our target regions. RNAPLEX was used with a penalty of 0.3 kcal/mol per nucleotide as suggested by Tafer and Hofacker (2008). We used RNAUP including the probability of unpaired regions in both RNAs (parameter -b) (Mückstein et al., 2008) and set the maximal length of interaction to 80, which is slightly longer than the maximal interaction length in our dataset.
3.1 Accuracy of predicted sRNA–mRNA interactions
In a first experiment, we assessed whether INTARNA is able to predict precisely the interaction between each sRNA and its mRNA target. Therefore, we computed the sensitivity and the positive predictive value (PPV) for each sRNA–target pair, where and . These measures have been used in the past to compare different RNA secondary structure prediction methods (see e.g. Do et al., 2006).As shown in Table 1, INTARNA outperforms existing methods in the accuracy of the predicted interactions. TARGETRNA achieves the second best sensitivity and the third best PPV, but reports only 12 out of 18 interactions due to its cutoff. The program RNAHYBRID tends to maximize the length of hybridization, which leads to high sensitivity, but very low PPV. Thus, the program is more appropriate to predict interactions between short RNAs (like miRNAs) and long RNAs. To overcome this problem, RNAPLEX introduced a length penalty, which significantly increased its PPV compared with RNAHYBRID.
Table 1.
Prediction accuracy of INTARNA compared with leading RNA–RNA interaction prediction methods on a set of experimentally verified interactions
Sensitivity
PPV
sRNA–target
Reference
INTARNA
TARGETRNA
RNAHYBRID
RNAPLEX
RNAUP
INTARNA
TARGETRNA
RNAHYBRID
RNAPLEX
RNAUP
DsrA-rpoS
Repoila et al. (2003)
0.808
0.808
0.000
0.808
0.808
0.778
0.778
0.000
0.778
0.778
GcvB-argT
Sharma et al. (2007)
0.950
1.000
1.000
0.000
0.900
0.950
0.625
0.160
0.000
0.947
GcvB-dppA
Sharma et al. (2007)
1.000
0.941
0.941
0.765
1.000
0.586
0.421
0.132
0.448
0.459
GcvB-gltI
Sharma et al. (2007)
0.000
–
0.875
1.000
0.000
0.000
–
0.210
0.857
0.000
GcvB-livJ
Sharma et al. (2007)
0.955
–
1.000
0.955
0.955
0.955
–
0.180
0.955
0.955
GcvB-livK
Sharma et al. (2007)
0.542
–
0.542
0.542
0.542
0.565
–
0.108
0.565
0.565
GcvB-oppA
Sharma et al. (2007)
1.000
1.000
1.000
1.000
1.000
0.957
0.957
0.200
0.957
0.957
GcvB-STM4351
Sharma et al. (2007)
0.760
0.000
0.000
0.000
0.880
0.905
0.000
0.000
0.000
0.957
IstR-tisAB
Vogel et al. (2004)
0.879
0.939
0.939
0.750
0.667
0.690
0.775
0.403
1.000
1.000
MicA-ompA
Udekwu et al. (2005)
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.302
1.000
1.000
MicA-lamB
Bossi and Figueroa-Bossi (2007)
1.000
–
0.609
1.000
0.826
0.821
–
0.318
1.000
0.704
MicC-ompC
Chen et al. (2004)
1.000
0.636
1.000
0.000
0.727
0.537
0.286
0.333
0.000
0.410
MicF-ompF
Schmidt et al. (1995)
0.960
0.560
0.960
0.920
0.800
0.960
0.636
0.545
0.958
0.952
OxyS-fhlA
Argaman and Altuvia (2000)
0.500
–
0.938
0.563
0.375
1.000
–
0.288
0.750
1.000
RyhB-sdhD
Masse and Gottesman (2002)
0.588
0.882
0.794
0.824
0.794
1.000
0.909
0.403
1.000
0.794
RyhB-sodB
Geissmann and Touati (2004)
1.000
1.000
1.000
1.000
1.000
0.818
0.375
0.167
0.818
0.900
SgrS-ptsG
Kawamoto et al. (2006)
0.739
–
0.000
0.739
0.739
1.000
–
0.000
1.000
1.000
Spot42-galK
Møller et al. (2002)
0.409
0.545
0.523
0.432
0.523
0.643
0.558
0.280
0.655
0.523
Average
0.783
0.776
0.729
0.683
0.752
0.787
0.610
0.224
0.708
0.772
For every sRNA–target pair, sensitivity and PPV were calculated for the highest scoring interaction predicted. ‘-’ means that no interaction was predicted. The best average result for each measure is highlighted in bold.
Prediction accuracy of INTARNA compared with leading RNA–RNA interaction prediction methods on a set of experimentally verified interactionsFor every sRNA–target pair, sensitivity and PPV were calculated for the highest scoring interaction predicted. ‘-’ means that no interaction was predicted. The best average result for each measure is highlighted in bold.RNAUP achieves the third best sensitivity and the second best PPV. Among all programs compared, it has an overall accuracy closest to INTARNA.INTARNA and RNAUP are the only programs whose optimal solution locates every sRNA binding site correctly, except for the interaction GcvB-gltI. For this interaction, both INTARNA and RNAUP predict optimal hybridizations, which do not share a base pair with the experimentally verified location. However, INTARNA, RNAHYBRID and RNAPLEX also give suboptimal solutions, see Supplementary Table 1. When these predictions are additionally taken into account, the average sensitivity/PPV of RNAHYBRID and RNAPLEX improve to 0.774/0.251 and 0.736/0.761, respectively. INTARNA achieves, on average, both a sensitivity and a PPV greater than 0.8 when the first suboptimal prediction for GcvB-gltI is included.To study the influence of seed features on INTARNA's prediction quality, we repeated the experiments neglecting them (Supplementary Table 2). In this case, the averaged values of sensitivity and PPV are 0.699 and 0.728, respectively, which is below the accuracy of INTARNA with seed features and RNAUP. The difference to the latter, which uses a similar energy model, can be explained by the heuristic of INTARNA.Altogether, the results demonstrate that RNAUP and INTARNA, which both incorporate the accessibility of binding sites, perform better in the prediction of sRNA–mRNA interactions than the other programs neglecting the accessibility. Furthermore, the quality of INTARNA's predictions is substantially improved when seed features are additionally taken into account.
3.2 Performance on prediction of sRNA targets
In a second experiment, we compared INTARNA and the existing methods with respect to the ability of finding sRNA targets. We applied every program to our test set and for each sRNA searched all target regions for potential target sites. The resulting list of target candidates for each sRNA was sorted by the computed energy score. All programs except TARGETRNA and INTARNA give an interaction for each putative target region. TARGETRNA reports at most 100 putative interaction sites per sRNA. INTARNA returns interactions that have both a seed with specified features and an energy score below 0.0 kcal/mol. For each method, we calculated the sensitivity and specificity. For our test set, there are 18 true interactions. Each of the nine sRNAs in E.coli may interact with any of the 4294 target regions, and each sRNA in Salmonella may interact with any of the 4425 target regions. Consequently, there are 47 496 potential interactions, of which 47 478 are considered non-interactions. A similar approach to evaluate the performance on prediction of sRNA targets has been used by Tjaden et al. (2006).The ROC curves in Figure 4a illustrate the performance of different target prediction methods on our test set. We generated each ROC curve by calculating sensitivity and specificity while varying the number of computed interactions that were taken into account for each sRNA. The plot shows that INTARNA and RNAUP are the methods performing best on prediction of sRNA targets. Both RNAHYBRID and RNAPLEX achieve a low sensitivity suggesting that these programs are suitable only to a limited degree for genome-wide sRNA target searches. Taking in consideration that TARGETRNA limits the number of reported putative interaction sites, it achieves a fairly high sensitivity at a low false positive rate, although only an alignment-like algorithm based on base pairing potential is used. However, it can be assumed that the program will perform worse on interactions that show lower sequence complementarity, but underlie more complex duplex formation rules. The curves show that INTARNA and RNAUP have a similar performance on predicting sRNA targets and perform best among all studied programs. However, there is a clear difference in the practical applicability of both programs (Fig. 4b). On an Intel Xeon 5160 (3.0 GHz) with 7.8 GB available RAM, a GcvB target search in all Salmonella target regions allowing a maximal interaction length of 80 nt takes 21 h and requires 33 MB RAM with INTARNA. The same search with RNAUP needs 95 h and 840 MB RAM. An increase of the maximal interaction length to 140 nt raises INTARNA's runtime to 29 h with unchanged memory usage, whereas RNAUP now requires 207 h and 4.3 GB RAM. Without a restriction on the interaction length, INTARNA takes 36 h and requires again 33 MB RAM. Since RNAUP requires a restriction, we limited the interaction size to the length of the sRNA. This causes exhaustion of the complete memory and, as a consequence, a crash of RNAUP. The dramatic increase of RNAUP's resource requirements results from its higher asymptotic complexity and impairs its applicability on normal work stations with limited available memory.
Fig. 4.
Performance of INTARNA. (a) Comparison of INTARNA and other leading methods in the prediction of targets on our test set of 18 sRNAs with experimentally verified targets. The sensitivity (true positive rate) is shown as a function of the false positive rate (1−specificity). For each prediction method, the target candidates for each sRNA were sorted by energy score. Each ROC curve was generated from the rate of true and false predictions, while varying the number of considered interactions per sRNA. (b) Comparison of resource requirements of INTARNA (including computation of ED-values) and RNAUP for a GcvB target search in Salmonella. Without restricting the interaction length, RNAUP uses up the available complete memory and, as a consequence, crashes.
Performance of INTARNA. (a) Comparison of INTARNA and other leading methods in the prediction of targets on our test set of 18 sRNAs with experimentally verified targets. The sensitivity (true positive rate) is shown as a function of the false positive rate (1−specificity). For each prediction method, the target candidates for each sRNA were sorted by energy score. Each ROC curve was generated from the rate of true and false predictions, while varying the number of considered interactions per sRNA. (b) Comparison of resource requirements of INTARNA (including computation of ED-values) and RNAUP for a GcvB target search in Salmonella. Without restricting the interaction length, RNAUP uses up the available complete memory and, as a consequence, crashes.Furthermore, it should be noted that the calculation of sensitivity and specificity is conservative, since we rely only on biochemically mapped interactions from literature. For instance, there are several unpublished experimentally verified targets among the top-ranked predictions of INTARNA (Jörg Vogel lab, personal communication) that have not been taken into consideration when evaluating the performance.
3.3 Prediction of the type of regulation
For all interactions of our test set, we also analyzed the regulatory outcome of the sRNA binding to its target mRNA. Many sRNAs regulate the translation of their target by changing the accessibility of the SD sequence where translation initiation occurs. Therefore, we studied the change ΔPU in the probability that the SD sequence is unpaired as a consequence of sRNA–target interaction. We predicted SD sequence locations within a region of 35-nt up- and downstream of the first base of the start codon for each gene following the approach of Starmer et al. (2006). Then, we determined whether the sRNA binding site predicted by INTARNA is at or close to the SD sequence. In this case, the SD sequence is inaccessible for ribosome binding. Otherwise, we calculated ΔPU for the gene. Supplementary Table 3 shows the results for all interactions of our test set.In 11 out of 18 examples, our method successfully predicted the type of translational regulation by the sRNA. For three of the remaining seven interactions, the SD sequence could either not be located (GcvB-STM4351) or was located at an incorrect position (MicF-ompF and OxyS-fhlA). Another sRNA, IstR, blocks translation of its mRNA target tisAB by binding 100-nt upstream of the start codon without inducing structural changes at the RBS. Instead, IstR blocks a ribosome standby site that is essential for translation initiation (Unoson and Wagner, 2007). The remaining three interactions all involve the sRNA GcvB. Its targets argT, livJ and gltI are bound upstream of the ribosome binding site, and the inhibitory activity cannot be directly explained by competition with ribosome binding. At least for the last example, translational repression by a simple interference model or by masking a ribosome standby site is unlikely (Sharma et al., 2007). Consequently, the regulation cannot be predicted by our model.
4 CONCLUSIONS
Although numerous regulatory ncRNAs have already been identified, the number of experimentally verified targets is much smaller. Consequently, computational target prediction is in great demand to restrain the list of putative targets.In this article, we presented INTARNA, a new method for the prediction of interactions between two RNAs based on minimization of an extended hybridization energy. Our algorithm accounts for two important features that influence the strength of RNA–RNA interactions: the accessibility of the interaction sites and the existence of a user-specified seed. In contrast to previous methods for the prediction of RNA–RNA interactions, both features are integrated in a general approach for arbitrary RNAs. Although INTARNA was applied to predict targets for bacterial sRNAs in this work, the program can readily be used to find other RNA–RNA interactions as well.The INTARNA target predictions for bacterial sRNA were compared to results of several state-of-the-art methods for the prediction of RNA–RNA interactions. INTARNA outperforms existing methods in the accuracy of the predicted interaction, and our method performs as well as the best existing program on finding putative sRNA targets, while the required CPU time and memory decrease drastically. Overall, the results show that our method is well suited both for general searches for putative target sites and the prediction of accurate RNA–RNA interactions. The comparison with RNAHYBRID, whose hybridization energy model is the basis of our more sophisticated extended hybridization energy, shows how the incorporation of the free energy required for making both interaction sites single stranded and the existence of a seed can improve the prediction quality. In addition, we were also able to successfully predict the regulatory effect on translation initiation for a number of sRNAs.The interaction between the OxyS sRNA and the fhlA mRNA is the only one in our test set with a discontinuous hybridization pattern. In fact, the two RNAs form kissing hairpins at two sites (Argaman and Altuvia, 2000). We are aware of two approaches that can be used to predict interactions consisting of such independent substructures (Aksay et al., 2007; Alkan et al., 2006; Pervouchine, 2004). However, these algorithms are rather expensive with a time complexity of O(n3m3). Further extensions of INTARNA could incorporate some basic ideas of those approaches to allow for prediction of more complex interactions with multiple sites.Many bacterial sRNA genes and their mRNA interaction sites are conserved in closely related bacteria (see e.g. Delihas, 2003; Udekwu et al., 2005). It can be assumed that the sRNA–target interaction mechanisms are conserved as well. Therefore, the performance on finding sRNA targets may be further improved by restricting the prediction to orthologous genes. A scoring scheme that accounts for interactions conserved between related species may be a promising extension of our combined energy score.
Authors: Alexander J Jones; Alan G Fast; Michael Clupper; Eleftherios T Papoutsakis Journal: Appl Environ Microbiol Date: 2018-07-02 Impact factor: 4.792