Tsukasa Fukunaga1,2, Michiaki Hamada1,3. 1. Faculty of Science and Engineering, Waseda University, Tokyo 169-8555, Japan. 2. Japan Society for the Promotion of Science, Tokyo 102-0083, Japan. 3. Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo 169-8555, Japan.
Abstract
MOTIVATION: LncRNAs play important roles in various biological processes. Although more than 58 000 human lncRNA genes have been discovered, most known lncRNAs are still poorly characterized. One approach to understanding the functions of lncRNAs is the detection of the interacting RNA target of each lncRNA. Because experimental detections of comprehensive lncRNA-RNA interactions are difficult, computational prediction of lncRNA-RNA interactions is an indispensable technique. However, the high computational costs of existing RNA-RNA interaction prediction tools prevent their application to large-scale lncRNA datasets. RESULTS: Here, we present 'RIblast', an ultrafast RNA-RNA interaction prediction method based on the seed-and-extension approach. RIblast discovers seed regions using suffix arrays and subsequently extends seed regions based on an RNA secondary structure energy model. Computational experiments indicate that RIblast achieves a level of prediction accuracy similar to those of existing programs, but at speeds over 64 times faster than existing programs. AVAILABILITY AND IMPLEMENTATION: The source code of RIblast is freely available at https://github.com/fukunagatsu/RIblast . CONTACT: t.fukunaga@kurenai.waseda.jp or mhamada@waseda.jp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: LncRNAs play important roles in various biological processes. Although more than 58 000 human lncRNA genes have been discovered, most known lncRNAs are still poorly characterized. One approach to understanding the functions of lncRNAs is the detection of the interacting RNA target of each lncRNA. Because experimental detections of comprehensive lncRNA-RNA interactions are difficult, computational prediction of lncRNA-RNA interactions is an indispensable technique. However, the high computational costs of existing RNA-RNA interaction prediction tools prevent their application to large-scale lncRNA datasets. RESULTS: Here, we present 'RIblast', an ultrafast RNA-RNA interaction prediction method based on the seed-and-extension approach. RIblast discovers seed regions using suffix arrays and subsequently extends seed regions based on an RNA secondary structure energy model. Computational experiments indicate that RIblast achieves a level of prediction accuracy similar to those of existing programs, but at speeds over 64 times faster than existing programs. AVAILABILITY AND IMPLEMENTATION: The source code of RIblast is freely available at https://github.com/fukunagatsu/RIblast . CONTACT: t.fukunaga@kurenai.waseda.jp or mhamada@waseda.jp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Long non-coding RNAs (lncRNAs) play integral roles in diverse biological processes including transcriptional regulation (Kino ) and subnuclear structure formation (Naganuma and Hirose, 2013). The dysfunctions of many lncRNAs are associated with severe diseases such as diabetes and various cancers (Wapinski and Chang, 2011), and thus elucidating lncRNA functions is an important research topic. Although large-scale transcriptome analysis has revealed that more than 58 000 lncRNA genes are encoded by the human genome (Iyer ), most of these lncRNAs are still poorly characterized (de Hoon ).Sequence similarity search and RNA secondary structure similarity search have achieved substantial success in characterizing the function of protein-coding genes and small RNAs (sRNAs), respectively (Altschul ; Nawrocki and Eddy, 2013). However, these strategies are unsuitable for inferring the function of lncRNAs because lncRNAs frequently lack sequence and structure conservation (Cabili ). In contrast, the identification of interaction partners for each lncRNA should be a powerful approach for inferring the function of lncRNAs because lncRNAs function by being assembled with other proteins or RNAs (Hirose ). Several lncRNAs have been experimentally confirmed to regulate biological processes through their interactions with target RNAs. For example, Abdelmohsen et al. determined that lncRNA 7SL reduces p53 protein translation levels by binding TP53 mRNA (Abdelmohsen ). Gong and Maquat also discovered that lncRNA 1/2-sbsRNAs inhibit the translation of the interaction target RNA through a Staufen1-mediated mRNA decay process (Gong and Maquat, 2011). These examples show that the identification of lncRNA–RNA interactions is an important step in characterizing lncRNA functions.Several sequencing-based technologies have been developed for the experimental discovery of RNA–RNA interactions. RIA-seq (Kretz ) and RAP-RNA (Engreitz ) can identify target RNAs attached to an anchored RNA. Although these methods are outstanding technologies to exhaustively detect interaction targets of a specific lncRNA, repeating these experiments across many lncRNAs is labour intensive. In contrast, PARIS (Lu ), SPLASH (Aw ), LIGR-seq (Sharma ) and MARIO (Nguyen ) can comprehensively identify RNA–RNA interactions in vivo. However, the majority of the detected interactions have been related to ribosomal RNAs or small RNAs, and the number of identified lncRNA–RNA interactions has been limited. In addition, because most of the lncRNAs show tissue-specific expression patterns (Cabili ; Iyer ), these experiments on various tissues are necessary but they require quite hard work. Since the detection of genome-wide lncRNA–RNA interactions through experiments is difficult, computational prediction of lncRNA–RNA interactions is an indispensable technique.Szcześniak and Makalowska predicted entire lncRNA–RNA interactions across the human transcriptome using sequence similarity search without consideration of RNA secondary structure (Szcześniak and Makałowska, 2016). However, benchmarking results of RNA–RNA interaction predictions showed that omitting RNA secondary structure information decreases prediction accuracy (Lai and Meyer, 2016). To date, many RNA–RNA interaction prediction tools that consider RNA secondary structure have been proposed, e.g. IntaRNA (Busch ), RNAplex (Tafer and Hofacker, 2008; Tafer ) and RactIP (Kato ), and can detect sRNA interactions with high accuracy. However, as these programs were designed for detecting sRNA interactions, the computational costs are too high to predict lncRNA interactions comprehensively. To predict a comprehensive lncRNA interactome with consideration of RNA secondary structure, Terai ) firstly screened interaction candidates based on only sequence complementarity and then exhaustively predicted interactions using IntaRNA. Although their approach effectively narrowed down interaction candidates, it still required extensive computational resources to utilize IntaRNA. Therefore, a much faster RNA–RNA interaction prediction program that considers RNA secondary structure is required for further progress in lncRNA function estimation.In the present study, we developed an ultrafast RNA–RNA interaction prediction algorithm for comprehensive lncRNA interaction analysis. The key idea of our algorithm is the utilization of seed-and-extension approach, which is widely adopted in sequence homology search tools including BLAST (Altschul ). We implemented this high-speed algorithm as a program named RIblast, which detects seed regions using suffix arrays, and subsequently extends both ends of seed regions based on an RNA secondary structure energy model. While the prediction accuracies of RIblast were comparable to those of existing programs, RIblast was more than 64 times faster than existing tools.
2 Materials and methods
2.1 Concept of the RIblast algorithm
RIblast enumerates potentially interacting segments between a query RNA x and a target RNA y. RIblast uses two energies as the evaluation criteria to determine whether two segments, (x and y) in sequences x and y, intermolecularly interact: accessible energy and hybridization energy. Accessible energy is the energy required to prevent the segments from forming intramolecular base pairs and can be calculated by utilizing a partition function algorithm (McCaskill, 1990). Briefly, a segment with high accessible energies tends to not form intermolecular base pairs because the segment forms intramolecular base pairs (Fig. 1A). Hybridization energy is the free energy derived from intermolecular base pairs between two segments and can be calculated as the sum of stacking energies and loop energies in the formed base-paired structure based on a nearest-neighbour energy model (Fig. 1B). When calculating hybridization energies, intra-molecular base pairs are not taken into consideration. Here, we defined the interaction energy between two segments x and y as the sum of the accessible energy of x, accessible energy of y and hybridization energy between x and y. RIblast outputs two segments with a particularly low interaction energy as a detected RNA–RNA interaction. Note that RNAup (Mückstein ), IntaRNA (Busch ) and RNAplex-a (Tafer ) also predict RNA–RNA interactions based on this combination of hybridization energy and accessible energy, and each showed high prediction accuracies in a previous benchmarking test (Lai and Meyer, 2016).
Fig. 1
(A) A schematic illustration of the effect of accessible energies. While a segment with low accessible energy tends to form inter-molecular base pairs, a segment with high accessible energy tends not to form inter-molecular base pairs because such a segment tends to form intra-molecular base pairs. (B) Example of hybridization energy calculation. Hybridization energy can be calculated as the sum of stacking energies and loop energies in the formed base-paired structure. Generally, stacking energies stabilize RNA–RNA interactions but loop energies destabilize interactions. This calculation is based on Turner’s parameter. (C) Overview of the RIblast algorithm
(A) A schematic illustration of the effect of accessible energies. While a segment with low accessible energy tends to form inter-molecular base pairs, a segment with high accessible energy tends not to form inter-molecular base pairs because such a segment tends to form intra-molecular base pairs. (B) Example of hybridization energy calculation. Hybridization energy can be calculated as the sum of stacking energies and loop energies in the formed base-paired structure. Generally, stacking energies stabilize RNA–RNA interactions but loop energies destabilize interactions. This calculation is based on Turner’s parameter. (C) Overview of the RIblast algorithmFor all-to-all interaction predictions of lncRNAs, the calculation time of accessible energies scales linearly with the number of sequences. This is because accessible energies of an RNA sequence can be calculated independently of the other RNA sequences. On the other hand, the calculation time of hybridization energies is quadratic with the number of sequences when an all-to-all interaction prediction is conducted. This calculation is the obstacle to comprehensive lncRNA–RNA interaction prediction. The calculation of hybridization energy between two RNA segments is similar to the calculation of a local alignment score between two sequences (Tjaden ). Therefore, hybridization energy can be calculated based on a Smith-Waterman algorithm-like method. In the subject of local sequence alignment, many researches have been conducted to speed up the calculation of alignments. Seed-and-extension heuristic is one of the most successful approaches and has been adopted by many sequence alignment tools, such as BLAST (Altschul ) and LAST (Kiełbasa ). This method first finds short matching regions, which are called seeds, between a query and target sequence and subsequently extends alignments from both end points of the detected seeds. We recognized that the application of this approach to the calculation of hybridization energy should accelerate the computation speed considerably.
2.2 Methodology overview
RIblast implements two major steps: database construction and an RNA interaction search. Figure 1C shows the flowchart of the RIblast algorithm. In the database construction step, RIblast first calculates the accessible energy of each segment in the target RNA dataset. To speed up calculation, RIblast calculates approximated accessible energies. Second, target RNA sequences are reversed and concatenated with delimiter symbols inserted between the two sequences. Third, a suffix array of the concatenated sequence is constructed. The suffix array is an efficient text-indexing data structure that comprises a table of the starting indices of all suffixes of the string in alphabetical order. It can be constructed in linear-time relative to sequence length (Nong ; Shrestha ). Fourth, in order to speed-up the RNA interaction search, search results of short strings are exhaustively pre-calculated. Then, the approximated accessible energies, concatenated sequences, suffix array and search results of short strings are stored in a database.In the RNA interaction search step, RIblast first calculates approximated accessible energies and constructs a suffix array for a query sequence. Second, RIblast finds seed regions whose hybridization energy is less than a threshold energy T1 based on two suffix arrays of the query and the database. Third, the interaction energies of the detected seed regions are calculated by summation of hybridization energy and two accessible energies. In this step, RIblast removes seed regions whose interaction energies exceed 0 kcal/mol. Fourth, RIblast extends interactions from seed regions without a gap. RIblast terminates the extension when the duplexes is no longer formed in the extension. Fifth, the interactions that fully overlap with other interactions are removed. In addition, those interactions with interaction energies exceeding the threshold energy T2 are also excluded. Note that no interactions are removed if T2 is set to 0 kcal/mol. Finally, RIblast extends interactions from seed regions with a gap. RIblast terminates the extension when the duplexes is no longer formed in the extension.
2.3 The method for calculating accessible energy
We defined as a segment from position i to position j in an RNA sequence x. Here, we defined the accessible energy that is required to make the segment form a single-stranded structure as , and the accessible probability that the segment is single-stranded as . For a fixed segment length, Raccess can calculate accessible energies and accessible probabilities of all segments with using dynamic programming (Kiryu ). Here, N is the sequence length and W is the constraint of maximal distance between the bases that may form base pairs.However, RIblast requires the accessible energies of segments with arbitrary length, and the exhaustive calculation is computationally expensive. Therefore, RIblast uses approximated accessible energies instead of . This method was proposed in RNAplex-a (Tafer ). was defined as follows:
R represents the gas constant and T represents the absolute temperature (we used T = 310.15 K in this study). By this approximation, we only have to calculate the accessible energies of segments with length δ and . In addition, by restricting the minimum length of seeds to δ, we need not calculate accessible energies of segments whose length is less than δ.
2.4 Seed search
BLAST searches seeds with a fixed length, but this method is unsuitable for RNA–RNA interaction search from the viewpoint of RNA energy model. For example, in Andronescu’s energy model, the hybridization energy of a 6-mer seed consisting of only G-C base pairs is about -10 kcal/mol, but that consisting of only G-U base pairs is about -1 kcal/mol. The large difference in hybridization energies between seeds of the same length may depress the performance of tools. Therefore, RIblast adopts score-based seeds, as proposed in GHOSTX (Suzuki ). Our score-based seeds were defined as the perfect base-pairing region whose hybridization energy is less than the threshold energy T1 and length is at least δ.In RIblast, the enumeration of duplexes satisfying the seed criteria and the search of the seeds in the database and the query are conducted simultaneously based on the suffix array and the depth-first search. Supplementary Figure S1 shows the schematic illustration of the seed search. First, RIblast searches for a single inter-molecular base pair such as G-C. If this pair is found in the query and the database, then RIblast extends the base pair by one base pair as GG-CC, GC-CG, …, GU-CG and then checks whether these extended strings are found in the query and the database. If the extended strings are detected and meet the conditions for score-based seeds, then RIblast stores the string pair as a seed. If extended strings are detected but do not meet the conditions for score-based seeds, then RIblast extends the strings by one base pair again and repeats this step. If extended strings are not detected, then the extension is stopped. The search of extended duplexes in the database and the query is efficiently calculated based on the suffix array (Shrestha ). To avoid overly long seeds, we restricted the max seed length length. Supplementary Figure S2 shows the pseudo-code of the RIblast seed search algorithm. Here, S and S are the query sequence and the reversed and concatenated database sequence, respectively. SA and SA are the suffix arrays of S and S, respectively. seed and seed represent the temporary seeds for the query and database, respectively. sp, ep, sp and ep are the indices of SA and SA. The SASearchNextString function returns the indices of the new extended string in a suffix array. If the string exists in the query and the database, then the returned sp() is smaller than the returned ep().In order to accelerate this seed search step, we pre-calculate the indices of the strings whose length is shorter than l for a database sequence in the database construction step. The results of the short sequence search are used on the database sequences. Therefore, this binary search of the suffix array is needed only for the search of query sequences or long strings in the database sequence.
2.5 Extension
After the seeds are found in the query and the database, RIblast tries to extend interactions from both end points of these seed regions. The gapless extension is first conducted, and then the gapped extension is performed in a similar way to BLAST, LAST and GHOSTX.RIblast first extends interactions without a gap from seed regions. If extended interactions have lower interaction energies than the present minimum interaction energy in this extension step, then RIblast updates the minimum interaction energy. Otherwise, extensions are repeated. If RIblast extends Y nucleotides from the length that requires the minimum interaction energy in this extension but the energy has not been updated, then RIblast terminates the gapless extension. In this step, we assume that the possible complementary bases always interacts with each other. After the gapless extension step, if two interactions and satisfy the conditions and , then we exclude the later interaction. In addition, if the interaction energy of an interaction exceeds threshold T2, then we remove the interaction.Next, RIblast tries to extend interactions with a gap. Like the gapless extension step, if the interaction energy of extended interactions is lower than the present minimum interaction energy in this extension step, then the minimum interaction energy is updated. If RIblast extends X nucleotides from the length that requires the minimum interaction energy in this extension but the energy has not been updated, then RIblast terminates the gapless extension. The calculation of the interaction energy of extended interactions is as follows (Supplementary Fig. S3 shows the schematic illustration). Here, we regard as an interaction after gapless extension. In the extension towards the end of the query sequence (and end of the database sequence), RIblast calculates , which is the minimum interaction energy for sequences and , as the following equation.
where indicates the free energy of the loop consisting of base pairs (a, b) and (c, d) and n is the sequence length of the database sequence. Here, and are satisfied. In addition, the internal loop size is restricted to within X. This formula is the same as the calculation formula of interaction energy in RNAup and IntaRNA. The extension in the opposite direction is calculated in the same manner. Dangling energies are added only after gapped extensions are finished.
2.6 Evaluation method
We assessed the performance of RIblast using three evaluation methods: base pair prediction performance, bacterial sRNA target prediction performance, and lncRNA TINCR target prediction performance.
2.6.1 Method for evaluating base pair prediction performance
To evaluate the base pair prediction performance, we used 109 validated bacterial sRNA-mRNA pairs and 52 validated fungal snoRNA-rRNA pairs as datasets, which were constructed by Lai and Meyer (Lai and Meyer, 2016). The bacterial sRNA-mRNA interaction dataset was composed of 64 E.coli and 45 Salmonella enterica interactions as well as 18 query sRNAs and 82 target mRNAs. Following the benchmark research of Lai and Meyer (Lai and Meyer, 2016), we used the sequences between 150 bp upstream and 150 bp downstream of each start codon as the target sequences. All fungal snoRNA-rRNA interactions in the dataset were S. cerevisiae C/D box interactions, and these interactions were between 43 snoRNAs and 2 rRNAs. For target rRNAs, full rRNA sequences were used. We selected IntaRNA and RNAplex-a as the software to compare with RIblast although there are many programs to predict RNA–RNA interactions. This is because both these programs consider accessible energies as with RIblast and show the best prediction performance in current tools (Lai and Meyer, 2016). The command line options used for IntaRNA and RNAplex-a were the same as those used by previous benchmark research (Lai and Meyer, 2016). As the energy parameter characterizing RNA secondary structures, we used two energy parameters, Turner’s 2004 parameter (Mathews ) and Andronescu’s BL* parameter (Andronescu ). Because IntaRNA did not have an option to change the energy parameter, we used only the default Turner’s 1999 parameter (Mathews ) in the IntaRNA evaluation. TPR, PPV and MCC were calculated for each RNA–RNA interaction, and the averaged scores were evaluated. The definitions of these three scores are as follows: TPR = TP/(TP + FN), PPV = TP/(TP + FP), and
We only evaluated the minimum energy interactions.
2.6.2 Method for evaluating sRNA target prediction performance
We evaluated the sRNA target prediction performance by predicting all-to-all interactions between 18 sRNAs and all 4319 E.coli mRNAs. This evaluation method was originally proposed by Richter and Backofen (2012). As target mRNA sequences, we used sequences between 150 base-pairs (bp) upstream and 50 bp downstream from each start codon. This sequence length setting is the same as that used by Terai ). The sequence data were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/nuccore/NC_000913). We used 64 experimentally validated interactions as positive data, which were also used to evaluate base pair prediction performance. As negative data, we used all non-positive interactions in all-to-all interactions between 18 sRNAs and all 4319 E.coli mRNAs.
2.6.3 Method for evaluating lncRNA target prediction accuracy
To evaluate the human lncRNA target prediction performance, we used an RIA-seq-based TINCR interaction dataset (Kretz ). We used the same dataset and evaluation method as Terai ). The dataset was composed of 5195 target RNAs and 1062 RNAs among them that interact with TINCR at one or more interacting segments. The target RNAs that have more interacting segments are more likely to be TINCR-interacting RNAs. As positive data, we used RNAs that at least had a threshold number of the interacting segments. When this threshold was set to 1, 2, 3, 4 and 5 interactions, the numbers of positive data were 1062, 434, 191, 104 and 65, respectively. Instead of comparing RIblast to IntaRNA or RNAplex-a, we compared the performance of RIblast with those of the Terai et al. pipeline (Terai ) and LAST (Kiełbasa ), a fast local alignment tool. This is because lncRNA target predictions by IntaRNA and RNAplex-a have heavy computational costs. As LAST alignment parameter, we used the parameter settings used by Szcześniak and Makalowska (2016). We regarded the alignment score × (–1) as the interaction energy. The simple repeat regions were masked by TANTAN (Frith, 2010) with the default options. In order to match our study with previous research by Terai et al., we excluded masked regions in the seed search step but considered them in the RIblast extension step. The short summary of the Terai et al. pipeline is as follows. First, accessible energies were calculated by Raccess, and inaccessible regions were removed. Second, pairs of complementary gapless subsequences were detected as interaction regions by LAST. Finally, the interaction energies of the interaction regions were calculated by IntaRNA.
2.7 Assesment of parameter settings
RIblast uses multiple parameters in the algorithm, and the appropriate parameters have to be set in order to predict RNA–RNA interaction accurately and efficiently. Because the influence of these parameters on the performance is not independent, the simultaneous optimization of all parameters is desirable. However, there are too many parameters in RIblast, it is difficult to optimize all parameters simultaneously. Therefore, we determined parameter settings as follows. First, we fixed length, which is the parameter of a maximum seed length, to 20 because this parameter does not influence software performance when the value is sufficiently large. Second, l, which is the parameter for pre-search of seed in the database construction step, was fixed to 8. This is because this parameter influences only the calculation speed of seed search step and the calculation time of this step is very short compared to the total calculation time. Third, we investigated the influence of parameter T1 and X on the base pair prediction performance of the bacterial sRNA-mRNA dataset with the other parameters fixed. T1 is a threshold energy for seed detection, and X is a threshold length for extension termination. We adopted the parameter combination that yielded the highest MCC score. If there were several parameter combinations with the best performance, we adopted the smallest X and largest parameter combination in order to accelerate the computation. In this step, W, δ, Y and T2 is set to 70, 5, 5, and 0, respectively. The parameter setting of W is the same as that of a benchmark research for RNA–RNA interaction prediction tools (Lai and Meyer, 2016). Fourth, we assessed the effect of parameter W, δ and Y on the base pair prediction performance of the bacterial sRNA-mRNA dataset with determined X and T1 values. Fifth, to determine values for the parameter T2, we examined the dependence of lncRNA target prediction accuracy decreases from area under the receiver operating characteristic curve (AUROC) scores of SUMENERGY on T2. We used AUC scores of -16 kcal/mol and -8 kcal/mol as interaction energy thresholds for SUMENERGY when the energy parameters were Turner and Andronescu parameters, respectively.
3 Results
3.1 Evaluation of basepair prediction performance
We investigated base pair prediction performance by evaluating whether programs predict correct base pairs between two RNAs with experimental interaction evidence.We firstly determined the values of the parameter T1 and X in RIblast based on the base pair prediction performance of the bacterial sRNA-mRNA dataset. The performances of various T1 and X values were investigated, and the parameter set that yielded the best performance was adopted (Supplementary Tables S1 and S2). These determined values of T1 and X were used in the following analyses. T2 was set to 0 kcal/mol in this evaluation. As a result, we set X and T1 to 18 and -10.0, respectively, when we used Turner’s model, and we set X and T1 to 16 and -6.0, respectively, when we used Andronescu’s model. Next, we assessed the effect of parameter W, δ, and Y on the base pair prediction performance with determined X and T1 values. (Supplementary Tables S3–S5). In consequence, we set W, δ and Y to 70, 5 and 5, respectively.Tables 1 and 2 show the evaluation results of base pair prediction performance. For the bacterial sRNA-mRNA dataset, RIblast with Andronescu’s parameter achieved the best PPV (0.73) and MCC (0.67) performance. The best TPR score was obtained by IntaRNA (0.66). For the fungal snoRNA-rRNA dataset, RNAplex-a with Andronescu’s parameter was the best performing tool according to all three accuracy measures (TPR, 0.74; PPV, 0.69; MCC, 0.71), and was followed by RIblast using Andronescu’s parameter (TPR, 0.66; PPV, 0.60; MCC, 0.62). In both datasets, tools using Andronescu’s parameter showed superior performance to the same tool with Turner’s parameter.
Table 1
The base pair prediction performance on the bacterial sRNA dataset
Program
TPR
PPV
MCC
IntaRNA
0.66
0.61
0.62
RNAplex-a (Turner)
0.63
0.56
0.58
RNAplex-a (Andronescu)
0.60
0.68
0.63
RIblast (Turner)
0.58
0.66
0.61
RIblast (Andronescu)
0.63
0.73
0.67
Note: The columns correspond to the three evaluation criteria. The rows indicate the performance of each program. The bold values are the highest scores in each column.
Table 2
The base pair prediction performance on the fungal snoRNA dataset
Program
TPR
PPV
MCC
IntaRNA
0.61
0.53
0.56
RNAplex-a (Turner)
0.56
0.49
0.52
RNAplex-a (Andronescu)
0.74
0.69
0.71
RIblast (Turner)
0.57
0.49
0.53
RIblast (Andronescu)
0.66
0.60
0.62
Note: The columns correspond to the three evaluation criteria. The rows indicate the performance of each program. The bold values are the highest scores in each column.
The base pair prediction performance on the bacterial sRNA datasetNote: The columns correspond to the three evaluation criteria. The rows indicate the performance of each program. The bold values are the highest scores in each column.The base pair prediction performance on the fungal snoRNA datasetNote: The columns correspond to the three evaluation criteria. The rows indicate the performance of each program. The bold values are the highest scores in each column.
3.2 Evaluation of sRNA target prediction accuracy
Although the information of interacted base pair is useful for designing experiments for further functional analyses of the RNAs, what most of the users would like to know may be whether an RNA is a target or not. Therefore, we evaluated bacterial sRNA target prediction performance by validating whether the predicted interaction energies of positive sRNA-mRNA interactions are lower than those of negative sRNA-mRNA interactions. After all-to-all interaction between mRNAs and sRNAs were predicted by each software, we sorted mRNAs for each sRNA by the minimum interaction energy. Then, we plotted ROC-like curves whose x- and y-axes were the number of true positive predictions and the total number of target predictions per sRNA, respectively. The parameter T2 was also set to 0 kcal/mol in this evaluation.Figure 2 shows the bacterial sRNA target prediction performance. The best performing tool was RNAplex-a with Andronescu’s parameter. The prediction performance of RIblast with Turner’s parameter was slightly lower, but RIblast with Andronescu’s parameter showed similar performance to the other programs.
Fig. 2
The performance of bacterial sRNA target prediction. The x- and y-axes represent target prediction numbers per sRNA and true positives, respectively.
The performance of bacterial sRNA target prediction. The x- and y-axes represent target prediction numbers per sRNA and true positives, respectively.
3.3 Evaluation of lncRNA TINCR target prediction accuracy
Third, we validated human lncRNA target prediction performance by comparing predicted interactions of human lncRNA TINCR with interactions experimentally validated by RIA-seq (Kretz ). After all interactions between TINCR and target RNAs were calculated, we sorted target RNAs based on the minimum interaction energy among all predicted interactions in the target RNA (denoted by MINENERGY) or the sum of the interaction energies that are lower than some threshold value in the target RNA (denoted by SUMENERGY). Then, we calculated AUROC scores using the pROC R package (Robin ).Supplementary Table S6 shows AUROC results for MINENERGY sorting. LAST, Terai et al. pipeline and RIblast exhibited performances that were similar to each other in this case. On the other hand, Figure 3 and Supplementary Tables S7–S9 show AUROC scores for SUMENERGY sorting. This result illustrates that SUMENERGY sorting performs better than MINENERGY sorting among all methods. This result is consistent with at least one previous study (Terai ). In addition, for SUMENERGY sorting, RIblast achieved higher AUROC scores than the other methods for any threshold number of interacting segments. Unlike the evaluation of base pair prediction or sRNA target prediction performance, there was no difference in performance between Turner’s and Andronescu’s parameters. Finally, to obtain the appropriate parameter T2, we investigated the influence of T2 on TINCR target prediction accuracy (Supplementary Tables S10–S11). Although lower T2 values cause faster computation speed with lower prediction accuracy, the accuracy was robust to the T2 parameter setting. We set T2 to -6 and -4 when the energy models were Turner’s and Andronescu’s parameters, respectively.
Fig. 3
The performance of human lncRNA TINCR target prediction. The x-axis represents the threshold number of interacting segments in the positive data. The y-axis represents the AUROC score.
The performance of human lncRNA TINCR target prediction. The x-axis represents the threshold number of interacting segments in the positive data. The y-axis represents the AUROC score.
3.4 Evaluation of running time
We finally evaluated the computational speed of RIblast by comparing its run time with the times required for IntaRNA, RNAplex-a and Terai et al. pipeline. We excluded joint secondary structure prediction using RactIP (Kato ) in the Terai et al. pipeline (Terai ) because this step does not affect prediction accuracy. The calculation time for RNAplex-a included the run time of accessibility calculation by RNAplfold (Bernhart ). The query and target sequences were randomly selected from human lncRNAs and mRNAs in Gencode version 24, respectively (Harrow ). Then, all-to-all interaction predictions between query and target sequences were conducted. The computation was performed on an Intel(R) Xeon E5 2670 2.6GHz CPU with 4 GB of memory. Table 3 and Figure 4 show the computational times depended on the dataset size for each software tool. In all cases, RIblast was much faster than the other programs. As the dataset size increased, the speed advantage over the other programs became quite large. In particular, when the dataset consisted of 500 lnRNAs and mRNAs, RIblast was 64-fold and 73-fold and faster than the Terai et al. pipeline and RNAplex-a, respectively.
Table 3
The results of the run time evaluation on partial human lncRNA and mRNA datasets
The number of lncRNAs and mRNAs
Program
5 × 5
10 × 10
50 × 50
100 × 100
500 × 500
IntaRNA
59 m 04 s (126.6)
3 h 30 m 17 s (252.3)
− (-)
− (-)
− (-)
RNAplex-a
2 m 34 s (5.5)
10 m 37 s (12.8)
4 h 20 m 42 s (47.0)
17 h 56 m (63.4)
19 d 20 h 53 m (74.1)
Terai et al. pipeline
1 m 02 s (2.2)
3 m 08 s (3.8)
2 h 26 m 43 s (26.4)
14 h 50 m (52.4)
17 d 09 h 44 m (64.9)
RIblast
28 s (1.0)
50 s (1.0)
5 m 33 s (1.0)
17 m (1.0)
6 h 26 m (1.0)
Note: The columns correspond to the number of lncRNAs assessed in the dataset. The rows indicate the run times and the run time ratio of each program to RIblast. The symbol ‘−’ indicates that we did not investigate the computational speed for a particular combination of dataset size and program because the calculation time was prohibitively long.
Fig. 4
The results of the run time evaluation on partial human lncRNA and mRNA datasets. The x-axis represents the number of lncRNAs and mRNAs. The y-axis represents the runtime ratio of each program to RIblast. The dotted line, the dashed line and the solid line represent the performances of IntaRNA, RNAplex-a and RIblast, respectively
The results of the run time evaluation on partial human lncRNA and mRNA datasetsNote: The columns correspond to the number of lncRNAs assessed in the dataset. The rows indicate the run times and the run time ratio of each program to RIblast. The symbol ‘−’ indicates that we did not investigate the computational speed for a particular combination of dataset size and program because the calculation time was prohibitively long.The results of the run time evaluation on partial human lncRNA and mRNA datasets. The x-axis represents the number of lncRNAs and mRNAs. The y-axis represents the runtime ratio of each program to RIblast. The dotted line, the dashed line and the solid line represent the performances of IntaRNA, RNAplex-a and RIblast, respectively
4 Discussion
In this study, we developed RIblast, which is an RNA–RNA interaction prediction algorithm based on the seed-and-extension approach. RIblast is the fastest software that can be applied to large-sclae lncRNA datasets.We used an interaction energy cutoff to exclude likely incorrect predictions, but this method may be highly arbitrary. As such, we should determine the reliability of the predicted interactions based on a statistical score like the e-values generated by BLAST. Rehmsmeier et al. developed a calculation method for the statistical significance of predicted RNA–RNA interactions (Rehmsmeier ). However, their calculation method cannot be applied to our software directly because the method did not consider the effect of accessible energies. Therefore, we need to develop an e-value calculation method for RIblast’s predicted interactions.While the previous benchmarking paper of bacterial sRNA target prediction found that IntaRNA shows the superior performance to RNAplex (Pain ), these two programs show almost the same performances in our evaluation. This difference of evaluation may be caused by the difference of some experimental conditions such as the benchmark dataset or the version of software. Specifically, the difference of energy parameter can cause large difference of evaluation results. The previous paper uses Turner’s 1999 parameter for the energy parameter in RNAplex, but we use Turner’s 2004 parameter for that. Therefore, RNAplex in our evaluation should show high performance than RNAplex in the previous paper.Although Hajiaghayi et al. reported that the accuracy of RNA secondary structure prediction with Andronescu’s parameter outperforms those that use other energy parameters (Hajiaghayi ), we provide the first report that Andronescu’s parameter also delivers superior performances compared with Turner’s parameter in small RNA–RNA interaction predictions. Currently, major miRNA or snoRNA target prediction tools (Agarwal ; Betel ; Tafer ) utilize Turner’s parameter, and thus the application of Andronescu’s parameter to these programs may improve the accuracies.RIblast efficiently predicts RNA–RNA interaction, but further acceleration is an essential task. Considering that the seed-and-extension approach contributes to the acceleration of RNA–RNA interaction predictions, other acceleration techniques in sequence homology search may be effective for RNA–RNA interaction predictions. Specifically, parallelization is a promising technique. At present, many parallelization methods (Rognes, 2011; Suzuki , 2016) have been proposed for sequence homology search and have successfully speed up calculation.While typical mRNAs tend to be localized in the cytoplasm, typical lncRNAs tend to be localized in the nucleus (Ulitsky and Bartel, 2013). This tendency may suggest that lncRNAs exert their functions by interacting with pre-mRNAs (Engreitz ). Thus, interaction prediction between lncRNAs and pre-mRNAs is a fascinating research topic, but the current RIblast cannot be applied to this task. This is because the accessible energy calculation of pre-mRNAs by the Raccess algorithm is computationally difficult for too long RNA sequences. For this purpose, we will integrate the ParasoR (Kawaguchi and Kiryu, 2016), which can calculate accessible energies for quite long RNAs, with RIblast.The evolution of lncRNA is a hot topic (Ulitsky, 2016). Although the majority of lncRNAs are lineage-specific, a thousand human lncRNAs have homologs (Hezroni ). In addition, Ngueyn et al. revealed that experimentally validated RNA–RNA interaction sites are evolutionarily conserved (Nguyen ). These results suggest that the interaction relationships between lncRNA and RNA are widely conserved among species. We aim to validate this hypothesis by comparing RIblast-based lncRNA interactome networks between species.Click here for additional data file.