Literature DB >> 27288498

Prediction of conserved long-range RNA-RNA interactions in full viral genomes.

Abstract

MOTIVATION: Long-range RNA-RNA interactions (LRIs) play an important role in viral replication, however, only a few of these interactions are known and only for a small number of viral species. Up to now, it has been impossible to screen a full viral genome for LRIs experimentally or in silico Most known LRIs are cross-reacting structures (pseudoknots) undetectable by most bioinformatical tools.
RESULTS: We present LRIscan, a tool for the LRI prediction in full viral genomes based on a multiple genome alignment. We confirmed 14 out of 16 experimentally known and evolutionary conserved LRIs in genome alignments of HCV, Tombusviruses, Flaviviruses and HIV-1. We provide several promising new interactions, which include compensatory mutations and are highly conserved in all considered viral sequences. Furthermore, we provide reactivity plots highlighting the hot spots of predicted LRIs.
AVAILABILITY AND IMPLEMENTATION: Source code and binaries of LRIscan freely available for download at http://www.rna.uni-jena.de/en/supplements/lriscan/, implemented in Ruby/C ++ and supported on Linux and Windows. CONTACT: manja@uni-jena.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Viral

Year: 2016 PMID： 27288498 PMCID： PMC7189868 DOI： 10.1093/bioinformatics/btw323

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Long-range RNA-RNA interactions (LRIs) have been marginally reported in various positive strand RNA viruses like Tombusvirus (8 in Carnation Italian Ringspot Virus (CIRV) and Tomato Bushy Stunt Virus (TBSV)), Hepacivirus [5 in Hepatitis C Virus (HCV)], Coronavirus (1 in Transmissible Gastroenteritis Virus (TGEV)), Flavivirus (3 in Dengue Virus (DENV) and West Nile Virus (WNV)), Luteovirus (2 in Barley Yellow Dwarf Virus (BYDV)), Apthovirus (2 in Foot-and-Mouth Disease Virus (FMDV)), Pestivirus (1 in Classical Swine Fever Virus (CSFV)) and Human Immunodeficiency Virus (5 in HIV-1) (Abbink and Berkhout, 2003; Andersen ; Beerens and Kjems, 2010; Huthoff and Berkhout, 2001; Ooms ). According to their definition, a long-range interaction spans distances between a few hundred and several thousands of nucleotides (>26 kb in TGEV). LRIs are often located in loop regions or internal bulges of local RNA structures (known as cis-acting regulatory elements) and therefore build pseudoknot-like structures. Various programs have been developed for general RNA-RNA interaction prediction, which can be classified into five groups (Kato ; Seemann ): The first group neglects intramolecular base-pairs, based on the hybrids minimum free energy (MFE). Members of this group are RNAduplex and RNAplex (Lorenz ; Tafer and Hofacker, 2008) or RNAhybrid (Rehmsmeier ). The second category includes RNAcofold (Bernhart ) and PairFold (Andronescu ). These tools concatenate two interacting RNA sequences and calculate the MFE of the joint RNA sequences. The third group considers intramolecular and intermolecular RNA-RNA interactions in separated steps, however, only one binding site is predicted. Members of this group are IntaRNA (Busch ) or RNAup (Mückstein ). The fourth group considers more complex RNA-RNA interactions and allows also more than one binding site. This group includes tools like RactIP (Kato ), inteRNA (Alkan ) or inRNAs (Salari ). The final group contains e.g. PETcofold (Seemann ), RNAaliduplex (Lorenz ), IRBIS (Pervouchine, 2014), ripalign (Li ) and simulfold (Meyer and Miklós, 2007). These tools consider not only a pair of single sequences, like the tools mentioned earlier, they use multiple sequence alignments as input. With this comparative method, it is possible to reduce the false-positive rate by incorporating evolutionary conserved information. All of these programs have different properties, unsuitable for viral genomes. For example, RNAaliduplex is unable to predict pseudoknots and neglects intramolecular RNA foldings. PETcofold considers both, intra- and intermolecular interactions as well as pseudo-knots, but returns per default only a single secondary structure, which makes the detection of multiple functional binding sites impossible. Ripalign’s running time makes the program not applicable to viral sequences and IRBIS is only applicable for predictions of RNA interactions related to RNA splice sites. Up to now, we are aware of a single program that is designed for LRI prediction called CovaRNA (Bindewald and Shapiro, 2013). This tool detects long-range nucleotide covariation from multiple sequence alignments of eukaryotic genomes using an index-based algorithm to find clusters of covarying base-pairs. The extended function CovStat determines the statistical significance of observed covariation cluster. CovaRNA has very strict filter criteria and is therefore very conservative in predicting LRIs. For short genomes, such as viral genomes, this leads to almost no predicted interactions. Here, we present LRIscan for detecting LRIs in complete viral genome alignments without prior knowledge. Sparse alignment interaction dotplots in combination with RNAalifold (Bernhart ) secondary structure foldings based on MFE calculations are used to predict possible LRIs. We add several filter steps and scoring functions to reduce false positive candidates. We confirm 14 out of 16 experimentally known and evolutionary conserved LRIs in HCV, Tombusviruses, Flaviviruses and HIV-1. We predict several promising new interactions, being highly conserved in all considered viral sequences with multiple compensatory mutations and highly conserved in all considered viral sequences.

2 Materials and methods

With LRIscan, we propose for the first time a method for conserved genome-wide LRI prediction in viral genomes based on a multiple sequence alignment. LRIscan is based on the C-library of the ViennaRNA Package 2.0 (Lorenz ). The pipeline consists of four basic steps (see workflow Fig. 1):

Fig. 1.

LRIscan workflow. (A) Coverage (dark gray) and complexity (light gray) of the entire alignment. Only regions which pass both, the coverage-threshold (dotted line) and the complexity-threshold (dashed line) are considered for further calculations. (B) Dotplot containing all possible seed interactions (gray and black lines) without gaps and a given minimum interaction length. We calculate only interactions with a distance . To decrease the run time, multiple CPUs are used, calculating only a specific range of the dotplot matrix. All ranges overlap by the minimum seed length. (C) For all seeds, we calculate the MFE of the alignment with RNAalifold. For each seed passing the MFE threshold (black dots in B), we calculate a P-value based on the z-score and the compensatory score as defined in the methods. (D) Seeds are extended toward both sides. An extended z-score/P-value and compensatory score is calculated (Color version of this figure is available at Bioinformatics online.) Calculate alignment coverage and complexity. Find LRI seeds with a sparse dotplot method. Filter LRI candidates based on MFE; calculate z-score/P-value and compensatory score. Extend seed interaction. The input of LRIscan is a nucleotide alignment A of length n with m sequences. By A, we denote the i-th column of the alignment. Entry is the k-th row of column i. We define the alignment coverage for each column A as percentage of nucleotides over all sequences m, without gaps. We introduce the pairing matrix Π with entries if at least t percent (default t = 0.95 for more than 100 sequences, otherwise default t = 0.80) of the corresponding sequences can form a base-pair , otherwise .

2.1 Alignment cleaning, coverage and complexity

To improve the alignment quality for the nucleotide folding, rarely occurring IUPAC nucleotide ambiguities are replaced by the most occurring valid nucleotides of the same alignment column. We introduce the coverage matrix with entries if the coverage of columns A and A is greater than the minimum coverage defined by the user (default 0.5), otherwise . Let δ be the compression function replacing stretches of identical nucleotides by a single nucleotide, e.g. . The complexity for each alignment column A is stored by the complexity matrix C computed as where s is the minimum seed interaction length (default: 5 bp). With the complexity score, we avoid calculations of regions with gaps or low complexity, such as poly-A/U stretches. A minimum coverage- and complexity-threshold can be defined by the user (default: coverage = 0.5, complexity = 0.5), with a direct effect on run time.

2.2 LRI seed detection

To find LRIs, we use a dotplot calculation combined with several base-pairing criteria computed by RNAalifold (Bernhart ). To efficiently identify interacting seed regions, we initialize the dotplot seed matrix S with . Each entry with minimum distance w between column i and j (default w = 100 nt) is calculated, following the recursion: will be considered as seed candidate, if (i) and , (ii) the MFE of the seed alignment, calculated by RNAalifold, is smaller as the maximum MFE defined by the user (default ) and (iii) the mean sequence complexity of the seed is greater than the user defined threshold (see black dots in Fig. 1B). To speed up the seed finding, which needs quadratic time , multiple CPU’s can be used. Each CPU calculates a row of 100 nt (Fig. 1B) overlapping by the minimum seed length s. To save memory, we store only the last valid entry for each seed in a hash.

2.3 LRI seed scoring

For each LRI, we calculate the MFE with RNAalifold. Sequences including only gaps are neglected. Based on the MFE, we calculate a z-score to determine the reliability of each LRI, compared with a randomly sampled alignment. The z-score can be calculated for each predicted LRI as where X is the MFE of the corresponding interaction, μ the mean MFE and σ the corresponding standard deviation of the z-times randomly swap-shuffled alignments. Pseudocode 1. Algorithm to find seed interactions in a multiple genome alignment. S[i,j] = 0 for i in 1…n–w if PHI(i) > 0 forj in i + w…n if PHI(j) > 0 and PI(i,j)> 0 S[i,j] = S[i −1,j + 1] + 1 else ifS[i−1,j + 1] ≥ MIN_SEED_LENGTH andC(S[i−1,j + 1]) > MIN_COMPLEX and mfe(S[i−1,j + 1]) < MAX_MFE returnS[i−1,j + 1] end end end end end For each consensus base-pair of a given seed (of length ), we determine the compensatory score τ to find LRIs with a high amount of compensatory mutations coincident with a high amount of compatible base-pairs. We consider 1 nt changes (e.g. AU to GU), which preserve a base-pairing, as well as 2 nt changes (e.g. AU to GC) as compensatory mutations. For each consensus base-pair, u is the number of different base-pair types () and h is the number of compatible base-pairs (). We normalize by the maximum number of different base-pairs over all consensus base-pairs () for all sequences k included in the LRI ().

2.4 LRI seed extension

For each LRI seed, we attempt to extend the alignment by 10 bp at the 5′ and 3′ end. For the extended alignment, the MFE is calculated by RNAalifold, with a hard constraint to build given seed base-pairs and a soft constraint which forces an intermolecular interaction of the surrounding base-pairs. For the extended alignment, a separate z-score and P-value calculation is possible (default off). These scores are independent from seed scores.

2.5 Output results

All resulting LRIs are presented in a tab-separated file and additionally as HTML table, linking all corresponding figures to allow the user to browse through the output. The interacting alignment positions are calculated back to the original position of each virus isolate to easily assess the interactions.

2.6 CovaRNA

To compare LRIscan with CovaRNA, we converted all alignments to the UCSC MAF format and used CovaRNA with a minimum number of two input sequences, reading out of the MAF input file (-s 2).

2.7 Dataset

For HCV, we downloaded 950 genomes from NCBI and the HCV database (http://hcv.lanl.gov/content/sequence/NEWALIGN/align.html) [v. 2008 (Kuiken )]. Sequences which occur in both datasets were considered only once. For Tombusvirus and Flavivirus, we downloaded all genomes listed as complete genomes at NCBI-Genome, resulting in 13 sequences for Tombusvirus and six sequences for Flavivirus (mosquito/vertebrate Flaviviruses). For the 950 HCV sequences, an alignment was generated with MAFFT –auto, v.6.8 (Katoh ) and a phylogenetic tree was built with Geneious, v.6.1 (Kearse ) neighbor-joining method Tamura-Nei (Tamura and Nei, 1993). Based on this tree and the annotations from NCBI and the HCV database, the dataset was reduced to two of the longest genomes from each subtype if available, resulting in 106 sequences from 65 subtypes. For Tombusvirus (13 sequences), Flavivirus (6 sequences) and the reduced HCV set (106 sequences), we generated MAFFT –maxiterate 1000 –localpair alignments as input for LRIscan. For HIV, we downloaded the hand curated compendium alignment from the HIV sequence database (http://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html) choosing all subtypes of the HIV-1 alignment of 2014, resulting in an alignment of 200 sequences. All input alignments can be found at the supplementary page (http://www.rna.uni-jena.de/en/supplements/lriscan/).

2.8 Sensitivity and specificity

To calculate the sensitivity and specificity of LRIscan, we performed dinucleotide shuffling of each sample alignment with multiperm (Anandam ) and hid true positive LRIs. We defined all experimentally detected LRIs as true positives as well as the top LRIs of the original alignment. We applied LRIscan with the same parameters as with the original alignments.

3 Results and discussion

To validate our tool, we chose the four positive strand RNA viruses with known LRIs: HCV, Flavivirus, Tombusvirus and HIV. The best studied viruses, with highest number of known LRIs, are HCV and Tombusviruses. In HCV, five LRIs are experimentally verified (Fig. 2B) and another 12 LRIs have been predicted semimanually in Fricke . Currently, eight LRIs in Tombusvirus (Fig. 5B), three LRIs in Flaviviruses (Fig. 3B) and two LRIs in HIV-1 (Fig. 6B) have been experimentally verified. As input for LRIscan, we used alignments of 106 HCV genomes, 13 Tombusvirus genomes, 6 Flavivirus genomes and 200 HIV-1 genomes.

Fig. 2.

Fig. 5.

(A) Plot of all predicted LRIs with (213) found in the Tombusvirus alignment of 13 sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRIs, which have been also predicted by LRIscan, named SL3-SLB (Fabian and White, 2004, 2006; Nicholson and White, 2008), PRTE-DRTE (Cimino ), UL-DL (Wu ), sg2-SLB (Fabian and White, 2004), AS1-RS1 (Lin and White, 2004) and AS2-RS2 (Lin and White, 2004). (C) Highly interesting new LRIs predicted by LRIscan. A complete list including all predicted LRIs can be found at the supplementary page (Color version of this figure is available at Bioinformatics online.)

Fig. 3.

(A) Plot of all predicted LRIs with (113) found in the Flavivirus alignment of six sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRI, which can be predicted by LRIscan, named 5′–3′ CS (Friebe and Harris, 2010). (C) Highly interesting new LRIs predicted by LRIscan. A complete list including all predicted LRIs can be found at the supplementary page (Color version of this figure is available at Bioinformatics online.)

Fig. 6.

(A) Plot of all predicted LRIs with (115) found in the HIV-1 alignment of 200 sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRIs, which can be predicted by LRIscan, named LDI, U5-AUG and TAR-TAR (Abbink and Berkhout, 2003; Andersen ; Beerens and Kjems, 2010; Huthoff and Berkhout, 2001). (C) Highly interesting new LRIs predicted by LRIscan. A complete list including all predicted LRIs can be found at the supplementary page (Color version of this figure is available at Bioinformatics online.)

(A) Plot of all predicted LRIs with (74) found in the HCV alignment of 106 sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. High reactive genome positions can be found in the 5′/3′ UTR and the coding region of the core gene C. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRIs, which can be predicted by LRIscan, named SLIIId-5BSL3.2 (Romero-López and Berzal-Herranz, 2009, 2012), 5BSL2-5BSL3.2 (Romero-López ; Tuplin ), 5BSL3.2-3′SLII (Friebe ) and 5′ UTR-core (Beguiristain ; Honda ). (C) Highly interesting new LRIs predicted by LRIscan. In a former study, we suggested that LRI 2e could be a seed interaction for a HCV genome circularization (Fricke ). A complete list including all predicted LRIs can be found at the supplementary page. Colors are used to indicate conserved base-pairs: from red (no variation of a base-pair within the alignment) to purple (all six base-pair types are found); from dark (all sequences contain this base-pair) to light colors (1 or 2 sequences are unable to form this base-pair). Compensatory mutations are marked by a circle around the variable base(s) (Color version of this figure is available at Bioinformatics online.) (A) Plot of all predicted LRIs with (113) found in the Flavivirus alignment of six sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRI, which can be predicted by LRIscan, named 5′–3′ CS (Friebe and Harris, 2010). (C) Highly interesting new LRIs predicted by LRIscan. A complete list including all predicted LRIs can be found at the supplementary page (Color version of this figure is available at Bioinformatics online.) Plot of predicted LRIs in the HCV alignment using different MFE thresholds. The default threshold is −10 kcal/mol resulting in 311 LRIs. With increasing MFE threshold, the number of predicted LRI increases dramatically and therewith the false positive rate. Thus, we decided to choose a conservative MFE threshold. Black—cumulative sum of LRIs per MFE threshold. Gray—sum of LRIs per MFE threshold (A) Plot of all predicted LRIs with (213) found in the Tombusvirus alignment of 13 sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRIs, which have been also predicted by LRIscan, named SL3-SLB (Fabian and White, 2004, 2006; Nicholson and White, 2008), PRTE-DRTE (Cimino ), UL-DL (Wu ), sg2-SLB (Fabian and White, 2004), AS1-RS1 (Lin and White, 2004) and AS2-RS2 (Lin and White, 2004). (C) Highly interesting new LRIs predicted by LRIscan. A complete list including all predicted LRIs can be found at the supplementary page (Color version of this figure is available at Bioinformatics online.) (A) Plot of all predicted LRIs with (115) found in the HIV-1 alignment of 200 sequences. The outer circle represents the genome. The histogram represents the number of LRIs per alignment position. The inner circle shows all predicted interactions between all genome positions. Gray—all new LRIs; black—LRIs corresponding to B and C. The plot was created with Circos (Krzywinski ). (B) Experimentally verified LRIs, which can be predicted by LRIscan, named LDI, U5-AUG and TAR-TAR (Abbink and Berkhout, 2003; Andersen ; Beerens and Kjems, 2010; Huthoff and Berkhout, 2001). (C) Highly interesting new LRIs predicted by LRIscan. A complete list including all predicted LRIs can be found at the supplementary page (Color version of this figure is available at Bioinformatics online.)

3.1 HCV

To determine the LRIs of the HCV alignment, we used LRIscan with default parameters. Due to general assembly problems of the 5′/3′ untranslated region (UTR) of viral sequences, the HCV alignment consists of only 19 complete sequences. Therefore, we set the minimum number of involved sequences to 17% (18 sequences), resulting in 311 predicted LRIs (74 LRIs with ). We plotted all detected LRIs of HCV, passing the P-value threshold, to the corresponding genome alignment position (Fig. 2A) to detect regions with a high number of interactions. Interestingly, we detected some highly reactive regions in the 5′ and 3′ UTR but also high reactivity sites in the coding sequence (CDS) of the core and NS5B protein coding region. The UTRs and the core region show the highest amount of possible LRIs, in agreement with the known highly structured regions for all HCV subtypes (Fricke ). Consistent with regions of high reactivity (superior interacting regions), it has been shown that the 5′ UTR includes three LRIs (Beguiristain ; Filbin and Kieft, 2011; Honda ; Romero-López and Berzal-Herranz, 2009, 2012). For HCV, five LRIs have been experimentally verified: SLII-SLIV (Filbin and Kieft, 2011), SLIIId-5BSL3.2 (Romero-López and Berzal-Herranz, 2009, 2012), 5BSL2-5BSL3.2 (Romero-López ; Tuplin ), 5BSL3.2-3′SLII (Friebe ) and 5′ UTR-core (Beguiristain ; Honda ). We identified three known LRIs within the first 58 hits ranked by P-value (Fig. 2B). We missed SLII-SLIV because of the unfavorable seed MFE of only −7 kcal/mol (default threshold −10 kcal/mol). The LRI SLIIId-5BSL3.2 did not pass our conservative P-value threshold but is part of the LRIscan output (LRI 2d). Increasing the MFE or P-value threshold would increase the amount of LRIs dramatically, very likely resulting in a high false positive rate (Fig. 4).

Fig. 4.

Plot of predicted LRIs in the HCV alignment using different MFE thresholds. The default threshold is −10 kcal/mol resulting in 311 LRIs. With increasing MFE threshold, the number of predicted LRI increases dramatically and therewith the false positive rate. Thus, we decided to choose a conservative MFE threshold. Black—cumulative sum of LRIs per MFE threshold. Gray—sum of LRIs per MFE threshold

We present three highly interesting new LRIs based on P-value, compensatory score and location in the HCV genome (Fig. 2C). The LRI g of Figure 2 (LRI 2g) is one of the best ranked LRI with a very long seed interaction of 8 nt and three compensatory base-pairs, whereof one base-pair consists of three different types of base-pairs changed in both sites. This LRI is conserved in 89/106 isolates and spans a distance of 754 nt between 5′ UTR and the coding region of the core protein. An experimental verification of the interactions would be highly recommended. LRI 2f is also highly conserved in all isolates and spans a distance of 9440 nt, connecting the 5′ UTR with the NS5B coding regions [corresponding to LRI 4 in Fricke ]. We identified also the possible genome circularization 2e) between 5′SLII and 3′DLS [see Fricke ].

3.2 Flavivirus

For Flavivirus genomes, we used LRIscan with default values. We predicted 113/157 LRIs with a P-value smaller 0.05. Although Flaviviruses and Hepacivirus are both assigned to the family Flaviviridae, the LRI distribution is very different. In Flaviviruses, we found some clearly separated peaks with an accumulation of LRIs, located at the 3′ end of gene M and in the center of the NS3 and NS5 gene (Fig. 3A). Different from the HCV alignment, we did not find an accumulation of LRIs in the 5′ and 3′ UTR. This effect could be explained by different sequence conservations of the UTRs. In HCV, the pairwise identity of the UTRs is > 95%, whereas the UTR pairwise identity of the Flaviviruses is < 50%. The Flavivirus alignment consists of only six sequences, including only the mosquito/vertebrate Flaviviruses. In Flavivirus, three experimentally verified LRIs are known: 5′–3′-UAR, DAR and 5′–3′-CS (Alvarez ; Corver ; Friebe and Harris, 2010; Khromykh ; You ; Zhang ). All known interactions are in close proximity and can build a genome circularization, essential for the viral replication. The strongest interaction (CS) was ranked at fourth position (Fig. 3B). It was not possible to identify the 5′–3′-UAR and DAR, because the seed region does not appear to be conserved in the highly variable 5′ UTR. However, we found several new promising LRIs (Fig. 3C) in the considered Flaviviruses. All depicted interactions are ranked among the first 75 hits and show a high amount of compensatory mutations (up to four types of base-pairs) and being conserved in all sequences.

3.3 Tombusvirus

For the genomes of the Tombusviridae, we used LRIscan with default parameters. We found 529 LRIs (213 LRIs with ). Most of the LRIs in Tombusvirus are located in the p92 region (Fig. 5A). This is in line with the already known interactions, where four out of eight known LRIs start in the p92 region (Cimino ; Lin and White, 2004; Wu ). But also the intergenic regions between p41/p22 and the 3′ end of the p92 coding regions show high reactivity (Fig. 5A). These areas harbor the known interacting regions of the AS1/RS1, DE/CE, sg1/SLB, AS2/RS2, sg2/SLB (Fabian and White, 2004; Lin and White, 2004; Wu ). In contrast, the p41 region contains only a few predicted LRIs. This is due to the very variable sequence of this gene, which encodes the coat protein of the Tombusviruses. In addition to the mentioned five, three more LRIs are known from experimental data for Tombusviruses: SL3-SLB (Fabian and White, 2004, 2006; Nicholson and White, 2008), PRTE-DRTE (Cimino ) and UL-DL (Wu ). We detected six of the eight known LRIs (Fig. 5B). AS2-RS2 was ranked very low due to a very short seed of 5 bp, a low P-value and a poly-G stretch. The missing SL3-SLB and DE-CE have been only described for one species and are also manually not discoverable in other species. Both interactions are located in variable regions with low conserved RNA sequences. Here, we present three novel LRIs (Fig. 5C). The LRI 5g with strong compensatory mutations in 6 of 6 base-pairs and up to four types is conserved in all sequences. The LRIs 5h and 5i are conserved in all sequences and have a very high rank as well as compensatory base-pairs with up to three types of base-pairs mutated at both interaction sites. The introduced LRIs are interesting candidates for further wet lab studies.

3.4 HIV

For the HIV alignment, LRIscan was used with default parameters. Only 20% of HIV genomes (40 sequences) contained a complete 5′ and 3′ UTR, therefore we decreased the minimum involved sequences to 20%. With these settings, we identified 314 LRIs (115 LRIs with ). In HIV, five LRIs are known termed R-GAG, LDI, U5-AUG, TAR-TAR and GAG-U3R (Abbink and Berkhout, 2003; Andersen ; Beerens and Kjems, 2010; Huthoff and Berkhout, 2001; Ooms ). We identified three of the known interactions: U5-AUG, TAR-TAR and LDI, (Fig. 6B). Due to the conservative MFE threshold, it was not possible to detect the conserved R-GAG interaction. For the GAG-U3R interactions, no conserved seed interaction exists. We also suggest three novel LRIs (Fig. 6C). These LRIs are conserved in all 200 sequences and have several compensatory mutations. LRI 6d is ranked at position one and has compensatory mutations in 7 out of 9 base-pairs with up to 3 different base-pair types.

3.5 Comparison to CovaRNA

We compare LRIscan with CovaRNA, designed for large eukaryotic genomes. CovaRNA outputs no reliable results for the small viral genomes. For the HCV and for the Tombusvirus alignment, CovaRNA found only one covariation cluster consisting of only two base-pairs. This cluster covers no known LRI. No cluster could be identified for the HIV alignment. The output of the Flavivirus alignment includes five covariation clusters. All clusters contain only two base-pairs and cover none of the known LRIs.

3.6 Sensitivity and specificity

An accurate sensitivity/specificity calculation is difficult due to the limited number of experimentally verified LRIs. We reach a mean sensitivity of 0.83. Most of the non-detected true positive LRIs are specific and experimentally verified only for single isolates. We investigated their conservation throughout the individuals of the alignment manually, resulting in isolate-specific LRIs. The sensitivity is independent on the sequence number and alignment identity (Table 1). The mean specificity of 0.64 depicts a rather high number of false positives. A more stringent P-value would remove a large fraction of false positives, however, results also in a loss of true positive LRIs, see Figures 2, 3, 5, and 6. In practice, known LRIs have very small P-values (high ranks) and/or high compensatory scores and/or can be extended in length. The user selects the LRIs based on all metrics and on the regions of interests. A minimal P-value threshold including almost all experimental verified LRIs can be found at 0.05.

Table 1.

Plot of sensitivity and specificity (P < 0.05) of the four shuffled genome alignments of HCV, Flaviviruses, Tombusviruses and HIV

	#seq	identity %	true positives (TP)	sensitivity	specificity
HCV	106	62.2	10	0.81	0.69
Flaviviruses	6	58.3	10	0.83	0.46
Tombusviruses	13	63.5	23	0.88	0.75
HIV	200	80.0	16	0.81	0.69
Mean				0.83	0.64

Plot of sensitivity and specificity (P < 0.05) of the four shuffled genome alignments of HCV, Flaviviruses, Tombusviruses and HIV

4 Conclusions

The identification of LRIs is experimentally and bioinformatically a challenging task. The huge amount of theoretically possible interactions makes it impossible to verify all possible interactions by wet lab experiments. LRIscan offers the opportunity to predict possible LRIs under certain criteria and reduces dramatically the number of candidates for wet lab studies. As shown in part A of Figures 2, 3, 5, and 6 also well studied viruses provide a huge list of highly ranked LRIs, which could be involved in viral translation and transcription mechanisms. Further wet lab studies, which verify the functionality of these interactions, could improve our understanding of the viral replication. Click here for additional data file.

47 in total

1. Circularization of the HIV-1 genome facilitates strand transfer during reverse transcription.

Authors: Nancy Beerens; Jørgen Kjems
Journal: RNA Date: 2010-04-29 Impact factor: 4.942

2. Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies.

Authors: Parvez Anandam; Elfar Torarinsson; Walter L Ruzzo
Journal: Bioinformatics Date: 2009-01-09 Impact factor: 6.937

3. Interplay of RNA elements in the dengue virus 5' and 3' ends required for viral RNA replication.

Authors: Peter Friebe; Eva Harris
Journal: J Virol Date: 2010-03-31 Impact factor: 5.103

4. RNAplex: a fast tool for RNA-RNA interaction search.

Authors: Hakim Tafer; Ivo L Hofacker
Journal: Bioinformatics Date: 2008-04-23 Impact factor: 6.937

5. RactIP: fast and accurate prediction of RNA-RNA interaction using integer programming.

Authors: Yuki Kato; Kengo Sato; Michiaki Hamada; Yoshihide Watanabe; Kiyoshi Asai; Tatsuya Akutsu
Journal: Bioinformatics Date: 2010-09-15 Impact factor: 6.937

6. Co-selection of West Nile virus nucleotides that confer resistance to an antisense oligomer while maintaining long-distance RNA/RNA base pairings.

Authors: Bo Zhang; Hongping Dong; David A Stein; Pei-Yong Shi
Journal: Virology Date: 2008-10-07 Impact factor: 3.616

7. Context-influenced cap-independent translation of Tombusvirus mRNAs in vitro.

Authors: Beth L Nicholson; K Andrew White
Journal: Virology Date: 2008-09-04 Impact factor: 3.616

8. IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions.

Authors: Anke Busch; Andreas S Richter; Rolf Backofen
Journal: Bioinformatics Date: 2008-10-21 Impact factor: 6.937

9. Circularization of the HIV-1 RNA genome.

Authors: Marcel Ooms; Truus E M Abbink; Chi Pham; Ben Berkhout
Journal: Nucleic Acids Res Date: 2007-08-07 Impact factor: 16.971

10. A discontinuous RNA platform mediates RNA virus replication: building an integrated model for RNA-based regulation of viral processes.

Authors: Baodong Wu; Judit Pogany; Hong Na; Beth L Nicholson; Peter D Nagy; K Andrew White
Journal: PLoS Pathog Date: 2009-03-06 Impact factor: 6.823

10 in total

1. Synonymous Co-Variation across the E1/E2 Gene Junction of Hepatitis C Virus Defines Virion Fitness.

Authors: Brendan A Palmer; Liam J Fanning
Journal: PLoS One Date: 2016-11-23 Impact factor: 3.240

Review 2. Know Your Enemy: Successful Bioinformatic Approaches to Predict Functional RNA Structures in Viral RNAs.

Authors: Chun Shen Lim; Chris M Brown
Journal: Front Microbiol Date: 2018-01-04 Impact factor: 5.640

Review 3. Intragenomic Long-Distance RNA-RNA Interactions in Plus-Strand RNA Plant Viruses.

Authors: Tamari Chkuaseli; K Andrew White
Journal: Front Microbiol Date: 2018-04-04 Impact factor: 5.640

Review 4. Signals Involved in Regulation of Hepatitis C Virus RNA Genome Translation and Replication.

Authors: Michael Niepmann; Lyudmila A Shalamova; Gesche K Gerresheim; Oliver Rossbach
Journal: Front Microbiol Date: 2018-03-12 Impact factor: 5.640

Review 5. Towards Long-Range RNA Structure Prediction in Eukaryotic Genes.

Authors: Dmitri D Pervouchine
Journal: Genes (Basel) Date: 2018-06-15 Impact factor: 4.096

6. RNA Structure Duplication in the Dengue Virus 3' UTR: Redundancy or Host Specificity?

Authors: Luana de Borba; Sergio M Villordo; Franco L Marsico; Juan M Carballeda; Claudia V Filomatori; Leopoldo G Gebhard; Horacio M Pallarés; Sebastian Lequime; Louis Lambrechts; Irma Sánchez Vargas; Carol D Blair; Andrea V Gamarnik
Journal: mBio Date: 2019-01-08 Impact factor: 7.867

7. Women in the European Virus Bioinformatics Center.

Authors: Franziska Hufsky; Ana Abecasis; Patricia Agudelo-Romero; Magda Bletsa; Katherine Brown; Claudia Claus; Stefanie Deinhardt-Emmer; Li Deng; Caroline C Friedel; María Inés Gismondi; Evangelia Georgia Kostaki; Denise Kühnert; Urmila Kulkarni-Kale; Karin J Metzner; Irmtraud M Meyer; Laura Miozzi; Luca Nishimura; Sofia Paraskevopoulou; Alba Pérez-Cataluña; Janina Rahlff; Emma Thomson; Charlotte Tumescheit; Lia van der Hoek; Lore Van Espen; Anne-Mieke Vandamme; Maryam Zaheri; Neta Zuckerman; Manja Marz
Journal: Viruses Date: 2022-07-12 Impact factor: 5.818