Literature DB >> 29358234

nextPARS: parallel probing of RNA structures in Illumina.

Ester Saus^1,2, Jesse R Willis^1,2, Leszek P Pryszcz^1,2,3, Ahmed Hafez⁴, Carlos Llorens⁴, Heinz Himmelbauer^1,2,5, Toni Gabaldón^1,2,6.

Abstract

RNA molecules play important roles in virtually every cellular process. These functions are often mediated through the adoption of specific structures that enable RNAs to interact with other molecules. Thus, determining the secondary structures of RNAs is central to understanding their function and evolution. In recent years several sequencing-based approaches have been developed that allow probing structural features of thousands of RNA molecules present in a sample. Here, we describe nextPARS, a novel Illumina-based implementation of in vitro parallel probing of RNA structures. Our approach achieves comparable accuracy to previous implementations, while enabling higher throughput and sample multiplexing.

Entities: Chemical

Keywords: RNA secondary structure; genome-wide enzymatic probing; multiplex sequencing

Mesh：

Substances：
RNA, Messenger

Year: 2018 PMID： 29358234 PMCID： PMC5855959 DOI： 10.1261/rna.063073.117

Source DB: PubMed Journal: RNA ISSN： 1355-8382 Impact factor: 4.942

INTRODUCTION

The structure of RNA molecules plays a central role in their function and regulation (Cruz and Westhof 2009). Over recent years, several new approaches have been developed that couple high-throughput sequencing technologies to traditional enzymatic- or chemically-based assays to probe RNA structure, thereby enabling the structural profiling of transcribed RNAs at genome-wide scales (Table 1). These include, among others, PARS (Kertesz et al. 2010), FragSeq (Underwood et al. 2010), or the more recent in vivo approaches DMS-Seq (Ding et al. 2014; Rouskin et al. 2014), icSHAPE (Spitale et al. 2015), and PARIS (Lu et al. 2016). Analyses made in vivo represent a powerful tool to validate in vitro studies and to obtain more physiologically relevant information. However, currently available in vivo methods can only probe either single-stranded or double-stranded regions, but not both at the same time unless using different technologies, missing the direct combined information obtained in PARS, for example. In addition, RNA is generally associated with proteins and other molecules, which limits the obtained information to unprotected regions (Ge and Zhang 2015). Thus, both in vitro and in vivo studies are complementary and there is a niche of applications for the two approaches (Sanbonmatsu 2016). In this context, efforts have been made to develop computational methods that infer RNA secondary structure by accounting for such high-throughput experimental data (Ouyang et al. 2013; Ge and Zhang 2015; Wu et al. 2015).

TABLE 1.

Summary and main characteristics of methods to probe RNA secondary structure

Summary and main characteristics of methods to probe RNA secondary structure Here, we describe nextPARS, an adaptation of the PARS technique—originally developed using SOLiD sequencing—to Illumina technology, which allows higher throughput and sample multiplexing. Although the PARS approach has been previously adapted to Illumina (Wan et al. 2013), that protocol does not enable pooling of different samples and, moreover, it requires the use of an Ambion kit, which has been discontinued (https://www.lifetechnologies.com/order/catalog/product/4454073). As a consequence, the use of that protocol has been limited to very few studies (Wan et al. 2014; Dominissini et al. 2016; Righetti et al. 2016). Here we developed a related method, based on the parallel specific enzymatic digestion of single-stranded (S1) and double-stranded (V1) regions directly followed by Illumina library preparation and sequencing. Among all previously published methods for probing RNA secondary structure in vitro, nextPARS is the only one capable of tagging all four bases in both single and double-stranded conformation at a genome-wide scale, while enabling multiplexing in Illumina's sequencing technology, therefore dramatically reducing the sequencing costs and enabling higher throughput. We tested the validity of our approach by comparing our results with reported structural profiles obtained using similar techniques. We probed polyadenylated [poly(A)] and total RNA of Saccharomyces cerevisiae as well as various in vitro-transcribed RNAs added in the experiments as controls. To estimate the accuracy of our approach, we compared our high-throughput results with structural data obtained by crystallography of five RNA molecules with resolved secondary structures, totaling 5607 bases, including the Tetrahymena ribozyme fragment TETp4p6, and the Saccharomyces cerevisiae ribosomal RNAs 5S, 18S, 25S, and 5.8S.

RESULTS

With the aim to obtain higher sequencing throughput and multiplexing capacity to study the secondary structure of RNA molecules at a genome-wide scale, we implemented and adapted the PARS protocol (Kertesz et al. 2010) to the Illumina sequencing platform. We refer to this approach as “nextPARS” (see Materials and Methods and Fig. 1 for a more detailed description of the protocol). In brief, our adapted protocol couples initial phosphatase and kinase treatments after the RNase probing step, to allow ligation of the corresponding 5′ and 3′ adaptors of the Illumina TruSeq Small RNA Sample Preparation Kit. Then a reverse transcription of the RNA fragments and a PCR amplification are performed to obtain sequencing-ready libraries. Finally, single end read sequencing of the gel size-selected libraries and subsequent mapping allows determining the enzymatic cleavage points at base resolution, enabling the computation of the final nextPARS scores (see below).

FIGURE 1.

Summary of the different steps performed in the nextPARS protocol. From the cells or tissue of interest (A), total RNA is extracted (B) and then poly(A)+ RNA is selected (C) to initially prepare the samples for nextPARS analyses. Once the quality and quantity of poly(A)+ RNA samples is confirmed, RNA samples are denatured and in vitro folded to perform the enzymatic probing of the molecules with the corresponding concentrations of RNase V1 and S1 nuclease (D). For the library preparation using the Illumina TruSeq Small RNA Sample Preparation Kit, an initial phosphatase treatment of the 3′ends and a kinase treatment of the 5′ ends are required (E) to then ligate the corresponding 5′ and 3′ adapters at the ends of the RNA fragments (F). Then a reverse transcription of the RNA fragments and a PCR amplification are performed to obtain the library (G). The library is size-selected to get rid of primers and adapters dimers using an acrylamide gel and a final quality control is performed (H). Libraries are sequenced in single-reads with read lengths of 50 nucleotides (nt) using Illumina sequencing platforms (I) and computational analyses are done as described in the Materials and Methods section in order to map Illumina reads and determine the enzymatic cleavage points, using the first nucleotide in the 5′ end of the reads (which correspond to the 5′end of original RNA fragments) (J). We obtained highly reproducible results in terms of enzymatic digestion profiles, with high and significant statistical correlation among at least three independent replicates (Table 2; Supplemental Table S1). These correlations were of the same level as those found in the original PARS protocol. Detailed analysis of the digestion profiles in the control molecules showed significant agreement with published structures, classical chemical footprinting, and results from other methodologies, particularly the SOLiD-based, PARS method (Fig. 2; Supplemental Figs. S1–S3). However, we noted that different probing approaches, differences in the relative abundance of molecules, and even the use of different sequencing protocols (e.g., Illumina versus SOLiD) had an impact on reactivity profiles. As a result, results from PARS and nextPARS were positively correlated but showed moderate levels of agreement. Thus raw data provided by both methods should not be considered equivalent but rather related.

TABLE 2.

Correlations within nextPARS replicates, within PARS replicates, and between nextPARS and PARS

FIGURE 2.

nextPARS results for TETp4p6 fragment. (A) Sites having a nextPars score higher than 0.5 (predicted paired site) or lower than −0.5 (predicted unpaired site) are indicated as green (+1, double-stranded) and pink (−1, single-stranded), respectively, on the reference secondary structure of TETp4p6 RNA according to the PDB database and visualized using the VARNA program (Visualization Applet for RNA, http://varna.lri.fr/ [Darty et al. 2009]). Nucleotides that do not pass the threshold are assigned as 0. Green crosses (+) show V1 cuts (paired sites) which target double-stranded nucleotides in the reference structure, and pink asterisks (*) show S1 cuts (unpaired sites) which target single-stranded nucleotides in the reference structure. Numbers and percentages of bases detected for each conformation are shown below the RNA molecule. (B) Plot comparing both PARS and nextPARS techniques: normalized number of reads for V1 enzyme are plotted for each technique. (C) Plot comparing the results obtained with nextPARS with those of previously published results obtained by traditional footprinting experiments (Kertesz et al. 2010). Correlations within nextPARS replicates, within PARS replicates, and between nextPARS and PARS The original method for calculating the PARS scores was based on the log2 ratio between normalized values of V1 reads and S1 reads for a given PARS experiment (Kertesz et al. 2010). However, due to discrepancies in the read counts between V1 and S1 experiments, as mentioned in the Materials and Methods section under “Computation of nextPARS scores” Phase I: part (iii), we found that this approach was not entirely satisfactory. Here we propose an alternative procedure (see Materials and Methods) that combines normalization of the raw digestion profiles and a sequenced-based recurrent neural network (RNN) classifier. This procedure produces a single “nextPARS score” that ranges from −1.0 (highest preference for single-stranded regions) to 1.0 (highest preference for double-stranded regions). The full computational pipeline to derive these scores from raw nextPARS data can be found at https://github.com/Gabaldonlab/nextPARS. To provide a better framework for comparison of PARS and nextPARS raw data and the two scoring procedures, we used secondary structure data obtained by crystallography of five RNA molecules, totaling 5607 bases, including the Tetrahymena ribozyme fragment TETp4p6 (Cate et al. 1996), and four yeast ribosomal RNAs (Ben-Shem et al. 2011), as downloaded from the Protein Data Bank (PDB) (Berman et al. 2000). Our results indicate that both methods are comparable in their sensitivities and accuracy to directly determine single or double-stranded sites, regardless of the scoring scheme used (Fig. 3). In fact, when comparing the nextPARS sequencing data processed with the nextPARS scoring method to those of PARS processed with the PARS scoring method, we observed statistically significant correlations (Spearman coefficient = 0.563). As expected, the correlations are stronger when comparing the two scoring methods on the same sequencing data set, since the experimental and sequencing methods differ. Correlating the two different scoring methods on the nextPARS data set gives a coefficient of 0.835, and on the PARS data set gives a coefficient of 0.874. Overall, PARS and nextPARS approaches had similar accuracy ability to determine whether a nucleotide is paired or unpaired, as determined by receiver operating characteristic (ROC) curves (Fig. 3).

FIGURE 3.

Comparison of scoring methods from nextPARS and PARS. Receiver operator characteristic (ROC) curves to assess the ability to determine whether a nucleotide is paired or unpaired. They were generated by varying a score threshold at 101 evenly spaced values over the full range of scores for the five benchmark molecules with known structures. True positives here are considered to be those sites that are paired in the reference structure with a score greater than the given threshold, while true negatives are those sites that are unpaired in the reference structure with a score lower than the given threshold. The blue curves are the nextPARS scores for the benchmark molecules in the two different data sets and the red curves are for the PARS scores in the two data sets. Solid curves are for a data set with its own scoring method, dashed curves are for a data set with the opposite scoring method. Also included in the legend is the area under the curve (AUC). Finally, to test whether stochastic, nonenzymatic breaks could be a source of noise in our protocol, we probed a set of five RNA molecules (TETp4p6, TETp9p9.1, SRA, B2, and U1) with known secondary structure with our nextPARS protocol but using RNase A, an enzyme that specifically cuts in single-stranded C's and U's. We obtained only one cut in an unexpected base (A) out of 86 when applying a threshold of 0.8 to nextPARS scores (Fig. 4; Supplemental Fig. S4). This indicates that most observed cuts are enzymatically derived and that nonenzymatic breaks represent a very minor fraction. Hence, differences in signal strengths could account for real cutting preferences of the enzymes, differences in accessibility, or the presence of alternative coexisting structures.

FIGURE 4.

Probing of RNA molecules with RNase A enzyme. Examples of the signals obtained in some RNA molecules when performing nextPARS using RNase A, an enzyme that cuts specifically in single-stranded cytosines (C) and uracils (U). Scores were calculated for each site by first capping all read counts for a given transcript at the 95th percentile and then normalizing to have a maximum of 1 (as done in the “Computation of nextPARS scores” of the Materials and Methods, but since Rnase A is the only enzyme in this case, there will be no subtraction performed, so all values will then fall in the range of 0 to 1). Cuts are considered for signals above a threshold of 0.8. (A) nextPARS signals above the threshold of 0.8 are depicted for TETp4p6 and TETp9-9.1 RNA fragments after probing them by nextPARS using RNase A. Secondary structures of the RNA fragments according to PDB are displayed using VARNA program (Visualization Applet for RNA, http://varna.lri.fr/ [Darty et al. 2009]). In green, nucleotides with a cut signal above 0.8; green crosses (+) show cuts obtained in a C or U; pink asterisks (*) show cuts obtained in a G or A; and blue arrows (→) show cuts obtained in double-stranded positions. (B) Table summarizing the total number (N) and percentages (%) of cuts with a signal above 0.8 threshold obtained in five different RNA fragments with known secondary structure (TETp4p6, TETp9-9.1, SRA, B2, U1): first column, N and % of cuts with a signal above 0.8 in the molecules; second column, N and % of these cuts in C or U nucleotides; and third column, N and % of cuts in G or A nucleotides.

DISCUSSION

In this work, we present nextPARS, an adapted method of PARS technology to Illumina platform for RNA structure probing. The main advantages of nextPARS are that the experimental procedure is very straightforward to follow, all scripts are freely available to easily obtain the secondary structure profiles for downstream analyses, and the results are at least as accurate as previous methodologies but allowing higher throughput and sample multiplexing in comparison to PARS. One of the main limitations in the experimental protocol of nextPARS is that, when using the Illumina TruSeq Small RNA Kit to prepare the libraries, we cannot directly ligate first the 5′ adaptor and then the 3′ adaptor, requiring the performance of phosphatase and kinase treatments of the RNA ends. This implies that we cannot discern 5′ ends in the RNA molecules caused by V1 or S1 enzymatic digestion from those produced by unspecific fragmentation of RNA molecules (RNase V1 and S1 nuclease enzymatically digest RNA leaving 5′ phosphate ends, while random fragmentation produces 5′-hydroxyl groups which cannot be ligated). In the original PARS protocol (Kertesz et al. 2010; Wan et al. 2013), after the initial enzymatic digestion, a random fragmentation step is done followed by the 5′ adaptor ligation that only occurs in 5′-phospate ends. In nextPARS protocol, we skipped the initial random fragmentation after the V1 and S1 specific enzymatic digestion (we obtained high quality libraries and a final gel size-selection is performed) and proceed directly to the adaptors ligations (previously required phosphatase and kinase treatments are done). Thus, it is possible that nonspecific fragmentation occurring in RNA molecules during nextPARS protocol would produce noisy signals. However, when enzymatically digesting with RNase A (which specifically cuts in single-stranded C's and U's) some control molecules with previously described secondary structure, the signals obtained are all in the expected nucleotides except one (in one A) out of 86 (Fig. 4) when applying a threshold of 0.8 to nextPARS scores. This means we can rule out unspecific fragmentation as a possible cause of noisy signals, which could account for the presence of different conformations of RNA molecules at the same time or different enzymatic preferences for cutting points. Besides, we cannot discard the possibility that more than one cut occurs in some RNA molecules, which could also lead to noisy signals, since possible conformational changes occurring in the RNA molecule due to the first cut could be detected by a second cut. Although this is an intrinsic characteristic of both PARS and nextPARS techniques that cannot be avoided, we tried to minimize this confounding effect. First, we performed several nextPARS experiments testing different enzyme concentrations, and we chose the optimal V1 and S1 amounts to have single-cut kinetics based on the results obtained for some molecules with known secondary structure. Moreover, data analyses and thresholds are applied to remove as much as possible the noisy signals, to obtain the most accurate and reliable results possible. When applying different thresholds to nextPARS scores in the control molecules, we can appreciate that relaxed thresholds (between 0.4 and 0.6) are enough to discard unspecific and noisy signals in most of them, obtaining only real single-stranded or double-stranded signals according to their reference structure (Supplemental Figs. S1–S3). Altogether, nextPARS is a rapid and easy protocol using Illumina sequencing technology to experimentally and massively probe the secondary structure of RNAs. It achieves the same level of resolution, as well as similar accuracy of previously published in vitro structure probing methodologies, while providing higher throughput and multiplexing capacity. In addition, we provide a computational procedure to go from the sequencing reads to a single score that can be used in downstream analyses.

MATERIALS AND METHODS

Sample preparation

Total and poly(A)+ RNA from yeast

Saccharomyces cerevisiae S288C was grown in YPDA medium in an orbital shaker (30°C, 200 rpm, overnight). Total RNA from these cultures was extracted using the RiboPure-Yeast Kit according to manufacturer's instructions (Ambion), starting with a total amount of 3 × 108 cells per sample as recommended for a maximum yield. Total RNA integrity and quantity of the samples were assessed using the Agilent 2100 bioanalyzer with the RNA 6000 Nano LabChip Kit (Agilent, see Supplemental Fig. S5A) and NanoDrop 1000 Spectrophotometer (Thermo Scientific). To obtain poly(A)+ RNA, total RNA from yeast was purified by two rounds of selection using the MicroPoly(A)Purist Kit according to manufacturer's instructions (Ambion) to obtain poly(A) RNA from yeast, and the quality of the samples was controlled as above (Supplemental Fig. S5A).

RNA positive controls

Three RNA fragments with previously determined secondary structures were spiked into the samples and used as positive controls in the experiments. Tetrahymena ribozyme and HOTAIR clones were obtained from Howard Chang's laboratory (Stanford University School of Medicine). Both of them had been previously used as positive controls in the original PARS protocol (Kertesz et al. 2010) and were used in the present work to compare PARS with nextPARS performance. In addition, three other RNA molecules with previously described structures (SRA, B2 and U1) were spiked into the samples in one experiment to probe them with RNase A. In all cases, plasmids were transformed in One Shot TOP10 Chemically Competent E. coli according to manufacturer's instructions (Invitrogen). Single colonies were grown in LB+Ampicillin medium (37°C, 220 rpm, overnight), and plasmids were purified using the QIAprep Spin Miniprep Kit according to the manufacturer's instructions (Qiagen). PCRs were performed to amplify two fragments of Tetrahymena ribozyme (TETp4p6 and TETp9-9.1) and one fragment of the other molecules (HOTAIR -HOT2-, SRA, B2 and U1). Primer sequences and amplicon sizes are shown in Supplemental Table S2. PCR amplicons were purified using a QIAquick PCR Purification Kit according to manufacturer's instructions (Qiagen) and then were sequenced with Sanger, to confirm that no mutation had been introduced in the fragments of interest. Then, the fragments used as positive controls were transcribed in vitro using the T7 RiboMax Large-scale RNA production system according to manufacturer's instructions (Promega). Finally, RNAs of interest were selected by size and purified using Novex-TBE Urea gels according to manufacturer's instructions (Life Technologies). A final quality control of the purified RNAs was performed as described above.

Enzymatic probing

For the enzymatic probing of RNA samples, we reproduced the PARS protocol using RNase V1 and S1 nuclease to cleave RNAs in double or single-stranded conformation, respectively (Kertesz et al. 2010; Wan et al. 2013). Two micrograms of poly(A)+ RNA or total RNA were mixed with 20 ng of each positive control RNA (TETp4p6, TETp9-9.1, HOT2) and were brought to a final volume of 80 µL with nuclease-free water in a 200 µL thin wall PCR tube. We took 1 µL of each experiment to perform a quality control using Agilent 2100 bioanalyzer with the RNA 6000 Pico LabChip Kit (Agilent) and NanoDrop 1000 Spectrophotometer (Thermo Scientific). Once we confirmed that RNA samples were not degraded, they were denatured at 90°C for 2 min (in the thermal cycler with heated lid on) and the tubes were immediately placed on ice for 2 min. Then, 10 µL of ice-cold 10X RNA structure buffer (Ambion) were added to samples and mixed by pipetting up and down several times. RNA samples were subsequently brought from 4°C to 23°C, in 20 min (1°C per min) in a thermal cycler. Finally, 10 µL of nuclease-free water, with the corresponding dilutions of RNase V1 (Ambion) or S1 nuclease (Fermentas), were added to the samples inside the thermal cycler for nondigested, V1-digested, and S1-digested samples, respectively (see the following section, “Determination of the optimal RNase V1 and S1 nuclease concentrations”). After mixing by pipetting, samples were incubated at 23°C for 15 min. The RNAs were purified using the RNeasy MiniElute Cleanup Kit following manufacturer's instructions (Qiagen). We took 1 µL of each experiment to perform a quality control as described earlier (Supplemental Fig. S5B–D). For the probing with RNase A enzyme, we included in a total of 2 µg of poly(A) RNA from S. cerevisiae 20 ng of the following RNA molecules: TETp4p6, TETp9-9.1, SRA, B1, and U1. All the experiments were performed following exactly the same steps previously described, but adding 0.05 µg of RNase A (Ambion) in the samples instead of V1 or S1 enzymes.

Determination of the optimal RNase V1 and S1 nuclease concentrations

The original PARS protocol used 0.01 U of RNase V1 (Ambion) and 1000 U of S1 nuclease (Fermentas) in a 100 µL reaction volume in their first study (Kertesz et al. 2010), which the authors claimed are the appropriate enzyme concentrations for cleavage reactions occurring with single-hit kinetics. However, in their next study (Wan et al. 2013) they suggested a titration of nucleases to choose the optimal conditions for cleaving RNA molecules once per molecule on average, to avoid putative conformational changes after the first enzymatic cleavage. In their work they considered that this single-hit kinetics happened when around 10%–20% of the RNA is cleaved. The authors suggested that the titration of nucleases could be done radiolabeling the RNA molecules, digesting them with different enzymatic dilutions, running them on a gel and quantifying the percentage of full-length RNA before and after the digestion. In our study, we went for a more direct approach performing full nextPARS experiments using different enzyme dilutions and probing some RNA control molecules with known secondary structure. In this way, we tested different dilutions of both enzymes and checked their bioanalyzer profile with the RNA 6000 Pico LabChip Kit (Agilent). This served to assess the digestion pattern and confirm that RNA samples were not completely digested or not digested at all (Supplemental Fig. S5C,D). Then, rather than measuring the percentage of undigested/digested RNA molecules as in previous studies developing the PARS technique (Wan et al. 2013), we directly analyzed the sequencing results of different samples treated with distinct enzyme concentrations, as well as a nondigested sample. In this way, we could directly assess the optimal enzyme concentration that resulted in a digestion profile that gives more accurate results according to the known secondary structure of positive controls. Specifically, we tested the following RNase V1 dilutions (number between parentheses correspond to units used in a 100 µL reaction volume): 1:30 (0.03 U), 1:50 (0.02 U), 1:100 (0.01 U), and 1:250 (0.004 U). We also tested 1:10 (0.1 U) and 1:20 (0.05 U) RNase V1 dilutions, but samples were completely digested, so we did not proceed further to the preparation of the libraries. For S1, the following dilutions were tested: stock concentration (1000 U), 1:2 (500 U), 1:5 (200 U), 1:20 (50 U), 1:50 (20 U). Triplicates were performed for all samples. The final concentrations used for our experiments were 0.03 U and 200 U for V1 and S1, respectively.

Library preparation

nextPARS: library preparation using TruSeq Small RNA Sample Preparation Kit (Illumina)

With the aim to have higher sequencing throughput and multiplexing capacity, we implemented an adapted PARS protocol to the Illumina sequencing platform to study genome-wide the secondary structure of RNA molecules, which we named “nextPARS” (Fig. 1). In PARS, after the in vitro folding and RNase digestion step, the authors include a random fragmentation step and a size selection of RNA fragments by a column cleanup that we skip in the nextPARS protocol. In PARS, the first ligation is the 5′adapter and the second one the 3′ adaptor after a phosphatase treatment. In nextPARS, we performed a phosphatase and a kinase treatment just after the enzyme digestion, to leave a hydroxyl group at the 3′ end and a phosphate group at the 5′ end of all RNA fragments coming from nuclease digestion to ligate them to the adaptors in the further library preparation steps. To control for unspecific degradation of RNA, we included the nondigested sample. For the phosphatase treatment, we incubated at 37°C for 30 min a reaction mixture with 16 µL of the nondigested, V1- and S1-digested samples, 2.5 µL of 10× phosphatase buffer, 2.5 µL of nuclease-free water, 1 µL of RNase inhibitor, and 3 µL of Antarctic phosphatase (New England BioLabs). After 5 min at 65°C, we put samples on ice and added 4 µL of T4 Polynucleotide Kinase (PNK, New England BioLabs), 5 µL of 10× PNK buffer, 10 µL of ATP 10 mM, 1 µL of RNase inhibitor and nuclease-free water up to a total volume of 50 µL. After 1 h of incubation at 37°C, samples were purified using the RNeasy MiniElute Cleanup Kit following manufacturer's instructions (Qiagen) with a 10 µL RNase-free water final elution step. Then, samples were concentrated using a centrifugal evaporator Speed Vac to a final volume of 5 µL, and we started the TruSeq Small RNA Sample Preparation Kit (Illumina) protocol. All reagents used in the next step are from the Illumina kit if not specified otherwise. Briefly, we first performed the 3′ adapter ligation with an initial denaturing step at 70°C for 2 min with the 5 µL of RNA samples and 1 µL of RNA 3′ adapter. Samples were then put on ice, and 2 µL of 5× HM Ligation Buffer, 1 µL of RNase inhibitor and 1 µL of T4 RNA Ligase 2, truncated (New England BioLabs) were added. After 1 h incubation at 28°C, we added 1 µL of stop solution, gently pipetted up and down, incubated for 15 min more at 28°C and placed the samples on ice. Next, for the 5′ adapter ligation, we denatured the RNA 5′ adapter (1.1 µL per sample) at 70°C for 2 min and placed the tube immediately on ice. We added 1.1 µL of 10 mM ATP and 1.1 µL of T4 RNA Ligase per each sample in the same tube, mixed it, and added 3 µL of the mix in each of the samples coming from the 3′ adapter ligation. We incubated them at 28°C for 1 h. We then performed the reverse transcription of the samples starting with a denaturing step at 70°C for 2 min of 6 µL of the 5′ and 3′ adapter-ligated RNA with 1 RNA RT primer. After putting the samples on ice, we added 2 µL of 5× First strand buffer, 1 µL of SuperScript II Reverse Transcriptase (Life Technologies), 1 µL of 100 mM DTT, 1 µL of RNase inhibitor, and 0.5 µL of 1:2 diluted dNTP mix and incubated them at 50°C for 1 h. To perform the PCR amplification we added to the samples 8.5 µL of ultrapure water, 25 µL of PCR mix, 2 µL of RNA PCR primer, and 2 µL of RNA PCR primer indexed (with a different index in each of the samples tested). Cycling conditions began with a denaturation step of 30 sec at 98°C, followed by 11 cycles of 10 sec at 98°C, 30 sec at 60°C, and 15 sec at 72°C, with a final extension step at 72°C for 10 min. We diluted 1 µL of each sample in 1 µL of ultrapure water to perform a quality control using the Agilent 2100 bioanalyzer with the High Sensitivity DNA Kit (Agilent). Finally, we purified and size-selected the prepared libraries to get rid of primers and adapters dimers using Novex TBE 6% gel (Invitrogen). We loaded into the gel the total volume of cDNA constructs (50 µL), as well as the High resolution ladder and Custom ladder (1 µL each), mixed with the proper amount of DNA loading dye, and ran it at 145 V for 1 h. Gels were stained for 10 min with 4 µL SYBR Safe (Invitrogen) mixed with 50 mL of TBE. Using blue light (Dark Reader; Clare Chemical Research), we cut a gel slice from around 180 to 500 nt, which was shredded as previously mentioned, and 400 µL of ultrapure water were added to elute the DNA by rotating the tube at room temperature for at least 2 h. Both the eluate and gel debris were transferred to a Spin-X centrifuge tube filter (pore size 0.45 µm, Sigma-Aldrich) and centrifuged at 8000 rpm for 1 min. Then, 40 µL of 3 M NaOAc, 2.5 µL of glycogen and 1300 µL of prechilled 100% ethanol were added and centrifuged at maximum speed for 20 min at 4°C. We washed the pellet with 500 µL of prechilled 70% ethanol, and after centrifuging at maximum speed for 2 min and removing the ethanol, the pellet was dried placing the tube with the lid open in a 37°C heat block for around 10 min. We resuspended the pellet in 10 µL of EB buffer for at least 10 min and a final quality control of each library was run using the Agilent 2100 bioanalyzer with the DNA 1000 Kit (Agilent).

Sequencing

Libraries were sequenced in single-reads with read lengths of 50 nt in Illumina HiSeq2000 machines at the Genomics Unit of the CRG. All raw sequences used in this project have been deposited in the Short Read Archive of the European Nucleotide Archive under project number PRJNA380612.

Mapping of Illumina reads and determination of enzymatic cleavage points

Illumina reads were aligned with tophat2 version 2.0.9 with the --no-coverage-search option enabled (Trapnell et al. 2009). SOLiD reads from the original PARS publication (Kertesz et al. 2010) were aligned with SHRiMP version 2.2.3 (Rumble et al. 2009). We used several reference sequence sets depending on data analysed: S. cerevisiae S288C full chromosomes were downloaded from the Saccharomyces Genome Database (SGD) (Cherry et al. 2012), and we concatenated the sequences of the corresponding control molecules spiked to each experiment. Reads aligned nonuniquely (i.e., having mapping quality below 20) were ignored. Subsequently, enzyme cleavage positions were determined as follows. For each read alignment, we retrieved the 5′-end position in the reference genome, and compared this to the genome annotation. If the position coincided with exonic regions of the genome, the information about the cleavage site was stored. The resulting digestion profile is stored as the number of cuts per position of the transcript. The load is defined as the average number of cuts per position. We provide all necessary scripts to perform this operation in the following repository (https://github.com/Gabaldonlab/nextPARS).

Assessment of correlations between digestion profiles

We compared all sequencing runs in a pairwise manner in terms of the number of enzymatic cleavage points (cuts) for all transcript positions. For each pair of sequencing runs, we retrieved the number of cuts in all positions from all transcripts meeting the threshold described below and computed the Spearman's correlation coefficients to ensure that the results were consistent. The following is an example of the number of cuts in the first 20 nt positions (separated by semi-colons) for the same control molecule (HOT2) in two individual sequencing runs being compared, one from nextPARS and one from PARS: nextPARS: 971;268;191;234;387;639;460;114;20;18;23;18;112;111; 106;14;61;57;43;21;… PARS: 136;42;20;19;15;27;47;77;126;25;8;13;19;4;2;5;6;24;30;8;… Since each run of the experiment will produce different values, and transcript expression levels may differ each time (as can be seen by the values in the example above), we used the rank-based Spearman's value to see if coinciding positions have the same relative number of cuts for the given enzyme (the correlation coefficient in the above example was 0.498 with a P-value of 1.826 × 10−18). We defined the transcript load as the average number of inferred enzymatic cuts per position in a given transcript. We expect the load to depend on the relative concentration of the transcript in the sample and the depth of the sequencing run. Correlations at different load cut-offs are shown in Supplemental Table S3. From this, we determined that a load of 5.0 or greater is optimal to retain a sufficient number of transcripts that all have a high enough expression level and sequencing depth to produce consistent results, so for subsequent analyses, transcripts with a load below 5.0 (on average less than five cuts per position of a transcript) in a given run were ignored. We computed correlations among replicates within each nextPARS and PARS experiment, as well as between the two approaches. Since nextPARS uses 50 nt reads, the final 50 nt of each probed molecule are uninformative and are not included in the correlation calculations involving nextPARS (correlations for the original PARS with itself do include the final 50 positions). The resultant load values for the HOT2 sequencing runs in the example above were 174.9 and 16.3 for nextPARS and PARS, respectively. Table 2 shows the average correlations for the whole set of yeast transcripts in all pair-wise comparisons of sequencing runs, as well as for the three control molecules shared by the different experiments (TETp4p6, TETp9p9.1, HOT2), while Supplemental Table S1 shows all correlations results for each individual pair-wise comparison.

Computation of nextPARS scores

In brief, nextPARS scores are derived in two phases, as follows. Supplemental Figure S6 shows an example of how raw scores are transformed in each step of the scoring procedure for one of the control molecules (HOT2). Phase I: scores from raw experimental data (Sprofile) The input is a digestion profile indicating the raw number of enzymatic cuts per position. One such profile is available for each enzyme and replicate (Supplemental Fig. S6A). Raw numbers are capped to a given maximum percentile (the default is 95%, meaning the upper 5% of the values are set to the value at the 95th percentile). This step is introduced because there are often positions that are preferentially cleaved with numbers of cuts that are orders of magnitude greater than other positions (Supplemental Fig. S6B). Then read counts from each digestion in a given molecule are normalized to its average, giving an average of 1 read per position in each molecule and run of the experiment (Supplemental Fig. S6C). This is to account for a number of factors, including the different expression levels of each molecule, the potentially different sequencing depth between each run, and the different cutting rates between the S1 and V1 enzymes, allowing for a comparable range of values among all of the digestions. When comparing the read counts per site between S1 and V1 experiments, including only those molecules shared between both PARS and nextPARS data sets having an average of at least five counts per site (420 molecules in total), a Student's t-test showed that in the PARS experiments, S1 experiments had significantly greater counts per site than V1 experiments (P = 5.1 × 10−14), while the reverse was true for nextPars (P = 2.2 × 10−16). So it is important to have a normalized value before comparing V1 against S1. Also, since part of the nextPARS protocol involves mapping reads with a length of 50 nt, the final 50 sites have no information, and so these sites are given scores of 0 and not considered when normalizing to the average. When replicates are available, we obtain a single list of values for both V1 and S1 digestions, by taking the average at each position for all of the normalized V1's and S1's, respectively (Supplemental Fig. S6D). With these two lists we can calculate combined scores per position in a manner similar to the original PARS protocol, but now with a more reliable footing. Since the aim is to use these scores to determine whether each position is paired or unpaired, we must compare to a given threshold, and thus we will need a standard range of values for the scores. After the steps taken so far, there are no set maximum or minimum values for each position in either the V1 or S1 lists, so the combined scores must be normalized. A few different methods have been tested on the benchmark molecules: (A) subtract the S1 from V1, then normalize to give a maximum value of 1.0. (B) Normalize V1 and S1 values, separately, to a maximum value of 1.0, then subtract S1 values from the corresponding V1 values, so that positive values would suggest a paired site and negative values would suggest an unpaired site. However, since we want to be able to apply a universal threshold value to scores when determining structural constraints, we need to have a fixed range of values (ideally from −1.0 to +1.0). Method B will generally have bounds smaller than this range because at most sites there are cuts from both enzymes. So we then tried method C), which is to first subtract S1 from V1 values, then normalize positive values to have a maximum of 1.0 and negative values to have a minimum of −1.0. With this method, there is still the potential bias toward one enzyme or the other (as mentioned above in subsection iii). So we tested method D), which essentially combines B and C by first normalizing S1 and V1 values to a maximum of 1.0, subtracting, and then normalizing positive values to 1.0 and negative values to −1.0, to ensure that sites with the strongest V1 score to always have a score of 1.0 and those with the strongest S1 scores to always have −1.0. The effect is not large, since the sites cut most frequently by one enzyme are typically cut infrequently by the other, but it ensures the appropriate range for the final scores. The justifications for the normalization technique and the maximum percentile cap are shown in Supplemental Figure S7. The final normalization method chosen was D. After normalization we produce a single nextPARS score (Supplemental Fig. S6E). Phase II: scores from a recurrent neural network (RNN) classifier (SRNN) To enhance the nextPars score, a prior classification score could be inferred using local neighbor information of each nucleotide of the RNA sequence and by utilizing a classification method that takes each nucleotide and its neighbor nucleotides and calculates the probability of being in a double (DS) or single (SS) stranded position in the folded secondary structure based on a data set of known RNA secondary structures of other molecules. The classifier used here for DS/SS prediction is a classifier built on a recurrent neural network (RNN) model constructed with a long term short memory (LTSM) layer and a dense neural network layer (Hochreiter and Schmidhuber 1997; Sutskever et al. 2014). Sequence fragments used as input for the network are represented by a binary vector in which each nucleotide character is represented by a 4-binary one-hot encoding vector. The final input vector size for a k-mer sequence fragment is n = 4*k. The input vectors are fed to the first hidden LTSM layer which consists of n LTSM cells. The output of the LTSM layer is then propagated to a fully connected neural network layer with a sigmoid activation function. Training the RNN classifier is performed using Adam optimizer with a cross entropy as a loss function (Kingma and Ba 2015). We trained this classifier with RNA molecules with known RNA secondary structure information downloaded from the RNA STRAND Database (Andronescu et al. 2008), filtering for those molecules with 30 or more nucleotides that were validated by NMR or X-Ray and from any source. We removed from this data set the TETp4p6 and the rRNA molecules used as controls in our experiments. The molecules in the final training set ranged in length from 30 to 3032 nt, with a total of 484,539 nt and an average length of 683. These sequences were fragmented into k-mer subsequences, and the corresponding class (DS/SS) from the molecule secondary structure was assigned for the nucleotide in the center of the k-mer subsequence. We trained different k-classifiers of different k-mer sizes (k = [7,9,11,13,15]) on the RNA filtered RNA STRAND data set. Supplemental Table 4 lists the number of fragments in the constructed training set for different values of k. The RNN classification score is calculated according to the following equation: where SRNN is the RNN final classification score, Sk is the classification score from each k-classifier for k-mer sequences fragments, and wk is a weight associated with each k-classifier, which we set to 0.25 for all five values of k. Phase III, final nextPARS score The final nextPARS score is then calculated according to the following equation (Supplemental Fig. S6F): where Sprofile is the score calculated from the nextPARS experimental data in Phase I, SRNN are the RNN scores obtained in Phase II, and wprofile and wRNN are adjustable weights for each score which we set to 0.5. The full computational pipeline to derive nextPARS scores and structural constraints from raw nextPARS data can be found at (https://github.com/Gabaldonlab/nextPARS). There are a few options provided to output scores or other values in a number of formats. Users can output the nextPARS scores for given molecules both before and after the application of the RNN classifier. By applying a threshold, the score can be converted into a structure preference profile (SPP), which is modeled from that used in the SeqFold protocol (Ouyang et al. 2013). For each position in the molecule, if the combined score is greater than the threshold, it is assigned a value of 0, which indicates a paired site. If the score is less than the negative of the threshold, it is assigned a value of 1, for an unpaired site. Otherwise it is assigned “NA” as there is not enough information at that site to say definitively whether it should be paired or unpaired. And scores can also be output in a format compatible with the program VARNA (Visualization Applet for RNA, http://varna.lri.fr/ [Darty et al. 2009]), which allows visualization of secondary structures colored by the scores. The README document in the github repository above describes how to produce any of these files.

Comparison of nextPARS and PARS scores on known structures

We used five benchmark structures (TETp4p6, and the Saccharomyces cerevisiae ribosomal RNAs 5S, 18S, 25S, and 5.8S). The PDB IDs for their structures are 1GID, for TETp4p6 and 4V88, for three of the four RNAs, which is a collection of multiple older PDB IDs [3U5H contains RDN5-1 (5S), RDN25-1 (25S), RDN58-1 (5.8S) as distinct chains], and 3U5F is RDN18-1 (18S). The structures provided by PDB contain 3D coordinates of the molecules, so we converted these to connectivity table files (which represent secondary structures) using the RNApdbee webserver (Antczak et al. 2014). We then used these structures to determine true and false positive calls (where the nextPARS or PARS scores indicate a paired site) and true and false negative calls (the scores indicate an unpaired site) when making the ROC curves in Figure 3.

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

34 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Selective 2'-hydroxyl acylation analyzed by primer extension and mutational profiling (SHAPE-MaP) for direct, versatile and accurate RNA structure analysis.

Authors: Matthew J Smola; Greggory M Rice; Steven Busan; Nathan A Siegfried; Kevin M Weeks
Journal: Nat Protoc Date: 2015-10-01 Impact factor: 13.491

3. Genome-wide measurement of RNA folding energies.

Authors: Yue Wan; Kun Qu; Zhengqing Ouyang; Michael Kertesz; Jun Li; Robert Tibshirani; Debora L Makino; Robert C Nutter; Eran Segal; Howard Y Chang
Journal: Mol Cell Date: 2012-09-13 Impact factor: 17.970

4. SHAPE-Seq 2.0: systematic optimization and extension of high-throughput chemical probing of RNA secondary structure with next generation sequencing.

Authors: David Loughrey; Kyle E Watters; Alexander H Settle; Julius B Lucks
Journal: Nucleic Acids Res Date: 2014-10-10 Impact factor: 16.971

5. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data.

Authors: Zhengqing Ouyang; Michael P Snyder; Howard Y Chang
Journal: Genome Res Date: 2012-10-11 Impact factor: 9.043

6. Saccharomyces Genome Database: the genomics resource of budding yeast.

Authors: J Michael Cherry; Eurie L Hong; Craig Amundsen; Rama Balakrishnan; Gail Binkley; Esther T Chan; Karen R Christie; Maria C Costanzo; Selina S Dwight; Stacia R Engel; Dianna G Fisk; Jodi E Hirschman; Benjamin C Hitz; Kalpana Karra; Cynthia J Krieger; Stuart R Miyasato; Rob S Nash; Julie Park; Marek S Skrzypek; Matt Simison; Shuai Weng; Edith D Wong
Journal: Nucleic Acids Res Date: 2011-11-21 Impact factor: 16.971

7. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo.

Authors: Silvi Rouskin; Meghan Zubradt; Stefan Washietl; Manolis Kellis; Jonathan S Weissman
Journal: Nature Date: 2013-12-15 Impact factor: 49.962

8. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP).

Authors: Nathan A Siegfried; Steven Busan; Greggory M Rice; Julie A E Nelson; Kevin M Weeks
Journal: Nat Methods Date: 2014-07-13 Impact factor: 28.547

9. Landscape and variation of RNA secondary structure across the human transcriptome.

Authors: Yue Wan; Kun Qu; Qiangfeng Cliff Zhang; Ryan A Flynn; Ohad Manor; Zhengqing Ouyang; Jiajing Zhang; Robert C Spitale; Michael P Snyder; Eran Segal; Howard Y Chang
Journal: Nature Date: 2014-01-30 Impact factor: 49.962

10. RNApdbee--a webserver to derive secondary structures from pdb files of knotted and unknotted RNAs.

Authors: Maciej Antczak; Tomasz Zok; Mariusz Popenda; Piotr Lukasiak; Ryszard W Adamiak; Jacek Blazewicz; Marta Szachniuk
Journal: Nucleic Acids Res Date: 2014-04-25 Impact factor: 16.971

7 in total

Review 1. Candida parapsilosis: from Genes to the Bedside.

Authors: Renáta Tóth; Jozef Nosek; Héctor M Mora-Montes; Toni Gabaldon; Joseph M Bliss; Joshua D Nosanchuk; Siobhán A Turner; Geraldine Butler; Csaba Vágvölgyi; Attila Gácser
Journal: Clin Microbiol Rev Date: 2019-02-27 Impact factor: 26.132

2. Analysis of miRNAs responsive to long-term calcium deficiency in tef (Eragrostis tef (Zucc.) Trotter).

Authors: Muhammad Numan; Wanli Guo; Sang-Chul Choi; Xuegeng Wang; Boxuan Du; Weibo Jin; Ramji Kumar Bhandari; Ayalew Ligaba-Osena
Journal: Plant Direct Date: 2022-05-10

Review 3. Non-coding RNAs in cancer: platforms and strategies for investigating the genomic "dark matter".

Authors: Katia Grillone; Caterina Riillo; Francesca Scionti; Roberta Rocca; Giuseppe Tradigo; Pietro Hiram Guzzi; Stefano Alcaro; Maria Teresa Di Martino; Pierosandro Tagliaferri; Pierfrancesco Tassone
Journal: J Exp Clin Cancer Res Date: 2020-06-20

4. Structural characterization of NORAD reveals a stabilizing role of spacers and two new repeat units.

Authors: Uciel Chorostecki; Ester Saus; Toni Gabaldón
Journal: Comput Struct Biotechnol J Date: 2021-05-29 Impact factor: 7.271

Review 5. Long Non-coding RNAs: Mechanisms, Experimental, and Computational Approaches in Identification, Characterization, and Their Biomarker Potential in Cancer.

Authors: Anshika Chowdhary; Venkata Satagopam; Reinhard Schneider
Journal: Front Genet Date: 2021-07-01 Impact factor: 4.599

6. RNA Structure Elements Conserved between Mouse and 59 Other Vertebrates.

Authors: Bernhard C Thiel; Roman Ochsenreiter; Veerendra P Gadekar; Andrea Tanzer; Ivo L Hofacker
Journal: Genes (Basel) Date: 2018-08-01 Impact factor: 4.096

Review 7. Chemical and Enzymatic Probing of Viral RNAs: From Infancy to Maturity and Beyond.

Authors: Orian Gilmer; Erwan Quignon; Anne-Caroline Jousset; Jean-Christophe Paillart; Roland Marquet; Valérie Vivet-Boudou
Journal: Viruses Date: 2021-09-22 Impact factor: 5.048

7 in total