Literature DB >> 26495937

Bidirectional Direct Sequencing of Noncanonical RNA by Two-Dimensional Analysis of Mass Chromatograms.

Anders Björkbom^1,2,3, Victor S Lelyveld^1,2, Shenglong Zhang¹, Weicheng Zhang^1,4, Chun Pong Tam^1,4, J Craig Blain¹, Jack W Szostak^1,4,2.

Abstract

Mass spectrometry (MS) is a powerful technique for characterizing noncanonical nucleobases and other chemical modifications in small RNAs, yielding rich chemical information that is complementary to high-throughput indirect sequencing. However, mass spectra are often prohibitively complex when fragment ions are analyzed following either solution phase hydrolysis or gas phase fragmentation. For all but the simplest cases, ions arising from multiple fragmentation events, alternative fragmentation pathways, and diverse salt adducts frequently obscure desired single-cut fragment ions. Here we show that it is possible to take advantage of predictable regularities in liquid chromatographic (LC) separation of optimized RNA digests to greatly simplify the interpretation of complex MS data. A two-dimensional analysis of extracted compound chromatograms permits straightforward and robust de novo sequencing, using a novel Monte Carlo algorithm that automatically generates bidirectional paired-end reads, pinpointing the position of modified nucleotides in a sequence. We demonstrate that these advances permit routine LC-MS sequencing of RNAs containing noncanonical nucleotides, and we furthermore examine the applicability of this approach to the study of oligonucleotides containing artificial modifications as well as those commonly observed in post-transcriptionally modified RNAs.

Entities: Chemical Disease Species

Mesh：

Substances：
RNA

Year: 2015 PMID： 26495937 PMCID： PMC7547889 DOI： 10.1021/jacs.5b09438

Source DB: PubMed Journal: J Am Chem Soc ISSN： 0002-7863 Impact factor: 15.419

Introduction

Cellular RNA is highly modified as a result of diverse post-transcriptional mechanisms. To date, greater than 100 chemically distinct RNA modifications have been observed, with some tRNA molecules displaying noncanonical nucleotides at over 20% of positions.[1] These modifications play critical structural and informational roles in rRNA, tRNA, snRNA, and mRNA, and may be important in other regulatory RNA classes and epigenetic processes that are now being elucidated.[2] Concomitantly, synthetic modified RNAs have been employed to modulate RNA interference,[3] to enhance RNA-protein cross-linking,[4] and for therapeutic applications.[5] Alternative nucleic acid chemistries have also been extensively investigated in considering plausible prebiotic materials that may have played informational and functional roles in the origins of cellular life.[6] Information about chemical modifications is generally lost during indirect sequencing that relies on base complementarity and enzymatic polymerization. Because indirect RNA sequencing typically begins with reverse transcription using an RNA-dependent DNA polymerase, in cases where modifications affect base pairing, the resultant sequence may contain base call ambiguities.[7] In cases where modifications do not significantly affect complementarity, no modification information is transmitted to the cDNA intermediate. However, clever indirect sequencing techniques that identify a small number of modifications have been reported, including bisulfite sequencing,[8] inosine erasing,[9] and methylation-specific immunoprecipitation,[10] but each method is specially tailored to probe for the presence of a specific modified nucleotide. Direct sequencing by MS is complementary to high throughput indirect sequencing, permitting RNA modifications to be routinely detected and identified (for reviews, see the literature[11,12]). When coupled with chemical[13,14] and enzymatic[15,16] RNA degradation techniques, matrix-assisted laser desorption ionization (MALDI)[17] and electrospray ionization (ESI) MS[15] as well as tandem MS[18,19] have been widely applied for characterization and sequence confirmation of modified oligoribonucleotides. Exonucleolytic digestion has been used to prepare laddering fragments of an RNA followed by mass analysis by time-of-flight (TOF).[20] Acid hydrolysis-based generation of mass ladders has also been used to sequence RNA and to identify modifications.[13] Acid hydrolysis can be performed directly in the MALDI matrix, circumventing the need for desalting after mass ladder generation.[21] Sequence confirmation for modified L-RNA has also been demonstrated using liquid chromatography (LC) coupled ESI-MS, making use of an end-labeling technique to generate single-ended reads free of internal fragments.[22] Nonetheless, no generalized procedure for fragmentation and de novo bidirectional sequencing of modified RNA has been presented. MS offers rich chemical information in the analysis of RNA modifications with notable caveats. Complications in detecting compound signal above noise, including matrix materials and low-mass contaminants, as well as the low ionization efficiency of small oligonucleotides contribute to low sensitivity and resolution in the relevant low mass range (300–1000 Da, 1–3 nucleotides) using ESI-MS.[13,23] There also exist significant methodological inadequacies in the preparation of high quality degradative sequence ladders.[23] Ideally, for convenient and unambiguous RNA sequencing, the conditions for sequence ladder formation by fragmentation of an isolated RNA starting material should be optimized such that (a) the RNA of interest is specifically fragmented at a single cleavage site on each nucleotide, (b) the resulting family of fragments have good sequence coverage that is unaffected by sequence content or context, (c) internal fragments resulting from more than one scission event per molecule are minimized, and (d) the degradation conditions are mild enough to minimize modification loss. Under these conditions, base calling can be performed by measuring the mass differences between observed hydrolytic fragments, and searching for this difference in a database generated from known nucleotide masses.[13] By comparison, state-of-the-art methods for the direct gas-phase fragmentation of intact RNA samples into mass ladders, such as in-source decay MALDI,[24] fail to give uniform bond-specific fragmentation patterns across the RNA phosphate backbone. Despite this, impressive direct sequencing of oligonucleotides by gas-phase fragmentation has been reported,[25,26] but recent demonstrations have made use of specialized instrumentation and the resulting spectra remain complicated by product ions arising from multiple fragmentation pathways. Here we present a generic procedure that permits RNA oligonucleotides to be sequenced directly, without extraneous labeling[22,27,28] or reverse transcription[29,30] procedures. This method consists of an optimized technique for partial RNA degradation with formic acid, high performance liquid chromatography (HPLC) separation of hydrolytic fragments, online ESI-QTOF mass analysis, and a computational algorithm for read generation from survey mass spectra that generates paired end reads. Together these methods give a simple and robust approach to the sequencing of modified RNA. We further show that in the case of isomeric modifications, such as methylation, the workflow is compatible with tandem MS for disambiguation. Central to the success of the methodology are the formic acid degradation protocol and our ability to correlate fragment masses with chromatographic retention times (RTs), facilitating bidirectional sequencing. With continued improvements in MS sensitivity, we expect this direct sequencing strategy to find utility in pinpointing the chemical identity and position of nucleotide modifications in cellular RNA.

Results

Acid Hydrolysis Generates LC–MS Sequencing Ladders Ideal for Mass Analysis

Direct RNA sequencing relies on generating degradative products from a purified starting material (Figure a).[31] Hydrolytic RNA fragments produced by single scission events are directly sequenceable by observing mass differences between neutral compound masses. While exonucleolytic digestion generates degradative fragments containing the authentic 5′ or 3′ ends of the starting material, endonucleolytic digestion can rapidly hydrolyze single-cut fragments to generate confounding internal fragments. Identification of single-cut fragments can be achieved by end labeling,[32] but enzymatic digestion can be inhibited by RNA modifications[33] resulting in nonuniform formation rates and sequencing gaps.

Figure 1

Workflow for direct sequencing of modified RNA. (a) An isolated RNA starting material is partially digested in a manner that predominantly generates single-cut fragments. Compounds are identified and extracted from raw two-dimensional mass vs retention time data. In ion pairing reverse phase HPLC (IP-RP-HPLC), RNA fragment retention time is dominated by backbone phosphate content, generating an elution peak ladder that correlates with mass. Data quality can be improved by clustering or filtering coeluting mass adducts. An algorithm that traverses compound points can then generate bidirectional alignable reads from random walks along proximal mass differences. Alternatively, a manual analysis of this data set can verify the algorithm’s results or disambiguate impurities in the starting material. (b) A sequenced synthetic RNA (R2, AUAGCCCAGUm5CAGUm5CUACGC) containing two 5-methylcytosine (mC) nucleobases. Solid lines indicate walks automatically traversed by the sequencing algorithm between compounds indicated in black-outlined circles and squares. Letters indicate the base calls made using mass differences observed for each step, with the upper and lower sets arising from the sequencing ladders containing the original 5′ end (5′ ladder, circle) and 3′ end (3′ ladder, square), respectively, of the starting material.

Alternatively, acidic conditions favor RNA hydrolysis (Scheme ).[34] Each phosphodiester hydrolysis event produces a 5′ fragment with terminal 3′(2′)-monophosphate isomers and a 3′ fragment with a 5′-hydroxyl. The reaction proceeds by nucleophilic attack of the ribose 2′-hydroxyl on the vicinal 3′-phosphodiester, resulting in a pentacoordinate transition state that can, in part, resolve by cleavage of the 5′-ester of the subsequent nucleotide, releasing a newly generated 5′-hydroxyl and yielding a cyclic 2′,3′-phosphate intermediate. Water addition to this cyclic species then gives a fragment terminating in a ribonucleotide 3′(2′)-monophosphate with a forward rate that is substantially faster than the equivalent hydroxide mediated reaction.[34] In this work, we denote the single-cut sequence ladder having the original 5′ end as the “5′ ladder” and the sequence ladder containing the original 3′ end as the “3′ ladder” (Figure a).

Scheme 1

General Reaction Scheme for Acid-Mediated Hydrolysis of Ribonucleotide Phosphodiesters

We designed a series of RNA oligonucleotides (R1–R8 in Table S1) as models to optimize the method. Limited acid hydrolysis of the RNA phosphodiester backbone produced two ladders with newly generated 3′ and 5′ ends. To separate these fragments by HPLC, we employed a neutral pH mobile phase containing 1,1,1,3,3,3-hexafluoro-2-propanol (HFIP) and triethylamine (TEA) that has been demonstrated to give good separation of small RNAs by length.[35] Since retention in a methanol gradient was dominated by phosphate number rather than base composition, this convenient feature allowed for relatively simple disambiguation of the ladder identity of observed fragments. Sequence ladders of the base-modified RNA R2 having two 5-methylcytosines in positions 11 and 15 were separated and are shown as a plot of extracted ion chromatogram “compounds” (Figure b). The ladders eluted along a sigmoidal mass vs RT curve, with clear separation between 5′ and 3′ ladders (the ladder of fragments containing the original 5′ and 3′ ends, respectively), with the former showing an increase in RT. Since fragments containing the original 5′ end retain a terminal 3′(2′)-monophosphate following acid hydrolysis, the 5′ ladder shows increased RT relative to fragments in the 3′ ladder of the same sequence length. The inset in Figure S2 shows terminal loss fragments starting from an RNA with cytidine at both termini. Fragments resulting from single cleavage event showed distinct RT despite having the same base composition, differing only by a single terminal 3′(2′)-phosphate. This RT shift would be more pronounced for a species carrying additional 5′ charged groups, such as a 5′-phosphate, triphosphate, or capping group found in native cellular RNAs, further separating the fragments with the authentic 5′ end from all others. Because acid hydrolysis can rapidly generate internal fragments by multiple scission events from a starting molecule, we optimized the reaction to generate predominantly single-cut fragments (Figure ). Formic acid (FA, pKa = 3.77) is a mild and volatile organic acid used extensively in MS because it has a low boiling point (100.8 °C) and can therefore easily be removed by lyophilization.[36] The 20-nt RNA R1 degraded with a first-order rate constant of 3.1 h–1 at 50 μM in 50% (v/v) formic acid at 40 °C. The extent of hydrolysis was controlled by varying the treatment time, which was terminated by flash-freezing and lyophilization. The abundance of single-cut ladder fragments initially increased followed by a decrease due to overdigestion, while the intensity of the full length material monotonically decreased and was nearly undetectable at 120 min (Figure a,c). We detected 38 of the 39 expected single-cut fragments of this 20-nt RNA (Figure e). The 3′ terminal nucleoside ionized too weakly for assignment with stringent thresholding; generally, ionization efficiency diminished with decreasing length. During the first 5 min the single-cut fragments had higher intensities than the internal fragments, but by 15 min the prevalence of internal fragments began to outstrip some single-cuts (Figure b,d). Given these considerations, we chose 5 min as the optimal treatment time for a 20-nt RNA. In principle, this treatment time should be inversely adjusted for RNAs of different length.

Figure 2

Optimizing formic acid digestion to yield single-cut RNA fragments. (a) Heat map of single-cut fragment intensities in the 5′ ladder of the 20-nt RNA R1 (AUAGCCCAGUCAGUCUACGC) generated by digestion in 50% (v/v) formic acid at 40 °C. The intensities were normalized to the highest observed intensity for each fragment over a 120 min formic acid treatment. The point spacing for the time axis is shown in panel b, and the measured fragment masses set the point spacing on the mass axis. (b) Absolute intensities for selected ladder fragments arising from either one hydrolysis event (magenta) or more than one hydrolysis event (cyan) for R1. The desired single-cut fragments arise from hydrolysis of one phosphodiester bond at any position in the starting material, while undesired internal fragments arise from additional hydrolysis. (c) Normalized intensities for selected ladder fragments in the 5′ ladder over time. (d) Normalized intensities for all ladder fragments in the 5′ ladder for a 5 min (circle), 15 min (triangle) and 30 min (square) formic acid treatments. (e) Number of ladder fragments for both sequence ladders of R1 arising from one hydrolysis event (magenta) or more than one hydrolysis event (cyan) as a function of increasing formic acid treatment time. The dashed line represents the theoretical maximum number of desired fragments (39 fragments), of which we assigned 38 with stringent thresholding. The internal fragments were calculated as the number of possible fragments with unique mass. Experiments were run in triplicates, and error bars represent s.e.m.

This hydrolysis condition is less acidic and at a lower temperature than a reported protocol suggesting the use of 0.75% trifluoroacetic acid (pKa = 0.23) for use in MALDI[13] and yet achieves significantly better bidirectional overlapping sequence coverage for the same sample treatment time. We did not detect significant acid-mediated reactions of canonical or noncanonical nucleotides used here, suggesting that this procedure is suitable for routine use. However, unhindered primary alcohols may be formylated under these conditions,[37] a product we observed on an artificially 3′ Cy3 dye labeled RNA oligonucleotide (R4), which contains a 3′ terminal propanol group as an artifact of solid phase synthesis (Figure S3, see Table S1 for structure). The ribose 5′-hydroxyl formylation product was not typically observed in significant abundance, but a +CO mass addition could nevertheless be considered a neutral adduct during analysis.

Computational Algorithm to Read Modified RNA Sequences

RNA sequencing reads were generated de novo from compounds identified by their mass, RT, and abundance. The algorithm proceeds by generating compounds by molecular feature extraction, clustering related mass adducts appearing at similar RT, and filtering the final set of compounds by number of observed charge states and other factors (Figure a and Figure S1). Subreads were generated by performing biased random walks across compound points using a series of step rules and a probabilistic criterion to generate a set of candidate sequencing trajectories (see Materials and Methods). A walk begins from every compound, and each walk terminates when no legal step can be taken that would result in a base call. A final sequence is determined by alignment[38] of subread fragments and an intensity-weighted scoring metric. Workflow for direct sequencing of modified RNA. (a) An isolated RNA starting material is partially digested in a manner that predominantly generates single-cut fragments. Compounds are identified and extracted from raw two-dimensional mass vs retention time data. In ion pairing reverse phase HPLC (IP-RP-HPLC), RNA fragment retention time is dominated by backbone phosphate content, generating an elution peak ladder that correlates with mass. Data quality can be improved by clustering or filtering coeluting mass adducts. An algorithm that traverses compound points can then generate bidirectional alignable reads from random walks along proximal mass differences. Alternatively, a manual analysis of this data set can verify the algorithm’s results or disambiguate impurities in the starting material. (b) A sequenced synthetic RNA (R2, AUAGCCCAGUm5CAGUm5CUACGC) containing two 5-methylcytosine (mC) nucleobases. Solid lines indicate walks automatically traversed by the sequencing algorithm between compounds indicated in black-outlined circles and squares. Letters indicate the base calls made using mass differences observed for each step, with the upper and lower sets arising from the sequencing ladders containing the original 5′ end (5′ ladder, circle) and 3′ end (3′ ladder, square), respectively, of the starting material. Optimizing formic acid digestion to yield single-cut RNA fragments. (a) Heat map of single-cut fragment intensities in the 5′ ladder of the 20-nt RNA R1 (AUAGCCCAGUCAGUCUACGC) generated by digestion in 50% (v/v) formic acid at 40 °C. The intensities were normalized to the highest observed intensity for each fragment over a 120 min formic acid treatment. The point spacing for the time axis is shown in panel b, and the measured fragment masses set the point spacing on the mass axis. (b) Absolute intensities for selected ladder fragments arising from either one hydrolysis event (magenta) or more than one hydrolysis event (cyan) for R1. The desired single-cut fragments arise from hydrolysis of one phosphodiester bond at any position in the starting material, while undesired internal fragments arise from additional hydrolysis. (c) Normalized intensities for selected ladder fragments in the 5′ ladder over time. (d) Normalized intensities for all ladder fragments in the 5′ ladder for a 5 min (circle), 15 min (triangle) and 30 min (square) formic acid treatments. (e) Number of ladder fragments for both sequence ladders of R1 arising from one hydrolysis event (magenta) or more than one hydrolysis event (cyan) as a function of increasing formic acid treatment time. The dashed line represents the theoretical maximum number of desired fragments (39 fragments), of which we assigned 38 with stringent thresholding. The internal fragments were calculated as the number of possible fragments with unique mass. Experiments were run in triplicates, and error bars represent s.e.m. We used hierarchical clustering of mass adducts to augment compound intensity. Co-eluting neutral and charge-carrying adducts were recursively clustered, such that their integrated intensities are combined with that of the main peak. This increased the intensity of ladder fragment compounds, better differentiated singly cut fragments from coeluting internal fragments of the same length, and reduced the data complexity in the region critical for generating sequencing reads. Base calling relied on the observed mass difference between adjacent fragments along a sequencing trajectory, where mass differences were converted to base calls by a database search of nucleotide and modified nucleotide masses. A series of contiguous base calls in a trajectory generated a subread that was aligned with other subreads to produce a consensus sequence from each ladder. Chromatographic separation of sequence ladders simplified identification of reads in the same orientation. Subreads belonging to each ladder could be automatically clustered by bifurcating the set of walks with a locally regressed midline through the observed compound data set. This step made use of locally weighted scatter plot smoothing (LOWESS) to generate a reasonable clustering border, and walks were grouped together based on their position relative to this border. Internal fragments could have similar masses and RTs as desired single-cut fragments, leading to erroneous base calls. However, the algorithm biases each step toward a nearby compound with high local peak intensity and also scores steps based on that intensity. Since internal fragments were in lower abundance than single-cut fragments as a result of our sample preparation, internal fragment ladders generated subreads that were less frequent, shorter, and lower scoring than bona fide reads. Critically, filtering on subread length further minimized reads containing steps off of the main ladder, since these usually terminate a sequencing trajectory prematurely. Direct LC–MS sequencing of modified RNA. In panels (a–d), circles denote compounds on the 5′ ladder (containing the original 5′ end) and squares denote those on the 3′ ladder (containing the original 3′ end). (a) Sequencing of an RNA that contains only canonical bases (R1: AUAGCCCAGUCAGUCUACGC). (b) Sequencing of a modified 20-nt RNA (R8: As2UCs2UCCCAGUCAGUCUACGC). (c) Sequencing of a 5′-biotinylated 15-nt RNA (R6: Biotin-GCGUAUACUGUCUCG). (d) Sequencing of a 20-nt RNA with the bulky hydrophobic 3′ modification Cy3 (R4: AUAGCCCAGUCAGUCUACGC-Cy3). The algorithm was used to sequence a variety of modified and unmodified RNAs (Figure b, 3, 5a, S3, and S5). For the 20-nt RNA R2 (Figure b), the two methylated residues were identified in the expected positions in both sequence ladders with mass differences of 319.0564 ± 0.001 Da on each ladder, corresponding to methylated cytidine monophosphate. There are four isomeric RNA bases that would yield a mass difference of 319.0569 Da upon loss: 3-methylcytidine (m3C), 5-methylcytidine (m5C), N4-methylcytidine (m4C) and 2′-O-methylcytidine (Cm).[1] The hydrolysis mechanism can be used to exclude 2′-hydroxyl methylation.[13] From MS1 data, we therefore obtained the position and exact mass of the modification, and we deduced that the modified nucleotides are one of the three documented cytosine methylations.

Figure 3

Direct LC–MS sequencing of modified RNA. In panels (a–d), circles denote compounds on the 5′ ladder (containing the original 5′ end) and squares denote those on the 3′ ladder (containing the original 3′ end). (a) Sequencing of an RNA that contains only canonical bases (R1: AUAGCCCAGUCAGUCUACGC). (b) Sequencing of a modified 20-nt RNA (R8: As2UCs2UCCCAGUCAGUCUACGC). (c) Sequencing of a 5′-biotinylated 15-nt RNA (R6: Biotin-GCGUAUACUGUCUCG). (d) Sequencing of a 20-nt RNA with the bulky hydrophobic 3′ modification Cy3 (R4: AUAGCCCAGUCAGUCUACGC-Cy3).

Figure 5

Probing sequencing gaps and nucleotide isomers by MS/MS. (a) Automated sequencing of a modified 20-nt RNA with a 2′-O-methylation in position C11 (R5: AUAGCCCAGUCmAGUCUACGC). 2′-O-modifications inhibit acid hydrolysis of the phosphate backbone leading to weak or missing ladder fragments. The mass of the missing fragment in the lower 3′ ladder corresponded to A + C + methyl (gray text). (b) Structure of the missing dimer determined by MS/MS analysis, with characteristic fragment ions as labeled. (c) Fragmentation spectrum following extended acid hydrolysis (80 °C, 75% (v/v) formic acid, 2 h) to increase the abundance of the dimer. MS/MS data was collected for the modified dimer and fragment ions were used to confirm that the methylation is on the ribose 2′ position of cytidine and the sequence is CmA. Assignable fragments labels are indicated on the dimer structure in panel (b).

The algorithm was able to reconstruct bidirectional reads in the presence of significant compound noise. For instance, for the 20-nt RNA R8 (Figure b), two thiolated uridines (corresponding to 2- or 4-thiouridine) were identified as a mass differences of 322.001 ± 0.0031 Da. In this case, 273 compounds were initially identified by molecular feature extraction with a signal-to-noise ratio (SNR) > 20, of which 189 compounds remain following adduct clustering, and 180 remain after limiting the overall mass range. The random walk algorithm in this case generated over 400 redundant sequencing trajectories, resulting in an alignable bidirectional read of the 20-mer with 20 out of 20 base calls in one orientation (upper 5′ ladder) and 18 out of 20 in the reverse orientation (lower 3′ ladder). The two remaining bases on the 3′ ladder have low ionization efficiency and retention with these instrument parameters because they exhibit one or zero phosphates for the 3′ terminal dimer and monomer, respectively, and were often not observed above the specified SNR. The overall procedure gives excellent reproducibility, with equivalent sequencing results in multiple independent replicates (Figure S5). The robustness of the algorithm permits high fidelity results from across a wide range of digest times, from 3–15 min (Figure ). In an extreme case, two of three independent samples overdigested for 30 min result in error-free final reads. In these cases, the algorithm began from more than 500 compounds, such that the desired reads were largely obscured by internal fragments. Reads arising from both internal fragment ladders and the bona fide single-cut ladder contributed to the final alignments.

Figure 4

Sequencing error rate as a function of RNA digest time at 40 °C in 50% (v/v) formic acid for independent triplicate digests of the 20-nt RNA R1. Error rate was defined as (No. of incorrect base calls)/(sequence length) × 100%.

While ion pairing retention is predominantly dictated by phosphate number,[39] base composition can also affect RT. This second-order effect was beneficial, since it permitted separation of RNAs having partial sequence heterogeneity (Figure S6). In Figure S6a, two RNAs having similar sequences were sequenced simultaneously. R1 and R3 differ in the first five bases, and four ladders were therefore observed. For fragments shorter than 16-nt, the 3′ ladders for R1 and R3 have identical sequence and therefore converged. Figure S6b shows that mobility differences arising from only a single base difference (16mer in 3′ ladders) gave rise to a 20 s separation. This resulted in unique RT for all fragments in the 5′ ladders. The chromatographic separation between R1 and R3 simplified analysis, such that ladders could be identified based on inspection of the retention. Both RNAs were correctly sequenced by a manual analysis, demonstrating that at least in some cases multiple RNAs could be sequenced simultaneously and that small differences in composition could be adequate to separate sequence ladders arising from different RNAs. Sequencing error rate as a function of RNA digest time at 40 °C in 50% (v/v) formic acid for independent triplicate digests of the 20-nt RNA R1. Error rate was defined as (No. of incorrect base calls)/(sequence length) × 100%.

Effect of Hydrophobic Modifications on Ladder Separation

Bulky hydrophobic modifications should significantly increase fragment retention time. To examine extreme cases of bulky modifications, we examined the laddering behavior of biotinylated and dye-modified oligonucleotides. We directly sequenced a 5′-biotinylated RNA, R6 (Figure c). This modification is less polar than RNA and led to marked separation of the sequence ladders. Further, modification position had a large effect on ladder separation. A 3′ terminal hydrophobic modification inverted ladder elution order, such that the 3′ ladder exhibited longer RT than the 5′ ladder. This effect was clearly apparent in the case of a 3′ terminally Cy3 labeled RNA (R4, Figure d). Bulky internal hydrophobic modifications yielded similar effects, as observed for the RNA R7, incorporating a Cy3 group between G13 and U14 (Figure S4). Fragments that retained Cy3 were clearly separated from the rest of the laddering pattern.

Benefit of Bidirectional Reads and Scenarios Requiring MS/MS Confirmation

Our procedure does not rely on MS/MS confirmation of hydrolytic fragments, but in select cases it may be helpful to disambiguate sequencing anomalies and to probe the chemical structure of isobaric modified bases. Schemes to differentiate some nucleoside isomers by specific labeling reactions have been previously reported, such as identification of pseudouridine by cyanoethylation.[33] Enzymatic digestion of tRNA down to single nucleoside level followed by RP-HPLC also gives baseline separation of m3C, m5C and Cm.[40,41] However, selected ion fragmentation can provide additional modification information without significant changes to sample preparation. When an RNA was modified at a 2′-hydroxyl (R5, Figure a), acid hydrolysis led to gaps in the laddering corresponding to the position of the 2′-O-modification.[13] In the 5′ ladder, a surprising sequencing read through the 2′ modified position nevertheless occurs, most likely by stepping to a low intensity internal fragment in the absence of a dominant single-cut fragment at this position. This interpretation is supported by the fact that the read is not observed bidirectionally. The observed gap in the 3′ ladder at this position could be used to manually infer that the intervening sequence contains A + C + methylation. During characterization of a novel RNA, the unmodified sequence is likely to be known from indirect sequencing, allowing the likely modification mass to be deduced from observed ladder gaps. In this test case, the position of the missing fragment in the 3′ sequence ladder clearly gave the location of the modification, and the presence of a dimer-sized gap hints at a ribose 2′ modification. Confirmation was therefore obtained by MS/MS of the modified dimer fragment, by selecting the characteristic dimer ion (Figure b,c).[13] Here, we observed several assignable product ions consistent with dimer 5′-Cm-A-3′. The combination of LC–MS and MS/MS for short selected ions was thus a simple means for complete sequencing of a sparsely modified 2′-O-methylated RNA that was directly compatible with our workflow.

Discussion

A complete understanding of the informational and functional potential of nucleic acids requires sequencing methods that can directly detect chemical modifications and noncanonical bases. Here, we have presented a procedure for improved direct sequencing of RNA using a two-dimensional analysis of mass and chromatographic retention time that we expect to be useful for the analysis of synthetic oligonucleotides, artificial nucleic acids, and isolated biological samples. By optimizing sample preparation and making use of observed regularities in the resulting data sets, we have presented a number of typical cases in which our custom algorithm permits automated sequencing, as well as limiting cases in which manual analysis remains relatively straightforward. These latter cases point toward areas where future algorithmic improvements may be made. Critically, chromatographic fragment separation prior to mass determination greatly simplified assignment of related ladder fragments. In addition, fragment separation ameliorated ion suppression effects, simplified charge state deconvolution, and generally yielded uncomplicated spectra amenable to automated analysis methods. This was especially important for ladder fragments with high masses, since their charge state envelopes can overlap with that of the full length starting material. A 5 min formic acid treatment only digested approximately a quarter of the starting material for a 20-nt RNA, such that each ladder fragment is present in the injected sample at approximately 1% relative to the undigested starting material. Thus, a further benefit of LC separation is the retention-based identification and clustering of mass adducts, which decreases the complexity of the mass spectra and results in increased signal from the dominant ion of interest. Finally, because two-dimensional analysis also yielded visually identifiable laddering patterns, observable gaps in algorithmically generated sequencing trajectories, e.g., Figure a, would also reveal sites in small RNAs that contain novel unannotated modification types. Such discontinuities imply that, from among those nucleotides in the chosen reference set, no valid mass could be found that would yield a base call in that sequence region. As such, visual inspection of the sequencing trajectories may be directly useful for hypothesis generation and discovery of novel modifications. Probing sequencing gaps and nucleotide isomers by MS/MS. (a) Automated sequencing of a modified 20-nt RNA with a 2′-O-methylation in position C11 (R5: AUAGCCCAGUCmAGUCUACGC). 2′-O-modifications inhibit acid hydrolysis of the phosphate backbone leading to weak or missing ladder fragments. The mass of the missing fragment in the lower 3′ ladder corresponded to A + C + methyl (gray text). (b) Structure of the missing dimer determined by MS/MS analysis, with characteristic fragment ions as labeled. (c) Fragmentation spectrum following extended acid hydrolysis (80 °C, 75% (v/v) formic acid, 2 h) to increase the abundance of the dimer. MS/MS data was collected for the modified dimer and fragment ions were used to confirm that the methylation is on the ribose 2′ position of cytidine and the sequence is CmA. Assignable fragments labels are indicated on the dimer structure in panel (b). There are several notable limitations to our current method. The maximum length of starting material is practically limited by the accuracy and resolution of the mass analyzer. With technological advances in instrumentation, the sequenceable length will continue to increase. In general, charge state envelopes for long RNAs and their numerous metal ion adducts are broad and complicated, such that chromatographic refinements are also highly valuable. In our conditions, fragment chromatographic separation becomes decreasingly effective for fragments longer than ∼25–30-nt.[35] Adjustment of chromatographic conditions to improve separation of longer RNA fragments would reduce ambiguity by better separating the 5′ and 3′ fragment ladders. The increased retention of 5′ fragment ladders is valuable for automatic assignment of read orientation, but in special cases this property may no longer be assumed. In particular, if the starting material has a 5′-hydroxyl and a 3′-phosphate, both hydrolytic fragments will have the same approximate RT under ion pairing conditions for compositionally similar fragments. This technical problem can be mitigated by dephosphorylation (e.g., using a phosphatase). Additionally, if a significantly hydrophobic modification exists on the 3′ fragment, dramatic retention effects emerge in the resulting laddering (Figure d and Figure S3). These large effects can either improve or confound automated data analysis, but they generally improve 5′ vs 3′ ladder separation for part or all of the sequence. Although Cy3 and biotin represent artificial modifications that are particularly extreme in their hydrophobicity, some naturally found RNA modifications may have similar properties, such as the recently identified geranylated 5-methylaminomethyl-2-thiouridine.[42] Our procedure relies on a homogeneous starting material, although we demonstrate that some RNA mixtures of limited complexity can be manually analyzed. This requires that native RNA be isolated in significant yield for analysis. Our choice of mobile phase trades some sensitivity for fragment separation. With our instrumentation, 15 pmol of starting material was sequenceable, but other currently available instruments have the sensitivity to achieve full sequencing at the low picomole or subpicomole scale. These amounts are well within the range tractable to typical PAGE or chromatographic purifications of biological samples. It is notable that our method is insensitive to DNA contamination in the starting material, such that RNA isolation using (e.g., biotinylated) complementary 2′-O-methyl, 2′-fluoro, or deoxy-oligonucleotides would be compatible with the method. In principle, any linear informational polymer with known monomer composition could be sequenced bidirectionally by a generalization of the presented method and algorithm. In particular, adaptation of the method to novel backbone chemistries or dramatically different linear polymers requires only optimization of digest conditions and a database of relevant monomer masses.

Conclusion

The discovery of novel classes of RNA molecules playing diverse roles in cellular processes is leading to an increasing demand for techniques that interrogate nucleic acids at the chemical level. We suggest that the direct sequencing procedure presented here will aid in the mechanistic understanding of RNA species isolated from natural sources, particularly small noncoding RNAs including miRNAs, piRNAs, and numerous new RNA classes of recent interest. The method could also be used to determine the interaction site for chemically or photochemically cross-linked RNA-protein and RNA–RNA complexes after chemical or enzymatic degradation. Furthermore, we believe the method will permit analysis of extension products from nonenzymatic replication reactions using RNA or oligonucleotides containing nonstandard nucleobases or backbone chemistries.

Materials and Methods

LC–MS

Samples were separated and analyzed on an Agilent 1200 HPLC coupled to an Agilent 6520 accurate-mass Q-TOF (Agilent Technologies) equipped with a solvent degasser, autosampler, diode array detector and column oven. All separations were performed using IP-RP-HPLC according to ref (35) with aqueous mobile phase (A) as 200 mM HFIP with 1.25 mM TEA at pH 7.0 and organic mobile phase (B) as methanol across a 100 mm × 1 mm i.d. Xbridge C18 column with a particle size of 3.5 μm (Waters). For samples with hydrophobicity similar to unmodified RNA, mobile phase B was ramped from 2.5% to 20% over 30 min. For samples with large hydrophobic modifications, mobile phase B was increased from 5% to 15% over 0–20 min and from 15% to 60% over 20–41 min. The flow rate was 0.1 mL/min, and all separations were performed with the column temperature maintained at 50 °C. Injection volumes were 5–25 μL, and sample amounts were 15–400 pmol of RNA. RNAs were analyzed in negative ion mode from 239 m/z to 3200 m/z with a scan rate of 1 spectrum/s with the following MS settings: drying gas flow, 8 L/min; drying gas temperature, 325 °C; nebulizer pressure, 30 psig; capillary voltage, 3500 V; fragmentor, 200 V; and skimmer, 65 V. For LC–MS/MS the source parameters were the same as above and ions were fragmented using collision induced dissociation (CID) with N2 and collision potential of 30 V.

Automated De Novo Sequencing Algorithm

Candidate compounds identified by molecular feature extraction (see Supporting Information) were analyzed using a novel algorithm implemented in Python that generates paired end sequencing reads from a set of compounds defined by their mass, chromatographic RT, and abundance (Figure S1). Compounds were recursively clustered to agglomerate peak intensity from multiple adducts into a single parent compound. To do so, a list of compounds was searched for mass additions corresponding to frequent adducts and losses (e.g., +Na–H, +K–H, +Cl, −H2O, and +formyl) within a specified mass error window (calculated in ppm) and RT window (specified as a peak width multiplier). The intensities of candidate compound adducts were summed onto the species with maximum integrated compound volume, based on the observation that the dominantly observed ion for purified nucleic acids was the adduct-free species. Adduct clustering was performed recursively to a specified recursion depth (typically no more than 2–3 levels to avoid error amplification), such that multiple mass adducts were hierarchically clustered with the parent compound. The clustered compound set can be further filtered based on ion characteristics (number of observed charge states, peak quality, intensity, etc.), although strict filters were generally unnecessary. An optional high stringency weighting scheme can be employed, since the desired single-cut fragments are only those pairs whose mass sum to that of the candidate full length species (−H2O), such that all other fragments can be down-weighted. This optional feature was not required for the analyses reported here. Sequencing reads for a 20-nt RNA were commonly generated from a final set of ca. 200–300 compounds. For the analyses performed here, compounds were filtered only by requiring observation of two ions and that they have a mass greater than 224 Da. With a set of compounds, C, a constrained Monte Carlo random walk was repeatedly simulated across all C(m,t,v) points, where m is compound monoisotopic mass, t is compound RT, and v is integrated peak intensity. In the algorithm, a candidate sequencing read of length n is defined by a walk [C, C, ..., C] along steps i that proceeds across C point space using the following set of rules. A step from C → C can be taken if and only if m > m (i.e., in the direction of monotonically decreasing mass), t > tor t < t for all i (i.e., in the direction of monotonically increasing or decreasing RT for an entire walk), mass difference, Δm = m – m ∈ {Δmknown ± merr}, and C maximizes P = P(C|C), where the conditional probability P is defined for this purpose as P = (v/(max(v))) + G. Here, G is a Gaussian noise term with mean 1 and empirical variance (generally 0.1 for internal jumps and 0.5 for first jumps in a walk). As above, v is the volumetrically integrated and globally normalized peak intensity of compound C. The local normalization factor max(v) is determined by taking the maximum intensity of all nearest neighbors within a mass × RT window around C, the size of which can be variably defined by the set of known base mass differences, {Δmknown}, the maximum allowable mass error, merr (limited at 5–10 ppm for our instrument), and an expected maximum RT difference for loss of a single nucleotide (maximally around 3 min for native bases and minor modifications under the separation conditions used in this work). The intensity-weighted term of the probability P is valid primarily for the sample preparation conditions used here, as singly cut fragments are generally expected to be in highest abundance in any mass × RT region relative to multiply cut internal fragments. Since many trajectories can be generated from a data set, an intensity based scoring metric was used to judge the quality of each step in a trajectory, and therefore of each call in a read. Walk trajectories were generated on each degradative sequencing ladder, in both 3′–5′ and 5′–3′ orientations. Trajectories were clustered into two groups expected to be in the same read orientation, bifurcated by a locally weighted scatterplot smoothing (LOWESS) regression line generated from the filtered compound set. Trajectories were then converted to subsequence reads by a table lookup of Δm for each step, taking into account the specified maximum allowable merr. A preliminary sequence alignment within each cluster of trajectories was performed, and a consensus subread was determined for each by choosing the base at each aligned sequence position with maximal score. Preliminary alignment gave one consensus subread for each cluster, but orientation of these sequences is not yet known. Orientation can be unambiguously determined in most cases by checking whether the first mass difference from the full length mass corresponds to a nucleoside or nucleotide minus H2O. Since acid digestion results in fragments containing a 3′(2′)-monophosphate and a 5′-hydroxyl, loss of a nucleoside occurs at the 3′ end, provided that the starting material is not 3′(2′)-phosphorylated, which can be ensured by phosphatase treatment. Given this predicted orientation, the two sequences resulting from preliminary alignment could then be aligned against one another after placing them in the same orientation, which gave a final consensus read.

39 in total

1. Detection of pseudouridine and other modifications in tRNA by cyanoethylation and MALDI mass spectrometry.

Authors: Jonas Mengel-Jørgensen; Finn Kirpekar
Journal: Nucleic Acids Res Date: 2002-12-01 Impact factor: 16.971

Review 2. Mass spectrometry of RNA.

Authors: Benjamin Thomas; Alexandre V Akoulitchev
Journal: Trends Biochem Sci Date: 2006-02-17 Impact factor: 13.807

3. Detection of RNA modifications.

Authors: Stefanie Kellner; Jürgen Burhenne; Mark Helm
Journal: RNA Biol Date: 2010-03-08 Impact factor: 4.652

4. The use of nuclease P1 in sequence analysis of end group labeled RNA.

Authors: M Silberklang; A M Gillum; U L RajBhandary
Journal: Nucleic Acids Res Date: 1977-12 Impact factor: 16.971

5. Sequencing RNA by a combination of exonuclease digestion and uridine specific chemical cleavage using MALDI-TOF.

Authors: D A Tolson; N H Nicholson
Journal: Nucleic Acids Res Date: 1998-01-15 Impact factor: 16.971

6. Direct chemical method for sequencing RNA.

Authors: D A Peattie
Journal: Proc Natl Acad Sci U S A Date: 1979-04 Impact factor: 11.205

Review 7. RNA therapeutics: beyond RNA interference and antisense oligonucleotides.

Authors: Ryszard Kole; Adrian R Krainer; Sidney Altman
Journal: Nat Rev Drug Discov Date: 2012-01-20 Impact factor: 84.694

8. The RNA Modification Database, RNAMDB: 2011 update.

Authors: William A Cantara; Pamela F Crain; Jef Rozenski; James A McCloskey; Kimberly A Harris; Xiaonong Zhang; Franck A P Vendeix; Daniele Fabris; Paul F Agris
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

9. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors: Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal: Mol Syst Biol Date: 2011-10-11 Impact factor: 11.429

10. Discovery and biological characterization of geranylated RNA in bacteria.

Authors: Christoph E Dumelin; Yiyun Chen; Aaron M Leconte; Y Grace Chen; David R Liu
Journal: Nat Chem Biol Date: 2012-09-16 Impact factor: 15.040

9 in total

1. 2D-HELS MS Seq: A General LC-MS-Based Method for Direct and de novo Sequencing of RNA Mixtures with Different Nucleotide Modifications.

Authors: Ning Zhang; Shundi Shi; Barney Yoo; Xiaohong Yuan; Wenjia Li; Shenglong Zhang
Journal: J Vis Exp Date: 2020-07-10 Impact factor: 1.355