Literature DB >> 20810537

The ends of a large RNA molecule are necessarily close.

Aron M Yoffe¹, Peter Prinsen, William M Gelbart, Avinoam Ben-Shaul.

Abstract

We show on general theoretical grounds that the two ends of single-stranded (ss) RNA molecules (consisting of roughly equal proportions of A, C, G and U) are necessarily close together, largely independent of their length and sequence. This is demonstrated to be a direct consequence of two generic properties of the equilibrium secondary structures, namely that the average proportion of bases in pairs is ∼60% and that the average duplex length is ∼4. Based on mfold and Vienna computations on large numbers of ssRNAs of various lengths (1000-10 000 nt) and sequences (both random and biological), we find that the 5'-3' distance-defined as the sum of H-bond and covalent (ss) links separating the ends of the RNA chain-is small, averaging 15-20 for each set of viral sequences tested. For random sequences this distance is ∼12, consistent with the theory. We discuss the relevance of these results to evolved sequence complementarity and specific protein binding effects that are known to be important for keeping the two ends of viral and messenger RNAs in close proximity. Finally we speculate on how our conclusions imply indistinguishability in size and shape of equilibrated forms of linear and covalently circularized ssRNA molecules.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2010 PMID： 20810537 PMCID： PMC3017586 DOI： 10.1093/nar/gkq642

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

There are many situations in which it is biologically important for the two ends of a large RNA molecule to be close to each other. In animal viruses with single-stranded (ss) RNA genomes, for example, efficient replication of the genome has been shown to depend on its effective ‘circularization’. More explicitly, complementary sequences have been identified at or near the 5′- and 3′-ends that are responsible for forming ‘panhandles’ that keep the two ends close together. These panhandles are duplexes that are 21 bp in the case of yellow fever virus (1), and 15 bp in the case of influenza A (2), thereby according them unusual robustness. Another example where RNA genome circularization of this kind has been implicated in RNA replication is sindbis virus; here an 18 bp 5′–3′ panhandle has been shown to survive denaturing conditions sufficient to eliminate much of the remaining secondary structure, leaving the genome with a circular appearance in electron micrographs (3). In dengue, also (like yellow fever, influenza A and sindbis) a positive-sense RNA virus, minus-strand synthesis involves long-distance 5′–3′ base pairing that facilitates the transfer of the RNA-dependent RNA polymerase from its binding site at the 5′-end to the initiation site at the 3′-end (4). Similarly, circularization of HIV-1 has been shown to arise from base pairing between the 5′- and 3′-ends of the RNA genome (5); these interactions are found to occur as well in different HIV-1 subtypes with large sequence variation, suggesting they share an evolutionary basis. It has also long been known that effective circularization of messenger RNA molecules is important for efficient translation. The 5′- ‘capping’ and 3′-polyadenylation of mRNAs—through a variety of specific protein-binding events—result in the association of the two ends of the molecules and subsequent formation of translation initiation complexes (6). In eukaryotes, for example, the 3′-poly(A) ‘tail’ interacts with the poly(A)-binding protein, the 5′-G-cap binds a eukaryotic initiation factor, and these two bound proteins—with the full length of mRNA intervening—simultaneously bind a ‘bridging’ protein. This effective circularization of the molecule results in recruitment of the 40S ribosomal subunit (via binding of still another protein) and initiation of translation. Because circularization of mRNA is so important for its translation, mechanisms that co-localize the ends have evolved even in cases where the molecules are not capped or polyadenylated. Plant viruses, for example, often lack both of these special sequences and yet are translated efficiently (7,8). The effective circularization is enhanced by direct base pairing between sub-sequences in the untranslated regions (UTRs) at the 5′- and 3′-ends; the UTRs functionally replace the G-cap and poly(A) tail. Further, the RNAs of many positive-sense (mRNA) viruses have internal ribosome entry sites (IRESs) at their 5′-ends, i.e. subsequences that recruit ribosomes and initiate translation (9,10). In all of the above examples—involving both direct interaction between 5′- and 3′-ends or interaction mediated by binding proteins—particular, evolved, subsequences are involved in effective circularization. But in all of these scenarios, an even more fundamental requirement is that the two ends of the fluctuating molecule must spend enough time near each other in order for there to be a high probability for the special elements—RNA subsequences or binding proteins—to find one another. More explicitly, we will argue here that effective circularization of large RNA molecules is achieved through generic properties of secondary structure that are essentially independent of sequence. The specific evolved subsequences mentioned above are not needed so much for circularization as for facilitating the binding of particular proteins—e.g. RNA replicases and ribosome initiation factors—that are important for biological function of the circularized RNA. Consider the analogous situation of double-stranded (ds) DNA with ‘sticky’ ends arising from complementary ss overhangs (generated, say, by a restriction enzyme). Here the probability of the two ends being covalently bound by a ligase is directly determined by—and ultimately limited by—the likelihood that they are close enough to each other to bind, i.e. that the double helix can twist and bend enough for its two ends to get close together (11). This classic problem is informed by the well-known statistical mechanical result giving the likelihood of the ends of a linear, semiflexible, polymer being within a monomer distance of one another. For sufficiently long molecules this probability is of order where and are the contour and persistence lengths, respectively, of the linear polymer; the contour length is the number of monomers times the average inter-monomer distance, and the persistence length is the distance along the chain contour beyond which the polymer can bend almost freely (12). Thus, the circularization probability of long DNA is small because is large, i.e. the molecule is long compared to its persistence length (50 nm, for DNA): maximization of configurational entropy requires that the ends be far apart. The small probability of finding them close, decreasing as reflects directly the fact that the root-mean-square distance between the ends of the molecule is increasing as . To understand the basis for effective circularization of ssRNA, then, it is natural to ask: is there, in analogy with dsDNA, a generic result for the probability of finding the two ends of an RNA molecule close to one another, and how different is it from that for a linear polymer? In this article we argue that there is indeed a universal distribution of end-to-end distances in large RNA molecules, and furthermore that it is essentially independent of overall sequence and length. We show in particular that the distance between ends is necessarily small, because of generic features of the secondary structure, notably that the percentage (f) of paired nucleotides (nt) is ∼60% and that the average duplex length () is ∼4. Using an early variant of the RNA folding algorithm developed by Zuker et al. (13,14), Fontana et al. (15) have calculated various characteristics of the minimum free energy (MFE) structure corresponding to several different types of short (20–100) nucleotide sequences. Averaging over many sequences of the same length (number of nucleotides, N) and base composition (), they found that and approach a constant value with increasing N. They also calculated a property (the number of unpaired bases in ‘joints’ and ‘free ends’) that is closely related to our definition of the 5′–3′ distance (see next section), finding that for the short chains analyzed this number increases, yet with a gradually decreasing slope, as increases. The constancy of and has been confirmed for a wide range of biological (viral and yeast) ssRNA sequences (16) by application of the mfold and Vienna codes for predicting thermally accessible secondary structures. For certain models of polynucleotide chains, the -independence of and has been proven analytically, using a variety of powerful theoretical tools. Hofacker et al. (17), applying an elegant graph-theoretic approach, derived exact results for these properties (see their Table 3) and various other secondary structure attributes of RNA-like heteropolymers. Their results apply to an idealized ensemble where all possible secondary structures have equal statistical weight, resulting in low values of and . More recently, Clote et al. (18), using the Nussinov–Jacobson (‘maximum base pairing’) model (19) have shown that, for an ssRNA chain with Watson–Crick pairing rules, approaches a constant value slightly exceeding 90% for large (>1000). Earlier, de Gennes had noted (20) that, for a random sequence of two complementary nucleotides, the distance between chain ends remains finite even as approaches infinity. Based on this notion he also concluded that ‘ … many properties of a large, open, strand are not very different from those of a cyclic strand of equal molecular length’ (20). We elaborate on this idea in the next section. Our goal in the present work is to emphasize the generality of the proximity of the 5′- and 3′-ends of large RNA molecules of arbitrary length and sequence. Based on the general findings noted above for large ssRNA chains, we derive a simple expression for the 5′–3′ distance that can be evaluated numerically for sequences of given and . We also calculate this distance using the RNAsubopt (21,22) and mfold (23,24) folding algorithms. A further consequence of our analyses is that the secondary—and hence tertiary—structures of linear and covalently-circularized RNA molecules are practically identical. These conclusions are tested against several systematic calculations of secondary structures for specific linear and circular sequences, both random and viral.

METHODS

Figure 1A displays the MFE secondary structure of a rather short (200 nt) random-sequence ssRNA molecule, composed of equal numbers of A, C, G and U, as predicted by the mfold algorithm (23,24). The duplexes are represented in the usual way by straight ‘ladders’ and the loops by circles of different sizes. The same secondary structure is visualized slightly less schematically in Figure 1B, with more realistic scaling of duplex dimensions, using the jViz.Rna drawing program (25). This latter representation illustrates that the dangling ss segments in the ‘exterior loop’—the one including the 5′- and 3′-ends—are independent flexible chains. In Figure 1C the secondary structure is mapped into a tree graph, where each edge (bond) represents a duplex and the vertices represent the loops (15,17,26); the interior loops are denoted by solid circles, and the exterior loop by an open circle. The term ‘interior loop’ is conventionally defined as the chain of bases, both paired and unpaired, comprising a closed loop, excluding its closing (‘downstream’) base pair. In the following we slightly depart from this definition and include the closing base pair as part of the (hence closed) loop. Our definition of the exterior loop, which lacks a closing base pair, is identical to the conventional one, namely, it includes all bases (paired and unpaired) along the shortest connected (covalently or H-bonded) path from the 5′- to the 3′-end.

Figure 1.

Three different representations of the mfold-predicted minimum free energy secondary structure of a random 200 nt ssRNA of uniform composition (25% A, C, G, U). (A) Conventional schematic, drawn with mfold, showing base-paired regions (duplexes) and single-stranded loops. (B) jViz.Rna drawing (16), emphasizing the flexibility of single-stranded loops and scaled dimensions of duplexes. (C) Graph-theoretic mapping of this secondary structure, reducing duplexes to edges (bonds) and loops to vertices (filled circles); the single ‘exterior’ loop is depicted by an open circle.

5′–3′ Distance

As a simple intuitive measure of the 5′–3′ distance (in a given secondary structure of a given sequence) we use the total number of nucleotide links comprising the exterior loop, i.e. Here is the number of covalent (phosphodiester) bonds (hereafter also referred to as ss links) in the exterior loop and is the number of base-paired (H-bonded, ds) links in the exterior loop or, equivalently, the number of duplexes emanating from the exterior loop. As it is the total number of (ss and ds) links in the nucleotide chain constituting the exterior loop, we shall refer to as the ‘effective contour length’ of this loop. Expressing in the form where is the total number of nucleotides in the exterior loop, and noting that is the total number of paired bases in the exterior loop, it follows from Equation (1) that is the number of unpaired bases in this loop. Figure 2 illustrates an exterior loop where whereas in Figure 1 . It should be emphasized that the average physical distance between the 5′- and 3′-ends depends not only on but also on the specific sequence of the loop, as well as the number of duplexes branching from the loop. In fact the lengths of the covalent and H-bonded links are different (the latter are about three times larger). If all links were of equal length , and their joints were fully flexible, then the physical 5′–3′ distance would be roughly , where we have neglected excluded volume effects because of the shortness of the exterior loop (12). It follows that small, -independent, -values imply small, -independent physical distances between the two chain ends.

Figure 2.

Detailed view of an exterior loop consisting of covalent links and H-bonded links of nucleotides. The effective contour length of the loop is .

Detailed view of an exterior loop consisting of covalent links and H-bonded links of nucleotides. The effective contour length of the loop is . Four simple observations will guide our calculation of the 5′–3′ distance: Two simple and important results can easily be proved from the tree graph analogy. First, the number of vertices, , and the number of bonds, , of a circular RNA are related by the equality . This relation is also valid for linear RNAs provided the exterior loop is also represented by a vertex (possibly differently labeled, as in Fig. 1C). Second, on average (over all loops in any given structure), each loop (vertex) is connected to duplexes (edges). For long () sequences we also find (see below), in which case we can safely set which (unless otherwise stated) will be the value used in our calculations. Note that the averaging here is over all loops in a given structure. The same holds, of course, after averaging over any number of structures and/or sequences. Note also that we always have , with corresponding to a ‘hairpin’ loop, to a ‘bubble’ or ‘bulge,’ and to a ‘multi loop’. The MFE secondary structures of a given linear ssRNA molecule and that of the circular RNA obtained by linking the 5′- and 3′-ends of the linear chain are very similar, and their energies practically identical. This is because the presence or absence of a covalent (phosphodiester) bond between the terminal nucleotides does not significantly alter overall base pairing. Its small influence on the configurational free energy of the molecule enters only through the entropy difference between the open exterior loop in the linear RNA and the corresponding closed (interior) loop in the circular analog. Actually, for any secondary structure of the linear ssRNA, not only the one of minimum free energy, the corresponding circular structure has essentially the same energetic and structural characteristics. Conversely, any secondary structure of a linear RNA can be regarded as derived from ‘cutting’ a specific covalent bond in one of the interior loops of the corresponding circular RNA. We thus expect that secondary structure characteristics of long RNA molecules, such as the pairing fraction or average duplex length, are practically the same for the linear and circularized ‘isomers’. These conclusions have been confirmed by numerical analyses of a large number of linear and circular RNA sequences of different lengths and compositions, as reported below and in Supplementary Figure S1 and Supplementary Table S1. As noted in the Introduction, for long chains (say ) composed of comparable proportions of A, C, G and U (25 ± 5%), we find that for randomly-permuted sequences and for most viral RNAs (Tables 1 and 2).

Table 1.

Composition ()-dependence of the average percentage of bases paired (f), the average duplex length (k) and the average 5′–3′ distance (D), for different sets of random and yeast-derived sequences of length 3000 nt; each set consists of 500 sequences

Type of ssRNA	Folding program	(%)^a				(%)	(bp)	, links	, from Equation (2)
Type of ssRNA	Folding program	G	C	A	U	(%)	(bp)	, links	, from Equation (2)
Random, viral-like	RNAsubopt	24	22	26	28	62 ± 1	4.0 ± 0.1	12 ± 4	11.6
Random, uniform	RNAsubopt	25	25	25	25	61 ± 1	3.9 ± 0.1	12 ± 5	12.6
Yeast-derived^b	RNAsubopt	19	19	31	31	58 ± 2	4.1 ± 0.1	14 ± 5	11.9
Random, viral-like	mfold	24	22	26	28	61 ± 1	4.5 ± 0.1	14 ± 7	12.8

Values following the ± symbols are standard deviations.

aThe randomly-permuted ssRNAs of each type are of identical composition; for the yeast ssRNAs, the mean composition is listed.

bThese are ssRNA transcripts of successive 3000 bp sections of yeast (S. cerevisiae) chromosomes XI and XII.

Table 2.

Values of f, k and D for viral ssRNAs, determined with RNAsubopt

Viral taxon	No. of seq.^a	Host	N (nt)	f (%)	k (bp)	D, links
Bromoviridae RNA3	8	Plant	2210	63 ± 1	4.2 ± 0.1	19 ± 6
Bromoviridae RNA2	8	Plant	2891	63 ± 2	4.3 ± 0.1	18 ± 4
Bromoviridae RNA1	8	Plant	3265	64 ± 2	4.3 ± 0.1	15 ± 3
Leviviridae	9	Bacterium	3780	68 ± 2	4.3 ± 0.1	15 ± 9
Sobemovirus	9	Plant	4199	66 ± 2	4.2 ± 0.2	17 ± 4
Luteovirus	17	Plant	5725	62 ± 1	4.2 ± 0.1	16 ± 7
Tymovirus	9	Plant	6300	45 ± 4	3.9 ± 0.1	26 ± 5
Tobamovirus	22	Plant	6425	64 ± 1	4.2 ± 0.1	19 ± 5
Astroviridae	6	Animal	6719	63 ± 1	4.3 ± 0.1	16 ± 8
Caliciviridae	18	Animal	7713	62 ± 1	4.1 ± 0.1	20 ± 19

Values following the ± symbols are standard deviations.

aNumber of sequences analyzed.

For long chains, we also know that the average length of (i.e. number of base pairs in) a duplex, , is independent of and rather insensitive to (for compositions involving 25 ± 5% of the four bases). For nearly all the sets of sequences examined in this study—randomly-permuted, viral and yeast-derived— is between 4 and 5 (Tables 1 and 2; Supplementary Table S1). As is well known, every secondary structure can be represented by a tree graph (26), as illustrated in Figure 1C. Composition ()-dependence of the average percentage of bases paired (f), the average duplex length (k) and the average 5′–3′ distance (D), for different sets of random and yeast-derived sequences of length 3000 nt; each set consists of 500 sequences Values following the ± symbols are standard deviations. aThe randomly-permuted ssRNAs of each type are of identical composition; for the yeast ssRNAs, the mean composition is listed. bThese are ssRNA transcripts of successive 3000 bp sections of yeast (S. cerevisiae) chromosomes XI and XII. Values of f, k and D for viral ssRNAs, determined with RNAsubopt Values following the ± symbols are standard deviations. aNumber of sequences analyzed. Among the numerous possible secondary structures of long RNA sequences, there are often thousands whose free energies are just marginally higher ( or less) than that of the MFE configuration, and under equilibrium conditions all these structures are nearly equally likely. Consequently, any property of the molecule that depends on its secondary structures should be averaged over their full thermal (Boltzmann) distribution. Suppose that, using RNAsubopt or a similar program, we have stochastically sampled the thermal ensemble of structures corresponding to a certain circular ssRNA sequence of given and . As argued in (i), above, all the linear ssRNA molecules derived by cutting any covalent (ss) bond in any interior loop of any member of the above ensemble will fold into ensembles of structures that are practically identical both to each other, and to the ensemble of the original circular molecule. The only difference is the appearance of an exterior loop, which now contains the 5′- and 3′-ends. For every given circular structure containing interior loops, this cutting procedure yields linear ssRNA sequences, where is the total number of ss (covalent) bonds in all loops of the given structure, denoting the number of covalent bonds in loop . Noting that the total number of nucleotides in the closed loop , namely is equal to the total number of bonds in this loop (), we find , with and denoting the number of unpaired and H-bonded nucleotides in loop , respectively, and the number of duplexes emerging from this loop. This yields . We have used the fact that the first sum is the total number of unpaired nucleotides, , and the fact that because every duplex is connected to two loops, the second sum is twice the total number () of duplexes in the structure. But can be expressed in the form so that . Here, and in all subsequent analytical expressions involving , its numerical value will be understood to be the fraction of bases in pairs, rather than the percentage. As before, denotes the average duplex length in the particular sequence considered. For and we find . In the next section we present numerical calculations of the average 5′–3′ distance for two types of ssRNA molecules, biological (yeast-derived and viral) and randomly-permuted sequences. The random sequences were included both for direct comparison to the biological sequences, and for general theoretical interest. In each case, a Boltzmann-weighted average -value is determined for the thermal ensemble of structures associated with each sequence. We then report the mean of these ensemble-average -values for each set of sequences. For the random sequences a simple theoretical prediction of (showing good agreement with the numerical calculation) can be derived based on two reasonable approximations, as argued in the Appendix 1. We show there that, for any given secondary structure of a very long () ssRNA molecule, the 5′–3′ distance is given by with denoting the average number of ss covalent bonds per interior loop in the structure considered. In terms of the pairing fraction, , and duplex length, , of this structure we obtain . For both the MFE structure and the canonical ensemble averages of secondary structures of random (but also viral) sequences containing roughly equal proportions of the four bases it is found that and , yielding , and hence . See also Table 1.

Numerical computations

RNA sequences

Randomly-permuted ssRNA sequences were generated with a Fisher–Yates shuffle driven by a Mersenne Twister random number generator (27) implemented in C++ (by R. Wagner, University of Michigan, available at: www-personal.umich.edu/∼wagnerr/MersenneTwister.html). Viral ssRNA sequences were obtained from the National Center for Biotechnology Information Genome Database (www.ncbi.nlm.nih.gov). Yeast (Saccharomyces cerevisiae) genomic sequences were obtained from the Saccharomyces Genome Database (www.yeastgenome.org).

Folding programs

Secondary structure predictions were made with two RNA folding programs, RNAsubopt, a program in the Vienna RNA Package, Version 1.7 (21,22), and mfold, Version 3.1 (23,24). These programs employ detailed empirically-based energy models to estimate the free energies of the non-pseudoknotted secondary structures that are formed by a specified ssRNA sequence. With RNAsubopt, it is possible to sample stochastically from the ensemble of secondary structures, with a sampling probability in proportion to each structure’s Boltzmann weight. Thus, sampling a sufficient number of structures (we use 1000), and averaging the -values for this set, gives a close approximation to the ensemble-average predicted value of the end-to-end distance for that sequence. In earlier work (16) we demonstrated that the average properties of subsets of 1000 structures are not significantly different from those of the complete ensemble of structures. More generally, for any property , its RNAsubopt-predicted ensemble-average value is calculated as , where is its value in the member of the stochastically-generated subset of the Boltzmann ensemble of secondary structures. In mfold, by contrast, an algorithm is used to generate a structurally diverse representation of the ensemble, rather than a thermally-representative average. We configured mfold to generate the 1000 lowest-energy structures from such a set, measured for each, and averaged them in proportion to their Boltzmann weights, to give an mfold-averaged -value. For any property , its mfold-predicted average value is with the free energy of the secondary structure relative to the MFE for that sequence.

RESULTS

While there can be significant inter-taxon variation, the average composition, , of the viral RNAs in this study is ∼24% G, 22% C, 26% A and 28% U (16). With this ‘viral-like’ , we generated 2000 random sequences of lengths 50, 100, 200 and 400; 1000 of lengths 800 and 1500; 500 of lengths 2000, 2500, 3000 and 4000; 300 of lengths 5000, 6000 and 7000; and 1000 of length 8000. These sequences were folded with RNAsubopt. Figure 3 shows the mean and standard deviation for each length of RNA, and a regression line fitted to sequences of length 400 and greater. Except for the very short sequences, is ∼12, independent of sequence length; in addition, it is relatively insensitive to small changes in . That this -value is identical to the estimate obtained above, through the theoretical calculation, is coincidental, because the latter is based on the somewhat approximate expression given in Eq. (2) (the approximations are explained in Appendix 1). But it is nevertheless very striking, and highly significant, that the simple theory predicts a -value that is of the correct magnitude and that is independent of length and sequence.

Figure 3.

Mean ensemble-averaged 5′–3′ distances, , from Equation (1), for random and viral sequences. Standard deviations are shown with vertical bars. The small black points represent the 10 groups of viral sequences listed in Table 2. The large gray points represent the 14 different lengths of randomly-permuted RNAs (50–8000 nt), of viral-like composition, described in the text. The line is a least-squares fit to the values for random sequences with . The asymptotic value of for the random sequences is very close to the theoretically predicted one, [see Equation (2)]. Table 1 shows the results for 500 3000-nt ssRNAs of viral-like and uniform , as well as 500 ssRNAs that are the transcripts of consecutive 3000 bp sections on yeast (S. cerevisiae) chromosomes XI and XII. In these sets, the values of , and (averaged over the 500 sequences) were 12–14, ∼60% and ∼4, respectively. The last column in the table lists the values of calculated according to Equation (2), and these results are seen to agree closely with those from the detailed numerical calculations (especially for the random sequences, as expected). The viral taxa analyzed are listed in Table 2. All are non-enveloped ssRNA viruses and, except for the rod-shaped Tobamoviruses, have icosahedral capsids. The Leviviridae infect bacteria, the Astroviridae and Caliciviridae are animal viruses, and the remainder infect plants. The Bromoviridae are, in addition, tripartite: the genome consists of three ssRNAs, divided among three separate capsids. The number of sequences analyzed in each case corresponds to the number of species considered. From Figure 3 it can be seen that the values and standard deviations of D for the viral RNAs are higher, but overlap those of the random sequences for all taxa except the Tymoviruses. The latter can be understood from the fact that small -values are an inherent consequence of base pairing; all non-pathological secondary structures with a sufficiently high percentage of bases in pairs, , will have a low . The Tymoviruses show a relatively larger (although still small relative to sequence length) because they have a significantly smaller . We note that current RNA folding programs have been shown to be limited in their ability to correctly predict individual base pairs in long ssRNA sequences (28). Consistent with this, RNAsubopt and mfold (which use slightly different energy models to generate their ensembles of secondary structures, and different algorithms to sample from these ensembles), when given long sequences to fold, output structures that often show significant differences in the details of base pairing, as well as overall appearance. However, our simple theoretical model predicts that depends only on the values of and , which we have previously found to be robust with respect to the details of the folding program used (16). Consequently, should likewise be robust to the details of the folding program, and thus insensitive to low-level inaccuracies in specific predictions of base pairing. To test this, we compared predictions of made using mfold and RNAsubopt. As expected, we found that the values do not differ significantly between the two folding programs, and can thus be considered broadly robust to the specific characteristics of the energy model used (Table 1). There is currently no published experimental work that directly measures the 5′–3′ distance of large (103–104 nt) ssRNAs in their native state (i.e. not complexed with proteins). However, based on a combination of experimental and computational approaches, Filomatori et al. (4) have proposed a model for the secondary structure of the exterior loop of native dengue ssRNA. Their proposed loop has a D-value of 25, which is of the same magnitude as both the theoretical predictions in Table 1, and the numerical predictions in Table 2.

DISCUSSION

We have made two predictions in the current work, both of which can be tested experimentally. First, we have predicted with general theoretical arguments—and demonstrated with numerical computations involving the equilibrated secondary structures of a large number of different lengths and sequences—that the distance between ends of an ssRNA (or ssDNA) should be ∼10–15 nt links. This corresponds to a 3D physical distance of a few nm, which is far smaller than the contour lengths of large ssRNA molecules. As mentioned earlier, a crude estimate of the 3D distance between ends may be obtained in terms of the root-mean-square (RMS) end-to-end distance () associated with a flexible linear polymer defined by the string of covalent and H-bonded links shown in Figure 2. With an average link size, , of ∼3/4 nm, and a of 12, one obtains an RMS end-to-end distance of ∼3 nm. This is approximately an order of magnitude less than the 37 nm average distance between nucleotides (radius of gyration) that has been measured by small-angle X-ray scattering for a 6400 nt viral ssRNA (29). Our estimate of 3 nm could be confirmed by fluorescence resonance energy transfer (FRET) measurements, or still more directly by cryo-EM imaging of large ssRNA molecules whose ends have been labeled by small gold particles (for example, 1 nm particles conjugated to oligonucleotides that are complementary to the 5′- and 3′-ends). Second, we have predicted that all the linearized ssRNAs obtained by making a single cut in a long circular ssRNA molecule should have secondary (and hence) tertiary structures that are essentially identical to that of the parent circular form. Accordingly, they should have the same size and shape. And because they necessarily have the same charge, they should show virtually indistinguishable band positions in native gels, even though the linear and circular forms can be easily distinguished in denaturing gels where the secondary structure needed to effectively circularize the linear molecule has been destroyed. Similarly, under native conditions, small-angle X-ray scattering experiments, cryo-EM, and measurements of diffusion coefficients/hydrodynamic radii should show no difference between the circular and linearized molecules. The only caveat here, as well as for the measurements of 5′–3′ distance described earlier, is that the secondary structures of the molecules be equilibrated, since this is explicitly assumed in the theoretical arguments leading to all of these predictions [for a critical discussion of the equilibration/renaturation (and the lack thereof) of ssRNA, see Uhlenbeck (30)].

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

US National Science Foundation (grant number CHE07-14411 to W.M.G.); the Israel Science Foundation (grant number 695/06 to A.B.-S.); the US–Israel Bi-National Science Foundation (grant number 2006-401 to A.B.-S.); The Netherlands Organization for Scientific Research, Rubicon grant (to P.P.); and the University of California, Los Angeles, a Dissertation Year Fellowship (to A.M.Y.). Funding for open access charge: Research grant of A.B.-S. (grant number ISF 695/06). Conflict of interest statement. None declared.

23 in total

1. Small angle X-ray scattering studies on local structure of tobacco mosaic virus RNA in solution.

Authors: Y Muroga; Y Sano; H Inoue; K Suzuki; T Miyata; T Hiyoshi; K Yokota; Y Watanabe; X Liu; S Ichikawa; H Tagawa; Y Hiragi
Journal: Biophys Chem Date: 2000-01-24 Impact factor: 2.352

2. Mfold web server for nucleic acid folding and hybridization prediction.

Authors: Michael Zuker
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

3. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization.

Authors: David H Mathews
Journal: RNA Date: 2004-08 Impact factor: 4.942

4. Predicting the sizes of large RNA molecules.

Authors: Aron M Yoffe; Peter Prinsen; Ajaykumar Gopal; Charles M Knobler; William M Gelbart; Avinoam Ben-Shaul
Journal: Proc Natl Acad Sci U S A Date: 2008-10-09 Impact factor: 11.205

5. Genomic RNAs of influenza viruses are held in a circular conformation in virions and in infected cells by a terminal panhandle.

Authors: M T Hsu; J D Parvin; S Gupta; M Krystal; P Palese
Journal: Proc Natl Acad Sci U S A Date: 1987-11 Impact factor: 11.205

6. Statistics of branching and hairpin helices for the dAT copolymer.

Authors: P G de Gennes
Journal: Biopolymers Date: 1968 Impact factor: 2.505

7. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information.

Authors: M Zuker; P Stiegler
Journal: Nucleic Acids Res Date: 1981-01-10 Impact factor: 16.971

8. Fast algorithm for predicting the secondary structure of single-stranded RNA.

Authors: R Nussinov; A B Jacobson
Journal: Proc Natl Acad Sci U S A Date: 1980-11 Impact factor: 11.205

9. Biophysical studies on circle formation by Sindbis virus 49 S RNA.

Authors: T K Frey; D L Gard; J H Strauss
Journal: J Mol Biol Date: 1979-07-25 Impact factor: 5.469

10. 5'-3' RNA-RNA interaction facilitates cap- and poly(A) tail-independent translation of tomato bushy stunt virus mrna: a potential common mechanism for tombusviridae.

Authors: Marc R Fabian; K Andrew White
Journal: J Biol Chem Date: 2004-05-03 Impact factor: 5.157

19 in total

The ends of a large RNA molecule are necessarily close.

INTRODUCTION

METHODS

5′–3′ Distance

Numerical computations

RNA sequences

Folding programs

RESULTS

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

1. Small angle X-ray scattering studies on local structure of tobacco mosaic virus RNA in solution.

2. Mfold web server for nucleic acid folding and hybridization prediction.

3. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization.

4. Predicting the sizes of large RNA molecules.

5. Genomic RNAs of influenza viruses are held in a circular conformation in virions and in infected cells by a terminal panhandle.

6. Statistics of branching and hairpin helices for the dAT copolymer.

7. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information.

8. Fast algorithm for predicting the secondary structure of single-stranded RNA.

9. Biophysical studies on circle formation by Sindbis virus 49 S RNA.

10. 5'-3' RNA-RNA interaction facilitates cap- and poly(A) tail-independent translation of tomato bushy stunt virus mrna: a potential common mechanism for tombusviridae.

1. Expected distance between terminal nucleotides of RNA secondary structures.

2. The block spectrum of RNA pseudoknot structures.

3. Combinatorics of locally optimal RNA secondary structures.

Review 4. RNA structure in splicing: An evolutionary perspective.

5. The separation between the 5'-3' ends in long RNA molecules is short and nearly constant.

6. mRNA- and factor-driven dynamic variability controls eIF4F-cap recognition for translation initiation.

7. Condensates in RNA repeat sequences are heterogeneously organized and exhibit reptation dynamics.

Review 8. Making ends meet: New functions of mRNA secondary structure.

9. Heterogeneous Dynamics of Protein-RNA Interactions across Transcriptome-Derived Messenger RNA Populations.

10. Efficient calculation of exact probability distributions of integer features on RNA secondary structures.