Literature DB >> 20667096

Large-scale analysis of structural, sequence and thermodynamic characteristics of A-to-I RNA editing sites in human Alu repeats.

Yoav Kleinberger1, Eli Eisenberg.   

Abstract

BACKGROUND: Alu repeats in the human transcriptome undergo massive adenosine to inosine RNA editing. This process is selective, as editing efficiency varies greatly among different adenosines. Several studies have identified weak sequence motifs characterizing the editing sites, but these alone do not account for the large diversity observed.
RESULTS: Here we build a dataset of 29,971 editing sites and use it to characterize editing preferences. We focus on structural aspects, studying the double-stranded RNA structure of the Alu repeats, and show the editing frequency of a given site to depend strongly on the micro-structure it resides in. Surprisingly, we find that interior loops, and especially the nucleotides at their edges, are more likely to be edited than helices. In addition, the sequence motifs characterizing editing sites vary with the micro-structure. Finally, we show that thermodynamic stability of the site is important for its editing.
CONCLUSIONS: Analysis of a large dataset of editing events reveals more information on sequence and structural motifs characterizing the A-to-I editing process.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20667096      PMCID: PMC3091650          DOI: 10.1186/1471-2164-11-453

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

RNA Editing is a post-transcriptional modification of mRNA [1-4], which may result in the synthesis of proteins that are not directly encoded in the genome. There are two major types of RNA Editing in mammals, both of which occur via deamination of a base, either cytidine (which is turned into uridine) or adenosine (which turns into inosine). Inosine is read by the ribosome (and sequencers) as guanosine, and thus A → I modifications at the mRNA level translate into an A → G changes at the genetic code level. In this work we focus exclusively on A-to-I RNA Editing, which is catalyzed by enzymes from the ADAR (Adenosine Deaminases that Act on RNA) family. ADARs are double-stranded RNA (dsRNA) binding proteins, and thus dsRNA is a prerequisite for A-to-I editing [1,2]. RNA Editing is a fine-tuning mechanism, capable of changing only a few nucleotides. Both edited and unedited variants of the same transcript may be present in the cell. A-to-I editing is known to be vital in vertebrates, and important for normal life in invertebrates. In Drosophila, knocking out ADAR activity causes the flies to exhibit defects in locomotion and mating and to suffer tremors [5]. ADAR knockout C. elegans worms exhibit chemotaxis defects [6]. In mice, knocking out ADAR1 causes embryonic death and defects in erythropoiesis [7,8]. ADAR2 -/- mice die shortly after birth and are increasingly seizure prone after postnatal day 12 [9]. The lethal phenotype is accounted for by a single editing site resulting in a single amino acid substitution in the gluR-B gene. In addition, alteration of A → I editing has been ascribed to several pathological conditions [10], mainly to neuro-psychiatric conditions such as amyotrophic lateral sclerosis (ALS) [11], epilepsy [9,12], major depression disorder [13-15], and glioblastoma multiforme [16]. Reduced A-to-I editing levels have been linked to cancer in various tissues, most strongly to brain tumors. A correlation between the reduction of ADAR3 and the tumor aggressiveness was observed, and overexpression of ADAR1 and ADAR2 resulted in decreased proliferation rate of the glioblastoma multiforme cell-lines [17]. Isolating inosine-containing transcripts from C. elegans and human brain, it has been noticed that most A-to-I editing occurs in non coding regions [18]. Genome-wide bioinformatic searches for A-to-I editing sites have enabled the identification of abundant A-to-I editing in the transcriptome of several vertebrates [19-24]. It was found that editing occurs mainly within repetitive elements. These repetitive elements are likely to base-pair with a neighboring similar element and form the dsRNA structure which is the target of the ADAR enzymes. In particular, virtually all A-to-I editing events in human occur specifically within Alu repeats. The Alus are a particular set of primate-specific retrotransposons, approximately 280 nucleotides in length. The Alus are the most abundant of all transposable elements in primates, making up more than 10% of the human genome, with some 1.1 million copies. Recent studies [21,23] have demonstrated that the frequency of A-to-I editing in human is much higher than in mouse, rat, chicken and fly. This has to do with the abundance and low diversity of the Alu elements as compared to similar elements in other genomes [24]: since Alu is so common in the human genome, there is a high probability that an Alu and a counterpart, oppositely oriented Alu, exist nearby and are transcribed together. When the RNA transcript folds, these two Alus form a helix, thus becoming a target for the dsRNA binding ADARs. The physiological significance of A-to-I editing within non-coding repetitive elements is still elusive. Several possible mechanisms have been suggested through which editing of a non-coding repetitive element might affect the fate of a transcript: editing may result in insertion or elimination of a splice site, and may theoretically lead to the alteration of transcriptional start and stop codons [25]. Hyperedited inosine-containing RNAs might be cleaved at specific sites [26-29]. In addition, inosine containing mRNAs were also shown to be retained in the nucleus, suggesting an additional regulatory role for A-to-I editing [30,31]. However, the validity and scope of this last mechanism has been debated recently [32,33]. Finally, while the molecular significance is yet unclear, editing within Alu repeats was shown to be altered in cancerous tissues [17]. A-to-I editing is characterized by a puzzling specificity and selectivity in the adenosines which are edited. In some substrates, e.g. the AMPA receptor gluR-B subunit in mice [34] and the E1 sites within an Alu repeat in the NARF gene [25], RNA Editing is extremely efficient, editing 100% of transcripts at a specific adenosine. In others, such as most of the sites in Alu repeats, a seemingly random editing pattern is observed, where many adenosines are targeted, with varying editing efficiency. However, careful analysis reveals that editing in Alu repeats is also highly reproducible: the variability among healthy individuals in editing level at a given site within a specific Alu repeat is much lower than the site-to-site differences. Sequence preferences for ADARs have been previously documented. C and T are overrepresented at the nucleotide 5' to the editing site, while G is underrepresented. At the nucleotide 3' to the site, G is significantly overrepresented [19,35-39]. These motifs are too weak, however, to fully characterize A-to-I editing. Therefore, the question still stands: what controls the editing level at each given site? ADARs bind to the RNA via double-stranded RNA binding motifs. Thus, dsRNA is a necessity for A-to-I editing. Indeed, it has been shown for the highly selective R/G editing site within the hairpin of the glutamate receptor subunits mRNAs, that the identities of bases in the helical region are evolutionarily conserved, while the bases in the nonhelical part of the hairpin covary so as to maintain their non-helical structure [40]. This distinctive feature demonstrates the importance of the secondary structure to the phenomenon of RNA Editing. The internal structure of the dsRNA is expected to control the editing efficiency [41]. For example, it has been shown experimentally that internal loops may effectively be equivalent to helix termini in terms of editing efficiency [42]. Thus, internal loops along dsRNA, if large enough, may act as delimiters separating a large dsRNA into many small helices. Since ADARs deaminate fewer A's in shorter helices, their existence (along with the sequence preferences of the ADARs) might be a means to increase the specificity of editing. It is thus plausible that more features of the secondary structure of an RNA molecule play an important role in determining the specificity of adenosine deamination of an ADAR substrate. In this paper we will characterize the properties of A-to-I editing sites in terms of their secondary structure properties, their sequence properties, and their thermodynamic properties. We describe the building of a database of MFOLD[43] foldings used to query these properties, and then display and discuss the results of those queries.

Results and Discussion

Structural Analysis

We first look at the editing frequency for each substructure type (see Table 1 and fig. 1). We compare a "test set" of A-to-I Editing sites, which we denote by E1, and a control set of sites not known to be edited, denoted by E0. The E1 and E0 sets are defined with precision in the Methods section. Interestingly, while the existence of a helix is well known to be a prerequisite for editing, the overall frequency of E1 is actually more than two fold lower in helices (0.044) than in interior loops (0.091). As the overwhelming majority of E1 sites reside in helices and interior loops, we focus henceforth on these two substructures only. For clarity, we emphasize here that by "interior loop" we mean only the unpaired nucleotides that form the loop's constituent strands.
Table 1

Editing frequency and average structure size for the various substructures.

substructurecountE1 freq.average size(E0)average size(E1)
bulge149410.0315.74 ± 0.113.44 ± 0.11
hairpin253180.0328.98 ± 0.097.82 ± 0.37
helix3952550.04412.47 ± 0.0314.08 ± 0.16
interior1071120.0915.61 ± 0.034.00 ± 0.05
junction747880.02423.56 ± 0.1017.20 ± 0.49
strand27630.02013.38 ± 0.459.20 ± 1.74

size is the number of nucleotides in the structure for bulge, hairpin, interior, junction and strand, and the length of the helix for helices. The table lists average sizes with 95% confidence intervals.

"strand" here means an RNA strand which is not part of a loop (stand-alone strand).

Figure 1

Secondary structure substructures. Bold lines indicate those nucleotides formally included in a given substructure.

Editing frequency and average structure size for the various substructures. size is the number of nucleotides in the structure for bulge, hairpin, interior, junction and strand, and the length of the helix for helices. The table lists average sizes with 95% confidence intervals. "strand" here means an RNA strand which is not part of a loop (stand-alone strand). Secondary structure substructures. Bold lines indicate those nucleotides formally included in a given substructure. Table 1 also suggests length dependence. The editing prevalence as a function of length is given in figs. 2 and 3 (henceforth, error bars represent 95% confidence intervals. Also, some graphs of integer-valued variables have non-integer entries due to data binning). Clearly, longer helices are more likely to be edited, while longer strands of interior loops are less likely to be edited. In addition, the length of the opposite strand (the one the editing site does not reside in) also affects the editing frequency in an interior loop: as shown in fig. 4, symmetric loops are more likely to be edited.
Figure 2

.

Figure 3

.

Figure 4

. Editing frequency is presented for sites within interior-loop strands of lengths 1 (circles), 2 (squares), 3 (diamonds), 4 (triangles), as a function of the asymmetry of the loop. Asymmetry is defined as the difference between the length of the strand opposing the editing site and the edited strand length. Frequencies are normalized by the averaged editing frequency for sites having same strand length, regardless of opposite strand length.

. . . Editing frequency is presented for sites within interior-loop strands of lengths 1 (circles), 2 (squares), 3 (diamonds), 4 (triangles), as a function of the asymmetry of the loop. Asymmetry is defined as the difference between the length of the strand opposing the editing site and the edited strand length. Frequencies are normalized by the averaged editing frequency for sites having same strand length, regardless of opposite strand length. Furthermore, we study the effect of the location of the specific nucleotide within its respective substructure. We define cePos as the distance of the site (in nucleotides) from the closest edge of the substructure it is in (cePos = 0 means the very edge of a substructure). Figs. 5 and 6 present the frequency of E1 sites as a function of cePos. For helices, one observes a general trend of enhancement of editing as a site lies deeper in the helix. For interior loops, however, there is dramatic depletion of E1 for cePos > 0. In fact, it should be noted that 91% of edited sites in interior loops lie at the very edge of the loop, i.e. cePos = 0. Most of these are in fact a single mismatch within an almost perfect helix (i.e., opposite strand length is also one nucleotide). Such mismatches were already implicated as preferred targets of ADARs, as previous in-vitro data as well as bioinformatic work indicate that AC mismatches are more favorable substrates than A-T pairs [19,44]. However, it is worthwhile noticing that our analysis shows this trend to persist even for longer interior loops: interior loop strands of length up to five nucleotides are more likely to be edited than the average site in a helix (see fig. 3).
Figure 5

for helices.

Figure 6

for interior loops.

for helices. for interior loops. For these cePos = 0 sites, there is a significant (p < 2.2e-16) effect to the direction of the nearest neighboring helix: A-to-I editing frequency is 0.068 for sites with a helix only in the upstream site, 0.094 for sites with a helix only in the downstream site, and 0.13 for sites with helices on both sides. The above results hold when controlling for the total length of the substructure: we compared E1 and E0 sites for helices of a given length, and for loops of a given size. The resulting trend was the same: for E1 sites in helices cePos is larger than for E0 sites, whereas in interior loops the connection is reversed. Other location variables tested, such as the position relative to the middle of the substructure, or relative to the 5' end, did not result in noticeable results.

Sequence Analysis

We start with the nucleotide opposite of the editing site. For helices, it is clear what this means: the "opposite" nucleotide of a site is the nucleotide that pairs with that site (and is therefore always T). We expand this idea, however, to sites at the edges of interior loops (i.e., having cePos = 0): for these sites on the most 5' (3') nucleotide of the loop-strand, the opposite nucleotide is the most 3' (5') site of the other strand in the loop. If the site is the only one on its strand, and the opposite strand has more than one nucleotide, the opposite nucleotide is undefined. We shall refer to the opposite nucleotide as opNuc for short. There is a very strong enrichment for sites with C on the opposite site: we looked at the frequency of E1 for sites with a given opNuc, and obtained a frequency of 18.5% for C, whereas for A the frequency was 5.1% and for G, 3.7%. This is consistent with (but more pronounced than) the data presented in [19-22,37,44]. Next we look at the statistics of the nucleotides upstream and downstream of the A-to-I editing sites. In order to avoid biases due to the underlying nucleotide statistics in Alu repeats we do not look at the raw distribution of nucleotides but rather at the enrichment factor, i.e. how much is the editing frequency increased (compared to the average within the respective substructure) when the neighboring site is any specific nucleotide. The enrichment factors are presented in figs 7, 8 and 9 for the two immediate neighbor nucleotides separately, as well as for the joint variable composed of both upstream and downstream neighbors. Overall, the profiles found are similar to those seen in previous large-scale studies of editing [19-22,24,45]: T is most preferred upstream and is not preferred downstream, while G is most preferred downstream and least preferred upstream (in both helices and loops). However, we do find a significant (p < 1.1e-16 for all comparisons) difference between the profiles for helices and loops. For example, the preference for an upstream T is stronger in helices, whereas the preference for a downstream G is stronger in interior loops.
Figure 7

Enrichment factors for upstream nucleotide in helices and interior loops.

Figure 8

Enrichment factors for downstream nucleotide in helices and interior loops.

Figure 9

Enrichment factors for joint upstream, downstream nucleotides in helices and interior loops.

Enrichment factors for upstream nucleotide in helices and interior loops. Enrichment factors for downstream nucleotide in helices and interior loops. Enrichment factors for joint upstream, downstream nucleotides in helices and interior loops. We also calculated the enrichment factors for the joint variable composed of the site's upstream neighbor, downstream neighbor, and opNuc. The results are displayed in Table 2.
Table 2

Frequency of E1 and enrichment factors for the joint distribution of upstream neighbor, downstream neighbor and opNuc.

up1,dn1:opNucE1 freq.enrichment factor# of sites
G,T:A0.0050.054407
G,T:G0.0060.066501
G,C:A0.0060.069160
G,C:G0.0070.081273
G,G:G0.0070.081545
A,C:G0.0140.156213
G,A:G0.0160.174190
C,C:G0.0190.207319
C,T:G0.0190.213362
G,C:C0.0210.2261268
C,T:A0.0220.240412
G,G:A0.0230.2561078
C,A:A0.0240.265334
A,A:A0.0260.288422
G,A:C0.0260.292981
A,A:G0.0300.335461
A,T:G0.0330.367210
T,T:A0.0330.368449
G,T:C0.0340.3701016
T,T:G0.0350.383202
C,C:A0.0370.408784
T,C:G0.0410.454170
G,A:A0.0420.468118
C,A:G0.0440.481528
C,G:G0.0460.506567
A,T:A0.0550.608236
A,C:A0.0590.651339
G,G:C0.0670.7371482
C,G:A0.0680.753966
T,C:A0.0720.792209
A,A:C0.0941.0351226
T,A:G0.0961.063197
T,A:A0.1041.14996
T,G:G0.1101.209228
C,T:C0.1291.4211203
A,G:G0.1441.587278
T,G:A0.1481.629474
C,C:C0.1631.8012321
A,G:A0.1661.825278
C,A:C0.1741.9231515
A,C:C0.1932.1321314
A,T:C0.2482.7401610
C,G:C0.2572.8392797
A,G:C0.2682.9521364
T,T:C0.2692.967788
T,C:C0.2712.994818
T,A:C0.2843.138689
T,G:C0.3744.1231589
Frequency of E1 and enrichment factors for the joint distribution of upstream neighbor, downstream neighbor and opNuc. In addition, we searched for enrichment in the extended neighborhood of the editing sites, looking at 30 neighboring nucleotides at both sides of the site (upN refers to the nucleotide N sites upstream to the editing site, and dnN refers to the nucleotide N sites downstream to the editing site). Almost all neighbors show a significantly different nucleotide distribution around edited sites, see Tables 3 (helices) and 4 (interior loops). The most significant differences (largest χ2 scores) are observed for neighbors up1, up2, up7 and dn18 in helices and up1, dn1, up2 and up3 in interior loops. We note that while almost all 60 neighbors tested show statistically significant difference, it is hard to tell whether these differences are due to ADARs preference or rather stem from editing hot spots within the Alu. We also present the enrichment factors for seven positions surrounding the editing sites which were reported to show preferences to specific nucleotides when surrounding ADAR2 editing sites [41]. As seen in Table 5, the patterns observed here for Alu editing are somewhat different: for example, locations dn10 and dn13 seem to favor G in contrast to the opposite trend reported in [41] for ADAR2 sites. The differences might be due to the much larger sample we study here. Additionally, it is also possible that editing sites in the coding region, mostly having a functional role, have different characteristics than the ones in Alu repeats. However, these differences could also result from differences between the profiles of ADAR1 and ADAR2. While the sample of editing sites studied in [41] is biased towards ADAR2 targets, the sample studied here, coming from a wide range of tissues, represents a different mix of the two enzymes, with larger weight of ADAR1. Moreover, the different splice-variants of the ADARs possibly have varying editing efficiencies and site preferences. The mix of these variants occurring in our in-vivo sample, could also lead to slight variations in the preferences observed as compared to results of in-vitro studies.
Table 3

Comparison of nucleotide distribution for sites in the vicinity of E1 and E0 sites in helices.

Neighborχ2
-30160.260.27620.23590.28120.20670.23780.25110.30920.2019
-29194.860.26020.30160.23230.20590.22520.30030.27370.2009
-28149.270.18040.25320.23600.33040.18310.28480.24140.2907
-2751.330.19410.26440.25790.28360.21480.25630.26030.2686
-26127.840.24330.26960.30970.17750.24750.26020.28310.2093
-25146.160.24810.30340.27650.17200.26770.26430.29890.1691
-24559.560.23810.25160.24420.26610.29080.22060.27980.2088
-23116.790.21080.27290.30050.21580.23850.25100.28180.2288
-22274.040.23860.26710.32800.16630.25880.27880.27190.1905
-21361.420.18420.23670.30550.27360.20840.26080.24250.2884
-20488.070.18400.29950.27880.23760.22320.27330.22320.2803
-19447.660.21610.33010.25850.19540.24740.26140.29840.1928
-1841.330.20640.25420.25990.27950.21860.24780.27160.2620
-17133.290.19680.30060.24460.25800.23010.27200.24760.2503
-16487.440.19120.28810.30590.21480.25260.26960.24970.2281
-1554.110.18850.29520.24380.27250.20580.28090.23070.2826
-14272.670.25010.21000.28270.25720.24470.26410.25000.2412
-13147.240.28360.25940.28570.17130.27830.28880.24990.1830
-12520.860.22690.25400.27100.24820.22550.31630.20740.2508
-11451.160.18120.28350.25830.27700.25200.26140.24260.2440
-1042.990.26790.28910.19030.25260.26810.26850.20330.2601
-9114.340.25740.27150.26300.20810.26930.24490.25120.2345
-8422.790.21300.26430.25390.26870.26850.21060.26190.2590
-71058.190.15110.30010.20430.34450.23310.23280.24290.2911
-6174.050.23930.23530.30300.22240.27130.23330.26230.2332
-5471.670.24350.26100.30350.19190.28900.20570.28030.2249
-4262.050.20980.24650.25820.28560.25130.26570.22850.2544
-366.910.27040.23660.24040.25260.28410.22300.25820.2347
-21440.180.16430.29980.22710.30890.28950.25100.22450.2350
-15907.220.28860.32230.09180.29730.25760.32290.29210.1274
1815.290.19750.22310.45730.12210.26370.20590.36820.1622
2514.010.22740.31160.20080.26010.29180.24800.21020.2500
3581.330.20550.34840.18370.26230.28720.29980.18100.2320
4790.830.21770.27030.28540.22660.26910.26370.20320.2640
5432.690.23950.26050.29700.20290.30350.21320.30070.1826
6270.680.26120.15440.37150.21290.28790.18040.31480.2169
7404.670.21030.31200.25730.22030.25080.24800.27910.2222
8375.770.23730.25680.30260.20340.29020.20940.28200.2184
991.530.28820.23830.22450.24900.28870.25600.23540.2199
10211.510.25530.25430.27860.21180.28970.22680.24960.2338
11697.410.19060.34580.16630.29730.26910.31190.18060.2384
12380.760.20030.27290.18650.34030.25490.27250.18930.2834
13289.450.26190.25680.29160.18970.31170.25890.24340.1860
14855.860.22830.29500.33520.14150.27670.27460.25260.1961
15377.730.17990.25940.32430.23640.23670.23330.28330.2467
16402.310.23940.23830.31740.20490.28020.21250.26600.2413
17155.610.26000.27830.25470.20690.25400.24560.25930.2412
181056.400.19280.21010.39490.20220.25330.23310.28270.2308
19951.470.20070.29380.32110.18430.28260.22600.27290.2184
20243.360.22460.33390.25960.18200.26430.28640.25420.1952
21571.530.17810.28570.25510.28110.25710.25050.24610.2463
22725.430.26520.27130.30650.15700.29780.24640.23680.2190
2328.630.26210.25400.30170.18220.27540.25680.28440.1833
24499.380.26490.27330.29830.16360.32060.22690.25800.1946
25141.560.27070.24570.29390.18970.31100.22840.27020.1905
26543.540.35200.22680.26740.15380.28980.22390.26700.2193
27220.850.26350.27150.27890.18610.25140.27820.24330.2271
28568.720.18490.20290.29590.31630.22060.24650.22690.3060
29199.020.22300.24700.25280.27720.26460.21920.23220.2841
3097.030.25850.20780.30960.22420.28720.20940.28060.2229

The neighbor index gives the location of the nucleotide relative to the edited (or unedited) A, where negative values correspond to upstream nucleotides. The total number of E1 sites here was 17,187, and about 378,000 E0 sites (not all E0 sites have all neighbors defined).

Table 4

Comparison of nucleotide distribution for sites in the vicinity of E1 and E0 sites in interior loops.

Neighborχ2
-3097.880.29440.18800.29200.22550.26950.22640.26960.2345
-2977.150.25350.22490.25400.26750.23200.26370.25250.2519
-2891.610.19600.22070.24170.34170.20680.25160.24250.2992
-272.750.22470.26130.25960.25450.22460.25680.26700.2516
-264.520.23500.27200.25000.24300.24170.27500.24180.2415
-25134.590.24200.24860.22390.28560.25800.24750.25760.2370
-24133.460.20850.22180.32720.24250.24630.23840.27990.2353
-237.300.23900.26350.26170.23580.24320.25690.25440.2454
-2252.350.23720.25040.29370.21870.26090.26290.26630.2099
-2160.900.23390.24560.25870.26190.24680.26990.22910.2542
-2059.590.22530.23720.28050.25700.24640.23510.24790.2706
-1924.770.25060.25940.24940.24060.25270.23860.26530.2434
-1858.310.20430.26690.27040.25840.23730.24700.26140.2543
-1729.160.21120.24100.27240.27550.21970.26030.25610.2639
-16135.380.24030.24710.32360.18900.26030.25020.27140.2180
-15105.980.18490.24270.26640.30590.22280.25140.25600.2699
-1454.330.26060.21520.30190.22220.24960.24470.27850.2272
-13205.780.28900.18770.33170.19160.27950.24000.27690.2037
-12228.230.18680.23490.27640.30190.22700.27150.25370.2478
-11236.290.15550.28820.30670.24960.21830.26450.26620.2510
-1089.140.20600.26530.26950.25930.24890.24970.25890.2425
-9292.420.19390.30760.28770.21080.25940.25360.25790.2291
-830.940.26770.19640.30640.22950.28210.21130.29110.2155
-7215.280.18020.25300.32690.23980.23820.23730.27740.2471
-6101.360.22710.25970.31380.19950.26700.23470.28820.2100
-542.950.27340.25970.30040.16650.27090.25150.28450.1931
-4233.230.25800.18730.24820.30660.25220.24130.25860.2480
-3348.650.17390.30510.33110.18990.23570.25970.27930.2253
-2801.120.13190.33760.25760.27290.24760.25560.26790.2289
-12126.160.23420.47200.04340.25040.25370.40220.20720.1368
11058.770.18640.22670.42760.15940.26810.25770.27390.2003
2286.100.19010.30400.32060.18540.24320.28160.25900.2162
318.120.24950.26890.21650.26520.26670.27210.20630.2549
4305.150.17050.27830.27190.27930.24130.28350.22400.2512
559.460.23580.25970.28490.21950.27030.25300.26200.2148
6204.460.25000.29920.26710.18360.26440.23590.28460.2151
7131.410.19300.32750.23140.24820.22550.27680.24690.2509
894.860.23100.24900.29960.22050.27240.23980.26670.2211
958.620.25690.28030.26560.19720.27500.25830.24730.2193
10345.510.20840.32950.28930.17280.26160.27310.24710.2182
11154.890.21310.29050.21660.27990.25490.31140.18500.2487
12316.510.15990.30210.21460.32330.23580.28580.20610.2722
1346.170.24510.27120.24960.23410.27590.26530.23140.2274
14236.160.20520.26420.32780.20280.26370.26410.26720.2050
1570.600.19460.29470.27110.23950.22840.26970.25890.2430
1620.480.29170.24890.25490.20450.28350.24020.25210.2242
17247.210.21040.28790.28790.21380.26480.24420.24980.2412
18129.050.22280.26230.30050.21440.26970.24050.26720.2225
1965.190.29670.29020.21030.20290.28390.26240.22540.2283
2034.260.23350.26680.25050.24910.25320.26530.22780.2537
21231.490.17940.28960.29050.24060.23800.26010.24740.2545
2260.610.27260.26760.24880.21100.27830.24130.24020.2402
2365.140.24950.30830.25190.19030.26840.27260.25150.2074
2437.990.31230.20290.28400.20080.30940.22000.25910.2115
25110.080.25310.24220.30480.19990.29770.22150.27360.2072
26215.240.27450.28940.27530.16080.26910.25310.25520.2226
2717.110.27030.26400.24590.21980.27600.25360.23570.2347
2851.460.23340.24560.26810.25280.25200.25250.23670.2588
2989.770.25450.27780.23660.23110.26900.24490.22230.2637
30152.900.23460.23630.31910.21000.27850.22060.27280.2281

The neighbor index gives the location of the nucleotide relative to the edited (or unedited) A, where negative values correspond to upstream nucleotides. The total number of E1 sites here was 9,711, and about 97,000 E0 sites (not all E0 sites have all neighbors defined).

Table 5

Nucleotide enrichment for several locations neighboring an editing sites

neighborhelixinteriorpattern reported in [41]
up18:A0.94680.8720C = T > G
up18:C1.02501.0727
up18:G0.95901.0313
up18:T1.06351.0149
up9:A0.95790.7649G > A = T
up9:C1.10321.1904
up9:G1.04491.1040
up9:T0.89160.9270
up1:A1.11470.9294T = A > G
up1:C0.99811.1554
up1:G0.32390.2256
up1:T2.20551.7022
dn7:A0.84560.8678T > C > G = A
dn7:C1.24551.1647
dn7:G0.92620.9433
dn7:T0.99320.9908
dn9:A1.00000.9407U > A > G
dn9:C0.93521.0777
dn9:G0.95741.0675
dn9:T1.12770.9083
dn10:A0.88740.8125A = T > G = C
dn10:C1.11731.1853
dn10:G1.11261.1540
dn10:T0.91140.8079
dn13:A0.84810.8988T = A > G
dn13:C0.99461.0219
dn13:G1.19121.0726
dn13:T1.02151.0281
dn15:A0.77040.8653G > T > A = C
dn15:C1.10951.0859
dn15:G1.14071.0447
dn15:T0.96290.9884

Nucleotide distributions at certain locations around editing sites, reported to exhibit nucleotide biases [41]. For each of the sites, we present the probability to have a given nucleotide N when the 0 location adenosine is edited, divided by the probability of that nucleotide regardless of whether the 0 adenosine is edited: P(N|0 adenosine is edited)/P(N). Values > 1 indicate enrichment of N for edited sites, and < 1 indicate depletion. upX = X nucleotides upstream of site 0, dnX = X nucleotides downstream. The probabilities are given for editing sites in helices and interior loops separately, but are very similar for both. For comparison, we present the patterns reported in [41].

Comparison of nucleotide distribution for sites in the vicinity of E1 and E0 sites in helices. The neighbor index gives the location of the nucleotide relative to the edited (or unedited) A, where negative values correspond to upstream nucleotides. The total number of E1 sites here was 17,187, and about 378,000 E0 sites (not all E0 sites have all neighbors defined). Comparison of nucleotide distribution for sites in the vicinity of E1 and E0 sites in interior loops. The neighbor index gives the location of the nucleotide relative to the edited (or unedited) A, where negative values correspond to upstream nucleotides. The total number of E1 sites here was 9,711, and about 97,000 E0 sites (not all E0 sites have all neighbors defined). Nucleotide enrichment for several locations neighboring an editing sites Nucleotide distributions at certain locations around editing sites, reported to exhibit nucleotide biases [41]. For each of the sites, we present the probability to have a given nucleotide N when the 0 location adenosine is edited, divided by the probability of that nucleotide regardless of whether the 0 adenosine is edited: P(N|0 adenosine is edited)/P(N). Values > 1 indicate enrichment of N for edited sites, and < 1 indicate depletion. upX = X nucleotides upstream of site 0, dnX = X nucleotides downstream. The probabilities are given for editing sites in helices and interior loops separately, but are very similar for both. For comparison, we present the patterns reported in [41].

Thermodynamic Calculations

Finally, we study the effect of thermodynamic stability on editing efficiency. For each genomic neighborhood, we look at the thermodynamic average over all the low free-energy structures. The laws of statistical mechanics give us the probability that the RNA is in a specific secondary structure n, where T is the temperature in degrees Kelvin, kis Boltzmann's constant and Z is defined by the sum where the label n runs over all available foldings, and Gare the respective free energies. In practice, we only use those folds generated by MFOLD which are expected to be all folds relevant at human-body temperature. We may now, for example, calculate the probability of some particular site to be in a helix, where is the indical function, defined by Similarly, one may calculate the probabilities for all other substructures. Let S denote the set of possible substructures, We define a site's structural entropy to be where is the frequency of site i being in substructure of type s. This entropy is a measure of the thermodynamic volatility of the site's substructure: if a site is always in the same substructure (e.g. the site is always in a helix), it will have zero structural entropy. If, however, the site's substructure fluctuates, for example between a helical structure and a loop structure, it will have higher structural entropy. The structural entropy of a site with equal probability to be in two difference substructures is ln(2) = 0.7. The highest possible structural entropy is of a site which spends equal time in each of the possible substructures. Figs. 10 and 11 show the frequency of E1 as a function of the structural entropy, for sites whose lowest free-energy structure is helix or interior-loop separately. Interestingly, A-to-I editing is enriched for sites with low structural entropy, i.e. having a well-defined low energy micro-structure. A wobbling state, fluctuating between two or more possible structures is less well edited. This holds regardless of the ground-state structure, but the effect is stronger for interior loops: sites with a well-defined interior loops structure are twice more likely to be edited compared with sites whose ground state structure is also an interior loop but having even 1% probability to be in other structures.
Figure 10

.

Figure 11

.

. . Analysis of a large dataset of secondary structures of putatively edited Alu repeats reveals that structural motifs are indeed important in determining the A-to-I editing efficiency of a given site. Most notably, we highlight the strong preference for editing of adenosines within short symmetrical internal loops. Moreover, the microstructure also has modest but noticeable effect on the cis-preferences of the ADARs. Long perfect dsRNA duplexes are often considered to be the best target for editing by ADARs. Here we find that sites adjacent to the edge of helices (cePos = 0) are even more efficiently edited. Averaged over our whole database, adenosines deep within (cePos > 10) long (> 30 bp) perfect helices are indeed edited more efficiently than the average adenosine in a helix - we find 1625 such sites, with editing frequency 8.2%, compared to only 4.4% for the average helix site. However, this is still lower than the average frequency for interior loops, 9.1%. Moreover, focusing on single A-C mismatches within a helix (i.e. cePos = 0 sites having neighboring helices on both sides and C as the opposite nucleotide) raises the frequency to 19%. Finally, choosing also the optimal neighbors, i.e. T upstream and G downstream, raises the editing frequency as observed in our database to 37% ! We stress again that these frequencies should not be regarded as the true editing efficiency, but rather as a relative measure. Yet, one is able to conclude that the best way to engineer an efficient editing site is not to put it in a long perfect duplex, but rather to be in a single mismatch within a duplex. Interestingly, the 100% edited E1 site in the NARF gene [25], fits nicely with these engineering rules - it is a cePos = 0 site in a symmetric loop, with C opposite to it and T and G in the upstream and downstream sites, respectively. However, the strand length there is 3 and not the optimal 1. An editing site that fully adheres to the above "rules" is the amber/W one of the hepatitis delta virus antigenome (genotype I) [46]. This site is critical for the virus to assemble viral particles and to be infectious [47]. Given the high adaptivity of viruses, it is not surprising to find that this site fits with all of the above: it is located in a single A-C mismatch within a helix (cePos = 0 and loop length = 2), and has T and G as its immediate neighbors. However, the Q/R site in GluRB does not fit to our observations. It lies within a rather long (17 bp) helix, with cePos = 5, with C (rather than the optimal T) upstream and G downstream [48]. Yet, this site is also 100% edited. Apparently, there is still much more to learn about the characteristics of editing by ADARs, beyond the information presented in the present study. One possible explanation is that this site in known to be edited only by ADAR2 [49]. The two editing enzymes ADAR1 and ADAR2 are known to have overlapping, but distinct, preferences [36-38,50,51]. However, our approach does not allow us to distinguish between them. It was recently shown that editing of mouse B1 and B2 SINEs is mediated by both enzymes [39]. Some sites within these repeats are ADAR1 specific, some are ADAR2 specific and some are edited by both. It is not yet clear which enzyme is the main one in terms of Alu editing in human. Since our database is based on mRNA sequences from a wide range of tissues, it is possible that it characterized mainly the profile of the widely-expressed ADAR1 rather than that of ADAR2 which is expressed mainly in the brain. It is thus likely that some of the preferences identified in this work characterize ADAR1 and are therefore not present in the GluRB ADAR2-specific site. The discrepancies between nucleotide distributions around the editing sites reported above and those reported by [41] for ADAR2 sites might also attest for differences between the ADAR2 profile and the one characterizing our dataset, probably a mix of the two enzymes, with larger weight of ADAR1. In an attempt to estimate the differences between the two enzymes, we compared 4657 editing sites supported by 13805 brain mRNAs, where both ADAR1 and ADAR2 are present, and 1684 sites residing in 10186 non-brain mRNAs, presumably edited mostly by ADAR1 (tissue-origin was determined by UCSC annotation [52]). The patterns observed were similar but not identical. For example, 1376 of the 2966 brain sites residing in a helix (46.4%) had a G in the dn1 position, compared to 452 out of 1076 (42.0%) in non-brain sites in a helix (p-value 0.013). However, differences were not statistically significant upon Bonferroni-correcting for multiple testing. Thus, a larger and better dataset (fully characterized in terms of of the tissues studied) is required in order to study the small tissue differences between the preferences of the two ADAR enzymes.

Conclusions

Using a dataset of 29,971 editing sites within Alu repeats, we analyzed the editing preferences. We found that the micro-structure a site resides in affects its editing frequency. In addition, the sequence motifs characterizing editing sites vary with the micro-structure. We have also shown that structural entropy and thermodynamic stability play a role in determining editing efficiency. We find that the probability of a nucleotide fluctuating between a number of possible structures to be edited is lower than the weighted average of the probabilities for each possible structure alone. This provides a hint as to the rate of the A-to-I editing process compared to the relaxation time scales controlling the transition between the possible folds. Taken together, the results presented here could be of help in revealing the complex nature of the ADARs editing profile.

Methods

We construct a list of putative editing sites within Alu repeats following the method presented in [23,24]. That is, we use mismatches in the relatively clean RNA sequences, rather than the much larger but noisier EST data. We use UCSC alignments of human RNA sequences to the genomes http://genome.ucsc.edu[52] and record all mismatches in these alignments. Then, known SNPs are removed, and the list is intersected with Alu locations, to obtain a set of mismatches within Alu repeats. A-to-I editing sites in Alu repeats tend to occur in clusters, we thus take only clusters of three or more consecutive identical mismatches. While this process is not inherently biased towards any specific type of mismatch, 98% of the mismatches found are A-to-G, suggesting that although these sites are typically supported by a single mismatch, they do represent true A-to-I editing sites with a low level of false-positives. We then construct the predicted secondary structures using the following procedure: (a) for each site in our list, its Alu was located in the genome. Then, the nearest antisense Alu was located, and the genomic neighborhood that includes all nucleotides in and between the two Alus was taken, along with 200 extra nucleotides on each side. 61% of the inter-Alu distances are less than 1000 nucleotides, 22% more lie between 1000 and 2000, 9% are between 2000 and 3000 and the remaining 8% strech from 3000 to 6800 nucleotides. (b) Neighborhoods having > 400 bp overlap on the same strand were merged into a single neighborhood encompassing both. This step resulted in 3,276 neighborhoods, containing 29,971 putative editing sites. RNA segments corresponding to these neighborhoods were folded using MFOLD, resulting in predicted secondary structures. The accuracy of RNA secondary structure prediction by current dynamic programming algorithms (such as the MFOLD software) is moderate, up to 73% [53]. Yet, while false structure predictions would inevitably introduce noise to our analyzed data, the large sample size should allow for detecting a signal. Moreover, one should bear in mind that the RNA structures we consider - long and almost perfect dsRNAs - are relatively easy to analyze, and thus the accuracy of the folding algorithms is expected to be much higher than the above quoted rate. (c) We parsed the output of MFOLD into a relational database containing all the information about the secondary structures in which the various sites reside (the basic substructures are given in fig. 1). (d) All adenosines in the genomic neighborhoods' sites were classified: We find 29,971 putative editing sites within Alu repeats, denoted by E1 and 590,206 adenosines within Alu repeats that were not detected as editing sites, denoted by E0. The adenosines which are not in Alu repeats do not enter our analyses. In the following, we use the E0 sites as a control set. It should be stressed that the sensitivity of the bioinformatic algorithms for detecting editing sites is rather low, mainly due to the low coverage, low editing efficiency of most sites and tissue origin of the available sequences. For example, the observed editing efficiency averaged over all Alus, which is 0.048, is probably lower than the actual value. Therefore, the set of E0 sites should not be thought of as sites that are never edited, but rather as a background, maybe slightly depleted in editing sites. On the other hand, the set of E1 contains, with high precision, only edited sites [23].

Authors' contributions

YK carried out the RNA folding computations, prepared the foldings database and performed the statistical analysis. EE designed the study and prepared the editing sites database. Both authors read and approved the final manuscript.
  52 in total

1.  RNA editing of the human serotonin 5-HT2C receptor. alterations in suicide and implications for serotonergic pharmacotherapy.

Authors:  C M Niswender; K Herrick-Davis; G E Dilley; H Y Meltzer; J C Overholser; C A Stockmeier; R B Emeson; E Sanders-Bush
Journal:  Neuropsychopharmacology       Date:  2001-05       Impact factor: 7.853

2.  A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing.

Authors:  P J Aruscavage; B L Bass
Journal:  RNA       Date:  2000-02       Impact factor: 4.942

Review 3.  Functions and mechanisms of RNA editing.

Authors:  J M Gott; R B Emeson
Journal:  Annu Rev Genet       Date:  2000       Impact factor: 16.830

4.  Point mutation in an AMPA receptor gene rescues lethality in mice deficient in the RNA-editing enzyme ADAR2.

Authors:  M Higuchi; S Maas; F N Single; J Hartner; A Rozov; N Burnashev; D Feldmeyer; R Sprengel; P H Seeburg
Journal:  Nature       Date:  2000-07-06       Impact factor: 49.962

5.  Requirement of the RNA editing deaminase ADAR1 gene for embryonic erythropoiesis.

Authors:  Q Wang; J Khillan; P Gadue; K Nishikura
Journal:  Science       Date:  2000-12-01       Impact factor: 47.728

6.  Widespread cleavage of A-to-I hyperediting substrates.

Authors:  Sivan Osenberg; Dan Dominissini; Gideon Rechavi; Eli Eisenberg
Journal:  RNA       Date:  2009-07-21       Impact factor: 4.942

7.  Double-stranded RNA adenosine deaminases ADAR1 and ADAR2 have overlapping specificities.

Authors:  K A Lehmann; B L Bass
Journal:  Biochemistry       Date:  2000-10-24       Impact factor: 3.162

8.  Substrate recognition by ADAR1 and ADAR2.

Authors:  S K Wong; S Sato; D W Lazinski
Journal:  RNA       Date:  2001-06       Impact factor: 4.942

9.  A-to-I pre-mRNA editing in Drosophila is primarily involved in adult nervous system function and integrity.

Authors:  M J Palladino; L P Keegan; M A O'Connell; R A Reenan
Journal:  Cell       Date:  2000-08-18       Impact factor: 41.582

10.  Purification and characterization of a human RNA adenosine deaminase for glutamate receptor B pre-mRNA editing.

Authors:  J H Yang; P Sklar; R Axel; T Maniatis
Journal:  Proc Natl Acad Sci U S A       Date:  1997-04-29       Impact factor: 11.205

View more
  21 in total

Review 1.  A-to-I RNA editing - immune protector and transcriptome diversifier.

Authors:  Eli Eisenberg; Erez Y Levanon
Journal:  Nat Rev Genet       Date:  2018-08       Impact factor: 53.242

2.  microRNA editing in seed region aligns with cellular changes in hypoxic conditions.

Authors:  Giovanni Nigita; Mario Acunzo; Giulia Romano; Dario Veneziano; Alessandro Laganà; Marika Vitiello; Dorothee Wernicke; Alfredo Ferro; Carlo M Croce
Journal:  Nucleic Acids Res       Date:  2016-06-13       Impact factor: 16.971

3.  Trade-off between Transcriptome Plasticity and Genome Evolution in Cephalopods.

Authors:  Noa Liscovitch-Brauer; Shahar Alon; Hagit T Porath; Boaz Elstein; Ron Unger; Tamar Ziv; Arie Admon; Erez Y Levanon; Joshua J C Rosenthal; Eli Eisenberg
Journal:  Cell       Date:  2017-04-06       Impact factor: 41.582

4.  Adenosine deaminases that act on RNA induce reproducible changes in abundance and sequence of embryonic miRNAs.

Authors:  Cornelia Vesely; Stefanie Tauber; Fritz J Sedlazeck; Arndt von Haeseler; Michael F Jantsch
Journal:  Genome Res       Date:  2012-02-06       Impact factor: 9.043

5.  Systematic identification of edited microRNAs in the human brain.

Authors:  Shahar Alon; Eyal Mor; Francois Vigneault; George M Church; Franco Locatelli; Federica Galeano; Angela Gallo; Noam Shomron; Eli Eisenberg
Journal:  Genome Res       Date:  2012-04-12       Impact factor: 9.043

6.  Competition between ADAR and RNAi pathways for an extensive class of RNA targets.

Authors:  Diane Wu; Ayelet T Lamm; Andrew Z Fire
Journal:  Nat Struct Mol Biol       Date:  2011-09-11       Impact factor: 15.369

7.  Modulation of microRNA editing, expression and processing by ADAR2 deaminase in glioblastoma.

Authors:  Sara Tomaselli; Federica Galeano; Shahar Alon; Susanna Raho; Silvia Galardi; Vinicia Assunta Polito; Carlo Presutti; Sara Vincenti; Eli Eisenberg; Franco Locatelli; Angela Gallo
Journal:  Genome Biol       Date:  2015-01-13       Impact factor: 13.583

8.  Genome-wide analysis of Alu editability.

Authors:  Lily Bazak; Erez Y Levanon; Eli Eisenberg
Journal:  Nucleic Acids Res       Date:  2014-05-14       Impact factor: 16.971

9.  Consistent levels of A-to-I RNA editing across individuals in coding sequences and non-conserved Alu repeats.

Authors:  Shoshana Greenberger; Erez Y Levanon; Nurit Paz-Yaacov; Aviv Barzilai; Michal Safran; Sivan Osenberg; Ninette Amariglio; Gideon Rechavi; Eli Eisenberg
Journal:  BMC Genomics       Date:  2010-10-28       Impact factor: 3.969

10.  A-to-I RNA editing occurs at over a hundred million genomic sites, located in a majority of human genes.

Authors:  Lily Bazak; Ami Haviv; Michal Barak; Jasmine Jacob-Hirsch; Patricia Deng; Rui Zhang; Farren J Isaacs; Gideon Rechavi; Jin Billy Li; Eli Eisenberg; Erez Y Levanon
Journal:  Genome Res       Date:  2013-12-17       Impact factor: 9.043

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.