Pentatricopeptide repeat (PPR) proteins with an E domain have been identified as specific factors for C to U RNA editing in plant organelles. These PPR proteins bind to a unique sequence motif 5' of their target editing sites. Recently, involvement of a combinatorial amino acid code in the P (normal length) and S type (short) PPR domains in sequence specific RNA binding was reported. PPR proteins involved in RNA editing, however, contain not only P and S motifs but also their long variants L (long) and L2 (long2) and the S2 (short2) motifs. We now find that inclusion of these motifs improves the prediction of RNA editing target sites. Previously overlooked RNA editing target sites are suggested from the PPR motif structures of known E-class PPR proteins and are experimentally verified. RNA editing target sites are assigned for the novel PPR protein MEF32 (mitochondrial editing factor 32) and are confirmed in the cDNA.
Pentatricopeptide repeat (PPR) proteins with an E domain have been identified as specific factors for C to U RNA editing in plant organelles. These PPR proteins bind to a unique sequence motif 5' of their target editing sites. Recently, involvement of a combinatorial amino acid code in the P (normallength) and S type (short) PPR domains in sequence specific RNA binding was reported. PPR proteins involved in RNA editing, however, contain not only P and S motifs but also their long variants L (long) and L2 (long2) and the S2 (short2) motifs. We now find that inclusion of these motifs improves the prediction of RNA editing target sites. Previously overlooked RNA editing target sites are suggested from the PPR motif structures of known E-class PPR proteins and are experimentally verified. RNA editing target sites are assigned for the novelPPR protein lass="Gene">MEF32 (lass="Chemical">n class="Gene">mitochondrial editing factor 32) and are confirmed in the cDNA.
The inplants vastly expanded family of pentatricopeptide repeat (PPR) proteins provides diverse RNA maturation functions mostly to the two organelles mitochondria and plastids [1], [2]. RNAs synthesized in these organelles from their resident genomes are processed by intron excision, 5′- and 3′-terminal processing, endonucleolytic fragmentation and RNA editing. Different classes of PPR proteins are involved in all of these steps. About 200 of the 450 PPR proteins inn class="Species">flowering plants belolass="Chemical">ng to a subfamily characterized by related C-termilass="Chemical">nal extelass="Chemical">nsiolass="Chemical">ns (E-domailass="Chemical">ns; [3], [4]). All PPR proteilass="Chemical">ns idelass="Chemical">ntified to be ilass="Chemical">nvolved ilass="Chemical">n RNA editilass="Chemical">ng belolass="Chemical">ng to this class [5]. So far olass="Chemical">nly olass="Chemical">ne exceptiolass="Chemical">n has beelass="Chemical">n documelass="Chemical">nted where a proteilass="Chemical">n with olass="Chemical">nly PPR repeats but lass="Chemical">no extelass="Chemical">nsiolass="Chemical">n ilass="Chemical">nfluelass="Chemical">nces editilass="Chemical">ng at several sites [6]. A lass="Chemical">number of PPR RNA editilass="Chemical">ng proteilass="Chemical">ns have beelass="Chemical">n idelass="Chemical">ntified through alass="Chemical">nalysis of mutalass="Chemical">nts with phelass="Chemical">notypic defects ilass="Chemical">n orgalass="Chemical">nellar fulass="Chemical">nctiolass="Chemical">ns. Other alass="Chemical">nalogous mutalass="Chemical">nts do lass="Chemical">not show physiological phelass="Chemical">notypes alass="Chemical">nd require a direct comprehelass="Chemical">nsive alass="Chemical">nalysis of all editilass="Chemical">ng sites. Silass="Chemical">nce this is very labour alass="Chemical">nd cost ilass="Chemical">ntelass="Chemical">nsive particularly for the more thalass="Chemical">n 400 sites ilass="Chemical">n mitocholass="Chemical">ndria, a tool to predict target sites from the sequelass="Chemical">nce of a givelass="Chemical">n calass="Chemical">ndidate PPR proteilass="Chemical">n will be very useful.
Previously, the strong sequence specific interactions between the non-extended PPR proteins and their target RNA have been used to determine features in the PPR repeats which correlate with the contacted nucleotide identities in the RNA [7]. Several of these non-extended PPR proteins have specific functions in internal and exonucleolytic RNA processing through tightly binding to the RNA at specific sites and to protect the thus covered termini and/or guide RNA processing enzymes.In initial correlations, the three lass="Chemical">amino acid positiolass="Chemical">ns 1′, 4 alass="Chemical">nd 6 (also labelled as ii, 1 alass="Chemical">nd 4 ilass="Chemical">n [8]) ilass="Chemical">n the repeat elemelass="Chemical">nts of these PPR proteilass="Chemical">ns were foulass="Chemical">nd to be occupied by amilass="Chemical">no acids whose idelass="Chemical">ntity shows some accord with the lass="Chemical">nucleotide moiety opposite the respective repeat ulass="Chemical">nit [8], [9]. lass="Chemical">n class="Chemical">Amino acid position 1′ (or ii) is the first amino acid of the C-terminally adjacent repeat. These correlations were recently further refined [7], [9] and experimental evidence confirmed the conclusions [10]. In experimental assays, amino acids at these positions were altered in several PPR repeats and the manipulated PPR protein was found to indeed attach selectively to the predicted novel RNA sequence motif [10].
These analyses covered two types of the PPR motifs, the ‘regular’ ones with 35 amino acids (called P type [4]) and some of the shorter ones with 31–32 amino acids (S type). The longer repeats (L type) with 35–40 amino acids were not included. These L repeats may be important in the PPR proteins involved in RNA editing, since this subgroup uniquely consists of alternating P-L-S elements. Inlarge PPR proteins with more than ten repeat elements not all of them may actually contact the RNA, one or more may function as spacers to allow for the 3D alignment of RNA and PPR repeats. Such gaps could compensate for different spatiallengths of the nucleotide chain and the PPR repeats in the proteins [8]–[10]. However, presently repeats looped out calass="Chemical">nnot be distilass="Chemical">nguished from those colass="Chemical">ntactilass="Chemical">ng the RNA. Furthermore, the PPR elemelass="Chemical">nts attachilass="Chemical">ng to RNA alass="Chemical">nd those lass="Chemical">not bilass="Chemical">ndilass="Chemical">ng to a lass="Chemical">nucleotide may vary betweelass="Chemical">n differelass="Chemical">nt target sites of a givelass="Chemical">n PPR proteilass="Chemical">n [11]. Ilass="Chemical">n this colass="Chemical">ntributiolass="Chemical">n we ilass="Chemical">nclude the L motifs ilass="Chemical">n the aliglass="Chemical">nmelass="Chemical">nt alass="Chemical">nd filass="Chemical">nd that this improves the predictiolass="Chemical">n of the RNA target sites for a givelass="Chemical">n PPR proteilass="Chemical">n. Experimelass="Chemical">ntal alass="Chemical">nalysis of respective mutalass="Chemical">nts colass="Chemical">nfirms the accuracy of the predictiolass="Chemical">n for several klass="Chemical">nowlass="Chemical">n PPR proteilass="Chemical">ns alass="Chemical">nd allows assiglass="Chemical">nmelass="Chemical">nt of the RNA editilass="Chemical">ng target sites for the lass="Chemical">novel factor lass="Chemical">n class="Gene">MEF32 (mitochondrial editing factor). We believe that this refined code improves the potential to generate specific RNA binding proteins for any sequence, analogous to the possibilities the TAL proteins offer to DNA manipulation [12].
Methods
Computational Analysis of RNA Editing Factors
The L, L2 and S2 type motifs characteristic for RNA editing PPR proteins were not included in previous analyses. To investigate their potential contact with selected nucleotide identities, we analysed n class="Chemical">amino acid positiolass="Chemical">ns 6 alass="Chemical">nd 1′ ilass="Chemical">n all classes of repeat ulass="Chemical">nits ilass="Chemical">n 41 PPR RNA editilass="Chemical">ng factors alass="Chemical">nd aliglass="Chemical">ned them with the respective lass="Chemical">nucleotides ilass="Chemical">n the upstream sequelass="Chemical">nces of their target RNA editilass="Chemical">ng sites (Figure 1 alass="Chemical">nd Figure S1). Amilass="Chemical">no acids 6 alass="Chemical">nd 1′ correspolass="Chemical">nd to the sixth amilass="Chemical">no acid of the colass="Chemical">nsidered PPR motif alass="Chemical">nd the first amilass="Chemical">no acid of the lass="Chemical">next C-termilass="Chemical">nal PPR motif which is accordilass="Chemical">ngly termed 1′ (or 33), respectively (Figure 1 alass="Chemical">nd Figure S1). To positiolass="Chemical">n the RNA, the fourth lass="Chemical">nucleotide upstream of each editilass="Chemical">ng site (lass="Chemical">nucleotide –4) was aliglass="Chemical">ned to the S2 motif. The S2 elemelass="Chemical">nt is located directly N-termilass="Chemical">nal of the E motif. The PPR elemelass="Chemical">nts N-termilass="Chemical">nally followilass="Chemical">ng the S2 motif were aliglass="Chemical">ned colass="Chemical">nsecutively with the subsequelass="Chemical">nt upstream (5′) lass="Chemical">nucleotides. Ilass="Chemical">n three separate colass="Chemical">nsideratiolass="Chemical">ns, amilass="Chemical">no acids at either positiolass="Chemical">n 6 or 1′ or the combilass="Chemical">natiolass="Chemical">n of amilass="Chemical">no acids at both positiolass="Chemical">ns were recorded with respect to the correspolass="Chemical">ndilass="Chemical">ng lass="Chemical">nucleotides.
Figure 1
Structure model of RNA editing PPR proteins and their alignment to the RNA editing target sequence.
The RNA editing PPR proteins are extended at their C-termini by E and often also by DYW domains. Different from P-type PPR proteins, the RNA editing PPR proteins contain alternating P-L-S type elements. The positions of the amino acid identities at positions 6 and 1′ are not given in the structurally correct position. These two amino acid positions have here been correlated to nucleotide identities (Figure S1). Dashed lines indicate their presumed connection to target nucleotide identities. Position 1′ is the first amino acid of the respective C-terminally adjacent repeat. For element S2 this position corresponds to amino acid 33 of this repeat while the E domain begins by convention only after amino acid 36. To illustrate this unclear assignment we placed position 1′ for the S2 element between the S2 and E domains. Question marks indicate the connections to the L, S2 and L2 domains investigated here for correlations with the opposite nucleotides. The nucleotide sequence is arbitrary and is spelled out solely to indicate the specific order of nucleotides here.
Structure model of RNA editing PPR proteins and their alignment to the RNA editing target sequence.
The RNA editing PPR proteins are extended at their C-termini by E and often also by DYW domains. Different from P-type PPR proteins, the RNA editing PPR proteins contain alternating P-L-S type elements. The positions of the amino acid identities at positions 6 and 1′ are not given in the structurally correct position. These two lass="Chemical">amino acid positiolass="Chemical">ns have here beelass="Chemical">n correlated to lass="Chemical">nucleotide idelass="Chemical">ntities (Figure S1). Dashed lilass="Chemical">nes ilass="Chemical">ndicate their presumed colass="Chemical">n class="Chemical">nnection to target nucleotide identities. Position 1′ is the first amino acid of the respective C-terminally adjacent repeat. For element S2 this position corresponds to amino acid 33 of this repeat while the E domain begins by convention only after amino acid 36. To illustrate this unclear assignment we placed position 1′ for the S2 element between the S2 and E domains. Question marks indicate the connections to the L, S2 and L2 domains investigated here for correlations with the opposite nucleotides. The nucleotide sequence is arbitrary and is spelled out solely to indicate the specific order of nucleotides here.
For the computational analysis, a combinatorial search was performed by considering all possible alignments of the PPR domains and the corresponding nucleotides and recording the frequency of each nucleotide for each amino acid at positions 6 or 1′ (data not shown) as well as in the combination of 6 and 1′ (Figure S2A). The combinatorial search for generating figures S2A–C and S3 and for computing the nucleotide counts was performed under the technical computation environment MATLAB (www.mathworks.com) with a customized interface to the PPR and editing sites database under MS Excel.To analyse the statistical probability, these data are compared to the overallnumber of nucleotides in the coding regions in all mRNAs in mitochondria and to those open reading frames in plastids where RNA editing has been observed (Figure S2B and Figure S2C). To avoid the effect of a biased nucleotide ratio in organellar transcripts, the computed ratios were further adjusted by correcting for the overall probability of a given nucleotide identity. For instance, the ratio of A in the listing (Figure S2B) is given by R(A) = N(A)/N(A+C+G+U) for each corresponding amino acid at the respective position, where N(·) denotes the respective nucleotide frequency. The probability of n class="Chemical">nucleotide A to occur at a certailass="Chemical">n positiolass="Chemical">n ilass="Chemical">n orgalass="Chemical">nellar tralass="Chemical">nscripts is givelass="Chemical">n by NP(A) = Ntotal(A)/Ntotal(A+C+G+U). Adjusted lass="Chemical">nucleotide ratios AN(A) are deduced accordilass="Chemical">ng to AN(A) = R(A)/NP(A) (Figure S2B alass="Chemical">nd Figure S2C, alass="Chemical">nd employed ilass="Chemical">n Figures 1–4). Total lass="Chemical">nucleotide lass="Chemical">numbers are 12035 for A, 7634 for C, 8633 for G alass="Chemical">nd 13724 for U. The resultilass="Chemical">ng lass="Chemical">nucleotide probabilities (NP) are 0.286 for A, 0.182 for C, 0.205 for G alass="Chemical">nd 0.327 for U. Furthermore, these adjusted lass="Chemical">nucleotide ratios are recalculated ilass="Chemical">nto percelass="Chemical">ntages alass="Chemical">nd thelass="Chemical">n used as positiolass="Chemical">n-depelass="Chemical">ndelass="Chemical">nt scorilass="Chemical">ng matlass="Chemical">n class="Species">rices in the program FIMO (link given below) for RNA editing site prediction (Fig. S2C).
Figure 4
The L2 motifs in RNA editing PPR proteins correlate with nucleotide identities.
(A) Correlations between amino acid and respective nucleotide identities in the target RNAs reveal preferential combinations with amino acid position 6 in the sequence logos. (B) Amino acid identities leucine and isoleucine at position 6 correlate with C, U or A (respectively G) in descending frequency, whereas threonine at this position is prevalent opposite nucleotide A. As in the L repeats, amino acid V occurs with any nucleotide. Preferences at position 1′ are not apparent in the sample size available.
Amino acids in RNA editing PPR protein S2 motifs correlate with target nucleotides.
(A) Sequence logos were constructed for each of the four nucleotides facing the respective S2 domains in the predicted PPR-RNA interaction at position –4 relative to the edited C (Figure 1). Coincidences between lass="Chemical">nucleotide alass="Chemical">nd amilass="Chemical">no acid idelass="Chemical">ntities are seelass="Chemical">n for positiolass="Chemical">n 1′ (also labelled as amilass="Chemical">no acid 33). No coilass="Chemical">ncidilass="Chemical">ng lass="Chemical">n class="Chemical">amino acid preference is seen with the C nucleotide. (B) The amino acid identity at position 1′ shows the most prominent correlation between D (aspartic acid) and nucleotide G, N (asparagine) and A, T (threonine) and U. In the bar diagram, percentages of nucleotide identities coinciding with the respective amino acid are indicated. Nucleotide percentages are normalized by calculations with taking the A, C, G and U percentage into account as detailed in the methods section. Sequence logos were derived with the web-based software at weblogo.berkeley.edu [25].
Amino acids at position 6 in RNA editing PPR protein L motifs correlate with nucleotide identities.
(A) Sequence logos opposite each of the four nucleotides show the amino acid identities in L domains of predicted PPR-RNA interactions at position 6. lass="Chemical">Amino acid V (lass="Chemical">n class="Chemical">valine) is prominent at all nucleotide identities and thus possibly represents non-discriminatory spacer elements. (B) Correlations between amino acid identities at position 6 are most prominent for amino acid P and to a lower extent also for L, I, T and M with nucleotide U and amino acid T (threonine) with A or G. Position 1′ shows no discernible correlation when amino acids I, L, P, T or M are present at position 6. When these amino acid identities are excluded (ex.), a weak correlation can be seen with amino acid N to nucleotide identity A or U.
The L2 motifs in RNA editing PPR proteins correlate with nucleotide identities.
(A) Correlations between amino acid and respective nucleotide identities in the target RNAs reveal preferential combinations with lass="Chemical">amino acid positiolass="Chemical">n 6 ilass="Chemical">n the sequelass="Chemical">nce logos. (B) Amilass="Chemical">no acid idelass="Chemical">ntities lass="Chemical">n class="Chemical">leucine and isoleucine at position 6 correlate with C, U or A (respectively G) in descending frequency, whereas threonine at this position is prevalent opposite nucleotide A. As in the L repeats, amino acid V occurs with any nucleotide. Preferences at position 1′ are not apparent in the sample size available.
The P-values for the actualnucleotide ratios to totalnucleotide ratios were calculated using G-tests. If P<0.1 for at least one of four nucleotides, we handled the data as approved with sufficient significance for further analyses (Figure S2B and Figure S2C). The G-tests were calculated with an MS Excel spreadsheet downloaded from the web site of “Handbook of Biological Statistics” (http://udel.edu/~mcdonald/statgtestgof.html). Amino acid combinations or single amino acid identities at positions 6 and 1′ which occurred in fewer than three factors or less than eight incidences were not included in the further analysis.
RNA Editing Site Prediction in the RNA
As previously surmised [8]–[10], we assumed that each PPR domain in a givenPPR protein binds to one n class="Chemical">nucleotide alass="Chemical">nd that the bilass="Chemical">ndilass="Chemical">ng ilass="Chemical">ntelass="Chemical">nsity of a PPR domailass="Chemical">n to a lass="Chemical">nucleotide correlates with the adjusted lass="Chemical">nucleotide ratio obtailass="Chemical">ned by computatiolass="Chemical">nal alass="Chemical">nalysis. Therefore we directly employed the adjusted lass="Chemical">nucleotide ratio as putative bilass="Chemical">ndilass="Chemical">ng ilass="Chemical">ntelass="Chemical">nsity of each PPR domailass="Chemical">n alass="Chemical">nd redefilass="Chemical">ned this as “bilass="Chemical">ndilass="Chemical">ng value”.
To predict the target RNA editing sites of a PPR protein, allPPR motifs in the PPR protein were aligned to the respective 5′ nucleotide sequences of all known RNA editing sites with the fourth nucleotide upstream of each editing site (nucleotide –4) assigned to the S2 motif. The binding values of each PPR motif domain were calculated with the FIMO program in the MEME suite (http://meme.nbcr.net/meme/fimo-intro.html) with 30 nucleotides upstream sequence of 34 chloroplast and 430 mitochondrial RNA editing sites in the respective coding regions (Figure S7). For mitochondrial transcriptomes, the sequences of all mitochondrial proteins with known functions and the ribosomal RNA sequences, and for chloroplast transcriptomes, all chloroplast proteins and ribosomal RNA sequences were retrieved from FLAGdb++ (http://urgv.evry.inra.fr/projects/FLAGdb/HTML/index.shtml) and used as reference sequence in the FIMO program. Predicted targets sites are ranked by p-value.
RNA Editing Analysis of the MEF11 and MEF32 Mutants
The top 20 ranked RNA editing sites predicted by the FIMO program tool for the two E-class PPR proteins lass="Gene">MEF11 alass="Chemical">nd lass="Chemical">n class="Gene">MEF32 were analysed experimentally. Total cellular RNA was purified from the respective T-DNA (MEF32 At4g14170: SALK_039629) or EMS mutants (mef11-1) [14] with the RNAeasy kit (Qiagen, Hilden, Germany). Reverse transcription reactions were performed with a 30-primers-set developed for mitochondrial transcripts in Arabidopsis thaliana
[15]. PCR reactions were performed with the respective gene-specific primer sets. PCR products were purified by alkaline phosphatase and ExoI and were commercially sequenced (Macrogen, Seoul, Korea or LGC, Berlin, Germany).
Results
Previously, lass="Chemical">amino acid positiolass="Chemical">ns 6 alass="Chemical">nd 1′, respectively, ilass="Chemical">n each PPR motif were lass="Chemical">noted to show correlatiolass="Chemical">ns betweelass="Chemical">n amilass="Chemical">no acid idelass="Chemical">ntity alass="Chemical">nd the presumably colass="Chemical">ntacted lass="Chemical">nucleotide idelass="Chemical">ntity. These could fulass="Chemical">nctiolass="Chemical">n as discrimilass="Chemical">nators to colass="Chemical">nvey RNA sequelass="Chemical">nce specificity depelass="Chemical">ndilass="Chemical">ng olass="Chemical">n the order of the repeat elemelass="Chemical">nts ilass="Chemical">n the respective PPR proteilass="Chemical">n. Ilass="Chemical">ndeed, experimelass="Chemical">ntal evidelass="Chemical">nce colass="Chemical">nfirmed the ilass="Chemical">nfluelass="Chemical">nce of these lass="Chemical">n class="Chemical">amino acid positions [10]. Since the L, L2 and S2 type motifs characteristic for RNA editing PPR proteins could not be included, we here set out to determine their influence.
The L, L2 and S2 Motifs Selectively Contact RNA Nucleotides
The recent computational analysis of the RNA sequence recognition code inPPR proteins by Kobayashi et al. [9] used 4614 PPR motifs from n class="Species">Arabidopsis PPR proteilass="Chemical">ns. The subsequelass="Chemical">nt improved alass="Chemical">nalysis by Barkalass="Chemical">n et al. [10] is based upolass="Chemical">n the aliglass="Chemical">nmelass="Chemical">nt of the P alass="Chemical">nd S type repeats ilass="Chemical">n three P-class PPR proteilass="Chemical">ns alass="Chemical">nd 37 E-class RNA editilass="Chemical">ng factors. Here the L, S2 alass="Chemical">nd L2 type motifs were lass="Chemical">not colass="Chemical">nsidered as RNA bilass="Chemical">ndilass="Chemical">ng elemelass="Chemical">nts but as spacers betweelass="Chemical">n P alass="Chemical">nd S motifs. This assiglass="Chemical">nmelass="Chemical">nt is supported by the observatiolass="Chemical">n that these L, S2 alass="Chemical">nd L2 elemelass="Chemical">nts display amilass="Chemical">no acid compositiolass="Chemical">ns very differelass="Chemical">nt from the P alass="Chemical">nd S type repeats. Accordilass="Chemical">ng to the suggestiolass="Chemical">n of Rivals et al. [16], S2 is the C-termilass="Chemical">nal repeat at the border of the PPR tract alass="Chemical">nd the N-termilass="Chemical">nus of the E domailass="Chemical">n (Figure 1). The L2 elemelass="Chemical">nt is located just upstream, N-termilass="Chemical">nal to the S2 motif. The L elemelass="Chemical">nts are spaced betweelass="Chemical">n the P alass="Chemical">nd S type repeats, usually ilass="Chemical">n the order P-L-S [16].
We reasoned that the L, S2 and L2 elements may also have to be considered as RNA contacts, since they may be required for RNA sequence specificity of short PPR proteins. Several RNA editing factors have only a very limited number of PPR motifs, e.g. lass="Gene">MEF20 has olass="Chemical">nly eight such repeats [17], lass="Chemical">n class="Gene">MEF8 and MEF8S have only five including the L, L2 and S2 elements (Figure 1) [13]. If in e.g. MEF20 the L, L2 and S2 domains are not involved in nucleotide recognition, only four PPR repeats are left to specify target RNA editing sites. These could contact only four nucleotides in the one-on-one mode, which would not be sufficient to target a specific RNA editing site in plant mitochondria. In fact, these four nucleotides and the C at the editing position occur 31 times in the mitochondrial transcriptome.
Therefore at least these and possibly other L, L2 and S2 motifs should be considered to contribute to the sequence specific binding of the PPR proteins involved in RNA editing in plant organelles. Therefore we here probed these repeats to discern if they also show a nucleotide selective code. To identify potentially discriminatory amino acids in the L, L2 and S2 repeats, we performed a computational analysis of correlations between amino acid and corresponding nucleotide identities. As informational base we used 42 RNA editing PPR proteins for the analysis of potential amino acid-nucleotide correlations. These included two n class="Species">rice proteilass="Chemical">ns with eight editilass="Chemical">ng target sites, six Physcomitrella proteilass="Chemical">ns with lass="Chemical">nilass="Chemical">ne editilass="Chemical">ng target sites alass="Chemical">nd 34 lass="Chemical">n class="Species">Arabidopsis proteins with 57 editing target sites, a total of 74 RNA editing target sites. We focussed attention on the amino acid identities at the sites equivalent to the discriminatory positions in the P repeats [7]–[10].
The Amino Acid at the 1′ Position of S2 Motifs Correlates with the Target Nucleotide Identity
The site-specific PPR RNA editing factors and the RNA target sequences show optimal correlations when the PPR domains are aligned 5′ to the editing sites starting from nucleotide position −4 relative to the edited C in the upstream direction [8]–[10]. The S2 motifs are accordingly positioned at the −4 nucleotides (Figure 1). To test for S2 nucleotide-amino acid correlations we probed 42 S2 domains against 74 nucleotide identities in the –4 position. Figure 2A shows in the sequence logos the frequencies of individual amino acid identities at position 33 (i.e. 1′) opposite either A, C, G or U nucleotides. Figure 2B depicts the reverse analysis and shows how often an A, C, G or U nucleotide is found opposite lass="Chemical">amino acid threonine (T), lass="Chemical">n class="Chemical">aspartic acid (D) or asparagine (N) at position 33 (i.e. 1′). These nucleotide coincidences were calculated after adjusting for the respective A, C, G and U contents of the mitochondrial and plastid coding sequences (see Methods). For example, opposite nucleotide G most often amino acid D is found in amino acid position 33 (Figure 2A). If amino acid N is in this position, usually nucleotide A is present in the RNA (Figure 2B and Figure S2).
Figure 2
Amino acids in RNA editing PPR protein S2 motifs correlate with target nucleotides.
(A) Sequence logos were constructed for each of the four nucleotides facing the respective S2 domains in the predicted PPR-RNA interaction at position –4 relative to the edited C (Figure 1). Coincidences between nucleotide and amino acid identities are seen for position 1′ (also labelled as amino acid 33). No coinciding amino acid preference is seen with the C nucleotide. (B) The amino acid identity at position 1′ shows the most prominent correlation between D (aspartic acid) and nucleotide G, N (asparagine) and A, T (threonine) and U. In the bar diagram, percentages of nucleotide identities coinciding with the respective amino acid are indicated. Nucleotide percentages are normalized by calculations with taking the A, C, G and U percentage into account as detailed in the methods section. Sequence logos were derived with the web-based software at weblogo.berkeley.edu [25].
This 33rd n class="Chemical">amino acid positiolass="Chemical">n correspolass="Chemical">nds to the determilass="Chemical">nilass="Chemical">ng positiolass="Chemical">n 1′ of the degelass="Chemical">nerated P motifs ilass="Chemical">n the N-termilass="Chemical">nal regiolass="Chemical">n of the extelass="Chemical">nsiolass="Chemical">n (E) domailass="Chemical">n. This filass="Chemical">ndilass="Chemical">ng supports the observed weak similarity of the E domailass="Chemical">n with the structure of the PPR repeats alass="Chemical">nd the ilass="Chemical">nterpretatiolass="Chemical">n that the E domailass="Chemical">n is a degelass="Chemical">nerated PPR motif.
The amino acid at position 6 is more variable in S2 than inP and S motifs and therefore requires a larger sample number than is presently available.
In L and L2 Motifs Amino Acid Identities at Positions 6 or 1′ Correlate with Specific Nucleotides
Sequence logos constructed from 153 L motifs (without the L2 repeats) aligned with 258 target nucleotides show that at position 6 amino acids n class="Chemical">valine (V) alass="Chemical">nd also lass="Chemical">n class="Chemical">alanine (A) are present opposite all four nucleotide identities (Figure 3A). This non-discriminating coincidence may reflect a frequent function of the L domain as spacer or placeholder that was suggested previously [9], [10].
Figure 3
Amino acids at position 6 in RNA editing PPR protein L motifs correlate with nucleotide identities.
(A) Sequence logos opposite each of the four nucleotides show the amino acid identities in L domains of predicted PPR-RNA interactions at position 6. Amino acid V (valine) is prominent at all nucleotide identities and thus possibly represents non-discriminatory spacer elements. (B) Correlations between amino acid identities at position 6 are most prominent for amino acid P and to a lower extent also for L, I, T and M with nucleotide U and amino acid T (threonine) with A or G. Position 1′ shows no discernible correlation when amino acids I, L, P, T or M are present at position 6. When these amino acid identities are excluded (ex.), a weak correlation can be seen with amino acid N to nucleotide identity A or U.
Of the other position 6 amino acid identities, I, L, P, T and M correlate with a slight bias with different nucleotide identities. In the reverse analysis, most prominent are the absence of the G nucleotide opposite amino acid L and the positive correlation between lass="Chemical">amino acid P alass="Chemical">nd lass="Chemical">nucleotide U (Figure 3B). Ilass="Chemical">n positiolass="Chemical">n 1′, a positive correlatiolass="Chemical">n is detected betweelass="Chemical">n lass="Chemical">n class="Chemical">amino acid N and nucleotides U or A. However, this is only seen when neither of the five biased amino acids I, L, P, T or M is present at position 6. That the presence of a 1′ amino acid to nucleotide correlation depends on the position 6 amino acid identity, may suggest that position 6 can overrule the influence of the amino acid identity at position 1′ at least in these instances. Here position 1′ refers to the amino acid identity at the first position of the next PPR element which is most often an S-type PPR.
Overall, the nucleotide - amino acid correlations in the L domains are much weaker than those in the P and S motifs. The potential significance of other amino acids at these positions remains unclear or undetectable due to the limited number of samples.The L2 motif is the C-terminalL motif, located at the N-terminus of the S2 motif (Figure 1). The N-terminal region of the L2 motif, facing the ‘regular’ PPR tract, shows high similarity to the N-terminal region of ‘normal’ L motifs, but the C-terminus is rather different. This difference and the smallass="Chemical">ler sample lass="Chemical">number of olass="Chemical">nly olass="Chemical">ne L2 elemelass="Chemical">nt per PPR proteilass="Chemical">n complicate the correlative alass="Chemical">nalysis. Sequelass="Chemical">nce logos colass="Chemical">nstructed from 42 L2 motifs aliglass="Chemical">ned with 72 target lass="Chemical">nucleotides suggest that amilass="Chemical">no acids T, I, V alass="Chemical">nd L at positiolass="Chemical">n 6 correlate with a selective lass="Chemical">nucleotide bias similar to the ‘lass="Chemical">normal’ L domailass="Chemical">ns (Figure 4A). The reverse alass="Chemical">nalysis, probilass="Chemical">ng from the amilass="Chemical">no acid idelass="Chemical">ntity at positiolass="Chemical">n 6 the idelass="Chemical">ntity of the lass="Chemical">nucleotide opposite, shows biases for lass="Chemical">n class="Chemical">isoleucine, which is negatively correlated with nucleotide A and for threonine which is conspicuously absent opposite U but is positively correlated with A (Figure 4B and Figure S2).
In P and S motifs Single Amino Acid Identities at Positions 6 or 1′ are Correlated with Nucleotide Preferences
The previously observed correlation between specific amino acid combinations at positions 6 and 1′ and nucleotide preferences [8]–[10] is confirmed by our computational analysis (Figure S1, Figure S2 and Figure S4). The major correlated n class="Chemical">amino acid pairs, T6N1’, T6D1’, N6D1’, N6N1’, N6S1’, N6T1’, S6N1’ alass="Chemical">nd S6D1’, gelass="Chemical">nerally match the bilass="Chemical">ndilass="Chemical">ng ilass="Chemical">ntelass="Chemical">nsity betweelass="Chemical">n respectively modified repeats alass="Chemical">nd correspolass="Chemical">ndilass="Chemical">ng RNA sequelass="Chemical">nces demolass="Chemical">nstrated ilass="Chemical">n vitro by Barkalass="Chemical">n et al. [10]. These typical amilass="Chemical">no acids combilass="Chemical">natiolass="Chemical">ns prevail ilass="Chemical">n 85% of all P alass="Chemical">nd S motifs.
To determine whether there are additional correlations beyond these combinations, we focused the next analyses on either of these two lass="Chemical">amino acid positiolass="Chemical">ns ilass="Chemical">ndividually. Accordilass="Chemical">ngly we scalass="Chemical">n class="Chemical">nned nucleotide - amino acid correlations for either position with excepting the prevalent amino acid at the respective other position 6 or 1′. This approach detects amino acid - nucleotide preferences in addition to the prevalent ones (Figure 5). For example, no C and G nucleotides are found opposite amino acid N at position 1′ in S motifs. While much less prominent than the combination of amino acid identities at positions 6 and 1′, these individual positions may prefer nucleotides singly or connected with rare amino acid identities at the respective other position 6 or 1′. To identify such additional rare combinations, yet larger sample numbers are required.
Figure 5
Positions 6 or 1′ in P and S motifs in RNA editing PPRs correlate with specific nucleotides.
Depicted are individual connections of positions 6 or 1′ in those instances, where the most prominent combinatory amino acid identity correlations between positions 6 and 1′ are excluded as indicated (ex.). In these instances single amino acid positions correlate with distinct nucleotide preferences in S and P elements, respectively. For the S elements non-random distributions are found at positions 6 and 1′, for the P elements only at position 1′. The most prominent combinatory amino acid – nucleotide identity correlations which are excluded here have been identified previously (8–10).
Positions 6 or 1′ in P and S motifs in RNA editing PPRs correlate with specific nucleotides.
Depicted are individual colass="Chemical">nnectiolass="Chemical">ns of positiolass="Chemical">ns 6 or 1′ ilass="Chemical">n those ilass="Chemical">nstalass="Chemical">nces, where the most promilass="Chemical">nelass="Chemical">nt combilass="Chemical">natory amilass="Chemical">no acid idelass="Chemical">ntity correlatiolass="Chemical">ns betweelass="Chemical">n positiolass="Chemical">ns 6 alass="Chemical">nd 1′ are excluded as ilass="Chemical">ndicated (ex.). Ilass="Chemical">n these ilass="Chemical">nstalass="Chemical">nces silass="Chemical">ngle lass="Chemical">n class="Chemical">amino acid positions correlate with distinct nucleotide preferences in S and P elements, respectively. For the S elements non-random distributions are found at positions 6 and 1′, for the P elements only at position 1′. The most prominent combinatory amino acid – nucleotide identity correlations which are excluded here have been identified previously (8–10).
Prediction of RNA Editing Target Sites
The computational analysis of the P, S, L, S2 and L2 motifs correlates nucleotide preferences with different combinations of amino acids at positions 6 and 1′ in allPPR elements. To evaluate the functional relevance of the correlations, we employed them to predict target nucleotide sequences from the respective amino acid identities in severalPPR proteins known or suspected to be involved in RNA editing. The RNA sequences of editing sites were aligned starting from nucleotide –4 with the S2 element of the editing PPR protein. The number of the further upstream nucleotides is determined by the number of PPR motifs as in the previous studies [8]–[10]. We evaluated 430 RNA editing sites inn class="Species">Arabidopsis mitocholass="Chemical">ndria codilass="Chemical">ng regiolass="Chemical">ns alass="Chemical">nd 34 sites ilass="Chemical">n chloroplasts. Predicted bilass="Chemical">ndilass="Chemical">ng ilass="Chemical">ntelass="Chemical">nsities of each PPR-lass="Chemical">nucleotide pair were calculated as relative values, were compared betweelass="Chemical">n differelass="Chemical">nt RNA editilass="Chemical">ng sites alass="Chemical">nd used for the ralass="Chemical">nkilass="Chemical">ng as detailed ilass="Chemical">n the methods sectiolass="Chemical">n.
Predicted nucleotide binding intensities (binding value) for each PPR repeat in the editing protein lass="Gene">MEF11 are showlass="Chemical">n ilass="Chemical">n the upper part of Figure 6A. For example, the bilass="Chemical">ndilass="Chemical">ng values of the P motif at positiolass="Chemical">n 15 (that is opposite lass="Chemical">nucleotide −15 of the edited lass="Chemical">nucleotide) are calculated to A = 0.73, C = 0.29, G = 0.41 alass="Chemical">nd U = 0, respectively. To predict the target RNA editilass="Chemical">ng sites for each PPR proteilass="Chemical">n, the bilass="Chemical">ndilass="Chemical">ng values of each of the ilass="Chemical">ndividual PPR motifs are used as positiolass="Chemical">n-depelass="Chemical">ndelass="Chemical">nt scorilass="Chemical">ng matlass="Chemical">n class="Species">rices for the FIMO program. This program converts log-odds scores for each of the RNA sequences into p-values, assuming a zero-order background model as detailed on the FIMO webpage (http://meme.nbcr.net/meme/fimo-intro.html). To predict the target sites of MEF11, the obtained p-values for all RNA editing sites are used for ranking the sites (Figure 6A, bottom part).
Figure 6
Prediction of nucleotide target sequences for MEF11 and the novel RNA editing PPR protein MEF32.
(A) For MEF11 (At4g14850), the RNA editing sites ranked at positions 1, 2 and 3 out of 430 sites have been previously identified as target sites (shaded light blue) [17]. When we analysed all 20 top ranked editing sites in a MEF11 mutant, sites ranked 10, 14 and 17 turned out to be also targets of MEF11 (shaded yellow). (B) Target sequence predictions for the previously unassigned mitochondrial RNA editing factor encoded by At4g14170 are shown for the top ranked twenty sites. When we investigated these in a T-DNA mutant of the gene At4g14170, the three top ranked sites were identified as bona fide targets, at these nucleotides editing is absent in the mutant. This locus has been accordingly renamed to indicate that it codes for the novel RNA editing protein MEF32. The respective top parts in panels A and B show the PPR motifs considered (shaded light green; including the L2 and S2 elements) and their alignment to nucleotide positions which are counted 3′ to 5′ from the edited C (from right to left, −4 to −17 and −4 to −13, respectively). Amino acid identities at positions 6 and 1′ are given (shaded blue) and the respective scores are shown. In the box below, the locations of the top twenty ranked sites are indicated and the assigned specificity factor is given for experimentally confirmed targets, here MEF11 or MEF32. For each repeat the score at each site is given (shaded ocre with the color intensity reflecting the score) and the p-value of FIMO progaram as shown in the far right column is used for the ranking.
Prediction of nucleotide target sequences for MEF11 and the novel RNA editing PPR protein MEF32.
(A) For lass="Gene">MEF11 (At4g14850), the RNA editilass="Chemical">ng sites ralass="Chemical">nked at positiolass="Chemical">ns 1, 2 alass="Chemical">nd 3 out of 430 sites have beelass="Chemical">n previously idelass="Chemical">ntified as target sites (shaded light blue) [17]. Whelass="Chemical">n we alass="Chemical">nalysed all 20 top ralass="Chemical">nked editilass="Chemical">ng sites ilass="Chemical">n a lass="Chemical">n class="Gene">MEF11 mutant, sites ranked 10, 14 and 17 turned out to be also targets of MEF11 (shaded yellow). (B) Target sequence predictions for the previously unassigned mitochondrial RNA editing factor encoded by At4g14170 are shown for the top ranked twenty sites. When we investigated these in a T-DNA mutant of the gene At4g14170, the three top ranked sites were identified as bona fide targets, at these nucleotides editing is absent in the mutant. This locus has been accordingly renamed to indicate that it codes for the novel RNA editing protein MEF32. The respective top parts in panels A and B show the PPR motifs considered (shaded light green; including the L2 and S2 elements) and their alignment to nucleotide positions which are counted 3′ to 5′ from the edited C (from right to left, −4 to −17 and −4 to −13, respectively). Amino acid identities at positions 6 and 1′ are given (shaded blue) and the respective scores are shown. In the box below, the locations of the top twenty ranked sites are indicated and the assigned specificity factor is given for experimentally confirmed targets, here MEF11 or MEF32. For each repeat the score at each site is given (shaded ocre with the color intensity reflecting the score) and the p-value of FIMO progaram as shown in the far right column is used for the ranking.
The accordingly top ranked RNA target motifs predicted for lass="Gene">mitochondrial editing factor 11 (lass="Chemical">n class="Gene">MEF11) include the editing sites where MEF11 is known to be involved at ranking positions 1, 2 and 3 [17]. The other top 20 ranked sites for MEF11 were analysed in the respective gene disrupted mutant plant. Of these predicted targets, matR-1730 at rank 10, ccmFc-378 (ccb452-378) at rank 14 and ccmC-568 (ccb256-568) at rank 17 were identified as previously overlooked targets of MEF11 (Figure S5). These sites had not been included in the original screen of mitochondrial editing sites in the MEF11 mutant.
In the next experimental test, we predicted the target RNA editing sites of the E-class PPR protein encoded by At4g14170, for which the target sites were unknown. To evaluate the predictive power of the program, we analysed a T-DNA insertion line of this gene for RNA editing defects at the top twenty predicted target sites (Figure 6B and Figure S5). In the respective cDNA of the T-DNA line of At4g14170, the three top ranked predicted target sites, lass="Gene">nad1-571, lass="Chemical">n class="Chemical">ccmB-569 (ccb206-569) and cox2-27, are indeed unedited (Figure 6B and Figure S5), confirming the functional validity of the improved PPR prediction approach. The thus newly identified E-class PPR protein encoded by At4g14170 has now been renamed mitochondrial RNA editing factor MEF32.
Inclusion of the L, L2 and S2 Motifs Improves Prediction of RNA Editing Target Sites
To evaluate if the inclusion of the L, L2 and S2 domains improves assignment and ranking of predicted target sites over the previous method of using only P and S elements, we compared the prediction of target sequences by the two approaches.Figure 7 shows the rankings of the predicted RNA editing target sites of the presently published RNA editing factors as listed in Figure S3. Both approaches rank the known target sites for each RNA editing factor in the upper 50% inplastids (Figure 7A) as well as in mitochondria (Figure 7B). However, the ranking of many sites improves considerably when the L, L2 and S2 motifs are included. For example, one the two experimentally identified target sites of n class="Gene">CLB19, rpoA-200, is predicted at positiolass="Chemical">n three of 34 plastid editilass="Chemical">ng sites whelass="Chemical">n olass="Chemical">nly the P alass="Chemical">nd S codes are used (Figure 7A). Ilass="Chemical">nclusiolass="Chemical">n of the L, L2 alass="Chemical">nd S2 elemelass="Chemical">nts ilass="Chemical">ncreases the predictiolass="Chemical">n of this site to ralass="Chemical">nk two. Overall, ilass="Chemical">n plastids, the predictiolass="Chemical">ns of lass="Chemical">nilass="Chemical">ne target sites improve by ilass="Chemical">nclusiolass="Chemical">n of the L, L2 alass="Chemical">nd S2 motifs, elevelass="Chemical">n target sites are equally well predicted without these motifs, alass="Chemical">nd three predictiolass="Chemical">ns are better with olass="Chemical">nly the P alass="Chemical">nd S elemelass="Chemical">nts.
Figure 7
Inclusion of L, L2 and S2 repeats generally improves the prediction accuracy of RNA editing targets.
Although the bona fide target sites are listed in the top ranks even without including the L, L2 and S2 repeats, their consideration mostly improves the prediction accuracy if only slightly. This suggests that these repeats also connect to target RNA sequences. Shown here are only the data for Arabidopsis. For Physcomitrella mitochondria, prediction ranks target sites always at the top, but then there are only very few editing sites in this moss (Figure S3). (A) Prediction of the target sites for the known chloroplast editing factors finds the identified targets within the top ranks out of the 34 RNA editing sites in chloroplasts of Arabidopsis. Prediction from only the P and S repeats (□) is usually sufficient, but inclusion of the L, L2 and S2 elements (•) often improves the ranking. (B) Analogous improvements of the predictions are seen within the 430 editing sites considered for mitochondrial PPR proteins. In a few instances the predicted PPR-RNA interaction drops in rank when the L, L2 and S2 elements are included (e.g. the targets of MEF18 and MEF19; further details are given in Figure S3). The nad6-95 target site of MEF8 (asterisk) cannot be ranked since the p-value is >1 in the FIMO program evaluation.
Inclusion of L, L2 and S2 repeats generally improves the prediction accuracy of RNA editing targets.
Although the bona fide target sites are listed in the top ranks even without including the L, L2 and S2 repeats, their consideration mostly improves the prediction accuracy if only slightly. This suggests that these repeats also colass="Chemical">nnect to target RNA sequelass="Chemical">nces. Showlass="Chemical">n here are olass="Chemical">nly the data for lass="Chemical">n class="Species">Arabidopsis. For Physcomitrella mitochondria, prediction ranks target sites always at the top, but then there are only very few editing sites in this moss (Figure S3). (A) Prediction of the target sites for the known chloroplast editing factors finds the identified targets within the top ranks out of the 34 RNA editing sites in chloroplasts of Arabidopsis. Prediction from only the P and S repeats (□) is usually sufficient, but inclusion of the L, L2 and S2 elements (•) often improves the ranking. (B) Analogous improvements of the predictions are seen within the 430 editing sites considered for mitochondrial PPR proteins. In a few instances the predicted PPR-RNA interaction drops in rank when the L, L2 and S2 elements are included (e.g. the targets of MEF18 and MEF19; further details are given in Figure S3). The nad6-95 target site of MEF8 (asterisk) cannot be ranked since the p-value is >1 in the FIMO program evaluation.
In the mitochondrial editing factors, overall target site prediction with the L, L2 and S2 elements improves in 19 instances, nine target sites are equally ranked without these motifs, and nine predictions are better with only the P and S elements (Figure 7B). Improved ranking is most striking with the short PPR proteins such as lass="Gene">MEF8 which colass="Chemical">ntailass="Chemical">n olass="Chemical">nly few P alass="Chemical">nd S elemelass="Chemical">nts. Alass="Chemical">nother short proteilass="Chemical">n, lass="Chemical">n class="Gene">MEF20, predicts five candidate target sites with equal binding values from the P and S repeats, while consideration of the L, L2 and S2 motifs yields a clear ranking and selects the actual MEF20 target sites to the top (Figure S3). The ranking improvement increases when the entire transcriptomes of chloroplasts or mitochondria are used as reference sequences (Figure S7). Inclusion of the L, L2 and S2 motifs in the prediction thus generally improves the specificity towards the target RNA editing sites over the consideration of only the P and S motifs.
Discussion
Amino Acid Identities in L Motifs Correlate with Nucleotide Preferences in the RNA
The detailed analysis shows that the structures of many L and L2 motifs show a correlation with the corresponding nucleotides in the target RNA. However, the nucleotide bias in these L type motifs is less pronounced than that of the P and S elements and is limited to few specific amino acid identities at positions 6 and/or 1′ in about half of the L and L2 elements. About 45% of the L and L2 motifs do not show any nucleotide preference. These L and L2 motifs may actually function as spacers betweenP and S motifs as proposed by Barkan et al. [10]. In our analyses we did not include non-edited lass="Chemical">cytidines which have to be discrimilass="Chemical">nated agailass="Chemical">nst by the PPR RNA editilass="Chemical">ng factors. The low but detectable ilass="Chemical">ncrease ilass="Chemical">n the specificity of the RNA editilass="Chemical">ng factors by ilass="Chemical">nclusiolass="Chemical">n of the L motifs may be importalass="Chemical">nt if lass="Chemical">not lass="Chemical">necessary to distilass="Chemical">nguish RNA editilass="Chemical">ng sites from other lass="Chemical">not-to-be-edited lass="Chemical">n class="Chemical">cytidines in organellar transcripts.
The importance of the L motif for the function of the RNA editing PPR proteins is supported by the effect of the lass="Gene">MEF3 SNP mutatiolass="Chemical">n ilass="Chemical">n ecotype lass="Chemical">n class="Chemical">Ler in comparison to the Col accession. The only two unique amino acid exchanges between the two ecotypes are located in one of the L domains in MEF3. These differences must be responsible for the lower level of RNA editing at the atp4-89 target site in ecotype Ler (50%) in comparison to the 100% in Col [18]. One of these two amino acid alterations occurs at position 1′. The substitution of amino acid D by N from Col to Ler changes the nucleotide preference from ‘neutral’ to ‘G negative’ in our prediction. This shift may lead to a PPR-nucleotide mismatch and may thus decrease the overall binding intensity to the target RNA sequence and consequently also result in the observed lower RNA editing efficiency of this site in Ler plants.
Comparison of the Nucleotide Recognition Patterns of P, S and L Elements
The amino acids at positions 6 and 1′ inP and S and here also inL motifs are correlated with nucleotide identities and have consequently been proposed to be potentialnucleotide binding amino acids [8]–[10]. In our analysis, the P and S motifs slightly differ in their bias of amino acid combinations and corresponding nucleotide identities. Most prominently, the amino acid combinationn class="Chemical">NN at positiolass="Chemical">ns 6 alass="Chemical">nd 1′ clearly shows a preferelass="Chemical">nce for C alass="Chemical">nd U lass="Chemical">nucleotides ilass="Chemical">n the P motifs but lass="Chemical">not ilass="Chemical">n the S type repeats (Figure S2 alass="Chemical">nd Figure S4).
Different from both P and S motifs are the amino acid – nucleotide correlations in the L elements. Among the four amino acid moieties at position 6 in the L domains for which we find nucleotide preferences, lass="Chemical">threonine ilass="Chemical">n P alass="Chemical">nd S domailass="Chemical">ns has beelass="Chemical">n correlated with A alass="Chemical">nd G lass="Chemical">nucleotides. Other prevalelass="Chemical">nt amilass="Chemical">no acids at positiolass="Chemical">n 6, lass="Chemical">notably L, I, M alass="Chemical">nd P, so far have lass="Chemical">not beelass="Chemical">n reported to be able to act as lass="Chemical">nucleotide bilass="Chemical">ndilass="Chemical">ng amilass="Chemical">no acids, however alass="Chemical">ny amilass="Chemical">no acid ilass="Chemical">n a peptide chailass="Chemical">n is potelass="Chemical">ntially able to attach to alass="Chemical">ny lass="Chemical">nucleotide [19]. Alterlass="Chemical">natively, these four amilass="Chemical">no acids may be merely tolass="Chemical">n class="Chemical">lerated by the contacted nucleotide. The lower nucleotide to amino acid correlation bias observed in L motifs in comparison to that of the P and S motifs may result from such a non-biased weak nucleotide affinity.
Evaluation of the Prediction Accuracy
The inclusion of the L, L2 and S2 repeats generally improves the correlation between amino acid identities in the PPR repeats and the target sequences in the RNA. This improvement is seen in the better accuracy in the prediction of these target sites from the PPR structures (Figure 7 and Figure S3). The high success rate of predicting correct target sites within the top twenty is by no means perfect, but may be better than it seems. Further of the high-scoring target sites may be genuine targets even though they are still edited in mutant lines of the respective RNA editing PPR protein. A target site may be hidden in a mutant when another PPR protein compensates for the missing factor and sustains RNA editing. This has been documented for the n class="Gene">MEF8 alass="Chemical">nd lass="Chemical">n class="Gene">MEF8S PPR proteins [13].
Some experimentally identified RNA editing target sites are ranked rather low in the prediction. Several possible explanations can be considered why e.g. predictions from lass="Gene">MEF14 or lass="Chemical">n class="Gene">MEF1 do not match the target sites very well. Firstly, amino acid identities at other positions than 6 and 1′ may influence the binding preference. Some of the EMS induced mutations and of the SNPs in ecotypes of Arabidopsis with lower RNA editing (e.g. MEF1 in ecotype Ler) occur at other positions within PPR motifs. These variant amino acids may influence the binding specificity and alter the RNA sequence recognized.
Secondly, the E and/or DYW domains may influence the target RNA sequence pattern. More than 90% of the −1 nucleotides at editing sites are C or U [20], this nucleotide locating near the E and/or DYW domains of the specific PPR protein. Furthermore, the E domains display PPR like features and likely evolved from another PPR repeat.Thirdly, the distance between the RNA editing site and the binding site of the cognate PPR protein may vary. As in the previous analyses [8]–[10], we aligned the PPR elements from the −4 position of the respective RNA editing site. This appears to be correct in most instances, but there are exceptions. For example, two successive editing sites are recognized by the PPR protein lass="Gene">SLO2 [21] which suggests that the distalass="Chemical">nce betweelass="Chemical">n the PPR proteilass="Chemical">n alass="Chemical">nd the RNA editilass="Chemical">ng site calass="Chemical">n be flexible ilass="Chemical">n some ilass="Chemical">nteractiolass="Chemical">ns. The recelass="Chemical">ntly idelass="Chemical">ntified lass="Chemical">n class="Gene">MEF25 [22], which is necessary for one of the two successive RNA editing sites, matches the target sequence much better with an alignment from nucleotide –5, which improves ranking of the target site from position 73 to position 3 (Fig. S3). Nevertheless, most of the target predictions are optimal for the –4 alignment when we probed alternative shifted alignments. For several PPR proteins the distance to the edited nucleotide is very rigid. For example, the CRR28 and MEF11 factors cannot edit the C-nucleotide immediately upstream the bona fide editing site. MEF20 cannot alter the C subsequent to its respective target nucleotide and CRR22 precisely targets the central of three consecutive C-nucleotides [11], [14], [23].
Fourthly, unique or very rare amino acid combinations at positions 6 and 1′ or at other positions may specify a nucleotide preference. Such correlations can only be identified in much larger numbers of samples than presently available.
Do RNA Editing Factors with Many PPR Repeats Allow Gaps in the Contact to RNA?
Previous analyses of coincidences between amino acids inP-type PPR proteins and RNA nucleotide identities had allowed gaps of one or two non-binding repeats opposite the respective nucleotides [8]–[10]. Experimental evidence suggests that P-class PPR proteins with large numbers of PPR elements can bind to their target sequence with several mismatched nucleotide - PPR element gaps or loop-outs. Since so far there is no experimental evidence that E-class PPR proteins permit analogous gaps, we did not allow for loop-outs of nucleotides in the RNA or of repeat elements in the PPR proteins. In the resulting alignments the rare scores lower than 0.05 potentially derive from analogous gaps between protein and RNA (Figure S3). This will have to be investigated experimentally.Contrary to P-type PPR proteins, binding of the C-terminal repeats should be important for the E and DYW-class PPR proteins, since the RNA editing site is always positioned at the C-terminal end of the respective PPR protein. One of the shortest specific RNA editing factors, lass="Gene">MEF20, possesses olass="Chemical">nly eight PPR domailass="Chemical">ns which by default have to be sufficielass="Chemical">nt for specific targetilass="Chemical">ng. By extrapolatiolass="Chemical">n, the eight C-termilass="Chemical">nal PPR domailass="Chemical">ns may be sufficielass="Chemical">nt for specific targetilass="Chemical">ng ilass="Chemical">n at least some other RNA editilass="Chemical">ng factors as well. To ilass="Chemical">nvestigate this possibility, we compared the ralass="Chemical">nkilass="Chemical">ng of target editilass="Chemical">ng sites betweelass="Chemical">n predictiolass="Chemical">ns derived from colass="Chemical">nsideratiolass="Chemical">n of all PPR motifs alass="Chemical">nd alass="Chemical">nd predictiolass="Chemical">ns which ilass="Chemical">ncluded olass="Chemical">nly the eight PPR motifs at the respective C-termilass="Chemical">ni (Figure S6). Half of the PPR editilass="Chemical">ng factors still ralass="Chemical">nked ilass="Chemical">n the top 10%, ilass="Chemical">ncludilass="Chemical">ng all those with more thalass="Chemical">n 20 PPR repeats. Ilass="Chemical">nterestilass="Chemical">ngly, lass="Chemical">n class="Gene">MEF14 and SLO2 actually improve ranking of their target sites when considering only the eight C-terminal PPR motifs suggesting that the rest of the PPR motifs, located towards the N-terminus, may not contribute to the specific targeting.
While this report was under review, an analysis was published which suggests that a third n class="Chemical">amino acid positiolass="Chemical">n may be ilass="Chemical">nvolved ilass="Chemical">n determilass="Chemical">nilass="Chemical">ng the lass="Chemical">nucleotide idelass="Chemical">ntity boulass="Chemical">nd by a givelass="Chemical">n repeat elemelass="Chemical">nt ilass="Chemical">n RNA editilass="Chemical">ng PPR proteilass="Chemical">ns [24]. Ilass="Chemical">n our alass="Chemical">nalysis we did lass="Chemical">not observe such alass="Chemical">n additiolass="Chemical">nal discrimilass="Chemical">natilass="Chemical">ng positiolass="Chemical">n which may be due to the differilass="Chemical">ng approaches alass="Chemical">nd selectiolass="Chemical">ns.
Conclusions
The correlative analysis of the L, L2 and S2 type repeats shows an analogous albeit weaker con class="Chemical">nnectiolass="Chemical">n to the respective lass="Chemical">nucleotide idelass="Chemical">ntities thalass="Chemical">n the P alass="Chemical">nd S elemelass="Chemical">nts alass="Chemical">nd suggests that these repeats oftelass="Chemical">n also colass="Chemical">ntact the RNA. Ilass="Chemical">nclusiolass="Chemical">n of the L, L2 alass="Chemical">nd S2 type repeat correlatiolass="Chemical">ns gelass="Chemical">nerally improves the predictiolass="Chemical">n accuracy of filass="Chemical">ndilass="Chemical">ng target RNA sequelass="Chemical">nces for a givelass="Chemical">n E-class RNA editilass="Chemical">ng PPR proteilass="Chemical">n. This will ilass="Chemical">ncrease the efficielass="Chemical">ncy to assiglass="Chemical">n RNA editilass="Chemical">ng target sites to lass="Chemical">novel E-class PPR proteilass="Chemical">ns.
The improved correlation will furthermore enhance the chances of manipulating these RNA editing PPR proteins. For example, it should be easier to complement (or abolish) only one of multiple RNA editing targets by respective specific alterations in the PPR protein. This will give access to analyse the effect of an individual RNA editing event even when the complete loss of the editing factor is lethal.Finally, this information will allow to create RNA editing factors which can edit any n class="Chemical">cytidine ilass="Chemical">n alass="Chemical">ny tralass="Chemical">nscript specifically alass="Chemical">nd thus facilitate the gelass="Chemical">neratiolass="Chemical">n of ‘RNA mutalass="Chemical">nts’ ilass="Chemical">n mitocholass="Chemical">ndria. Such colass="Chemical">nstructs will circumvelass="Chemical">nt the difficulties elass="Chemical">ncoulass="Chemical">ntered ilass="Chemical">n mitocholass="Chemical">ndrial tralass="Chemical">nsformatiolass="Chemical">n. These malass="Chemical">nipulatiolass="Chemical">ns are lass="Chemical">not restricted to plalass="Chemical">nts or to mitocholass="Chemical">ndria, but PPR proteilass="Chemical">ns calass="Chemical">n be gelass="Chemical">nerated for alass="Chemical">ny RNA target ilass="Chemical">n alass="Chemical">ny orgalass="Chemical">nism. Especially the here alass="Chemical">nalysed type of E-class PPR proteilass="Chemical">ns will be very ilass="Chemical">nterestilass="Chemical">ng silass="Chemical">nce durilass="Chemical">ng editilass="Chemical">ng they colass="Chemical">ntact the RNA alass="Chemical">nd thelass="Chemical">n dissociate agailass="Chemical">n. Pure P-type proteilass="Chemical">ns oftelass="Chemical">n bilass="Chemical">nd tightly to their RNA target alass="Chemical">nd calass="Chemical">n class="Chemical">nnot be removed.
Evaluation of coincidences between amino acids within PPR elements and corresponding nucleotides. Part of the entire set of aligments is shown for the PPR editing protein n class="Gene">MEF1. Each amilass="Chemical">no acid ilass="Chemical">n each motif is alass="Chemical">nalysed for co-occurelass="Chemical">nces betweelass="Chemical">n amilass="Chemical">no acid alass="Chemical">nd correspolass="Chemical">ndilass="Chemical">ng lass="Chemical">nucleotide idelass="Chemical">ntities. Amilass="Chemical">no acids at positiolass="Chemical">ns 1′ alass="Chemical">nd 6 show the strolass="Chemical">ngest correlatiolass="Chemical">ns as depicted ilass="Chemical">n figure 1. For the lelass="Chemical">ngth of motif S2 colass="Chemical">nvelass="Chemical">ntiolass="Chemical">n assiglass="Chemical">ns 36 lass="Chemical">nucleotides which positiolass="Chemical">ns amilass="Chemical">no acid 1′ ilass="Chemical">nto S2 (ilass="Chemical">ndicated by alass="Chemical">n arrow alass="Chemical">nd shadilass="Chemical">ng), differelass="Chemical">nt from all other elemelass="Chemical">nts where amilass="Chemical">no acid 1′ is usually assiglass="Chemical">ned to the first lass="Chemical">n class="Chemical">amino acid position of the C-terminally adjacent element.
(PDF)Click here for additional data file.(A) Printout of an example from the MATLAB output alignment. In this sample, amino acids at position 6 (top horizontal column) with each of the four nucleotides (second horizontal column) are correlated with the amino acids present at position 1′ (left vertical column). The numbers of appearances in the 41 PPR RNA editing proteins are given in the figure. This and further data sets are evaluated and compiled in Figure S2B. (B)
The data set used for assigning co-occurences betweennucleotides in target sequences and amino acid identities at positions 6 and/or 1′ in the indicated motifs of RNA editing PPR proteins. For example, for the S2 motif, amino acid D is found at position 1′ four times correlated with an A nucleotide identity, two times with C, 22 times with G and seven times with U. The respective probabilities (P-value) calculated by G-test for each n class="Chemical">nucleotide are givelass="Chemical">n color coded ilass="Chemical">n the right columlass="Chemical">ns, the color code is showlass="Chemical">n olass="Chemical">n the right bottom. (C)
The adjusted data set used for assiglass="Chemical">nilass="Chemical">ng co-occurelass="Chemical">nces at positiolass="Chemical">ns 6 alass="Chemical">nd/or 1′ ilass="Chemical">n the ilass="Chemical">ndicated motifs of RNA editilass="Chemical">ng PPR proteilass="Chemical">ns. The raw lass="Chemical">numbers of lass="Chemical">nucleotide-amilass="Chemical">no acid co-occurrelass="Chemical">nces from figure S2B were adjusted for the G+C colass="Chemical">ntelass="Chemical">nt of mitocholass="Chemical">ndrial sequelass="Chemical">nces. These adjusted values recalculated as total ratios were used ilass="Chemical">n the further alass="Chemical">nalyses.
(PDF)Click here for additional data file.The data set used for
as derived with the prediction tool shows the nucleotide target sequences assigned to RNA editing PPR proteins based upon their amino acid identities at positions 6 and 1′. In the upper left, the gene names and their identifier numbers are given, and species, type of PPR protein (E or DYW), the organellar locations and the references as listed in References S1 are indicated. Below, the respective target sites are identified by gene name and the nucleotide position affected. For each of the PPR repeats, which are displayed from right to left in the C- to N-terminal direction, the p-values and the ranking of the respective target sites aligned from nucleotide –4 of the respective target sequence are given with or without considering the L, L2 and S2 elements as indicated. For lass="Gene">MEF8 the predicted target sites are foulass="Chemical">nd several times ilass="Chemical">n the mitocholass="Chemical">ndrial set of editilass="Chemical">ng sites, this alass="Chemical">nd the low ralass="Chemical">nkilass="Chemical">ng is due to the small lass="Chemical">number of PPR elemelass="Chemical">nts ilass="Chemical">n this proteilass="Chemical">n. Ilass="Chemical">n the upper part, data for lass="Chemical">n class="Species">Arabidopsis thaliana (At) are shown with the plastid factors shaded green, in the lower part predictions for the PPR proteins for editing in mitochondria of Physcomitrella patens (Pp) are listed, shaded light green. At the bottom, predictions for Oryza sativa (Os) are given shaded yellow with the genuine target sequence put into the Arabidopsis editing site set (with At). Red cells in the target sequence indicate the nucleotide encoded as C in the genome and changed to U by RNA editing. Five of the seven targets of OGR1 are not found in Arabidopsis. Prediction for the rice editing site ccmC-458 is reasonable with the genomic sequence (unedited; rank 15), but deteriorates down to rank 71 after another site in the upstream sequence is edited, i.e. converted to U in the target motif. Not included in the calculations were the here identified MEF32, the new target sites of MEF11 and the PPR proteins OTP71, OTP72 and MEF25 which became avaliable after we had initiated the assignments.
(PDF)Click here for additional data file.Correlation between amino acid combinations at positions 6 and 1′ inP and S motifs and nucleotide identities. The amino acid combinations are given in the order 6 and 1′. Displayed are the percentages of coincidences between a given amino acid combination and the nucleotide identity. Amino acid combinations are shown from left to right ordered by their number of occurrence. These data are compiled from data as shown in figure S2C. For the P-elements the combinations N6D1’, N6N1’, T6N1’, T6D1’, N6T1’, N6S1’, S6N1’ and S6D1’ show the strongest correlations.(PDF)Click here for additional data file.Target sites predicted for lass="Gene">MEF11 alass="Chemical">nd lass="Chemical">n class="Gene">MEF32 are analysed in respective mutant plants. The top panels show a comparison of the cDNA sequences at the new target sites predicted for MEF11 between Col-0 wild type plants and the knock-out mutant mef11-1. While in the wild type plants the genomic encoded C is changed to T in the cDNA, the C remains unedited in mutant mef11-1. Site ccmFc-378 (ccb452-378) is a silent nucleotide exchange and is edited to only about 40% in Col-0 wild type plants. The lower panels show a comparison between the cDNA sequences at the target sites predicted for the previously unassigned PPR RNA editing factor MEF32 between Col-0 wild type plants and the knock-out mutant mef32. While in the wild type plants the genomic encoded C is changed to T in the cDNA, the C remains unedited in the mutant. The arrow points to the C peak not present in the wild type plants.
(PDF)Click here for additional data file.The eight C-terminalPPR elements including the L, L2 and S2 repeats are often sufficient to predict RNA editing targets. (A) Prediction of the target sites for the known chloroplast editing factors of lass="Species">Arabidopsis idelass="Chemical">ntifies bolass="Chemical">na fide targets withilass="Chemical">n the top ralass="Chemical">nks from olass="Chemical">nly the eight PPR elemelass="Chemical">nts at the C-termilass="Chemical">nus of the respective proteilass="Chemical">n. Actually the ralass="Chemical">nkilass="Chemical">ng withilass="Chemical">n the 34 RNA editilass="Chemical">ng sites ilass="Chemical">n chloroplasts is oftelass="Chemical">n better with olass="Chemical">nly these eight PPR elemelass="Chemical">nts thalass="Chemical">n whelass="Chemical">n all PPR elemelass="Chemical">nts are ilass="Chemical">ncluded. (B) Alass="Chemical">nalogous comparative alass="Chemical">nalysis of the predictiolass="Chemical">ns withilass="Chemical">n 430 editilass="Chemical">ng sites colass="Chemical">nsidered for mitocholass="Chemical">ndrial PPR proteilass="Chemical">ns usually shows less faithful ralass="Chemical">nkilass="Chemical">ng with olass="Chemical">nly the eight PPR elemelass="Chemical">nts at the C-termilass="Chemical">nus of the respective proteilass="Chemical">n. Olass="Chemical">nly ilass="Chemical">n a few ilass="Chemical">nstalass="Chemical">nces such as lass="Chemical">n class="Gene">MEF14 and some sites of SLO2 the predicted PPR-RNA interaction is increased in rank in comparison to the prediction from all PPR elements.
(PDF)Click here for additional data file.Inclusion of L, L2 and S2 repeats generally improves the prediction accuracy of RNA editing targets in the respective entire transcriptome. The bona fide RNA editing target sites will have to be identified in vivo by the PPR protein factor against the presence of all C nucleotides in the respective organelle. Screening the prediction accuracy within all C nucleotides with or without including the L, L2 and S2 repeats, their inclusion generally improves the ranking considerably. Shown here are data for selected RNA editing PPR proteins for plastids and mitochondria (MEFs) from lass="Species">Arabidopsis. Screelass="Chemical">nilass="Chemical">ng was dolass="Chemical">ne agailass="Chemical">nst all C lass="Chemical">nucleotides ilass="Chemical">n all tralass="Chemical">nscripts ilass="Chemical">n plastids (17.886) alass="Chemical">nd ilass="Chemical">n the mitocholass="Chemical">ndrial tralass="Chemical">nscripts with klass="Chemical">nowlass="Chemical">n fulass="Chemical">nctiolass="Chemical">ns respectively, both as alass="Chemical">n class="Chemical">nnotated in the Flagdb. Changes in the ranking predictions are seen for example with the novel mitochondrial PPR protein MEF32 for which rankings change from positions 5 to 3, from 61 to 18 and from 3 to 1 upon inclusion of the L, L2 and S2 repeats. For MEF11, the predicted PPR-RNA interactions change rank from positions 92 to 4, from 295 to 331, from 100 to 27, from 209 to 184 and from 342 to 21 when the L, L2 and S2 elements are included. Some target sites (asterisks) are not ranked in the top 1000.
(PDF)Click here for additional data file.(DOCX)Click here for additional data file.
Authors: Daniil Verbitskiy; Johannes A van der Merwe; Anja Zehrmann; Barbara Härtel; Mizuki Takenaka Journal: Plant Cell Physiol Date: 2011-12-19 Impact factor: 4.927
Authors: Qiang Zhu; Jasper Dugardeyn; Chunyi Zhang; Mizuki Takenaka; Kristina Kühn; Christian Craddock; Jan Smalle; Michael Karampelias; Jurgen Denecke; Janny Peters; Tom Gerats; Axel Brennicke; Peter Eastmond; Etienne H Meyer; Dominique Van Der Straeten Journal: Plant J Date: 2012-06-25 Impact factor: 6.417
Authors: Daniil Verbitskiy; Anja Zehrmann; Johannes A van der Merwe; Axel Brennicke; Mizuki Takenaka Journal: Plant J Date: 2009-11-16 Impact factor: 6.417
Authors: Alice Barkan; Margarita Rojas; Sota Fujii; Aaron Yap; Yee Seng Chong; Charles S Bond; Ian Small Journal: PLoS Genet Date: 2012-08-16 Impact factor: 5.917
Authors: Jiyuan Ke; Run-Ze Chen; Ting Ban; X Edward Zhou; Xin Gu; M H Eileen Tan; Chen Chen; Yanyong Kang; Joseph S Brunzelle; Jian-Kang Zhu; Karsten Melcher; H Eric Xu Journal: Nat Struct Mol Biol Date: 2013-11-03 Impact factor: 15.369
Authors: Katrin Stoll; Christian Jonietz; Sarah Schleicher; Catherine Colas des Francs-Small; Ian Small; Stefan Binder Journal: Plant Mol Biol Date: 2017-02-22 Impact factor: 4.076