Anna A Shalybkova1, Darya S Mikhailova2, Ivan V Kulakovskiy3, Liliia I Fakhranurova4, Eugene F Baulin5. 1. Lomonosov Moscow State University. 2. Moscow Institute of Physics and Technology. 3. Engelhardt Institute of Molecular Biology, Russian Academy of Sciences; Vavilov Institute of General Genetics, Russian Academy of Sciences; Institute of Protein Research, Russian Academy of Sciences. 4. Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences; Shemiakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences; 5. Institute of Mathematical Problems of Biology RAS; Moscow Institute of Physics and Technology baulin@lpm.org.ru.
Abstract
Non-coding RNAs play a crucial role in various cellular processes in living organisms, and RNA functions heavily depend on molecule structures composed of stems, loops, and various tertiary motifs. Among those, the most frequent are A-minor interactions, which are often involved in the formation of more complex motifs such as kink-turns and pseudoknots. We present a novel classification of A-minors in terms of RNA secondary structure where each nucleotide of an A-minor is attributed to the stem or loop, and each pair of nucleotides is attributed to their relative position within the secondary structure. By analyzing classes of A-minors in known RNA structures, we found that the largest classes are mostly homogeneous and preferably localize with known A-minor co-motifs, e.g. tetraloop-tetraloop receptor and coaxial stacking. Detailed analysis of local A-minors within internal loops revealed a novel recurrent RNA tertiary motif, the across-bulged motif. Interestingly, the motif resembles the previously known GAAA/11nt motif but with the local adenines performing the role of the GAAA-tetraloop. By using machine learning, we show that particular classes of local A-minors can be predicted from sequence and secondary structure. The proposed classification is the first step toward automatic annotation of not only A-minors and their co-motifs but various types of RNA tertiary motifs as well. Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Non-coding RNAs play a crucial role in various cellular processes in living organisms, and RNA functions heavily depend on molecule structures composed of stems, loops, and various tertiary motifs. Among those, the most frequent are A-minor interactions, which are often involved in the formation of more complex motifs such as kink-turns and pseudoknots. We present a novel classification of A-minors in terms of RNA secondary structure where each nucleotide of an A-minor is attributed to the stem or loop, and each pair of nucleotides is attributed to their relative position within the secondary structure. By analyzing classes of A-minors in known RNA structures, we found that the largest classes are mostly homogeneous and preferably localize with known A-minor co-motifs, e.g. tetraloop-tetraloop receptor and coaxial stacking. Detailed analysis of local A-minors within internal loops revealed a novel recurrent RNA tertiary motif, the across-bulged motif. Interestingly, the motif resembles the previously known GAAA/11nt motif but with the local adenines performing the role of the GAAA-tetraloop. By using machine learning, we show that particular classes of local A-minors can be predicted from sequence and secondary structure. The proposed classification is the first step toward automatic annotation of not only A-minors and their co-motifs but various types of RNA tertiary motifs as well. Published by Cold Spring Harbor Laboratory Press for the RNA Society.
The ubiquity and importance of noncoding RNAs in living organisms are now widely accepted (Costa 2010). Their functions include gene expression regulation (Hollands et al. 2012; Breaker 2018), RNA modification (Kiss 2001), intron splicing (Dvinge et al. 2019), and transposon control (Aravin et al. 2007). The spatial structure of noncoding RNAs significantly determines their functions (Montange and Batey 2008). It is known that the RNA structure has modular organization. It is composed of secondary structure elements (stems and loops) and their tertiary interactions forming the so-called RNA tertiary motifs, broadly defined as the structural “building blocks” that are often recurrent and hold their configuration in different structural environments (Leontis et al. 2006). In contrast to RNA secondary structure elements, the definition of a tertiary motif is not strict and covers diverse modules, for example, coaxial stacking (eight and more nucleotides forming two stacked stems [Walter et al. 1994]) or dinucleotide platforms (base pairs between consecutive nucleotides [Mládek et al. 2012]).The A-minor interaction is one of the most abundant types of the RNA tertiary interactions, with the total count comparable to that of noncanonical base pairs (Nissen et al. 2001). A-minor involves the insertion of the sugar edges of adenines into the minor groove of Watson–Crick helices, preferentially at C–G base pairs, where the hydrogen bonds are formed between the adenine and the base pair (Nissen et al. 2001). Adenine can be replaced with other bases but the two most common and most stable types of A-minors (Šponer et al. 2007) are highly specific for adenine bases (types I and II, see Nissen et al. 2001). A-minors have been found in various types of noncoding RNAs including ribosomal RNA, ribozymes, and riboswitches (Xin et al. 2008). 23S and 5S ribosomal RNA contain almost 200 A-minors (Nissen et al. 2001). Particularly, codon-anticodon helices are recognized by ribosomes through intermolecular A-minors, which led to the conclusion that “the ribosome is a ribozyme” (Lescoute and Westhof 2006a). A recent study (Torabi et al. 2021) reported a novel subclass of WC/H A-minors engaging Watson–Crick and/or Hoogsteen edges of an adenine instead of its sugar edge.A-minors tend to form clusters. In Nissen et al. (2001), authors describe a motif called A-patch that is formed by a stack of adenines involved in A-minors. In Hamdani and Firdaus-Raih (2019), such A-patches of two stacked adenines and two consecutive base pairs have been introduced as sextuples, a novel RNA tertiary motif composed of six bases interconnected with hydrogen bonds. Usually, “A-minor interaction” or just “A-minor” are used to refer to an individual nucleotide triple (one adenine and one base pair), and the term “A-minor motif” is used to describe a cluster of two and more A-minor interactions (Hendrix et al. 2005). However, this principle is not conventional, and suffixes “interaction” and “motif” are often being interchanged (see, e.g., Nissen et al. 2001; Sheth et al. 2013). Hereinafter, we will use the terms “A-minor interaction” and “A-minor” to refer to an individual nucleotide triple, and “A-minor motif” to refer to a maximum local group of stacked A-minors, that is, either to an isolated A-minor or to a cluster of two or more A-minors (see Materials and Methods for strict definitions).A-minors play a crucial role in RNA structure stabilization and molecular recognition, often serving as components of more complex motifs. A-minors stabilize coaxial stacking in multiple junctions (Lescoute and Westhof 2006b; Geary et al. 2011) and are involved in ribose-zipper motifs (Tamura and Holbrook 2002). According to Xin et al. (2008), these two types of motifs are the most common co-motifs for A-minor interactions. The kink-turn motif, a kink in the phosphodiester backbone that causes a sharp turn in the RNA helix, is generally stabilized by at least one A-minor interaction (Klein et al. 2001; Rázga et al. 2005). GNRA tetraloop-receptor interaction, probably the most studied RNA tertiary motif so far, also uses A-minors (Geary et al. 2008; Wu et al. 2012; Fiore and Nesbitt 2013). Another recurrent RNA motif, UAA/GAN internal loop forming interstrand adenine stack, binds long-range RNA regions via two or more A-minor interactions (Lee et al. 2006). ABAB-pseudoknots (also known as H-knots) are often stabilized by the triple helix that is usually formed by adenines of the 3′-closest loop and the minor groove of the 5′-closest stem (Aalberts and Hodas 2005). Thus, the A-minor is among the most important tertiary motifs of noncoding RNAs.Still, the RNA secondary structure context of A-minors has been understudied. The only example is the analysis of ribosomal RNAs (Xin et al. 2008) where authors showed that 67% of the adenines involved in A-minor interactions are located in single-stranded regions forming tertiary motifs in hairpins, internal, or junction loops.In this work, we present a novel classification of A-minors in terms of RNA secondary structure. Each nucleotide of an A-minor was attributed to the corresponding stem or loop, and each pair of nucleotides was attributed to their relative position (from the same stem or loop, from an adjacent stem and loop, or from distant elements). The results of the classification of A-minors from known RNA structures suggest that the structural context of an A-minor mostly defines its co-motifs.
RESULTS
Analysis of A-minors from the representative set of RNA structures shows that 64% of them are clustered
To perform a comprehensive analysis of A-minors, we assembled a data set of 2431 A-minors annotated in the representative set of RNA spatial structures (Leontis and Zirbel 2012) from the Protein Data Bank (PDB [Berman et al. 2002]) (see“Annotation of A-minors in known RNA structures” in Materials and Methods).The data set included 2431 A-minor interactions forming 1504 A-minor motifs, including 626 A-minor clusters. About 12% of all the A-minors were intermolecular, that is, formed by two or three different RNA chains. Intermolecular A-minors were found to be isolated more often than the intramolecular ones (52% and 34%, respectively, see Fig. 1). A-minors of Type II were isolated in only 21% of cases, whereas Types I and X were isolated in a significantly larger number of cases, 37% and 41%, respectively. The overall number of clustered A-minors made up 64% from total (1553 out of 2431), and only 878 A-minors (36%) were isolated. However, it can be seen (Fig. 1) that A-minor clusters themselves composed only 42% of all the A-minor motifs.
FIGURE 1.
Distribution of clustered A-minor interactions by geometric and molecular types. Shares of clustered A-minor motifs are shown in the bottom of the figure. The absolute numbers of the interactions and motifs are shown in gray. Shares of clustered interactions and clusters are shown in orange. The x-axis is in log-scale. An A-minor motif is counted as inter/intramolecular if it contains at least one inter/intramolecular A-minor interaction; therefore the sum of intramolecular and intermolecular A-minor motifs is higher than the total number of motifs.
Distribution of clustered A-minor interactions by geometric and molecular types. Shares of clustered A-minor motifs are shown in the bottom of the figure. The absolute numbers of the interactions and motifs are shown in gray. Shares of clustered interactions and clusters are shown in orange. The x-axis is in log-scale. An A-minor motif is counted as inter/intramolecular if it contains at least one inter/intramolecular A-minor interaction; therefore the sum of intramolecular and intermolecular A-minor motifs is higher than the total number of motifs.As shown in Figure 2, the majority of all A-minor motifs (878 out of 1504) are isolated A-minor interactions. The next biggest class contains 389 A-minor clusters of size (2, 2), that is, clusters made of two adenines and two base pairs. A small number of clusters having more base pairs than adenines (19 cases) represents uncertain cases with some adenines located between two consecutive base pairs. The opposite case of more adenines than base pairs is quite common and is represented by 155 motifs, and the largest motifs include up to 5 base pairs and seven adenines.
FIGURE 2.
A-minor motifs rarely involve more than three adenines. Circle areas are proportional to the number of A-minor motifs.
A-minor motifs rarely involve more than three adenines. Circle areas are proportional to the number of A-minor motifs.Thus, the majority of A-minors form clusters, but there are many isolated A-minors as well, especially among the intermolecular interactions.
Largest structural classes of A-minors are in a good match with their known co-motifs
To identify the structural environment of A-minors, we introduced a new classification of A-minors in terms of their secondary structure context (see “Classification of A-minors” in Materials and Methods) and applied it to 2431 A-minors of our primary data set.All the 2431 A-minors belong to 99 different structural classes according to the proposed classification. Seventy-three classes are represented by at least two A-minors and only 10 classes are represented by at least 50 cases. A total of 2110 A-minors (87%) involve a base pair belonging to some stem and either an adjacent adenine (35%, classes of form L-S-S-LC-LC-SM, i.e., the base pair belongs to a stem and the adenine belongs to an adjacent loop L of any type) or a distant adenine (52%, classes of form L-S-S-LR-LR-SM, i.e., the base pair belongs to a stem and the adenine belongs to a distant loop L of any type). The distribution of the classes is very diverse without a clearly dominant class: The largest class HC-S-S-LR-LR-SM makes up only 22.5%, and the 10 most frequent classes make up 73.8% of the total number of A-minor interactions.The 10 most frequent classes were analyzed with respect to their tendency to be clustered (see Fig. 3). It was found that the share of clustered A-minors significantly varies among the classes being from 42.6% for adjacent bulged adenines (class BC-S-S-LC-LC-SM) to 92.6% for distant adenines from pseudoknotted internal loops (class IP-S-S-LR-LR-SM). We also considered A-minor motifs that contain A-minors of the most frequent classes. The share of clusters among them was also notably different, ranging from about 35% to almost 84%.
FIGURE 3.
The share of clustered A-minors significantly varies among the 10 most frequent structural classes. The absolute numbers of the interactions and motifs are shown in gray. Shares of clustered interactions and clusters are shown in orange. X-axis is in log-scale. (*) Counting all motifs containing at least one interaction of the given class. Therefore the sum of the motif numbers is higher than the total number of motifs.
The share of clustered A-minors significantly varies among the 10 most frequent structural classes. The absolute numbers of the interactions and motifs are shown in gray. Shares of clustered interactions and clusters are shown in orange. X-axis is in log-scale. (*) Counting all motifs containing at least one interaction of the given class. Therefore the sum of the motif numbers is higher than the total number of motifs.Out of the 10 most frequent A-minor classes, only five of the largest classes frequently occur in nonribosomal noncoding RNAs. The classes are in good concordance with known A-minor co-motifs (see Table 1 and the next three paragraphs for details).
TABLE 1.
Most common A-minor motifs
Most common A-minor motifsIn particular, 68% of A-minors of the class HC-S-S-LR-LR-SM stabilize GAAA-11nt, GNRA-like/minor-groove, and other tetraloop-receptor motifs. Fifty-two percent of A-minors of this class involve an adenine within a GNRA-sequence.Intramolecular IC-S-S-LR-LR-SM A-minor interactions have been found within UAA/GAN and UAA/GAN-like internal loops with cross-strand stacked adenines. The class also includes intermolecular A-minors between SSU RNA and a codon-anticodon helix with adenines being looped out and not forming cross-strand stacks.IC-S-S-LC-LC-SM A-minors are divided almost equally into two subgroups: kink-turn stabilization and the formation of a new motif, the across-bulged motif, that mimics tetraloop-receptor interaction (discussed below). A-minors of classes JC-S-S-LC-LC-SM and HP-S-S-LC-LC-SM are in perfect correspondence with coaxial stacking stabilization and pseudoknot stabilization, respectively.Thus, the most common structural classes according to the proposed classification reflect well the widely known co-motifs of A-minors. We suggest that the classification can be successfully used for automatic annotations as it significantly narrows down the possible A-minor co-motifs.
The reference type I A-minors confirm general composition of A-minor classes
To verify the results, we replicated the analysis using a smaller set of 103 canonical A-minors of type I present both in our primary data set and in the CaRNAval database (Reinharz et al. 2018) (see “Reference set of A-minors” in Materials and Methods).Out of 103 A-minors, 67 (65%) are L-S-S-LR-LR-SM (long-range A-minors) and 35 (34%) are L-S-S-LC-LC-SM (local A-minors), with the single exception of loop-loop HC-JC-JC-LR-LR-SM A-minor. Of note, according to the CaRNAval database, as many as 15 of canonical A-minors belong to the loop-loop class, but 14 of them actually include a base pair from a stem that had been discarded during the pseudoknot removal stage of the CaRNAval pipeline.The overall share of clustered A-minors in the set of 103 type I A-minors is 74% (76 A-minors). The three largest structural classes are HC-S-S-LR-LR-SM (34 A-minors, 65% of them are clustered), IC-S-S-LR-LR-SM (21 A-minors, all of them are clustered), and JC-S-S-LC-LC-SM (10 A-minors, 70% of them are clustered). The three classes together comprise 63% of the subset.All in all, the reference set of 103 type I A-minors confirmed the main observations made on our primary data set of 2431 A-minors.
Analysis of IC-S-S-LC-LC-SM A-minors reveals a new recurrent motif
Analysis of IC-S-S-LC-LC-SM A-minor interactions (147 A-minors forming 100 A-minor motifs) revealed that along with 66 A-minors (56 motifs) involved in kink-turn stabilization, there is a subgroup of 68 A-minors (34 motifs) involved in previously undescribed but recurrent motif (Supplemental Table S1). We named it the “across-bulged” motif as the thread opposite to A-minor adenines contains bulged out bases. All 34 such cases were found to be homologs of seven unique motifs, five of which belong to ribosomal RNAs, one is found in a riboswitch and another one in a single-guide RNA (Table 2). The structure of the ribosome from Spinacia oleracea (PDB entry 6ERI) was chosen to illustrate the ribosomal motifs as it was the only structure containing instances of all five unique motifs.
TABLE 2.
Representatives of across-bulged motifs. Bulged bases and bases involved in A-minor interactions are shown in capital letters
Representatives of across-bulged motifs. Bulged bases and bases involved in A-minor interactions are shown in capital lettersWe consider a classical internal loop (IC) to be a canonical across-bulged motif when (i) one of its threads contains at least one adenine that forms an A-minor with the preceding stem; (ii) the flanking residues of the adenine's thread form a base triple either with the 5′-end or the 3′-end base of the opposite thread; (iii) at least one of the opposite thread's bases is bulged out.The number of adenines involved in A-minors of across-bulged motifs is one or two among different organisms. The bulged bases take part in the cross-strand base stacking or form A-minors, G-minors, and other N-minors, base-phosphate interactions (Zirbel et al. 2009), or RNA–protein interactions. Usually, the across-bulged motifs also include a base-triple (see Fig. 4A), but in motifs I and II one base of the base triple is missing leaving it with a base pair instead (Table 2).
FIGURE 4.
3D structure and RNA secondary structure of two across-bulged motifs and a GAAA-11nt motif. (A) Motif VI, a representative case from 6ERI PDB-entry (see Table 2). (B) Motif VIII, a representative case from 6ERI PDB-entry (see Table 2). (C) GAAA-11nt motif from 2R8S PDB-entry, chain R, involving A-minor A152|C223–G250. Base pairs of stems and adenines of 11 nt-loop are shown in gray. Base-triples are shown in yellow. A-minor adenines are shown in blue. Bulged bases are shown in purple.
3D structure and RNA secondary structure of two across-bulged motifs and a GAAA-11nt motif. (A) Motif VI, a representative case from 6ERI PDB-entry (see Table 2). (B) Motif VIII, a representative case from 6ERI PDB-entry (see Table 2). (C) GAAA-11nt motif from 2R8S PDB-entry, chain R, involving A-minor A152|C223–G250. Base pairs of stems and adenines of 11 nt-loop are shown in gray. Base-triples are shown in yellow. A-minor adenines are shown in blue. Bulged bases are shown in purple.We assumed that there could be across-bulged motifs without A-minors and inspected internal loops of similar sizes (4–2 and 4–3) within the 6ERI PDB entry. Indeed, there was a motif with two pyrimidines instead of adenines (motif VIII, see Table 2 and Fig. 4B).The spatial structure of the across-bulged motif was found to resemble the structure of the well-known GAAA-11nt motif (see Fig. 4C). In the case of the GAAA-11nt motif, the A-minors are formed with a GAAA-tetraloop, but within the across-bulged motif, the A-minors are formed with the local adenines of the internal loop.
A-patch is the primary architecture of A-minor clusters
We examined all 389 A-minor clusters of two adenines and two base pairs (i.e., of size [2, 2]; see “A-minor interaction and A-minor motif definitions” in Materials and Methods). The majority (94%, 366 cases) of such clusters are of A-patch architecture (Nissen et al. 2001), that is, are formed by a stack of adenines involved in A-minors with stacked base pairs. In 304 cases (83%), an A-patch includes a stack of consecutive adenines (see Fig. 5A), and in 62 cases (17%), it is formed by a cross-strand adenine stack (see Fig. 5B).
FIGURE 5.
Different A-patch architectures of size (2,2). (A) A-patch formed by a stack of consecutive adenines, SSU rRNA (PDB ID: 6QZP, chain S2, A-minors: A996|C674-[A2M]1031, A997|G673–C1032) (B) A-patch formed by a cross-strand adenine stack, LSU rRNA (PDB ID: 5TBW, chain 1, A-minors: A2696|C2630–G2648, A2758|U2629–A2649).
Different A-patch architectures of size (2,2). (A) A-patch formed by a stack of consecutive adenines, SSU rRNA (PDB ID: 6QZP, chain S2, A-minors: A996|C674-[A2M]1031, A997|G673–C1032) (B) A-patch formed by a cross-strand adenine stack, LSU rRNA (PDB ID: 5TBW, chain 1, A-minors: A2696|C2630–G2648, A2758|U2629–A2649).Both architectures include the same top three structural classes—HC-S-S-LR-LR-SM, IC-S-S-LR-LR-SM, and JC-S-S-LC-LC-SM, but in different proportions: 32%–15%, 15%–29%, and 12%–18%, respectively. A total of 271 out of 304 A-patches with consecutive adenine stack include base pairs consecutive within a stem, 22 cases include nonconsecutive stacked base pairs, and in 11 cases adenines of an A-patch are not consecutive but 1 nt apart from each other in sequence. Thirty-eight out of 62 A-patches with a cross-strand adenine stack are formed by adenines from the same loop, and 24 cases include adenines from two distant RNA secondary structure elements.Out of 177 A-minor clusters of larger sizes, 23 (13%) are formed by a single stretch of consecutive stacked adenines, 103 (58%) include cross-strand stacking of adenines, and the remaining 51 clusters (29%) are not of A-patch architecture.The results show that the A-patch is the primary architecture of A-minor clusters. It is also worth noting that a minor but noticeable part of A-patches contains a cross-strand adenine stack that we believe allows the A-minor cluster to achieve greater stability.
Particular classes of local A-minor motifs can be predicted with machine learning
We applied the proposed classification of A-minors to assess if A-minor motifs can be predicted from the RNA sequence and secondary structure.We formulated the problem of computational prediction of A-minors as a binary classification problem with a (stem, A-stretch) pair of a stem and a stretch of unpaired adenines (i.e., free of stem-forming base pairs) as an object of classification. The pairs that form A-minor interactions have been named A-stems and treated as positives. We trained a random forest classifier and, considering positive:negative label imbalance of 1:10 to 1:200 depending on the considered A-minor classes, used the area under the precision-recall curve (AUPRC) as the primary quality estimate (see “Machine learning framework” in Materials and Methods).The cross-validation on the entire data set of (stem, A-stretch) pairs showed very low overall AUPRC (below 0.1). However, we identified two types of local A-stems that can be predicted with notably higher quality (Fig. 6). First, there were HP-LC A-stems, that is, A-minor interactions between a pseudoknotted hairpin and an adjacent stem, which demonstrated the mean AUPRC of 0.73 (st. dev. 0.17) in fourfold cross-validation. Second, there were IP-LC A-stems, that is, A-minor interactions between a pseudoknotted internal loop and an adjacent stem, which demonstrated the mean AUPRC score of 0.43 (st. dev. 0.17). Other types of (stem, A-stretch) pairs did not achieve AUPRC scores higher than 0.2.
FIGURE 6.
Precision-recall curves representing the fourfold cross-validation (cv) results on (A) the data set of HP-LC (stem, A-stretch) pairs (169 negatives and 21 positives, 11.05% positive rate, 0.7 AUPRC), and (B) the data set of IP-LC (stem, A-stretch) pairs (116 negatives and 11 positives, 8.66% positive rate, 0.35 AUPRC). The line of precision-recall break-even points is shown as blue dots.
Precision-recall curves representing the fourfold cross-validation (cv) results on (A) the data set of HP-LC (stem, A-stretch) pairs (169 negatives and 21 positives, 11.05% positive rate, 0.7 AUPRC), and (B) the data set of IP-LC (stem, A-stretch) pairs (116 negatives and 11 positives, 8.66% positive rate, 0.35 AUPRC). The line of precision-recall break-even points is shown as blue dots.Although HP-LC and IP-LC (stem, A-stretch) pairs constitute a limited data set (21 HP-LC A-stems and 12 IP-LC A-stems with positive:negative label imbalance of 1:10; see Fig. 6 caption), yet the results were stable in terms of cross-validation and were obtained with only five features, which were independently selected on HP-LC data when predicting IP-LC A-stems and vice versa. The five features used to predict HP-LC A-stems were the numbers of (i) nucleotides, (ii) adenines, and (iii) cytosines between the wings of the stem, (iv) the number of nucleotides between the adenines and the left wing of the stem, and (v) the Boolean value reflecting whether the thread that follows the left wing belongs to a pseudoknotted hairpin. The five features used to predict IP-LC A-stems were (i) the Boolean value reflecting whether the wing preceding the adenines is the right wing of the stem, (ii) the Boolean value reflecting whether the wing that follows the adenines is the wing that follows the right wing of the stem, (iii) the number of right wings between the adenines and the right wing of the stem, (iv) the Boolean value reflecting whether the thread that precedes the adenines is the thread that precedes the right wing of the stem, and (v) the Boolean value reflecting whether the adenines’ thread is the thread that follows the right wing of the stem (see also Supplemental Table S5).Thus, we conclude that the proposed classification can be successfully used to predict particular types of local A-minor motifs.
DISCUSSION
In this work, we proposed strict definitions of A-minor motifs and A-minor clusters and used them on top of the DSSR annotation of A-minor interactions to identify the motifs in experimentally determined RNA spatial structures of the representative set with 3.0 Å resolution cutoff. More than 60% of A-minors were found to be located in A-minor clusters. Nearly 90% of the clusters were found to adopt the well-known A-patch architectures.We proposed the novel classification of tertiary motifs in terms of RNA secondary structure and applied it to analyze A-minors in known RNA structures. We found that the most frequent classes of A-minors correspond well to their most known co-motifs, such as tetraloop-receptor motifs and coaxial stacking. Such correspondence could be used to improve the automatic annotation of A-minor co-motifs using structural context. It should be also noted that the proposed classification is not limited to A-minors and can be applied to a wide range of RNA tertiary motifs.Detailed annotation of IC-S-S-LC-LC-SM A-minors revealed a novel recurrent motif of RNA tertiary structure, the across-bulged motif, that contains bulged bases in the strand of the internal loop opposite to the A-minor adenines. The spatial structure of the across-bulged motif resembles the structure of the well-known GAAA-11nt motif.In a number of instances of the across-bulged motif, the bulged base also interacts with the minor groove of another base pair, forming G-minor or A-minor interactions. We also found a case of the across-bulged motif with two pyrimidines instead of two adenines that form U-minor and C-minor interactions. Thus, although A-minor motifs prefer adenines, other bases can form analogous motifs. In the analysis of A-minor clusters, we found A-patches of size (2, 2) that actually included another base stacked between the two adenines (see Fig. 7). These findings suggest the need for annotation of other N-minors along with A-minors by the commonly used annotation software like the DSSR program used in this study. The pipeline in Reinharz et al. (2018) is a good example of such annotation of A-minors not restricted to adenine bases. The formation of N-minors also should be considered in evolutionary analyses dealing with point mutations.
FIGURE 7.
AUA A-patch from LSU rRNA, PDB ID 5TBW, chain 1. A-minors: A3106|C2893–G2908 (in orange), A3129|A2892–U2909 (in gray). (A) Top view. (B) Side view. Uracil U3105 stacked between two adenine bases is shown in purple.
AUA A-patch from LSU rRNA, PDB ID 5TBW, chain 1. A-minors: A3106|C2893–G2908 (in orange), A3129|A2892–U2909 (in gray). (A) Top view. (B) Side view. Uracil U3105 stacked between two adenine bases is shown in purple.With a machine learning framework, we showed that HP-LC A-stems can be predicted in silico from the RNA sequence and secondary structure with an acceptable quality (0.73 AUPRC). Other than HP-LC A-stems, only IP-LC A-stems allowed reaching AUPRC over 0.2. We consider the following explanation. In the case of HC-LC, IC-LC, and other local (stem, A-stretch) pairs with an A-stretch belonging to a classical loop, the information of relative features is very limited and does not describe the environment outside the loop and its adjacent stems. In the case of HC-LR, IC-LR, and other long-range (stem, A-stretch) pairs, there is an opposite effect: The relative features cover a large amount of sequence and secondary structure volume which is irrelevant for a given pair. However, in the case of local (stem, A-stretch) pairs with an A-stretch from a pseudoknotted loop, the features describing relative distances reflect the relevant local context that allows distinguishing an A-stem from a noninteracting (stem, A-stretch) pair.
Conclusion
In this work, we proposed a novel classification to describe A-minors in terms of RNA secondary structure. The classification was applied to A-minors annotated in the known RNA 3D-structures. The data set consisted of more than 2400 interactions forming more than 1500 motifs. The majority of A-minors formed clusters of the typical size of two to three interactions. The analysis of the largest annotated classes showed that they are highly homogeneous and in good agreement with the known co-motifs of A-minors. We also showed that the local A-minors from internal loops not only stabilize kink-turn motifs but also form a novel recurrent RNA tertiary motif, the across-bulged motif. The across-bulged motif mimics the well-known GAAA-11nt motif but with local adenines forming A-minors. Using a machine learning framework, we showed that the particular local classes of A-minors with adenines from pseudoknotted loops can be predicted using the sequence and secondary structure information. Thus, we show that the proposed classification can be successfully used both to automatically annotate co-motifs of A-minor motifs by their structural context and to predict A-minors of particular local classes.
MATERIALS AND METHODS
A-minor interaction and A-minor motif definitions
Our definition of the A-minor interaction follows the one from the DSSR program (Lu et al. 2015) that is widely used for RNA motifs annotation.The A-minor interaction (A-minor) is the nucleotide triple of an adenine and a cWW base pair (cis-Watson–Crick/Watson–Crick base pair according to the Leontis–Westhof classification [Leontis and Westhof 2001]), where the adenine faces the minor groove of the base pair and forms H-bonds. According to the DSSR manual, in canonical A-minors (types I and II), the adenine has its minor groove edge facing the minor groove of a base pair, and the O2′ atom of adenine is involved in H-bonds with the pair; in the miscellaneous type X (eXtended), the adenine uses its Watson–Crick edge or major-groove edge to interact with the minor groove of a base pair, without resorting to the O2′ atom (see page 38 at http://docs.x3dna.org/dssr-manual.pdf). Furthermore, in type I A-minors both the O2′ and the N3 of the adenine are inside the minor groove of the base pair (i.e., lie in between O2′ atoms of the base pair); and in the type II version, the O2′ of the adenine is outside the near strand O2′ atom whereas the N3 of the adenine is inside (Nissen et al. 2001).The adenine of A-minor is referred to as “A,” the nucleotides of the base pair are referred to as “L” (if located closer to the 5′-end of the RNA chain) and “R” (if located closer to the 3′-end of the RNA chain). An A-minor interaction is called intramolecular if all three participating nucleotides belong to the same RNA chain and intermolecular otherwise. If “L” and “R” nucleotides belong to different RNA chains, their assignment order is determined by the lexical order of their RNA chain identifiers.For each entry from the Protein Data Bank (PDB [Berman et al. 2002]), we constructed an undirected graph G = (V, E), where V = {v = (A, L, R)} is the set of A-minor interactions annotated with DSSR, and E = {e = (v, v)} is the set of edges between them. (v, v) ∊ E if there are either Nj that are the same nucleotide or there are N and N that are stacked, where N ∊ {A, L, R} and N ∊ {A, L, R}. A connected component within the graph G is called the A-minor motif. The A-minor motif is called the A-minor cluster if it involves at least two different adenines, A and A, or two different base pairs, (Li) and (Lj). The size of an A-minor motif is defined by a pair of numbers: The number of adenines and the number of base pairs, for example, A-minor motif of size (3, 2) involves three adenines and two base pairs. We call the A-minor interaction clustered if it belongs to an A-minor cluster.
Classification of A-minors
To describe the RNA secondary structure, we used the generalization of the nearest neighbor model (NNM [Mathews et al. 1999]) proposed in Baulin et al. (2016). This approach allows a uniform description of arbitrary RNA secondary structures including pseudoknotted structures of any complexity, thus ensuring that each nucleotide is associated with at least one secondary structure element. The following additional definitions are required for the A-minor classification (the complete set of strict definitions is provided in Supplemental Text S1 and at http://urs.lpm.org.ru/struct.py?where=3#def).A stem is a sequence of at least two consecutive Watson–Crick or Wobble base pairs. Two strands of a stem are called its left wing and right wing. A loop is a set of threads (regions that are free of stem-forming base pairs) confined by a stem. Each loop is assigned with one of the following common types: hairpin (H), bulge (B), internal loop (I), or multiple junction (J). In addition, each loop is classified in regard to pseudoknots: A loop is called pseudoknotted (P) if it is involved in a pseudoknot, isolated (I) if it is adjacent to a pseudoknot, and classical (C) otherwise (see Fig. 8). Pseudoknotted loops may contain both threads and stem wings.
FIGURE 8.
Pseudoknot-related classes of internal loops. (A) An internal loop in the absence of pseudoknots is called Classical. (B) An internal loop adjacent to a pseudoknot is called Isolated. (C) An internal loop involved in a pseudoknot is called Pseudoknotted. The graph has been prepared using R-chie (Lai et al. 2012) and forna (Kerpedjiev et al. 2015).
Pseudoknot-related classes of internal loops. (A) An internal loop in the absence of pseudoknots is called Classical. (B) An internal loop adjacent to a pseudoknot is called Isolated. (C) An internal loop involved in a pseudoknot is called Pseudoknotted. The graph has been prepared using R-chie (Lai et al. 2012) and forna (Kerpedjiev et al. 2015).Each nucleotide is being ascribed either with a stem (S) or with a set of loops (concatenation of loops’ letter pairs of pattern [HIBJ][ICP] in alphabetical order).All A-minors were classified with respect to the RNA secondary structure elements involving their nucleotides. Each nucleotide of an A-minor was attributed to the corresponding stem or loop(s), and each pair of nucleotides was attributed to their relative position: within the same stem or the same loop (same element, SM), from an adjacent stem and loop (local, LC), from distant elements (long-range, LR). Thus, each A-minor belongs to A-B-C-AB-AC-BC structural class, where A-B-C are secondary structure elements of the A, L, and R nucleotides, respectively, and AB-AC-BC are relations of the AL, AR, and LR nucleotide pairs.An example of an A-minor from a lysine riboswitch (PDB code 3D0U, chain A) is presented in Figure 9. Here the adenine A124 belongs to a classical hairpin (HC) adjacent to stem 8. A20 of the noncanonical base pair (A20, G66) belongs to the classical internal loop (IC) confined by stem 2. G66 of the base pair belongs to the same loop and also belongs to a pseudoknotted hairpin (HP) of stem 4 and therefore is assigned with HPIC. As A20 and G66 share a loop, the nucleotide pair A20–G66 is annotated with the relative position SM (from the same element). Pairs A124–A20 and A124–G66 are annotated with LR as the nucleotides are distant from each other within the secondary structure. Overall, the A-minor is classified as having type HC-IC-HPIC-LR-LR-SM.
FIGURE 9.
The secondary structure of the lysine riboswitch from PDB entry 3D0U. The structure is visualized with VARNA (Darty et al. 2009). Loops are assigned with their types, classes, and confining stems. A-minor A124–A20–G66 of type X is emphasized on the structure and presented separately with the 3D structure of its nucleotides in red, green, and blue. Each nucleotide of the A-minor is annotated with the element of RNA secondary structure. Each nucleotide pair is annotated with their relative positions within the secondary structure.
The secondary structure of the lysine riboswitch from PDB entry 3D0U. The structure is visualized with VARNA (Darty et al. 2009). Loops are assigned with their types, classes, and confining stems. A-minor A124–A20–G66 of type X is emphasized on the structure and presented separately with the 3D structure of its nucleotides in red, green, and blue. Each nucleotide of the A-minor is annotated with the element of RNA secondary structure. Each nucleotide pair is annotated with their relative positions within the secondary structure.Of note, the idea of the proposed classification can be used for other RNA motifs and interactions. For example, noncanonical base pairs can be ascribed with an A-B-AB class. For motifs with a rather larger number of bases, a more coarse-grained system can be applied, for instance, switching the focus from nucleotides to threads and wings.
Annotation of A-minors in known RNA structures
1074 RNA-containing PDB entries from the representative set of RNA structures (version 3.76 with the 3.0 A resolution cutoff [Leontis and Zirbel 2012]) were selected for the analysis. To annotate A-minor interactions, the DSSR program (version v1.8.5-2018nov29 [Lu et al. 2015]) was used. A-minor motifs were annotated using the python library from the URSDB (https://github.com/febos/urslib). The resulting data set included 2431 A-minors composing 1504 A-minor motifs (see Supplemental Tables S2, S3).Each A-minor interaction was annotated with the features related to its geometric parameters, involved H-bonds, the local context of RNA secondary structure including annotations of RNA tetraloop sequences (Klosterman et al. 2004), and the size of the corresponding A-minor motif (see Supplemental Table S2 for the detailed description of all features). Edges within A-minor clusters were annotated with the features of the involved A-minor interactions and base stacking interactions between them along with the edge description in the form of Nj relationships, where N is A, L, or R of the corresponding A-minor v and d ∊ {“e”—equality, “n”—consecution, “s”—stacking, “ns”—stacking and consecution}. For example, the description “AsA_LeL_ReR” depicts an edge between two A-minors made of the same base pair and nonconsecutive stacked adenines (see Supplemental Table S3 for the detailed description of all features).
Reference set of A-minors
To validate the DSSR annotation we compared the primary data set of 2431 A-minors with the A-minors provided in the CaRNAval database (Reinharz et al. 2018). Since the CaRNAval database contains only motifs that include at least two base pairs between two different secondary structure elements, only A-minors of type I are present among three-nucleotide motifs (RIN#2, 194 occurrences, see http://carnaval.lri.fr/all_HEADERS/info_cluster_2.html). These 194 motifs occur in 37 RNA chains, 21 of which are also included in our primary data set. These RNA chains include 103 A-minors from the CaRNAval database, and all of them are also presented in our data set (see Supplemental Table S2). In total, for those 21 RNA chains our data set includes 240, 119, and 210 intramolecular A-minors of type I, type II, and type X, respectively.Such a significant difference (more than twofold in the case of type I A-minors) between the data sets arises from the following issues. First, the CaRNAval pipeline is more rigorous with respect to base pairs, that is, requires the adenine to form two strict sugar-edge/sugar-edge base pairs (Leontis and Westhof 2001) for A-minor to be annotated, whereas the A-minor definition used in the DSSR does not require the adenine to form base pairs. Furthermore, out of 2431 A-minors of our full data set, DSSR annotated no base pairs formed by the adenine for 683 cases, only one base pair for 1609 cases (including base pairs of intermediate Leontis–Westhof types), and two base pairs for 139 A-minors. Thus it is clear that the DSSR definition of A-minor cannot be reduced to a pure base triple, that is, three bases interconnected by a set of base pairs, and following the DSSR definition allows to avoid discarding numerous intermediate cases. Second, unlike CaRNAval, the DSSR annotation includes intermolecular A-minors and A-minors containing modified nucleotides and nucleotides with missing atoms. Third, the CaRNAval pipeline does not consider the A-minors formed within a secondary structure element (X-X-X-SM-SM-SM classes according to the proposed classification, where X is any stem or any set of loops).Of note, unlike DSSR, the CaRNAval pipeline does not restrict an A-minor to be formed by an adenine base, but the only such annotated example of a type I A-minor formed by a guanine base does not belong to the considered set of 21 RNA chains.
Machine learning framework
We formulate the task of computational prediction of A-minors as a binary classification problem. To choose an object of classification we examined the annotated A-minors from the known RNA structures. First, DSSR annotates a considerable number of intermediate cases of A-minors, where, for example, an adenine is located evenly between two consecutive base pairs such that it is unclear with which particular base pair the adenine forms the A-minor interaction. Second, more than 60% of all annotated A-minors belong to A-minor clusters and nearly 90% of all A-minors include a base pair that belongs to some stem. Considering these facts, to avoid fitting the model to the technical features of the annotation rather than to principal features of A-minor interactions, we used a more coarse-grained approach and chose a (stem, A-stretch) pair of a stem and a stretch of unpaired adenines (i.e., adenines free of stem-forming base pairs) as the target object of the classification. Thus we considered all possible (stem, A-stretch) pairs and trained a model to predict if a particular (stem, A-stretch) pair forms an A-stem, that is, forms at least one A-minor interaction (see Fig. 10).
FIGURE 10.
Definition of an A-stem classification problem. If a (stem, A-stretch) pair involves A-minors it belongs to the positive class and to the negative class otherwise. The emphasized stem consists of two base pairs, and the emphasized stretch of unpaired adenines consists of two bases.
Definition of an A-stem classification problem. If a (stem, A-stretch) pair involves A-minors it belongs to the positive class and to the negative class otherwise. The emphasized stem consists of two base pairs, and the emphasized stretch of unpaired adenines consists of two bases.The representative set of structures consisted of 130 RNA chains containing A-minor interactions. To reduce redundancy we manually excluded homologous RNA chains from different organisms. The resulting 44 RNA chains included exactly one structure of each type of RNA molecule present in the representative set. The resulting set of (stem, A-stretch) pairs included 347 A-stems and 183298 noninteracting (stem, A-stretch) pairs (0,19% positive rate, the complete data is available at https://github.com/febos/urs_aminors).The features for classification were based on RNA sequence and secondary structure information. Each (stem, A-stretch) pair was annotated with 288 features describing local contexts of the stem and the A-stretch (local features, e.g., the base pair content of the stem, the stem's length, types of neighboring loops, lengths of neighboring wings and threads, the length of the A-stretch, base types of A-stretch neighboring residues, etc.), and distances between the stem's wings and between each wing and the A-stretch in terms of sequence and secondary structure (relative features, e.g., number of guanines in sequence between the objects, number of right wings, number of bulges, whether the A-stretch is located between the stem's wings, whether the thread adjacent to the stem's left wing is the A-stretch thread, etc.). A detailed description of all features is provided in Supplemental Table S5 and at https://github.com/febos/urs_aminors.(stem, A-stretch) pairs were classified in terms of RNA secondary structure in a similar manner as A-minors. Thus, a (stem, A-stretch) pair that could form A-minor interactions of type IC-S-S-LC-LC-SM has been attributed to IC-LC type.Next, we applied the RandomForest algorithm (Liaw and Wiener 2002) implemented in the scikit-learn Python package (Pedregosa et al. 2011). The experiments have been carried out using the fourfold cross-validation method (Rodriguez et al. 2009).Each model has been trained on the five best features selected using an automatic feature selection tool of scikit-learn (SelectFromModel). The parameter class_weight = “balanced” was used to account for positive:negative label imbalance. All the other hyper-parameters were left at default values. To ensure there is no information leak or overfitting through the huge pool of features, the best results for HP-LC and IP-LC classes were validated by performing feature selection using (stem, A-stretch) pairs of a different class: The results for the HP-LC pairs have been obtained using the five best features selected for the IP-LC pairs and vice versa. The source code is available at https://github.com/febos/urs_aminors/blob/master/ML_Astems.ipynb. To assess the quality of the binary classification results we used the area under the precision-recall curve (AUPRC).
SUPPLEMENTAL MATERIAL
Supplemental material is available for this article.
Authors: Palak Sheth; Miguel Cervantes-Cervantes; Akhila Nagula; Christian Laing; Jason T L Wang Journal: Comput Biol Chem Date: 2013-10-24 Impact factor: 2.877