Literature DB >> 20876693

Structural characterization of naturally occurring RNA single mismatches.

Amber R Davis¹, Charles C Kirkpatrick, Brent M Znosko.

Abstract

RNA is known to be involved in several cellular processes; however, it is only active when it is folded into its correct 3D conformation. The folding, bending and twisting of an RNA molecule is dependent upon the multitude of canonical and non-canonical secondary structure motifs. These motifs contribute to the structural complexity of RNA but also serve important integral biological functions, such as serving as recognition and binding sites for other biomolecules or small ligands. One of the most prevalent types of RNA secondary structure motifs are single mismatches, which occur when two canonical pairs are separated by a single non-canonical pair. To determine sequence-structure relationships and to identify structural patterns, we have systematically located, annotated and compared all available occurrences of the 30 most frequently occurring single mismatch-nearest neighbor sequence combinations found in experimentally determined 3D structures of RNA-containing molecules deposited into the Protein Data Bank. Hydrogen bonding, stacking and interaction of nucleotide edges for the mismatched and nearest neighbor base pairs are described and compared, allowing for the identification of several structural patterns. Such a database and comparison will allow researchers to gain insight into the structural features of unstudied sequences and to quickly look-up studied sequences.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2010 PMID： 20876693 PMCID： PMC3035445 DOI： 10.1093/nar/gkq793

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

RNA is known to perform a variety of biological functions and to be involved in several cellular processes; however, it is only active when in its correct 3D conformation. The structural complexity and wide repertoire of structural components of RNA allows this biomolecule to effectively carry out a multitude of key functions. RNA consists of canonical double helical regions, along with non-canonical regions, such as internal loops, bulges, hairpins and multi-branch loops, which have implications for folding and stability of the correct tertiary and quaternary structures. Often times, these motifs are important for a variety of biological functions, such as serving as binding sites for proteins (1–10), metals (11–13), small molecules (14–19), or other nucleic acids (20). The scaffold of RNA tertiary structure is a result of the secondary structural components, which introduce kinks and turns in the RNA structure while providing available hydrogen bond donor and/or acceptor sites allowing for intermolecular interactions. Therefore, an understanding of the 3D conformation of RNA secondary structure motifs will give insight into RNA function. An understanding of the structural propensities of common RNA secondary structure motifs should improve the prediction of RNA structure, function and recognition (21). Much work has been done to improve the prediction of RNA secondary structure from sequence (22–31), and methods are being developed to predict RNA tertiary structure (32–39). While the methods of NMR, crystallography and cryo-electron microscopy provide definitive tertiary structure information, they are not capable of keeping pace with the discovery of new and interesting RNA sequences. However, these tools have revealed a wide range of base pairing geometries commonly found in RNA (40,41). These different geometries have been shown to contribute to the complexity of RNA tertiary structure (42,43). Therefore, an understanding of these base–base conformations may allow for further understanding and accuracy in the prediction of RNA secondary and tertiary structure. One possible approach to begin developing a method to predict tertiary structure of RNA is to identify structural patterns for a given motif by structurally characterizing each occurrence of that motif in available 3D structures. Such structures have been deposited into the Protein Data Bank (PDB) (44–48), a world-wide archive of structural data of biomolecules, which includes all RNA structures solved by NMR, crystallography and cryo-electron microscopy. Currently, there are over 1600 structures containing RNA in the PDB (44–48) (accessed on 12 August 2009). The structural characterization and comparison of all structures containing a particular secondary structure motif is not a trivial task; however, several laboratories have made significant contributions to analyzing RNA motifs found in the structures deposited in the PDB (44–48). The Fox laboratory has developed an internet-based, interactive database of non-canonical base pairs found in known RNA structures (NCIR). It contains over 2000 non-canonical base pairs with descriptions of the associated structural properties, such as sequence context, sugar pucker and glycosidic bond orientation (49,50). The Olson laboratory has also developed a user friendly internet-based database [the RNA base-pair structure (BPS) database] of canonical and non-canonical base pairs found in determined RNA structures. It contains over 91 000 bp and approximately 4000 higher-order base interactions. The database provides representative figures of the observed spatial patterns and the annotation of the structural and chemical features for each base pair (51). The Gutell laboratory has contributed a significant amount of data by investigating the occurrence and diversity of various motifs (52–54). The laboratories of Leontis and Westhof have provided a standardized method for the naming and classification of the various orientations of RNA base pairs to allow for unambiguous communication (55–62). The Brenner and Holbrook laboratories have developed the Structural Classification of RNA (SCOR) database, which provides details about the 3D structure, function, tertiary interactions and phylogentic relationships of RNA secondary structure motifs (63–65). The Major laboratory has developed computational tools which are compliant with the RNA ontology (66) and are incorporated into the computer program, MC-Annotate, which is capable of interpreting and labeling RNA base pairs and base stacking interactions of a given 3D structure (67–69). The Major laboratory has also developed the computer program MC-Search, which determines the locations of user-defined structural motifs in RNA (69–71). These efforts have advanced the understanding of the structural details of RNA and have provided tools to analyze RNA tertiary structure. However, with the exception of the recent structural characterization of hairpin triloops (69), no effort has been put forth to systematically locate, annotate and compare occurrences of a particular RNA secondary structure motif. This work is focused on systematically locating, annotating and comparing the most frequently occurring RNA single mismatches in nature. Single mismatches are known to be the most frequently occurring secondary structure motif in ribosomal RNA (72) and often times serve integral structural and/or functional roles (73–83). Using the computer search algorithm MC-Search, single mismatches have been located in the deposited structures found in the PDB. The structural characteristics of each occurrence were then objectively annotated using MC-Annotate. The resulting data for each located and annotated single mismatch were exported into Microsoft Excel to allow for the extraction of the most frequently occurring single mismatch-nearest neighbor sequence combinations (84). Hydrogen bonding, stacking and interaction of nucleotide edges for the mismatched and nearest neighbor base pairs are described and compared, allowing for the identification of several structural patterns. Such a database and comparison will allow researchers to gain insight into the structural features of unstudied sequences and quickly look-up studied sequences. It is important to distinguish this work from previous databases, such as the NCIR and BPS databases. Both the NCIR and BPS databases contain structure information about non-canonical pairs in all secondary structure motifs. This work focuses on non-canonical pairs in single mismatches exclusively, allowing for the identification of structural patterns specific to isolated non-canonical pairs.

MATERIALS AND METHODS

Creation of a 3D RNA structure database

To create a database of previously solved RNA 3D structures, the PDB was searched for molecules containing RNA using the Molecule/Chain Type (since changed to Macromolecule Type) query in the Advanced Search menu on the PDB website (44–48) and selecting the molecules to contain RNA. All query results were selected and downloaded as uncompressed, .pdb formatted files. This search was conducted on 12 August 2009 and, therefore, includes all RNA-containing structures deposited into the PDB up to this date. The search was not limited by experimental method or resolution, but the resulting data is limited by the quality of the data deposited into the PDB.

Single mismatch database

The programs MC-Search (69–71) and MC-Annotate (67–69) were utilized to create the single mismatch database, and it is important to note they were not modified from the version provided by the authors. MC-Search (version 0.5) (69–71) was used to locate all single mismatches in the 3D structure database. In order to search 3D structures to locate a secondary structure motif, MC-Search requires an input descriptor (Figure 1). In simple terms, the input descriptor defines the size and type of the secondary structure motifs of interest. In order to define a single mismatch, 6 nt are involved, the 2 nt in the mismatch and the 2 nt in each of the 2 bp on either side of the mismatch. The type of interaction between the 2 nt in each pair was defined in the input descriptor, thereby limiting the nearest neighbor pairs to canonical pairs and the mismatch pair to a non-canonical pair. The pairing relations for the MC-Search input descriptor are defined by Roman (85–87) and Arabic (88,89) numerals, which indicate the presence of two or three hydrogen bonds and bifurcated or single hydrogen bonds, respectively. For example, Roman numeral XX (85–87) represents an A-U base pair with two hydrogen bonds (from A-N6 to U-NH3 and A-NH6 to U-O4) between the Watson–Crick face of each base with a cis-glycosidic bond orientation. Other Roman numerals represent other pairs in a similar fashion (85–87). Arabic numeral 51 (88,89) represents an A-U base pair with one hydrogen bond (A-NH6 to U-O4) between the Watson–Crick face of each base with a trans-glycosidic bond orientation. Other Arabic numerals represent other pairs in a similar fashion (88,89).

Figure 1.

Single mismatch graph (top) and MC-Search input descriptor (bottom). The nucleotides are numbered A1 to A3 and B1 to B3 in the 5′ to 3′ direction. The ‘A’ and ‘B’ letter designations specify opposing RNA strands. The letter ‘N’ represents any nucleotide. The input descriptor identifies the canonical nearest neighbors by limiting the allowed pairing interactions to the canonical pairs defined by the Roman (85–87) and Arabic (88,89) numerals. Not all possible numerals for A–U, U–A, G–C, C–G, G–U and U–G pairs are shown here due to space limitations. The input descriptor identifies the mismatched nucleotides by allowing an interaction defined by no hydrogen bonds, while also prohibiting the canonical pairing interactions defined by the Roman and Arabic numerals. For the nearest neighbor pair, any pair described by the Roman or Arabic numeral naming system of base pairs was allowed, thereby allowing most conformations of G-C, C-G, A-U, U-A, G-U and U-G pairs. Conversely, the mismatch nucleotides were defined as any pair not described by the Roman and Arabic numeral naming system of base pairs, thereby disallowing the pairs previously listed. Once the input descriptor contained this information, MC-Search was able to locate all of the single mismatches in the three dimensional RNA structural database. For each single mismatch located in this manner, the nucleotides involved in the single mismatch-nearest neighbor sequence combination were ‘clipped’ (i.e. all nucleotides not involved in the single mismatch or nearest neighbor were removed) and saved as a .pdb file to allow for quick annotation and a simple 3D graphic to be produced. Once the results from the MC-Search and MC-Annotate scripts were tabulated, the results were searched for false-positives. A false-positive results, for example, when MC-Annotate does not annotate a G-C pair with a Roman or Arabic numeral. As a result, this G-C pair is considered a single mismatch. All G-C, C-G, A-U, U-A, G-U and U-G identified by the scripts as single mismatches were considered false positives and were removed from the database of true single mismatches.

Single mismatch annotation

The located single mismatches were structurally characterized by the program MC-Annotate (version 1.6.2) (67–69), which analyzes the atomic coordinates to determine the nucleotide interactions and classifies the type of base pairing. MC-Annotate utilizes four characterization parameters which include: (i) residue conformation, (ii) adjacent stackings, (iii) non-adjacent stackings and (iv) base-pairs. The residue conformation defines the sugar pucker as endo or exo and the glycosidic bond orientation as syn or anti. The adjacent and non-adjacent stackings define the relative orientation of each base, which are identified by MC-Annotate utilizing the method proposed by Gabb et al. (90). The nomenclature used to describe these orientations was proposed by Major and Thibault (91), which includes four base-stacking types: upward, downward, outward and inward. The nomenclature incorporated to illustrate the base pairing annotations is based on the Leontis and Westhof (56,57) classification scheme, which describes the interacting edges [i.e. the Watson–Crick (W), Hoogsteen (H) and Sugar (S) edges] of the two bases. This scheme has been further defined and described previously by Lemieux and Major (68). The resulting data for each located and annotated single mismatch were exported into Microsoft Excel.

Analysis of data and identification of structural patterns

Due to the excessive amount of data generated from the search and annotation (4899 single mismatches identified), the analysis of the data and the identification of structural patterns focused on the 30 most frequently occurring single mismatches in nature (84). To allow for the extraction of the most frequently occurring single mismatch-nearest neighbor sequence combinations (84) and further allow for the identification of structural patterns, the Leontis and Westhof (56,57) naming scheme was utilized when determining general structural trends and patterns because annotation is subject to interpretation and small geometrical variations (32), which could arise due to experimental conditions. It is important to note some single mismatches have been excluded from the following analysis. In order to prevent over-counting and to simplify the analysis, ensembles of structures determined by NMR were excluded from the analysis. PDB structures consisting of a single averaged NMR structure, however, were included. Several clipped PDB files were not included in the analysis for various reasons (i.e. 13 single mismatch containing PDB files were not in the correct .pdb format, which prevented nucleotide annotation by MC-Annotate). These PDB files are denoted in Supplementary Table S1. Lastly, it is important to note the structural trends and patterns may be skewed due to repetitive representation of a molecule in the PDB. For example, the crystal structure of the large ribosomal subunit of Haloarcula marismortui has been solved unbound (PDB I.D. 1ffk) and bound (PDB I.D. 1n8r) to antibiotics.

RESULTS

3D RNA structure database

The PDB (44–48) search returned 1666 RNA-containing structures which were then used to create the 3D RNA structure database. A complete listing of the obtained structures can be found in the Supplementary Data (Supplementary Table S1).

Single mismatch structural database

Incorporation of a single mismatch-specific input descriptor into the MC-Search (69–71) program followed by a search of the structures contained in the 3D RNA structure database returned an extremely large dataset. Each of these 4899 identified single mismatches were structurally characterized using MC-Annotate. Of the 30 most frequently occurring single mismatches in a secondary structure database (84), 21 were located in the 3D structure database (Table 1 and Supplementary Table S2) and are the focus of the rest of this study. The nine frequently occurring single mismatch-nearest neighbor sequences (84) not found in the structural database were: , , , , , , , and , with frequencies of 94, 62, 54, 43, 38, 38, 34, 34 and 34, respectively (84). For each of the remaining single mismatch-nearest neighbor combinations found in the top 30 (84), a wide variance in the number of times they were found in the structural database resulted (Table 1). Single mismatches were found in a wide repertoire of RNAs, including ribosomal RNAs (free and bound to antibiotics and proteins), riboswitches, tRNAs and viral RNAs.

Table 1.

Summary of the structural orientation and interaction of the 30 frequently occurring single mismatches

aAll possible orientations and hydrogen bonding patterns are not shown for each single mismatch-nearest neighbor combination. Only those representing at least 5% of total occurrences are included.

bFor each sequence, the top strand is written 5′–3′, and the bottom strand is written 3′–5′. Duplexes are written in alphabetical order by the loop nucleotide (A over G, not G over A). If the loop nucleotides are identical, then duplexes are written in alphabetical order by the nearest neighbors (CUG over GUU, not GUU over CUG).

cFrequency of occurrence in the database (84).

dNumber of times each single mismatch-nearest neighbor sequence combination was located in the three dimensional RNA structure database compiled from structures deposited into the PDB.

eNumber of occurrences in each subclass, which is determined among each sequence combination, considering four parameters: interacting edges for the single mismatch nucleotides and the nearest neighbor base pairs and hydrogen bond patterns for the single mismatch nucleotides and the nearest neighbor base pairs.

fAnnotated orientations and hydrogen bonding patterns of the single mismatch and 5′- and 3′-nearest neighbor nucleotides, which is described in ‘Materials and Methods’ section.

Summary of the structural orientation and interaction of the 30 frequently occurring single mismatches aAll possible orientations and hydrogen bonding patterns are not shown for each single mismatch-nearest neighbor combination. Only those representing at least 5% of total occurrences are included. bFor each sequence, the top strand is written 5′–3′, and the bottom strand is written 3′–5′. Duplexes are written in alphabetical order by the loop nucleotide (A over G, not G over A). If the loop nucleotides are identical, then duplexes are written in alphabetical order by the nearest neighbors (CUG over GUU, not GUU over CUG). cFrequency of occurrence in the database (84). dNumber of times each single mismatch-nearest neighbor sequence combination was located in the three dimensional RNA structure database compiled from structures deposited into the PDB. eNumber of occurrences in each subclass, which is determined among each sequence combination, considering four parameters: interacting edges for the single mismatch nucleotides and the nearest neighbor base pairs and hydrogen bond patterns for the single mismatch nucleotides and the nearest neighbor base pairs. fAnnotated orientations and hydrogen bonding patterns of the single mismatch and 5′- and 3′-nearest neighbor nucleotides, which is described in ‘Materials and Methods’ section. Due to the immense amount of data collected, a table summarizing the common structural characteristics for each single mismatch-nearest neighbor sequence combination in the top 30 (84) is provided in Table 1 and Supplementary Table S2. To determine structural classes, or specimens (69), among each sequence combination, four parameters were considered: interacting edges for both the single mismatch nucleotides and the nearest neighbor base pairs and hydrogen bond patterns for both the single mismatch nucleotides and the nearest neighbor base pairs. Interactions involving a mismatched nucleotide and a nearest neighbor nucleotide were only considered when occurring in >5% of the total population for each single mismatch-nearest neighbor sequence combination.

DISCUSSION

A·G single mismatches

A·G single mismatches are the most frequently occurring single mismatch type found in the secondary structure database (84) when categorized by only the mismatched nucleotides. There are 10 A·G mismatch-nearest neighbor sequence combinations found in the 30 most frequently occurring single mismatches (84), and nine are represented in the RNA single mismatch structural database (Table 1 and Supplementary Table S2), with a total of 1462 occurrences. These nine can be divided into three groups based upon the geometric configuration of the mismatch nucleotides. The first group consists of the most common geometric orientation of the mismatched nucleotides, 5′(A)H/3′(G)S pairing, antiparallel, trans glycosidic bond conformation, with 83% of the total occurrences found with these characteristics (Figure 2). , , , and are the five sequence combinations with these geometric features, and, interestingly, they each contain a U-A or A-U base pair on the 5′ side of the A·G mismatch. Considering these five single mismatch-nearest neighbor sequence combinations, the most common base-pair orientation and hydrogen bonding pattern of the 5′ and 3′ nearest neighbors are 5′(U)W/3′(A)H pairing, antiparallel, trans XXIV and 5′W/3′W pairing, antiparallel, cis XIX, respectively. Although the orientation of the 5′ nearest neighbors are reversed for (A–U instead of U–A), the A–U pair still exhibits a 5′(U)W/3′(A)H pair. It is interesting to note the 5′ A–U or U–A nearest neighbor does not have the expected 5′W/3′W pairing. Perhaps this is due to the structural perturbation resulting from the accommodation of the A·G mismatch, a purine–purine mismatch. The helical geometry may be disrupted to accommodate this type of noncanonical base pair. However, it is unclear why the 3′ nearest neighbor is not similarly disrupted.

Figure 2.

Representation of an A·G mismatch in the 5′(A)H/3′(G)S pairing, antiparallel, trans orientation with XI hydrogen bonding pattern (PDB ID 1C04), which is the most common orientation and interaction determined for the most frequently occurring A·G mismatch-nearest neighbor combinations (84) that were also represented in the PDB. The second group of A·G mismatches consist of mismatch nucleotides with 5′(A)W/3′(G)W pairing, antiparallel, cis orientation forming two hydrogen bonds in the VIII pattern. and are the two sequence combinations with these geometric features. They have similar nearest neighbors, with 5′Y/3′G (where Y is a pyrimidine) and 5′C/3′G on the 5′ and 3′ side of the A·G single mismatch, respectively. The 5′ and 3′ nearest neighbors are both characterized as 5′W/3′W pairing, antiparallel, cis XIX. The third group of A·G mismatches consists of mismatch nucleotides which are annotated not to form any interactions with each other. and are the two sequence combinations with these geometric features. No interactions are found between the A·G mismatch nucleotides in because the A is flipped out from the center of the helix and is interacting with the surrounding solvent. The nucleotides of the base pairing nearest neighbors for were most commonly annotated to both be in the 5′W/3′W pairing, antiparallel, cis orientation forming three hydrogen bonds in the XIX pattern (one of the four examples was annotated to form only one hydrogen bond in the 130 base-pairing pattern). Although contains similar nearest neighbor sequence combinations and geometries as (discussed above in the second group), the geometry of the single mismatch is different. also is annotated not to have any interactions between the mismatched nucleotides; however, the geometries of the 5′ and 3′ nearest neighbors are the same as those in the first group discussed above, 5′(U)W/3′(A)H pairing, antiparallel, trans XXIV and 5′(U)W/3′(G)W pairing, antiparallel, cis XIX, respectively. Inter- and intra-strand interactions involving a mismatched nucleotide and a nearest neighbor nucleotide were found to occur prevalently in eight of the nine A·G mismatch-nearest neighbor sequence combinations (data not shown). The sequence without these types of interactions is , and it is unclear why this A·G mismatch does not engage in these types of interactions. Characterizing the single mismatch-nearest neighbor sequences as , all eight involved an inter-strand interaction between nucleotides A and E. The sequence combinations of and also formed an intra-strand interaction between nucleotides B and C through the O2P/Bh (i.e. one of the free oxygen atoms at the phosphorous between nucleotides B and C is the hydrogen bond acceptor which forms a bifurcated hydrogen bond with the two amino hydrogen atoms found on the Hoogsteen edge of the C) adjacent pairing with upward stacking. It is interesting to note, these two sequences only differ by the orientation of their 5′ nearest neighbor. The sequences and formed an intra-strand interaction between nucleotides F and E, and has an additional intra-strand interaction between nucleotides E and D. These types of interactions may contribute to single mismatch stability and are, therefore, important to understand and further study their effects. An interesting structural and thermodynamic comparison is found for the two mismatch-nearest neighbor sequence combinations of and , which only differ by the identity of the 5′ nearest-neighbor, U-A versus U-G, respectively; however, they have experimental free energy values of −0.6 and 1.2 kcal/mol (84). There are 356 examples of found in the structural database, and the 5′ nearest neighbor, A·G mismatch and the 3′ nearest neighbor nucleotides are annotated to have the following characteristics in 90% of these occurrences: 5′(U)W/3′(A)H pairing antiparallel trans XXIV (two hydrogen bonds), 5′(A)H/3′(G)S pairing antiparallel trans XI (two hydrogen bonds) and 5′(C)W/3′(G)W pairing antiparallel cis XIX (three hydrogen bonds), respectively. Additionally, this mismatch-nearest neighbor sequence generally forms intra- and inter-strand interactions, which are described above. There are 79 examples of found in the structural database, and the 5′ nearest neighbor, A·G mismatch and the 3′ nearest neighbor nucleotides are annotated to have the following characteristics in 67% of these occurrences: 5′(U)W/3′(G)W pairing antiparallel cis one_hbond (one hydrogen bond), 5′(A)W/3′(G)W pairing antiparallel cis VII (two hydrogen bonds), and 5′(C)W/3′(G)W pairing antiparallel cis XIX (three hydrogen bonds), respectively. It is important to note another 29% of the occurrences of have similar structural characteristics and only differ by the hydrogen bonding pattern of the 5′ nearest neighbor, which is annotated to be XXVIII (two hydrogen bonds). However, this mismatch-nearest neighbor sequence is not annotated to engage in intra- and inter-strand interactions. Comparing the structural and interaction differences between these two mismatch-nearest neighbor sequences to the difference in free energy contribution of the respective single mismatches to duplex stability, it is unclear what the major contributing factor is that is resulting in such a large difference in thermodynamic stability. However, the additional stability of may partially be a result of the additional intra- and inter-strand hydrogen bonding.

U·U single mismatches

There are seven U·U RNA single mismatch-nearest neighbor combinations found in the top 30 naturally occurring single mismatches (84), and four of these combinations, which include , , and , are represented in the RNA single mismatch structural database with a total of 403 occurrences (Table 1). Comparing these sequence combinations, the most common orientation of mismatch and nearest neighbor nucleotides for each are similar. Most commonly, the U·U mismatch nucleotides adopt the 5′W/3′W pairing, antiparallel, cis conformation in 344 (85%) of the occurrences. When the U·U mismatches are found in this orientation, XVI and one_hbond (note this hydrogen bonding pattern has not been defined by an Arabic numeral in the literature) are the two hydrogen bonding patterns observed for 257 (75%) (Figure 3) and 87 (25%) of these occurrences, respectively. Also, when only considering this U·U conformation, 343 (∼100%) and 302 (88%) of the 5′ and 3′ nearest neighbor base pairs, respectively, are interacting in the 5′Ww/3′Ww pairing, antiparallel, cis XIX orientation. Interestingly, the 5′ nearest neighbors vary in sequence identity, including G-C, C-G and A-U, but they are all observed with the same type of orientation and interaction. The 3′ nearest neighbors also vary in sequence identity, including G-C, C-G and G-U; however, the 3′ nearest neighbor of the sequence combination is observed to always have the same orientation but with the two different hydrogen bonding patterns of XIX (forming three hydrogen bonds) and XXVIII (forming two hydrogen bonds) for 40 (44%) and 50 (56%) of the occurrences, respectively.

Figure 3.

Representation of a U·U mismatch in the 5′(U)W/3′(U)W pairing, antiparallel, cis orientation with XVI hydrogen bonding pattern (PDB ID 1FJG), which is the most common orientation and interaction determined for the most frequently occurring U·U mismatch-nearest neighbor combinations (84) that were also represented in the PDB. It is interesting to note for these four U·U single mismatch-nearest neighbor sequence combinations, , , and , there is at least one occurrence found for each where the U·U mismatch nucleotides are found to have no interaction with each other and are observed to be flipped-out from the center of the helix or to be positioned in such a way where hydrogen bonding is not possible through the 5′W/3′W paring type (data not shown). Furthermore, U·U mismatch nucleotides involved in the and sequence combinations are annotated to have no interaction for 16 and 50% of the total hits of each, respectively. This may suggest U·U mismatches are dynamic and interact with the surrounding environment under certain conditions, such as what is observed for the sequence combination, which is annotated and observed to be in a hydrogen bonded (one or two bonds formed), stacked conformation (Figure 4a) and a non-hydrogen bonded, unstacked conformation, where one of the U nucleotides involved in the single mismatch is flipped-out from the center of the helix and is interacting with surrounding solvent (Figure 4b) in 84 and 16% of the occurrences, respectively. However, it is further interesting to note both of these geometric orientations were annotated to have the same 5′W/3′W nearest neighbors; therefore, it appears the difference in spatial arrangement of the mismatched nucleotides does not affect that of the adjacent base pairs. This loop sequence was thermodynamically measured to contribute favorably to duplex stability (92), which may result from the ability of one of the loop nucleotides to rotate between two positions without distorting the geometrical orientation of the nearest neighbors.

Figure 4.

Representation of in the hydrogen bonded, stacked orientation (PDB ID 1O9M) (a) and in the non-hydrogen bonded, unstacked orientation (PDB ID 1O9M) (b).

A·C single mismatches

Six of the eight A·C RNA single mismatch-nearest neighbor sequence combinations of the 30 most frequently occurring single mismatches in nature (84) are found in the RNA single mismatch structural database compiled here (Table 1). Considering these six combinations, a total of 89 A·C RNA single mismatch occurrences are found in the database; however, accounts for 73 (82%) of these hits, with all other combinations accounting for only 4% each, on average. The mismatched nucleotides of are most commonly observed in the 5′(A) H/3′(C) W pairing, antiparallel, trans orientation with the XXV (forming two hydrogen bonds) (Figure 5) or one_hbond hydrogen bonding pattern (each occurring ∼50% of the time). When A·C mismatches are found with this type of orientation and these interactions, the 5′ and 3′ nearest neighbors are always found in the 5′(A)Hh/3′(U)Ws pairing, antiparallel, trans XXIV and 5′Ww/3′Ww pairing, antiparallel, cis XIX orientation and interaction, respectively. Similar to A·G single mismatches, the 5′ nearest neighbor does not have the expected 5′W/3′W pairing. Contrary to A·G mismatches, A·C mismatches are not expected to disrupt the neighboring base pairs because this type of mismatch is comprised of one purine and pyrimidine base; therefore, it is similar in size to a canonical pair. This mismatch-nearest neighbor sequence combination was also found to engage in intra- and inter-strand interactions similar to what is observed for A·G mismatches. If the mismatch-nearest neighbor sequence is simply characterized as above, then inter- and intra-strand interactions are observed to form between nucleotides A and E and nucleotides B and C, respectively.

Figure 5.

Representation of an A·C mismatch in the 5′(A)H/3′(C)W pairing, antiparallel, trans orientation with XXV hydrogen bonding pattern (PDB ID 1FJG)), which is the most common orientation and interaction determined for the A·C mismatch-nearest neighbor combination of . This mismatch-nearest neighbor sequence combination is found in the 30 most frequently occurring single mismatches (84) and accounts for 80% of the total A·C mismatches found in this study. The remaining five A·C mismatch-nearest neighbor sequence combinations include , , , and . These five can be divided into three groups based upon the geometric configuration of the mismatch nucleotides. The first group consists of the sequences and and the mismatched nucleotides are annotated with 5′(A)Wh/3′(C)Ww pairing, antiparallel, cis 75 (one hydrogen bond) geometric features. The second group consists of the sequences and and are annotated to have no interaction. Interestingly, the first and second groups exhibit the same 5′ and 3′ nearest neighbor orientations and interactions. These nearest neighbors are annotated to both be in the 5′Ww/3′Ww pairing antiparallel cis orientation forming the canonical three hydrogen bonds in the XIX pattern. All four of these sequence combinations have G–C or C–G nearest neighbor base pairs at both the 5′ and 3′ side of the mismatch. Based upon the similarities in the type and orientation of the adjacent base pairs in these two groups, it is unclear why the A·C mismatched nucleotides are adopting different conformations. The third group only consists of the sequence combination, and the mismatched nucleotides are annotated to be in the 5′(A)Ww/3′(C)Hw pairing antiparallel cis, one_hbond orientation. The 5′ nearest neighbor of this mismatch-nearest neighbor sequence exhibits the same geometric orientation and hydrogen bonding pattern as the first and second group of A·C mismatches. However, the 3′ nearest neighbor is unique in identity and orientation when compared to these groups. The U-A base pair at this position is either annotated to be in the 5′(A)W/3′(C)Bh or 5′(A)W/3′(C)W pairing, antiparallel, trans orientation with the 46 (one hydrogen bond) hydrogen bonding pattern.

C·U single mismatches

C·U RNA single mismatches are the fourth most frequently occurring mismatch type, with three C·U mismatch-nearest neighbor sequences found in the 30 most frequently occurring single mismatches (84). Only one of these combinations is represented in the RNA single mismatch structure database presented here. There are 76 occurrences of , and the C·U mismatch nucleotides are either in the 5′(C)W/3′(U)W pairing, antiparallel, cis one_hbond conformation (Figure 6) or the nucleotides are annotated to have no interaction. However, it is important to note the C·U mismatches annotated to have no interaction are also observed in the 5′(C)W/3′(U)W orientation. The 5′ and 3′ nearest neighbor base pairs are both in the 5′Ww/3′Ww pairing, antiparallel, cis XIX orientation.

Figure 6.

Representation of a C·U mismatch in the 5′(C)W/3′(U)W pairing, antiparallel, cis orientation with one_hbond hydrogen bonding pattern (PDB ID 1FJG), which is the most common orientation and interaction determined for the most frequently occurring C·U mismatch-nearest neighbor combinations (84) that were also represented in the PDB.

A·A single mismatches

A·A RNA single mismatches are the fifth most frequently occurring mismatch type (84). Additionally, there is only one A·A mismatch-nearest neighbor sequence combination, , found in the top 30, and it is not represented in the RNA 3D structure database. Therefore, this work does not contain structural information for this type of mismatch, but we are currently working to locate and annotate other A·A mismatch-nearest neighbor sequence combinations.

G·G single mismatches

G·G RNA single mismatches are the sixth most frequently occurring type of mismatch in nature (84). There is only one example of this mismatch type in the top 30 single mismatches, , and it is represented in the database presented here with 24 occurrences. The G·G mismatch nucleotides are either annotated to have no interaction (Figure 7) or in the 5′H/3′Bs pairing, antiparallel, trans conformation. When the nucleotides are interacting, the two hydrogen bond patterns annotated are 34 (bifurcated hydrogen bond) or 112 (one hydrogen bond). However, the G·G mismatches annotated to have no interaction are also observed in the 5′H/3′S orientation. Interestingly, regardless of the orientation and interaction of the mismatched nuceotides, the 5′ and 3′ nearest neighbor base pairs are always found in the 5′Hh/Ws3′ pairing, antiparallel, trans XXIV and 5′Ww/3′Ww pairing, antiparallel, cis XIX conformations, respectively. Once again, it is interesting to note the 5′ nearest neighbor does not form the canonical 5′W/3′W pairing type.

Figure 7.

Representation of a G·G mismatch annotated as having no interaction (PDB ID 2QAM), which is the most common orientation and interaction determined for the most frequently occurring G·G mismatch-nearest neighbor combination, (84) that was also represented in the PDB.

C·C single mismatches

C·C RNA single mismatches are the least frequently occurring mismatch type, and there are no C·C mismatch-nearest neighbor combinations found in the top 30 frequently occurring singe mismatches (84). Therefore, this work does not contain structural information for this type of mismatch, but we are currently working to locate and annotate C·C mismatch-nearest neighbor sequence combinations.

Nearest neighbor comparison

There are four examples in the top 30 of the nearest neighbor combination , where X is any nucleotide, and all are represented here, which include , , and . It is important to note all three possible types of mismatches are present in this group: R·Y, R·R and Y·Y, when A and G are categorized as purines (R) and C and U are categorized as pyrimidines (Y). R·Y mismatches are similar in size to a canonical base pair since they are comprised of one purine and one pyrimidine; therefore, R·Y single mismatches are not likely disrupting the duplex backbone. R·R and Y·Y single mismatches are likely to disrupt the duplex backbone by causing the backbone to bulge-out or –in, respectively, to accommodate the mismatched nucleotides. Conversely, regardless of the mismatch type for these four sequence combinations, the 5′ and 3′ nearest neighbors are both in the 5′W/3′W pairing, antiparallel, cis XIX conformation in ∼99% of the occurrences. There are three examples in the top 30 of the nearest neighbor combination , but only two are represented in the RNA structural database, and . It is important to note the difference of mismatch type, R·Y versus R·R, for reasons stated in the previous section in regards to the size of the nucleotides comprising the mismatched base pair and the hypothesized effect on the backbone. Interestingly, the 5′ and 3′ nearest neighbors are most commonly found in the 5′H/3′W pairing, antiparallel, trans XXIV and 5′Ww/3′Ww pairing, antiparallel, cis XIX conformations, respectively. There are three examples in the top 30 of the nearest neighbor combination , which are all represented in the structural database and include , and . Similar to the previous nearest neighbor sequence combinations, both the 5′ and 3′ nearest neighbors are found in the 5′Ww/3′Ww pairing, antiparallel, cis XIX conformation, in ∼100% of the occurrences. It is interesting to note the 3′ nearest neighbor for , , and is C-G, and the orientation and interaction of this base pair is found to be the same for each, regardless of the identities of 5′ nearest neighbor base pair and the mismatch nucleotides. There are three examples in the top 30 of the nearest neighbor combination , but only two are found in the structural database, and . The 5′ nearest neighbor conformation is different for each sequence combination. However, the 3′ nearest neighbor is identical in 98% of the total occurrences and is found to be 5′Ww/Ww3′ pairing, antiparallel, cis XIX, which is the same orientation and hydrogen bond pattern found in the above nearest neighbor comparisons. In conclusion, the PDB is a rich source of structural information, and this work has undertaken the task of systematically locating, annotating and comparing the most frequently occurring RNA single mismatches in nature. The 2046 single mismatches presented here (Table 1 and Supplementary Table S2) account for only 42% of the total number of single mismatches found in the available PDB structures. Therefore, this study only begins to investigate the available data, and we are currently looking at and comparing the remaining single mismatches to identify more structural patterns.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institute of General Medical Sciences (R15GM085699 to B.M.Z.). Monsanto Scholars Graduate Fellowship and the Saint Louis University Graduate School Dissertation Fellowship to A.R.D. Funding for open access charge: National Institutes of Health. Conflict of interest statement. None declared.

89 in total

1. Multiple secondary structure rearrangements during HIV-1 RNA dimerization.

Authors: Hendrik Huthoff; Ben Berkhout
Journal: Biochemistry Date: 2002-08-20 Impact factor: 3.162

2. Conserved RNA secondary structures in Flaviviridae genomes.

Authors: Caroline Thurner; Christina Witwer; Ivo L Hofacker; Peter F Stadler
Journal: J Gen Virol Date: 2004-05 Impact factor: 3.891

3. Importance of partially unfolded conformations for Mg(2+)-induced folding of RNA tertiary structure: structural models and free energies of Mg2+ interactions.

Authors: Dan Grilley; Vinod Misra; Gokhan Caliskan; David E Draper
Journal: Biochemistry Date: 2007-08-18 Impact factor: 3.162

4. RNA2D3D: a program for generating, viewing, and comparing 3-dimensional models of RNA.

Authors: Hugo M Martinez; Jacob V Maizel; Bruce A Shapiro
Journal: J Biomol Struct Dyn Date: 2008-06

5. Comprehensive comparison of structural characteristics in eukaryotic cytoplasmic large subunit (23 S-like) ribosomal RNA.

Authors: M N Schnare; S H Damberger; M W Gray; R R Gutell
Journal: J Mol Biol Date: 1996-03-08 Impact factor: 5.469

6. The structure of an RNA "kissing" hairpin complex of the HIV TAR hairpin loop and its complement.

Authors: K Y Chang; I Tinoco
Journal: J Mol Biol Date: 1997-05-30 Impact factor: 5.469

7. Mutational analysis of an RNA internal loop as a reactivity epitope for Escherichia coli ribonuclease III substrates.

Authors: Irina Calin-Jageman; Allen W Nicholson
Journal: Biochemistry Date: 2003-05-06 Impact factor: 3.162

8. Deoxystreptamine dimers bind to RNA hairpin loops.

Authors: Xianjun Liu; Jason R Thomas; Paul J Hergenrother
Journal: J Am Chem Soc Date: 2004-08-04 Impact factor: 15.419

9. Small molecule ligands for bulged RNA secondary structures.

Authors: S Todd Meyer; Paul J Hergenrother
Journal: Org Lett Date: 2009-09-17 Impact factor: 6.005

10. The interaction networks of structured RNAs.

Authors: A Lescoute; E Westhof
Journal: Nucleic Acids Res Date: 2006-11-28 Impact factor: 16.971

14 in total

1. A mutate-and-map strategy accurately infers the base pairs of a 35-nucleotide model RNA.

Authors: Wipapat Kladwang; Pablo Cordero; Rhiju Das
Journal: RNA Date: 2011-01-14 Impact factor: 4.942

2. Ensemble analysis of primary microRNA structure reveals an extensive capacity to deform near the Drosha cleavage site.

Authors: Kaycee A Quarles; Debashish Sahu; Mallory A Havens; Ellen R Forsyth; Christopher Wostenberg; Michelle L Hastings; Scott A Showalter
Journal: Biochemistry Date: 2013-01-18 Impact factor: 3.162

3. Molecular recognition of 6'-N-5-hexynoate kanamycin A and RNA 1x1 internal loops containing CA mismatches.

Authors: Tuan Tran; Matthew D Disney
Journal: Biochemistry Date: 2011-01-24 Impact factor: 3.162

Review 4. Hierarchy of RNA functional dynamics.

Authors: Anthony M Mustoe; Charles L Brooks; Hashim M Al-Hashimi
Journal: Annu Rev Biochem Date: 2014-03-05 Impact factor: 23.643

5. Enhancing potency of siRNA targeting fusion genes by optimization outside of target sequence.

Authors: Kseniya Gavrilov; Young-Eun Seo; Gregory T Tietjen; Jiajia Cui; Christopher J Cheng; W Mark Saltzman
Journal: Proc Natl Acad Sci U S A Date: 2015-11-16 Impact factor: 11.205

6. A dynamic structural model of expanded RNA CAG repeats: a refined X-ray structure and computational investigations using molecular dynamics and umbrella sampling simulations.

Authors: Ilyas Yildirim; HaJeung Park; Matthew D Disney; George C Schatz
Journal: J Am Chem Soc Date: 2013-02-26 Impact factor: 15.419

7. Effect of sodium ions on RNA duplex stability.

Authors: Zexiang Chen; Brent M Znosko
Journal: Biochemistry Date: 2013-10-09 Impact factor: 3.162

8. Identification and Characterization of New RNA Tetraloop Sequence Families.

Authors: Katherine E Richardson; Miranda S Adams; Charles C Kirkpatrick; David W Gohara; Brent M Znosko
Journal: Biochemistry Date: 2019-11-12 Impact factor: 3.162

9. RNA CoSSMos: Characterization of Secondary Structure Motifs--a searchable database of secondary structure motifs in RNA three-dimensional structures.

Authors: Pamela L Vanegas; Graham A Hudson; Amber R Davis; Shannon C Kelly; Charles C Kirkpatrick; Brent M Znosko
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

10. pH dependence of C•A, G•A and A•A mismatches in the stem of precursor microRNA-31.

Authors: Anita Kotar; Sicong Ma; Sarah C Keane
Journal: Biophys Chem Date: 2022-01-22 Impact factor: 3.628