Literature DB >> 31748405

Adenine·cytosine substitutions are an alternative pathway of compensatory mutation in angiosperm ITS2.

Xinwan Zhang¹, Yong Cao¹, Wei Zhang^1,2, Mark P Simmons².

Abstract

Compensatory mutations are crucial for functional RNA because they maintain RNA configuration and thus function. Compensatory mutation has traditionally been considered to be a two-step substitution through the GU-base-pair intermediate. We tested for an alternative AC-mediated compensatory mutation (ACCM). We investigated ACCMs by using a comprehensive sampling of ribosomal internal transcribed spacer 2 (ITS2) from 3934 angiosperm species in 80 genera and 55 families. We predicted ITS2 consensus secondary structures by using LocARNA for structure-based alignment and partitioning paired and unpaired regions. We examined and compared the substitution rates and frequencies among base pairs by using RNA-specific models. Base-pair states of ACCMs were mapped onto the inferred phylogenetic trees to infer their evolution. All types of compensatory mutations involving the AC intermediate were observed, but the most frequent substitutions were with AU or GC pairs, which are part of the AU-AC-GC pathway. Compared with the GU intermediate, AC had a lower frequency and higher mutability. Within the AU-AC-GC pathway, the AU-AC substitution rate was much slower than the AC-GC substitution rate. No consistently higher overall rate was identified for either pathway among all 80 sampled lineages, though compensatory mutations through the AC intermediate averaged about half that through the GU intermediate. These results demonstrate an alternative compensatory mutation between AU and GC that helps address the controversial inference of inferred simultaneous double substitutions.

Entities: Chemical

Keywords: base-pair substitution; compensatory mutations; ribosomal ITS2; secondary structure

Mesh：

Substances：

Year: 2019 PMID： 31748405 PMCID： PMC6961544 DOI： 10.1261/rna.072660.119

Source DB: PubMed Journal: RNA ISSN： 1355-8382 Impact factor: 4.942

INTRODUCTION

The functions of RNA molecules are based on their secondary and tertiary structures, which are formed by intramolecular base pairs (Côté et al. 2002; Pendrak and Roberts 2011; Woolford and Baserga 2013). Any mutation occurring on one side of a base pair could cause RNA conformational changes and thus is expected to be deleterious. Compensatory mutation or compensatory base change (CBC), wherein a substitution on one side of a base pair is compensated by a substitution on the other side of that base pair, is expected to restore base-pairing (Kimura 1985; Chen et al. 1999; Ivankov et al. 2014). In the context of a fitness landscape, the intermediate variant is inferred to represent a deep valley between adaptive peaks (Kimura 1985; Kirby et al. 1995; Meer et al. 2010). Identifying the intermediate base pair prior to the compensatory mutation is necessary to understand RNA evolution and adaptation (Meer et al. 2010; Kusumi et al. 2016). The GU/UG base pair has a special status among the 12 possible noncanonical combinations (Masquida and Westhof 2000; Varani and McClain 2000; Mokdad et al. 2006; Ananth et al. 2013). It is generally recognized that the frequency and stability of the GU/UG pair are less than Watson–Crick (WC) base pairs, but still higher than all other noncanonical base pairs (Masquida and Westhof 2000; Savill et al. 2001; Ananth et al. 2013). Therefore RNA-substitution models always include the GU/UG pair to account for this compensatory mutation process (e.g., AU → GU → GC; Rousset et al. 1991; Tillier and Collins 1995; Savill et al. 2001). In contrast, the AC/CA pair, which is the other purine–pyrimidine intermediate, has rarely been assessed (Meer et al. 2010; Marin Rodrigues et al. 2017). Crystal-structure analyses indicate that the two characteristics which make UG pairs relatively stable, hydrogen bonds and translation of the pyrimidine and purine toward major groove and minor grooves, respectively, are both shared with the C·A+ base pair (Jang et al. 1998; Pan et al. 1998; Brovarets and Hovorun 2016). In some RNA functional sites, C·A+ pairs can be used in place of U·G pairs because of their structural similarity (Gabriel et al. 1996; Pan and Sundaralingam 1999; Masquida and Westhof 2000). This type of C·A+ pair (the other being tautomeric C·A pair; Wang et al. 2011), which is formed by protonation and is stable in a weak acidic environment, has been repeatedly observed by X-ray crystallography (Hunter et al. 1986; Jang et al. 1998; Garg and Heinemann 2018) and nuclear-magnetic resonance (NMR; Boulard et al. 1992; Durant and Davis 1999; Huppler et al. 2002) in both chemically synthesized oligoribonucleotides as well as cleavage sites of an in vivo leadzyme, the Bacillus stearothermophilus DNA polymerase I large fragment, and human interleukin (IL)-6 mRNA. The C·A pair can also form a cognate-base-pair conformation in the polymerase active site (Wang et al. 2011; Brovarets and Hovorun 2016). These detailed biochemical findings for AC/CA structures contrast with our relative ignorance of their evolutionary relevance. Detection of unstable base pairs remains problematic in practice. X-ray crystallography or NMR spectroscopy remain indispensable for determining the base-pair composition of crystal structures (e.g., Wang et al. 2011; Brovarets and Hovorun 2016; Garg and Heinemann 2018). But these approaches have limitations when applied to the CA/AC unstable pair because it is both infrequent and transient, and hence seldom observed. Furthermore, the formation of hydrogen bonds between noncanonical base pairs, either through protonation or tautomerization, is dependent upon the crystal environment, which makes it difficult to visualize in the crystal structure (Wang et al. 2011; Kimsey et al. 2015). Finally, determining structures larger than tRNA remains difficult. Recently, a more sophisticated NMR-relaxation-dispersion method has been developed to facilitate the visualization of transient Watson–Crick-like mismatches in DNA and RNA, but it is restricted to GT and GU base pairs (Kimsey et al. 2015). As an alternative to biochemical-detection methods, nucleotide-substitution models infer the substitution process and are not limited to canonical-base-pair states. By assuming that the substitution process is constant within a given lineage (Yang 1994; Jow et al. 2002), equilibrium frequency and rate parameters can be defined, including substitutions involving transient AC/CA base pairs. Here, we applied the 16-state RNA model, which accounts for all base-pair combinations (Allen and Whelan 2014), to the ribosomal RNA region internal transcribed spacer 2 (ITS2), which is a widely used phylogenetic marker in plants (Álvarez and Wendel 2003; Chen et al. 2010; Qin et al. 2017; Li et al. 2019), in order to determine if AC/CA has a similar role as GU/UG in RNA compensatory mutations, as shown in Figure 1.

FIGURE 1.

A schematic representation of substitution processes involving compensatory mutations. The base-pair substitutions in black are existing pathways proposed by Tillier and Collins (1998); the base-pair substitutions in red are proposed in this study. The thickness of the solid arrows and their shading indicate the substitution rates inferred from our data sets. The dashed arrows indicate the simultaneous double-substitution pathway that was not supported by our best-fit 16D/E/F/J model for 70 of the 80 lineages. Letter(s) on the arrows represent substitution rates between base pairs.

RESULTS

Characteristics of AC-base-pair substitutions from the best-fit double-substitution rate matrices

The best-fit single-substitution models for 70 of our 80 study lineages were RNA16D (67 lineages), RNA16E (1 lineage), RNA16F (1 lineage), and RNA16J (1 lineage); none of which allows for simultaneous substitutions of both nucleotides in a base pair (Supplemental Table S1). However, the 16D/E/F single-substitution models are subject to an artifact wherein the total probability of change between AU (UA) and GC (CG) is identical for both the GU (UG) and AC (CA) intermediates (i.e., αn*βn is always equal to μn*γn; Supplemental Fig. S1). Given that this artifact of the 16D/E/F models is negatively determinate to the rates that we are interested in, we instead used double (simultaneous) substitution models to estimate rates (Supplemental Table S2). The best-fit double-substitution models for these 80 lineages were 16C (61 lineages) and 16A (19 lineages). These models assign each base pair an equilibrium frequency and substitution rate with the other base pairs. By summing across the substitution rates for each base pair we obtained the net rate of change from one base pair to the others (“mutability”). There are eight possible CBC pathways, which can be divided into four groups based on the same WC base-pair change, albeit through different intermediates (e.g., CG can change into UA through CA or UG; Table 1; Fig. 1). Our averaged results for the 80 lineages indicate that substitution rates from these intermediates to WC pairs differ greatly but are generally proportional within each pathway. For example, the rate of AC → GC (γ1, 1.8800; Table 1) is very different from its counterpart of CA → CG (γ2, 2.5713). But they are both approximately twice the rates of GU → GC (β1, 0.9064) and UG → CG (β2, 1.2627), respectively, in each of these two CBC pathways. Given that other rates and frequencies were similar among the four groups (Table 1; Fig. 2; Supplemental Fig. S2), we chose to focus on one group (AU → GC) to demonstrate these common characteristics.

TABLE 1.

Comparison of average substitution elements between intermediate base pairs across the 80 best-fit double-substitution rate matrices among all eight possible compensatory substitutions

FIGURE 2.

Comparisons of substitution rates and base-pair frequencies between AC (CA) and GU (UG) intermediate base pairs in CBCs. All results are shown based on the best-fit double-substitution rate matrix. (A) Frequency-mutability scatter plot of AC and GU base pairs and their chemical structures. (B) Box-plots for single-substitution rates from AC to six other base pairs. (C) Comparisons of rate heterogeneity between AC- and GU-mediated CBCs. The two substitution-rate ratios for AC-mediated CBCs are generally higher than those for GU-mediated CBCs. (D) Comparisons of the total probabilities of change from AU (UA) to GC (CG) based on multiplying the pairs of single-substitution rates. α1*β1 and µ1*γ1 represent the probability of GU- and AC-mediated CBCs, respectively (Fig. 1). Comparison of average substitution elements between intermediate base pairs across the 80 best-fit double-substitution rate matrices among all eight possible compensatory substitutions Taken across all 80 lineages with the best-fit double-substitution models, the average AC frequency was only 14% that of the average GU frequency (0.0111 AC vs. 0.0786 GU; Table 1; Supplemental Table S2). However, the AC mutability was higher than the GU mutability for 63/80 lineages, averaging 196% that of GU (3.2122 AC vs. 1.6389 GU; Table 1; Supplemental Table S2). The lower frequency but higher mutability of AC is largely distinct from the higher frequency and lower mutability of GU (Fig. 2A). Taken together, AC → GC and AC → AU account for 90.0% of the six possible one-site base pair changes from AC (1.8800 GC, 1.0103 AU, 0.0771 CC, 0.0817 AG, 0.0860 UC, 0.0770 AA; Fig. 2B). The high mutability of AC (Fig. 2A) and this substitution bias (Fig. 2B) indicate an additional pathway of compensatory mutation involving an AC intermediate despite the low frequency of AC base pairs (Figs. 1, 2A). We compared the compensatory substitution rates of AU → GC between the two alternative compensatory-mutation pathways (Fig. 1) and found that the average mutation rate from AU to GU (α1; 0.2828, Table 1; Supplemental Table S2) was 403% higher than that for AU to AC (μ1; 0.0701). But the average mutation rate from GU to GC (β1; 0.9064) was 52% lower than that for AC to GC (γ1; 1.8800, Table 1; Supplemental Table S2). For each compensatory mutation, the substitution rate between AU and the intermediate base pair (AC or GU) was always lower than that between the intermediate base pair and GC (AC or GU → GC; μ1<γ1, α1<β1; Table 1; Fig. 1; Supplemental Table S2). We found that in AC-mediated CBCs, γ was always higher while μ was always lower than their GU-mediated counterparts β and α, respectively. Thus γ1/μ1 > β1/α1 in 75/80 lineages (15.3-fold on average; Fig. 2C, the four exceptions [Asarum, Fraxinus, Rhodiola and Viburnum], for which μ1 = 0, are not shown in Fig. 2C; Supplemental Table S2). Again, these substitution rates are consistent with our inference that AC-mediated CBCs are a distinct process from GU-mediated CBCs. The total probabilities of change from AU to GC based on the double-substitution models are shown in Table 1, Supplemental Table S2, and Figure 2D. Only three lineages have equal probabilities for α*β and μ*γ. Of the remaining 77 lineages, 25 had higher probabilities of AC-mediated CBCs and 52 had higher probabilities of GU-mediated CBCs, indicating no consistently higher rate for either pathway. Taken across all eight CBC pathways, we observed that the total probabilities of CBC change via AC and CA intermediate base pairs was about half of those through GU and UG intermediates (0.1174 μ*γ vs. 0.2264 α*β; Table 1).

Examination of CBCs among lineages

We recorded the number of base pairs involved in compensatory mutations to help assess the power of our model-based rate estimates. Compensatory base changes (CBCs) were inferred at 353 base pairs among 65 of the 80 sampled lineages (Supplemental Table S3). This count is based on those base-pair sites that have both AU (UA) and GC (CG) as well as either UA (AU) or CG (GC) present. From one to 30 CBCs were inferred for each of these 65 lineages. For example, 30 base-pair positions with CBCs were inferred among the 405 sampled species of Allium, whereas a single base-pair position with a CBC was inferred among the nine species of Circaea sampled (Supplemental Table S3). Among these 353 base-pair positions, 26 also included the AC (for positions with AU and GC) or CA (for positions with UA and CG) intermediate state but not GU or UG, whereas 189 positions included the GU or UG intermediate state but not AC or CA. These counts correspond with our model-parameter estimates, both of which indicate that the AC/CA intermediate is less frequent than the GU/UG intermediate. Of the remaining 138 positions, 101 included both AC or CA and GU or UG. We counted the occurrence of hemi-CBCs (change at only one of the two base-pair positions) for the remaining 15 non-CBC lineages (Supplemental Table S4). We inferred that 60 AC/CA-mediated hemi-CBCs occurred based on those base-pair sites that have at least one of the following four base-pair combinations: AC–AU, AC–GC, CA–UA, and CA–CG. Likewise, we inferred 138 GU/UG-mediated hemi-CBCs. These counts of hemi-CBCs are consistent with our CBC results and indicate that these 15 non-CBC lineages can also provide useful information in estimating CBC rates. On average fewer species were sampled in non-CBC lineages, and taxon undersampling may account for the lack of inferred CBCs in these 15 lineages.

Comparison of CBCs and model-rate estimates between LocARNA and 4SALE alignments

The consensus secondary structure generated separately by LocARNA and 4SALE delimited similar stem and CBC positions for our three test genera: a total 40 CBC positions of 206 stem positions by LocARNA versus 34 of 187 positions by 4SALE (Supplemental Table S5). Among these CBC positions, 39 included an intermediate base pair in the LocARNA alignments and 31 in the 4SALE alignments. When attention is restricted to those positions for which only one intermediate base pair is included, a total 16 GU/UG-mediated CBCs are inferred in both pairs of alignments, whereas the sum of AC/CA-mediated CBCs dropped from six to four, including zeroes for Nepenthes and Potentilla. We further compared AC frequency and substitution rates from AC to the six other single-substitution base pairs between the LocARNA and 4SALE alignments (Supplemental Table S6). The AC frequency increased in Hypericum but decreased in Nepenthes and Potentilla. In five of six cases, the AC → GC and AC → AU substitution rates were lower for the 4SALE alignments, but in every case these were the two highest of the six substitution rates from AC, both of which are part of the GC–AC–AU pathway. Taken across all eight possible compensatory-substitution pathways in these three genera, the average AC/CA frequency is identical in both alignments (0.0110; Supplemental Table S7), whereas there is a decrease in mutability for 4SALE alignments (1.6012 vs. 1.3130). The average overall rates of the AC/CA intermediate also decreased from the LocARNA alignment to the 4SALE alignment (0.0434 vs. 0.0398; Supplemental Table S7). But the average overall rates relative to the GU/UG intermediate in each of these two alignments are similar (0.18; 0.0434/0.2467 vs. 0.0398/0.2267). Taken together, our results based on the LocARNA and 4SALE alignments are similar, and both support the alternative AC pathway.

Optimization of AC/CA-mediated CBCs

Both model-parameter estimates and our counts of base-pair positions for which CBCs are inferred indicate the occurrence of AC-mediated CBCs. To complement these results, we also optimized potential AC/CA-mediated CBCs on the inferred gene trees. Despite the instability of the AC/CA base pairs, our dense taxon and specimen sampling enabled us to directly identify four unambiguous AC/CA-mediated CBCs among the 26 base-pair sites including AU–AC–GC or UA–CA–CG (Supplemental Table S3). Each of these four CBC sites was observed in a different lineage. In three of the four cases (Fig. 3A-C), specimens with the AC/CA intermediates were resolved as a paraphyletic group separating taxa with the stable base pairs, and Fitch optimization of the AC/CA intermediate was unambiguous. In the fourth case (Fig. 3D), the clade of specimens with base pair CA is sister to the clade of specimens with the base pair CG, and Fitch optimization is ambiguous but consistent with the UA-CA-CG pathway. These four cases are unlikely to be sequencing artifacts because the AC/CA intermediate was observed in numerous specimens and at least three species. We identified all four possible AC/CA-mediated CBCs (Fig. 1). In Aralia we inferred a GC → AC → AU CBC at position 145 (stem IV; Fig. 3A). In Astilbe we inferred a reverse AU → AC → GC CBC at position 17 (stem I; Fig. 3B). In Meconopsis we inferred a CG → CA → UA CBC at position 80 (stem II, Fig. 3C). In Celastrus, we inferred a reverse UA → CA → CG CBC at position 98 (stem III; Fig. 3D). All four of these cases of AC/CA-mediated CBCs support the CBC species concept (Wolf et al. 2013), which was proposed based on GU-mediated CBCs, because none of the species were inferred to retain both stable base pairs of a given CBC.

FIGURE 3.

Optimization of AC/CA-mediated CBCs on ITS2 gene trees for four lineages. AC-mediated CBCs are inferred in (A) lineage Aralia and (B) lineage Astilbe; CA-mediated CBCs are inferred in (C) lineage Meconopsis and (D) lineage Celastrus. Base-pair states involved in CBCs are indicated using different colors for the applicable branches and species. Species names in bold font indicate the outgroup, which was used to root each tree. Branches with ≥0.5 posterior probabilities for Bayesian inference are highlighted in bold. Numbers following a species name indicate GenBank accession numbers.

DISCUSSION

An accurate description of RNA evolution is important for both systematic and functional biology. In this study, we inferred the transient noncanonical base pairs by using model-parameter estimates and investigated the base-pair substitution process in RNA stems. The GU base pair has traditionally been regarded as the intermediate of the two-step compensatory mutation from AU to GC (Rousset et al. 1991). Our results demonstrate that AC is another intermediate, thereby expanding the model of compensatory mutations (Tillier and Collins 1995, 1998). It has traditionally been assumed that since the noncanonical base pairs are rare, they probably provide little phylogenetic signal (Tillier and Collins 1998; Higgs 2000; Savill et al. 2001). Our results demonstrate that compensatory mutations occur on RNA stems mainly through GU or AC intermediates, thereby indicating their importance in RNA evolution (Fig. 1). Rousset et al. (1991) found that substitutions between AU and GC always occurred through a GU intermediate in the rapidly evolving D1 and D2 domains of Drosophila large subunit (LSU) rRNA, which was the basis for most of the GU-mediated-compensatory-substitution models (Schöniger and Von Haeseler 1994; Muse 1995; Rzhetsky 1995; Tillier and Collins 1995, 1998; Higgs 1998). However, other authors using slowly evolving SSU rRNA sampled from the rRNA database (http://bioinformatics.psb.ugent.be/webtools/rRNA/; Van de Peer et al. 1998) found that the rate of simultaneous substitutions at both positions is greater than the rate of single substitutions through the GU intermediate (Tillier and Collins 1998; Savill et al. 2001). These authors inferred that some compensatory substitutions between WC base pairs have not necessarily gone through the GU intermediate. This inference is consistent with our findings of an alternative AC-mediated CBC, for which the overall rate is roughly half that of GU-mediated CBCs (0.1174 μ*γ, 0.2264 α*β; Table 1). Simultaneous substitutions were also inferred, albeit at a still lower rate (Fig. 1; 0.088 rds; data not shown). Given the high mutability of AC, AC (CA) intermediates are transient and frequently unobserved. Hence we propose that AC-mediated CBCs are frequently misinterpreted as simultaneous double substitutions, artificially increasing the inferred rate of double substitutions and therefore the fit of double-substitution models (Tillier and Collins 1998). This proposal helps reconcile the controversial issue of interpreting double substitutions in rate matrices (Savill et al. 2001). Since base change on one side of a pair often decreases or increases the stability of the stem, which is deleterious or advantageous to RNA function (Kimura 1985; Chen and Stephan 2003; Meer et al. 2010), a concern is how selective constraint acts on different base changes of a compensatory mutation. We found that the relatively stable GU intermediate had a higher frequency and that its substitute rates differed only slightly during the CBC process (Fig. 1). In contrast, the unstable AC intermediate is consistently rare and the substitution rates varied substantially during the two-step process, indicating that AC is under different selective pressures than GU. These observations are consistent with Kimura's (1985) hypothesis that deleterious intermediates can be maintained at low frequency but cannot reach fixation until another complementary mutation occurs. In the context of GU and AC, we infer that the GU intermediate can persist longer than the AC intermediate until a compensatory mutation occurs, and that both of these alternatives frequently occur in angiosperm ITS2. The generality of our ITS2-based results and inferences should be tested in other RNA regions. The function of RNA molecules depends on their secondary and tertiary structures, which are determined by their base-pair interactions. A better understanding of the base-pair-change process is thus important for elucidating the mechanisms of RNA functional evolution. Both our model-parameter estimates and base-pair counts among diverse flowering-plant lineages indicate that the AC (CA) base pair is an alternative intermediate to the conventional GU (UG) base pair. These two intermediate pathways have different substitution rates that are consistent with their different levels of stability. Although the occurrence of AC-mediated compensatory mutations is roughly half that of GU, incorporation of the AC intermediate helps address the controversial inference of simultaneous double substitutions.

MATERIALS AND METHODS

Taxon sampling and sequence acquisition

We sampled lineages for which ITS2 sequences are available from closely related species as well as multiple specimens per species in most cases in order to sample low-frequency base pairs. We focused on empirical studies that presented plant DNA barcodes because they generally have sufficient inter- and intraspecific sampling to achieve effective species identification (Hebert et al. 2004; www.ibol.org). For the study lineages, all sequences with the annotation “internal transcribed spacer 2” or “internal transcribed spacer” were retrieved from GenBank between September 2018 and February 2019. Incomplete ITS2 sequences were excluded. We sampled most of the major flowering-plant lineages (Chase et al. 2016). We assessed species coverage and validity of species names according to the Plant List (http://www.theplantlist.org/). A total of 80 lineages, each consisting of a single genus, subgenus, or species group, were sampled. Taken across all 80 lineages, 3945 species and 13,741 sequences were sampled. These lineages are distributed among 55 families and 35 orders (Supplemental Fig. S3; Supplemental Table S8; Chase et al. 2016).

Alignment partition and phylogenetic analyses using RNA-specific models

ITS2 boundaries were identified by using GenBank annotations or hidden Markov models implemented in the ITS2 Ribosomal RNA Database (http://its2.bioapps.biozentrum.uni-wuerzburg.de/; Ankenbrand et al. 2015). We aligned the raw sequences using the MAFFT auto-select strategy (Katoh and Standley 2013) and then imported the sequence alignments into LocARNA 1.9.2.1 (Supplemental Fig. S4A,B; Will et al. 2007, 2012). LocARNA was used for consensus-secondary-structure annotation for each of the 80 lineages independently of each other. The consensus secondary structure generated by LocARNA was viewed and refined by using PseudoViewer3 (Byun and Han 2009) according to the conserved “four-helix model” of ITS2 in Eukaryota (Coleman 2003; Schultz et al. 2005). We partitioned the sequence alignment into unpaired and paired regions according to their consensus secondary structure (Supplemental Fig. S5), and then selected among conventional DNA models and RNA-specific models with a Perl script (model_selection.pl) from PHASE package 3.0 (Allen and Whelan 2014). This Perl script selects among two DNA models (HKY85 and REV), seven RNA 7-state models, and nine RNA 16-state models. We used Allen and Whelan's (2014) likelihood-correction method to facilitate comparison between the 4-, 7-, and 16-state models. To test whether our LocARNA alignments are robust, we also applied an alternative alignment method that uses a 12 × 12 scoring matrix to three genera for which the greatest potential number of AC-mediated CBCs (as measured by the sixth and tenth columns of Supplemental Table S3) were inferred based on the LocARNA alignments (Hypericum, Nepenthes, and Potentilla; the 405 sampled species of Allium were excluded because we were unable to obtain a consensus secondary structure from 4SALE). With 4SALE, sequence-structure data are coded for each of the four nucleotides with three structural states (paired left, paired right or unpaired), wherein the scoring function for substitutions and gap costs are estimated specifically for ITS2 (Wolf et al. 2014). We performed this alignment according to Schultz and Wolf's (2009) tutorial as follows. First, the Vienna format of each ITS2 secondary structure was predicted by homology modeling from the online ITS2 Database (Koetschan et al. 2012; Ankenbrand et al. 2015). Second, sequences with homologous structures were synchronously aligned by using 4SALE 1.7 (Seibel et al. 2006, 2008). Third, the graphical form of the consensus secondary structure was generated after manual refinement using the 4SALE editor. Fourth, the 50% majority consensus secondary structure was transformed manually into Vienna format for subsequent analyses as described above for the LocARNA alignment. Gene-tree analyses for each of the 80 separate lineages were performed using the PHASE 3.0 package (Jow et al. 2002; Hudelot et al. 2003; Allen and Whelan 2014) with the best-fit mixed model (HKY85 or REV for unpaired regions and a 16-state RNA-base-pair model for paired regions; Allen and Whelan 2014). Bayesian MCMC phylogenetic inference was performed using the mcmcPHASE program (Allen and Whelan 2014). Two independent runs that each consisted of four MCMC chains were run for one million generations each, sampling every 100 generations, with a burn-in of 3000 (30%) trees. The mcmcsummarize program in the PHASE package was then used to calculate the majority-rule-consensus topology and posterior probabilities. These trees were viewed by using FigTree 1.5.4 (http://tree.bio.ed.ac.uk/software/figtree/) and edited by using the interactive tree of life (iTOL: https://itol.embl.de/). When applicable, the inferred trees were rooted using the same outgroups that the original authors applied in their previous DNA barcoding or phylogenetic studies. For the remaining 15 lineages, we added two or three outgroups from closely related lineages. For each data set, base-pair frequencies and substitution-rate parameter values were generated using the mcmcsummarize program from the PHASE package. Statistical analyses were then performed to summarize these results using SPSS 22.0 (SPSS).

Inferring the compensatory substitution process

To infer CBCs, both positions of a base pair should be considered at the same time. To facilitate this inference, we transformed all paired bases into alternative symbols as follows. Base-pair information was included in the sequence-structure alignment generated from LocARNA (Supplemental Fig. S4B; Will et al. 2007, 2012). Each base pair was treated as a unit and transformed into a 28-symbol coding matrix using RNAstat (Supplemental Fig. S4C1; Subbotin et al. 2007). To supplement our model-based results, we optimized potential AC-mediated compensatory mutations on the inferred trees by using Fitch (parsimony; Fitch 1971) optimization. We did so for aligned positions that contained one of the following four base-pair combinations: F–D–M (representing AU–AC–GC), P–H–J, M–D–F, or J–H–P in the 28-symbol matrix (Supplemental Fig. S4C). Supplementary data, including ITS2 sequence matrices of both secondary-structure guided and base-pair transformed alignments, gene trees from both the best-fit models (16C/D/E/F/J) and the best-fit double-substitution models (16A/C), optimization of AC/CA-mediated CBCs on nine gene trees, Microsoft Excel files containing the raw and summarized data of rate matrices from both the best-fit models and the best-fit double-substitution models, and supplemental data for an alternative alignment method that uses a 12 × 12 scoring matrix for three genera are posted at: https://figshare.com/articles/Supplemental_data_for_Adenine_cytosine_substitutions_are_an_alternative_pathway_of_compensatory_mutation_in_angiosperm_ITS2_/8285447.

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

55 in total

Review 1. RNA secondary structure: physical and computational aspects.

Authors: P G Higgs
Journal: Q Rev Biophys Date: 2000-08 Impact factor: 5.318

2. ITS2 is a double-edged tool for eukaryote evolutionary comparisons.

Authors: Annette W Coleman
Journal: Trends Genet Date: 2003-07 Impact factor: 11.639

3. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution.

Authors: H Jow; C Hudelot; M Rattray; P G Higgs
Journal: Mol Biol Evol Date: 2002-09 Impact factor: 16.240

4. Evolution of compensatory substitutions through G.U intermediate state in Drosophila rRNA.

Authors: F Rousset; M Pélandakis; M Solignac
Journal: Proc Natl Acad Sci U S A Date: 1991-11-15 Impact factor: 11.205

5. A stochastic model for the evolution of autocorrelated DNA sequences.

Authors: M Schöniger; A von Haeseler
Journal: Mol Phylogenet Evol Date: 1994-09 Impact factor: 4.286

6. Estimating substitution rates in ribosomal RNA genes.

Authors: A Rzhetsky
Journal: Genetics Date: 1995-10 Impact factor: 4.562

7. Estimating the pattern of nucleotide substitution.

Authors: Z Yang
Journal: J Mol Evol Date: 1994-07 Impact factor: 2.395

8. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.

Authors: Sebastian Will; Kristin Reiche; Ivo L Hofacker; Peter F Stadler; Rolf Backofen
Journal: PLoS Comput Biol Date: 2007-02-22 Impact factor: 4.475

9. Assessing the state of substitution models describing noncoding RNA evolution.

Authors: James E Allen; Simon Whelan
Journal: Genome Biol Evol Date: 2014-01 Impact factor: 3.416

10. A novel form of RNA double helix based on G·U and C·A⁺ wobble base pairing.

Authors: Ankur Garg; Udo Heinemann
Journal: RNA Date: 2017-11-09 Impact factor: 4.942

3 in total

1. New Insights into the Genomic Structure of Avena L.: Comparison of the Divergence of A-Genome and One C-Genome Oat Species.

Authors: Alexander A Gnutikov; Nikolai N Nosov; Igor G Loskutov; Elena V Blinova; Viktoria S Shneyer; Nina S Probatova; Alexander V Rodionov
Journal: Plants (Basel) Date: 2022-04-19

2. Compensatory Base Changes and Varying Phylogenetic Effects on Angiosperm ITS2 Genetic Distances.

Authors: Ruixin Cao; Shuyan Tong; Tianjing Luan; Hanyun Zheng; Wei Zhang
Journal: Plants (Basel) Date: 2022-03-30

Review 3. Phylogenetic Utility of rRNA ITS2 Sequence-Structure under Functional Constraint.

Authors: Wei Zhang; Wen Tian; Zhipeng Gao; Guoli Wang; Hong Zhao
Journal: Int J Mol Sci Date: 2020-09-03 Impact factor: 5.923

3 in total