Literature DB >> 30912655

Mapping a Systematic Ribozyme Fitness Landscape Reveals a Frustrated Evolutionary Network for Self-Aminoacylating RNA.

Abe D Pressman^1,2, Ziwei Liu^3,4, Evan Janzen^1,5, Celia Blanco¹, Ulrich F Müller⁶, Gerald F Joyce⁷, Robert Pascal⁴, Irene A Chen^1,5.

Abstract

Molecular evolution can be conceptualized as a walk over a "fitness landscape", or the function of fitness (e.g., catalytic activity) over the space of all possible sequences. Understanding evolution requires knowing the structure of the fitness landscape and identifying the viable evolutionary pathways through the landscape. However, the fitness landscape for any catalytic biomolecule is largely unknown. The evolution of catalytic RNA is of special interest because RNA is believed to have been foundational to early life. In particular, an essential activity leading to the genetic code would be the reaction of ribozymes with activated amino acids, such as 5(4 H)-oxazolones, to form aminoacyl-RNA. Here we combine in vitro selection with a massively parallel kinetic assay to map a fitness landscape for self-aminoacylating RNA, with nearly complete coverage of sequence space in a central 21-nucleotide region. The method (SCAPE: sequencing to measure catalytic activity paired with in vitro evolution) shows that the landscape contains three major ribozyme families (landscape peaks). An analysis of evolutionary pathways shows that, while local optimization within a ribozyme family would be possible, optimization of activity over the entire landscape would be frustrated by large valleys of low activity. The sequence motifs associated with each peak represent different solutions to the problem of catalysis, so the inability to traverse the landscape globally corresponds to an inability to restructure the ribozyme without losing activity. The frustrated nature of the evolutionary network suggests that chance emergence of a ribozyme motif would be more important than optimization by natural selection.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2019 PMID： 30912655 PMCID： PMC6548421 DOI： 10.1021/jacs.8b13298

Source DB: PubMed Journal: J Am Chem Soc ISSN： 0002-7863 Impact factor: 15.419

Introduction

Molecular evolution is largely governed by the function of fitness in the space of all possible sequences, known as the “fitness landscape”.[1,2] Evolution corresponds to a biased random walk on this landscape, in which mutation enables exploration of neighboring points in sequence space, and natural (or artificial) selection favors hill-climbing toward higher fitness. Therefore, knowledge of the fitness landscape is necessary for a systematic, quantitative understanding of molecular evolution.[3−5] For example, a deep question is whether the landscape allows selection to optimize biochemical activity. If the topography of the fitness landscape is relatively smooth, optimization by selection can occur readily through hill-climbing. However, if the landscape is riddled with low-fitness valleys between local fitness optima, then many potential evolutionary pathways through sequence space will be inaccessible, inhibiting global optimization of activity. A comprehensive map of the fitness landscape would enable understanding of such fundamental issues. Fitness landscapes of ribozymes are of special interest because RNA may have been the first evolving molecule during an “RNA World” at the time of the origin of life.[6−12] In addition, ribozymes have been proposed as the genetic and catalytic basis for a minimal synthetic cell.[13] On the practical side, ribozymes can be relatively short in length (L),[14] so it is possible to interrogate the entirety of sequence space in a laboratory setting (e.g., for L = 21, 421 ≈ 4 × 1012 possible sequences). Recent studies have emphasized the importance of comprehensive coverage vs sparse sampling of sequence space for understanding evolutionary pathways. For example, sparse sampling (e.g., based on known genotypes) can miss viable evolutionary pathways and create a biased view of the fitness landscape.[15,16] Exhaustive data could also aid computational efforts to explore larger sequence spaces.[17,18] Therefore, mapping the comprehensive fitness landscape for ribozymes is an important goal. We previously developed a method for mapping the comprehensive fitness landscape of an RNA aptamer by in vitro selection,[19] with abundance used as a proxy for fitness. However, binding is qualitatively different from catalysis,[20] which involves a reaction pathway, often including covalent modification of the ribozyme, in addition to binding of the substrate and stabilization of the transition state. Furthermore, work by us and others has established methods for measuring affinity constants, ribozyme reaction rates, and RNA processing and thermodynamic stability by high-throughput sequencing, raising the prospect of mapping the landscape in terms of affinity or activity.[21−29] Although prior studies measuring chemical activity were applied to small populations or sparse samples of sequence space, these studies, combined with the ability to map a comprehensive fitness landscape, point toward the possibility of mapping the comprehensive chemical activity landscape for ribozymes. In the current work, we use this combined approach, termed SCAPE (sequencing to measure catalytic activity paired with in vitro evolution), to map a comprehensive ribozyme activity landscape. We focus on an activity that would be foundational to protein translation, perhaps the most impressive invention of the RNA World. Despite its importance, the emergence of protein translation is poorly understood. A key activity is the covalent attachment of specific amino acids to specific tRNAs, which establishes the biophysical information content of the “second genetic code”.[30] In modern biology, this attachment is catalyzed by aminoacyl-tRNA synthetases, but self-aminoacylating ribozymes could have been the original basis of the tRNA/synthetase system. Ribozymes that react with aminoacyl adenylates or other activated substrates have been discovered,[31−35] illustrating the ability of ribozymes to catalyze formation of aminoacyl-RNAs, although the substrates studied previously are prebiotically implausible or highly unstable. In contrast, N-carboxyanhydrides (NCAs) and the related 5(4H)-oxazolones can be produced from amino acids (or peptides) by multiple prebiotically plausible reaction pathways (e.g., with carbonyl sulfide,[36] cyanate,[37] or cyanamide[38] as activating agents). These compounds react with amino acids to form peptides,[39] and therefore have been proposed as a prebiotic form of chemically activated amino acids. At high concentration, NCAs and 5(4H)-oxazolones react with phosphate esters, including nucleotides, to form aminoacyl-RNA mixed anhydrides in low yield,[40−44] suggesting this reaction as a candidate for ribozyme catalysis. Use of 5(4H)-oxazolones avoids uncontrolled polymerization in comparison to NCAs, making oxazolones a practical and prebiotically relevant substrate for in vitro selection. Thus, we apply SCAPE to map the catalytic activity landscape for ribozymes that self-aminoacylate using a prebiotically plausible form of chemical activation, and we analyze the evolutionary and mechanistic implications of the empirically determined ribozyme landscape.

Results

The SCAPE strategy begins with a population of molecules containing a randomized central region of 21 nt flanked by two constant regions used for PCR amplification (total length = 71 nt). In a first step, this library is subjected to in vitro selection for aminoacylation activity to isolate the ribozymes. In a second step to assay the ribozymes’ activities, a pool of the selected molecules that includes many (∼104 to 105) different active sequences is allowed to react with various concentrations of substrate, and the products are isolated and sequenced on the Illumina platform. The sequencing output is used to quantify reaction products[28] and thereby measure the catalytic rates of potentially hundreds of thousands of sequences in parallel. We refer to this second step as kinetic sequencing (k-Seq).

Selection of Aminoacylation Ribozymes

Beginning with a pool of random-sequence RNAs (central random region length L = 21) with high coverage of sequence space (∼70–99.99% coverage; Supporting Text S1), six rounds of in vitro selection for aminoacylation activity were conducted (Figure A). In each round, the RNA pool was reacted with a biotinylated tyrosine analog, biotinyl-Tyr(Me)-oxazolone (BYO). RNAs that react with BYO become covalently attached to the biotin tag, allowing their isolation by binding to streptavidin beads. These RNAs are reverse-transcribed and amplified by PCR, providing templates for the next round of selection and amplification. The progress of the selection was followed by high-throughput sequencing, which yielded 2 × 106 to 1 × 107 sequence reads per round of selection. Two replicates of the selection were performed (RS1 and RS2). Analysis was conducted using RS1, with data from RS2 used to confirm reproducibility of the selection.

Figure 1

In vitro selection for aminoacylation ribozymes. (A) Selection began with DNA templates containing a transcription promoter (gray) and a central region of 21 random-sequence residues (red or blue) flanked by constant regions (black). These templates were transcribed into RNA and incubated with BYO. Aminoacylated RNAs (red) were isolated using streptavidin beads and amplified by RT-PCR for the next round of selection. (B) Pool composition over Rounds 4–6 after clustering. The top 20 families are indicated in non-neutral colors; gray corresponds to unclustered sequences; white corresponds to families with rank by abundance >20. Multiple families from submotif 1A (purple), 1B (dark blue), 1C (cyan), Motif 2 (green), and Motif 3 (yellow) are shown. Inset: Abundance of the top 20 families in Rounds 4–6 (same color scheme, except that the dotted black line corresponds to families of rank >20). (C) SeqLogo representations of the motifs. For each round, sequences were first clustered into families using a maximum edit distance of 3 mutations (substitutions, insertions, or deletions) from the center sequence, which was defined as the sequence of highest abundance in the family. Sequence families could be identified starting in Round 4 (Figure B, Supporting Figure S1A). The 20 ribozyme families of highest center abundance identified in the RS1, Round 5 pool were compared manually to identify conserved sequence motifs. The top 20 families comprised 80% of sequence reads by Round 6 and were consistent in RS1 and RS2 (Supporting Figure S1B). These 20 families could be characterized by one of three distinct motifs, numbered as Motif 1, 2, and 3. Motif 1 contained the shortest conserved region (Figure C, Supporting Figure S1C) and the greatest number of unique sequences contained Motif 1. This motif could be further categorized into three submotifs (1A, 1B, 1C) based on differences in the conserved region, with 14 of the top 20 families containing Motifs 1A or 1B. Motif 2 characterized fewer unique sequences than Motif 1, but more than Motifs 1A, 1B, or 1C. Motif 2 also characterized Family 2.1, the most abundant family in the pool. Of Motifs 1−3, Motif 3 was found in the smallest fraction of the pool and characterized the fewest unique sequences.

Kinetic Sequencing (k-Seq)

We determined the rate constants of the selected ribozymes by a massively parallel assay (kinetic sequencing, or k-Seq; Figure A). In a gel-based assay to measure the rate constant of aminoacylation, a single RNA sequence was mixed with BYO and product formation was monitored by gel shift of the RNA in the presence of streptavidin. In the k-Seq assay, we reacted a heterogeneous pool obtained from in vitro selection, which contained many different RNA sequences, with BYO and isolated the aminoacylated RNAs using streptavidin beads. These RNAs were analyzed by high-throughput sequencing (HTS), yielding the relative abundance of each sequence in the products, which were converted to absolute concentrations by comparison to a standard of known concentration in the product pool. Rate constants ( for sequence s) and maximum amplitude of reaction (A) in both assays were obtained from the dependence of product formation on the concentration of BYO. k-Seq estimates for activity could be obtained for 8.9 × 106 sequences, but the majority of sequences were present at low abundance and correspond to low activity (Supporting Figure S2). ∼105 unique sequences, out of ∼421 possibilities, were found to have activity >10-fold above the noncatalytic background rate (i.e., catalytic enhancement r > 10, where r = kA/k0A0, and and A are the rate constant and amplitude of reaction of the noncatalyzed reaction, measured in the randomized RNA pool).

Figure 2

Emergence of ribozymes and kinetic characteristics. (A) In k-Seq, an RNA pool enriched for active ribozymes is reacted at multiple BYO concentrations, in triplicate. Captured RNA is then reverse-transcribed and sequenced. Activity curves are constructed for sequences detected in the enriched pool. (B) Aminoacylation at various [BYO] for ribozyme S-2.1-a observed by both gel shift and k-Seq Data for all other measured ribozymes are shown in Supporting Figure S3. Error bars correspond to standard deviation among triplicates. (C) Correlation between catalytic enhancement of ten ribozymes, measured by gel shift assay and k-Seq Error bars correspond to standard deviation among triplicates (k-Seq) or 2–3 replicates (gel assay) (R2 = 0.87; Supporting Table S1). Dotted orange line indicates line of unity. To determine how well k-Seq results corresponded to results of the standard assay, we chose ten sequences that are close to the consensus sequences of the high- or medium-activity families (with all five motifs and submotifs represented) and measured aminoacylation activity by the gel-shift assay.[45] Rate constants determined from k-Seq matched well with gel-shift measurements (Figure B,C, Supporting Table S1). All k-Seq and gel-shift measurements were performed in triplicate and the standard error was similar between k-Seq and gel-shift measurements (Supporting Figure S1A). Measurement error during k-Seq decreased as sequence read abundance increased, as expected for stochastic noise. For most sequences with count >10, and nearly all sequences with count >100, the noise of k-Seq measurements appeared to be within a factor of 2 (Supporting Figure S4). High-activity sequences (e.g., the center of Family 2.1, with r = 1010 and = 779 ± 21 min–1 M–1) exhibit saturating kinetics from k-Seq, providing both the rate constant () and the maximum amplitude of reaction (A). However, the reaction for lower activity sequences (approximately < 20 min–1 M–1) appears linear under the conditions tested, so that and A are difficult to estimate separately using these data; instead the combined parameter kA can be estimated (Supporting Figure S5A).

Aminoacylation Site and True Catalytic Enhancement

The most highly abundant sequences from each major motif were chosen (S-1A.1-a, S-1B.1-a, S-2.1-a, S-3.1-a; see Methods for sequence nomenclature) for characterization of the reactive site. Identification of the reactive site was performed in two steps. First, reverse transcription is known to be sensitive to 2′ adducts, such that stalled products can be used to identify the sites of 2′ acylation.[46,47] The putative ribozymes were ligated to a 3′ adapter to test for stalling of reverse transcription along the entire length of the ribozyme. Stalling resulted in a truncated product whose length, determined by gel electrophoresis, suggested a likely site of aminoacylation (Figure A; Supporting Figure S10). Second, the nucleophilic importance of the 2′-OH at the candidate site was verified by testing the activity of a synthetic RNA modified at this position by 2′-O-methylation. In each case, a control synthetic RNA that was instead modified at an adjacent position was also tested. Blocking of the candidate site (but not the control sites) by O-methylation is expected to abolish the reaction. For all sequences tested, the results were consistent with aminoacylation at a specific internal 2′-OH position within the 3′ constant region of the sequence (Figure B; Supporting Figure S6). While the reactive site was conserved for sequences from the same major motif (e.g., S-1A.1-a and S-1B.1-a, both from Motif 1), the site differed among sequences from the three major motifs, indicating that ribozymes with different motifs utilize different detailed reaction mechanisms.

Figure 3

Aminoacylation site and landscape ruggedness. The likely site of BYO modification on ribozyme S-1A.1-a was identified by stalling of reverse transcription, resulting in a truncated product (A). The site, G65, was verified by loss of activity upon 2′-O-methylation, assayed by streptavidin gel shift after BYO reaction (B). 2′-O-Methylation of an adjacent site (C64) did not show loss of activity. (C) Average correlation of fitness effects γ as a function of edit distance d, shown for the sequence families around the five most abundant centers: 2.1 (magenta), 1A.1 (red), 1B.1 (orange), 1B.2 (green), 1A.2 (blue). Note that the catalytic enhancement r calculated here underestimates the true catalytic enhancement at the modified site. The potential nucleophilic sites include 70 internal 2′-OH groups, the vicinal diol at the 3′ end, and the 5′-triphosphate. Thus, the uncatalyzed reaction rate at a particular site is at least 73-fold lower than k0A0, which we measured for the entire RNA. In addition, previous work on oxazolone modification of small RNA oligonucleotide models indicates that the vicinal diol and terminal phosphates (2′, 3′, or 5′) are strongly preferred as nucleophiles, with no detectable reactivity at internal 2′-OH sites.[43,44] In contrast, we found that all ribozymes tested, representing each motif (1A, 1B, 2, 3), were modified at an internal 2′-OH. Therefore, the true catalytic enhancement provided by these ribozymes at a specific internal 2′-OH is likely to be at least 700-fold greater (Supporting Text S2) than the r as reported here.

Frequency Distribution of Catalytic Activity

The log-normal shape of the frequency distribution of catalytic activity kA, is consistent with prior findings.[24,48] Because the rate constant scales exponentially with the activation energy, it was of interest to determine the distribution of alone. For the highest activity family (2.1), many ribozymes could be characterized by and A separately. was observed to fit a log-normal distribution, indicating that activation energies are normally distributed for a ribozyme family (Supporting Figure S5B,C). The distribution of A, which represents the maximum extent of reaction and may indicate the fraction of RNA that is well-folded, also fit well to a log-normal distribution, suggesting that folding energies may also be normally distributed for a ribozyme family. For the regime in which and A could be determined separately, these parameters are not well-correlated with each other (Supporting Figure S5A), suggesting no relationship between the catalytic rate and fraction folded.

Ruggedness of Chemical Activity Peaks

To understand how the overall character of the ribozyme fitness landscape compares with well-known theoretical models, we characterized the ruggedness of the ribozyme peaks. Generally, the fitness of close relatives is highly correlated to each other, but the fitness of more distant relatives is less correlated. A simple measure of ruggedness is the fitness correlation γ for a ribozyme family, which is the average correlation of activity effects of single mutations between sequences at evolutionary distance d of each other[49] (d is the Levenshtein edit distance, i.e., the number of substitutions, insertions or deletions between two related sequences). γ = 1 indicates a perfectly smooth landscape and γ = 0 indicates a highly rugged, completely uncorrelated landscape. γ1 was approximately 0.3–0.4 for all families analyzed, i.e., the typical effect of a particular mutation is 30–40% correlated across all single mutant backgrounds (Figure C, Supporting Figure S7), indicating substantial ruggedness of the fitness peaks. Interestingly, as the neighborhood size increased up to d = 4, γ dropped only slightly (Figure C), indicating that activity remained similarly correlated at longer evolutionary distances within the peaks. The relative constancy of γ over a range of d indicates an underlying smoothness that is felt throughout the peak.

Evolutionary Pathways between Ribozyme Motifs

A series of single mutations defines an evolutionary pathway between two sequences. Although there are very many conceivable pathways, many of these include intermediate sequences of low fitness. Under selection, such fitness valleys represent dead ends that effectively block evolution. An open question is whether viable evolutionary pathways exist between different sequences that catalyze the same reaction. Using the chemical activity data from k-Seq, we searched for viable evolutionary pathways between center sequences of the major ribozyme families (Figure , Supporting Table S2).

Figure 4

Evolutionary pathways for aminoacylation ribozymes. (A) Catalytic enhancement along a best pathway discovered from the center of Family 1B.1 (pink, S-1B.1-a), to 1A.1 (purple, S-1A.1-a), to 2.1 (cyan, S-2.1-a), to 2.2 (gray, S-2.2-a). Capital letters denote sequence positions changing at each step; underscore indicates a deletion. A large drop in activity is required for several mutations between Motif 1 and Motif 2. Error bars are standard deviation from triplicate measurements (only top bar is shown). Also see Supporting Figure S11. Asterisk (*) indicates a sequence that was found in only one replicate (RS1). (B) Evolutionary network displaying the 10 best pathways discovered between the centers of six key families (1A.1, 1B.1, 1C.1, 2.1, 2.2, and 3.1) representing each motif and submotif and the two most active centers from Motif 2. Each node is an individual sequence with activity measured by k-Seq indicated by color (see legend; red indicates activity at or below the baseline rate). The lines indicate mutational distance between sequences (solid black line = 1 mutation). Dotted lines indicate sequences at baseline activity (see legend). The majority (67%) of the edits along these pathways are substitutions; the remainder are indels. A broad network of pathways was found among Families 1A.1, 1B.1, and 1C.1, with a <10-fold catalytic rate decrement at the lowest point of the best pathways. Thus, the families of Motif 1 form a “plateau” in the chemical activity landscape, corresponding to the small size of Motif 1. Similarly, viable pathways exist between the top two families of Motif 2. Although Motif 2 encompasses a smaller region of sequence space compared to Motif 1 due to a larger conserved region, Motif 2 contains the global optimum of the landscape. Viable pathways were not found between families of Motif 3, likely due to the small number of unique sequences in this motif. Within Motifs 1 and 2, the number of viable pathways was relatively small, suggesting that evolution within a motif would be fairly reproducible. However, evolutionary pathways between motifs appeared strikingly different. The only pathways that could be constructed between different motifs contain fitness losses down to baseline activity, with multiple mutational steps occurring at near baseline activity. The closest apposition of motifs was a pathway between Family 3.1 and Family 1A.1, which involves 5 consecutive intermediates expected to have baseline activity (i.e., r ∼ 103-fold less than r). The global optimum (Family 2.1) is especially isolated, with >10 mutations at baseline activity required along any pathway toward a different motif. These pathways would not be viable under selection, indicating that optimization of activity over the global fitness landscape would be frustrated.

Discussion

In the SCAPE method, a ribozyme fitness landscape can be mapped in two steps. First, the vast majority of inactive sequences are removed from the pool through in vitro selection. Second, the catalytic activities of the remaining sequences are directly assayed by kinetic sequencing (k-Seq). In this case, k-Seq yielded estimates for ∼105 unique sequences (a number that in general depends on pool diversity, activity distribution, and sequencing depth). Using SCAPE, we mapped the first comprehensive fitness landscape for catalytic activity, subject to the following caveats. First, in order to survive the selection, sequences must be both catalytically active and replicable (by transcription and RT-PCR). Because RT stalls at the aminoacylated site, ribozymes that aminoacylate within the randomized region are presumably disfavored. Consistent with this, all of the ribozymes tested here react within the 3′ constant region, as modification does not preclude primer binding. Sequences may also have been lost during selection for other reasons (e.g., transcription or RT-PCR bias, and genetic drift in early rounds). While such occasional losses might affect the details of evolutionary pathways, they would likely not affect the overall findings given the extensive fitness valleys found. Alternatively, if the starting library were relatively small (∼106 sequences), k-Seq alone (without selection) could be used to build a comprehensive map of the library; advances in sequencing technologies may push this bound further. Second, fitness is measured in the specific environment applied, in this case for aminoacylation activity under the chemical conditions of the selection. How the environment would affect the fitness landscape, and how aminoacylation activity might relate to the replicative fitness of an RNA World organism (a variety of relationships are possible[50−54]), are difficult to address at present. We discovered ribozymes that self-aminoacylate using a 5(4H)-oxazolone, a key step toward the genetic code. The best ribozyme found here has a rate constant comparable to that of ribozymes obtained using a biologically derived aminoacyl adenylate,[31,32] indicating that these reactions could proceed efficiently even with only prebiotic substrates. Interestingly, all ribozyme families discovered here react at an internal 2′-OH of the RNA. These sites stand in contrast to the modification of modern tRNAs at the vicinal diol (3′ terminus), which is also found to be more reactive in model oligonucleotides.[43,44] It is possible that an internal reaction site facilitates establishment of multiple contacts with BYO, and the rate acceleration caused by these structural features outweighs the intrinsic reactivity of the vicinal diol. Similarly, it is unknown whether the identity of the 3′ terminal sequence (CUG in this study, compared to CCA in tRNAs) may contribute to this finding. This difference raises the interesting question of whether ribozymes such as those discovered in this model system could be on the pathway toward the modern implementation of the genetic code; whether they have the evolutionary capacity to adopt a mechanism more similar to the aminoacyl-tRNA synthetase system is currently unknown. Analysis of individual ribozyme families indicates that the overall topography of each peak can be described as a combination of two components: a “smooth” component (∼40%) in which mutations have additive effects on catalytic activity, and a “rough” component (∼60%) that represents deviations from additivity (i.e., epistasis). This combination resembles the so-called “Rough Mt. Fuji” model, which consists of a perfectly smooth peak overlaid by uncorrelated ruggedness[4,56−59] (Supporting Figure S7; also see discussion below). In addition to discovering novel ribozymes, a primary motivation for SCAPE analysis is to learn about molecular evolution by exhaustively determining the viable evolutionary pathways and networks through sequence space. We found that, while some viable pathways exist locally around an optimum, most conceivable pathways toward the global fitness optimum (Family 2.1) are blocked by extensive fitness valleys. The likely reason is that the three major motifs differ substantially in structure, as indicated by their different aminoacylation sites. It appears that the ribozyme structure cannot be changed without essentially destroying the structure of one ribozyme and building another, requiring extensive mutations at negligible activity. Such evolutionary walks would be essentially impossible while under selection for catalytic activity, frustrating optimization over the network. This landscape can be compared to other landscapes and evolutionary pathways that have been described for functional RNA. Extensive work on in silico folding of RNA sequences has predicted the existence of large neutral networks for secondary structure, in which evolutionary walks over long distances could maintain a given structure.[60−62] Such neutral networks would permit facile exploration of sequence space through evolution. In addition, multiple examples of ribozymes evolving to perform different functions are known.[63−66] In contrast, the previously described landscape for RNAs selected to bind GTP (based on sequence abundance rather than activity measurement; see Supporting Figures S3F and S8) showed that the landscape consisted of several evolutionarily isolated peaks.[19] Thus, it appears that, although preservation of secondary structure could occur over a neutral network, the additional tertiary structural requirements of a functional RNA leads to a qualitative change in the nature of the evolutionary network. Such a change is analogous to the phase transition-like behavior of percolation through a network;[67] as the frequency of active nodes decreases, the network suddenly switches from highly connected, as in the case of neutral networks of RNA secondary structure, to essentially impermeable, as observed for evolutionary networks of functional RNAs. An important caveat is that the landscape reported here was mapped under constant selection for a single catalytic activity and cannot be directly compared to evolutionary pathways leading to new functions; changing environments[68] or selection pressures may significantly alter this picture. The phenomenon of frustration arises when competing interactions prevent overall optimization of a system, resulting in a large number of local maxima. A classic illustration of frustration is the antiferromagnetic spin glass, in which energy would be minimized by antiparallel placement of neighboring electronic spins. In certain configurations (e.g., a triangular lattice), no placement of spins can satisfy all desired constraints, leading to rugged energy landscapes.[69] The analogy to frustrated spin glasses has been explored theoretically to understand fitness landscapes.[3,59,70−72] For example, the NK model of fitness landscapes, in which there are N sites and the fitness contribution of each site is influenced by K other sites, is equivalent to a form of spin glass, with the ruggedness of the landscape tuned by the epistatic parameter K.[74,75] At an extreme, when K = N – 1, the fitness of similar genotypes is completely uncorrelated. This regime is similar to the “House-of-Cards” landscape generated when fitness values are randomly assigned, giving a maximally rugged, degenerate landscape with a very large number of local optima.[3,76,77] The House-of-Cards landscape is also equivalent to the random energy model approximation for a disordered spin glass.[78] These theoretical models of maximally rugged energy or fitness landscapes are characterized by frustration: no configuration (i.e., sequence) can simultaneously satisfy all desirable interactions. The ruggedness of the empirically determined ribozyme fitness landscape reported here can be described by the Rough Mt. Fuji model, which is a combination of a smooth “Mt. Fuji” landscape and the random House-of-Cards landscape,[57,79] with the weighting of this combination reflected by the fitness correlation γ.[49] Our analysis of γ showed that the rugged House-of-Cards component dominated the shape of the major peaks of the landscape, with ∼60% of the variance of fitness being attributable to random contributions. (Note that this ruggedness is not fully represented in Figure , which only illustrates the best evolutionary pathways found between peaks.) Thus, while the fitness landscape observed here is not entirely uncorrelated, we suggest that the major House-of-Cards character found implies a substantial level of frustration, consistent with the lack of viable evolutionary pathways among the major optima. Frustration in biological systems has also been invoked to understand the folding energy landscape of proteins,[81−83] where individual local molecular arrangements that minimize energy may be mutually incompatible, resulting in rugged energy landscapes and misfolded states. Other systems that may exhibit some frustration include gene expression networks,[84] morphological innovation,[85] and even the evolution of biological complexity.[86] Our results show that the experimentally determined ribozyme activity landscape exhibits a degree of frustration, as individually beneficial mutations are often mutually incompatible, leading to ruggedness on the fitness landscape.[72,87] Walks on such energy or evolutionary landscapes are characterized by sensitivity to initial conditions, difficult optimization, and multiple possible outcomes. It should be noted that mechanisms that favor greater genetic diversity, such as recombination, gene duplication, or epistasis among genes, could enable crossing of fitness valleys.[88,89] Recent work suggests that recombination, in particular, can occur spontaneously in pools of RNA.[90,91] The quantitative effect of such mechanisms on traversal of the fitness landscape is unknown at present. Nevertheless, in the absence of such mechanisms, the emergence of a globally optimal sequence is likely to result from chance events rather than natural selection.

Methods

Synthesis of Biotinyl-Tyr(Me)-Oxazolone (BYO)

General Synthetic Procedures

Reagents and solvents were obtained from Fluka, Sigma-Aldrich or Bachem, and were used without further purification. NMR spectra in either CDCl3, DMSO-d6 or D2O solution were recorded on a Bruker DPX 300 spectrometer (300 MHz) or on a Bruker Avance 400 spectrometer (400 MHz); chemical shifts δH are reported in ppm with reference to the solvent resonance (CDCl3: δH = 7.26 ppm; DMSO: δH = 2.50 ppm; H2O: δH = 4.79 ppm); coupling constants J are reported in Hz. UHPLC analyses were carried out on a Thermo Scientific Dionex UltiMate 3000 Standard system including an autosampler unit, a thermostated column compartment and a photodiode array detector, using UV absorbance detection at λ = 273 nm. HPLC/ESI-MS analyses were carried out on a Waters UPLC Acquity H-Class system including a photodiode array detector (acquisition in the 200–400 nm range), coupled to a Waters Synapt G2-S mass spectrometer, with capillary and cone voltage of 30 kV and 30 V, respectively, source and desolvation temperature of 140 and 450 °C, respectively. ESI+ and ESI– refer to electrospray ionization in positive and negative mode, respectively. HRMS spectra were recorded on the same spectrometer, using the same source settings as above.

Preparation of N-tert-Butoxycarbonyl-O-methyl-tyrosine methyl ester (Boc-Tyr(Me)-OMe)

Synthesis of Boc-Tyr(Me)-OMe was carried out according to a published procedure.[92,93] A solution of Boc-Tyr-OH (7.0 mmol, 2.0 g; Bachem) in dimethylformamide (DMF, 20 mL) was cooled using an ice bath and treated with freshly ground KOH (7.7 mmol, 0.43 g). A cooled solution of CH3I (7.7 mmol, 0.49 mL) in DMF (5 mL) was added dropwise over 1 min. The mixture was stirred at room temperature for 30 min, then cooled using an ice bath, and additional KOH (7.7 mmol, 0.43 g) and a cooled solution of CH3I (7.7 mmol, 0.49 mL) in DMF (5 mL) were added dropwise over 1 min. The mixture was stirred for 3 h at room temperature, poured onto ice (40 g), and extracted with ethyl acetate (3 × 20 mL). The organic layers were washed with water (3 × 13 mL), brine (2 × 13 mL), and dried over Na2SO4. The solvent was removed under reduced pressure to afford a colorless oily residue. Then the oil was purified by preparative silica gel chromatography (mobile phase: ethyl acetate–hexane, 3:7 v/v) (yield: 1.4 g, 63.6%). 1H NMR (400 MHz, DMSO-d6) δ 7.14 (d, J = 8.5 Hz, 2H), 6.84 (d, J = 8.6 Hz, 2H), 4.11 (ddd, J = 10.0, 8.1, 5.3 Hz, 1H), 3.72 (s, 3H), 3.61 (s, 3H), 2.99–2.72 (m, 2H), 1.33 (s, 9H).

Preparation of Biotinylated O-Methyl-Tyrosine (Biotin-Tyr(Me)-OH)

This compound was prepared in three steps. In the first stage, Boc-Tyr(Me)-OMe (1.0 g) was treated by trifluoroacetic acid (TFA)/water solution (9:1 v/v, 2 mL) for 30 min. TFA was removed by evaporation in vacuo, the residue was poured into diethyl ether, the TFA salt of H-Tyr(Me)-OMe was collected by filtration as a white precipitate (yield: 0.88 g, 84%). 1H NMR (400 MHz, CDCl3) δ 7.14 (d, J = 8.1 Hz, 2H), 6.88 (d, J = 7.9 Hz, 2H), 4.23 (t, J = 6.4 Hz, 1H), 3.80 (s, 3H), 3.78 (s, 3H), 3.24 (qd, J = 14.6, 6.2 Hz, 2H). In the second stage, biotin (345 mg, 1.41 mmol), was activated with 1-ethyl-3-(3-(dimethylamino)propyl)carbodiimide (EDC, 293 mg, 1.55 mmol) and hydroxybenzotriazole monohydrate (HOBt, 242 mg, 1.55 mmol) in a mixture of CH2Cl2 (7 mL) and DMF (7 mL). The mixture was stirred for 5 h. Then the TFA salt of H-Tyr(Me)-OMe (500 mg, 1.55 mmol) was added with N-ethyl-N,N-diisopropylamine (DIEA, 531 μL, 3 mmol), and the mixture stirred overnight. DMF was removed under reduced pressure, and the residue redissolved in ethyl acetate (100 mL), washed with water (30 mL), 1 M KHSO4 (10 mL), NaHCO3 (saturated solution, 10 mL), and brine (10 mL), consecutively. The solution was dried over anhydrous Na2SO4 and concentrated under reduced pressure. Residual DMF was removed by dissolving the residue in ethyl acetate (20 mL) and precipitation with hexane (5 mL). The solid was recovered by filtration, washed with hexane, and dried in vacuo (300 mg, 46%). The methyl ester biotinyl-Tyr(Me)-OMe (270 mg, 0.62 mmol) was then dissolved in iPrOH:H2O (7:3 v/v) (minimum volume), treated with 1 N NaOH (0.93 mL). The mixture was stirred at room temperature overnight. The solvent was removed under reduced pressure, and the product was precipitated upon addition of water and acidification with 1 M HCl. The free acid biotinyl-Tyr(Me)-OH was recovered by filtration as a white solid, washed with water, and then dried under reduced pressure (yield: 226 mg, 86%). 1H NMR (300 MHz, DMSO-d6) δ 8.05 (d, J = 8.1 Hz, 1H), 7.13 (d, J = 8.5 Hz, 2H), 6.83 (d, J = 8.5 Hz, 2H), 6.37 (d, J = 9.9 Hz, 2H), 4.45–4.24 (m, 2H), 4.19–4.06 (m, 1H), 3.71 (s, 3H), 3.09–2.90 (m, 2H), 2.80 (ddd, J = 20.9, 13.2, 7.3 Hz, 2H), 2.05 (t, J = 7.1 Hz, 2H), 1.69–1.08 (m, 6H). HRMS (ESI+) m/z calcd for C20H28N3O5S [M + H]+ 422.1750, found 422.1747.

Preparation of Biotinyl-Tyr(Me)-Oxazolone (BYO)

In a typical experiment, biotinyl-Tyr(Me)-OH (42 mg, 0.1 mmol) was mixed with CH2Cl2 (3 mL) and then EDC (21.9 mg, 0.12 mmol) was added. After stirring by magnetic stirrer for 1 h, all the starting material was dissolved. Additional CH2Cl2 (3 mL) was added, then the mixture was washed by H2O (5 mL) twice and saturated brine (5 mL) once. The organic layer was dried by anhydrous Na2SO4 and concentrated under reduced pressure. The residue was dried in vacuum in the presence of P2O5 for 1 h. The product was stored under −20 °C, or kept in a solution of CH3CN under −20 °C. HRMS (ESI+) m/z calcd for C20H26N3O4S [M + H]+ 404.1644, found 404.1644. See Supporting Figure S9 for NMR data.

Ribozyme Selection and Sequencing

Chemical synthesis (IDT, PAGE purification) was used to obtain a library of DNA molecules having the sequence 5′-GATAATACGACTCACTATAGGGAATGGATCCACATCTACGAATTC-N21-TTCACTGCAGACTTGACGAAGCTG-3′, where N21 denotes 21 consecutive random positions and nucleotides upstream of the transcription start site are underlined. Two replicates of the selection were performed (RS1 and RS2), beginning with 9.1 (coverage ≈ 1.3-fold) and 145 pmol (coverage ≈ 20-fold) of DNA for RS1 and RS2, respectively. RNA was transcribed using HiScribe T7 polymerase (New England Biolabs) and purified by denaturing polyacrylamide gel electrophoresis (PAGE). In the first round of selection, 3.4 × 1014 or 1.9 × 1015 RNA sequences (RS1 and RS2, respectively) were incubated with 50 μM BYO in the aminoacylation selection buffer (100 mM HEPES (pH 6.95), 100 mM NaCl, 100 mM KCl, 5 mM MgCl2, 5 mM CaCl2) for 90 min, at an RNA concentration of 1.4–3.2 μM. The reaction was stopped by removing unreacted substrate using Bio-Spin P-30 Tris desalting columns (Bio-Rad). Streptavidin MagneSphere paramagnetic beads (Promega) were used to isolate reacted sequences at a volume ratio of 1:4, which were then eluted with a 5 min incubation at 65 °C in a solution containing 95% formamide and 10 mM EDTA. Sequences were prepared for the next round of selection by reverse transcription and PCR (RT-PCR), with primers complementary to the fixed sequence shown above. Five additional rounds of selection were performed using the same procedure, with ∼400 pmol (∼2 × 1014 molecules; 2 μM) of RNA in each round. DNA samples from each round were barcoded and pooled for sequencing by Illumina NextSeq 500 (Biological Nanostructures Laboratory, California NanoSystems Institute at UCSB).

Identification of Ribozyme Families and Motifs

Clustering was performed on the Galaxy platform[94] for sequences in Rounds 4–6. Multiple families containing the same motif were designated as 1A.1, 1A.2, etc., or 2.1, 2.2, etc. Center sequences were used to assign each family to a motif. SeqLogo plots[95] representing motifs were generated from all sequences identified among every family grouped into that motif. Sequences are named according to the convention: S-Motif.family rank-sequence rank, where rank is determined by relative abundance in Round 6. For example, S-1B.1-a is the top-ranked sequence from the top-ranked family of Motif 1B. Two μg of RNA from Round 5 (RS1) were incubated with BYO substrate at various concentrations (2, 10, 50, and 250 μM), under buffer conditions and reaction time otherwise identical to those during selection. Streptavidin beads were added at a volume ratio of 1:1, and bound RNA was eluted as described above. To enable absolute quantitation of the products, 4, 12, 17, and 42 fmol, respectively, of a control RNA sequence were spiked into the RNA eluted from each concentration point. The spike-in control sequence was transcribed by T7 RNA polymerase from a DNA oligonucleotide (IDT) having the sequence 5′-GATAATACGACTCACTATAGGGAATGGATCCACATCTACGAATTCAAAAACAAAAACAAAAACAAATTCACTGCAGACTTGACGAAGCTG-3′ (promoter underlined). The k-Seq reactions were performed in triplicate, barcoded, and sequenced as described above. Every unique sequence detected in Round 5 was tracked across all 12 k-Seq samples. The absolute concentration of each sequence was calculated as (n/nspike)[spike], where ns and nspike are the number of reads found for sequence s or the spike-in sequence, respectively, and [spike] is the known concentration of the spike-in sequence in the sample. Concentrations were averaged across triplicates and fit to the first-order rate equation ), where F is the measured fraction of sequence s reacted, A is the maximum reacted fraction, t is the incubation time of 90 min, and is the effective rate constant of the reaction catalyzed by sequence s (see Supporting Text S3). To obtain an estimate of error, each set of 12 observations was randomly grouped into three series of four concentrations, and A were fit individually for each set, and the standard deviation among the three series was calculated. For sequences of low activity, the parameter A could not be accurately estimated over the concentrations tested, leading to a fitting artifact with A = 1 and underestimation of ks (Supporting Figure S5A). However, while A and k are poorly estimated individually, the combined chemical activity parameter ksAs is estimated more accurately. Thus, ksAs was used to compare catalytic activity across the broad range of observed activity. The ratio of ksAs to k0A0 (the uncatalyzed activity, see below) is defined as the catalytic enhancement of sequence s (r).

Determination of Aminoacylation Rate by Gel Shift Assay

Ten sequences were chosen from among the top 20 peaks for experimental testing (Supporting Table S1). The corresponding DNA oligonucleotides were obtained from IDT (HPLC-purified) and RNA was transcribed using T7 RNA polymerase. In addition, a control sample of random pool sequences was used to determine baseline uncatalyzed activity (k0A0, measured as a combined parameter). RNA was labeled using a 5′ EndTag Labeling Kit (Vector Laboratories) with Alexa 488 (Fisher), and purified by phenol-chloroform extraction. Labeled RNA sequences were then incubated (RNA concentration of 100 nM) with BYO for 90 min under conditions described as above for k-Seq. Following desalting, samples were incubated with 2 μM streptavidin for 15 min in 10 mM Tris (pH 7.0), then analyzed by native PAGE. Gels were scanned and fluorescence was quantified with ImageQuant software on an Amersham Typhoon 5 Biomolecular Imager. Bands corresponding to the streptavidin complex and the free RNA band were quantified to calculate the fraction of each sequence that had undergone aminoacylation. Values determined by k-Seq were compared to gel shift percentages (Supporting Figure S3A) to determine the average fraction loss l during streptavidin bead pull-down. This value of l was used as a correction factor when calculating catalytic enhancements using k-Seq data, as k0A0 was measured by gel-shift assay (also see Supporting Table S1).

Identification of the Reactive Nucleotide

Ribozyme aminoacylation reactions were performed in selection buffer containing 1 μM RNA and 500 μM BYO and incubated with gentle agitation for 90 min. RNA was concentrated using Amicon Ultracel-3 filters (EMD Millipore) and an adapter oligo having the sequence 5′-AACCTGCTGTCATCGTCGTCCCTATAGTGAGC-3′ was adenylated using a 5′ adenylation kit (NEB) and ligated to the 3′ end using T4 RNA Ligase 2, truncated KQ (New England BioLabs) (see exception noted below). The ligated products were gel purified and reverse transcribed using a 5′ Rhodamine Green-X-tagged reverse primer complementary to a region of the adapter sequence (5′-CTCACTATAGGGACGACGATGACAGCAGG-3′) and SuperScript III Reverse Transcriptase (Thermo Fisher), with a 10 min extension at 55 °C. Reverse transcripts were run on a 12% denaturing sequencing gel, scanned on an Amersham Typhoon 5 Biomolecular Imager. The likely site of truncation was identified by gel position from the primer (bands at single nucleotide resolution could be visualized at high contrast; see Supporting Figure S10). To verify specific 2′-OH positions, RNA sequences containing 2′-O-methyl modifications were obtained from IDT and tested for aminoacylation activity by streptavidin gel shift (described above). Sequences from families 1A.1 and 1B.1 were ligated to an alternative adapter oligo (5′-AAAACGGGCTTCGGTCCGGTTC-3′), as ligation to the original adapter oligo (listed above) was noted to interfere with folding of these sequences. The corresponding RT primer was 5′-GAACCGGACCGAAGCCCG-3′.

Degradation Rate of BYO

RNA sequence S-1A.1-a (Supporting Table S1) was added to 250 μM BYO that was preincubated with reaction buffer for 5–180 min. The initial rate of the reaction was compared to the reaction kinetics for this sequence determined without preincubation of BYO (see above). The effective concentration of BYO at the start of reaction was calculated assuming a first-order reaction (i.e., effective [BYO] = (250 μM × initial rate)/(kA), where kA is the activity of the ribozyme without preincubation of BYO), giving a half-life for BYO of 36.5 min. Reaction rates for ribozymes were adjusted accordingly to account for lower effective substrate concentrations (Supporting Text S3). We compute the d-dependent epistatic correlation γ.[49] This parameter measures the correlation between the effect of a certain mutation at locus j in sequence s, Δ(s), and the effect of the same mutation in a d mutant background (i.e., all sequence pairs separated by a distance d), , averaged over all possible sequences s, mutations j, and d mutant backgrounds :The effect of a mutation at locus j is measured as the change in (log-scale) activity of sequence s, as . For each pair of sequences having a certain mutation at a certain locus j, we identify all possible d mutant backgrounds by finding every other pair of sequences in the pool (Round 5, RS1) that (i) are at an edit distance d + 1 from each other, and (ii) differ by the same mutation at the same locus j. Data values from k-Seq with kA < baseline activity were assigned the baseline value.

Evolutionary Pathways between Ribozymes

Evolutionary pathways between two sequences (s1 and s2) were determined by applying a modified A* search algorithm.[96] The activity landscape determined by k-Seq can be considered a graph of nodes (sequences) connected by edges with weights equal to the edit distance between two nodes. In general, the A* algorithm is an established search method to find the lowest cost path from a starting node to the target node. Sequences present in Round 5 with absolute sequencing count ≥2 were used as the set of possible node sequences. Beginning at the starting sequence s1, paths were chosen for extension according to minimization of the cost function f(s) = g(s) + h(s), where g(s) is the mutational distance traveled along the path from s1 to sequence s and h(s) is the edit distance from s to the target sequence s2. If more than one sequence gave the minimum f(s) value, these sequences were prioritized according to the following criteria in order of decreasing importance: (1) shortest average step distance, and (2) highest minimum sequence abundance along the pathway. The best five pathways from s1 to s2 were built with the shortest possible maximum step distance (i.e., edit distance for one step). Five additional pathways were built by allowing a step distance up to the previous maximum +1. To decrease runtime in practice, the algorithm kept an updated queue of sequences within a maximum edit distance of the last node(s). Initially, this maximum edit distance was set at 1, but if no pathway was found, this edit distance was increased by 1 until the desired number of pathways was found. To prevent unreasonably long search times, the search was terminated if f(s) > 42 (i.e., each nucleotide being mutated more than twice on average). The algorithm was implemented in Python. Pathway searches were conducted between family centers of the top two families of motif 2, among the top families of motifs 1A, 1B, and 1C, and among the top families of each of the three motifs (Supporting Table S2). Cytoscape[97] was used to generate a landscape overview from pathway data.

82 in total

1. Analysis of a local fitness landscape with a model of the rough Mt. Fuji-type landscape: application to prolyl endopeptidase and thermolysin.

Authors: T Aita; H Uchiyama; T Inaoka; M Nakajima; T Kokubo; Y Husimi
Journal: Biopolymers Date: 2000-07 Impact factor: 2.505

2. Ribozyme-catalyzed tRNA aminoacylation.

Authors: N Lee; Y Bessho; K Wei; J W Szostak; H Suga
Journal: Nat Struct Biol Date: 2000-01

3. Synthesizing life.

Authors: J W Szostak; D P Bartel; P L Luisi
Journal: Nature Date: 2001-01-18 Impact factor: 49.962

4. Adaptive walks by the fittest among finite random mutants on a Mt. Fuji-type fitness landscape. II. Effect of small non-additivity.

Authors: T Aita; Y Husimi
Journal: J Math Biol Date: 2000-09 Impact factor: 2.259

5. One sequence, two ribozymes: implications for the emergence of new ribozyme folds.

Authors: E A Schultes; D P Bartel
Journal: Science Date: 2000-07-21 Impact factor: 47.728

6. Stochastic gene expression as a many-body problem.

Authors: Masaki Sasai; Peter G Wolynes
Journal: Proc Natl Acad Sci U S A Date: 2003-02-26 Impact factor: 11.205

7. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

8. WebLogo: a sequence logo generator.

Authors: Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal: Genome Res Date: 2004-06 Impact factor: 9.043

9. Specific, rapid synthesis of Phe-RNA by RNA.

Authors: M Illangasekare; M Yarus
Journal: Proc Natl Acad Sci U S A Date: 1999-05-11 Impact factor: 11.205

10. The lipid world.

Authors: D Segré; D Ben-Eli; D W Deamer; D Lancet
Journal: Orig Life Evol Biosph Date: 2001 Feb-Apr Impact factor: 1.950

25 in total

1. Fitness variation across subtle environmental perturbations reveals local modularity and global pleiotropy of adaptation.

Authors: Grant Kinsler; Kerry Geiler-Samerotte; Dmitri A Petrov
Journal: Elife Date: 2020-12-02 Impact factor: 8.140