Literature DB >> 31506380

Circular code motifs in the ribosome: a missing link in the evolution of translation?

Gopal Dila¹, Raymond Ripp¹, Claudine Mayer^1,2,3, Olivier Poch¹, Christian J Michel¹, Julie D Thompson¹.

Abstract

The origin of the genetic code remains enigmatic five decades after it was elucidated, although there is growing evidence that the code coevolved progressively with the ribosome. A number of primordial codes were proposed as ancestors of the modern genetic code, including comma-free codes such as the RRY, RNY, or GNC codes (R = G or A, Y = C or T, N = any nucleotide), and the X circular code, an error-correcting code that also allows identification and maintenance of the reading frame. It was demonstrated previously that motifs of the X circular code are significantly enriched in the protein-coding genes of most organisms, from bacteria to eukaryotes. Here, we show that imprints of this code also exist in the ribosomal RNA (rRNA). In a large-scale study involving 133 organisms representative of the three domains of life, we identified 32 universal X motifs that are conserved in the rRNA of >90% of the organisms. Intriguingly, most of the universal X motifs are located in rRNA regions involved in important ribosome functions, notably in the peptidyl transferase center and the decoding center that form the original "proto-ribosome." Building on the existing accretion models for ribosome evolution, we propose that error-correcting circular codes represented an important step in the emergence of the modern genetic code. Thus, circular codes would have allowed the simultaneous coding of amino acids and synchronization of the reading frame in primitive translation systems, prior to the emergence of more sophisticated start codon recognition and translation initiation mechanisms.

Keywords: circular code; genetic code; origin of life; ribosome evolution; translation

Mesh：

Substances：
RNA, Ribosomal

Year: 2019 PMID： 31506380 PMCID： PMC6859856 DOI： 10.1261/rna.072074.119

Source DB: PubMed Journal: RNA ISSN： 1355-8382 Impact factor: 4.942

INTRODUCTION

Unraveling the emergence and evolution of the genetic code remains an elusive challenge (Koonin and Novozhilov 2017). It has been estimated that the events shaping the genetic code took place 3.7–4.1 billion years ago (Nutman et al. 2016) and led to the formation of the Last Universal Common Ancestor (LUCA) as a primordial ancestor of all life on Earth today. Since LUCA, the same standard genetic code has been used to translate nucleotides into amino acids in (quasi-) all organisms. The universality of the code is a hindrance with regard to studying its formation, because no organisms exist containing a primitive or intermediate genetic code for comparison. Nevertheless, different scenarios have been proposed that attempt to explain how the genetic code could have emerged from the primordial soup. Until recently, the textbook scenario has been an initial RNA world, in which RNA polymers acted both as a carrier of genetic information and as a catalyst for translation (Gilbert 1986). However, there is growing evidence supporting an early peptide–RNA world (e.g., Bowman et al. 2015; Carter 2015; Van der Gulik and Speijer 2015; Kunnev and Gospodinov 2018; Chatterjee and Yadav 2019), in which the first RNA polymers coexisted and interacted with short peptides. Irrespective of these scenarios, a key question is how the modern standard genetic code came into being. The contemporary genetic code represents a nearly universal assignment of 64 triplets of nucleotides (codons) to 20 amino acids. Many alternative hypotheses for the origins of this assignment have been put forward (for reviews, see Grosjean and Westhof 2016; Koonin 2017). For example, the stereochemical hypothesis (Woese et al. 1966) postulates that the code developed from interactions between codons, anticodons, and amino acids. The coevolution theory posits that the code coevolved with amino acid biosynthesis pathways, whereas the error minimization theory assumes that the adverse effect of point mutations and translation errors was the principal factor of the code's evolution. These theories are not mutually exclusive, and they may all have contributed to create the contemporary code. Initial amino acids may have been defined by stereochemical affinities, but extension of such initial assignments via coevolution and adaptation was probably essential to complete the modern coding table (Chatterjee and Yadav 2019). All these theories are compatible with the idea that the universal genetic code gradually evolved from a simpler primordial form that encoded fewer amino acids, first postulated by Crick et al. (1957). Crick's original proposal that the genetic code was a comma-free code explained how a sequence of trinucleotides could code for 20 amino acids, and at the same time how the correct reading frame could be retrieved and maintained. The main idea of comma-free codes is that coding trinucleotides are found only in one frame, known as the reading frame—that is, trinucleotides in the reading frame make sense, whereas trinucleotides in the shifted frames 1 and 2 make nonsense. In coding theory, such a comma-free code is also known as a self-synchronizing code, because no external synchronization is required. It was later proved that the modern genetic code could not be a comma-free code (Nirenberg and Matthaei 1961), when it was discovered that TTT, a trinucleotide that cannot belong to a comma-free code, codes for phenylalanine. Although the standard genetic code used by nearly all modern organisms is not a comma-free code, other comma-free codes have been proposed that may have represented primeval codes, notably the RRY code (R = G or A, Y = C or T) with eight trinucleotides and four amino acids (Crick et al. 1976), the RNY code (N = any nucleotide) with 16 trinucleotides and eight amino acids (Eigen and Schuster 1978; Shepherd 1981), or the GNC code with four trinucleotides and four amino acids (Ikehara 2002). A weaker version of comma-free codes, the so-called circular codes, has also been proposed (Arquès and Michel 1996). Circular codes are less restrictive than comma-free codes, as a frameshift of 1 or 2 nt in a sequence entirely consisting of trinucleotides from a circular code will not be detected immediately but after the reading of a certain number of nucleotides (for reviews, see Michel 2008; Fimmel and Strüngmann 2018). Circular codes possess the circular property—that is, any word written on a circle (the last letter becoming the first in the circle) has a unique decomposition into trinucleotides of the circular code (Fig. 1A). A circular code naturally excludes the homopolymer trinucleotides {AAA, CCC, GGG, TTT}. It also excludes trinucleotides related by circular permutation (e.g., AAC and ACA), because the concatenation of AAC with itself …AACAAC…, for example, can be decomposed in two ways: …AAC, AAC… or …A, ACA, AC… (Michel 2008). By excluding the homopolymer trinucleotides and dividing the 60 remaining trinucleotides into three disjoint classes, a circular code of trinucleotides has at most 20 trinucleotides (called a maximal circular code). There exist 12,964,440 maximal circular codes, although it has been shown that there is no maximal circular code that can code 20 or 19 amino acids and only 10 can code for 18 amino acids (Michel and Pirillo 2013). Remarkably, one of the maximal circular codes, called the X circular code (Fig. 1B), was found to be overrepresented in the reading frame of protein-coding genes from eukaryotes and prokaryotes (Arquès and Michel 1996; Michel 2017). Other circular codes, and notably variations of the common X circular code, are hypothesized to exist in different organisms (Frey and Michel 2003, 2006; Ahmed et al. 2010; Michel 2015, 2017).

FIGURE 1.

Properties of the X circular code. (A) The definition of circularity implies that any word of the X code written on a circle has a unique decomposition. (B) The X circular code is maximal (with 20 trinucleotides) and codes for 12 amino acids. (C) The X code is composed of 10 trinucleotides and their complementary trinucleotides. (D) The permutations of the X code associated with the shifted frames 1 and 2, named X1 and X2, respectively, are circular codes (C3) and in addition are complementary to each other: a word in the shifted frame 1 of the strand 5′–3′ is complementary to the word in the shifted frame 2 of the strand 3′–5′, and vice versa. Note that X1 and X2 are shown in only one strand for simplicity, although they exist in both strands. (E) According to the definition of a comma-free code, all words in the reading frame (frame 0) are valid (shown in blue), whereas all out-of-frame words are invalid (gray). For the X circular code, valid words may be present in frames 1 or 2, up to a length of at most 13 nt. The X circular code has additional symmetry properties; in particular, it is self-complementary, meaning that if a trinucleotide belongs to X, then its complementary trinucleotide also belongs to X (Fig. 1C). Moreover, the +1 and +2/−1 circular permutations of X, denoted X1 and X2, respectively, are also maximal circular codes and are complementary to each other (Fig. 1D). The class of circular codes, like comma-free codes, also have the property of synchronizability (i.e., they have the ability to retrieve the correct reading frame by using an appropriate window of nucleotides). In any sequence generated by a trinucleotide comma-free code, the reading frame can be determined in a window length of at most 3 nt, whereas for the X circular code, at most 13 consecutive nucleotides are enough to always retrieve the reading frame (Fig. 1E). In other words, any sequence “motif” containing four consecutive X trinucleotides is sufficient to determine the correct reading frame. The hypothesis of circular codes, and in particular the X circular code, is supported by evidence from several statistical analyses of modern genomes. For example, it was shown in a large-scale study of 138 eukaryotic genomes (El Soufi and Michel 2016) that X motifs (in the case of protein-coding genes, an X motif was defined as a run of at least four trinucleotides from the X circular code) are found preferentially in protein-coding genes compared to noncoding regions with a ratio of approximately eight times more X motifs located in genes. More detailed studies of the complete gene sets of yeast and mammal genomes (Michel et al. 2017; Dila et al. 2019) confirmed the strong enrichment of X motifs in genes and further demonstrated a statistically significant enrichment in the reading frame compared to frames 1 and 2 (P-value < 10−10). In addition, it was shown that most of the mRNA sequences from these organisms (e.g., 98% of experimentally verified genes in Saccharomyces cerevisiae) contain X motifs. Intriguingly, conserved X motifs have also been found in many tRNA genes (Michel 2013) and near the decoding center of 16S/18S ribosomal RNA from bacteria, archaea, and eukaryotes (El Soufi and Michel 2015), which suggest their involvement in universal gene translation mechanisms. Here, we investigate whether the overrepresentation of X motifs in genes might reflect traces of an ancestral coding system based on circular codes—one that used a smaller number of trinucleotides than the modern genetic code but that had the specific capacity to identify or maintain the reading frame. If the X circular code represents a predecessor of the genetic code, then we should be able to find imprints or traces of the code in the evolution of the translation machinery and, in particular, in the ribosome, a highly conserved ribonucleoprotein complex. Because the ribosome is universal in all extant organisms (Melnikov et al. 2012), it can be deduced that it was largely formed at the time of the LUCA, and its earliest origins likely lie in the prebiotic world. It is widely accepted that in the primordial soup, increased chemical complexity led to RNA or RNA-like oligomers. Interactions between these RNA conformations and prebiotic amino acids or short oligopeptides could have stabilized the structures and provided catalytic functions (Szathmáry 1999; Plankensteiner et al. 2005; Van der Gulik and Speijer 2015). Several mechanisms establishing correspondence between anticodons/codons and their cognate amino acids have been suggested, possibly representing a “proto-translation machine” (Yarus et al. 2009; Ma 2010; Noller 2012; Carter 2016). Thus, an early ribosome may have consisted of rRNAs stabilized by a few small peptides containing glycine, alanine, aspartic acid, and/or valine, essential for the structure of the nucleoprotein particle (Fournier et al. 2010; Maier et al. 2013). According to this theory, RNA and protein-based molecules would then have evolved concurrently and interactively, giving rise to the first system capable of translating genetic information (Kunnev and Gospodinov 2018) and self-replicating (Banwell et al. 2018). Thus, the original translation machinery would have been RNA-based, and this RNA translation template would have evolved to form the tRNA, mRNA, and rRNA established at the time of the LUCA (Chatterjee and Yadav 2019; Root-Bernstein and Root-Bernstein 2019). Most likely the initial specificity of translation would have been very low. The question remains of how such a system could have evolved to a more specific mapping between the genetic sequence and the peptide sequence, either by direct rRNA/amino acid interactions or indirectly via tRNA, in order to produce longer peptides that could fold into the first functional proteins (Lupas and Alva 2017). The coevolution theory suggests the idea of a growing coding repertoire interacting with a simultaneously growing repertoire of biosynthetic products. Although it is impossible to recreate the entire path along which the very complex process of translation evolved, it is possible to propose, and provide supporting evidence for, certain theoretical solutions. To test our hypothesis that the X circular code represents an intermediate coding system between the primordial, nonspecific RNA–peptide interactions and the modern ribosome-based translation machinery (Fig. 2), we performed a large-scale study of extant rRNA sequences from 133 representative organisms covering the three domains of life, in order to identify X motifs that have been conserved since the LUCA. In a comprehensive analysis of ribosome structural data, we show that most of these universally conserved X motifs, denoted uX motifs, are located in important functional sites, including the decoding center and the peptidyl transferase center (PTC). Furthermore, these functional sites are widely accepted to be essential building blocks of the primeval “proto-ribosome” that was already present in the LUCA (Smith et al. 2008; Bokov and Steinberg 2009; Hsiao et al. 2009, 2013; Petrov et al. 2015; Agmon 2017, 2018). Building on the previously described accretion models of ribosome growth (Hsiao et al. 2009; Petrov et al. 2015), we propose that error-correcting circular codes represent an important step in the coevolution of the genetic code and the ribosome, in which a single code allowed the simultaneous coding of amino acids and synchronization of the reading frame. To our knowledge, this is the first study to propose an ancestral mechanism for reading frame maintenance, prior to the emergence of more sophisticated start codon recognition and translation initiation systems.

FIGURE 2.

Hypothesis of circular codes as a missing link in the early evolution of the translation system. The prebiotic soup contained RNA oligomers and amino acids that interacted nonspecifically. They then coevolved to form an ancestral RNA-based “translation” system, with more specific mapping between trinucleotides and amino acids. The RNA template evolved to form the RNA building blocks of the modern ribosome.

RESULTS AND DISCUSSION

Universal X motifs in rRNA of extant organisms

Modern ribosomes are highly sophisticated molecular machines, consisting of two subunits that come together during the initiation of protein synthesis, remain together as individual amino acids are added to a growing peptide according to information encoded on the mRNA, and finally separate again in conjunction with the release of the finished protein. Each subunit is a large nucleoprotein complex. In bacteria and archaea, the large subunit (LSU) contains a 23S rRNA and a 5S rRNA, whereas the small subunit (SSU) contains the 16S rRNA. In eukaryotes, the LSU contains a 28S rRNA, a 5S rRNA, and a 5.8S rRNA, whereas the SSU contains the 18S rRNA. By comparing 3D ribosome structures from different organisms, a common core of rRNA was identified that is conserved over the entire phylogenetic tree, especially in terms of secondary/tertiary structures (Hsiao et al. 2009; Petrov et al. 2015; Opron and Burton 2018). To investigate the presence of X motifs, that is, motifs composed of trinucleotides from the circular code X, in this common core of rRNA, we identified universal X motifs (denoted uX motifs) in multiple sequence alignments of the LSU rRNAs (23S/28S and 5S) and SSU rRNAs (16S/18S) for 133 representative species covering all three domains of life (Supplemental Fig. S1). X motifs are defined as universal (denoted uX motifs) if they are present in at least 90% of the aligned sequences and have a length of at least 6 consecutive nucleotides. It is important to note that uX motifs are not necessarily conserved in terms of the nucleotide sequence. An example is the SSU trinucleotide 1505–1507, which is highly conserved in bacteria and archaea as GUA and conserved in eukaryotes as GUU, thus affecting the sequence conservation but not the universality of the X trinucleotide. In the SSU, 13 uX motifs were present in >90% of the sequences (Table 1; Fig. 3A), in the LSU 19 uX motifs were identified (Table 2; Fig. 3B), whereas no uX motifs were found in the 5S alignment. The uX motifs are labeled according to the accretion model of Petrov et al. (2015) and using capital letters for LSU motifs and small letters for SSU motifs (see below). The mean sequence conservation across the full length of the SSU and LSU is 65% and 62%, respectively, whereas the uX motifs are 81% conserved. A more detailed comparison of nucleotide sequence conservation and the universality of uX motifs is provided in Supplemental Figure S2 and Supplemental Table S1. Within the uX motifs, no significant correlation (Pearson correlation coefficient, P < 10−4; Spearman correlation coefficient, P = 0.006; Kendall coefficient, P = 0.007) was observed between the X universality and the sequence conservation (Supplemental Tables S2 and S3). In fact, >28% of the rRNA alignments covered by uX motifs are not conserved in terms of the sequence (Supplemental Table S1). Taken together, these results suggest that in certain regions of the ribosome, the X circular code property exists in addition to sequence level constraints in the ribosome.

TABLE 1.

Location of the 13 uX motifs in the SSU rRNA alignment (prokaryotic 16S and eukaryotic 18S), according to structural domains and helices (Escherichia coli numbering)

FIGURE 3.

(A) Location of the 13 uX motifs in the SSU rRNA alignments (prokaryotic 16S and eukaryotic 18S). The abscissa gives the nucleotide position referenced according to the E. coli 16S rRNA and the ordinate indicates the level of sequence conservation observed in the uX motifs. (B) Location of the 19 uX motifs in the LSU rRNA alignments (prokaryotic 23S and eukaryotic 25/28S). The abscissa gives the nucleotide position referenced according to the Escherichia coli 23S rRNA and the ordinate indicates the level of sequence conservation observed in the uX motifs. Colored boxes indicate rRNA domains (positions in Table 3): for the SSU, light blue for domain 5′, olive for the central domain, pink for 3′M, and green for 3′m domains and for the LSU, magenta for domain I, blue for domain II, violet for domain III, white for domain 0, yellow for domain IV, pink for domain V, and green for domain VI.

TABLE 2.

Location of the 19 uX motifs in the LSU rRNA alignment (prokaryotic 23S and eukaryotic 25/28S), according to structural domains and helices (E. coli numbering)

TABLE 3.

Coverage of rRNA structural domains by uX motifs, in the LSU and SSU

Location of the 13 uX motifs in the SSU rRNA alignment (prokaryotic 16S and eukaryotic 18S), according to structural domains and helices (Escherichia coli numbering) Location of the 19 uX motifs in the LSU rRNA alignment (prokaryotic 23S and eukaryotic 25/28S), according to structural domains and helices (E. coli numbering) The overall coverage of nucleotides in uX motifs in the SSU and LSU rRNAs are similar (7.8% and 6.0%, respectively); however, coverage is not homogeneous across the different structural domains of both subunits (Table 3; Fig. 4). It is interesting to note that the SSU 3′m domain, containing the central pseudoknot (CPK) and the decoding center, has the highest coverage with 19% of nucleotides in uX motifs. The SSU 3′M domain corresponding to the “head” region and the LSU V domain containing the PTC are also enriched with ∼12% coverage, in contrast to the SSU central domain and the LSU 0, I, III, and VI domains, which have only ∼3% coverage. To evaluate the significance of the observed coverage, we chose an approach that involved comparing the results obtained for the uX motifs with those obtained for universal random motifs (uR motifs) generated by random sampling of 100 different codes R with properties similar to the X code, except for the circularity property (defined in detail in Materials and Methods). The distributions of the number and total length of uR motifs (Fig. 5) thus provide an estimate of the expected values for the uX motifs. As shown in Supplemental Figure S3, the observed number of uX motifs in the SSU (13) and in the LSU (19) are significantly higher than expected (mean values for uR motifs are 10 and 13, respectively). We also determined how many of the uR motifs display the same level of occurrence and coverage as the uX motifs (Fig. 5). None of the R codes had a larger number of motifs than for observed uX motifs (=32), whereas 2% of the R codes had the same number of motifs. Three percent of the R codes had a longer total length than the uX motifs. These findings reveal an overrepresentation of uX motifs in the LSU (23S/28S) and SSU rRNAs (16S/18S) conserved in the three domains of life.

FIGURE 4.

FIGURE 5.

Distribution of the number and total nucleotide lengths of the uR random motifs in the SSU (16/18S) and LSU (23/28S) rRNA multiple alignments. The corresponding values for the uX motifs are indicated by a vertical red line. (A) Two percent of the random codes have the same number of universal motifs compared to uX motifs (number = 32). (B) Three percent of the random codes have the same or larger total length of universal motifs compared to uX motifs (length = 296).

Secondary structure schema of the LSU and SSU rRNA (E. coli), showing the location of the uX motifs (red boxes). The schema is colored according to the six phases of the accretion model (Petrov et al. 2015) of ribosome evolution (phase 1, blue; phase 2, cyan; phase 3, green; phase 4, sepia; phase 5, brown; phase 6, purple). uX motifs are labeled with capital letters for LSU motifs and small letters for SSU motifs, according to their order of accretion in the different phases. (PTC) Peptidyl transferase center, (CPK) central pseudoknot. Distribution of the number and total nucleotide lengths of the uR random motifs in the SSU (16/18S) and LSU (23/28S) rRNA multiple alignments. The corresponding values for the uX motifs are indicated by a vertical red line. (A) Two percent of the random codes have the same number of universal motifs compared to uX motifs (number = 32). (B) Three percent of the random codes have the same or larger total length of universal motifs compared to uX motifs (length = 296). Coverage of rRNA structural domains by uX motifs, in the LSU and SSU We then asked whether this overrepresentation might be linked to a compositional bias of the rRNA sequences. In terms of nucleotide composition, some bias is observed in the rRNA sequences (Supplemental Table S4) on which G is the most frequent (31.1%) and T is the least frequent (20.5%). However, the X circular code shows no bias with equal frequencies of the four bases A, C, G, and T (Supplemental Table S4), and therefore the nucleotide bias cannot explain the observed enrichment. The nucleotide composition of the 13 uX motifs in the SSU and the 19 uX motifs in the LSU are provided in Supplemental Tables S5 and S6. Concerning the trinucleotide composition of the rRNA sequences (Supplemental Table S4), no significant enrichment of X trinucleotides is observed, according to a Mann–Whitney U test (z-score = −0.51). We conclude that the enrichment concerns X trinucleotides located within motifs specifically. The trinucleotide composition of the 13 uX motifs in the SSU and the 19 uX motifs in the LSU are provided in Supplemental Tables S7 and S8. Finally, we investigated whether the observed enrichment of uX motifs might be associated with the fact that rRNA sequences covary in order to preserve their 3D structure. To do this, we used an infernal covariance model (CM), a probabilistic model that captures many important features of structured RNA sequence variation (Nawrocki and Eddy 2013). We constructed two CMs for each ribosomal subunit, one in which each position in the sequences is treated independently, and one in which base-paired positions are dependent on each other. However, no significant difference was observed between the two CMs (data not shown), and we conclude that the covariation constraints in the rRNA do not impose an enrichment of uX motifs.

uX motifs map to functional centers of modern ribosomes

In this section, we investigate the location of the 32 uX motifs identified in modern ribosomes and how they relate to known functional regions. Although some variation exists, modern translation mechanisms are generally similar in archaeal, bacterial, and eukaryotic systems, and the main functions of the ribosome are conserved in the three domains of life (Opron and Burton 2018). The SSU binds messenger RNA (mRNA) and, together with the transfer RNA (tRNA), is responsible for translational fidelity by ensuring base-pairing between the codon and anticodon in the decoding center. The LSU binds the acceptor ends of the A-site and P-site tRNAs and catalyzes peptide bond formation at the PTC. As the nascent protein is synthesized, it passes through an exit tunnel that begins at the PTC and exits from the back of the LSU. Both subunits are actively involved in translocating the mRNA by one trinucleotide in each cycle, and conformational dynamics are crucial (Jenner et al. 2010; Belardinelli et al. 2016). Large-scale rearrangements include rotation of the SSU and LSU relative to one another (also known as ratcheting), swiveling of the SSU head in relation to the body, and stepwise translocation of the tRNAs together with the mRNA through the ribosome. We based our study on a representative 3D structure of the ribosome from the bacteria Thermus thermophilus, because it contains mRNA nucleotides and three deacylated tRNAs in the A, P, and E sites. Figure 6 shows the positions of the 19 uX motifs in the LSU rRNA (Fig. 6A) and the 13 uX motifs in the SSU rRNA (Fig. 6B) and Tables 4 and 5 summarize the interactions of uX motifs with different molecules, including mRNA, tRNA, and ribosomal proteins.

FIGURE 6.

TABLE 4.

Contacts (<5 Å) of the 13 uX motifs in the SSU rRNA alignment, with other uX motifs, mRNA, tRNA, or ribosomal proteins

TABLE 5.

Contacts of the 19 uX motifs in the LSU rRNA alignment, with other uX motifs, tRNA, or ribosomal proteins

uX motifs in the rRNA of T. thermophilus. (A) LSU rRNA (green ribbon) with mRNA (orange sticks) and surface representations of tRNAs in the A-site (cyan), P-site (light blue), and E-site (deep teal). Nucleotides of the uX motifs are shown as magenta spheres. The PTC is identified by a black circle and the exit tunnel by a black arrow. (B) SSU rRNA (pink ribbon) with tRNA colored as in A. Nucleotides of the uX motifs are shown as red spheres. (C) Nucleotides in uX motifs close to the PTC (<10 Å in white sticks, <30 Å in magenta sticks, <50 Å in olive sticks). The distances were measured from atom N4 of CYT 2573 (white sphere). All uX motifs are shown as magenta ribbons. (D) All rRNA nucleotides (green ribbons) within 20 Å of the exit tunnel (black arrow) as defined by Dao Duc et al. (2019): nucleotides in uX motifs are colored according to rRNA domains, magenta for domain I, blue for domain II, violet for domain III, orange for domain 0, yellow for domain IV, and pink for domain V (Table 3). tRNA are colored as in A. (E) SSU rRNA nucleotides in contact with mRNA (<5 Å): nucleotides in uX motifs are colored according to rRNA domains, light blue for domain 5′, olive for the central domain, pink for 3′M, and green for 3′m domains (Table 3); other nucleotides and amicoumacin A (UAM) are white. Magnesium ions and their coordinated water molecules are represented by white spheres. Contacts (<5 Å) of the 13 uX motifs in the SSU rRNA alignment, with other uX motifs, mRNA, tRNA, or ribosomal proteins Contacts of the 19 uX motifs in the LSU rRNA alignment, with other uX motifs, tRNA, or ribosomal proteins In the LSU, the most conserved functional site is the PTC, where amino acids are polymerized onto the growing nascent chain. The majority of the uX motifs are clustered around the PTC (Fig. 6C) with three motifs within a radius of 10 Å (B, D, F), six motifs within a radius of 30 Å (B, C, D, E, F, P), and 13 out of the 19 motifs within a radius of 50 Å (A, B, C, D, E, F, G, H, I, L, K, M, P). Thus, 105 (60%) of the 175 nt covered by uX motifs are found within 50 Å of the PTC. Several uX motifs are in direct contact with tRNA: nucleotides G2553, U2555 (motif F) and G2583, U2585 (motif D) are in contact with the A-site tRNA; U2585 (motif D) and U2506 (motif B) are in contact with the P-site tRNA; and G1850–A1853 (motif N) are in contact with the E-site tRNA. One motif (A) is found in helix H89, which is known to be involved in the accommodation of the A-site tRNA in the PTC (Jenner et al. 2010). Another important structure in the LSU is the polypeptide exit tunnel that extends from the PTC to the surface of the ribosome. The tunnel shape is more conserved in the upper part close to the PTC, whereas in the lower part, it is substantially narrower in eukaryotes than in bacteria (Dao Duc et al. 2019). Figure 6D shows the eight uX motifs that are close to the exit tunnel: B, D, E, F, H, G, L, S. Finally, two uX motifs are found in regions involved in interactions with GTPase proteins during translation initiation and elongation: motif Q is in the GTP-Associated Center (GAC) and motif O is in the sarcin-ricin loop. The four remaining uX motifs (I, J, M, R) in the LSU are not associated with known functions to our knowledge. In the SSU, seven of the 13 uX motifs (a, b, c, d, e, h, i) are in contact with the mRNA (at a distance of <5 Å) (Fig. 6E). Remarkably, only three of the 25 rRNA nucleotides in contact with the mRNA are not found in uX motifs. The uX motifs also include many of the rRNA contacts with tRNAs, such as the A-site conserved nucleotides A1492–A1493 (motif b) and G530 (motif h); the P-site G926 (motif d), A790 (motif e), U1498 (motif b), and C1400 (motif a); and the E-site C795 (motif e) (Khade and Joseph 2010). An important feature of the SSU is the dynamic swiveling of the SSU head (3′M domain) relative to the body (5′ domain) during translation elongation. The movement originates from flexing at two hinge points—one in the middle of helix h28 at G926 and one in the linker between h34 and h35. Both of these hinges are found in uX motifs (d and l, respectively). Rotation of the SSU head has also been linked to the opening and closing of a 13 Å constriction or “gate” between the head and body domains between the P and E sites, presenting a steric block to the movement of the P-site tRNA. The gate involves G1338 (motif j), situated in the stable ridge that sterically separates the P and E sites, and A790 (motif e) located on the opposite side of the constriction (Achenbach and Nierhaus 2015). The C1397 (motif a) and A1503 (motif c) have also been considered to be “ratchet pawls” that intercalate with mRNA bases during reverse rotation of the head (Achenbach and Nierhaus 2015). Three uX motifs (f, g, and k) in the SSU are not associated with known functions to our knowledge. Many of the uX motifs identified in this study are also in contact with ribosomal proteins (11 out of 13 uX motifs in the SSU and 16 out of 19 uX motifs in the LSU). Among the 102 known ribosomal protein families, 34 (15 in the SSU, 19 in the LSU) are represented in all three domains of life (Supplemental Table S9; Smith et al. 2008). Many of these universal proteins have been shown to be crucial for ribosome assembly, the formation of intersubunit bridges, and interactions with the tRNAs or the polypeptide exit channel (Lecompte et al. 2002). Interestingly, nearly all the proteins in contact with uX motifs are universal ribosomal proteins (in T. thermophilus, all 10 proteins in contact with the SSU uX motifs are universal, and 10 out of 14 proteins in contact with the LSU uX motifs are universal).

uX motifs were present in the primordial proto-ribosome

It is generally assumed that the large and small subunits of the ribosome initially existed independently, although there is some debate as to whether the LSU or the SSU emerged first (Kunnev and Gospodinov 2018; Opron and Burton 2018). Based on comparative structural analyses, proto-LSU (Smith et al. 2008; Bokov and Steinberg 2009; Hsiao et al. 2009, 2013; Petrov et al. 2015; Agmon 2017) and proto-SSU (Petrov et al. 2015; Agmon 2018) models have been proposed (Fig. 7).

FIGURE 7.

Proto-LSU and proto-SSU, with nucleotides and numbering from the contemporary E. coli 23S and 16S rRNA. uX motifs are highlighted in red and labeled according to the accretion model of Petrov et al. (2015), with 5′–3′ direction indicated by red arrows. The dimeric proto-LSU (Agmon 2017) can be divided into A- and P-monomers corresponding to the modern A-tRNA and P-tRNA sites. Sequence complementarity of nucleotides building the conserved PTC walls in bacterial ribosomes is indicated by gray arrows in the PTC loop (connecting X trinucleotides shown in bold). The minimal proto-SSU model proposed by Agmon (2018) is shown in brown, and the additional core segments identified by Petrov et al. (2015) are shown in yellow. (PTC) Peptidyl transferase center, (CPK) central pseudoknot. The proto-LSU corresponds to the PTC, a symmetrical region deep within the large rRNA, where new amino acids are incorporated into the growing peptide chain (Agmon 2009). This region has generally been modeled using the contemporary E. coli sequence to represent the ancestral system (Fig. 7). It consists of ∼120 nt, forming a pocket-like structure that could have accommodated two random amino acids, and would have provided positional catalysis, producing short peptides with random composition. We mapped the uX motifs to the 2D model and found a total of 40 nt (30%) in uX motifs. The motifs are almost exclusively located in the A-monomer corresponding to the modern A-tRNA site, with 35 (58%) of the 60 A-monomer nucleotides in uX motifs. In addition to the universal regions, many of the nucleotides that constitute the two halves of the PTC cavity are composed of X trinucleotides and these trinucleotides have been shown to have a high level of complementarity in different ancient bacteria (Agmon 2017), reflecting the self-complementary property of the X circular code. This complementarity has been suggested to indicate a simple and efficient mode of replication (i.e., the proto-LSU may have been a self-replicating ribozyme) (Agmon 2017). The ancestor of the SSU is more controversial, but it may have worked simply as a location to bind RNAs in an open structure configuration (de Farias et al. 2019). The proposed models correspond to the contemporary CPK in the decoding center (Noller 2012). However, in contrast to what is observed in the LSU, there is no single self-folding segment in the modern 16S RNA that encompasses the majority of the decoding site rRNA. A number of disjoint short segments of total length of ∼150 nt have been considered ancestral (Petrov et al. 2015; Agmon 2018). Of these, 40 nt (27%) are found in uX motifs, notably including the future A-site (A1492–A1493) and P-site (C1402–C1403, U1498–A1499) tRNA binding sites. It is worth noting that the combined models of the proto-ribosome, incorporating the active sites of both ribosomal subunits, cover <6% of the modern prokaryotic rRNA, yet they integrate 80 (27%) of the 296 rRNA nucleotides found in uX motifs.

Accretion of uX motifs in the transition from the proto-ribosome to the modern ribosome

Given the complexity of the modern ribosome, it is unlikely that it appeared spontaneously (Hsiao et al. 2009; Petrov et al. 2015; Opron and Burton 2018). According to the RNA–peptide world theory, RNA and protein-based molecules would have evolved concurrently and interactively, giving rise to the first system capable of translating genetic information (Kunnev and Gospodinov 2018) and self-replicating (Banwell et al. 2018). For example, Petrov et al. (2015) suggested that the proto-ribosome evolved to the modern rRNA core by recursive accumulation of ancestral expansion segments (AESs) and proposed an accretion model of rRNA evolution divided into six major phases representing successive steps in the complexification of the ribosome. Figure 4 shows the location of the uX motifs with respect to this accretion model, in which the uX motifs are labeled (a–m for the SSU uX motifs and A–S for the LSU uX motifs) according to their presumed ancestry. We can differentiate two subsets of the uX motifs: those already present in the proto-ribosome described above (phases 1 and 2 of the accretion model) and those gained in the subsequent phases of ribosome evolution (phases 3–6). Thus, four motifs (B–E) of the 19 uX motifs were already present in the proto-LSU, two additional motifs (A,F) are located close to the slightly extended ancestral region defined by Petrov et al. (2015), and four motifs (a–d) of the 13 uX motifs were present in the proto-SSU. In phase 3, uX motifs G–L are incorporated near the extended exit tunnel and motifs K,M in the LSU–SSU interface. In phase 4, motif e is included in the SSU–LSU interactions, and motifs f,g in the A-site and P-site tRNA binding pockets, respectively. In phase 5, motif O is incorporated near the binding sites for elongation factors G and Tu, and motifs P,Q in the L11 stalk. In the SSU, motifs i,j are included in the P-site tRNA pocket and motif h in the CPK. In phase 6, the remaining motifs R–S and k–m are introduced in AESs that serve mainly as binding sites for the globular domains of ribosomal proteins. The universal ribosomal proteins mentioned above have also been incorporated into this accretion model (Kovacs et al. 2017), based on the assumption that the age of a given segment of protein is the same as that of the rRNA with which it interacts. In phases 1 and 2, it is assumed that only short random peptides are present in the proto-ribosome system. In phases 3 and 4, uX motifs (A–M, a–g) interact with seven of the 19 universal proteins in the LSU (Table 5) and seven of the 15 universal proteins in the SSU (Table 4). Many of these proteins are known to interact with the PTC (L2, L3, L4, L14) or have contacts to the tRNA binding site and/or the mRNA (S7, S9, S11, S12) mainly via their nonglobular extensions (Smith et al. 2008). In phase 5, uX motifs (O–Q, h–j) contact globular domain proteins, including L6, L13, L36, and S3. In phase 6, most of the newly incorporated proteins are on the surface of the ribosome, and the uX motifs (R–S, k–m) contact only a few of them: L23, S2, and S17.

Model of coevolution of genetic code and translation system

Based on our analyses of uX motifs in the proto-ribosome and the accretion model of ribosome evolution, we suggest that comma-free codes and circular codes represented ancestors of the modern genetic code and were used to map the first trinucleotides to amino acids. We thus propose a model for the coevolution of the genetic code and the translation system in four stages, shown in Figure 8 and discussed in the following paragraphs.

FIGURE 8.

Proposed model of genetic code evolution associating codes, translation systems, and peptide products at different stages from the primordial translation building blocks to the ancestor of the modern ribosome present in the Last Universal Common Ancestor (LUCA). Recent evidence suggests that RNA and peptides coevolved from the beginning or at least that the proto-ribosome building blocks gained the ability to bind amino acids or small peptides very early (Lupas and Alva 2017; Kunnev and Gospodinov 2018). The first peptides were most probably of abiotic origin, most likely including glycine and alanine, and binding would have been nonspecific. However, natural selection would soon have favored forms encoded and synthesized by nucleic acids. We propose that the first encoding system was based on a comma-free code, such as {GGC, GCC}, which would have allowed encoding of the amino acids and the reading frame within a single code. At this time, the LSU and SSU would have evolved separately, with the proto-LSU having a PTC function and the proto-SSU binding proto-mRNA. Assembly of the two subunits with the intermediate tRNA would have given rise to the first ribosomes capable of coding longer and more specific peptides. From this time, the ribosome and genetic code would have coevolved (Vitas and Dobovišek 2018). With the addition of new amino acids, comma-free codes were no longer viable and the genetic code would have evolved toward the circular codes, possibly with a smaller number of amino acids initially. For example, we have shown previously (Michel et al. 2017) that an X′ circular code exists with 10 trinucleotides capable of coding eight of the 10 hypothesized “early amino acids” (Koonin 2017). Only two universally conserved motifs from this X′ circular code can be observed in the modern ribosome, at positions 1396–1404 in the SSU and 2500–2511 in the LSU. It is interesting to note that these two “primitive” circular code motifs correspond to the X motifs a and B (in the SSU and LSU, respectively), which are predicted to be the earliest X motifs in the ribosome according to the accretion model. The peptides synthesized by the early ribosomes may have functioned as primordial ribosome cofactors, possibly to increase rRNA stability (Lupas and Alva 2017). At the early/intermediate stages, in addition to their function of amino acid assignment, circular codes would have allowed reading frame detection and/or maintenance before the emergence of complex start codon recognition systems, allowing to code the first simple proteins. The X circular code may thus have been the first error detection/correction system, avoiding reading the mRNA in the wrong frame. Finally, no circular codes can include more than 20 trinucleotides, so the circular code property was not sufficient when more amino acids were needed. The standard genetic code requires a specific start codon that initiates translation, and sophisticated ratchet mechanisms for maintaining the reading frame during translation elongation. Intriguingly, uX motifs are found in modern ribosomes in many of the ratchet pawls, as well as in the PTC and the decoding center.

Conclusion

The genetic code is too complex to have emerged spontaneously and it is hypothesized that the coding process started with a set of primitive amino acids and that others were added until the total of 20 was reached (Chatterjee and Yadav 2019). Most studies of the origin and evolution of the genetic code have focused on the mapping between codons and amino acids (e.g., Ikehara 2002; Hartman and Smith 2014; Koonin 2017), and the origin of reading frame maintenance has not been addressed before. Here, we have investigated the hypothesis that the contemporary genetic code arose from simpler comma-free codes via circular codes. In addition to encoding the amino acids, comma-free codes and circular codes present the important synchronization property that would have allowed detection and maintenance of the reading frame in primordial and less sophisticated translation systems. Should our hypothesis be true, the contemporary translation system may still contain vestiges of such codes. To test this, we used the X circular code as it has the most “universal” occurrence in genes and also strong mathematical properties—in particular, it is self-complementary and C3. We compared rRNA sequences from the three domains of life and identified 32 motifs from the X circular code that are universal, even though they occur in sequences that are not conserved in terms of nucleotides. The enrichment of the rRNA in uX motifs is statistically significant, and most of the motifs are clustered around important functional sites, including the PTC and the exit tunnel in the LSU and the decoding center and ratchet mechanisms in the SSU. We propose that they represent the observable remnants of a primordial code used during the emergence of the RNA or RNA–peptide world. The emergence of the translation system is a chicken-and-egg problem: The ribosome is needed to code proteins, but the ribosome needs proteins to function. It has been suggested that an RNA molecule with a peptidyl transferase activity existed before the full sequential three-base decoding (Polacek and Mankin 2005). This early noncoded proto-ribosome could have catalyzed the association of arbitrary amino acids, producing short peptides of random sequences. Here, we showed that the models of both proto-LSU and proto-SSU are enriched in uX motifs, with 30% of the nucleotides found in uX motifs. Concerning the LSU, we observed more uX motifs in the A-monomer than in the P-monomer, based on the E. coli sequence that was used in the model (Fig. 7). This may reflect an inherent asymmetry of the proto-LSU, or it may be due to a stronger conservation of the A-site in evolution. In the RNA–peptide world scenario, the RNA polymers of the proto-ribosome served as templates to directly bind amino acids or short peptides. Cognate RNA triplets could have then evolved to act as anticodons in tRNAs and codons in mRNAs (Yarus 2017). It has been observed previously that the early prebiotic amino acids are coded by G/C-rich codons, whereas engagement of new amino acids required more of A and U to be included in the codons (Polyansky et al. 2013). We propose here that the comma-free code {GGC, GCC} was used initially to code Ala and Gly, and that this code quickly expanded to an ancestral circular code, such as the X′ circular code containing 10 codons with a composition of 66% G/C and 33% A/T, and coding for eight out of the 10 identified early amino acids (Koonin 2017). The increase of the amino acid repertoire and the transition from the production of random peptides to the coding of specific protein sequences require more sophisticated mechanisms for codon recognition, but also the identification of the reading frame. Circular codes represent an efficient means to synchronize the reading frame within a short window, before the evolution of a start codon and the modern translation initiation system. In support of this hypothesis, here we have identified uX motifs in the early rRNA. X motifs have also been discovered in modern mRNA sequences (Michel et al. 2017; Dila et al. 2019), as well as in many tRNA (Michel 2013). It is therefore tempting to suggest that base-pairing between the X motifs of the mRNA and those of the tRNA and the rRNA would have given rise to the first coded ribosome apparatus. Traces of such interactions remain in the 3D structures of modern ribosomes, in which we have shown that most of the uX motifs in the rRNA are in contact with the mRNA or the A, P, and E-site tRNAs. Universally conserved X circular code motifs are present at each evolutionary stage up to the common core of the modern ribosome and are coherent with the proposed model for coevolution of the genetic code and the translation system. The question of whether the X motifs retain a function in modern translation systems, possibly by participating in reading frame retrieval, can only be answered by experimental studies.

MATERIALS AND METHODS

Ribosomal RNA multiple sequence alignments

Multiple sequence alignments for LSU rRNAs (23S/28S and 5S) and SSU rRNAs (16S/18S) were obtained from the Center for Ribosomal Origins and Evolution's RiboVision web server at http://apollo.chemistry.gatech.edu/RibosomeGallery/Read Me/alignments/index.html. The alignments contain complete sequences for rRNAs from 133 distinct species, representing a broad but sparse sampling of the phylogenetic tree of life, including all three domains of life. The sequences for the 30 eukaryotes, 67 bacteria, and 36 archaea were originally extracted from the SILVA database at https://www.arb-silva.de. A list of the organisms present in the alignments is provided in Supplemental Table S10.

Identification of universal X motifs (uX motifs) in rRNA alignments

The trinucleotide set X is a maximal C3 self-complementary circular code (Arquès and Michel 1996). A circular code is a set of words over an alphabet such that any sequence written on a circle has a unique decomposition (factorization) into words of the circular code. Any motif from the circular code X, called X motif, has the ability to retrieve the reading frame of the sequence. Formal and classical definitions related to circular codes that are not explicitly necessary to understand the results obtained in this work are not recalled here. They are available in Arquès and Michel (1996); Michel (2008); Fimmel et al. (2016); Michel et al. (2017); Fimmel and Strüngmann (2018); and Dila et al. (2019). As in Michel et al. (2017), an X motif is defined as a consecutive sequence of trinucleotides from the X circular code. For each rRNA sequence in the above alignments, the X motifs were localized using a program developed in the Java language (El Soufi and Michel 2016). The program takes optional parameters that define the minimum length l (in nucleotides) of the X motifs searched. As in previous work, we used l ≥ 8 nt (i.e., at least two trinucleotides, and either prefixes or suffixes of trinucleotides), which implies that the reading frame can be retrieved with a probability of 99.6% (Michel 2012). For each position in each of the LSU rRNA (23S/28S and 5S) and SSU rRNA (16S/18S) alignments, we then calculated the “universality” of the X motifs, defined as the number of sequences having an X motif at that position. A universal X motif (denoted uX motif) was defined as a region in the alignment with two constraints: at least six consecutive positions and ≥90% X universality (i.e., positions covered by X motifs in ≥90% of the sequences in the alignment). It is important to note that, in the case of the rRNA, because the notion of “reading frame” is not relevant, we searched for X motifs starting at any position in the sequences. Thus, the trinucleotides of the X motifs in the different organisms are not necessarily in the same “frames.” For example, one of the uX motifs in the SSU covers the sequences AG,GTA,ACC in E. coli and A,GGT,TTC,G in Homo sapiens.

Identification of universal random motifs (uR motifs) in rRNA alignments

To evaluate the statistical significance of both the occurrence number and the nucleotide length of the uX motifs identified in the rRNA alignments, we generated 100 “random” codes. The random codes represent a purposive sampling of extreme cases and were designed to have similar properties to the X circular code except its circularity, as described in Michel et al. (2017). Thus, a random code R has 20 trinucleotides; the total number of each nucleotide A, C, G, and T in R is 15; and R has no stop codons and no periodic trinucleotides {AAA, CCC, GGG, TTT}. Motifs from each of the 100 random codes were identified in each rRNA alignment, and their universality was calculated as for X motifs. Thus, we defined a universal R motif (denoted uR motif) as a region in the alignment with at least six consecutive positions and ≥90% R universality. To estimate the expected enrichment of uX motifs, we calculated the ±0.99 confidence levels for the mean values of the uR motifs. We then used a one-sided Student's t-test to evaluate whether the observed number and length of uX motifs were significantly higher than expected for random uR motifs.

Secondary structures

The secondary structures of LSU and SSU rRNAs for E. coli were downloaded from http://apollo.chemistry.gatech.edu/RibosomeGallery/. Mapping of information on to secondary structures was performed with RiboVision (apollo.chemistry.gatech.edu/RiboVision) (Bernier et al. 2014). Positions of the expansion segments for LSU and SSU rRNAs and phases in the accretion model were obtained from Petrov et al. (2015).

Three-dimensional structures

Coordinates of the high-resolution crystal structure of the T. thermophilus ribosome were obtained from the PDB database (https://www.rcsb.org/). The PDB entry 4W2F was chosen because it contains mRNA nucleotides, an antibiotic (amicoumacin A) and three deacylated tRNAs in the A, P, and E sites. Numbering of the T. thermophilus SSU rRNA is the same as for E. coli. For the LSU rRNA, E. coli numbering is used. Visualization and analysis of the three-dimensional structures, as well as image preparation, were performed with PyMOL (The PyMOL Molecular Graphics System, Version 1.2r3pre, Schrödinger, LLC).

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

65 in total

1. A speculation on the origin of protein synthesis.

Authors: F H Crick; S Brenner; A Klug; G Pieczenik
Journal: Orig Life Date: 1976-12

2. Circular code motifs in transfer RNAs.

Authors: Christian J Michel
Journal: Comput Biol Chem Date: 2013-03-15 Impact factor: 2.877

3. The maximal C(3) self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses.

Authors: Christian J Michel
Journal: J Theor Biol Date: 2015-04-29 Impact factor: 2.691

Review 4. RNA-amino acid binding: a stereochemical era for the genetic code.

Authors: Michael Yarus; Jeremy Joseph Widmann; Rob Knight
Journal: J Mol Evol Date: 2009-10-01 Impact factor: 2.395

Review 5. The ribosome challenge to the RNA world.

Authors: Jessica C Bowman; Nicholas V Hud; Loren Dean Williams
Journal: J Mol Evol Date: 2015-03-05 Impact factor: 2.395

6. On the fundamental nature and evolution of the genetic code.

Authors: C R Woese; D H Dugre; S A Dugre; M Kondo; W C Saxinger
Journal: Cold Spring Harb Symp Quant Biol Date: 1966

7. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.

Authors: J C Shepherd
Journal: Proc Natl Acad Sci U S A Date: 1981-03 Impact factor: 11.205

8. A permuted set of a trinucleotide circular code coding the 20 amino acids in variant nuclear codes.

Authors: Christian J Michel; Giuseppe Pirillo
Journal: J Theor Biol Date: 2012-12-01 Impact factor: 2.691

9. The evolution of the ribosome and the genetic code.

Authors: Hyman Hartman; Temple F Smith
Journal: Life (Basel) Date: 2014-05-20

10. Possible Emergence of Sequence Specific RNA Aminoacylation via Peptide Intermediary to Initiate Darwinian Evolution and Code Through Origin of Life.

Authors: Dimiter Kunnev; Anastas Gospodinov
Journal: Life (Basel) Date: 2018-10-02

6 in total

1. Frameshift and wild-type proteins are often highly similar because the genetic code and genomes were optimized for frameshift tolerance.

Authors: Xiaolong Wang; Quanjiang Dong; Gang Chen; Jianye Zhang; Yongqiang Liu; Yujia Cai
Journal: BMC Genomics Date: 2022-06-02 Impact factor: 4.547