Kuba Nowak1, Paweł Błażej2, Małgorzata Wnetrzak2, Dorota Mackiewicz2, Paweł Mackiewicz2. 1. Faculty of Mathematics and Computer Science, University of Wrocław, ul. F. Joliot-Curie 15, 50-383 Wrocław, Poland. 2. Department of Bioinformatics and Genomics, Faculty of Biotechnology, University of Wrocław, ul F. Joliot-Curie 14a, 50-383 Wrocław, Poland.
Abstract
Reprogramming of the standard genetic code to include non-canonical amino acids (ncAAs) opens new prospects for medicine, industry, and biotechnology. There are several methods of code engineering, which allow us for storing new genetic information in DNA sequences and producing proteins with new properties. Here, we provided a theoretical background for the optimal genetic code expansion, which may find application in the experimental design of the genetic code. We assumed that the expanded genetic code includes both canonical and non-canonical information stored in 64 classical codons. What is more, the new coding system is robust to point mutations and minimizes the possibility of reversion from the new to old information. In order to find such codes, we applied graph theory to analyze the properties of optimal codon sets. We presented the formal procedure in finding the optimal codes with various number of vacant codons that could be assigned to new amino acids. Finally, we discussed the optimal number of the newly incorporated ncAAs and also the optimal size of codon groups that can be assigned to ncAAs.
Reprogramming of the standard genetic code to include non-canonical amino acids (ncAAs) opens new prospects for medicine, industry, and biotechnology. There are several methods of code engineering, which allow us for storing new genetic information in DNA sequences and producing proteins with new properties. Here, we provided a theoretical background for the optimal genetic code expansion, which may find application in the experimental design of the genetic code. We assumed that the expanded genetic code includes both canonical and non-canonical information stored in 64 classical codons. What is more, the new coding system is robust to point mutations and minimizes the possibility of reversion from the new to old information. In order to find such codes, we applied graph theory to analyze the properties of optimal codon sets. We presented the formal procedure in finding the optimal codes with various number of vacant codons that could be assigned to new amino acids. Finally, we discussed the optimal number of the newly incorporated ncAAs and also the optimal size of codon groups that can be assigned to ncAAs.
The standard genetic code (SGC) is a set of rules according to which 64 codons are assigned to 20 canonical amino acids and stop coding signal. Thanks to that, genetic information can be stored in DNA and transmitted into the protein world. It is clear that the SGC is redundant because there are 18 amino acids encoded by more than one codon, i.e. 2, 3, 4, or 6 codons. Such codons encoding the same amino acid are named synonymous and are organized in groups called blocks or boxes.The redundancy is a consequence of necessity of coding 21 items by non-overlapping words with a constant length and having at their disposal four nucleotides. Codons, i.e. words composed of three nucleotides, are enough to encode all these 21 elements. Shorter words, e.g. with the length of two nucleotides would encode up to 16 items. According to the adaptation hypothesis, the redundancy of the SGC could evolve to minimize adverse effects of mutations or translational errors of coded proteins (Woese 1965; Sonneborn 1965; Epstein 1966; Goldberg and Wittes 1966; Haig and Hurst 1991; Freeland and Hurst 1998; Freeland ; Gilis ). This property causes that the coded information is more resistant to changes. Mutations in codons encoding the same amino acid are neutral in terms of the coded amino acid. Therefore, the SGC appears to be a good buffer to the mutations. The redundancy may also result from the necessity to fill in as many as possible codons with sense information, otherwise the unassigned codons could pause or break protein synthesis. It would result in the production of shorter products without function or with disabled activity. Moreover, the redundancy enables coding of selected amino acids by more than one codon, which may increase the number of this amino acid in coded proteins. The encoding a given amino acid by several codons differently used enables regulation of efficiency and speed of translation, which can be important in correct protein folding (Orešič and Shalloway 1998; Xia 1998; Kanaya ; Akashi 2003; Rocha 2004; Zhou ; Plotkin and Kudla 2011; D’Onofrio and Abel 2014).Despite the additional roles, the presence of the surplus codons suggests to reduce their redundancy and exploit it in expanding the genetic code. Thanks to that, we could use the extra codons for introducing new genetic information into the canonical coding system. The inclusion of non-canonical amino acids (ncAAs) in the code can allow us for producing new artificial proteins with novel functions and properties. This approach is very promising for synthetic biology and may find many applications in medicine, industry and biotechnology.There are several approaches to the expansion of the SGC (Chin 2014). The first one is stop-codon suppression (Noren ; Chin 2017; Italia ; Young and Schultz 2018). In this method, stop translation codons, especially those that are very rarely used, e.g. UAG, are applied to encode new ncAAs. This technique needs a modified aminoacyl-tRNA synthetase that charges a tRNA molecule with the ncAA. However, this approach has several drawbacks. For example, we can expand the SGC by only up two new amino acids, because one of the three stop codons must be left to function as a termination signal of translation (Ozer ). What is more, the newly added ncAAs could compete with translation release factors, which may have an impact on the speed and efficiency of the protein synthesis.The second method is related to programmed frameshift suppression. In this approach, four-base codons, called quadruplets, are used to encode new ncAAs (Hohsaka ; Anderson ; Neumann ). Generally, these quadruplets are composed of rarely used classical codons with an additional base. They are decoded by a modified tRNAs containing complementary four-base anticodons. It should be noted, that the competition between tRNAs reading classical codons and the respective quadruplets can decrease the efficiency of the whole procedure.The third method postulates the expansion of the SGC by using selected synonymous codons, whose corresponding tRNAs are pre-charged with ncAAs (Iwane ). In this approach, up to 1, 2, 3, and 5 codons from corresponding synonymous codon blocks can be used to encode ncAAs leaving at least one codon for the canonical amino acid. This method can significantly increase the number of ncAAs by using many codon boxes. However, changes in the using of synonymous codons can disturb the translation and protein folding process, because the codon usage is associated with the speed of protein synthesis (Plotkin and Kudla 2011).Another approach is based on addition one pair of unnatural nucleotides to the canonical four bases (Ishikawa ; Ohtsuki ; Yang ; Kimoto ; Malyshev ; Dien ; Hamashima ). It is thereby possible to generate up to new codons to which ncAAs can be assigned. Since the new genetic information does not involve the canonical codons, it does not interfere with the natural system. For example, the new unnatural codons may not compete with tRNAs charged with canonical amino acids. However, this method must deal with some molecular problems: the pairing efficiency of unnatural bases and their recognition by polymerases during DNA replication. Hopefully, these technical problems can be solved with the development of molecular biology and biological chemistry, so it is interesting to consider the expansion of the SGC from theoretical point of view, which may be used in the experimental solutions. The theoretical approach to the expansion of the SGC has been recently proposed by Błażej . The authors analyzed how to expand the SGC up to 216 codons generated by a six-letter nucleotide alphabet, including besides four canonical bases also one pair of new bases. The model of the code assumed the gradual addition of the codons to minimize the consequences of point mutations.In this paper, we investigated other theoretical aspects of the SGC expansion using 64 canonical codons and the code redundancy. We focused on finding the rules of the code expansion via optimal partition of codon boxes into two parts coding canonical and new information. In the first step, we found the minimal set of codons that encodes the complete canonical information, whereas the vacant codons can be used to encode non-canonical items. At the same time, this code was supposed to be the most robust to point mutations, which could change the information between the canonical and non-canonical parts. We considered codes with various number of the codons in the canonical set and studied the robustness of the codes to lose encoded information including physicochemical properties of amino acids.
Methods
Representation of the genetic code as a graph
We described properties of the SGC using the methodology of graph theory, which studies graphs, i.e. mathematical structures, consisting of objects that are related to each other in some way. According to this approach, the objects are represented by vertices (nodes), which are connected by edges (links). This representation is suitable to describe relationships between all possible 64 codons of the SGC in terms of point mutations. In this case, vertices are codons, whereas edges are all possible single point mutations, which may occur between codons in protein coding sequences. Assuming that, each codon has nine connections with others, which results from three possible point mutations in each of three codon positions (Figure 1). For example, codon UUU can mutate into: AUU, CUU, and GUU due to mutations in the first codon position, UAU, UCU, and UGU because of mutations in the second codon position, as well as UUA, UUC, and UUG on account of mutations in the third codon position. This code representation was successfully used in many problems related to the optimality of the SGC in terms of point mutations (Aloqalaa ; Błażej ; Aloqalaa ). Moreover, some rules of the optimal genetic code expansion using an additional pair of unnatural bases were investigated by Błażej .
Figure 1
The representation of the standard genetic code as a graph, in which vertices are represented by 64 codons, whereas edges are all possible single mutations that may occur between these codons. For clarity, the connections between the codons were presented also separately for mutations in three codon positions. In fact, each codon can mutate into nine others with one substitution. The codons can be clustered into four groups differing in one codon position. The codons connected by edges representing mutations in the first codon positions were shown as concentric circles.
The representation of the standard genetic code as a graph, in which vertices are represented by 64 codons, whereas edges are all possible single mutations that may occur between these codons. For clarity, the connections between the codons were presented also separately for mutations in three codon positions. In fact, each codon can mutate into nine others with one substitution. The codons can be clustered into four groups differing in one codon position. The codons connected by edges representing mutations in the first codon positions were shown as concentric circles.In order to describe the SGC as a graph in a more formal way, let us assume that G(V, E) is a graph, in which V is the set of vertices representing all possible 64 codons, whereas E is the set of edges between these vertices. We say that two codons are connected by the edge if and only if the codon u differs from the codon v in exactly one position. In other words, these codons can mutate one into another with one base substitution. Thus, this graph is undirected, because its edges are bidirectional. The graph is also regular, because each vertex has the same number of neighbors, i.e. the same degree. In this case, it is nine. Moreover, the graph presented in Figure 1 is unweighted, because its edges do not have assigned different numerical values, i.e. weights.Codons can be clustered in different groups encoding 20 amino acids and the translation termination signal, as well as ncAAs in the case of expanded versions. Thus, following graph theory, we can say that every possible genetic code induces a partition of the codon set V into disjoint non-empty subsets S, i.e. codon groups encoding at least 21 items. This can be written in the formal way as:
Measures of genetic code robustness to point mutations
A good measure, which describes the amount of information lost due to mutations of codons, is related to the set conductance and its modifications (Aloqalaa ; Błażej ; Aloqalaa ; Błażej ). They were defined below.For a given graph G, let S be a subset of vertices V. The set conductance of S is defined as:where is the number of edges of graph G crossing from subset S to its complement and vol(S) is the sum of all neighbors of the vertices belonging to subset S.The set conductance has an interesting interpretation. Let us assume that S is a codon group encoding a selected amino acid. Then, is the ratio of non-synonymous substitutions to all possible point mutations for the given codon group S. Therefore, this measure gives us an information about the robustness level of this codon group to single mutations that can change an amino acid coded by this group to other amino acids coded by other codon groups . For example, the group of six codons encoding arginine and serine have equal to 36/54 and 40/54, respectively. This observation rises immediately a question about the minimum value of the set conductance for a codon group with a given number of codons, i.e. its size denoted here as k. It is particularly interesting in the context of the optimal encoding of genetic information by a codon block. In order to study this property, we used other measure called k-size conductance.The k-size conductance of the graph G, for the number of codons in a group , is defined as:where is the number of codons in subset S and is the set conductance.In other words, it is the minimum possible value of set conductance for a group consisting of k codons. It is helpful in finding the codon groups that minimize consequences of changing genetic information due to point mutations. For example, in the case of codon groups encoding arginine and serine, only the former has the minimal set conductance for their size, i.e. but the latter is greater. To evaluate the general quality of genetic code, we introduced the third characteristic, called the average conductance of set collection.Let be a set collection that fulfills the following property:The average conductance of is then defined as:where is the number of subsets S in a code partition and is the set conductance of an individual subset.This measure is an average robustness of a code to consequences of point mutations, which can occur between codon blocks S of this code. is a generalization of the average code conductance calculated for the SGC (Aloqalaa , 2020). In fact, if the number of codon groups and are codon blocks arranged as in the SGC, then is the average conductance for the SGC.
Finding the optimal codon groups in terms of changing genetic information
The characteristics presented above appeared useful in studying the structural properties of genetic codes. However, they require a fast and effective method for determining the optimal conductance for groups with k, i.e. a specific number of codons. Fortunately, the graph G including 64 codons possesses many interesting properties, which are helpful in the solution of the optimality problem. First of all, this graph can be represented as a Cartesian graph product, i.e. the set of all ordered combinations of three cliques:
where K4 is a 4-vertex clique, i.e. the set of four vertices corresponding to nucleotides , such that every two distinct vertices are adjacent. This property allows us to characterize the set of codons reaching the minimal set conductance from all possible subsets with a given codon number k. The following proposition presented in Aloqalaa , 2020) is a natural consequence of the Theorem 1 given by Bezrukov and Elsässer (2003).Let us consider a linear order of the set of vertices of 4-clique K , and let C in the lexicographic order. Then we get:where , for any . Therefore, the following equations hold for any :As a result, each sequence of codons of graph G sorted according to a given lexicographic order can reach the minimum of the set conductance over all possible set of codons with the number k. We used the notation C in the whole paper to denote a general set of k-codons in the lexicographic order. Briefly, sorting codons according to the lexicographic order enabled us to find the codon groups that minimize changes in coded genetic information, e.g. substitutions between codons encoding different information. It should be noted that there are exactly 144 different lexicographic codon orders, which can be used to build a genetic code as a graph G. It results from all possible linear orders of the four nucleotides and all possible orders of three codon positions: .
Inclusion of physicochemical properties of amino acids
The model of the genetic code described above assumed equal consequences of substitutions between all amino acids. This general assumption allowed us for analytical finding of the optimal codon groups minimizing the amino acid changes due to point codon mutations. It would not be possible after the inclusion of amino acid properties, although the model would contain an important information. Nevertheless, we also calculated the costs of amino acid substitutions assuming physicochemical properties of these amino acids for the optimal codes found according to the lexicographic approach in the procedure described in the previous section. These calculations were done for the codon groups that encoded canonical amino acids but not for those encoding newly incorporated ncAAs, because we do not know a priori their properties.We included eight amino acid indices describing their physicochemical properties: BLAM930101, BIOV880101, MAXF760101, TSAJ990101, NAKH920108, CEDJ970104, LIFS790101, and MIYS990104. They represent diverse features, such as: electric properties (isoelectric point and polarity), hydrophobicity, alpha-helix and turn propensities, general physicochemical properties, residue propensity (molecular weight, average accessible surface area, and mutability), composition, beta-strand propensity, and intrinsic propensities (hydration potential, refractivity, optical activity, and flexibility). The indices were selected as representatives out of more than 500 amino acid indices present in the AAindex database (Kawashima ) using a consensus fuzzy clustering method (Saha ).Based on these indices, we calculated F function for the canonical part of a genetic code, which is defined as:
where D is the set of all possible pairs of codons i and j from canonical part of genetic code that differ by a single-point mutation, whereas and are values of amino acid index n for the amino acids encoded by these codons, respectively. Simply speaking, this function is the sum of squared differences in eight amino acid properties. In the case of mutations involving stop translation codons, we assumed the maximum possible squared difference over all possible pairs of amino acids for the given amino acid index. The values of corresponding amino acid indices were standardized by dividing by the maximum squared difference of the given index. The final F values were additionally normalized by the total number of codons k belonging to the canonical subset of the considered code.
Data availability
The computations were conducted using Python 3.9.1 programming language. All source codes and raw data relevant to our investigations were included in supplementary material at figshare: https://figshare.com/s/10.25386/genetics.14079452.
Results and Discussion
We began our investigation with finding the smallest set of codons encoding all 20 amino acids and stop coding signal, which still preserves the canonical codon assignments and is simultaneously optimal in terms of changing genetic information between the set of canonical codons and the vacant codons, which can encode ncAAs. We discussed different scenarios of reprogramming the SGC assuming various number of vacant codons. We used the average conductance as a measure of the quality of given genetic code structures, i.e. codon blocks. We also discussed structural features of the canonical codes in terms of its robustness against changes in encoded amino acids.In the construction of the codes, we assumed their robustness to changes causing the loss of genetic information due to mutations. This assumption follows the adaptation hypothesis, which claims that the SGC evolved to minimize harmful consequences of mutations or mistranslations of coded proteins (Woese 1965; Sonneborn 1965; Epstein 1966; Goldberg and Wittes 1966; Haig and Hurst 1991; Freeland and Hurst 1998; Freeland ; Gilis ). Although this code did not turn out perfectly optimized in this respect (Błażej ; Massey 2008; Novozhilov ; Santos ; Santos and Monteagudo 2017; Wnętrzak ; Błażej , 2019b; Wnętrzak ), it shows a general tendency to error minimization in the global scale. This property is better exhibited by its alternative versions (Błażej , 2019a), which occurred later in the evolution. Therefore, the analysis of the genetic code expansion in this context seems to be a natural consequence of its evolution.
The smallest set of codons encoding canonical information
It is well known that the SGC is redundant, which means that a smaller number of codons is enough to encode all 20 canonical amino acids and one stop translation signal. Therefore, the set encoding the canonical information can be reduced to a smaller number of codons, allowing for encoding new genetic information by the set of vacant codons. It seems reasonable to postulate some conditions that must be met to obtain minimalistic genetic codes encoding the canonical genetic information, which can be a potential starting point for further analysis of genetic code expansion. We assumed that this codon set must be optimal in terms of the set conductance , which means that, for a given number k of codons in the set, the number of connections between canonical information and the set of vacant codons is as small as possible. In other words, the number of mutations between the canonical and the non-canonical codes should be minimal. This assumption has a sensible biological meaning because it reduces a possibility of unwanted changes between the new and the old genetic information.Following Proposition 1, we get that the first k-codons ordered in lexicographic order C constitute the set with the minimum set conductance over all possible sets with k codons. In consequence, this codon set is the most resistant against loosing information. An example of such set is shown in Table 1 for eight codons. This property poses a question about the minimum number of codons k such that there exists a set C composed of codons that encode 20 amino acids and stop translation signal. In order to deal with this problem, we denote as a partition of the set C of k lexicographically ordered codons that encode 21 canonical items creating a code, i.e. is a set collection of codons encoding canonical information:
where is a non-empty set of codons encoding 21 items according to the SGC rules.
Table 1
The example of the codon set C for k = 8, which is a sequence of the first eight codons taken in a selected lexicographic order
Codon
Amino acid
AAA
Lys
AAC
Asn
AAG
Lys
AAU
Asn
ACA
Thr
ACC
Thr
ACG
Thr
ACU
Thr
According to Proposition 1, this set is characterized by the minimal set conductance over all sets with the size of k = 8. The codons have assigned encoded amino acids as in the standard genetic code.
The example of the codon set C for k = 8, which is a sequence of the first eight codons taken in a selected lexicographic orderAccording to Proposition 1, this set is characterized by the minimal set conductance over all sets with the size of k = 8. The codons have assigned encoded amino acids as in the standard genetic code.We tested all possible sets C encoding 21 canonical items induced by all 144 codon orders and SGC assignments. Using this method, we obtained that k = 28 is the minimal number of codons in the set C, which induces partition , and encodes 20 amino acids and stop coding signal. In fact, there are two lexicographic orders, which produce such a code. The first is induced by a linear order between nucleotides and an order between codon positions 1 < 2 < 3. The second is generated by a linear order between nucleotides and an order between codon positions 1 < 2 < 3.Tables 2 and 3 include representations of 64 codons in the classical SGC table showing the structure of the optimal C28 codon set. The codons C28 belonging to the canonical part of this code are marked in red, whereas the vacant codons are in blue. In the first, third and fourth column of the tables, two codons in the block comprising four codons differing in the third codon position encode a classical amino acid or stop signal, whereas in the second column, only one codon in the block encodes an amino acid. Interestingly, only codons ending with G and U are involved in the coding of the canonical information. These two codes differ only in the assignment of four amino acids in the second column of the tables. In the first code, these amino acids are coded by the codons ending with G, whereas in the second code, the codons end with U. This way of codon selection by the algorithm causes that the number of mutations changing the canonical information to the non-canonical one coded by the vacant codons is minimized. At the same time, all 20 canonical amino acids and at least one stop translation signal are included in the code.
Table 2
The smallest set of 28 codons encoding canonical information and minimizing changes of information between the canonical (labeled according to canonical assignments) and non-canonical (unassigned codons) partition of the code
UUU Phe
UCU
UAU Tyr
UGU Cys
UUC
UCC
UAC
UGC
UUA
UCA
UAA
UGA
UUG Leu
UCG Ser
UAG Stop
UGG Trp
CUU Leu
CCU
CAU His
CGU Arg
CUC
CCC
CAC
CGC
CUA
CCA
CAA
CGA
CUG Leu
CCG Pro
CAG Gln
CGG Arg
AUU Ile
ACU
AAU Asn
AGU Ser
AUC
ACC
AAC
AGC
AUA
ACA
AAA
AGA
AUG Met
ACG Thr
AAG Lys
AGG Arg
GUU Val
GCU
GAU Asp
GGU Gly
GUC
GCC
GAC
GGC
GUA
GCA
GAA
GGA
GUG Val
GCG Ala
GAG Glu
GGG Gly
These codons were chosen according to a lexicographic order induced by the linear order of nucleotides and the order of codon positions 1 < 2 < 3.
Table 3
The smallest set of 28 codons encoding canonical information and minimizing changes of information between the canonical (labeled according to canonical assignments) and non-canonical (unassigned codons) partition of the code
UUU Phe
UCU Ser
UAU Tyr
UGU Cys
UUC
UCC
UAC
UGC
UUA
UCA
UAA
UGA
UUG Leu
UCG
UAG Stop
UGG Trp
CUU Leu
CCU Pro
CAU His
CGU Arg
CUC
CCC
CAC
CGC
CUA
CCA
CAA
CGA
CUG Leu
CCG
CAG Gln
CGG Arg
AUU Ile
ACU Thr
AAU Asn
AGU Ser
AUC
ACC
AAC
AGC
AUA
ACA
AAA
AGA
AUG Met
ACG
AAG Lys
AGG Arg
GUU Val
GCU Ala
GAU Asp
GGU Gly
GUC
GCC
GAC
GGC
GUA
GCA
GAA
GGA
GUG Val
GCG
GAG Glu
GGG Gly
These codons were chosen according to a lexicographic order induced by the linear order of nucleotides and the order of codon positions 1 < 2 < 3.
The smallest set of 28 codons encoding canonical information and minimizing changes of information between the canonical (labeled according to canonical assignments) and non-canonical (unassigned codons) partition of the codeThese codons were chosen according to a lexicographic order induced by the linear order of nucleotides and the order of codon positions 1 < 2 < 3.The smallest set of 28 codons encoding canonical information and minimizing changes of information between the canonical (labeled according to canonical assignments) and non-canonical (unassigned codons) partition of the codeThese codons were chosen according to a lexicographic order induced by the linear order of nucleotides and the order of codon positions 1 < 2 < 3.
Properties of the codes in terms of robustness to point mutations
We compared the quality of the obtained codes in terms of the average conductance , to find out to what extent these codes minimize consequences of point mutations between codon blocks encoding the canonical information . We considered codes with the increasing number of these codons at the expense of the codons for non-canonical information. We also present which were generated for two lexicographic orders C28, for which we found the smallest set of codons encoding the canonical information.Figure 2 presents a relationship between and the codon number k calculated for the two lexicographic orders, for which we found the smallest coding set C28 (the blue and orange lines). The lower bound of calculated over all possible 144 orders is also shown for comparison (the green line). This line corresponds to the codes whose structure allows for the best possible minimization of substitutions between the coded canonical items. As we can observe, in all considered cases the average conductance decreases with the number of codons k involved in the code. This trend is related with an increasing redundancy of the code for the same number of coded items. The maximum is reached at k = 28 and is equal , whereas the minimum equals to for all set collections at k = 64. It should be noted that corresponds to the average code code conductance calculated for the SGC and was discussed in Aloqalaa , 2020). However, the presented relationships are not strictly linear, because local changes in the course of this trend occur. What is more, the two lexicographic orders that generate the smallest codon sets C28, generally do not induce the optimal collections of sets for k > 28 in terms of . In other words, it is not possible to generate a set collection for each using lexicographic orders shown in Tables 2 and 3 that would be minimal in terms of .
Figure 2
The relationship between the average conductance and the number of codons in the code calculated for two lexicographic orders, for which we found the smallest set coding canonical information C28 (blue and orange lines). The lower bound calculated over 144 orders is shown for comparison (green line).
The relationship between the average conductance and the number of codons in the code calculated for two lexicographic orders, for which we found the smallest set coding canonical information C28 (blue and orange lines). The lower bound calculated over 144 orders is shown for comparison (green line).We also analyzed these codes in terms of consequences of amino acid substitutions considering their physicochemical properties. Figure 3 presents the relationship between the smallest possible average costs of amino acid replacements and the number of codons encoding these amino acids in the codes that minimize changes between the canonical and non-canonical information. The costs were normalized by the number of codons for the canonical information in the corresponding code. Interestingly, the maximum of this normalized cost is taken by the code with all 64 codons, i.e. the SGC and the minimum is for the code including 55 codons for the canonical information (Table 4). This code has nine codons released, which can be used to code ncAAs. Interestingly, these codons have U in the first codon position and among them are two encoded stop translation signal in the SGC. Moreover, these codons encode two amino acids, which are very rarely used, i.e. cysteine and tyrosine, as well as those that are abundant in proteins and can be coded by six codons, i.e. leucine and serine. Therefore, reprogramming of these codons seems sensible.
Figure 3
The relationship between the smallest possible average costs of amino acid substitutions regarding their physicochemical properties based on F function and the number of codons in the canonical codes that minimize changes between the canonical and non-canonical information. The costs were normalized by the number of codons for the canonical information in the corresponding code.
Table 4
The set of 55 codons (labeled according to canonical assignments) encoding canonical information and minimizing changes of information between the canonical and non-canonical (unassigned codons) partition of the code
UUU
UCU
UAU
UGU
UUC Phe
UCC Ser
UAC Tyr
UGC Cys
UUA
UCA
UAA
UGA
UUG
UCG Ser
UAG Stop
UGG Trp
CUU Leu
CCU Pro
CAU His
CGU Arg
CUC Leu
CCC Pro
CAC His
CGC Arg
CUA Leu
CCA Pro
CAA Gln
CGA Arg
CUG Leu
CCG Pro
CAG Gln
CGG Arg
AUU Ile
ACU Thr
AAU Asn
AGU Ser
AUC Ile
ACC Thr
AAC Asn
AGC Ser
AUA Ile
ACA Thr
AAA Lys
AGA Arg
AUG Met
ACG Thr
AAG Lys
AGG Arg
GUU Val
GCU Ala
GAU Asp
GGU Gly
GUC Val
GCC Ala
GAC Asp
GGC Gly
GUA Val
GCA Ala
GAA Glu
GGA Gly
GUG Val
GCG Ala
GAG Glu
GGG Gly
These codons were chosen according to a lexicographic order induced by the linear order of nucleotides and the order of codon positions 1 < 3 < 2. This code shows also the smallest possible average costs of amino acid replacements considering their physicochemical properties normalized by the codon number.
The relationship between the smallest possible average costs of amino acid substitutions regarding their physicochemical properties based on F function and the number of codons in the canonical codes that minimize changes between the canonical and non-canonical information. The costs were normalized by the number of codons for the canonical information in the corresponding code.The set of 55 codons (labeled according to canonical assignments) encoding canonical information and minimizing changes of information between the canonical and non-canonical (unassigned codons) partition of the codeThese codons were chosen according to a lexicographic order induced by the linear order of nucleotides and the order of codon positions 1 < 3 < 2. This code shows also the smallest possible average costs of amino acid replacements considering their physicochemical properties normalized by the codon number.Table 5 presents how many times the individual codons were selected as vacant in 37 codes that minimized changes of information between the canonical and non-canonical partition and showed the smallest possible average costs of amino acid replacements considering their physicochemical properties. Interestingly, codons encoding stop translation signal in the SGC were most often released. However, among them there is not UAG codon, which is often used in experimental approaches due to its low usage in protein coding sequences (Noren ; Chin 2017; Italia ). Our algorithm preferred other stop codons because it applies different criteria, i.e. the minimization of changing the canonical and non-canonical information. Next frequently used codons in the non-canonical partition are those with A in the second and third codon position encoding lysine, glutamic acid and glutamate. On the other hand, codons with G in the third codon position were released very rarely or not at all. Two of these codons are the only ones that encode methionine and tryptophan.
Table 5
The number of times when the individual codons were selected as vacant in the codes minimizing changes of information between the canonical and non-canonical partition and showing the smallest possible average costs of amino acid replacements considering their physicochemical properties
UUU Phe 8
UCU Ser 8
UAU Tyr 9
UGU Cys 7
UUC Phe 7
UCC Ser 11
UAC Tyr 13
UGC Cys 8
UUA Leu 27
UCA Ser 25
UAA Stop 36
UGA Stop 31
UUG Leu 1
UCG Ser 1
UAG Stop 0
UGG Trp 0
CUU Leu 5
CCU Pro 5
CAU His 5
CGU Arg 4
CUC Leu 7
CCC Pro 13
CAC His 13
CGC Arg 10
CUA Leu 22
CCA Pro 23
CAA Gln 29
CGA Arg 25
CUG Leu 0
CCG Pro 3
CAG Gln 0
CGG Arg 0
AUU Ile 5
ACU Thr 5
AAU Asn 7
AGU Ser 4
AUC Ile 8
ACC Thr 12
AAC Asn 14
AGC Ser 10
AUA Ile 24
ACA Thr 24
AAA Lys 31
AGA Arg 27
AUG Met 0
ACG Thr 2
AAG Lys 0
AGG Arg 0
GUU Val 5
GCU Ala 4
GAU Asp 5
GGU Gly 4
GUC Val 7
GCC Ala 10
GAC Asp 13
GGC Gly 9
GUA Val 22
GCA Ala 22
GAA Glu 30
GGA Gly 22
GUG Val 0
GCG Ala 1
GAG Glu 0
GGG Gly 0
The results were obtained from the set of 37 codes in which 28–63 codons encoded the canonical information.
The number of times when the individual codons were selected as vacant in the codes minimizing changes of information between the canonical and non-canonical partition and showing the smallest possible average costs of amino acid replacements considering their physicochemical propertiesThe results were obtained from the set of 37 codes in which 28–63 codons encoded the canonical information.
Properties of expanded genetic code
The reducing number of codons encoded canonical amino acids and stop translation signal, as shown in The smallest set of codons encoding canonical information section, implies that the rest codons can be used to encode ncAAs. Mathematically speaking, the codon set C encoding the canonical information by the codons, induces its own complement, i.e. the set of vacant codons for ncAAs. Thus, the new genetic information would be encoded by codon blocks, which constitute a partition of the set . In consequence, we introduced a set collection of n codon blocks for the new genetic information:
where each is a non-empty set of codons that encodes the same genetic information, e.g. a specific ncAA. The new set together with the set encoding canonical information constitutes an expanded genetic code denoted by , which encodes exactly n new ncAAs and items of canonical genetic information:Please note that according to the definition of , we get that the number of connections between the canonical and non-canonical parts of the expanded code is as small as possible, which may causes a low probability of potential reversion between the new and old information. It is very useful from experimental point of view, when we want to keep the information about the canonical amino acids and the stop translation, and simultaneously not lose the new information encoded in the vacant codons.Similar to the average conductance of the canonical code , it is theoretically possible to calculate this measure also for the codon set encoding ncAAs, denoted as . Finally, the average conductance of the whole expanded genetic code can be derived to assess its optimality in terms of point mutations. However, these measures can be obtained only when the assignments of individual ncAAs to the vacant codons are known, because there are many possible set collections for the fixed number of codons in the canonical set k and the number of ncAAs coded in the non-canonical set n, which differ in values. Therefore, we decided to find a lower bound on the values of for the fixed and . It could be done using the representation 1 of as well as the definition 2 of the k-size conductance and a simple observation that for every , we have:It means that the average conductance of the non-canonical part of the code is greater or at most equal to the average of respective -size conductances of the codon blocks encoding n ncAAs and having the optimal structure in terms of the set conductance.Therefore, for every , there exists a lower bound on the average conductance of the set collection imposed on its codon set . What is more, these optimal collections are composed of the best codon blocks in terms of the k-size conductance, i.e. the minimum possible value of set conductance for a group consisting of k codons. This feature gives us a general overview on the optimal structures of the genetic code expansions including the selected number n of ncAAs.Following the property 3, we found all possible lower bounds for every and . Figure 4 presents their graphical representations. As we can see, the lower bound on increases with the number n of coded ncAAs.
Figure 4
The lower bound of the average conductance calculated for the set collection encoding ncAAs in relation to the number n of coded ncAAs. The relationship was presented for all possible partitions of the set containing vacant codons, which encode ncAAs, for being the number of codons in the canonical set.
The lower bound of the average conductance calculated for the set collection encoding ncAAs in relation to the number n of coded ncAAs. The relationship was presented for all possible partitions of the set containing vacant codons, which encode ncAAs, for being the number of codons in the canonical set.This relationship shows an interesting course, e.g. for k = 28 (Figure 3), the curve of the lower bound increases with n but slows down for n close to and then blows up again for . This fact results from that there is no set with the number of codons lower than four for , whereas these sets appear for . Since the k-size conductance for groups of codons in the number k = 1, 2, 3 is and larger than for more numerous groups with , a set collection containing the codon group of size lower than four have the average conductance of the set collection generally higher in comparison to the collections that are composed of codon blocks with the size greater or equal than four. This fact could explain the presence of the minimum for . This phenomenon is also observed for for respective changing points (Figures 4 and 5).
Figure 5
The minimum of the average set conductance (blue line) in relation to the number n of coded ncAAs. The minimum of was found over all possible partitions of the set containing 28 codons for canonical information and 36 vacant codons for n ncAAs. The red dashed line shows the minimum of the average set conductance obtained for n = 9. As we can see, n = 9 is a deflection point, in which the rate of the curve increase is changing.
The minimum of the average set conductance (blue line) in relation to the number n of coded ncAAs. The minimum of was found over all possible partitions of the set containing 28 codons for canonical information and 36 vacant codons for n ncAAs. The red dashed line shows the minimum of the average set conductance obtained for n = 9. As we can see, n = 9 is a deflection point, in which the rate of the curve increase is changing.
The balance of expanded genetic code
Using equation (2), we can compare the structural differences between the canonical and the non-canonical parts of the expanded code. In order to do so, we introduced a balance measure defined as:
where is the average conductance of the canonical code partition and is the average conductance of the non-canonical code partition.The balance function indicates that the non-canonical partition possesses better structural properties in terms of the average conductance, i.e. minimization of non-synonymous substitutions than the canonical partition , whereas means that the canonical genetic information is better optimized in this respect. From our point of view, the value of around one is the most interesting because it suggests a similar robustness of codon blocks to point mutations in both types of the expanded genetic code. Therefore, the balance measure appears to be useful in studying properties of codon groups belonging to and . Thanks to that, we can compare the quality of coding system for the new and old information.We tested the balance under the assumption that the non-canonical set attains lower bound of the average conductance value . Figure 6 presents the balance values calculated for various number of codons in the non-canonical set in relationship with the number n of coded ncAAs. It is visible that the expanded genetic code is extremely unbalanced for small n, i.e. when , which indicates that the non-canonical partition have in general codon block structure that minimize non-synonymous substitutions better than the canonical partition . In all considered cases increases with the number of newly incorporated ncAAs. However, it is possible to find a balanced code for which are around one.
Figure 6
The balance , i.e. the ratio of the average conductance of the non-canonical to canonical code, in relation to the number n of coded ncAAs. The relationship was presented for all possible partitions of the set containing vacant codons, which encode ncAAs, for being the number of codons in the canonical set. It was assumed that attains the lower bound of .
The balance , i.e. the ratio of the average conductance of the non-canonical to canonical code, in relation to the number n of coded ncAAs. The relationship was presented for all possible partitions of the set containing vacant codons, which encode ncAAs, for being the number of codons in the canonical set. It was assumed that attains the lower bound of .After comparison of codes with the balance in the range of 0.991–1.010, we noticed a biased distribution of codons in the canonical and non-canonical partitions. Codons with the G in the third codon position dominated in the canonical partition. Each of them was present in more than 2.3% among codons belonging to this set. In total, these codons constituted 40%. In the case of uniform distribution of all codons, we should expect the 1.56% usage for each codon and 25% for the group having one nucleotide type in one codon position. On the other hand, in the non-canonical partition, codons ending with A were most frequent, each in more than 3.2% and 60% in total. Interestingly, such codons were also very often selected to the non-canonical partition in the expanded codes showing the smallest possible average costs of amino acid replacements in terms of their physicochemical properties (Table 5). The biased usage of the codon groups in the canonical and non-canonical sets is significantly different in comparison to the uniform distribution (P-value in the proportion test).What is more, the number of newly included ncAAs required to obtain the balanced code is in some cases quite large. For example, in the case of k = 28, possible balanced genetic codes are obtained for the number of ncAAs n = 28, 29, and 30. This result shows in fact a huge redundancy level of the SGC.
Concluding remarks
The redundancy of the SGC suggests that this coding system can be expanded. In literature, we can find several approaches to this problem. These findings encouraged us to start studying the issue of the optimal expansion of the SGC from theoretical perspective. In this paper, we proposed a method of genetic code expansion using graph theory. Following this methodology, we described the smallest set of codons still encoding 21 canonical items (20 amino acids with one stop translation signal) and characterizing by the minimal set conductance for its size. This property provides the smallest number of connections between codons in this minimalistic canonical code and the set of vacant codons, which can be assigned to new genetic information. Thanks to that, such a code is characterized by the minimized possibility of reversions between these two parts of the expanded code, the canonical and non-canonical one. What is more, we investigated the optimal structure of many expanded codes with various number of codons released for encoding potential ncAAs. Among these codes, we found those that minimized average costs of amino acid replacements considering their physicochemical properties. In addition, the introduced balance measure, i.e. the ratio of the average conductance of the non-canonical to canonical code, allows us for finding the expanded genetic codes whose canonical and non-canonical sets show a similar robustness to point mutations. Using these approaches, we identified the codons that can be used for reprogramming to encode new ncAAs.It should be noted that the results presented here are based on some theoretical assumptions, which were necessary to conduct the analytical calculations and reasoning as well as make general conclusions about the expansion of the SGC. First of all, we proposed an universal approach, which does not take into account the different probabilities of nucleotide mutations and the codon usage. These features are much diversified and specific not only between various species but even within the same genome. Therefore, it is not possible to construct a general model of the genetic code expansion including the huge diversity of the mutations and codon frequency. Secondly, we did not regard the number and types of tRNAs, which can be used to decode unambiguously respective codons. Nevertheless, it seems reasonable to investigate the problem of the SGC expansions starting from the general foundations. Interestingly, using these assumptions, we found several interesting limitations on the number of codons required to encode canonical information and also on the codon blocks that would encode new information. Our approach can be considered a null model and a starting point to other more complex models, most probably heuristic and genome-specific, including the different mutation rate between nucleotides and codon usage.
Funding
This work was supported by the National Science Centre, Poland (Narodowe Centrum Nauki, Polska) under Grant number 2017/27/N/NZ2/00403.
Authors: J Christopher Anderson; Ning Wu; Stephen W Santoro; Vishva Lakshman; David S King; Peter G Schultz Journal: Proc Natl Acad Sci U S A Date: 2004-05-11 Impact factor: 11.205