| Literature DB >> 28800638 |
Dmitry N Ivankov1,2,3.
Abstract
In the course of evolution, genes traverse the nucleotide sequence space, which translates to a trajectory of changes in the protein sequence in protein sequence space. The correspondence between regions of the nucleotide and protein sequence spaces is understood in general but not in detail. One of the unexplored questions is how many sequences a protein can reach with a certain number of nucleotide substitutions in its gene sequence. Here I propose an algorithm to calculate the volume of protein sequence space accessible to a given protein sequence as a function of the number of nucleotide substitutions made in the protein-coding sequence. The algorithm utilizes the power of the dynamic programming approach, and makes all calculations within a couple of seconds on a desktop computer. I apply the algorithm to green fluorescence protein, and get the number of sequences four times higher than estimated before. However, taking into account the astronomically huge size of the protein sequence space, the previous estimate can be considered as acceptable as an order of magnitude estimation. The proposed algorithm has practical applications in the study of evolutionary trajectories in sequence space.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28800638 PMCID: PMC5553642 DOI: 10.1371/journal.pone.0182525
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Consideration of the serine-coded UCG codon.
(a) The standard genetic code table with codons colored by distance from the considered UCG codon: UCG codon itself is colored black; codons at the distance of one, two and three nucleotide substitutions, are colored by blue, green and red, respectively. (b) The list of amino acids that can be obtained from serine UCG codon by zero (black), one (blue), two (green), and three (red) nucleotide substitutions. On the left all amino acid variants are given, while on the right only variants are given that contribute to the increment of the protein sequence space. (c) The graph representation of the number of possible amino acid variants when mutating UCG codon. Black, blue, green, and red arrows correspond to zero, one, two, and three nucleotide substitutions, multiplying the previously available number of amino acid variants (here one, left circle) by one, five, ten, and four variants, respectively.
Fig 2The illustration of dynamic programming procedure.
The example of nucleotide sequence AUG UCG coding for Met-Ser amino acid sequence is considered. The colors of the arrows denote the same as in the Fig 1c.
Comparison of approximate [5] and exact (this paper) number of possible amino acid sequences of GFP.
| Number of nucleotide mutations from the wildtype | Number of possible amino acid sequences [ | Number of possible amino acid sequences (this paper) | Ratio |
|---|---|---|---|
| 1 | 1233 | 1424 | 1.2 |
| 2 | 759528 | 1011954 | 1.3 |
| 3 | 3.1 x 108 | 4.8 x 108 | 1.5 |
| 4 | 9.6 x 1010 | 1.7 x 1011 | 1.8 |
| 5 | 2.4 x 1013 | 4.8 x 1013 | 2.0 |
| 6 | 4.8 x 1015 | 1.1 x 1016 | 2.3 |
| 7 | 8.5 x 1017 | 2.3 x 1018 | 2.7 |
| 8 | 1.3 x 1020 | 4.0 x 1020 | 3.1 |
| 9 | 1.8 x 1022 | 6.2 x 1022 | 3.4 |
| 10 | 2.2 x 1024 | 8.7 x 1024 | 4.0 |