| Literature DB >> 25922573 |
Nicolas Carels1, Miguel Ponce de Leon2.
Abstract
Purine bias, which is usually referred to as an "ancestral codon", is known to result in short-range correlations between nucleotides in coding sequences, and it is common in all species. We demonstrate that RWY is a more appropriate pattern than the classical RNY, and purine bias (Rrr) is the product of a network of nucleotide compensations induced by functional constraints on the physicochemical properties of proteins. Through deductions from universal correlation properties, we also demonstrate that amino acids from Miller's spark discharge experiment are compatible with functional primeval proteins at the dawn of living cell radiation on earth. These amino acids match the hydropathy and secondary structures of modern proteins.Entities:
Keywords: ancestral codon; genomics; protein features; purine bias; short-range correlations
Year: 2015 PMID: 25922573 PMCID: PMC4401237 DOI: 10.4137/BBI.S24021
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1Distribution of CDSs according to G (average G1, G2, G3), G1, G2 in C. reinhardtii (A) and P. falciparum (B). Bold lines are for G, thin lines are for G1, and dash lines are for G2.
Figure 2Correlations between G2 and G1 (panels A, C, E), G2 and T2 (panels B, D, F), in H. sapiens (Hs, n = 10,892, panels A, B), P. falciparum (Pf, n = 6,844, panels C, D), and C. reinhardtii (Cr, n = 15,727, panels E, F). r stands for the correlation coefficient and P for the statistical significance. Each r coefficient is associated with a P-value <0.001. Gray dots are for UFM-certified CDSs, and black dots are for CDSs homologous to proteins from PDB. (A) r = 0.12. (B) r = −0.43, y = −0.86x + 41.64. (C) rUFM = 0.43, rpdb = 0.41, y = 0.31x + 3.39. (D) rUFM = 0.09, rpdb = 0.19. (E) rUFM = 0.38, rpdb = −0.40, y = 0.61x −3.05. (F) rUFM = −0.48, rpdb = −0.53, y = −099x + 44.71.
Correlations (r) between nucleotide composition in the three positions of codons in H. sapiens (Hs, n = 10,892), O. sativa (Os, n = 8,643), and P. falciparum (n = 6,844) plus C. reinhardtii (n = 15,727) (Pf + Cr, n = 22,571). The gray boxes are for r ≥ +0.55 or r ≤ −0.55.
| SP | A1 | A2 | A3 | C1 | C2 | C3 | G1 | G2 | G3 | T1 | T2 | T3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1 | 1 | ||||||||||||
| A2 | +0.57** | 1 | |||||||||||
| A3 | +0.58** | +0.44** | 1 | ||||||||||
| C1 | −0.72** | −0.46** | −0.58** | 1 | |||||||||
| C2 | −0.40** | −0.59** | −0.21** | +0.43** | 1 | ||||||||
| C3 | −0.52** | −0.48** | −0.92** | +0.55** | +0.28** | 1 | |||||||
| G1 | −0.43** | −0.12** | −0.20** | −0.05** | +0.12** | +0.12** | 1 | ||||||
| G2 | −0.44** | −0.50** | −0.31** | +0.31** | +0.11** | +0.35** | +0.12** | 1 | |||||
| G3 | −0.53** | −0.23** | −0.83** | +0.57** | +0.12** | +0.68** | +0.34** | +0.17** | 1 | ||||
| T1 | +0.19** | +0.03** | +0.28** | −0.37** | −0.20** | −0.21** | −0.56** | +0.04** | −0.50** | 1 | |||
| T2 | +0.14** | −0.13** | −0.03* | −0.16** | −0.41** | −0.02* | −0.07** | −0.43** | 0.00 | +0.13** | 1 | ||
| T3 | +0.52** | +0.34** | +0.84** | −0.59** | −0.22** | −0.88** | −0.24** | −0.25** | −0.88** | +0.42** | +0.06** | 1 | |
| A1 | 1 | ||||||||||||
| A2 | +0.56** | 1 | |||||||||||
| A3 | +0.50** | +0.41** | 1 | ||||||||||
| C1 | −0.53** | −0.26** | −0.30** | 1 | |||||||||
| C2 | −0.49** | −0.61** | −0.35** | +0.28** | 1 | ||||||||
| C3 | −0.48** | −0.44** | −0.91** | +0.30** | +0.36** | 1 | |||||||
| G1 | −0.61** | −0.35** | −0.46** | −0.09** | +0.30** | +0.41** | 1 | ||||||
| G2 | −0.34** | −0.47** | −0.27** | +0.15** | +0.02** | +0.29** | +0.24** | 1 | |||||
| G3 | −0.43** | −0.28** | −0.79** | +0.30** | +0.31** | +0.60** | +0.44** | +0.21** | 1 | ||||
| T1 | +0.12** | +0.04** | +0.35** | −0.27** | −0.06** | −0.28** | −0.53** | −0.04** | −0.41** | 1 | |||
| T2 | +0.23** | −0.02* | +0.18** | −0.14** | −0.43** | −0.16** | −0.15** | −0.44** | −0.24** | +0.08** | 1 | ||
| T3 | +0.50** | +0.42** | +0.88** | −0.34** | −0.38** | −0.91** | −0.44** | −0.28** | −0.82** | +0.36** | +0.22** | 1 | |
| A1 | 1 | ||||||||||||
| + | A2 | +0.89** | 1 | ||||||||||
| A3 | +0.84** | +0.81** | 1 | ||||||||||
| C1 | −0.89** | −0.80** | −0.84** | 1 | |||||||||
| C2 | −0.82** | −0.86** | −0.71** | +0.76** | 1 | ||||||||
| C3 | −0.77** | −0.79** | −0.94** | +0.78** | +0.69** | 1 | |||||||
| G1 | −0.89** | −0.78** | −0.74* * | +0.67** | +0.78** | +0.67** | 1 | ||||||
| G2 | −0.81** | −0.83** | −0.71** | +0.70** | +0.64** | +0.69** | +0.76** | 1 | |||||
| G3 | −0.88** | −0.80** | −0.93** | +0.87** | +0.72** | +0.80** | +0.80** | +0.70** | 1 | ||||
| T1 | +0.74** | +0.66** | +0.75** | −0.76** | −0.73** | −0.68** | −0.84** | −0.65** | −0.80** | 1 | |||
| T2 | +0.46** | +0.32** | +0.33** | −0.41** | −0.62** | −0.30** | −0.57** | −0.53** | −0.35** | +0.62** | 1 | ||
| T3 | +0.88** | +0.83** | +0.93** | −0.87** | −0.74** | −0.92** | −0.79** | −0.73** | −0.94** | +0.79** | +0.35** | 1 |
Notes: () for a value of r that is not statistically significant at α€ = €0.05. (*) for a value of r that is statistically significant at probability level α < 0.05. (**) for a value of r that is statistically significant at probability level α < 0.01.
Figure 3GC2 in P. falciparum (n = 6,844), H. sapiens (n = 10,892), and C. reinhardtii (n = 15,727). The thin line is for P. falciparum with an average GC2 of 25.51% (σ = 6.84), the dot line is for H. sapiens with an average GC2 of 42.54% (σ = 6.61), and the bold line is for C. reinhardtii with an average GC2 of 53.70% (σ = 8.31).
Figure 4Correlations between A2 and A1 (panels A, D, G), A2 and T2 (panels B, E, H), A2 and C3 (panels C, F, I) in H. sapiens (Hs, n = 10,892, panels A, B, C), P. falciparum (Pf, n = 6,844, panels D, E, F), and C. reinhardtii (Cr, n = 15,727, panels G, H, I). r stands for the correlation coefficient and P for the statistical significance. Each r coefficient is associated with a P-value <0.001. Gray dots are for UFM-certified CDSs, and black dots are for CDSs homologous to proteins from PDB. (A) r = 0.57, y = 1.16x + 0.70. (B) r = −0.13. (C) r = −0.48, y = −0.35x + 42.39. (D) rUFM = 0.49, rpdb = 0.49, y = 1.2x – 4.6. (E) rUFM = 0.43, rpdb = −0.57, y = −32.48x + 926.73. (F) rUFM = −0.05, rpdb = 0.25. (G) rUFM = 0.48, rpdb = 0.63, y = 1.4x – 2.5. (H) rUFM = 0.18, rpdb = 0.28. (I) rUFM = 0.07, rpdb = 0.19.
Codon usage of ancestral codons RWr in relation to amino acid (aa) availability in primeval terrestrial conditions, aa hydropathy, and secondary structure of modern proteins. In adequate proportion, all the aa of this table may satisfy the ancestral codon RWr; more specifically 1) the white background indicates codons that do not match the ancestral codon RWr, 2) the light gray background indicates codons that imperfectly match the ancestral codon RWr, and 3) the dark gray background indicates the aa that exactly match the ancestral codon RWr. Black rectangle are for values larger than 2.
| AA | MILLER | CARB. LAT. | HYDROP. | CODON | SPLIT | DEGENER. | |||
|---|---|---|---|---|---|---|---|---|---|
| Gly | 440.0 | 0 | −0.4 | GG(A|C|G|T) | Quartet | 1.03 | 1.05 | 5.44 | |
| Ala | 790.0 | 1 | 1.8 | GC(A|C|G|T) | Quartet | 1.35 | 3.77 | 3.24 | |
| Ser | 5.0 | 1 | −0.8 | TC(A|C|G|T) | AG(C|T) | Sextet | 0.99 | 1.39 | 3.43 |
| Thr | 0.8 | 2 | −0.7 | AC(A|C|G|T) | Quartet | 1.32 | 1.31 | 2.73 | |
| Asp | 34 | 2 | −3.5 | GA(C|T) | Duet | 0.68 | 1.48 | 3.69 | |
| Val | 19.5 | 3 | 4.2 | GT(A|C|G|T) | Quartet | 2.97 | 2.04 | 2.22 | |
| Glu | 7.7 | 3 | −3.5 | GA(A|G) | Duet | 0.99 | 2.98 | 3.00 | |
| Ile | 4.8 | 4 | 4.5 | AT(A |C|T) | Triplet | 2.19 | 1.96 | 1.72 | |
| Leu | 11.3 | 4 | 3.8 | CT(A|C|G|T) | TT(A|G) | Sextet | 2.22 | 3.83 | 3.15 |
| Pro | 1.5 | 6 | −1.6 | CA(A|C|G|T) | Quartet | 0.42 | 0.64 | 3.62 |
Notes:
Amino acid concentration in the Miller’s experiment.41
Carbon number in the lateral aa chain.
Hydropathy, see Figure 4 of D’Onofrio et al.39
Amino acid distribution in proteins as in Table 2 (columns Ept, Hpt, Apt).
Distribution of amino acids (aa) in secondary structures of proteins, ie, β-sheet (E), α-helix (H), and aperiodic (A). The dataset of nonredundant proteins is from Ponce de Leon et al.6
| AA | SUM | PAV | ||||||
|---|---|---|---|---|---|---|---|---|
| 1.35 | 3.77 | 3.24 | 8.36 | 6.44 | 12.04 | 6.79 | 8.36 | |
| Cys | 0.33 | 0.31 | 0.54 | 1.18 | 1.58 | 1.00 | 1.13 | 1.18 |
| 0.68 | 1.48 | 3.69 | 5.85 | 3.24 | 4.73 | 7.74 | 5.85 | |
| 0.99 | 2.98 | 3.00 | 6.97 | 4.72 | 9.50 | 6.29 | 6.97 | |
| Phe | 1.20 | 1.23 | 1.55 | 3.99 | 5.73 | 3.93 | 3.26 | 3.99 |
| 1.03 | 1.05 | 5.44 | 7.52 | 4.94 | 3.34 | 11.41 | 7.52 | |
| His | 0.47 | 0.62 | 1.19 | 2.28 | 2.26 | 1.97 | 2.50 | 2.28 |
| 2.19 | 1.96 | 1.72 | 5.86 | 10.45 | 6.24 | 3.60 | 5.86 | |
| Lys | 0.94 | 2.10 | 2.82 | 5.86 | 4.48 | 6.71 | 5.92 | 5.86 |
| 2.22 | 3.83 | 3.15 | 9.20 | 10.60 | 12.21 | 6.61 | 9.20 | |
| Met | 0.48 | 0.85 | 0.96 | 2.29 | 2.27 | 2.71 | 2.01 | 2.29 |
| Asn | 0.53 | 0.93 | 2.67 | 4.13 | 2.51 | 2.98 | 5.60 | 4.13 |
| 0.42 | 0.64 | 3.62 | 4.67 | 1.98 | 2.03 | 7.58 | 4.67 | |
| Gln | 0.55 | 1.42 | 1.59 | 3.55 | 2.60 | 4.52 | 3.33 | 3.55 |
| Arg | 0.95 | 1.96 | 2.28 | 5.19 | 4.55 | 6.25 | 4.77 | 5.19 |
| 0.99 | 1.39 | 3.43 | 5.81 | 4.74 | 4.42 | 7.18 | 5.81 | |
| 1.32 | 1.31 | 2.73 | 5.36 | 6.31 | 4.17 | 5.73 | 5.36 | |
| 2.97 | 2.04 | 2.22 | 7.23 | 14.18 | 6.51 | 4.65 | 7.23 | |
| Trp | 0.34 | 0.44 | 0.51 | 1.29 | 1.64 | 1.39 | 1.06 | 1.29 |
| Tyr | 1.00 | 1.05 | 1.36 | 3.41 | 4.78 | 3.36 | 2.85 | 3.41 |
| Sum | 20.94 | 31.35 | 47.71 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Notes:
In the columns with “pt” as subscript, the frequencies in the table are given relative (%) to the total number of aa (n = 3,025,111) in the protein samples (n = 10,731) analyzed. The dataset of nonredundant proteins is from Ponce de Leon et al.6
The sum is over the columns Ept, Hpt, Apt and gives the average amino acid per protein.
In the columns with “ss” as subscript, the frequencies in the table are given relative (%) to the number of aa per secondary structure.
Pav is for the average of columns Ess, Hss, Ass weighted with their average representativeness of these secondary structures in proteins (20.94, 31.35, 47.71, respectively) showing the consistency of the calculation.
Bold-italic amino acids indicate the amino acids from the Miller’s experiment (1992). The numbers on dark gray background are for values larger than 3 for Ept, Hpt, Apt and larger than 10 for Ess, Hss, Ass. The numbers on light gray background are for values in the range 2–3 for Ept, Hpt, Apt and in the range 5–10 for Ess, Hss, Ass. The numbers on white background are for values lower than 2 for Ept, Hpt, Apt and lower than 5 for Ess, Hss, Ass.