| Literature DB >> 24371267 |
Alexander Goncearenco1, Bin-Guang Ma, Igor N Berezovsky.
Abstract
DNA, RNA and proteins are major biological macromolecules that coevolve and adapt to environments as components of one highly interconnected system. We explore here sequence/structure determinants of mechanisms of adaptation of these molecules, links between them, and results of their mutual evolution. We complemented statistical analysis of genomic and proteomic sequences with folding simulations of RNA molecules, unraveling causal relations between compositional and sequence biases reflecting molecular adaptation on DNA, RNA and protein levels. We found many compositional peculiarities related to environmental adaptation and the life style. Specifically, thermal adaptation of protein-coding sequences in Archaea is characterized by a stronger codon bias than in Bacteria. Guanine and cytosine load in the third codon position is important for supporting the aerobic life style, and it is highly pronounced in Bacteria. The third codon position also provides a tradeoff between arginine and lysine, which are favorable for thermal adaptation and aerobicity, respectively. Dinucleotide composition provides stability of nucleic acids via strong base-stacking in ApG dinucleotides. In relation to coevolution of nucleic acids and proteins, thermostability-related demands on the amino acid composition affect the nucleotide content in the second codon position in Archaea.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24371267 PMCID: PMC3950714 DOI: 10.1093/nar/gkt1336
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Compositional and sequence signals in coding and ncDNA
| ARCHAEA | BACTERIA | ||
|---|---|---|---|
| Characteristic | Value | Characteristic | Value |
| ANat/NCB | ANat/NCB | ||
| (A+T)Nat/NCB | (A+T)Nat/NCB | ||
| (A+G)Nat/NCB | (A+G)Nat/NCB | ||
| (A+C)Nat/NCB | (A+C)Nat/NCB | ||
| TNat | TNat | ||
| TNCB | TNCB | ||
| GNat/NCB | (T+G)Nat | ||
| (T+G)Nat | (T+G)NCB | ||
| (T+G)NCB | |||
| (G+C)Nat/NCB | |||
| ANat | ANat | ||
| ANCB | ANCB | ||
| GNat | (A+G)Nat | ||
| GNCB | (A+G)NCB | ||
| (A+G)Nat | (A+G)Nat/NCB | ||
| (A+G)NCB | |||
| ncDNA | |||
| ApA/TpT | 1.15** | ApA/TpT | 1.26** |
| CpC/GpG | 1.24** | GpC | 1.25** |
| GpT/ApC | 0.82** | ||
| cDNA | |||
| ApA | 1.25** | ||
| ApANCB | 1.06** | ||
| TpT | 1.20** | ||
| TpTNCB | 1.13** | ||
| GpC | 1.24** | ||
| GpCNCB | 1.07** | ||
| tRNA | |||
| ApA | 1.32** | ApG | 1.28** |
| CpC | 1.26** | TpC | 1.35** |
| TpC | 1.26** | ||
| ApC | 0.52** | ||
| TpG | 0.70** | ||
| rRNA | |||
| ApA | 1.25** | ApA | 1.15** |
| CpC | 1.27** | CpC | 1.19** |
| ncDNA | |||
| CpT | GpG | ||
| ApG | |||
| cDNA | |||
| CpTNat | GpGNat | ||
| CpTNCB | GpGNCB | ||
| CpTShfld | GpGShfld | ||
| ApGNat | |||
| ApGNCB | |||
| ApGShfld | |||
| tRNA | |||
| ApA | ApA | ||
| TpA | |||
| TpT | |||
| GpG | |||
| CpC | |||
| rRNA | |||
| ApA | ApA | ||
| TpA | CpC | ||
| GpG | |||
| CpC | |||
| Freq(ApG)Nat/NCB | 1.35n/a | Freq(CpG)Nat/NCB | 1.27n/a |
| Freq(CpG)Nat/NCB | 0.62n/a | ||
| ApG | 1.02 | TpT | 1.45** |
| ApGNCB | 0.79** | TpTNCB | 1.49** |
| CpG | 0.76** | TpA | 0.69** |
| CpGNCB | 1.17** | TpANCB | 0.65** |
| GpT | 0.67** | ||
| GpTNCB | 0.67** | ||
| ApG | |||
| ApGNCB | |||
| GpC | |||
| GpCNCB | |||
| CpC | |||
| CpCNCB | |||
| ApGNat/NCB | |||
| GpCNat/NCB | |||
| CpCNat/NCB | |||
| RpR | RpR | ||
| RpRNCB | RpRNCB | ||
| YpY | YpY | ||
| YpYNCB | YpYNCB | ||
| RpRNat/NCB | RpRNat/NCB | ||
| YpYNat/NCB | YpYNat/NCB | ||
| TpA | ApA | 1.50** | |
| TpANCB | ApANCB | 1.05** | |
| CpT | GpC | 1.33** | |
| CpTNCB | GpCNCB | 0.98 | |
| ApG | ApCNat/NCB | ||
| ApGNCB | GpANat/NCB | ||
| RpR | RpR | ||
| RpRNCB | RpRNCB | ||
| YpY | YpY | ||
| YpYNCB | YpYNCB | ||
| RpRNat/NCB | |||
| YpYNat/NCB | |||
| CpT | |||
| CpTShfld | |||
| ApG | |||
| ApGShfld | |||
| RpR | RpR | ||
| RpRShfld | RpRShfld | ||
| YpY | YpY | ||
| YpYShfld | YpYShfld | ||
Pearson correlation coefficient is denoted by R; DCT is calculated as the ratio of dinucleotide frequency to the product of frequencies of the corresponding independent nucleotides. Nucleotides with purine (A or G) and pyrimidine (T or C) bases are denoted with R and Y, respectively. The lower index distinguishes the values observed for natural sequences (Nat), sequences with eliminated codon bias (NCB), values observed after shuffling of amino acid sequences (Shfld). If the lower index is omitted, the value is given for the natural sequences. The P-values for correlations and for the dinucleotide contrast t-tests (H0: DCT is 1.0) are shown in superscripts as significance levels: +P-value < 0.05, * < 0.01, ** < 0.0001. Supplementary Files S5 and S7 list all P-values for position-specific correlations, SD for compositions and the P-values for t-tests. Supplementary Files S8 and S9 show the P-values for position-independent contrast t-tests and correlations.
Nucleotide compositions and their OGT correlations in DNA and RNAs
| Domain of life | Nucleotide | cDNANat | cDNANCB | ncDNA | tRNA ( | rRNA ( |
|---|---|---|---|---|---|---|
| Archaea | A | 28.45 | 27.94** | 17.13** (–0.76**) | 23.68 (–0.85**) | |
| T | 23.89 | 24.44 | 18.17** (–0.89**) | 19.24 (–0.88**) | ||
| G | 26.00 | 26.12 | 19.30** | |||
| C | 21.66 | 21.51** | 19.31** | |||
| Bacteria | A | 24.05 | 26.42** | 26.57 | 19.66 (–0.44**) | 26.09 (–0.52**) |
| T | 22.68** | 23.70** | 26.61 | 21.56 (–0.51**) | 20.68 (–0.77**) | |
| G | 27.50** | 26.66** | 23.41 | |||
| C | 25.77 | 23.23** | 23.41 | 22.13 (0.60**) |
The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes, while the numbers in parentheses are correlation coefficients (R) of nucleotide frequencies with OGT. The most important biases and correlations are shown in bold font. The P-values for correlations (H0: correlation coefficient R = 0) and for the nucleotide composition t-tests (H0: mean frequency is 0.25) are shown in superscripts as significance levels: *P-value < 0.01, ** < 0.0001. Supplementary Files S8 and S9 list all correlations and composition tests. cDNANAT, natural nucleotide composition in coding DNA; cDNANCB, nucleotide composition in coding sequences with eliminated codon bias.
OGT correlations in r- and t-RNA observed in folding simulations
| Domain of life | RNA type | ||
|---|---|---|---|
| Archaea | rRNA | 0.89 (<10–22) | –0.93 (1.1 × 10–20) |
| tRNA | 0.84 (1.84 × 10–13) | –0.71 (3.34 × 10–8) | |
| Bacteria | rRNA | 0.73 (1.38 × 10–14) | –0.66 (1 × 10–11) |
| tRNA | 0.53 (3.1 × 10–7) | –0.51 (9.1 × 10–7) |
R((G + C), OGT), correlation coefficient between the (G + C) content and the OGT; R(
Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability
| Nucleotides correlated with OGT | ||||
|---|---|---|---|---|
| Domain of life | Codon position | 1 | 2 | 3 |
| Archaea | Nucleotide | A | T,G | Non-[A,G] |
| Origin of the bias | Codon bias | T- amino acid | Against codon bias | |
| G-codon bias | ||||
| Bacteria | Nucleotide | Weak A | Weak T | Non-[A,G] |
| Origin of the bias | Codon bias | Amino acid | Against codon bias | |
Part 1. Thermophilic-prone nucleotide biases: Columns 1, 2, 3 contain information on favorable nucleotides and origin of the bias in codon positions 1, 2, 3. Part 2. Thermophilic-prone dinucleotide biases: Columns 1-2, 2-3, 3-1 contain information on favorable nucleotides and origin of the bias in codon positions 1-2, 2-3, 3-1.
Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling
| Characteristic | Archaea | Bacteria |
|---|---|---|
| The most and least (after the comma) frequent pairs in stems of mRNA | Mesophiles | |
| Phase I: C2•3G, G2•3U | Phase I: G3•2U, U3•2G | |
| Phase II: C3•1G, G1•3U | Phase II: A1•3U, G1•3U | |
| Phase III: A3•3U, U3•3G | Phase III: G3•3C, U3•3G | |
| Thermophiles | ||
| Phase I: G3•2C, G2•3U | Phase I: C2•3G, G2•3U | |
| Phase II: C2•2G, G1•3U | Phase II: C3•1G, G1•3U | |
| Phase III: A3•3U, U3•3G | Phase III: U3•3A, U3•3G | |
| Correlations with OGT of | ||
| Segment energy, < | ||
| Energy per base pair, < | ||
| The most signification correlations with OGT | Phase I: U3•2G, | (U2•1G)Nat, PIII, |
| Phase I: G2•3U, | (G1•2U)Nat, PIII, | |
| Phase I: All pairs, | (U2•1G)dShfld, PIII, | |
| (G1•2U)dShfld, PIII, | ||
Phases I, II and III correspond to positioning of triplets where, respectively, first, second and third nucleotides are complementary. All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences after the dicodon shuffling. See explanations of abbreviations in the Materials and methods section.
Purine loading in loop and stem regions of folded mRNA and its OGT correlation
| Feature | Loop | Stem | L-v-S p-value | ||
|---|---|---|---|---|---|
| Mean contents | OGT correlation | Mean content | OGT correlation | ||
| A + G | 0.560 | 0.59** | 0.500 | –0.26 | <2.2 |
| R/Y | 1.299 | 0.61** | 1.002 | –0.26 | <2.2 |
| ApG | 0.061 | 0.79** | 0.051 | 0.50** | 0.0002 |
| GGR (glycine) | 0.027 | 0.62** | 0.045 | 0.42** | 1.2 |
| GGY (glycine) | 0.017 | –0.05 | 0.079 | –0.23* | 1.0 |
| AGR (arginine) | 0.035 | 0.72** | 0.022 | 0.56** | 2.4 |
| CGR (arginine) | 0.018 | –0.22 | 0.040 | –0.24 | 7.1 |
| CGY (arginine) | 0.016 | –0.25* | 0.037 | –0.26 | 6.2 |
| GAR (glutamate) | 0.060 | 0.71** | 0.036 | 0.59** | <2.2 |
| AAR (lysine) | 0.086 | 0.22 | 0.017 | 0.11 | <2.2 |
| GAY (aspartate) | 0.037 | –0.17 | 0.040 | –0.37** | 0.0002 |
Feature, analyzed nucleotide, dinucleotide or amino acid; Loop and stem, information on mean content and OGT correlation of the above; L-v-S, a comparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests, and P-values are shown in this column. Correlation coefficients with OGT are shown. *P-value < 0.01, **P-value < 0.0001.
Signals of thermophilic adaptation in protein sequences of Archaea and Bacteria
| Correlation with OGT | Archaea | Bacteria |
|---|---|---|
| ILVW Y DKR ( | IPV Y EKR ( | |
| The most abundant residues in >70% (>60%) of all | VIWL Y ER | VPI Y ER(K) |
| Individual amino acids | +: L, W | +: E |
| –: T, Q, D | ||
| Types of amino acids | +: h | +: h, c |
| –: p | –: p | |
| Dipeptides | +: hp, ph | +: cc |
| –: cp, pc | –: pc |
+, increase of the amino acid (or amino acid type) fraction with OGT; –, decrease of the amino acid (or amino acid type) fraction with OGT. Capital letters are names of amino acids; h, p, c are hydrophobic, polar, charged types of amino acids. Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses.