| Literature DB >> 21966531 |
Nicholas J Hudson1, Quan Gu, Shivashankar H Nagaraj, Yong-Sheng Ding, Brian P Dalrymple, Antonio Reverter.
Abstract
Codon bias in the genome of an organism influences its phenome by changing the speed and efficiency of mRNA translation and hence protein abundance. We hypothesized that differences in codon bias, either between-species differences in orthologous genes, or within-species differences between genes, may play an evolutionary role. To explore this hypothesis, we compared the genome-wide codon bias in six species that occupy vital positions in the Eukaryotic Tree of Life. We acquired the entire protein coding sequences for these organisms, computed the codon bias for all genes in each organism and explored the output for relationships between codon bias and protein function, both within- and between-lineages. We discovered five notable coordinated patterns, with extreme codon bias most pronounced in traits considered highly characteristic of a given lineage. Firstly, the Homo sapiens genome had stronger codon bias for DNA-binding transcription factors than the Saccharomyces cerevisiae genome, whereas the opposite was true for ribosomal proteins--perhaps underscoring transcriptional regulation in the origin of complexity. Secondly, both mammalian species examined possessed extreme codon bias in genes relating to hair--a tissue unique to mammals. Thirdly, Arabidopsis thaliana showed extreme codon bias in genes implicated in cell wall formation and chloroplast function--which are unique to plants. Fourthly, Gallus gallus possessed strong codon bias in a subset of genes encoding mitochondrial proteins--perhaps reflecting the enhanced bioenergetic efficiency in birds that co-evolved with flight. And lastly, the G. gallus genome had extreme codon bias for the Ciliary Neurotrophic Factor--which may help to explain their spontaneous recovery from deafness. We propose that extreme codon bias in groups of genes that encode functionally related proteins has a pathway-level energetic explanation.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21966531 PMCID: PMC3179510 DOI: 10.1371/journal.pone.0025457
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Differential Entropy, measured using either Shannon or Approximate Entropy (ApEn), as a function of simulated coding sequences of varying lengths.
| Length, bp | Range (max – min) | 1% Significance Threshold | ||
| Shannon | ApEn | Shannon | ApEn | |
| 300 | 0.047 | 0.115 | 0.818 | 3.746 |
| 900 | 0.018 | 0.039 | 0.340 | 1.155 |
| 1,500 | 0.012 | 0.029 | 0.231 | 0.759 |
| 3,000 | 0.006 | 0.014 | 0.145 | 0.395 |
| 4,500 | 0.005 | 0.010 | 0.097 | 0.292 |
| 9,000 | 0.004 | 0.006 | 0.059 | 0.166 |
| Equation for Best Fit | 65.95 | 618.40 | ||
In all cases, the average Differential Entropy was within three decimal digits from zero.
Percentage error rate threshold corresponding to empirical P-value<0.01
Prediction equations (both with R2>99%) to identify coding sequences with statistically significant Differential Entropy (P-value<0.01).
Average Shannon and Approximate entropy (ApEn) of real and random coding sequences (CDS), and percentage of CDS where the entropy of the real sequence is less than expected by chance across the six species.
| Yeast |
|
| Chicken | Chimp | Human | |
| Number of CDS | 6,413 | 27,974 | 32,936 | 18,536 | 30,973 | 56,323 |
| Entropy of real CDS | ||||||
| Shannon | 1.954 | 1.970 | 1.976 | 1.971 | 1.968 | 1.965 |
| ApEn | 1.300 | 1.294 | 1.303 | 1.290 | 1.287 | 1.276 |
| Entropy of random CDS | ||||||
| Shannon | 1.979 | 1.983 | 1.986 | 1.986 | 1.986 | 1.984 |
| ApEn | 1.331 | 1.336 | 1.337 | 1.339 | 1.340 | 1.330 |
| % Observed<Random | ||||||
| Shannon | 93.44 | 78.22 | 84.64 | 79.97 | 82.10 | 82.28 |
| ApEn | 97.47 | 99.12 | 98.61 | 99.28 | 99.37 | 98.75 |
| % with Significant (P<0.01) Differential Entropy | ||||||
| Shannon | 82.55 | 65.28 | 62.01 | 60.86 | 65.18 | 63.43 |
| ApEn | 61.02 | 77.66 | 68.54 | 83.67 | 85.03 | 79.82 |
For every CDS, we generated 20 random sequences that encode the identical amino acid sequence to compute average entropies.
Figure 1Differential Entropy: Regularity in coding sequences expressed as the difference between the observed and the randomly expected entropy.
Negative values indicate sequences more regular than expected for a given amino acid sequence. The horizontal red line is positioned at zero on the y axis. All sequences below this line possess codon bias.
The 5 most extreme functional enrichments for each species on a within-lineage basis.
| Species | Biological process |
|
| Yeast | Translation | 4.26E-90 |
| Regulation of Translation | 4.70E-51 | |
| Posttranscriptional regulation of gene expression | 6.86E-48 | |
| Ribosome assembly | 4.74E-14 | |
| rRNA processing | 2.32E-13 | |
| C. elegans | Nucleosome organization and assembly | 1.55E-17 |
| Protein-DNA complex organization and assembly | 1.11E-16 | |
| Body morphogenesis | 1.77E-13 | |
| Translation | 5.54E-13 | |
| Chromatin organization | 2.07E-8 | |
| A. thaliana | Structural constituent of cell wall | 7.75E-14 |
| Translational elongation | 6.62E-10 | |
| Plant-type cell wall organization | 1.37E-7 | |
| Structural constituent of ribosome | 7.63E-7 | |
| Chloroplast ribulose bisphosphate carboxylase complex | 1.46E-4 | |
| Chicken | Regulation of multicellular organismal process | 1.95E-7 |
| Sex determination | 2.31E-6 | |
| Regulation of transcription, DNA dependent | 4.67E-6 | |
| Regulation of cell differentiation | 5.28E-6 | |
| Regulation of developmental process | 7.68E-6 | |
| Chimpanzee | Keratinization | 2.67E-18 |
| Feeding behavior | 5.93E-7 | |
| Epidermal cell differentiation | 2.56E-5 | |
| Regulation of transcription, DNA-dependent | 6.59E-5 | |
| Pigment accumulation in tissues | 5.47E-4 | |
| Human | Sequence-specific DNA binding activity | 3.68E-11 |
| Hormone activity | 4.24E-8 | |
| Regulation of transcription, DNA-dependent | 8.99E-7 | |
| RNA polymerase II transcription factor activity | 5.41E-6 | |
| Epidermal cell differentiation | 6.99E-5 |
Adjusted P-values for the hypergeometric test obtained using the GOrilla tool (Eden et al., 2009), http://cbl-gorilla.cs.technion.ac.il/.
Figure 2Differential Entropy in sequences from 609 orthologous proteins in humans and yeast (A).
Highlighted are the ribosomal proteins (N = 23; blue), the transcription factors (N = 47; red), RFX1 (green) and RPS3 (pink). Differential Entropy in sequences from 7,902 orthologous proteins in humans and chicken (B). Highlighted are mitochondrial proteins (N = 14; red), G-protein receptors (N = 14; blue) and CNTF (Table 5). Differential Entropy in sequences from 14,182 orthologous proteins in humans and chimps (C). Highlighted are the keratin associated proteins (N = 46; blue). The diagonal red lines are 45 degree bisectors that have been placed to show the point at which there is no difference in bias between species. The perpendicular distance from the diagonal represents the extent of the difference in bias.
Codon usage in the protein CNTF in chicken and humansA.
| CNTF | |||||||
| Chicken | Human | ||||||
| AA | Syn. | N | PC | Prop. | N | PC | Prop. |
| Phe | 2 | 2 | TTC | 0.667 | 4 | TTC | 0.571 |
| Leu | 6 | 25 | CTG | 0.807 | 8 | CTG | 0.307 |
| Ile | 3 | 3 | ATC | 1.000 | 5 | ATC | 0.417 |
| Trp | 1 | 2 | TGG | 1.000 | 4 | TGG | 1.000 |
| Val | 4 | 7 | GTG | 0.636 | 4 | GTG | 0.500 |
| Ser | 6 | 6 | AGC | 0.462 | 4 | TCT | 0.308 |
| Pro | 4 | 4 | CCC | 0.400 | 3 | CCA | 0.429 |
| Thr | 4 | 4 | ACC | 0.500 | 6 | ACC | 0.500 |
| Ala | 4 | 12 | GCC | 0.462 | 7 | GCT | 0.467 |
| Tyr | 2 | 3 | TAC | 1.000 | 3 | TAT | 0.600 |
| Cys | 2 | 1 | TGC | 1.000 | 1 | TGT | 1.000 |
| His | 2 | 4 | CAC | 1.000 | 9 | CAT | 0.900 |
| Gln | 2 | 10 | CAG | 1.000 | 8 | CAG | 0.667 |
| Asn | 2 | 1 | AAC | 1.000 | 6 | AAC | 0.750 |
| Lys | 2 | 3 | AAG | 1.000 | 8 | AAG | 0.889 |
| Arg | 6 | 9 | CGG | 0.474 | 5 | CGT | 0.417 |
| Asp | 2 | 9 | GAC | 0.819 | 6 | GAC | 0.600 |
| Glu | 2 | 17 | GAG | 1.000 | 10 | GAG | 0.714 |
| Gly | 4 | 9 | GGC | 0.692 | 4 | GGG | 0.400 |
| Met | 1 | 4 | ATG | 1.000 | 5 | ATG | 1.000 |
For each amino acid (AA) the number of synonymous (Syn.) codons is given. For each protein sequence, three values are given: the number (N) of occurrences of each AA, the preferred codon (PC) and the proportion (Prop.) in which the PC is used.
Codon usage in transcription factor RFX1 and ribosomal protein RPS3 in humans and yeastA.
| RFX1 | RPS3 | ||||||||||||
| Human | Yeast | Human | Yeast | ||||||||||
| AA | Syn. | N | PC | Prop. | N | PC | Prop. | N | PC | Prop. | N | PC | Prop. |
| Phe | 2 | 20 | TTC | 0.900 | 39 | TTC | 0.513 | 7 | TTT | 0.714 | 8 | TTC | 0.875 |
| Leu | 6 | 86 | CTG | 0.640 | 81 | TTA | 0.370 | 21 | CTG | 0.524 | 19 | TTG | 0.632 |
| Ile | 3 | 22 | ATC | 0.909 | 57 | ATT | 0.386 | 15 | ATC | 0.533 | 13 | ATC | 0.539 |
| Trp | 1 | 9 | TGG | 1.000 | 4 | TGG | 1.000 | 1 | TGG | 1.000 | 0 | TGG | 0 |
| Val | 4 | 81 | GTG | 0.630 | 33 | GTT | 0.333 | 25 | GTG | 0.560 | 25 | GTC | 0.520 |
| Ser | 6 | 91 | AGC | 0.451 | 116 | TCA | 0.259 | 10 | TCT | 0.300 | 9 | TCT | 0.556 |
| Pro | 4 | 89 | CCC | 0.494 | 62 | CCA | 0.323 | 17 | CCC | 0.412 | 11 | CCA | 1.000 |
| Thr | 4 | 64 | ACC | 0.563 | 43 | ACA | 0.395 | 13 | ACT | 0.385 | 13 | ACT | 0.616 |
| Ala | 4 | 96 | GCC | 0.573 | 31 | GCA | 0.387 | 18 | GCT | 0.444 | 28 | GCT | 0.964 |
| Tyr | 2 | 33 | TAC | 0.818 | 21 | TAC | 0.524 | 6 | TAC | 0.667 | 7 | TAC | 0.858 |
| Cys | 2 | 6 | TGC | 0.667 | 13 | TGT | 0.538 | 3 | TGC | 0.667 | 1 | TGT | 1.000 |
| His | 2 | 18 | CAC | 0.889 | 15 | CAT | 0.667 | 3 | CAC | 1.000 | 2 | CAC | 1.000 |
| Gln | 2 | 105 | CAG | 0.905 | 33 | CAA | 0.697 | 8 | CAG | 0.875 | 7 | CAA | 1.000 |
| Asn | 2 | 20 | AAC | 0.950 | 75 | AAT | 0.587 | 3 | AAT | 0.667 | 5 | AAC | 1.000 |
| Lys | 2 | 29 | AAG | 0.828 | 58 | AAA | 0.672 | 20 | AAG | 0.700 | 18 | AAG | 0.667 |
| Arg | 6 | 36 | CGG | 0.417 | 24 | AGA | 0.500 | 18 | CGG | 0.333 | 20 | AGA | 0.900 |
| Asp | 2 | 25 | GAC | 0.920 | 29 | GAT | 0.621 | 8 | GAC | 0.500 | 9 | GAC | 0.778 |
| Glu | 2 | 52 | GAG | 0.885 | 39 | GAA | 0.741 | 18 | GAG | 0.611 | 22 | GAA | 1.000 |
| Gly | 4 | 79 | GGC | 0.684 | 22 | GGC | 0.364 | 23 | GGC | 0.391 | 17 | GGT | 1.000 |
| Met | 1 | 18 | ATG | 1.000 | 16 | ATG | 1.000 | 16 | ATG | 1.000 | 6 | ATG | 1.000 |
For each amino acid (AA) the number of synonymous (Syn.) codons is given. For each protein sequence, three values are given: the number (N) of occurrences of each AA, the preferred codon (PC) and the proportion (Prop.) in which the PC is used.