| Literature DB >> 20679093 |
Abstract
Codon usage can provide insights into the nature of the genes in a genome. Genes that are "native" to a genome (have not been recently acquired by horizontal transfer) range in codon usage from a low-bias "typical" usage to a more biased "high-expression" usage characteristic of genes encoding abundant proteins. Genes that differ from these native codon usages are candidates for foreign genes that have been recently acquired by horizontal gene transfer. In this study, we present a method for characterizing the codon usages of native genes--both typical and highly expressed--within a genome. Each gene is evaluated relative to a half line (or axis) in a 59D space of codon usage. The axis begins at the modal codon usage, the usage that matches the largest number of genes in the genome, and it passes through a point representing the codon usage of a set of genes with expression-related bias. A gene whose codon usage matches (does not significantly differ from) a point on this axis is a candidate native gene, and the location of its projection onto the axis provides a general estimate of its expression level. A gene that differs significantly from all points on the axis is a candidate foreign gene. This automated approach offers significant improvements over existing methods. We illustrate this by analyzing the genomes of Pseudomonas aeruginosa PAO1 and Bacillus anthracis A0248, which can be difficult to analyze with commonly used methods due to their biased base compositions. Finally, we use this approach to measure the proportion of candidate foreign genes in 923 bacterial and archaeal genomes. The organisms with the most homogeneous genomes (containing the fewest candidate foreign genes) are mostly endosymbionts and parasites, though with exceptions that include Pelagibacter ubique and Beutenbergia cavernae. The organisms with the most heterogeneous genomes (containing the most candidate foreign genes) include members of the genera Bacteroides, Corynebacterium, Desulfotalea, Neisseria, Xylella, and Thermobaculum.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20679093 PMCID: PMC3002238 DOI: 10.1093/molbev/msq185
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Iterative Development of a High-Expression Gene Set in E. coli K-12.
| Number of genes | Matching genes | Iterations | |||
| 1 | 2 | 3 | |||
| Initial high-expression codon usage estimate | |||||
| Ribosomal protein genes (mode) | 55 | 2,402 | 2,533 | 2,571 | 2,588 |
| Ribosomal protein genes (average) | 55 | 2,476 | 2,552 | 2,582 | 2,598 |
| CAI genes (average) | 27 | 2,512 | 2,563 | 2,586 | 2,600 |
| aa-transfer RNA synthetase genes (average) | 22 | 2,565 | 2,593 | 2,600 | 2,597 |
| | 1 | 2,436 | 2,597 | 2,600 | 2,598 |
| Comparison of matching gene sets | |||||
| Genes in any set (union) | 2,655 | 2,658 | 2,639 | 2,622 | |
| Genes in at least three of five sets | 2,479 | 2,562 | 2,584 | 2,599 | |
| Genes in all sets (intersection) | 2,306 | 2,477 | 2,533 | 2,563 | |
| Comparison of 415 matching genes with highest | |||||
| Genes in any set (union) | 658 | 499 | 469 | 444 | |
| Genes in at least three of five sets | 394 | 416 | 414 | 417 | |
| Genes in all sets (intersection) | 225 | 327 | 362 | 381 | |
Genes matching the axis that intersects the mode of the genome and the original high-expression codon usage from the first column. Genes that do not match the mode and have negative x values are excluded.
Genes used to define optimal codons in CAI analysis (Sharp and Li 1986).
Generated by combining the native genes in each column.
Top 10% of the genes in the genome (415 genes) with highest x values for each column.
FFCA plot of E. coli K-12. Each plot point shows the location of a gene in the first two axes of the analysis. Genes are colored according to their axis position (x value) based upon the colors of the visible spectrum, with red genes indicating the highest expression-related codon usage bias and violet genes indicating the least. Genes that differ significantly from all points on the native codon usage axis (likely to be foreign) are colored gray and are drawn behind the colored genes. Each gene’s position along the first axis of the plot also corresponds with its G + C content (from left to right: high G + C to low G + C) (see also Médigue et al. 1991).
FFCA plot of P. aeruginosa PAO1. (A) Genes that are orthologous to those in E. coli K-12 are colored based upon E. coli axis position (x value) from figure 1. The nonorthologous genes are colored gray. (B) Genes are colored according to P. aeruginosa axis position (x value) based on the colors of the visible spectrum, with red genes having the highest expression-related codon usage bias and violet genes having the least. Genes that differ significantly from all points on the native codon usage axis (likely to be foreign) are colored gray. In both panels, gray genes are drawn behind the colored genes. Genes in the right portion of the first axis have low G + C contents (see also Grocock and Sharp 2002).
FFCA plot of B. anthracis A0248. (A) Genes that are orthologous to those in E. coli K-12 are colored based upon E. coli axis position (x value) from figure 1. (B) Genes are colored based upon B. anthracis axis position (x value) based on the colors of the visible spectrum, with red genes having the highest expression-related codon usage bias and violet genes having the least. Genes that differ significantly from all points on the native codon usage axis (likely to be foreign) are colored gray. In both panels, gray genes are drawn behind the colored genes. Each gene’s position along the first axis of the plot also roughly corresponds with its G + C content (from left to right: low G + C to high G + C).
Percentage of Candidate Foreign Genes in Each Genome for the Ten Bacterial and Archaeal Species With the Most and Least Homogenous Genomesa.
| Organism | CDS | %G+C | % Foreign |
| Most homogeneous | |||
| | 639 | 23.6 | 3.8 |
| | 838 | 29.4 | 7.8 |
| | 933 | 28.5 | 7.8 |
| | 855 | 30.1 | 8.3 |
| | 584 | 28.9 | 8.6 |
| | 591 | 27.3 | 8.8 |
| | 543 | 31.2 | 10.1 |
| | 1,355 | 29.8 | 10.3 |
| | 4,278 | 73.2 | 10.5 |
| | 1,292 | 32.9 | 11.0 |
| Least homogeneous | |||
| | 3,900 | 43.2 | 65.0 |
| | 4,832 | 43.9 | 64.5 |
| | 4,233 | 44.1 | 64.1 |
| | 2,343 | 54.1 | 61.6 |
| | 3,240 | 47.5 | 59.1 |
| | 2,243 | 52.9 | 58.9 |
| | 2,994 | 54.8 | 57.5 |
| | 2,917 | 53.8 | 54.9 |
| | 2,023 | 54.0 | 53.8 |
| | 3,048 | 53.8 | 52.4 |
When multiple strains of a species are available, only the most extreme is included.
Number of coding sequences in the genome (all replicons combined).
In coding sequences.