Literature DB >> 18203741

Codon choice in genes depends on flanking sequence information--implications for theoretical reverse translation.

Karthikeyan Sivaraman¹, Aswinsainarain Seshasayee, Patrick M Tarwater, Alexander M Cole.

Abstract

Algorithms for theoretical reverse translation have direct applications in degenerate PCR. The conventional practice is to create several degenerate primers each of which variably encode the peptide region of interest. In the current work, for each codon we have analyzed the flanking residues in proteins and determined their influence on codon choice. From this, we created a method for theoretical reverse translation that includes information from flanking residues of the protein in question. Our method, named the neighbor correlation method (NCM) and its enhancement, the consensus-NCM (c-NCM) performed significantly better than the conventional codon-usage statistic method (CSM). Using the methods NCM and c-NCM, we were able to increase the average sequence identity from 77% up to 81%. Furthermore, we revealed a significant increase in coverage, at 80% identity, from < 20% (CSM) to > 75% (c-NCM). The algorithms, their applications and implications are discussed herein.

Entities: Chemical Disease Species

Mesh：

Substances：
Bacterial Proteins
Codon

Year: 2008 PMID： 18203741 PMCID： PMC2241905 DOI： 10.1093/nar/gkm1181

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Word usage and codon usage in bacterial genomes has been extensively documented, both in the coding (1) and non-coding regions (2). These reports show that word usage in genomes is non-random and it serves as a biological signature of the organism in question. One such signature is codon usage in open reading frames (ORFs), and is reflected in measures such as the codon adaptation index (CAI) (3). Though CAI provides a convenient measure of codon bias, several reports show that codon usage is not a property of isolated codons and in several cases the bases immediately upstream or downstream affect the translation (4). Such neighboring base effects are well studied in case of stop codon read-through experiments where the flanking base or codon has been shown to affect the accuracy and magnitude of read-through (5). Apart from single bases, the effect of flanking codons has also been well studied in literature. Gutman and Hatfield (6) show that there is a strong first-order Markovian relationship between codons in a gene and this relation is seen even after translation, in proteins. Boycheva and colleagues extended this study to reveal that translation efficiency is strongly dependent on the dicodon pair that encodes for a given amino acid pair (7). They suggest that relative orientations of t-RNA in the ribosome may cause the observed differences in translation efficiency and subsequently certain dicodon pairs are selected evolutionarily. Moura and coworkers use a more recent and larger dataset for an analysis of dicodon usage patterns in both prokaryotes and eukaryotes. Their results suggest that the geometric constraints imposed by the translation machinery are driving forces in the evolution of gene sequences in bacteria (8). Collectively, these results suggest the existence of strong first-order Markovian relationships between codons in a gene. We hypothesized that information content of such correlations is carried over to the proteins, at least in part, when the gene is translated. This information manifests itself as a lack of randomness in the choice of codons and it is apparent when one attempts to theoretically reverse translate a protein sequence. Reverse translation has been discussed earlier as an abstract logical flow of information from proteins to DNA (9). In this work, we consider the pragmatic problem of theoretical reverse translation itself, rather than that of information flow from proteins to DNA. Theoretical reverse translation of protein sequences has potential applications in primer design for degenerate PCR and in design of synthetic genes (10). In degenerate PCR, several primers are designed, each representing a variant DNA sequence encoding the peptide region of interest. One of the best methods designed for degenerate PCR can, in the best case scenarios, still utilize up to 128 primers on one end (5′- or 3′-end) and one or more at the other end (11). Though no specific software is available for reverse translation, the conventional procedure is to substitute codons for residues based on the overall genomic codon usage probabilities which required different primers be designed for each ambiguous codon in the gene in the region of interest. In practice, it is common for almost all possibilities to be covered, increasing the number of required primers exponentially. Thus, improvements in reverse translation will help reduce the ambiguity in degenerate PCR. Improvements in reverse translation can be brought about by studying the rules of codon usage in the genome, which is feasible due to availability of whole genome sequences. In this study, we created a framework for reverse translation of bacterial gene sequences and term it the neighbor correlation method (NCM), due to its use of neighboring (flanking) sequence information to predict codon usage. We provide evidence for the dependency of codon choice on the flanking amino acid residues and used this dependency to reverse-translate protein sequences from two model genomes. We confirmed that NCM was a substantial improvement over the conventional method (codon-usage statistic method—CSM). Furthermore, we introduced a modification to both CSM and NCM [consensus CSM (c-CSM) and consensus NCM (c-NCM)] to improve significantly the sensitivity of reverse translations by both CSM and NCM, and show that these observed differences in performance are statistically significant. Finally, using the protein sequences of Salmonella typhi CT18 and the probability matrix from Escherichia coli K12, we show that it is possible to reverse translate sequences from organisms for which a reverse translation matrix is not available, by using a matrix from a related organism.

MATERIALS AND METHODS

All sequences were obtained from the NCBI database. For the analyses, the genome and predicted ORF sequences of E. coli K12 (12), B. subtilis (13), and S. typhi CT18 (14), Acidobacteria bacterium (NC_008095), Aquifex aeolicus (15), Bacteroides thetaiotaomicron (16), Bordetella pertussis (17), Campylobacter jejuni (18), Caulobacter crescentus (19), Chlamydia trachomatis (20), Clostridium acetobutylicum (21), Dehalococcoides ehtenogenes (22), Deinococcus radiodurans (23), Fusobacterium nucleatum (24), Lactobacillus acidophilus (25), Mesorhizobium loti (26), Methanococcus jannaschii (27), Methanopyrus kandleri (28), Mycobacterium bovis (29), Mycobacterium tuberculosis (30), Mycoplasma genitalium (31), Myxococcus xanthus (32), Nanoarchaeum equitans (33), Prochlorococcus marinus (34), Pseudomonas aeruginosa (35), Rickettsia prowazekii (36), Sulfolobus solfataricus (37), Synechococcus elongatus (38), Thermoplasma acidophilum (39), Ureaplasma urealyticum (40) and Magnetococcus sp. (NC_008576) were used. We used needle, an implementation of the Needleman–Wünsch algorithm available in the EMBOSS package (41) for all sequence identity analyses. The algorithms discussed were implemented in PERL (script provided as Supplementary Data) on a Linux platform.

Analysis for non-random codon usage dependency on flanking amino acid residues

For codons of interest, random occurrence model was constructed based on codon usage and amino acid frequencies in a given genome. We used 10 000 such random sets to calculate the z-scores for each residue–codon–residue combination. From the z-scores, P-values were calculated and were multiply corrected for both codon occurrence and amino acid occurrence biases using Bonferroni correction. To identify those combinations that have a skewed occurrence, we used a stringent threshold of P < 0.0001.

Creation of the probability matrix for CSM

Codon usage in the genome interest was calculated using the CUSP program in the EMBOSS package (41), and a codon usage probability table was created based on that information. For each amino acid the segmented probability interval spans from 0.0 to 1.0, where each consecutive non-overlapping segment corresponds to probability of a unique codon (Figure 1A). This probability interval matrix had 64 individual data points under 21 categories (20 amino acids + stop codons).

Figure 1.

Illustration of reverse translation methods. (A) shows reverse translation of a protein sequence based on codon usage and (B) shows the reverse translation using NCM. GS represents ORF (gene) sequences from the genome of interest. The first part shows the creation of probability intervals for both panels. For NCM, Bayesian probabilities of codon usage were calculated given the flanking residues. Note that the codon usage profiles for alanine are distinct between the two methods. The second part depicts the reverse translation process, which is similar to both methods. A random number ‘r’ was generated and the codon corresponding to the probability interval (the horizontal line spanning 0.0–1.0 in both panels) within which r fell was used for creation of the ORF. This codon was then used for reverse translation.

Creation of the probability matrix for NCM

For each tripeptide A1–A2–A3 in the genome of interest, we calculated the usage probabilities of codons for A2 flanked by A1 and A3. Based of these probabilities, we created a probability interval matrix for all combinations of A1–C*–A3, where C* is the codon that encodes A2. The probability interval matrix thus created had 24 400 individual data points under 8000 categories (20 × 20 × 20 amino acid combinations). Creation of such a probability interval for the tripeptide S–A–S is illustrated in Figure 1B.

Reverse translation

In reverse translation using CSM, a random number r was generated where 0 ≤ r ≤ 1, for each amino acid in the query protein sequence. The codon corresponding to the probability interval within which r falls was chosen for reverse translation. In NCM, overlapping tripeptides were used instead of single codons, and the codon was predicted for the second residue. However, when reverse translating with NCM, the first residue and stop codons were assigned based on probability alone. This procedure is also illustrated in Figure 1. c-NCM was created as an enhancement to NCM, in which reverse translation was performed n times using NCM for each protein sequence. The final DNA sequence was obtained by creating a consensus sequence from the n sequences created.

Statistical analyses of differences between various methods

In order to statistically test the difference in performance of the different methods, we used (i) either Kolmgorov–Smirnov (KS) or Mann–Whitney (MW) test for comparing distributions of nucleotide sequence identity and (ii) F-test followed by FDR to identify sequence identity range that is over-represented in one method over another. These tests were used to compare (i) c-NCM and NCM (ii) NCM and CSM and (iii) c-NCM and CSM. In case of KS and MW tests, we used the sequence identity data. For the F-test and subsequent FDR analysis, we used the number of sequences scoring within a given sequence identity interval (for example, 300 sequences scored between 80% and 85%). All tests were run in R (http://www.r-project.org). The complete statistical analysis and data are provided in Supplementary Data.

Statistical analyses of iteration threshold for c-NCM

The c-NCM was performed on a random set of 1000 sequences in the E. coli K12 genome. Various iterations were used, ranging from 5 to 100 in five steps. Resultant sequences were compared with reference gene sequences using needle and percentage identity calculated. The distribution of scores from 50 iterations was compared to (i) that of NCM for these 1000 sequences and (ii) the distribution of scores from 100 iterations. For the comparison, we used KS test with alternative hypothesis = greater. There was no significant difference between the scores of iterations 50 and 100 (P = 0.2406). However, there was a significant difference between NCM and the 50-iteration c-NCM (same test as above, P < 2.2 × 10–16), and hence we used 50 iterations as the threshold for c-NCM predictions. A similar approach was used to test the performance of c-CSM. The results of c-CSM were then compared with those of c-NCM.

RESULTS AND DISCUSSIONS

Reverse translation of protein sequences is necessary for the design of degenerate primers. In most cases, reverse translation uses the codon usage statistics of the complete genome or a representative set of genes for the organism of interest. While dictated by overall genomic preference, this method rests on the assumption that usage of a codon in a gene is essentially random. Until this study, there has been no comprehensive analysis on the statistics of reverse-translation using the classical method. In this work, we show that the choice of codons for reverse translation can be refined further by taking into account the residues flanking the residue of interest in a protein. Based on this observation, we have devised a method called the NCM that uses the correlation between codon usage and flanking residues in proteins. As a case study, we have analyzed the efficiency of reverse-translation using NCM performed on the set of predicted ORF of E. coli K12 and B. subtilis.

Correlation between codon choice and the flanking amino acid residues in the E. coli K12 genome

We analyzed the codon usage in the genomes of both E. coli K12 and B. subtilis and observed that the codon usage was not random but was to some extent dependent on the flanking codons. This dependency on flanking codons was reflected as a dependency on the flanking residues in proteins. For example, the codon GGC (Gly) encodes for 40.5% of all glycine residues present in E. coli (Supplementary Data). In the NCM, there are 400 possible theoretical combinations for any given codon. If the distribution of GGC were to be random, each of the combinations would span 0.25% (random probability = 0.0025) of the probability space. However, we observed that GGC is often flanked by branched chain aliphatic amino acids and hydrophobic amino acids. The 12 combinations (3% of total possible combinations) shown in Table 1 contribute almost 12% of total GGC usage in the genome, yielding a usage that is as much as four times the expected random usage.

Table 1.

Table showing strong distribution of the codon ‘GGC’ flanked by hydrophobic amino-acids (ILV)

Residue 1	codon	Residue 2	Occurrence (Occ)	p(R1-GGC-R2) (pX) = Occ/Total	pX/pRand
A	GGC	G	326	0.008 134	3.253
A	GGC	V	399	0.009 956	3.982
G	GGC	G	362	0.009 033	3.613
G	GGC	V	343	0.008 559	3.423
I	GGC	A	352	0.008 783	3.513
I	GGC	G	371	0.009 257	3.703
I	GGC	V	303	0.007 561	3.024
L	GGC	A	364	0.009 083	3.633
L	GGC	G	473	0.011 803	4.721
L	GGC	V	477	0.011 902	4.761
S	GGC	G	324	0.008 085	3.234
V	GGC	G	326	0.008 134	3.253

Occurrence in table denotes overall genomic occurrence of the combination. The pX denotes the occurrence probability of the combination X (occurrence/total occurrences of the codon GGC). The pRand denotes the random occurrence probability of the combination X (pRand = 1/400 = 0.0025). The pX/pRand denotes the ratio between observed and expected probabilities. These 12 combinations (out of 400) represent almost 12% of the total occurrences of GGC in the genome (expected = 3%) representing a skew in codon usage dependent on flanking residues.

Table showing strong distribution of the codon ‘GGC’ flanked by hydrophobic amino-acids (ILV) Occurrence in table denotes overall genomic occurrence of the combination. The pX denotes the occurrence probability of the combination X (occurrence/total occurrences of the codon GGC). The pRand denotes the random occurrence probability of the combination X (pRand = 1/400 = 0.0025). The pX/pRand denotes the ratio between observed and expected probabilities. These 12 combinations (out of 400) represent almost 12% of the total occurrences of GGC in the genome (expected = 3%) representing a skew in codon usage dependent on flanking residues. Though the analysis of GGX shows that codon usage is non random, the data discussed is specific for E. coli. Furthermore, glycine is encoded by only four codons and does not exhibit maximum degeneracy. In order to both test these observations in multiple genomes as well as to use a more degenerately encoded amino acid, we have analyzed the codon usage for the amino acid arginine in 30 genomes. Arginine is encoded by six codons and is amongst the most degenerately encoded amino acids along with leucine and serine. In our analysis, for each of the six codons (C), we generated 10 000 random distributions with flanking amino acid residues (R1–C–R2). Using these random distributions, z-scores and P-values for each observed combination were calculated. The calculated P-values were adjusted for both codon representation bias (for a given codon) and amino acid representation biases (across all codons for a given flanking pair) using the Bonferroni correction. The resultant values were screened using a stringent threshold of P < 0.0001. We observed that even after stringent corrections there were several combinations that had a non-random distribution. The results of these tests are given in Supplementary Data. These tests prove that codon usage varies with a change in flanking amino acid residues. We therefore hypothesized that a method exploiting the flanking residue information will be more sensitive in detecting signals that are lost in the conventional method (CSM) for reverse translation.

Comparison and analysis of reverse translations using CSM and NCM

In order to compare the performance of CSM and NCM, we reverse translated all the proteins in two genomes, E. coli K12 and B. subtilis, using both methods. Identity of the reverse translated proteins with the reference (original) ORF was used to quantify sensitivity of the methods. First, the distribution of percentage identities of nucleotide sequences reverse translated via NCM is significantly greater than that for CSM (P < 2.2 × 10–16; one-tailed KS test). A second assessment of performance using ratios of sequence identities (%IDNCM/%IDCSM) revealed that there was a small yet statistically significant increase in average sequence identities (P < 2.2 × 10−16; one-tailed MW test, null-hypothesis: ratio = 1). The average increase in sequence identity for all the sequences was ∼1%. We then grouped all sequence identities into bins of width 5% and tested which bins were significantly enriched in NCM over CSM. This revealed that NCM reverse translates a significantly large number of protein sequences to nucleotides withq high identities of >80–85% (P-value: 4.5 × 10–15, Fisher's test and FDR correction; Figure 2A). At this sequence identity range, there are twice as many DNA sequences predicted by NCM (239 sequences) as are predicted by CSM (103 sequences).

Figure 2.

Percentage identity distribution of reverse translated ORF sequences in E. coli K12 and B. subtilis are shown in Panel A and B, respectively. This study compares the identities of reverse translations by CSM, NCM and c-NCM, revealing the distribution of percentage identities of reverse translated ORFs to the native ORFs. Two genomes, E. coli K12 and B. subtilis, are represented herein. Note that the NCM predictions are both qualitatively and quantitatively better and are also more numerous beyond 77% identity in both cases. This graph depicts the current limits of theoretical reverse-translation at ∼85% for all the methods. The improvement of c-NCM over CSM and NCM, especially in regions of higher sequence identity, is clearly visible and significant (F-test with FDR correction; P < 2.3 × 10–44). With B. subtilis the overall numbers were lower, although we observed a similar trend. For NCM, the total number of sequences that had more than 75% identity was 3670 and the same for CSM was 3347 (Figure 2B). This represented an increase >10% in NCM predictions over CSM predictions. A 1% increase in the median sequence identity was seen for the B. subtilis data set as was in E. coli. As we increased the threshold to 80%, we observed 200 proteins that were reverse translated by NCM yet only 114 by CSM. Moreover, the average identity of sequences reverse translated by either method was lower in B. subtilis than in E. coli K12. On the whole, these results suggest more random choice of codons in B. subtilis than in E. coli K12. These collective results underscore an important and fundamental distinction between the two groups of bacteria tested: increased randomness in the gram positive genome (B. subtilis) may be an indicator of its earlier evolutionary origin as compared to the gram negative (E. coli) genome (42). In order to identify the CAI range within which NCM was effective, we compared the distribution of CAI values of genes whose IDNCM/IDCSM ratio was >1.01 with those whose IDNCM/IDCSM ratio was <0.99 (KS test, alternative hypothesis = CAI distribution of NCM is lesser than that of CSM; P < 1.0 × 10–6). These results show that NCM performs significantly better than CSM in regions of low CAI.

Comparison of CSM and NCM on various phyla in the bacterial kingdom

In the previous section, we discussed and compared the results of reverse translation using CSM and NCM in two divergent bacterial species. Despite the phylogenetic distance between the two species, the both were eubacteria with moderate GC content. In order to show that the differences are real, we tested and compared the methods 28 different bacterial genomes, each representing a major clade in the bacterial kingdom as listed in KEGG (43). This list included one each of various groups like Archaebacteria, alpha-, beta-, gamma- and delta-proteobacteria, firmicutes, mollicutes, actinomyces, halo- and acido-bacteria, green sulfur and non-green sulfur bacteria, and cyanobacteria. The exhaustive list of organisms and a comparison of CSM and NCM in these genomes is tabulated in Table 2, and Supplementary Data lists the minima, maxima, median, first and third quartiles for these methods for all the genomes. From Table 2, it is evident that NCM outperforms CSM not only in genomes with moderate GC content but also in all major bacterial clades.

Table 2.

Table comparing performance of CSM, NCM, c-CSM and c-NCM in 30 different clades of bacterial kingdom

Clade	Organism	Genome ID	CSM–cNCM	NCM–cNCM	cCSM–cNCM
Hyperthermophiles	Aquifex_aeolicus	NC_000117	2.2xE-16	2.2xE-16	0.2643
Bacteroides	Bacteroides thetaiotaomicron	NC_000908	2.2xE-16	2.2xE-16	2.2xE-16
Beta-proteobacteria	Bordetella pertussis	NC_000909	2.2xE-16	2.2xE-16	2.2xE-16
Delta-proteobacteria	Myxococcus xanthus	NC_000918	2.2xE-16	2.2xE-16	2.2xE-16
Epsilon-proteobacteria	Campylobacter jejuni	NC_000919	2.2xE-16	2.2xE-16	2.2xE-16
Alpha-proteobacteria	Caulobacter crescentus	NC_000962	2.2xE-16	2.2xE-16	2.2xE-16
Chlamydia	Chlamydia trachomatis	NC_000963	2.2xE-16	2.2xE-16	2.2xE-16^a
Clostridia	Clostridium acetobutylicum	NC_001263	2.2xE-16	2.2xE-16	2.2xE-16
Green-nonsulfur	Dehalococcoides ethenogenes	NC_002162	2.2xE-16	2.2xE-16	5.168xE-08
Deinococcus	Deinococcus radiodurans	NC_002163	2.2xE-16	2.2xE-16	2.2xE-16
Fusobacteria	Fusobacterium nucleatum	NC_002516	2.2xE-16	2.2xE-16	1.567xE-08
Lactobacillales	Lactobacillus acidophilus	NC_002578	2.2xE-16	2.2xE-16	2.2xE-16
Alpha-rhizobacteria	Mesorhizobium loti	NC_002678	2.2xE-16	2.2xE-16	2.2xE-16
Euryarchaeota	Methanococcus jannaschii	NC_002696	2.2xE-16	2.2xE-16	0.2876
Euryarchaeota	Methanopyrus kandleri	NC_002754	2.2xE-16	2.2xE-16	2.936xE-10
Actinobacteria	Mycobacterium bovis	NC_002929	2.2xE-16	2.2xE-16	2.2xE-16
Actinobacteria	Mycobacterium tuberculosis	NC_002936	2.2xE-16	2.2xE-16	2.2xE-16
Mollicutes	Mycoplasma genitalium	NC_002945	2.2xE-16	2.2xE-16	2.2xE-16
Nanoarchaeota	Nanoarchaeum equitans	NC_003030	2.2xE-16	2.2xE-16	2.2xE-16^a
Cyanobacteria	Prochlorococcus marinus	NC_003454	2.2xE-16	2.2xE-16	2.2xE-16^a
Gamma-proteobacteria	Pseudomonas aeruginosa	NC_003551	2.2xE-16	2.2xE-16	0.01897
Alpha/Rickettsiae	Rickettsia prowazekii	NC_004663	2.2xE-16	2.2xE-16	2.2xE-16
Crenarchaeota	Sulfolobus solfataricus	NC_005072	2.2xE-16	2.2xE-16	0.761
Cyanobacteria	Synechococcus elongatus	NC_005213	2.2xE-16	2.2xE-16	0.0011
Euryarchaeota	Thermoplasma acidophilum	NC_006576	2.2xE-16	2.2xE-16	2.2xE-16
Spirochete	Treponema pallidum	NC_006814	2.2xE-16	2.2xE-16	2.2xE-16 ^a
Mollicutes	Ureaplasma urealyticum	NC_008009	2.2xE-16	2.2xE-16	0.0455
Acidobacteria	Acidobacteria bacterium	NC_008095	2.2xE-16	2.2xE-16	5.471xE-05
Magnetococcus	Magnetococcus MC-1	NC_008576	2.2xE-16	2.2xE-16	2.2xE-16

aIn these cases, c-CSM performed significantly better than c-NCM.

Table comparing performance of CSM, NCM, c-CSM and c-NCM in 30 different clades of bacterial kingdom aIn these cases, c-CSM performed significantly better than c-NCM.

Improving the performance of NCM: the c-NCM

While NCM offered a better method to reverse-translate protein sequences, the overall improvement over CSM was apparent only at a higher sequence identity cut-off and for only a small fraction of the sequences. In order to improve the sensitivity of NCM, we developed a technique known as c-NCM, where the same protein was reverse-translated n times and a consensus was derived from the resultant sequence set. Our tests with a random set of 1000 sequences derived from E. coli K12 genome (Figure 3) demonstrated a drastic improvement from 1 cycle to 20 cycles. After 25 cycles, there was only a small improvement in prediction efficiency, which became insignificant beyond 50 cycles as compared to 100 cycles (KS test, alternative hypothesis = two-tailed: P-value = 0.2406). Moreover, our tests with a sample set of 100 sequences show that there is no significant improvement in sequence identity between 100 and 1000 cycles (data not shown). Hence we chose to use 50 cycles for subsequent c-NCM-based studies. The results of c-NCM are summarized along with those for CSM and NCM, in Figure 2A and 2B for E. coli K12 and B. subtilis, respectively. The average identity of reverse translated sequences increased by 4% with c-NCM when compared to the results from NCM. In summary, c-NCM reverse translated >75% of sequences with 80% identity or more while the percentage of sequences scoring the same with NCM was <20% in both genomes. This difference is highly significant (P = 0 for >85–90% ID and 2.3 × 10–44 for >80–85% ID, Fisher's test and FDR correction). These results revealed that c-NCM is an effective method for reverse translation of protein sequences based on genomic usage matrices, and also indicate that the performance of c-NCM was significantly better than both NCM and CSM. As was the case for CSM and NCM, we tested c-NCM on all the 30 genomes (Table 2). It could be seen that the performance of c-NCM was significantly different between both NCM and CSM for all phyla.

Figure 3.

Standardization of iteration values for c-NCM. This figure illustrates the improvement in sensitivity as the number of iterations is increased in NCM. We performed c-NCM-based reverse translations for 1000 randomly chosen proteins using various iterations (5–100, in five steps) and compared the results with (A) predictions from NCM and (B) predictions from 100 iterations of c-NCM. It can be seen that the largest difference is between iteration values of 1 (NCM) and 50 (KS test, alternative = greater; P = 2.2 × 10–16). However, there is a small increase of sensitivity as the iterations are increased. The sensitivity difference was tested to 100 cycles and since there was no significant difference between 50 and 100 cycles (KS test, alternative = greater; P = 0.2406), we chose the threshold for c-NCM at 50 cycles. Apart from testing c-NCM on different genomes, we were also interested in analyzing the effects of consensus improvisation on the CSM method. Differences from the normal trend, if any, would allow us to discern genomes that have increased, or decreased randomness in their codon usage. On the same set of 30 genomes, we performed c-CSM (50 cycles) and compared the results with that of c-NCM using a Wilcoxon Rank Sum test. The results in Table 2 show that in 70% (21 of 30) of the tested genomes, c-NCM had a better performance than c-CSM. In five cases, the difference between the two methods was insignificant. These genomes were, Aquifex aeolicus (hyperthermophile), Bordetella pertussis (beta-proteobacteria), Methanopyrus kandleri (euryarchaeota), Prochlorococcus marinus (cyanobacteria) and Acidobacteria bacterium (acidobacteria). In four other cases, c-CSM performed significantly better than c-NCM: they were Rickettsia prowazekii (alpha-proteobacteria/Rickettsiae), Clostridium acetobutylicum (Clostridia), Fusobacterium nucleatum (Fusobacteria), and Lactobacillus acidophilus (Lactobacillales). These results, at least for P. marinus and M. kandleri, show that in archaeal and cyanobacterial genomes very little of tricodon usage information is carried over to the protein level.

Application of reverse translation to an external genome: Salmonella typhi CT18

In the previous sections, we demonstrated that the improvised method (c-NCM) performed significantly better than CSM and NCM. We hypothesized that NCM matrices created from a genome can be used for reverse translating protein sequences from a related genome. S. typhi CT18 is 67% identical to E. coli K12 genome at the DNA level, and hence was a good model system to test our hypothesis. Results from these comparisons showed significant differences between the prediction quality between CSM and NCM. Again, as was seen in 21 other genomes, the use of c-NCM improved prediction quality, with average identity beyond 80%. There was a very small difference in the average identities and the distribution between S. typhi CT18 (Figure 4) and E. coli K12 (Figure 2A). These observations confirmed that our method can be successfully applied to related genomes, suggesting increased fidelity in the design of degenerate primers for an organism whose gene sequence information is meager or non-existent. In such cases, the use of (c-)NCM matrices from a related organism is a viable alternative.

Figure 4.

Reverse translation of S. typhi proteins using E. coli K12 matrices. S. typhi CT18 proteins were reverse translated using codon usage and NCM matrices of E. coli K12 genome. Analyses of identities with reference ORFs show that predictions using c-NCM are both qualitatively and quantitatively better than those using CSM (KS test: alternative = greater; P < 2.2 × 10−16). These results prove the applicability of c-NCM in cases where genome sequence data and NCM matrices are not available for the organism of interest. Throughout this work, we have concentrated on the applications of reverse translation in design of degenerate PCR. However, these studies also reveal the underlying logic of codon usage in genes in general, and such knowledge will be imperative in the design of synthetic genes to be used in artificial genetic systems and can also be used to adapt recombinant genes in a host specific manner.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

43 in total

1. Isolation of segments of homologous genes with only one conserved amino acid region via PCR.

Authors: M Laging; B Fartmann; W Kramer
Journal: Nucleic Acids Res Date: 2001-01-15 Impact factor: 16.971

2. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

3. The complete genome of the crenarchaeon Sulfolobus solfataricus P2.

Authors: Q She; R K Singh; F Confalonieri; Y Zivanovic; G Allard; M J Awayez; C C Chan-Weiher; I G Clausen; B A Curtis; A De Moors; G Erauso; C Fletcher; P M Gordon; I Heikamp-de Jong; A C Jeffries; C J Kozera; N Medina; X Peng; H P Thi-Ngoc; P Redder; M E Schenk; C Theriault; N Tolstrup; R L Charlebois; W F Doolittle; M Duguet; T Gaasterland; R A Garrett; M A Ragan; C W Sensen; J Van der Oost
Journal: Proc Natl Acad Sci U S A Date: 2001-06-26 Impact factor: 11.205

4. The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum.

Authors: A Ruepp; W Graml; M L Santos-Martinez; K K Koretke; C Volker; H W Mewes; D Frishman; S Stocker; A N Lupas; W Baumeister
Journal: Nature Date: 2000-09-28 Impact factor: 49.962

5. Complete genome sequence of Caulobacter crescentus.

Authors: W C Nierman; T V Feldblyum; M T Laub; I T Paulsen; K E Nelson; J A Eisen; J F Heidelberg; M R Alley; N Ohta; J R Maddock; I Potocka; W C Nelson; A Newton; C Stephens; N D Phadke; B Ely; R T DeBoy; R J Dodson; A S Durkin; M L Gwinn; D H Haft; J F Kolonay; J Smit; M B Craven; H Khouri; J Shetty; K Berry; T Utterback; K Tran; A Wolf; J Vamathevan; M Ermolaeva; O White; S L Salzberg; J C Venter; L Shapiro; C M Fraser; J Eisen
Journal: Proc Natl Acad Sci U S A Date: 2001-03-20 Impact factor: 11.205

6. Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1.

Authors: O White; J A Eisen; J F Heidelberg; E K Hickey; J D Peterson; R J Dodson; D H Haft; M L Gwinn; W C Nelson; D L Richardson; K S Moffat; H Qin; L Jiang; W Pamphile; M Crosby; M Shen; J J Vamathevan; P Lam; L McDonald; T Utterback; C Zalewski; K S Makarova; L Aravind; M J Daly; K W Minton; R D Fleischmann; K A Ketchum; K E Nelson; S Salzberg; H O Smith; J C Venter; C M Fraser
Journal: Science Date: 1999-11-19 Impact factor: 47.728

7. Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogen.

Authors: C K Stover; X Q Pham; A L Erwin; S D Mizoguchi; P Warrener; M J Hickey; F S Brinkman; W O Hufnagle; D J Kowalik; M Lagrou; R L Garber; L Goltry; E Tolentino; S Westbrock-Wadman; Y Yuan; L L Brody; S N Coulter; K R Folger; A Kas; K Larbig; R Lim; K Smith; D Spencer; G K Wong; Z Wu; I T Paulsen; J Reizer; M H Saier; R E Hancock; S Lory; M V Olson
Journal: Nature Date: 2000-08-31 Impact factor: 49.962

8. The complete sequence of the mucosal pathogen Ureaplasma urealyticum.

Authors: J I Glass; E J Lefkowitz; J S Glass; C R Heiner; E Y Chen; G H Cassell
Journal: Nature Date: 2000-10-12 Impact factor: 49.962

9. Complete genome structure of the nitrogen-fixing symbiotic bacterium Mesorhizobium loti.

Authors: T Kaneko; Y Nakamura; S Sato; E Asamizu; T Kato; S Sasamoto; A Watanabe; K Idesawa; A Ishikawa; K Kawashima; T Kimura; Y Kishida; C Kiyokawa; M Kohara; M Matsumoto; A Matsuno; Y Mochizuki; S Nakayama; N Nakazaki; S Shimpo; M Sugimoto; C Takeuchi; M Yamada; S Tabata
Journal: DNA Res Date: 2000-12-31 Impact factor: 4.458

10. Large scale comparative codon-pair context analysis unveils general rules that fine-tune evolution of mRNA primary structure.

Authors: Gabriela Moura; Miguel Pinheiro; Joel Arrais; Ana Cristina Gomes; Laura Carreto; Adelaide Freitas; José L Oliveira; Manuel A S Santos
Journal: PLoS One Date: 2007-09-05 Impact factor: 3.240

1 in total

1. Metabolic engineering of Escherichia coli for direct production of 1,4-butanediol.

Authors: Harry Yim; Robert Haselbeck; Wei Niu; Catherine Pujol-Baxley; Anthony Burgard; Jeff Boldt; Julia Khandurina; John D Trawick; Robin E Osterhout; Rosary Stephen; Jazell Estadilla; Sy Teisan; H Brett Schreyer; Stefan Andrae; Tae Hoon Yang; Sang Yup Lee; Mark J Burk; Stephen Van Dien
Journal: Nat Chem Biol Date: 2011-05-22 Impact factor: 15.040

1 in total