Algorithms for theoretical reverse translation have direct applications in degenerate PCR. The conventional practice is to create several degenerate primers each of which variably encode the peptide region of interest. In the current work, for each codon we have analyzed the flanking residues in proteins and determined their influence on codon choice. From this, we created a method for theoretical reverse translation that includes information from flanking residues of the protein in question. Our method, named the neighbor correlation method (NCM) and its enhancement, the consensus-NCM (c-NCM) performed significantly better than the conventional codon-usage statistic method (CSM). Using the methods NCM and c-NCM, we were able to increase the average sequence identity from 77% up to 81%. Furthermore, we revealed a significant increase in coverage, at 80% identity, from < 20% (CSM) to > 75% (c-NCM). The algorithms, their applications and implications are discussed herein.
Algorithms for theoretical reverse translation have direct applications in degenerate PCR. The conventional practice is to create several degenerate primers each of which variably encode the peptide region of interest. In the current work, for each codon we have analyzed the flanking residues in proteins and determined their influence on codon choice. From this, we created a method for theoretical reverse translation that includes information from flanking residues of the protein in question. Our method, named the neighbor correlation method (NCM) and its enhancement, the consensus-NCM (c-NCM) performed significantly better than the conventional codon-usage statistic method (CSM). Using the methods NCM and c-NCM, we were able to increase the average sequence identity from 77% up to 81%. Furthermore, we revealed a significant increase in coverage, at 80% identity, from < 20% (CSM) to > 75% (c-NCM). The algorithms, their applications and implications are discussed herein.
Word usage and codon usage in bacterial genomes has been extensively documented, both in the coding (1) and non-coding regions (2). These reports show that word usage in genomes is non-random and it serves as a biological signature of the organism in question. One such signature is codon usage in open reading frames (ORFs), and is reflected in measures such as the codon adaptation index (CAI) (3). Though CAI provides a convenient measure of codon bias, several reports show that codon usage is not a property of isolated codons and in several cases the bases immediately upstream or downstream affect the translation (4). Such neighboring base effects are well studied in case of stop codon read-through experiments where the flanking base or codon has been shown to affect the accuracy and magnitude of read-through (5). Apart from single bases, the effect of flanking codons has also been well studied in literature. Gutman and Hatfield (6) show that there is a strong first-order Markovian relationship between codons in a gene and this relation is seen even after translation, in proteins. Boycheva and colleagues extended this study to reveal that translation efficiency is strongly dependent on the dicodon pair that encodes for a given amino acid pair (7). They suggest that relative orientations of t-RNA in the ribosome may cause the observed differences in translation efficiency and subsequently certain dicodon pairs are selected evolutionarily. Moura and coworkers use a more recent and larger dataset for an analysis of dicodon usage patterns in both prokaryotes and eukaryotes. Their results suggest that the geometric constraints imposed by the translation machinery are driving forces in the evolution of gene sequences in bacteria (8). Collectively, these results suggest the existence of strong first-order Markovian relationships between codons in a gene. We hypothesized that information content of such correlations is carried over to the proteins, at least in part, when the gene is translated. This information manifests itself as a lack of randomness in the choice of codons and it is apparent when one attempts to theoretically reverse translate a protein sequence.Reverse translation has been discussed earlier as an abstract logical flow of information from proteins to DNA (9). In this work, we consider the pragmatic problem of theoretical reverse translation itself, rather than that of information flow from proteins to DNA. Theoretical reverse translation of protein sequences has potential applications in primer design for degenerate PCR and in design of synthetic genes (10). In degenerate PCR, several primers are designed, each representing a variant DNA sequence encoding the peptide region of interest. One of the best methods designed for degenerate PCR can, in the best case scenarios, still utilize up to 128 primers on one end (5′- or 3′-end) and one or more at the other end (11). Though no specific software is available for reverse translation, the conventional procedure is to substitute codons for residues based on the overall genomic codon usage probabilities which required different primers be designed for each ambiguous codon in the gene in the region of interest. In practice, it is common for almost all possibilities to be covered, increasing the number of required primers exponentially. Thus, improvements in reverse translation will help reduce the ambiguity in degenerate PCR.Improvements in reverse translation can be brought about by studying the rules of codon usage in the genome, which is feasible due to availability of whole genome sequences. In this study, we created a framework for reverse translation of bacterial gene sequences and term it the neighbor correlation method (NCM), due to its use of neighboring (flanking) sequence information to predict codon usage. We provide evidence for the dependency of codon choice on the flanking amino acid residues and used this dependency to reverse-translate protein sequences from two model genomes. We confirmed that NCM was a substantial improvement over the conventional method (codon-usage statistic method—CSM). Furthermore, we introduced a modification to both CSM and NCM [consensus CSM (c-CSM) and consensus NCM (c-NCM)] to improve significantly the sensitivity of reverse translations by both CSM and NCM, and show that these observed differences in performance are statistically significant. Finally, using the protein sequences of Salmonella typhi CT18 and the probability matrix from Escherichia coli K12, we show that it is possible to reverse translate sequences from organisms for which a reverse translation matrix is not available, by using a matrix from a related organism.
MATERIALS AND METHODS
All sequences were obtained from the NCBI database. For the analyses, the genome and predicted ORF sequences of E. coli K12 (12), B. subtilis (13), and S. typhi CT18 (14), Acidobacteria bacterium (NC_008095), Aquifex aeolicus (15), Bacteroides thetaiotaomicron (16), Bordetella pertussis (17), Campylobacter jejuni (18), Caulobacter crescentus (19), Chlamydia trachomatis (20), Clostridium acetobutylicum (21), Dehalococcoides ehtenogenes (22), Deinococcus radiodurans (23), Fusobacterium nucleatum (24), Lactobacillus acidophilus (25), Mesorhizobium loti (26), Methanococcus jannaschii (27), Methanopyrus kandleri (28), Mycobacterium bovis (29), Mycobacterium tuberculosis (30), Mycoplasma genitalium (31), Myxococcus xanthus (32), Nanoarchaeum equitans (33), Prochlorococcus marinus (34), Pseudomonas aeruginosa (35), Rickettsia prowazekii (36), Sulfolobus solfataricus (37), Synechococcus elongatus (38), Thermoplasma acidophilum (39), Ureaplasma urealyticum (40) and Magnetococcus sp. (NC_008576) were used. We used needle, an implementation of the Needleman–Wünsch algorithm available in the EMBOSS package (41) for all sequence identity analyses. The algorithms discussed were implemented in PERL (script provided as Supplementary Data) on a Linux platform.
Analysis for non-random codon usage dependency on flanking amino acid residues
For codons of interest, random occurrence model was constructed based on codon usage and amino acid frequencies in a given genome. We used 10 000 such random sets to calculate the z-scores for each residue–codon–residue combination. From the z-scores, P-values were calculated and were multiply corrected for both codon occurrence and amino acid occurrence biases using Bonferroni correction. To identify those combinations that have a skewed occurrence, we used a stringent threshold of P < 0.0001.
Creation of the probability matrix for CSM
Codon usage in the genome interest was calculated using the CUSP program in the EMBOSS package (41), and a codon usage probability table was created based on that information. For each amino acid the segmented probability interval spans from 0.0 to 1.0, where each consecutive non-overlapping segment corresponds to probability of a unique codon (Figure 1A). This probability interval matrix had 64 individual data points under 21 categories (20 amino acids + stop codons).
Figure 1.
Illustration of reverse translation methods. (A) shows reverse translation of a protein sequence based on codon usage and (B) shows the reverse translation using NCM. GS represents ORF (gene) sequences from the genome of interest. The first part shows the creation of probability intervals for both panels. For NCM, Bayesian probabilities of codon usage were calculated given the flanking residues. Note that the codon usage profiles for alanine are distinct between the two methods. The second part depicts the reverse translation process, which is similar to both methods. A random number ‘r’ was generated and the codon corresponding to the probability interval (the horizontal line spanning 0.0–1.0 in both panels) within which r fell was used for creation of the ORF. This codon was then used for reverse translation.
Illustration of reverse translation methods. (A) shows reverse translation of a protein sequence based on codon usage and (B) shows the reverse translation using NCM. GS represents ORF (gene) sequences from the genome of interest. The first part shows the creation of probability intervals for both panels. For NCM, Bayesian probabilities of codon usage were calculated given the flanking residues. Note that the codon usage profiles for alanine are distinct between the two methods. The second part depicts the reverse translation process, which is similar to both methods. A random number ‘r’ was generated and the codon corresponding to the probability interval (the horizontal line spanning 0.0–1.0 in both panels) within which r fell was used for creation of the ORF. This codon was then used for reverse translation.
Creation of the probability matrix for NCM
For each tripeptide A1–A2–A3 in the genome of interest, we calculated the usage probabilities of codons for A2 flanked by A1 and A3. Based of these probabilities, we created a probability interval matrix for all combinations of A1–C*–A3, where C* is the codon that encodes A2. The probability interval matrix thus created had 24 400 individual data points under 8000 categories (20 × 20 × 20 amino acid combinations). Creation of such a probability interval for the tripeptide S–A–S is illustrated in Figure 1B.
Reverse translation
In reverse translation using CSM, a random number r was generated where 0 ≤ r ≤ 1, for each amino acid in the query protein sequence. The codon corresponding to the probability interval within which r falls was chosen for reverse translation. In NCM, overlapping tripeptides were used instead of single codons, and the codon was predicted for the second residue. However, when reverse translating with NCM, the first residue and stop codons were assigned based on probability alone. This procedure is also illustrated in Figure 1. c-NCM was created as an enhancement to NCM, in which reverse translation was performed n times using NCM for each protein sequence. The final DNA sequence was obtained by creating a consensus sequence from the n sequences created.
Statistical analyses of differences between various methods
In order to statistically test the difference in performance of the different methods, we used (i) either Kolmgorov–Smirnov (KS) or Mann–Whitney (MW) test for comparing distributions of nucleotide sequence identity and (ii) F-test followed by FDR to identify sequence identity range that is over-represented in one method over another. These tests were used to compare (i) c-NCM and NCM (ii) NCM and CSM and (iii) c-NCM and CSM. In case of KS and MW tests, we used the sequence identity data. For the F-test and subsequent FDR analysis, we used the number of sequences scoring within a given sequence identity interval (for example, 300 sequences scored between 80% and 85%). All tests were run in R (http://www.r-project.org). The complete statistical analysis and data are provided in Supplementary Data.
Statistical analyses of iteration threshold for c-NCM
The c-NCM was performed on a random set of 1000 sequences in the E. coli K12 genome. Various iterations were used, ranging from 5 to 100 in five steps. Resultant sequences were compared with reference gene sequences using needle and percentage identity calculated. The distribution of scores from 50 iterations was compared to (i) that of NCM for these 1000 sequences and (ii) the distribution of scores from 100 iterations. For the comparison, we used KS test with alternative hypothesis = greater. There was no significant difference between the scores of iterations 50 and 100 (P = 0.2406). However, there was a significant difference between NCM and the 50-iteration c-NCM (same test as above, P < 2.2 × 10–16), and hence we used 50 iterations as the threshold for c-NCM predictions. A similar approach was used to test the performance of c-CSM. The results of c-CSM were then compared with those of c-NCM.
RESULTS AND DISCUSSIONS
Reverse translation of protein sequences is necessary for the design of degenerate primers. In most cases, reverse translation uses the codon usage statistics of the complete genome or a representative set of genes for the organism of interest. While dictated by overall genomic preference, this method rests on the assumption that usage of a codon in a gene is essentially random. Until this study, there has been no comprehensive analysis on the statistics of reverse-translation using the classical method. In this work, we show that the choice of codons for reverse translation can be refined further by taking into account the residues flanking the residue of interest in a protein. Based on this observation, we have devised a method called the NCM that uses the correlation between codon usage and flanking residues in proteins. As a case study, we have analyzed the efficiency of reverse-translation using NCM performed on the set of predicted ORF of E. coli K12 and B. subtilis.
Correlation between codon choice and the flanking amino acid residues in the E. coli K12 genome
We analyzed the codon usage in the genomes of both E. coli K12 and B. subtilis and observed that the codon usage was not random but was to some extent dependent on the flanking codons. This dependency on flanking codons was reflected as a dependency on the flanking residues in proteins. For example, the codon GGC (Gly) encodes for 40.5% of all glycine residues present in E. coli (Supplementary Data). In the NCM, there are 400 possible theoretical combinations for any given codon. If the distribution of GGC were to be random, each of the combinations would span 0.25% (random probability = 0.0025) of the probability space. However, we observed that GGC is often flanked by branched chain aliphatic amino acids and hydrophobic amino acids. The 12 combinations (3% of total possible combinations) shown in Table 1 contribute almost 12% of total GGC usage in the genome, yielding a usage that is as much as four times the expected random usage.
Table 1.
Table showing strong distribution of the codon ‘GGC’ flanked by hydrophobic amino-acids (ILV)
Residue 1
codon
Residue 2
Occurrence (Occ)
p(R1-GGC-R2) (pX) = Occ/Total
pX/pRand
A
GGC
G
326
0.008 134
3.253
A
GGC
V
399
0.009 956
3.982
G
GGC
G
362
0.009 033
3.613
G
GGC
V
343
0.008 559
3.423
I
GGC
A
352
0.008 783
3.513
I
GGC
G
371
0.009 257
3.703
I
GGC
V
303
0.007 561
3.024
L
GGC
A
364
0.009 083
3.633
L
GGC
G
473
0.011 803
4.721
L
GGC
V
477
0.011 902
4.761
S
GGC
G
324
0.008 085
3.234
V
GGC
G
326
0.008 134
3.253
Occurrence in table denotes overall genomic occurrence of the combination. The pX denotes the occurrence probability of the combination X (occurrence/total occurrences of the codon GGC). The pRand denotes the random occurrence probability of the combination X (pRand = 1/400 = 0.0025). The pX/pRand denotes the ratio between observed and expected probabilities. These 12 combinations (out of 400) represent almost 12% of the total occurrences of GGC in the genome (expected = 3%) representing a skew in codon usage dependent on flanking residues.
Table showing strong distribution of the codon ‘GGC’ flanked by hydrophobic amino-acids (ILV)Occurrence in table denotes overall genomic occurrence of the combination. The pX denotes the occurrence probability of the combination X (occurrence/total occurrences of the codon GGC). The pRand denotes the random occurrence probability of the combination X (pRand = 1/400 = 0.0025). The pX/pRand denotes the ratio between observed and expected probabilities. These 12 combinations (out of 400) represent almost 12% of the total occurrences of GGC in the genome (expected = 3%) representing a skew in codon usage dependent on flanking residues.Though the analysis of GGX shows that codon usage is non random, the data discussed is specific for E. coli. Furthermore, glycine is encoded by only four codons and does not exhibit maximum degeneracy. In order to both test these observations in multiple genomes as well as to use a more degenerately encoded amino acid, we have analyzed the codon usage for the amino acid arginine in 30 genomes. Arginine is encoded by six codons and is amongst the most degenerately encoded amino acids along with leucine and serine. In our analysis, for each of the six codons (C), we generated 10 000 random distributions with flanking amino acid residues (R1–C–R2). Using these random distributions, z-scores and P-values for each observed combination were calculated. The calculated P-values were adjusted for both codon representation bias (for a given codon) and amino acid representation biases (across all codons for a given flanking pair) using the Bonferroni correction. The resultant values were screened using a stringent threshold of P < 0.0001. We observed that even after stringent corrections there were several combinations that had a non-random distribution. The results of these tests are given in Supplementary Data.These tests prove that codon usage varies with a change in flanking amino acid residues. We therefore hypothesized that a method exploiting the flanking residue information will be more sensitive in detecting signals that are lost in the conventional method (CSM) for reverse translation.
Comparison and analysis of reverse translations using CSM and NCM
In order to compare the performance of CSM and NCM, we reverse translated all the proteins in two genomes, E. coli K12 and B. subtilis, using both methods. Identity of the reverse translated proteins with the reference (original) ORF was used to quantify sensitivity of the methods. First, the distribution of percentage identities of nucleotide sequences reverse translated via NCM is significantly greater than that for CSM (P < 2.2 × 10–16; one-tailed KS test). A second assessment of performance using ratios of sequence identities (%IDNCM/%IDCSM) revealed that there was a small yet statistically significant increase in average sequence identities (P < 2.2 × 10−16; one-tailed MW test, null-hypothesis: ratio = 1). The average increase in sequence identity for all the sequences was ∼1%. We then grouped all sequence identities into bins of width 5% and tested which bins were significantly enriched in NCM over CSM. This revealed that NCM reverse translates a significantly large number of protein sequences to nucleotides withq high identities of >80–85% (P-value: 4.5 × 10–15, Fisher's test and FDR correction; Figure 2A). At this sequence identity range, there are twice as many DNA sequences predicted by NCM (239 sequences) as are predicted by CSM (103 sequences).
Figure 2.
Percentage identity distribution of reverse translated ORF sequences in E. coli K12 and B. subtilis are shown in Panel A and B, respectively. This study compares the identities of reverse translations by CSM, NCM and c-NCM, revealing the distribution of percentage identities of reverse translated ORFs to the native ORFs. Two genomes, E. coli K12 and B. subtilis, are represented herein. Note that the NCM predictions are both qualitatively and quantitatively better and are also more numerous beyond 77% identity in both cases. This graph depicts the current limits of theoretical reverse-translation at ∼85% for all the methods. The improvement of c-NCM over CSM and NCM, especially in regions of higher sequence identity, is clearly visible and significant (F-test with FDR correction; P < 2.3 × 10–44).
Percentage identity distribution of reverse translated ORF sequences in E. coli K12 and B. subtilis are shown in Panel A and B, respectively. This study compares the identities of reverse translations by CSM, NCM and c-NCM, revealing the distribution of percentage identities of reverse translated ORFs to the native ORFs. Two genomes, E. coli K12 and B. subtilis, are represented herein. Note that the NCM predictions are both qualitatively and quantitatively better and are also more numerous beyond 77% identity in both cases. This graph depicts the current limits of theoretical reverse-translation at ∼85% for all the methods. The improvement of c-NCM over CSM and NCM, especially in regions of higher sequence identity, is clearly visible and significant (F-test with FDR correction; P < 2.3 × 10–44).With B. subtilis the overall numbers were lower, although we observed a similar trend. For NCM, the total number of sequences that had more than 75% identity was 3670 and the same for CSM was 3347 (Figure 2B). This represented an increase >10% in NCM predictions over CSM predictions. A 1% increase in the median sequence identity was seen for the B. subtilis data set as was in E. coli. As we increased the threshold to 80%, we observed 200 proteins that were reverse translated by NCM yet only 114 by CSM. Moreover, the average identity of sequences reverse translated by either method was lower in B. subtilis than in E. coli K12. On the whole, these results suggest more random choice of codons in B. subtilis than in E. coli K12. These collective results underscore an important and fundamental distinction between the two groups of bacteria tested: increased randomness in the gram positive genome (B. subtilis) may be an indicator of its earlier evolutionary origin as compared to the gram negative (E. coli) genome (42).In order to identify the CAI range within which NCM was effective, we compared the distribution of CAI values of genes whose IDNCM/IDCSM ratio was >1.01 with those whose IDNCM/IDCSM ratio was <0.99 (KS test, alternative hypothesis = CAI distribution of NCM is lesser than that of CSM; P < 1.0 × 10–6). These results show that NCM performs significantly better than CSM in regions of low CAI.
Comparison of CSM and NCM on various phyla in the bacterial kingdom
In the previous section, we discussed and compared the results of reverse translation using CSM and NCM in two divergent bacterial species. Despite the phylogenetic distance between the two species, the both were eubacteria with moderate GC content. In order to show that the differences are real, we tested and compared the methods 28 different bacterial genomes, each representing a major clade in the bacterial kingdom as listed in KEGG (43). This list included one each of various groups like Archaebacteria, alpha-, beta-, gamma- and delta-proteobacteria, firmicutes, mollicutes, actinomyces, halo- and acido-bacteria, green sulfur and non-green sulfur bacteria, and cyanobacteria. The exhaustive list of organisms and a comparison of CSM and NCM in these genomes is tabulated in Table 2, and Supplementary Data lists the minima, maxima, median, first and third quartiles for these methods for all the genomes. From Table 2, it is evident that NCM outperforms CSM not only in genomes with moderate GC content but also in all major bacterial clades.
Table 2.
Table comparing performance of CSM, NCM, c-CSM and c-NCM in 30 different clades of bacterial kingdom
Clade
Organism
Genome ID
CSM–cNCM
NCM–cNCM
cCSM–cNCM
Hyperthermophiles
Aquifex_aeolicus
NC_000117
2.2xE-16
2.2xE-16
0.2643
Bacteroides
Bacteroides thetaiotaomicron
NC_000908
2.2xE-16
2.2xE-16
2.2xE-16
Beta-proteobacteria
Bordetella pertussis
NC_000909
2.2xE-16
2.2xE-16
2.2xE-16
Delta-proteobacteria
Myxococcus xanthus
NC_000918
2.2xE-16
2.2xE-16
2.2xE-16
Epsilon-proteobacteria
Campylobacter jejuni
NC_000919
2.2xE-16
2.2xE-16
2.2xE-16
Alpha-proteobacteria
Caulobacter crescentus
NC_000962
2.2xE-16
2.2xE-16
2.2xE-16
Chlamydia
Chlamydia trachomatis
NC_000963
2.2xE-16
2.2xE-16
2.2xE-16a
Clostridia
Clostridium acetobutylicum
NC_001263
2.2xE-16
2.2xE-16
2.2xE-16
Green-nonsulfur
Dehalococcoides ethenogenes
NC_002162
2.2xE-16
2.2xE-16
5.168xE-08
Deinococcus
Deinococcus radiodurans
NC_002163
2.2xE-16
2.2xE-16
2.2xE-16
Fusobacteria
Fusobacterium nucleatum
NC_002516
2.2xE-16
2.2xE-16
1.567xE-08
Lactobacillales
Lactobacillus acidophilus
NC_002578
2.2xE-16
2.2xE-16
2.2xE-16
Alpha-rhizobacteria
Mesorhizobium loti
NC_002678
2.2xE-16
2.2xE-16
2.2xE-16
Euryarchaeota
Methanococcus jannaschii
NC_002696
2.2xE-16
2.2xE-16
0.2876
Euryarchaeota
Methanopyrus kandleri
NC_002754
2.2xE-16
2.2xE-16
2.936xE-10
Actinobacteria
Mycobacterium bovis
NC_002929
2.2xE-16
2.2xE-16
2.2xE-16
Actinobacteria
Mycobacterium tuberculosis
NC_002936
2.2xE-16
2.2xE-16
2.2xE-16
Mollicutes
Mycoplasma genitalium
NC_002945
2.2xE-16
2.2xE-16
2.2xE-16
Nanoarchaeota
Nanoarchaeum equitans
NC_003030
2.2xE-16
2.2xE-16
2.2xE-16a
Cyanobacteria
Prochlorococcus marinus
NC_003454
2.2xE-16
2.2xE-16
2.2xE-16a
Gamma-proteobacteria
Pseudomonas aeruginosa
NC_003551
2.2xE-16
2.2xE-16
0.01897
Alpha/Rickettsiae
Rickettsia prowazekii
NC_004663
2.2xE-16
2.2xE-16
2.2xE-16
Crenarchaeota
Sulfolobus solfataricus
NC_005072
2.2xE-16
2.2xE-16
0.761
Cyanobacteria
Synechococcus elongatus
NC_005213
2.2xE-16
2.2xE-16
0.0011
Euryarchaeota
Thermoplasma acidophilum
NC_006576
2.2xE-16
2.2xE-16
2.2xE-16
Spirochete
Treponema pallidum
NC_006814
2.2xE-16
2.2xE-16
2.2xE-16 a
Mollicutes
Ureaplasma urealyticum
NC_008009
2.2xE-16
2.2xE-16
0.0455
Acidobacteria
Acidobacteria bacterium
NC_008095
2.2xE-16
2.2xE-16
5.471xE-05
Magnetococcus
Magnetococcus MC-1
NC_008576
2.2xE-16
2.2xE-16
2.2xE-16
aIn these cases, c-CSM performed significantly better than c-NCM.
Table comparing performance of CSM, NCM, c-CSM and c-NCM in 30 different clades of bacterial kingdomaIn these cases, c-CSM performed significantly better than c-NCM.
Improving the performance of NCM: the c-NCM
While NCM offered a better method to reverse-translate protein sequences, the overall improvement over CSM was apparent only at a higher sequence identity cut-off and for only a small fraction of the sequences. In order to improve the sensitivity of NCM, we developed a technique known as c-NCM, where the same protein was reverse-translated n times and a consensus was derived from the resultant sequence set. Our tests with a random set of 1000 sequences derived from E. coli K12 genome (Figure 3) demonstrated a drastic improvement from 1 cycle to 20 cycles. After 25 cycles, there was only a small improvement in prediction efficiency, which became insignificant beyond 50 cycles as compared to 100 cycles (KS test, alternative hypothesis = two-tailed: P-value = 0.2406). Moreover, our tests with a sample set of 100 sequences show that there is no significant improvement in sequence identity between 100 and 1000 cycles (data not shown). Hence we chose to use 50 cycles for subsequent c-NCM-based studies. The results of c-NCM are summarized along with those for CSM and NCM, in Figure 2A and 2B for E. coli K12 and B. subtilis, respectively. The average identity of reverse translated sequences increased by 4% with c-NCM when compared to the results from NCM. In summary, c-NCM reverse translated >75% of sequences with 80% identity or more while the percentage of sequences scoring the same with NCM was <20% in both genomes. This difference is highly significant (P = 0 for >85–90% ID and 2.3 × 10–44 for >80–85% ID, Fisher's test and FDR correction). These results revealed that c-NCM is an effective method for reverse translation of protein sequences based on genomic usage matrices, and also indicate that the performance of c-NCM was significantly better than both NCM and CSM. As was the case for CSM and NCM, we tested c-NCM on all the 30 genomes (Table 2). It could be seen that the performance of c-NCM was significantly different between both NCM and CSM for all phyla.
Figure 3.
Standardization of iteration values for c-NCM. This figure illustrates the improvement in sensitivity as the number of iterations is increased in NCM. We performed c-NCM-based reverse translations for 1000 randomly chosen proteins using various iterations (5–100, in five steps) and compared the results with (A) predictions from NCM and (B) predictions from 100 iterations of c-NCM. It can be seen that the largest difference is between iteration values of 1 (NCM) and 50 (KS test, alternative = greater; P = 2.2 × 10–16). However, there is a small increase of sensitivity as the iterations are increased. The sensitivity difference was tested to 100 cycles and since there was no significant difference between 50 and 100 cycles (KS test, alternative = greater; P = 0.2406), we chose the threshold for c-NCM at 50 cycles.
Standardization of iteration values for c-NCM. This figure illustrates the improvement in sensitivity as the number of iterations is increased in NCM. We performed c-NCM-based reverse translations for 1000 randomly chosen proteins using various iterations (5–100, in five steps) and compared the results with (A) predictions from NCM and (B) predictions from 100 iterations of c-NCM. It can be seen that the largest difference is between iteration values of 1 (NCM) and 50 (KS test, alternative = greater; P = 2.2 × 10–16). However, there is a small increase of sensitivity as the iterations are increased. The sensitivity difference was tested to 100 cycles and since there was no significant difference between 50 and 100 cycles (KS test, alternative = greater; P = 0.2406), we chose the threshold for c-NCM at 50 cycles.Apart from testing c-NCM on different genomes, we were also interested in analyzing the effects of consensus improvisation on the CSM method. Differences from the normal trend, if any, would allow us to discern genomes that have increased, or decreased randomness in their codon usage. On the same set of 30 genomes, we performed c-CSM (50 cycles) and compared the results with that of c-NCM using a Wilcoxon Rank Sum test. The results in Table 2 show that in 70% (21 of 30) of the tested genomes, c-NCM had a better performance than c-CSM. In five cases, the difference between the two methods was insignificant. These genomes were, Aquifex aeolicus (hyperthermophile), Bordetella pertussis (beta-proteobacteria), Methanopyrus kandleri (euryarchaeota), Prochlorococcus marinus (cyanobacteria) and Acidobacteria bacterium (acidobacteria). In four other cases, c-CSM performed significantly better than c-NCM: they were Rickettsia prowazekii (alpha-proteobacteria/Rickettsiae), Clostridium acetobutylicum (Clostridia), Fusobacterium nucleatum (Fusobacteria), and Lactobacillus acidophilus (Lactobacillales). These results, at least for P. marinus and M. kandleri, show that in archaeal and cyanobacterial genomes very little of tricodon usage information is carried over to the protein level.
Application of reverse translation to an external genome: Salmonella typhi CT18
In the previous sections, we demonstrated that the improvised method (c-NCM) performed significantly better than CSM and NCM. We hypothesized that NCM matrices created from a genome can be used for reverse translating protein sequences from a related genome. S. typhi CT18 is 67% identical to E. coli K12 genome at the DNA level, and hence was a good model system to test our hypothesis. Results from these comparisons showed significant differences between the prediction quality between CSM and NCM. Again, as was seen in 21 other genomes, the use of c-NCM improved prediction quality, with average identity beyond 80%. There was a very small difference in the average identities and the distribution between S. typhi CT18 (Figure 4) and E. coli K12 (Figure 2A). These observations confirmed that our method can be successfully applied to related genomes, suggesting increased fidelity in the design of degenerate primers for an organism whose gene sequence information is meager or non-existent. In such cases, the use of (c-)NCM matrices from a related organism is a viable alternative.
Figure 4.
Reverse translation of S. typhi proteins using E. coli K12 matrices. S. typhi CT18 proteins were reverse translated using codon usage and NCM matrices of E. coli K12 genome. Analyses of identities with reference ORFs show that predictions using c-NCM are both qualitatively and quantitatively better than those using CSM (KS test: alternative = greater; P < 2.2 × 10−16). These results prove the applicability of c-NCM in cases where genome sequence data and NCM matrices are not available for the organism of interest.
Reverse translation of S. typhi proteins using E. coli K12 matrices. S. typhi CT18 proteins were reverse translated using codon usage and NCM matrices of E. coli K12 genome. Analyses of identities with reference ORFs show that predictions using c-NCM are both qualitatively and quantitatively better than those using CSM (KS test: alternative = greater; P < 2.2 × 10−16). These results prove the applicability of c-NCM in cases where genome sequence data and NCM matrices are not available for the organism of interest.Throughout this work, we have concentrated on the applications of reverse translation in design of degenerate PCR. However, these studies also reveal the underlying logic of codon usage in genes in general, and such knowledge will be imperative in the design of synthetic genes to be used in artificial genetic systems and can also be used to adapt recombinant genes in a host specific manner.
Authors: Q She; R K Singh; F Confalonieri; Y Zivanovic; G Allard; M J Awayez; C C Chan-Weiher; I G Clausen; B A Curtis; A De Moors; G Erauso; C Fletcher; P M Gordon; I Heikamp-de Jong; A C Jeffries; C J Kozera; N Medina; X Peng; H P Thi-Ngoc; P Redder; M E Schenk; C Theriault; N Tolstrup; R L Charlebois; W F Doolittle; M Duguet; T Gaasterland; R A Garrett; M A Ragan; C W Sensen; J Van der Oost Journal: Proc Natl Acad Sci U S A Date: 2001-06-26 Impact factor: 11.205
Authors: A Ruepp; W Graml; M L Santos-Martinez; K K Koretke; C Volker; H W Mewes; D Frishman; S Stocker; A N Lupas; W Baumeister Journal: Nature Date: 2000-09-28 Impact factor: 49.962
Authors: W C Nierman; T V Feldblyum; M T Laub; I T Paulsen; K E Nelson; J A Eisen; J F Heidelberg; M R Alley; N Ohta; J R Maddock; I Potocka; W C Nelson; A Newton; C Stephens; N D Phadke; B Ely; R T DeBoy; R J Dodson; A S Durkin; M L Gwinn; D H Haft; J F Kolonay; J Smit; M B Craven; H Khouri; J Shetty; K Berry; T Utterback; K Tran; A Wolf; J Vamathevan; M Ermolaeva; O White; S L Salzberg; J C Venter; L Shapiro; C M Fraser; J Eisen Journal: Proc Natl Acad Sci U S A Date: 2001-03-20 Impact factor: 11.205
Authors: O White; J A Eisen; J F Heidelberg; E K Hickey; J D Peterson; R J Dodson; D H Haft; M L Gwinn; W C Nelson; D L Richardson; K S Moffat; H Qin; L Jiang; W Pamphile; M Crosby; M Shen; J J Vamathevan; P Lam; L McDonald; T Utterback; C Zalewski; K S Makarova; L Aravind; M J Daly; K W Minton; R D Fleischmann; K A Ketchum; K E Nelson; S Salzberg; H O Smith; J C Venter; C M Fraser Journal: Science Date: 1999-11-19 Impact factor: 47.728
Authors: C K Stover; X Q Pham; A L Erwin; S D Mizoguchi; P Warrener; M J Hickey; F S Brinkman; W O Hufnagle; D J Kowalik; M Lagrou; R L Garber; L Goltry; E Tolentino; S Westbrock-Wadman; Y Yuan; L L Brody; S N Coulter; K R Folger; A Kas; K Larbig; R Lim; K Smith; D Spencer; G K Wong; Z Wu; I T Paulsen; J Reizer; M H Saier; R E Hancock; S Lory; M V Olson Journal: Nature Date: 2000-08-31 Impact factor: 49.962
Authors: T Kaneko; Y Nakamura; S Sato; E Asamizu; T Kato; S Sasamoto; A Watanabe; K Idesawa; A Ishikawa; K Kawashima; T Kimura; Y Kishida; C Kiyokawa; M Kohara; M Matsumoto; A Matsuno; Y Mochizuki; S Nakayama; N Nakazaki; S Shimpo; M Sugimoto; C Takeuchi; M Yamada; S Tabata Journal: DNA Res Date: 2000-12-31 Impact factor: 4.458
Authors: Gabriela Moura; Miguel Pinheiro; Joel Arrais; Ana Cristina Gomes; Laura Carreto; Adelaide Freitas; José L Oliveira; Manuel A S Santos Journal: PLoS One Date: 2007-09-05 Impact factor: 3.240
Authors: Harry Yim; Robert Haselbeck; Wei Niu; Catherine Pujol-Baxley; Anthony Burgard; Jeff Boldt; Julia Khandurina; John D Trawick; Robin E Osterhout; Rosary Stephen; Jazell Estadilla; Sy Teisan; H Brett Schreyer; Stefan Andrae; Tae Hoon Yang; Sang Yup Lee; Mark J Burk; Stephen Van Dien Journal: Nat Chem Biol Date: 2011-05-22 Impact factor: 15.040