Literature DB >> 19131380

Predicting gene expression level from relative codon usage bias: an application to Escherichia coli genome.

Uttam Roymondal1, Shibsankar Das, Satyabrata Sahoo.   

Abstract

We present an expression measure of a gene, devised to predict the level of gene expression from relative codon bias (RCB). There are a number of measures currently in use that quantify codon usage in genes. Based on the hypothesis that gene expressivity and codon composition is strongly correlated, RCB has been defined to provide an intuitively meaningful measure of an extent of the codon preference in a gene. We outline a simple approach to assess the strength of RCB (RCBS) in genes as a guide to their likely expression levels and illustrate this with an analysis of Escherichia coli (E. coli) genome. Our efforts to quantitatively predict gene expression levels in E. coli met with a high level of success. Surprisingly, we observe a strong correlation between RCBS and protein length indicating natural selection in favour of the shorter genes to be expressed at higher level. The agreement of our result with high protein abundances, microarray data and radioactive data demonstrates that the genomic expression profile available in our method can be applied in a meaningful way to the study of cell physiology and also for more detailed studies of particular genes of interest.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19131380      PMCID: PMC2646356          DOI: 10.1093/dnares/dsn029

Source DB:  PubMed          Journal:  DNA Res        ISSN: 1340-2838            Impact factor:   4.458


Introduction

Regulation of gene expression plays a central role in defining cell fate and controlling organ formation. Genomic function can be understood at the nucleotide level, but, the complexity and diversity of genomic function, leading to an emergent picture of the genome as an interacting system with many degrees of freedom, bring experimental and theoretical challenges to the quantitative measurement of the biological state, many of which are of statistical nature. Genes encode proteins, and proteins perform functions in the cell. Hence a gene takes part in biological function only if it is expressed, i.e. the protein produced from it is present in the cell. Gene regulation takes place during transcription, the process by which the cell reads the information contained in a gene and copies it to the messenger RNA which is subsequently used to make a functional protein. This is a most fundamental level of biological process which involves the interaction of DNA and proteins. Its regulation takes place <span class="Chemical">through the binding of proteins to DNA at specific loci in the vicinity of the gene to be regulated. The transcription of one gene may be enhanced or reduced by the expression of the gene itself. The process is complex and not yet understood completely. Genes with high expression levels include those required for an organism’s viability and the ability to identify these genes is crucial for drug development. Certainly the high cost and technical expertise required is an obstacle to many investigators who are interested in pursuing such studies. Although a variety of software tools and technologies have been developed for gene expression studies, a universal standard making these studies more suitable for comparative analysis and for inter-operability with other information sources is yet to emerge. Large-scale, high-throughput experimental methods require material and information processing systems to match. The analysis of high-throughput gene expression data is in an early stage of development. Development of advance technology for whole genome expression studies is thus becoming increasingly recognized. Predicting expression level of genes through computational methods is appealing because it circumvents expensive and difficult experiment. In recent years there has been increasing reports[1-23,43,44] on predicted highly expressed genes in several micro-organisms which provide a wealth of information about gene expression. It is suggested that the essential genes primarily include the ensembles of highly expressed genes that encode proteins [transcription/translational factors (TF), ribosomal proteins (RP), proteases and chaperons (CH), degradation, cellular localization, biosynthesis, metabolism, photosynthesis, respiration and glycolysis, etc] vital for cell physiology. Perhaps, the essential functions of these gene products correspond to the biased amino acid composition that might minimize the substantial biosynthesis energy costs indicating the high biological significance of these genes. Besides other mechanisms, it is also suggested that codon bias can influence gene expression by optimization of the translational rate and thus, highly expressed genes can be characterized on the basis of biased codon usages compared with average genes. In several previous studies,[3,7-13,17] a number of different patterns of codon usage have been hypothesized and many indices have been proposed to measure the degree of codon bias. Among these, the codon adaptation index (CAI) has been widely applied to the prediction of highly expressed genes in various organisms.[3,15,16,24-27] CAI was proposed as a measure of codon usage in a gene relative to that in a reference set of genes.[3] The previous studies suggest that CAI index correlates better with expression level of a gene than other codon usage indices, such as the effective number of codons,[7] codon bias index,[8] the frequency of optimal codons,[9] intrinsic codon bias index,[10] maximum likelihood codon bias,[11] synonymous codon bias orderliness,[12] and measure independent of length and composition (MILC),[13] etc. The parameters underlying the CAI model rely on the codon composition of only a limited set of highly expressed genes and are based on a fairly simple assumption that the functional class of genes are highly expressed. To define the parameters in the CAI model, Sharp and Li[3] considered the codon frequency of only 24 highly expressed genes of which 50% were genes of RPs and the rest mostly metabolic enzymes. A related method, the codon usage model, is based on similar principles, but the parameters are based on a somewhat broader set of highly expressed genes. In application of this model, Karlin and coworkers[17-23] have shown that it is a reasonable assumption that for RP genes, CH and TF are highly expressed. Gene expressivity is strongly correlated with protein abundances. A number of studies have also revealed that codon compositions in highly expressed genes are influenced by tRNA abundances.[1-6] Generally, highly expressed genes, producing abundant proteins, use a subset of optimal codons which are recognized by the most abundant tRNA species. It is well established that highly expressed genes have strongly biased usage of alternative synonymous codons and that of preferred codons, which are thought to be translated most efficiently by the most abundant tRNAs, and the lowly expressed genes have less biased codon usage patterns.[1,2] The observations strongly suggest that natural selection has shaped the codon usage pattern accommodating optimal gene expression levels for most situations of its habitat, energy sources, and life cycle. Codon usages vary considerably within and between organisms. The effect of natural selection on codon usage quantifies the level of gene expression. However, the resulting bias in the codon usage has two main components. One is the correlation with tRNA availability and the other is non-random choices between pyrimidines for third base. A critical analysis of codon usage in a gene shows that mutational bias also plays a role in codon selection. Several studies have analysed the relationship between the GC-content of isochors and the expression patterns of the genes they contain.[28] The G + C composition resulting from mutational bias has been hypothesized to determine the major trends in codon usage of high or low G + C organisms. Within a genome, codon bias tends to be much stronger in highly expressed gene than in genes expressed at lower levels, suggesting that there might be some selective advantage to concentrate essential genes on GC rich domains of the genome. Surprisingly, to address this important issue, some studies have also given conflicting results.[29-33] Several papers reported very weak correlations, either negative or positive between the GC-content and gene expression. The discrepancy among the studies might be due to the methods used to measure the expression parameter of the data sets analysed or the differences in the way correlations were computed. In fact, the characterization of regulatory elements underlying gene expression is largely an unsolved problem. The hypothesis that codon usage modulates gene expression has been accepted in general. Many researches in this field have formulated their own measures, which has led to a large number of available methods[3,7-12,17] for gene expressivity analysis. Unfortunately, these methods are not universally applicable as they exhibit strong artefacts of their formulation with varying sequence length, or overall codon bias, or codon bias discrepancy. Our aim is to develop a measure that will be free from any such possible artefacts and we attempt here to verify the usefulness of such a measure by employing it to predict gene expressivity in Escherichia coli (E. coli).

Materials and methods

The genome sequence for E. coli K-12 MG1655 is obtained from Genebank accession no. NC_000913. All ORF (open reading frames) listed as coding for proteins (confirmed and hypothetical) are considered in this study. Our approach in estimating gene expression level is related to codon usage difference of a gene with respect to biased nucleotide composition at the three codon sites. Let f(x,y,z) be the normalized codon frequency for the codon triplet (x,y,z) of a gene. Then the relative codon bias (RCB) of a codon triplet (x,y,z) in a gene is defined as where f1(x) is the normalized frequency of x at the first codon position, f2(y) is the normalized frequency of y at the second codon position, and f3(z) is the normalized frequency of z at the third codon position of the gene. The frequencies f1, f2, f3 have been derived from the set of codon samples of a gene and the normalization of frequency is done over the gene length in codons, in an attempt to compensate for the expected increase of RCB with the total number of codons. We quantify the degree of codon bias of a gene in such a way that comparisons can be made both within and between genomes. As defined earlier, d contains somewhat more quantitative information than others, since it considers codon usage as well as the base compositional bias. Then the expression measure of a gene is where is the codon usage difference of ith codon of a gene. L is the number of codons in the gene. RCB is the difference of observed frequency of a codon from the expected frequency under the hypothesis of random codon usage where the base composition were biased at three sites as that in the sequence under study, divided by the expected frequency. RCBS is the overall score of a gene indicating the influence of RCB of each codon in a gene. Our analysis is based on the hypothesis that RCB reflects the level of gene expression. The expression measure of a gene in this approach is denoted by RCBS. RCBS value close to 0 indicates a lack of bias for the codons and is thus useful for comparing different sets of genes.

Results

Our data set includes 4174 complete protein coding sequences from E. coli. Expression profiles of the genes are determined by calculating the score of RCB (RCBS value) for each gene and their distributions are shown in Fig. 1. The majority of genes (63%) have RCBS values lying between 0.2 and 0.4, and the mean and median values are 0.3870 and 0.3295, respectively. Only ∼18% genes have RCBS values >0.5. The analysis of RCBS values among different gene class shows that the gene classes (RP, CH, TF), which serve the representatives of highly expressed genes have RCBS > 0.5 in most of the cases. This suggests that significantly stronger codon bias is a result for translational efficiency as well. This finding is consistent with others,[3,17,18] as most of the previous expression measures have considered those as representative standards for highly expressed genes in their calculation. There is also experimental evidence in support of RP, CH and TF as standard derivatives for the highly expressed genes as it is observed that many RPs augmented by abundant TF and CH proteins are needed to assure properly translated, modified and folded protein products which expedite and regulate cellular activities in most prokaryotic genomes. Our data support the proposition that each genome has evolved a codon usage pattern accommodating gene expression level, and RCBS value >0.5 exhibits favourable codon usage. So, we chose this index as an effective expression measure on the basis that it has been shown to correlate highly to expression levels and the predicted expression level based on RCBS (RCBS > 0.5) values suggests that almost 18% of genes in the E. coli genome qualify as highly expressed genes. In our study, the genes are segregated into different functional categories such as metabolism, information transfer, regulation, transport, cell process, cell structure, location of gene products, extra-chromosomal, DNA sites and cryptic genes in accordance with Munich Information Center for Protein Sequence (MIPS) classification. Functional analysis shows that highly expressed genes involved in the location of gene products are the largest functional class followed by genes involved in information transfer, metabolism, cell structure, cell process, extra-chromosomal, regulation and transport function, respectively. A total of 750 genes are identified as highly expressed genes in E. coli with 163 genes involved in energy metabolism, 75 genes involved in translation, 34 genes in transcription, and 29 in CH and folding (Supplementary Table SI). In addition, the functional class of amino acid biosynthesis, nucleotide biosynthesis, fatty acid biosynthesis and other cofactor and small molecule, etc includes 67 highly expressed genes. Besides, there are several (∼185) genes encoding predicted proteins and 15 other genes of unknown function, which are thought to be highly expressed genes in our approach. We observe that 24 genes encoding predicted proteins and 12 genes encoding proteins of unknown function are highly expressed genes with RCBS > 1.0. The highly expressed genes of E. coli with RCBS > 1.0 are reported in Supplementary Table SII (hypothetical protein or predicted protein genes are not listed). Of these, 11 encode proteins that function in energy metabolism, 18 are RP genes, 11 encode TF and the remaining encode proteins that function in different cell process.
Figure 1

Distribution of RCBS for all coding genes in the genome of E. coli.

Distribution of <span class="Chemical">RCBS for all coding genes in the genome of <span class="Species">E. coli. In order to compare our results, we have also calculated CAI values for the same genes. Fig. 2 shows the relationship between RCBS and CAI values. Here, the CAI scores have been calculated according to the original publication of Sharp and Li,[3] which stem from 24 highly expressed genes. It can be clearly seen that for genes with high CAI values (>0.5), there is strong correlation between them (r = 0.4614). But for proteins with CAI values significantly <0.3, correlation is worse (r = −0.0572). The novel method of quantitatively predicting gene expressivity is then compared with the other widely accepted measure of Karlin and Marzek.[17] In Fig. 3, we plot RCBS values against E(g) of Karlin et al.[18] The correlation is surprisingly good with r = 0.6706, P < 0.001. We analyse further the relationship between the length of the coding regions and the expression level of genes. In Fig. 4 we plot RCBS as a function of the gene length. We observe that shorter genes assume the higher value of RCBS while longer genes tend to have lower RCBS. There is a strong correlation between RCBS and gene length (r2 = 0.65878 and χ2 = 0.0149). This effect is not due to systematic bias of gene size. To investigate the effect of protein length on gene expression as measured by RCBS, the data is split into three groups: short (L < 150), intermediate (150 < L < 300) and long (L > 300). Several observations can be made. Genes are sorted according to their expression level. It should be noted that genes of the same expression level may have wide variation in length and also that genes of the same length may have a wide range of RCBS. We observe that the estimate of expression level, as derived from RCBS, ranges from a low value to high value for each of the three length groups. It is evident from our data that RCBS ranges from 0.245 to 3.416 for L < 150, whereas it ranges from 0.123 to 0.907 for 150 < L < 300 and from 0.079 to 1.328 for L > 300. It is noted that the selective pressure on codon usage appears to be lower in genes encoding long rather than short proteins. Our studies, although less extensive, suggest that selection on codon usage as well as sequence composition is primarily responsible for RCBS. For a simple explanation, we select a set of E. coli sequences of equal length and randomize the above sequences 500 times, keeping their (i) codon usage; and (ii) sequence composition conserved. RCBS calculated for those sequences are found to vary in a wide range. We repeat the experiment on different sets of genes with varying length. The results are summarized in Supplementary Tables SIIA and SIIB. Supplementary Table SIIA describes the results of 14 arbitrary nucleotide sequences of different length, each randomized 500 times. In Supplementary Table SIIB, we present the results of the same experiment on a few selected genes of different length. We observe that the smaller sequences have a greater probability of resulting in high value of RCBS (>0.5), but there is nothing to prevent longer sequences from having high RCBS. Although the values for shorter sequences are more variable due to sampling effect, the intrinsic effect of gene length on RCBS reduces with the increase in length. A thorough exploration of theoretical values of RCBS suggests that RCBS can be an effective measure of gene expression, as its value depends on codon usage pattern along with DNA compositional bias of a gene.
Figure 2

RCBS plotted against CAI for E. coli genes.

Figure 3

RCBS plotted against E(g)[18] for E. coli genes.

Figure 4

RCBS plotted against the length of 4174 genes from the E. coli genome.

<span class="Chemical">RCBS plotted against CAI for <span class="Species">E. coli genes. <span class="Chemical">RCBS plotted against E(g)[18] for <span class="Species">E. coli genes. <span class="Chemical">RCBS plotted against the length of 4174 genes from the <span class="Species">E. coli genome. In order to test the RCBS as an expression level predictor, we chose to compare our results with the experiments. We collected data sets (listed in Supplementary Tables SIII and SIV) which consist of mRNA or protein abundance data obtained by different methods—mostly cDNA microarrays[27,34,35] or 2D gel electrophoresis data[36-39] for abundances of many E. coli proteins are available for comparison with the predicted levels of expression. In Fig. 5, we compare the predicted levels of expression in E. coli with 2D gel patterns[34] and expression measure E(g) of Karlin et al.[18] The relationship between RCBS values and mRNA levels seen in Fig. 5 agrees better than with the findings of Karlin et al.[18] The correlation between expression level (as relative molecular abundance) and RCBS value is found to be 0.4533 whereas that with E(g) value is 0 .2618. Among the 20 most abundant proteins, 17 were identified as highly expressed genes with three exceptions for metE, folA and ilvE. The results are in good agreement with those predicted by E(g). Among the 20 least abundant proteins, only three mismatch with our predicted results whereas there are seven mismatches with the results of Karlin et al.[18] Although pck, nusb, vals, args, rpll, thrs, leus are less abundant, according to 2D gel patterns, the high E(g) values of Karlin et al.[18] support naming the genes highly expressed. But our data support only nusb, vals and rpll to be highly expressive genes. Of the remaining 55 proteins 22 were identified as highly expressed genes. This agreement with molecular abundance data supports our predicted results better than others. In a step forward we compare RCBS and the concentrations of various proteins in E. coli along with their CAI values[24] (Supplementary Table SIV). Concentration is expressed as the number of protein molecules per cell. Concentration being used as a measure of gene expression, we find that our result is surprisingly good. The RCBS values along with the CAI values are plotted against the logarithm of concentration in Fig. 6. The predicted gene expression level using RCBS value is found to correlate well with the protein concentration data[24] (r = 0.708211). The correlation is better than the quantitative measure of CAI (r = 0.615546). It suggests that a quantitative estimate of the expression level by RCBS values performs better than other indices of expression measure. Thus, regardless of the state of cell growth, one can measure the relative expression level for each gene under various growth conditions, different genetic states or over a time course during environmental change.
Figure 5

RCBS (+) and E(g) (*) plotted against relative molecular abundance of 96 genes from E. coli genome.[18] RMB denotes relative molecular abundance. X-axis is taken in logarithmic scale.

Figure 6

CAI (+) and RCBS (*) plotted against protein concentration of 45 genes from the E. coli genome.[24] X-axis is taken in logarithmic scale.

RCBS (+) and E(g) (*) plotted against relative molecular abundance of 96 genes from E. coli genome.[18] RMB denotes relative molecular abundance. X-axis is taken in logarithmic scale. CAI (+) and <span class="Chemical">RCBS (*) plotted against protein concentration of 45 genes from the <span class="Species">E. coli genome.[24] X-axis is taken in logarithmic scale. In Fig. 7 we plotted radioactive data and microarray data against RCBS (Supplementary Table SV) for 117 genes as identified by heat shock treatment.[35] Among these, 26 genes show high (RCBS > 0.5), 84 genes moderate (0.2 < RCBS < 0.5) and only seven genes show a low (RCBS < 0.2) level of expression. Despite the fact that the quality of experimental data seems to be a very important factor, we observe a good correlation between RCBS and microarray (radioactive) data (rmicro = 0.2415, rradio = 0.2098).
Figure 7

Radioactive data (+) and microarray data (*)[35] plotted against RCBS for E. coli genes. Y-axis is taken in logarithmic scale.

Radioactive data (+) and microarray data (*)[35] plotted against <span class="Chemical">RCBS for <span class="Species">E. coli genes. Y-axis is taken in logarithmic scale. In another analysis we compared our expression measure (RCBS) with the genomic expression profiles of the E. coli genome growing on rich (Luria broth glucose) and on minimal culture (glucose) medium (Supplementary Tables SVA and SVB).[34] Of the 76 genes expressed at significantly higher levels on Luria broth plus glucose medium, 54 genes show a high expression level in our expression measure, whereas only 12 genes out of 107 genes expressed on minimal glucose medium have a high level of expression. We observe that the correlation co-efficient of minimal culture data with RCBS (r = 0.3011) is good, but very much worse for Luria broth glucose data. The agreement of predicted and actual protein expression level varied greatly between all examined combinations of prediction method and data set. The discrepancy is thought to lie in the quality of experimental data. The preliminary analysis on the quality of experimental data shows that these kinds of experiments are inherently noisy and of low reproducibility. The reproducibility of microarray data can be evaluated through the computation of correlation coefficients within and among the data sets from different microarray experiments. Two data sets from different sources can be chosen for analysis in this study. In the first, the data set was obtained from ExpressDB and the comparison made between expression levels in E. coli grown to either mid-log phase (LP) or stationary phase (SP). In the second, the data set was obtained from the ASAP database, where E. coli is cultured in lysogeny broth (LB). It can clearly be seen that the pair wise correlation coefficient among the gene expression levels from different experiments (rLP-SP = 0.52, rLB-LP = 0.017, rLB-SP = −0.039)[34] vary broadly indicating the very noisy nature of microarray experiments and their lack of accuracy. The quality of experimental data seems to be a very important factor in this kind of analysis. Large variances may reduce the significance of statistical tests and might hide interesting trends in complex data. Microarray data tend to suffer from noise introduced at each step of different experimental protocols, while protein abundance data and mRNA expression level do not agree well in all cases. The other probable reason for incoherent results is that prediction of gene expression from genomic data, based solely on codon usage, is oversimplified. Other factors, such as promoter strength and gene copy number should also be taken into account. We now discuss our results in more detail for different functional classes of genes. The highly expressed genes are then classified into different functional categories, e.g. RPs, CH and degradation proteins, transcription and TF, energy metabolism, electron transport, recombination and repair, outer membrane proteins, aminoacyl tRNA synthetases, etc. (The distribution of highly expressed genes of different functional class in the genomes of E. coli is displayed in Supplementary Table SI.) All, but one RP, the major CH/degradation proteins and translation/transcription processing factors attain high expression levels. Supplementary Table SII presents the 52 genes with the highest predicted expression levels in E. coli. The gene for trp operon ladder peptide trpL involved in amino acid (tryptophan) biosynthesis attains the highest RCBS value 3.42, among all E. coli genes.

RP genes

RPs are very important in cell biology as thus provide a range of activities required for all steps of protein biosynthesis. Following the analysis based on the definition RCBS and Equation (1) and (2), we observe that virtually all RP genes qualify as highly expressed genes. The genes encoding RPs, which are expected to be expressed at high levels during rapid cell growth, were identified with RCBS values >0.5 (Table 1). All but one RP in E. coli are expressed at significantly higher levels; the only exception is rimK, RP S6 modification protein, where it is thought to contribute to the ribosome maturation and modification. The RCBS values for highly expressed RP genes range from 0.50 to 1.77. In fact, all RP genes in E. coli do not reach the top expression level. Seventeen out of 56 are among the highest 86 highly expressed genes. The highest expression level occurs for L34, with an RCBS value of 1.77. The RPs are the major component, together with the ancillary proteins, involved in protein synthesis. The genes coding for RPs, protein synthesis factors and RNA polymerase subunits are all intermingled and organized into a small number of operons. We observe that the genes for some major translational or transcription processing factors, including tufA, tufB, fusA, fkpA, slyD, rpoB and rpoC, which are within or near the large RP operon, are predicted as highly expressed genes. Although RPs play an exclusive role in determining ribosome structure, several are multifunctional. RplA, rplD and rplT, the 50S ribosomal subunit proteins (L1, L4 and L20 respectively), and rpsH, the 30S ribosomal subunit protein S8 have a regulatory role. The S1 gene, a giant RP gene (labelled as rpsA) is essential to E. coli and putatively contributes to the initiation of protein synthesis. S9 (rpsI) participates in certain repair activities, and S16 (rpsP) acts as an endonucleases.
Table 1

RCBS of the highly expressed genes of different functional class in the E. coli genome

Functional classGeneRCBSGeneRCBSGeneRCBSGeneRCBS
RibosomalrplN0.50496rpsJ0.74635rplS0.87367rpmA1.08922
rpsD0.56061rplX0.75111sra0.88011rpmC1.09439
rpsS0.60728rpsF0.75859rplI0.90076rplO1.16165
rpsM0.61255rplD0.76302rpmB0.90877rpsI1.24694
rpsG0.62318rplM0.79227rpsN0.91121rpmG1.2494
rplF0.62913rplC0.79299rplP0.92341rpsT1.24983
rplE0.67119rplQ0.80176rpsP0.92858rplL1.3063
rpsH0.67126rpsB0.80995rplY0.9446rplT1.3222
rpsK0.67627rpsA0.81499rpsL0.95959rpsO1.32324
rpsE0.7021rplJ0.82165rplW1.00068rpmJ1.49921
rplB0.71682rpsC0.84223rpmD1.00368rpsU1.60846
rplV0.7302rplK0.84341rpsQ1.03424rpmI1.66876
rplR0.7344rplA0.84538rpmF1.04844rpmH1.77046
rplU0.73917rpmE0.85618rpsR1.05606
TranslationalEfp0.70878raiA0.50131rrfE1.03184ssrS0.70761
Ffs1.31636rrfA1.11799rrfF1.02752tsf0.85208
Frr0.77909rrfB1.03184rrfG1.11995tufA0.94012
fusA0.72335rrfC1.11995rrfH1.11995tufB0.86312
infA0.7532rrfD1.11995rrlA1.06128yeiP0.52763
TranscriptionalalpA0.64494glnB0.81972pspA0.71495rpoZ0.874
chaB0.91144greA0.61192pspB0.77923sfsB0.66054
Crl0.68275greB0.52545relB0.68232slmA0.53879
cspA1.2802Hha0.88747relE0.54866soxR0.59593
cspC1.12974Hns0.73934rof0.65143soxS0.60395
cspE0.87402metJ0.5234rpoB0.53467suhB0.53095
deaD0.62977nusB0.66651rpoC0.66692tdcR0.60661
flgM0.58028nusG0.62894rpoD0.53475trpR0.6079
flhC0.504osmE0.55743rpoH0.51287
CH and foldingccmD0.81384groL0.90549hybG0.62208secB0.66081
dksA0.5747groS0.82021iscA0.66931skp0.85476
dnaK0.65259hscB0.62877iscX0.73575slyD0.60592
dsbA0.59085hslO0.51531lolA0.51362stpA0.74434
fklB0.63123hslU0.49623narJ0.50787tig0.79986
fkpA0.55943htpG0.5791ppiB0.65291
fkpB0.51531hyaE0.56129ppiC0.70111
fliT0.51569hybF0.51315rmf0.96923
Outer membranecsgA0.73214ompC1.03758slyB0.59077yqiG0.69853
mipA0.52949ompF0.63223tsx0.58718
nmpC0.51413ompX0.90683yddL0.57797
ompA0.79079pagP0.50225yqhH0.53974
Post-translationalrimI0.50362Def0.50521napD0.65324npr0.66442
DNA repair/replication/recombinationcspD0.49781Hole0.70777ihfB0.58392rusA0.53058
dinI0.66454hupA0.97108priC0.58088ssb0.71106
dinJ0.57421hupB0.74465rdgC0.51482xseB0.865
fis0.93575ihfA0.55962recA0.60858yebG0.59001
RNA modificationrluB0.55764Pnp0.59733deaD0.62977rbfA0.72106
DNA degradationrusA0.53058xseB0.865
Degradation of Proteins/peptides/glycopeptideshflC0.4998degP0.51382yhbO0.53736yajG0.55166
Degradation of small moleculesPta0.58128frwB0.57401tnaC1.33277
Nucleoprotein and basic proteinHfq0.51407Hns0.73934skp0.85476tpr1.29474
dps0.55438stpA0.74434fis0.93575
ihfB0.58392hupB0.74465hupA0.97108
Aminoacyl tRNA synthaseaspS0.52912lysS0.54138pheM2.38353valS0.52017
ygjH0.5786
Energy metabolism
 Glycolysiseno0.99727gapA0.87498pfkA0.67783pykF0.62056
fbaA0.7547gpmA0.65413pgk0.76595tpiA0.80293
 TCA cyclemdh0.55763sucB0.51856sucC0.50409sucD0.62233
 Pentose phosphate pathwaytalB0.58526tktA0.63261
 ATP synthaseatpA0.64784atpC0.51365atpD0.64873atpE1.08527
atpF0.60762
 Pyruvate dehydronageaceE0.57263aceF0.55269lpd0.56421
 Aerobic respirationcyoC0.53164hyaE0.56129nuoA0.54378nuoK0.61103
cyoD0.61485nirD0.70885nuoI0.59343
 Anaerobic respirationfrdC0.73468hybG0.62208menB0.60086pflB0.75126
frdD0.72395hydN0.69364narH0.52986ubiC0.52458
glpE0.54693hypA0.67865narJ0.50787
hybF0.51315hypC0.56922yfiD0.87609
 Electron transportackA0.61336Fdx0.61409fldA0.60624cybC0.56769
 Flagellum biogenesisflgB0.54626fliJ0.67522fliS0.52105fliT0.51569
fliE0.66739fliQ0.5854
 Transport of small moleculesnupC0.50273potC0.51092tsx0.58718
 Salvage of nucleocides and nucleotidesApt0.73291deoC0.63634upp0.51826hpt0.69492
deoB0.55136deoD0.57449gpt0.56649
 Central intermediary metabolismcitD0.59133folX0.51347gloA0.76667ulaD0.52297
citE0.51485Mutt0.63455aspA0.52318gcvH0.72458
fixX0.60213
 Carbohydrate metabolismeda0.62187gntK0.50361ulaB0.51605uxaC0.57269
gatB0.53522Lpd0.56421ulaD0.52297uxuA0.59595
paaB0.60215
 Phosphorus metabolismpstA0.51705pstS0.5871ppa0.6365psiF0.66563
phnG0.5443
 Nitrogen metabolismcynS0.53274glnK0.65458
 Sulphur metabolismcysP0.51334
 Amines metabolismeutS0.57934
 Amino acid biosynthesisartM0.51962glnH0.54244ilvG1.32851metJ0.5234
dapD0.51627glnP0.596ilvL1.51982pheL2.8411
fliY0.51995glyA0.57258ilvM0.84298sdaC0.62785
glnA0.5114hisL1.99822ivbL1.76046thrL1.7054
glnB0.81972ilvC0.54397leuL1.93311trpL3.41556
trpR0.60479
 Fatty acid biosynthesisaccA0.57451dgkA0.55757fabI0.54893ymcE0.60055
acpS0.55661fabA0.67664fabZ0.58465
 Nucleotide biosynthesisadk0.76156Ndk0.79214purC0.5899pyrL1.1651
guaB0.58481purA0.53711
 Cofactor and small molecule biosynthesisgapA0.87498mioC0.50538moaE0.58446ubiC0.52458
glyA0.57258moaC0.50171ribE0.59736
menB0.60086moaD0.61154This0.78241
Macromolecule biosynthesisaccB0.55326dgkA0.55757grxC0.79395mipA0.52949
acpP0.82199fimA0.57714hipB0.62205nrdH0.66531
ccmD0.81384glgS0.89234iscR0.50455pagP0.50225
cybC0.56769grxA0.55662Lpp1.632trxA0.75124
yfgJ0.72071
 Inner membraneccmD0.81384metI0.53708yccF0.58505yidH0.53297
cyoC0.53164mscL0.57954ydgC0.55456yiiR0.51556
cyoD0.61485narH0.52986yeaL0.50064yijD0.50746
dgkA0.55757nuoA0.54378yeaQ0.71217yjeO0.54162
frdC0.73468nuoK0.61103ygdD0.62392yjeT0.68009
frdD0.72395nupC0.50273yhdT0.74646yncH0.7111
glnP0.596Pal0.86696yhhL0.62656ynfA0.60738
lpp1.632yaaH0.7921yiaB0.65847
mdtJ0.61263ybaN0.55105yiaW0.64364
 TransportyjdM0.76533glnH0.54244ptsH0.93025csgF0.54377
yjgA0.5484glnP0.596potC0.51092secG0.75473
fliY0.51995mscL0.57954pmrD0.5388mokC0.62148
cyoC0.53164sugE0.51943yrbC0.54592yajC0.69682
metI0.53708mdtI0.74374frwB0.57401tatA0.72924
metQ0.56475mdtJ0.61263fryB0.70188tatE0.71983
feoA0.76102chbA0.55214yedE0.50339cysP0.51334
gatB0.53522chbB0.65397ygaH0.5262npr0.66442
gspI0.54627nuoI0.59343yqaE1.13838sdaC0.62785
crr0.6849nupC0.50273marB0.61754
 RegulatorchpS0.57732csrC0.51672hipB0.62205yfeC0.5528
cpxP0.50596dsrA1.78721Spf1.34529yiaG0.51628
csgA0.73214dsrB0.75282sufE0.58559yifE0.54534
csrA0.83793feoC0.86637yddM0.5642yrbA0.62229
<span class="Chemical">RCBS of the highly expressed genes of different functional class in the <span class="Species">E. coli genome

Genes for transcription/translation processing factors

There are ∼100 genes encoding enzymes, factors and structural components that make up the translational apparatus. Out of these100 genes 75 are identified as highly expressed genes with <span class="Chemical">RCBS values >0.5. Thus the majority of genes involved in translation are predicted to have a high expression level. Of these 75 translational genes, which are expressed at higher level, 55 encoded RPs. Highly expressed genes for transcription/translation processing factors are reported in Table 1 and can be compared with the data available.[18] There are ∼260 known genes that encode factors involved in translation and ribosome modification including the initiation and elongation factors, 34 of which are indicated to be at a higher expression level. As with RPs, genes coding for elongation factors (efp, yeip, fusA, tsf, tufA, tufB), ribosome recycling factor (frr) and translation initiation factor (infA) register as highly expressed genes which play important roles in translation. The expression level of infB, fused protein chain initiation factor is moderately high (RCBS = 0.49017). The regulation of infB which is downstream and co-transcribed with moderately expressed TF gene nusA (RCBS = 0.46579), is complex and is thought to be the result of auto regulation of the extent of the read through at upstream terminators by moderately expressed nusA. The expression level of infB is higher than nusA. The elongation factor efp has been shown to be essential in E. coli for protein synthesis and viability. The expression levels of other elongation factors (fusA, tsf, tufA, tufB) are gradually higher. Interestingly, the regulation tufB is partially dependent upon the fis gene, global DNA binding transcriptional and the fis gene has significantly higher expression level (RCBS = 0.93575). Small RNA molecules are very important in cell biology and can regulate translation. It is found that genes coding 5S RNAs (rrfA, rrfB, rrfC, rrfD, rrfE, rrfF, rrfG, rrfH) and 23S RNA (rrlA) have distinctive RCBS values >1.0. Gene expression is controlled by a regulator that interacts with a specific sequence of a target RNA. Ffs coding for the 4.5S sRNA component of signal recognition particle works with the ffh protein (RCBS = 0.3524) and is involved in co-translational protein translocation into and possibly through membranes. SsrS coding for 6S sRNA inhibits RNA polymerase promoter binding. It acts as a template for RNA-directed pRNA synthesis by RNAP and mimics an open promoter. RaiA codes for cold shock protein associated with 30S ribosomal subunit. Ffs,ssrS and raiA involved in translational process are predicted to be highly expressed genes in our approach. Moreover we identify four other genes which are involved in the post-translational process and are expressed at higher level. These are riml coding acetylase for 30S ribosomal subunit S18, def coding peptide deformylase, hypC coding protein required for maturation hydrogenases 1 and 3, napD coding for assembly protein for periplasmic nitrate reductage, and npr coding for phosphohistidinoprotein-hexose phosphotransferage component of N-regulated peroximal targeting signal (PTS) system. Transcription is the first stage in gene expression and the principal step at which it is controlled. The gene for major cold shock protein (cspA) attains a significantly high expression level (RCBS = 1.28). The gene cspA is a regulator needed for adaptation to atypical conditions and gives a response to temperature stimulus. CspC coding for other stress proteins and a member of the cspA family is also a highly expressed gene. Among other genes involved in the transcription process RNA polymerase plays a vital role. RNA synthesis is catalysed by the enzyme RNA polymerase. Transcription starts when RNA polymerase binds to the promoter. Among the DNA-directed RNA polymerase rpoB, rpoC, rpoD, rpoH and rpoZ subunits in E. coli qualify the high expression level. RNA polymerase must be able to handle situations when transcription is blocked, e.g. when DNA is damaged. In the case of E. coli RNA polymerase, the proteins greA and greB, which have been predicted to have a high expression level, release polymerase from elongation arrest. Rho, transcription termination factor, attains a moderate expression level (RCBS = 0.4749). Termination and anti-termination are closely connected and involve proteins that interact with RNA polymerase. Anti-termination is used as a control mechanism and controls the ability of the enzyme to read past a terminator into genes lying beyond. The nus loci code for proteins that form part of the transcription apparatus. The nusA, nusb, nusG functions are concerned solely with the transmission of transcription. Transcription anti-termination protein (nusB) and transcription termination factor (nusG) have high expression levels. NusB is required for rho-dependent terminators whereas nusG may be considered with the general assembly of all the nus factors into a complex with RNA polymerase. NusA required for intrinsic terminators has a moderate expression level (RCBS = 0.4658).

CH/degradation protein genes

CH/degradation proteins are vital in cell physiology. CHs are proteins that assist the non-covalent folding/unfolding and assembly/disassembly of other macromolecular structures. One major function of CH is to prevent both newly synthesized polypeptide chains and assembled subunits from aggregating into non-functional structures. Many CHs are heat shock proteins, that is, proteins expressed in response to elevated temperatures or other cellular stresses. The reason for this behaviour is that protein folding is severely affected by heat and, therefore, some CHs act to repair the potential damage caused by misfolding. Other CHs are involved in folding newly made proteins as they are extruded from the ribosome. Although most newly synthesized proteins can fold in the absence of CHs, a minority strictly requires them. DnaK (HSP70), perhaps the best characterized CH in E. coli, is identified as a highly expressed gene. The Hsp70 proteins are aided by Hsp40 proteins (DnaJ in E. coli), which increase the ATP (adenosine triphosphate) consumption rate and activity of the Hsp70s. But, dnaJ has a low expression level (RCBS = 0.3988). It has been noted that increased expression of Hsp70 proteins in the cell results in a decreased tendency towards apoptosis. Although a precise mechanistic understanding has yet to be determined, it is known that Hsp70s have a high-affinity bound state to unfolded proteins when bound to adenosine diphosphate ribosyl, and a low-affinity state when bound to ATP. It is thought that many Hsp70s crowd around an unfolded substrate, stabilizing it and preventing aggregation until the unfolded molecule folds properly, at which time the Hsp70s lose affinity for the molecule and diffuse away. Other highly expressed heat shock proteins are groS, groL, hslO (Hsp33) htpG (Hsp90). GroS and groL are the small subunits of GroESL. These are the best characterized heat shock protein complexes in E. coli, identified as highly expressed genes. HtpG in E. coli is the least well-understood CH. Hsp90, a molecular CH, might be essential for activating many signalling proteins in the eukaryotic cell and is necessary for viability in eukaryotes. Since it is predicted to be a highly expressed gene, it is possibly necessary for prokaryotes as well. Protein degradation plays an important role in cell cycle, in signal transduction and in maintaining the integrity of the proper folded state of a protein. Out of 100 genes involved in macromolecular degradation only six genes qualify as highly expressed genes. In Table 1, the predicted expression levels of highly expressed degradation genes are reported. Among these the genes encoding xseB (exonuclease VII small subunit) and rusA (DLP12 prophage, endonuclease RUS) are enzymes which regulate the degradation of DNA. These are also involved in DNA repair activity. Pnp and csrA are the only two proteins qualifying as highly expressed genes involved in RNA degradation. Pnp, polynucleotide phosphorylase/polyadenylase, is fundamental in RNA processing. Polyadenylation plays an important role in initiating degradation of some RNAs. Triple mutations that remove Pnp have a strong effect on stability. Poly(A)polymerase may create a poly (A) tail that acts as a binding site for the nucleases. DegP, serine endoprotease (Protease D0) encodes an enzyme which is involved in protein and peptide degradation and is predicted to be required for global protein degradation. It responds to temperature stimulus. YhbO, YajG, a predicted lipoprotein and YhbO, a predicted intercellular protease are thought to be involved in degradation of proteins and <span class="Chemical">polysaccharides.

Aminoacyl tRNA synthetases and modification genes

There are 37 genes encoding the tRNA synthetases and other enzymes involved in tRNA modification. Results have been reported in Table 1. Compared with 19 PHX genes as predicted by Karlin et al.,[18] only three genes register as highly expressed genes in our expression measure. These include aspartyl tRNA synthetase (aspS), lysine tRNA synthetase (lysS) and valyl tRNA synthetase (valS). The gene encoding glysine tRNA synthetase (glyS) is also predicted to be a highly expressed gene marginally with RCBS = 0.4974. Among other tRNA synthetase genes phes, glyQ, glnS, leus, serS, pros, tyrS, gltX and metG have moderate expression levels. PheM, phenylalanyl tRNA synthetase operon leader peptide registers a high RCB score with RCBS = 2.1835.

Outer membrane protein

There are ∼13 highly expressed genes encoding outer membrane proteins, as predicted by our expression measure. The expression levels of these genes have been displayed in Table 1. These include outer membrane protein (ompA, ompC, ompF, ompX), outer membrane lipoprotein (slyB), truncated outer membrane porin (nmpC), palmitoyl transferase for Lipid A (pagP), scaffolding protein for murein synthesizing machinery (mipA) and tsx. Moreover, yqiG, a predicted outer membrane user protein, yqhH, a predicted outer membrane lipoprotein, and yddL, a predicted putative outer membrane protein have been predicted as highly expressed genes in our analysis.

Inner membrane protein

Among the genes encoding inner membrane protein, murein lipoprotein (lpp) has the highest expression level (RCBS = 0.6320). Other than conserved inner membrane protein, 34 inner membrane protein genes have been listed in Table 1 as highly expressed genes. There are ∼83 conserved inner membrane proteins in the E. coli genome. Out of those, 17 have been predicted to be highly expressed genes (Supplementary Table SVII).

Amino acid biosynthesis

Overall, 20 of the 255 amino acid biosynthesis genes are expressed at a higher level. The artM, an arginine transporter subunit, flyM, a cystine transporter subunit, glnH and glnP, the glutamine transporter subunits are predicted to be expressed at higher levels. The glnA gene, which encodes glutamine synthetase, and glnB, which encodes regulatory protein for glumine synthetase, are expressed at higher levels. Interestingly, hisL, his operon ladder peptide; ilvL, ilvG operon ladder peptide; ivbL, ilvB operon ladder peptide; leuL, leu operon ladder peptide; pheL, pheA gene ladder peptide; thrL, thr operon ladder peptide; and trpl, trp operon ladder peptide are expressed at higher levels. The monocystronic gene ilvC, which is depressed exclusively by valine has a high value of expression score. The dapD product, 2,3,4,5-tetrahydropyridine-2-carboxylate N-succinyl transferage, which encodes the enzyme for lysine biosynthesis process via diaminopimelate has a high expression level.

Nucleotide biosynthesis

According to MIPS classification, ∼31 genes encode enzymes for nucleotide biosynthesis. In our study, we observe that five genes namely purA, purC, adk, ndk and guaB encoding enzymes which are involved in Purine ribonucleotide biosynthesis and pyrL, pyrBI operon leader peptide for Pyrimidine ribonucleotide biosynthesis, are highly expressed genes. PyrL has a significantly high expression level with RCBS = 1.16.

Genes for energy metabolism and metabolism of carbon compounds

Of the 392 genes involved in metabolism of carbon compound, 39 genes have a significantly high expression level. Of those, 27 are involved in carbohydrate metabolism, 10 are involved in amino acid metabolism, and two are involved in amines metabolism. Lpd is involved both in carbohydrate and amino acid metabolism. Rest one is involved in other carbon compound metabolism. No genes involved in fatty acid metabolism attain a high expression level, but seven of the 27 genes involved in fatty acid biosynthesis have a significantly high expression level. The data presented here indicate that accA (acetyl-CoA carboxylase), which encodes one component of acetyl coenzyme A carboxilase is a highly expressed gene. In addition, ymcE, which is cold shock protein and aspS also attain a high expression level. Although less is known about fab genes except the FadR activation on fabA, we predict that some of fab genes (fabA, fabI, fabZ) have a significant expression level. This is consistent with genomic expression profiling obtained from DNA microarray analysis of Tao et al.[34]

Energy metabolism genes

The genes involved in energy metabolism are primarily divided into four groups: glycolysis, pyruvate dehydronage, the pentose phosphate pathway and the TCA cycle. Of the 1530 genes that are involved in energy metabolism, 163 have been predicted to be highly expressed genes in our approach. Two basic metabolic pathways glycolysis and TCA cycle involve eight and four highly expressed genes respectively, whereas the genes in glycolysis and pyruvate metabolism are predominantly highly expressed genes. These include the genes for eno, fbaA, gapA, gpmA, pfkA, pykF, tpiA, pgk. Unlike Karlin et al. the proteins involved in the initial steps of glycolysis (pgi coding glucophosphate isomerage and the proteins involved in the initial steps of TCA cycle (gltA, citrate synthase) are not highly expressed genes in our observation. Besides having the most TCA cycle, pyruvate dehydronage and glycolysis, E. coli genome has several highly expressed genes of anaerobic and aerobic respiration. Among NADH dehydrogenase nuo complex nuoA, nuoI and nuoK are highly expressed genes. Genes encoding α, β and ε subunits of F1 sector of membrane bound ATP synthase and b and c subunits of F0 sector of membrane bound ATP synthase genes have been predicted to be highly expressed genes. With respect to electron transport flavodoxin 1 (fldA) and cytochrome o ubiquinol oxidase subunit III (cyoC) are highly expressed gene with RCBS values 0.6062 and 0.5316, respectively. In addition, cytochrome c biogenesis protein (ccmD), and cytochrome o ubiquinol oxidase subunit IV (cyoD) also register high expression level in our approach. In marked contrast to Kerlin et al., <span class="Species">E. coli has six highly expressed flagellar genes flgB, fliE, fliJ, fliQ, fliS, fliT. The flagellum secretion apparatus may be viewed as part of the CH family essential for bacterial viability. Assembly of a flagellum is required to export protein subunits to the outer surface of the cell. Recent evidence indicates that flagellum regulon can also influence bacterium–host interactions independent of motility.

Fatty acid biosynthesis

Fatty acid metabolism is crucial because not only does it provide various fatty acids and phospholipids necessary for cell growth, but it also serves as a source of precursors for biosynthesis of secondary metabolites. The highly expressed genes involved in fatty acid biosynthesis included genes encoding beta-hydroxydecanoyl thioester dehydrase (fabA), NADH-dependent enoyl-[acyl-carrier-protein] reductase (fabI), (3R)-hydroxymyristol acyl carrier protein dehydratase (fabZ), holo-[acyl-carrier-protein] synthase 1(acpS), accA, cold shock gene (ymcE). Besides 3-oxoacyl-[acyl-carrier-protein] synthase I (fabB) has moderately high value of RCBS (RCBS = 0.4954).

Central intermediary metabolism

Several highly expressed genes in this functional class are also involved in <span class="Chemical">carbohydrate metabolism. Besides other genes in this class which are also involved in nitrogen metabolism, phosphorus metabolism, amino acid metabolism, etc., our analysis identified the key genes involved in central intermediary metabolism, encoding aspartate ammonia-lyase (aspA), citrate lyase (citD, citE), glycine cleavage complex lipoylprotein (gcvH), Ni-dependent glyoxalase I (gloA), 3-keto-l-gulonate 6-phosphate decarboxylase (ulaD), d-erythro-7,8-dihydroneopterin triphosphate 2′;-epimerase and dihydroneopterin aldolase (folX) and d-erythro-7,8-dihydroneopterin triphosphate 2′;-epimerase and dihydroneopterin aldolase (mutT) as highly expressed genes. FixX, 4Fe-4S ferredoxin-type protein is also registered as a highly expressed gene predicted to be involved in central intermediary metabolism.

Genomic repair proteins

An event that introduces a deviation from the usual double-helical structure of DNA is a threat to the genetic constitution of the cell. The repair system is thus very important for the survival of the cell. The repair system can recognize a range of distortions in DNA as signal for action, and is likely to have several systems able to deal with DNA damage. Table 1 reports the highly expressed repair proteins in E. coli genome. Other repair proteins have low to moderate expression levels. Of the 51 genes involved in DNA repair, only six genes reach a high expression level. The principal pathway for recombination repair in E. coli is identified by the rec genes. recA, predicted to be highly expressed genes in our approach is not only involved in recombination–repair activities, but also has another quite distinct function. It can be activated by many treatments that damage DNA or inhibit replication in E. coli. This causes it to trigger a complex series of phenotype changes called the SOS response, which involves the expression of many genes whose products include repair function. The other highly expressed repair genes in E. coli are xseB, dinl, yebG, dinJ, rusA. DinI, DNA damage-inducible protein I, and dinJ, predicted antitoxin of YafQ–DinJ toxin antitoxin system act on damaged DNA and involved in repairing damaged DNA. YebG, a conserved protein regulated by LexA functions as DNA repair.

Regulatory protein

About 440 genes in <span class="Species">E. coli encode regulatory proteins. Among these regulatory proteins 62 genes are predicted to be highly expressed genes. Several of the genes in this class also function in translation, transcription, DNA repair, replication/recombination, cell process, etc. The predicted expression levels of several other highly expressed genes of specific regulatory proteins are listed in Table 1.

Biosynthesis of vitamins, cofactors and small molecules

Vitamin biosynthesis proteins have largely low expression levels. Only ribE, riboflavin synthetase, is highly expressed. This is in contrast to the result of Karlin et al.[18] Pathways for the synthesis of vitamins of which only small amounts are generally needed to achieve adequate function, record low RCBS values ranging from 0.1801 to 0.5974. Some of the enzymes that utilize the vitamins as cofactors are highly expressed, e.g. accB, acetyl-CoA carboxylase, BCCP subunit of E. coli is registered as highly expressed gene in our approach with RCBS = 0.5533. Expression of the 10 highly expressed genes involved in the biosynthesis of cofactors and small molecules are listed in Table 1.

Biosynthesis of other macromolecules

Among the genes encoding proteins for macromolecular biosynthesis, lpp attains significantly high RCBS value (RCBS = 1.6320). In addition to it, other highly expressed genes involved in macromolecular biosynthesis genes are major type 1 subunit fimbrin (fimA), DNA-binding transcriptional repressor (iscR) and truncated cytochrome b562 cytochrome (cybC). GlsG, a predicted glycogen synthesis protein and yfgJ, another predicted protein thought to be involved in macromolecular biosynthesis also attain the score of high expression level. Of the 39 cryptic genes in <span class="Species">E. coli analysed in our model, only <span class="Chemical">three register as highly expressed genes. Those are csgA, a criptic curlin major subunit which is involved in glycoprotein biosynthesis, mokC, a regulatory protein of hokC, and gspl, a putative transport protein. The expression levels of these genes are 0.7, 0.62 and 0.55, respectively. Among the genes induced under starvation conditions only dps, Fe-binding and storage protein (RCBS=0.5544) which provides DNA protection during starvation proteins, rpoH, RNA polymerase, sigma 32 (sigma H) factor (RCBS = 0.5129) are predicted as highly expressed genes in agreement with Karlin et al.[18] Other starvation protein genes [otsA (RCBS = 0.2349), otsB (RCBS = 0.2700), rpoE (RCBS = 0.2781), rpoN (RCBS = 0.2486), rpoS (RCBS = 0.4093), katE (RCBS = 0.2359), surA (RCBS = 0.3936), bolA (RCBS = 0.4342)] have low to moderate expression levels. The survival protein surA which is registered as PHX with E(g) = 1.10 does not qualify as a highly expressed gene in our approach. Besides, we also observe that a number of genes encoding prophases are recorded as highly expressed genes in our analysis. A phase DNA molecule is often integrated into the DNA molecule of bacterium forming a prophase. A list of highly expressed genes encoding different prophases in E. coli is displayed in Table 2.
Table 2

Predicted expression levels of highly expressed prophage genes

GeneDescriptionRCBS
yeeTCP4-44 prophage; predicted protein0.76113
alpACP4-57 prophage; DNA-binding transcriptional activator0.64494
ypjKCP4-57 prophage; predicted inner membrane protein0.7551
yfjUCP4-57 prophage; predicted inner membrane protein1.07646
yfjMCP4-57 prophage; predicted protein0.56069
yafWCP4-6 prophage; antitoxin of the YkfI–YafW toxin–antitoxin system0.54248
tfaSCPS-53 (KpLE1) prophage; conserved protein0.60714
yfdTCPS-53 (KpLE1) prophage; predicted protein0.54524
yfdSCPS-53 (KpLE1) prophage; predicted protein0.59437
yffMCPZ-55 prophage; predicted protein0.72955
ninEDLP12 prophage; conserved protein0.61069
rusADLP12 prophage; endonuclease RUS0.53058
emrEDLP12 prophage; multidrug resistance protein0.65874
borDDLP12 prophage; predicted lipoprotein0.50128
rzoDDLP12 prophage; predicted lipoprotein0.98537
essDDLP12 prophage; predicted phage lysis protein0.77232
ybcODLP12 prophage; predicted protein0.56517
ybcWDLP12 prophage; predicted protein0.67154
ylcGDLP12 prophage; predicted protein1.05554
yciHe14 prophage; 5-methylcytosine-specific restriction endonuclease B0.67815
yciXe14 prophage; predicted DNA-binding transcriptional regulator0.79718
yciOe14 prophage; predicted inner membrane protein0.50282
rluBe14 prophage; predicted integrase0.55764
ymiAe14 prophage; predicted protein1.3517
ylcHhypothetical protein, DLP12 prophage1.56134
insMKpLE2 phage-like element; iron-dicitrate transporter subunit0.6455
insAKpLE2 phage-like element; IS1 repressor protein InsA0.52239
yqiGKpLE2 phage-like element; IS2 insertion element repressor InsA0.69853
yjhDKpLE2 phage-like element; IS30 transposase0.6955
relBQin prophage; bifunctional antitoxin of the RelE–RelB toxin–antitoxin system/transcriptional repressor0.68232
dicBQin prophage; cell division inhibition protein0.66801
cspBQin prophage; cold shock protein0.52261
cspFQin prophage; cold shock protein0.5891
cspIQin prophage; cold shock protein0.80085
dicCQin prophage; DNA-binding transcriptional regulator for DicB0.69275
ydfKQin prophage; predicted DNA-binding transcriptional regulator0.50987
ynfNQin prophage; predicted protein0.69704
gnsBQin prophage; predicted protein0.82038
ydfDQin prophage; predicted protein0.83742
ydfAQin prophage; predicted protein0.95351
ydfBQin prophage; predicted protein1.34218
essQQin prophage; predicted S lysis protein0.62869
hokDQin prophage; small toxic polypeptide0.75743
relEQin prophage; toxin of the RelE–RelB toxin–antitoxin system0.54866
Predicted expression levels of highly expressed prophage genes Apart from these classified genes, a fraction of poorly characterized genes which are generally annotated based on strong sequence similarity is also found among predicted highly expressed genes. Many of these genes encode predicted proteins and some are poorly characterized hypothetical genes. (A list of highly expressed genes which are thought to encode predicted proteins is given in supplementary <span class="Disease">Supplementary Table SVII). Our analysis thus provides strong support for significant roles of these genes which may be highly relevant for <span class="Species">E. coli. The large data set analysed here shows a clear connection between relative codon usage difference and gene expression level. Codon frequencies are found to vary between genes in the same genome and between genomes. Thus overall nucleotide composition of the genome which influences codon usage pattern introduces selective forces acting on highly expressed genes to improve efficiency of translation. This is also evident from the ob<span class="Species">servation that shorter coding sequence has greater <span class="Chemical">RCBS value, i.e. shorter genes have high expression level[4,5,40,41] and this is consistent with the fact that the cost of producing a protein is proportional to its length. Interestingly, we observe that besides highly expressed protein coding genes all tRNA genes (listed in Table 3) are also registered with very high RCBS values. This observation suggests that usage of preferred codons in these and highly expressed genes is positively correlated and the highly expressed genes use a preferred set of optimal codons in accordance with their respective tRNA levels. Moreover, this result might find another important application in tRNA genes. Besides measuring expression levels of a gene, RCBS score can be remarkably used to remove the false positives in tRNA finding algorithm. Moreover, several genes of unknown functions with predicted high expression levels may be attractive candidates for experimental characterization because we assume that they have important functions in those organisms. Table 4 lists such gene families of unknown functions. This kind of analysis is valuable in helping to identify the promising candidate genes to be focused for further experimental characterization.
Table 3

Predicted expression levels of tRNA genes

GeneRCBSGeneRCBSGeneRCBSGeneRCBS
alaX1.35584glnW1.96033leuP1.06805serT1.15723
alaW1.35584glnU1.96033leuX1.18771serU1.32755
alaV1.5556gltW1.85009leuU1.23093serW1.45877
alaU1.5556gltU1.85009leuZ1.3515serX1.45877
alaT1.5556gltT1.85009lysT1.91913thrW1.175
argU1.40468gltV1.85009lysW1.91913thrV1.27061
argX1.67244glyW1.32551lysY1.91913thrT1.27325
argQ1.76167glyV1.32551lysZ1.91913Thru1.7256
argZ1.76167glyX1.32551lysQ1.91913trpT1.62046
argY1.76167glyY1.32551lysV1.91913tyrU1.00445
argV1.76167glyT1.33638metY1.22225tyrV1.0433
argW1.99759glyU1.47125metZ1.32682tyrT1.0433
asnT1.87865hisR1.21868metW1.32682valW1.37166
asnW1.87865ileX1.41462metV1.32682valT1.37566
asnU1.87865ileV1.42883metU1.36722valZ1.37566
asnV1.87865ileU1.42883metT1.36722valU1.37566
aspU1.38539ileT1.42883pheV1.38483valX1.37566
aspV1.38539ileY1.45397pheU1.38483valY1.37566
aspT1.38539leuW1.02415proL1.26942valV1.6125
cysT1.35851leuT1.03107prom1.38923selC1.28639
glnX1.65127leuV1.03107proK1.44416
glnV1.65127leuQ1.03107serV1.14888
Table 4

Predicted expression levels of highly expressed hypothetical protein genes

GeneRCBSGeneRCBSGeneRCBS
ytcA0.51055ylcI0.77343ybhU1.09738
ybfK0.51884yojO0.84734ynhF1.15141
ymjA0.58644ygdT0.85155ydgU1.48121
yrhD0.63276ypaB0.92206ypfM1.86114
ydbJ0.63348yccB1.07903ylcH1.56134
Predicted expression levels of tRNA genes Predicted expression levels of highly expressed <span class="Gene">hypothetical protein genes

Discussion

Our analysis supports that each genome has evolved codon usage patterns indicating gene expression levels. The three protein families – RPs, major translation/transcription processing factors, and CH/degradation proteins which are fundamental at many stages of the life style in promoting growth and stability, have been identified as highly expressed genes. Although the concept of predicting gene expression from codon usage was proposed a decade ago, only recently these methods have been successfully applied to the identification of highly expressed genes in various bacteria and eukaryotic organisms. But, any such codon usage-based prediction of gene expression relies on a prior definition of a reference set, consisting of highly expressed genes. For instance, CAI listed a set of 27 highly expressed genes for E. coli, which includes gene encoding 17 RPs, four elongation factors, four outer membrane protein, recA, and dnaK. For yeast a set of 24 highly expressed genes has been taken as a reference set. These include 16 genes encoding RPs, one for an elongation factor, two enolase genes, two GA-3-PDH genes, ADH 1, PCK, pyruvate kinase.[3] Karlin and coworkers[17-23] included transcription/translation-related factors and CHs in the reference set, in addition to the RP genes. MILC-based expression level predictor MELP[13] is based on a reference set consisting of all genes coding for RPs, longer than 100 codons. Although the composition of the reference set is based on the functional assignment of the genes, but there is no specific algorithm to construct a reference set for individual species. The outcome is highly dependent on the genome examined. In some instances, in the use of alternative reference sets results are very poor. In principle it is not possible to regulate protein expression level by the judicious use of certain codons. It is worth emphasizing that individual genes tend to favour characteristic codon distributions and there is a strong connection between protein expressivity and the degree of codon bias. So, we emphasize that codon assignment as well as codon preferences should be taken into account in a single measure which will have functional feedback between the constraints of gene expression and microstructure of genomes. To better understand potential expression levels of genes, we developed a methodology that relates codon usage as well as large-scale DNA compositional biases among gene classes to the expression potential of individual genes. The CAI[3] and codon usage models[13,17] are originally based on somewhat qualitative assumptions about the expression levels of relatively few genes. This is our motivation for using a quantitative measure (RCBS) to recalculate genome-wide expression data. The new approach begins with the assumption, based on the argument just presented; that the general codon usage features observed in highly expressed genes greatly differ from that of randomly generated sequences with their sequence composition conserved. Our proposition is based on the fact that the difference between the geometric average of normalized frequency of codons (f) in a sequence of nucleotides and that of f(x) × f(y) × f(z) is >0.5 of the geometric average of f(x) × f(y) × f(z) for highly expressed genes. The proposed threshold value (0.5) of RCBS is investigated for E. coli genome, Yeast genome and archeal genomes. The data (available on request) provide the evidence in favour of potential strength of our expression measure over the others. The most of the housekeeping genes fall in the category of highly expressed genes. The study also identifies a number of functionally unknown genes as highly expressed genes based on their codon profile. Thus, it often seems sufficient that our approach is a better alternative to the existing expression models. Surprisingly, we have found that there is a strong negative correlation between relative codon usage bias and protein length in contradiction with others.[24,42] Although our primary motivation in developing this novel method was to compensate the possible artefacts due to sequence length variability, we have observed that highly expressed genes (identified by RCBS) show negative correlation with gene length leading to a biological relevance. This is suggested to be due to more effective translational selection acting to reduce size of the abundant proteins, to minimize transcriptional and translational energy costs. Although the longer sequences appear to be better optimized in terms of having codons for more abundant tRNAs which increase their probability in proper and timely translation, it is easier for a ribosome to translate a short RNA sequences, as opposed to decrease in fidelity for longer translation. Therefore it is likely that there is a natural selection for the shorter genes to be expressed at higher level.[41] To summarize, we have introduced a novel method, based on codon usage difference with regard to random base composition at three codon sites, to estimate the level of expression of a gene. In this article, predicted highly expressed genes are characterized for E. coli genome only, but the method equally applies to other microbes to be reported in separate communication. By comparing its performance with other commonly used measures of gene expression, we have established that RCBS is a generally applicable method, being resistant to species specific and introduces little noise into measurements. It is remarkable that the present model usually performs as well as other codon usage model of Kerlin et al.[18] sometime lead to a better correlation with expression data according to several other measures based on CAI.[3] The prediction of expression level in our approach can be appreciated by comparing them with the protein abundance data and microarray data. Thus, our method is effectively complementary to the experimental procedures of 2D gel electrophoresis and DNA microarray analysis in assessing gene expression levels. In contrast to other existing measures, our model describes the global enrichment of a codon in highly expressed genes with no restrictions on composition of the other codons. Of course, the codon-based expression indicators yield static value, whereas gene expression is a dynamic process with very different expression levels under different conditions. In our view codon usage pattern of genomes evolves as a result of interplay between mutational and selective forces and the proper account of the adaptive response to the codon assignment can lead to a practical solution of gene expression.

Supplementary data

Supplementary data are available online at www.dnaresearch.oxfordjournals.org.

Funding

Financial support by the University Grants Commission, India, sanction No. F.PSW-060/05-06 (ERO), is gratefully acknowledged.
  43 in total

1.  Nature and structure of human genes that generate retropseudogenes.

Authors:  I Gonçalves; L Duret; D Mouchiroud
Journal:  Genome Res       Date:  2000-05       Impact factor: 9.043

2.  Predicted highly expressed genes of diverse prokaryotic genomes.

Authors:  S Karlin; J Mrázek
Journal:  J Bacteriol       Date:  2000-09       Impact factor: 3.490

3.  Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae.

Authors:  A Coghlan; K H Wolfe
Journal:  Yeast       Date:  2000-09-15       Impact factor: 3.239

4.  The 'effective number of codons' used in a gene.

Authors:  F Wright
Journal:  Gene       Date:  1990-03-01       Impact factor: 3.688

5.  Distinguishing features of delta-proteobacterial genomes.

Authors:  Samuel Karlin; Luciano Brocchieri; Jan Mrázek; Dale Kaiser
Journal:  Proc Natl Acad Sci U S A       Date:  2006-07-14       Impact factor: 11.205

6.  Synonymous codon bias is related to gene length in Escherichia coli: selection for translational accuracy?

Authors:  A Eyre-Walker
Journal:  Mol Biol Evol       Date:  1996-07       Impact factor: 16.240

7.  Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases.

Authors:  D C Shields; P M Sharp
Journal:  Nucleic Acids Res       Date:  1987-10-12       Impact factor: 16.971

8.  Codon use and the rate of divergence of land plant chloroplast genes.

Authors:  B R Morton
Journal:  Mol Biol Evol       Date:  1994-03       Impact factor: 16.240

9.  Codon usage and gene expression.

Authors:  L Holm
Journal:  Nucleic Acids Res       Date:  1986-04-11       Impact factor: 16.971

10.  Codon usage in Kluyveromyces lactis and in yeast cytochrome c-encoding genes.

Authors:  M A Freire-Picos; M I González-Siso; E Rodríguez-Belmonte; A M Rodríguez-Torres; E Ramil; M E Cerdán
Journal:  Gene       Date:  1994-02-11       Impact factor: 3.688

View more
  21 in total

1.  Measuring and detecting molecular adaptation in codon usage against nonsense errors during protein translation.

Authors:  Michael A Gilchrist; Premal Shah; Russell Zaretzki
Journal:  Genetics       Date:  2009-10-12       Impact factor: 4.562

2.  Codon usage and amino acid usage influence genes expression level.

Authors:  Prosenjit Paul; Arup Kumar Malakar; Supriyo Chakraborty
Journal:  Genetica       Date:  2017-10-14       Impact factor: 1.082

3.  A novel framework for evaluating the performance of codon usage bias metrics.

Authors:  Sophia S Liu; Adam J Hockenberry; Michael C Jewett; Luís A N Amaral
Journal:  J R Soc Interface       Date:  2018-01       Impact factor: 4.118

4.  GC3 biology in corn, rice, sorghum and other grasses.

Authors:  Tatiana V Tatarinova; Nickolai N Alexandrov; John B Bouck; Kenneth A Feldmann
Journal:  BMC Genomics       Date:  2010-05-16       Impact factor: 3.969

5.  Relative codon adaptation: a generic codon bias index for prediction of gene expression.

Authors:  Jesse M Fox; Ivan Erill
Journal:  DNA Res       Date:  2010-05-07       Impact factor: 4.458

6.  Concept and application of a computational vaccinology workflow.

Authors:  Johannes Söllner; Andreas Heinzel; Georg Summer; Raul Fechete; Laszlo Stipkovits; Susan Szathmary; Bernd Mayer
Journal:  Immunome Res       Date:  2010-11-03

7.  Expression breadth and expression abundance behave differently in correlations with evolutionary rates.

Authors:  Seung Gu Park; Sun Shim Choi
Journal:  BMC Evol Biol       Date:  2010-08-07       Impact factor: 3.260

8.  Synonymous codon usage in Thermosynechococcus elongatus (cyanobacteria) identifies the factors shaping codon usage variation.

Authors:  Ratna Prabha; Dhananjaya P Singh; Shailendra K Gupta; Samir Farooqi; Anil Rai
Journal:  Bioinformation       Date:  2012-07-06

9.  Relationship between amino acid composition and gene expression in the mouse genome.

Authors:  Kazuharu Misawa; Reiko F Kikuno
Journal:  BMC Res Notes       Date:  2011-01-27

10.  Universal pattern and diverse strengths of successive synonymous codon bias in three domains of life, particularly among prokaryotic genomes.

Authors:  Feng-Biao Guo; Yuan-Nong Ye; Hai-Long Zhao; Dan Lin; Wen Wei
Journal:  DNA Res       Date:  2012-11-06       Impact factor: 4.458

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.