Literature DB >> 32550546

Codon usage pattern and predicted gene expression in Arabidopsis thaliana.

Satyabrata Sahoo¹, Shib Sankar Das², Ria Rakshit³.

Abstract

The extensive research for predicting highly expressed genes in plant genome sequences has been going on for decades. The codon usage pattern of genes in Arabidopsis thaliana genome is a classical topic for plant biologists for its significance in the understanding of molecular plant biology. Here we have used a gene expression profiling methodology based on the score of modified relative codon bias (MRCBS) to elucidate expression pattern of genes in Arabidopsis thaliana. MRCBS relies exclusively on sequence features for identifying the highly expressed genes. In this study, a critical analysis of predicted highly expressed (PHE) genes in Arabidopsis thaliana has been performed using MRCBS as a numerical estimator of gene expression level. Consistent with previous other results, our study indicates that codon composition plays an important role in the regulation of gene expression. We found a systematic strong correlation between MRCBS and CAI (codon adaptation index) or other expression-measures. Additionally, MRCBS correlates well with experimental gene expression data. Our study highlights the relationship between gene expression and compositional signature in relation to codon usage bias and sets the ground for the further investigation of the evolution of the protein-coding genes in the plant genome.

Entities: Chemical Disease Gene Mutation Species

Keywords: Arabidopsis thaliana; CAI; CAI, Codon adaptation index; CP, Chloroplast Pltd CP; Codon usage bias; GC content; GEO, Gene Expression Omnibus; Gene expression; MADS, Minichromosome maintenance1, Agamous, Deficiens and Serum response factor; MBP, Megabase pair; MRCBS, Score of Modified relative codon bias; MT, Mitochondrion; PHE genes; PHE, Predicted Highly Expressed; RCA, Relative Codon Adaptation; RCB, Relative codon bias; RCBS, Relative Codon Bias Strength; RMA, Relative Molecular Abundance; RP, Ribosomal protein; SAGE, Serial Analysis of Gene Expression; TAIR, The Arabidopsis Information Resourses

Year: 2019 PMID： 32550546 PMCID： PMC7286098 DOI： 10.1016/j.gene.2019.100012

Source DB: PubMed Journal: Gene X ISSN： 2590-1583

Introduction

Arabidopsis thaliana has proven to be a model experimental organism for essentially developing plant biology at the molecular level. Undoubtedly, any useful insight in understanding the expression of functional proteins of Arabidopsis thaliana will contribute to the development of plant research as well as in the field of modern biotechnology. It is well known that the synthesis of every protein molecule is directed by the arrangement of genetic codes in a genomic DNA sequence. The genetic code uses sixty-one codons to encode 20 amino acids and three codons to terminate translation in the process of protein synthesis. The degeneracy of the genetic code suggests that there must be many alternative nucleotide sequences to encode the same protein. The codon usage pattern varies significantly between different organisms, and also between genes which are expressed at different levels in the same organism. A number of hypotheses prevail regarding the factors which influence the codon usage pattern. Attempts have been made to explain the codon distributions in the protein-coding genes as well as the changes in codon usages among different synonymous codons in each organism (Sharp et al., 1988; Brandis and Hughes, 2016; Sharp and Li, 1987; Ikemura, 1981; Hockenberry et al., 2014; Lee et al., 2010). It is well discussed in the literature that organisms might be subjected to codon biases of different origins. In fact, it is rather difficult to decide the most common dominant codon bias of a genome. Some researchers have speculated that codon bias that tends to reduce the diversity of isoacceptor tRNAs may reduce the metabolic load (Gustafsson and Govindarajan, 2004; Akashi, 1994; Ikemura, 1985). Many other analyses have also revealed that there are many other factors like nucleotide compositional constraint, codon anticodon interaction, amino acid conservation etc. which may also influence the codon usage pattern of a genome. Whatever may be the molecular basis for codon bias, it is evident that codon bias can have a significant impact on the expression of functional proteins. Translational selection pressure or protein secondary structure may have profound effect on codon bias. It is generally thought that a balance between mutation and natural selection on translational efficiency is expected to yield a correlation between codon bias and rate of gene expression, such that highly expressed genes often have stronger relative codon bias (RCB) than genes expressed at lower levels (Kurland, 1991; Hiraoka et al., 2009). Our objective of this work is to identify and analyze PHE genes and codon usage pattern in Arabidopsis thaliana. Our analyses on E.coli, yeast, synechocystis and archaeal genomes support the hypotheses that each genome has evolved a codon usage pattern promoting its gene expression level (Roymondal et al., 2009; Das et al., 2009; Das et al., 2012; Sahoo and Das, 2014a; Das et al., 2017). With the advent of modern technologies, several high-throughput experiments are widely used to identify the highly expressed genes. The most commonly used technique to study large scale gene expression is cDNA microarray. Besides, other novel techniques like 2D gel electrophoresis, Mass spectrometry, Chromatin immunoprecipitation, DNA chip technology and Serial Analysis of Gene Expression (SAGE) have been developed for the purpose. All these experiments require wide range of conditions to match, massive investment of time and resources. To overcome these major obstacles for identifying highly expressed genes in the vast majority of organisms, we must look beyond the direct experimental methods. Following this, we focused our study on developing a computational methodology that can be used to study the large-scale gene expression profile of an organism. Based on the hypothesis that highly expressed genes are often characterized by strong compositional bias in terms of codon usage (Ikemura, 1981; Ikemura, 1985; Kurland, 1991; Sahoo and Das, 2014b; Karlin and Mrazek, 2000; Karlin et al., 2005; Carbone et al., 2003; Supek Fand Vlahovicek, 2005; Supek Fand Vlahovicek, 2010), a number of varieties of software tools like Codon Adaptation Index (CAI) (Sharp and Li, 1987), Relative Codon Adaptation (RCA) (Fox and Erill, 2010), Relative Codon Bias Strength (RCBS) (Roymondal et al., 2009; Das et al., 2009) etc. have been developed to provide numerical indices to predict the expression level of genes. There are no universal standards to make these results more suitable for comparative analysis. However, most of these commonly used computational approaches depend on the knowledge of codon bias of a reference set of highly expressed genes. But, MRCBS has been devised as an alternative model to predict gene expression level from their codon compositions in such a way that the score of the expression indicator may be calculated without any knowledge of previously set selective highly expressed genes as a reference set. In fact, MRCBS performs better to capture the highly expressed genes compared to the performances of several other commonly used measures (Das et al., 2012; Sahoo and Das, 2014a; Das et al., 2017; Sahoo and Das, 2014b). Here, we investigated the gene expression profile and the variation in synonymous codon usage pattern of Arabidopsis thaliana genome. It is a small flowering plant with a relatively short life cycle and is the first plant to have its genome completely sequenced (The Arabidopsis Genome Initiative, 2000). Since 1943, Arabidopsis thaliana started to be widely used as experimental biological material in plant research laboratories around the world. The small size of its genome with approximately 135 MBP and 5 chromosomes makes it a useful model for plant sciences. An extensive study has been done by plant biologist to assign functions of its 2500 genes and 3500 proteins they encode. The latest information on Arabidopsis research is available from Arabidopsis Information Resources (TAIR). The small genome size and the availability of the complete DNA sequence of Arabidopsis thaliana have attracted the attention of a wide range of scientists, including evolutionary biologists and biotechnology companies. The rapid life cycle, unusual properties of inheritance and the vast information about their genealogy suggest that this organism may be used as a useful tool for the plant biologist. Finally, its important role in the study of plant-pathogen interaction makes them very attractive to biotechnology companies for industrial and research uses. Thus, the gene expression profile of Arabidopsis thaliana is expected to make important contributions in plant sciences.

Materials and methods

The whole genome sequence of Arabidopsis thaliana along with the gene annotations was taken from NCBI GenBank have been considered in our study. All gene sequences under study along with those annotated as hypothetical have been extracted from the Gene Bank Accession Nos: NC_003070.9(Chromosome 1),NC_003071.7(Chromosome 2), NC_003074.8(Chromosome 3), NC_003075.7(Chromosome 4),NC_003076.8(Chromosome 5), NC_001284.2(Mitochondrion MT), NC_000932.1(Chloroplast Pltd). In the present communication, we have reported the codon usage pattern and gene expression in Arabidopsis thaliana genome. For this purpose, a variety of computational tools like CAI, Relative codon adaptation (RCA), GC3 and MRCBS have been used in this study.where, N is the number of codons in the gene and relative adaptiveness, w is defined asf is the frequency of the i codon, and f is the maximum frequency of the codon most often used for encoding amino acid aa in a set of highly expressed genes of the particular genome. The score measured by CAI ranges from 0 to 1 indicating that the higher are the CAI values, the genes are more likely to be highly expressed.where L is the length of a gene and RCA(i) is defined by. The codon adaptation index, CAI is given by (Sharp and Li, 1987) The relative codon adaptation (RCA) for an entire genome is computed as (Fox and Erill, 2010) f is the observed relative frequency of a codon xyz in any reference gene set, f(m) is the observed relative frequency of base m at codon position i in the same reference set.where N = any base, S = G or C,and f is the observed frequency of codon xyz.wherewhere f is the normalized codon frequency of a codon xyz and f(m) is the normalized frequency of base m at codon position n in a gene. RCBS is the maximum value of RCBS of codon encoding the same amino acid aa in the same reference set, and N is the codon length of the query sequence. The score of the modified relative codon bias ranges from 0 and 1. The numerical value computed by this method may be used to rank the set of genes with respect to codon bias towards gene expression. It is suggested that the threshold score of the modified relative codon bias identifies the highly expressed genes. But due to evolving codon assignments as well as codon usage patterns as the adaptive response of genomes, threshold score for identifying highly expressed genes varies from genome to genome and the methodology used to calculate threshold score was described in (Sahoo and Das, 2014a). GC3 measures the frequency of G or C at the third position of synonymous codons and can be used as an index of codon bias. It is measured by The score of modified relative codon bias, MRCBS measures the expression level of a gene and is defined as (Das et al., 2012; Sahoo and Das, 2014a; Das et al., 2017; Sahoo and Das, 2014b), In this work, the different expression level predictors have been computed by comparing its codon usage bias with the profile of universally functional genes. The predicted highly expressed genes (PHE) are then characterized on the basis of the strength of the codon usage bias derived from the algorithms as described in the literature and a gene is identified as PHE gene provided its MRCBS exceeds the threshold value. Pearson r correlation coefficients between different codon usage bias indices have been computed for a systematic analysis of the gene expression profile of the genome under study. The impact score of a codon (xyz) in a gene sequence is then defined by MRCBS(xyz) and is used to describe the codon usage profile of the genome under study. If and μ denote the sample mean and population mean of the impact score for a particular codon respectively; and σ the population standard deviation, then z score of a test statistics is given bywhere N is the total no of codons. The impact codons are then identified by the impact score of a codon based on the level of significance from the z score of the test statistic.

Results and discussion

In the present study, we have analyzed gene expression profile of Arabidopsis genome and predicted highly expressed (PHE) genes with respect to MRCBS. We have measured the expression pattern and codon usage bias of all protein-coding gens in the genome under study. Our study includes 12,645 protein-coding sequences of chromosome 1, 7596 protein-coding sequences of chromosome 2, 9474 protein-coding sequences of chromosome 3, 7426 protein-coding sequences of chromosome 4, 10,993 protein-coding sequences of chromosome 5, 117 protein-coding sequences of mitochondrion MT and 85 protein-coding sequences of chloroplast Pltd CP. Some basic information of Arabidopsis genome is given in Table 1. The expression level of all protein-coding genes was calculated by MRCBS and compared with other codons usage models like CAI and RCA. Threshold score for identifying highly expressed genes in Arabidopsis thaliana has been calculated to be 0.77. GC content of the genome under study is 44.26%. The overall GC3 score is 0.4215. Many researchers have argued that GC content or GC3 may be viewed as the primary influence on the codon usage pattern and thus on the expression profile. Table 2 displays the statistics of PHE genes and the top 20 PHE genes of Arabidopsis thaliana genome along with their functions and scores calculated in our approach (MRCBS).

Table 1

Some basic information of the Arabidopsis thaliana genome.

Genome	Number of genes	Average length	GC content (%)	GC3	Number of PHE genes	PHEgene %
Chromosome 1	12,645	1326	0.44	0.42	381	3.0%
Chromosome 2	7596	1232	0.44	0.42	300	3.9%
Chromosome 3	9474	1283	0.44	0.42	326	3.4%
Chromosome 4	7425	1320	0.44	0.42	225	3.0%
Chromosome 5	10,993	1304	0.44	0.42	368	3.3%
Chloroplast genome	85	929	37.5	0.27	0	0
Mitochondrial genome	117	586	44.6	0.43	0	0

Table 2

Characteristics of PHE genes and top 20 genes with the highest predicted expression levels for Arabidopsis thaliana genome.

Average length	Average GC content	Average GC3 content	% of PHE RP genes	% of PHE hypothetical genes	Top 20 genes
Average length	Average GC content	Average GC3 content	% of PHE RP genes	% of PHE hypothetical genes	Locus tag/gene name	Function	MRCBS
658	0.461	0.475	17.70%	8.63%	AT5G03710	Replication factor C large subunit	0.942377
					AT3G56020	Ribosomal protein L41 family	0.902928
					AT5G03850	Nucleic acid-binding, OB-fold-like protein	0.885142
					RPS28	Ribosomal protein S28	0.884064
					AT3G46430	ATP synthase	0.877127
					AT3G08520	Ribosomal protein L41 family	0.872734
					AT2G04621	Trans membrane protein	0.869109
					AT5G56670	Ribosomal protein S30 family protein	0.868022
					AT3G10090	Nucleic acid-binding, OB-fold-like protein	0.866286
					RPL23AA	Ribosomal protein L23AA	0.86058
					AT2G19730	Ribosomal L28e protein family	0.860542
					RS27A	Ribosomal protein S27	0.860165
					AT4G27090	Ribosomal protein L14	0.856987
					AT2G14285	Small nuclear ribonucleoprotein family protein	0.856773
					AT3G11120	Ribosomal protein L41 family	0.855905
					AT5G16130	Ribosomal protein S7e family protein	0.854895
					AT2G31490	Neuronal acetylcholine receptor subunit alpha-5	0.854269
					CAM3	Calmodulin 3	0.852098
					RPS15	Cytosolic ribosomal protein S15	0.848976
					CAM2	Calmodulin 2	0.847033

Some basic information of the Arabidopsis thaliana genome. Characteristics of PHE genes and top 20 genes with the highest predicted expression levels for Arabidopsis thaliana genome. Codon usage profile of Arabidopsis genome has been described in terms of average impact score of 27,046 complete protein-coding sequences of the genome [Fig. 1]. Although most of the amino acids can be specified by more than one codon, only a subset of potential codons is used [Table 3] in highly expressed genes. There are no impact codons coding His, Thr and Val in the presently studied Arabidopsis genomes. The impact codons in Arabidopsis are found to be mostly used in coding Phe (ttt,ttc), Leu (ttg,ctt,ctc), Ile (atc), Met (atg), Tyr (tac), Gln (caa,cag), Asn (aac), Lys (aaa,aag), Asp (gat), Glu (gaa,gag), Ser (tct,tcc,tca,agc), Pro (cct,cca), Ala (gct), Cys (tgc), Trp (tgg), Arg (aga), Gly (ggt,gga). Importantly, these codons do not reflect any simple compositional bias. Not all of the preferred (impact) codons are GC rich and GC/GC3 may not be the accurate representation of the trend in codon usage. It may be thought that the selection of the preferred codons causing the optimization of the translational rate possibly depends on the codon–anticodon interaction kinetics.

Fig. 1

Average impact score of codons in Arabidopsis thaliana genome.

Table 3

Codon/Amino Acid Usage of the Arabidopsis thaliana CP/MT genome and nuclear genome.

Amino Acid	Codon	CODON USAGE
Amino Acid	Codon	CP genome	MT genome	Nuclear genome	PHE Genes
Ala	GCA	0.924057	0.956196	0.977693	0.965759
	GCC	1.068317	1.015433	0.69599	0.821385
	GCG	0.633739	0.6198	0.527703	0.334181
	GCU	1.278889	1.181231	1.175584	1.84292
Cys	UGC	0.477558	0.85503	1.120411	1.100364
Cys	UGU	0.654264	0.881925	0.975416	0.88164
Asp	GAC	0.620287	0.891631	0.884973	0.732988
Asp	GAU	1.027884	1.099495	1.123944	0.928023
Glu	GAA	1.501542	1.667856	1.379294	1.363214
Glu	GAG	0.907668	1.278562	1.397898	1.38124
Phe	UUC	1.53997	1.704901	1.857261	2.556277
Phe	UUU	1.254081	1.45126	1.225468	1.079788
Gly	GGA	1.704801	1.621551	1.7502	2.544636
	GGC	1.214503	0.944487	0.844881	0.556763
	GGG	1.827965	1.327694	0.804863	0.489334
	GGU	1.158149	1.105812	1.163195	1.453484
His	CAC	0.609372	0.64853	0.762579	0.823344
His	CAU	0.740304	0.914712	0.73468	0.544987
Ile	AUA	0.792638	0.786369	0.620441	0.243809
	AUC	1.223305	1.097218	1.121274	1.320139
	AUU	1.132562	0.783437	0.792729	0.782475
Lys	AAA	1.387184	1.427459	1.386644	1.296746
Lys	AAG	0.793639	1.451157	1.58078	2.442647
Leu	CUA	0.674913	0.877658	0.74541	0.464587
	CUC	0.947252	1.11581	1.490388	1.778466
	CUG	0.633064	0.892686	0.803556	0.490864
	CUU	0.894811	1.108499	1.383461	1.59222
	UUA	1.459008	1.022769	0.899226	0.514989
	UUG	1.459008	1.218262	1.677031	1.828657
Asn	AAC	0.904617	0.881605	1.164078	1.109241
Asn	AAU	1.042164	0.929833	0.754519	0.393298
Pro	CCA	0.921901	1.153069	1.487962	2.096139
	CCC	1.468882	1.083116	0.622105	0.51766
	CCG	1.036982	0.794335	0.836171	0.537951
	CCU	1.069133	1.229223	1.306557	1.772502
Gln	CAA	1.734326	1.508288	1.356156	1.385078
Gln	CAG	0.843424	1.037337	1.114674	1.24047
Arg	AGA	0.808032	1.175478	1.511002	1.794382
	AGG	0.560481	1.134779	0.929007	1.144426
	CGA	1.283031	1.098178	0.785128	0.515815
	CGC	0.929904	0.773274	0.593302	0.483748
	CGG	1.120378	1.005459	0.622907	0.173957
	CGU	1.135756	0.742584	0.820779	1.376508
Ser	AGC	0.554621	1.050798	1.191272	0.949226
	AGU	0.828491	0.854586	0.846464	0.537035
	UCA	0.89995	1.209875	1.627653	1.527831
	UCC	2.178256	1.441785	1.260763	1.401957
	UCG	0.817047	0.915688	0.908629	0.641353
	UCU	1.07113	1.40707	1.726912	2.176242
Thr	ACA	0.793609	0.828891	0.960773	0.883517
	ACC	1.172183	0.875213	0.770601	0.86331
	ACG	0.501757	0.553283	0.513637	0.230112
	ACU	0.979165	0.831844	0.799601	1.013725
Val	GUA	0.764515	0.719545	0.468802	0.320551
	GUC	0.694481	0.676856	0.734463	0.895895
	GUG	0.607432	0.705351	0.880408	0.890438
	GUU	0.657571	0.659398	0.933754	1.208662
Tyr	UAC	0.820827	0.849145	1.097255	1.46001
Tyr	UAU	1.283358	1.066362	0.725359	0.473723
Met	AUG	1.806166	1.39968	1.446542	1.755233
Trp	UGG	2.457201	1.521081	1.542432	1.564577

Average impact score of codons in Arabidopsis thaliana genome. Codon/Amino Acid Usage of the Arabidopsis thaliana CP/MT genome and nuclear genome. The large data set analyzed here revealed a strong bias towards usage of a different set of preferred codons in genes with high cytoplasmic mRNA levels. In contrast, genes with low mRNA levels showed very little synonymous codon usage bias. Usage bias was proposed as a result from translational selection, since using a codon that is translated via an abundant tRNA species were hypothesized to boost translational efficiency. Codon frequencies are found to vary between genes in the same genome. The standard version of the genetic code includes 61 sense codons and three stop codons. Although almost all organisms have made the same codon assignments for each amino acid, the preferred use of individual codons varies greatly among genes. The overall nucleotide composition of the genome which influences the codon usage pattern introduces selective forces acting on highly expressed genes to improve the efficiency of translation. It is now widely accepted that synonymous codon preferences in a unicellular organism are affected by the cellular amount of isoacceptor tRNA species. But we observe that not all tRNA genes corresponding to impact codons have been detected by tRNAscanSE. However many tRNAs can translate more than one codon, but with variable ability and it is suggested that impact codons have favored translational efficiency. Since the highly expressed genes use a preferred set of optimal codons in accordance with their respective tRNA levels, this observation might find another important application in tRNA finding algorithm. Expression profiles of the genes are determined by calculating MRCBS for each gene and their distributions are shown in Fig. 2. The majority of genes (90%) have MRCBS values lying between 0.65 and 0.75, and the mean and median values are 0.3870 and 0.3295, respectively. Only 3.3% genes have MRCBS values >0.77. It was observed that percentage of PHE genes vary between.

Fig. 2

Distribution of MRCBS of all protein-coding genes in Arabidopsis thaliana genome.

Distribution of MRCBS of all protein-coding genes in Arabidopsis thaliana genome. 3% to 4% in Arabidopsis thaliana chromosomes, whereas no highly expressed genes are predicted in CP/MT genomes. The overall variation of GC or GC3 content of the genes is depicted in Suppl. Fig. 1, Suppl. Fig. 2 respectively. It indicates that majority of genes have GC3 score lying between 0.3 and 0.6 and (88.5%) of genes have GC content lying between 0.4 and 0.5. We observed that the percentage of PHE genes varies from chromosome to chromosome and is independent of GC content or GC3 score of these genes. In fact, we have failed to find any correlation between gene expression and GC content or GC3 score. It is well studied that highly expressed genes display more biased codon usage than the lowly expressed genes [Table 3]. We observed that PHE genes of Arabidopsis thaliana mostly include ribosomal protein (RP) genes, translation initiation factors, translation elongation factors, MADS box transcription factor, membrane traffic protein, trans-membrane protein, chaperon, heat shock protein, histone, ubiquitin, nucleic acid binding protein and many stress and energy metabolism genes. However, all RP genes of Arabidopsis thaliana do not comprise the PHE gene class. Table 2 reports the statistics of PHE gens. The percentage of PHE genes in Arabidopsis thaliana is 3.3%, whereas only 17.7% genes fall in the class of RP genes. It is remarkable that 99.21% RP genes in Yeast genome and almost all RP genes in E. coli genome fall in PHE class of genes. An average of 65.56% RP genes in the archaeal genome is PHE. Out of 561 RP genes 255 RP genes are PHE. Thus a very poor fraction of RP genes of Arabidopsis thaliana has highly predicted expression level in contrast to E.coli, Yeast and Archaea. The top 20 genes with the highest predicted expression levels for Arabidopsis thaliana genomes are displayed in Table 2. Our analysis predicted 1063 highly expressed genes in Arabidopsis thaliana. A list of well-characterized PHE genes has been displayed in Suppl. Table 1. It is worth noticing that these genes are separated into different functional categories. Table 4 displays a set of well-characterized PHE genes segregated into different functional categories.

Suppl. Fig. 1

Distribution of GC content of all protein-coding genes in Arabidopsis thaliana genome.

Suppl. Fig. 2

Distribution of GC3 content of all protein-coding genes in Arabidopsis thaliana genome.

Table 4

A list of potential PHE genes segregated into different functional categories.

Transcription factor	AT4G10480	Elongation	AT1G56070		AT3G07860
	AT3G12390		AT4G20360		ATG8C
	AT5G09920		AT3G12915		AT3G45180
	AT4G35900		AT1G07930		AT5G57860
	AT2G17770	Translation initiation factor/elongation factor	AT1G30230		AT3G58230
	AT1G54830		AT2G18110	Dehydrogenase	AT1G53240
	AT5G53980		AT5G19510		AT1G04410
	AT1G56170		AT5G12110		AT5G43330
MADS box transcription factor	AT1G69120		AT2G46280		AT2G02050
	AT1G31140		AT5G35680		AT1G12900
	AT1G50780		AT2G04520		AT3G04120
	AT1G71692		AT4G20980		AT3G26650
Chromatin/chromatin binding protein	AT3G03590		AT1G26630		AT1G13440
	AT1G01160		AT5G05470	DNA/RNA binding protein	AT4G01060
	AT1G75060		AT1G69410		AT5G08420
Histone	AT4G40040	mRNA processing/splicing	AT3G62840		AT5G47210
	AT5G59870		AT5G44500		AT4G17520
	AT5G12910		AT4G20440		AT4G16830
	AT5G10390		AT4G30220		AT3G57150
Tubulin	TUA2		AT2G14285	Membrane traffic protein	AT4G23630
	TUA3		AT3G11500		AT1G73030
	TUA4		AT2G03870		AT2G34250
	TUA5		AT2G23930		AT2G38360
	TUB2	Methyltransferase	AT4G34050		AT1G62880
	TUB3		AT4G13930		AT1G48440
	TUB4		AT5G66550	Transfer/carrier protein/transporter	AT3G10640
	TUB1		AT3G03780		AT2G19830
	TUB5		AT5G17920		AT3G15352
	TUB7	Ligase	AT5G10880		AT3G57900
	TUB9		AT1G55570		AT2G36830
	KIS		AT1G55560		AT3G16240
	TUA6		AT3G13400	Actin/Actin related protein	ACT2
Calcium binding protein	CRT1a		AT3G13390		ACT7
	CRT1b		AT1G66200		ACT8
	AT5G39670		AT5G35630		AT3G09860
	AT2G41090		AT3G17820		ACT11
	AT1G76640	Calmodulin	CAM1	Amino acid transporter	AT2G45960
G protein coupled receptor/modulator	AT5G42090		CAM2		AT3G61430
	AT5G18520		CAM3		AT4G00430
	AT2G30060		CAM5		AT1G01620
	AT3G07880		CAM6	ATP Synthase	AT4G23710
Transmembrane Protein	AT2G04621		CML42		AT3G01390
	AT2G01870		CML11		AT2G33040
	AT2G13965	Acyltransferase	AT5G11670	Carbohydrate kinase	AT3G59480
	AT5G19875	Basic helix-loop-helix transcription factor	AT4G10480		AT1G50390
	AT5G03120	Basic helix-loop-helix transcription factor	AT3G12390		AT1G79550
	AT2G29180	Basic leucine zipper transcription factor	AT4G35900
	AT3G18800		AT2G17770	Extracellular matrix structural protein	AT4G08410
	AT2G25297	Homeodomain transcription factor	AT5G53980		AT3G54580
	AT5G07165				AT5G06640
	AT2G22080	Cysteine protease	AT3G04840		AT2G24980
	AT5G16250	Cysteine protease	AT4G34670		AT1G23720
	AT5G04790	Dehydratase	AT3G46440		AT5G06630
	AT1G74458	Dehydratase	AT3G51160		AT3G28550
	AT3G28190	Aminoacyl-tRNA synthetase	AT1G55803		AT3G54590
	AT2G31090	Antibacterial response protein	AT5G50840		AT1G21310
	AT1G17090	ABC transporter	AT5G60790		AT1G76930
	AT3G14452	Ubiquitin/ubiquitin like	UBQ11	Chaperone/heat shock protein	AT1G27330
	AT2G05310		UBQ13		AT4G02450
	AT3G28193		UBQ4		AT5G12020
	AT1G65720		UBQ5		HSC70–1
	AT4G21500		UBQ6		HSP17.6A
	AT5G09225		UEV1D-4		HSP21
	AT1G16916		UBQ1		HSP70
	AT5G03460		UBQ14		Hsp70–2
	AT1G49310		AT5G18310		ERD2
	AT3G42075		AT3G61113		AT3G09440
	AT3G18915		AT5G32440		BIP2
	AT2G41905		NKS1		BIP1
	AT1G67235		UBC11		Hsp81.4
	AT5G61340		UBL5		HSP81–2
	AT1G06515		APG8A		HSP81–3
	AT5G19860		ATG8B		HSP90.1

It has been observed that PHE genes belonged to various functional classes and variably represented in the genome. These include carbohydrate kinase, dehydratase, dehydrogenase, ATP synthase, acyltransferase, methyltransferase,Amino acid transporter, actin/actin-related protein, calcium-binding protein, calimodulin, cysteine protease, chromatin/chromatin-binding protein, DNA directed DNA/RNA polymerase, enzyme modulator, extracellular matrix structural protein, ligase, non motor actin/microtubule-binding protein, non receptor serine/ thionine protein kinase, oxidase, oxidoreductase, nucleotidyltransferase, reductase, peroxidase, phosphatase, peroxodase/phosphatase inhibitor, transfer/ carrier protein. A list of potential PHE genes segregated into different functional categories. Besides, we have identified a number of PHE genes which play important roles in signal transduction mechanism, amino acid transport and metabolism, secondary metabolites biosynthesis and catabolism, cell membrane biogenesis, inorganic ion transport and metabolism, coenzyme transport and metabolism, carbohydrate transport and metabolism, intercellular trafficking, and energy production and conversion. These include vacuolar protein, vacuolar ATP synthase, vacuolar calcium-binding protein, vacuolar ATPase, vesicle coat protein, seed storage albumin,arabinogalactan protein, cytochrome complex, cytochrome c oxidase/electron carrier and members of the cytochrome family, DEFL family, dehydrin family. In addition, a number of PHE genes encoding plasma membrane intrinsic protein, plant defensin, photosystem II, phytochrome associated protein, phytosulfokine, plant viral response protein have significant roles in plant. Among other PHE genes, copper chaperone, copper iron-binding protein, a copper transport protein, Zinc-binding ribosomal family protein and ferredoxin like superfamily protein have important functions in this organism. However, a fraction of poorly characterized hypothetical genes was also found among the PHE genes. Table 2 displays the general statistics of hypothetical or poorly characterized PHE genes in Arabidopsis genome. Genes of unknown function with high predicted expression levels may be attractive candidates for experimental characterizations. The characteristic codon distribution of these genes indicates that they may have important functions in these organisms. A variety of PHE genes encoding proteins of unknown function may provide targets for identification of additional key features of Arabidopsis thaliana. The temporal and spatial organization of these genes for chromosome replication, genome segregation and cell division processes are less characterized in Arabidopsis genome. A detailed analysis of these putative/hypothetical PHE genes would generate a more comprehensive picture of the replication and division machineries, and of the regulatory features of the cell cycle.

Correlations among different codon bias indices

In this study, we compared the performances of several commonly used computation tools for predicting gene expression level. The expression profiles of the Arabidopsis genome were analyzed in terms of CAI, RCA and MRCBS. The CAI scores have been calculated by taking all RP (>80aa) genes as PHE genes which are commonly referred as reference set. RCA frequencies are computed using the identical reference set as used in the calculation of CAI. The results indicate that there is a good correlation between RCA and CAI(r = 0.673761) while the correlation of RCA with MRCBS is significantly higher (r = 0.787772) [Fig. 3]. The novel method of quantitatively predicting gene expressivity MRCBS is then compared with CAI and correlation between them is found to be surprisingly good (r = 0.900204) [Fig. 4]. These correlation coefficients can be used to express the strength of the existing prediction methods. It can be seen that MRCBS consistently yields better correlation than other. We also observe that there is no clear correlation between CAI or MRCBS with GC3(rCAI = −0.05726, rMRCBS = 0.101083) or GC(rCAI = −0.15775, rMRCBS = 0.041383). So, GC content and GC3 may not be the accurate representation of the trend in codon usage bias. Similarly, no correlation between the length of the gene and MRCBS or CAI has observed in our study.

Fig. 3

RCA plotted against MRCBS for each protein coding-genes in Arabidopsis thaliana genome.

Fig. 4

CAI plotted against MRCBS for each protein-coding genes in Arabidopsis thaliana genome.

RCA plotted against MRCBS for each protein coding-genes in Arabidopsis thaliana genome. CAI plotted against MRCBS for each protein-coding genes in Arabidopsis thaliana genome.

Correlation of protein and mRNA expression levels with MRCBS

In this study we choose to compare our results with the experimental datasets. The value of codon-based expression indicator can perhaps be appreciated by comparing them with the experimental gene expression data in general. Of course, the codon-based expression indicator yields static value, whereas gene expression is a dynamic process with very different expression levels under different conditions. The expression data that we have used in this study stems from Gene Expression Omnibus (GEO) datasets. In GEO dataset (GEO accession: GSM2473182) protein expression levels were quantified by RMA (Relative Molecular Abundance) signal intensity. For the entire group of selected genes (20,900 genes)for which the complete data set can be generated along with the codon based expression indicator, the Pearson correlation coefficient between CAI and MRCBS comes out to be 0.901964. The pair-wise correlation coefficient between protein expression level and MRCBS, CAI, RCA and GC turns out to be 0.268321, 0.253094, 0.283545 and 0.206581 respectively. Correlation is worse with GC3 (0.049775).It has been observed that for genes with high RMA signal intensity (>7.59), the pair-wise correlation coefficients are better (0.386227, 0.337139, 0.303723, 0.251336 and 0.290886) [Suppl. Fig. 3, Suppl. Fig. 4, Suppl. Fig. 5, Suppl. Fig. 6, Suppl. Fig. 7].

Suppl. Fig. 3

RMA signal intensity plotted against MRCBS for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 4

RMA signal intensity plotted against CAI for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 5

RMA signal intensity plotted against RCA for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 6

RMA signal intensity plotted against GC for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 7

RMA signal intensity plotted against GC3 for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

In another analysis we have compared our results with the radioactive data (González-Pérez et al., 2011). We have collected 1797 Arabidopsis genes for which there are orthologous in yeast and humans and that have mRNA half-life data (Calderwood et al., 2016). For these dataset, the predicted gene expression level using MRCBS value is found to correlate well with RMA signal intensity(r = 0.50923) [Fig. 5]. The correlation is better than the quantitative measure of CAI (r = 0.470608), RCA(r = 0.442278), GC3(r = 0.405765) and GC(r = 0.362806) [Suppl. Fig. 8, Suppl. Fig. 9, Suppl. Fig. 10, Suppl. Fig. 11]. It suggests that a quantitative estimate of the expression level by MRCBS values performs better than other indices of expression-measure. The novel method of quantitatively predicting gene expressivity is then compared with mRNA half-life data. We observe that the correlation coefficient of mRNA half-life data with MRCBS (r = 0.3504) is good [Fig. 6], but worse compared to RMA signal intensity. Although the pair-wise correlation coefficient among the gene expression levels from two experimental datasets (r = 0.525273) is good, it can be clearly seen that the agreement of predicted and actual protein expression level quantified by mRNA half-life data varied greatly between all examined combinations of prediction method and data set (rCAI = 0.31067, rGC3 = 0.310397, rGC = 0.281694 and rRCA = 0.279249) [Suppl. Fig. 12, Suppl. Fig. 13, Suppl. Fig. 14, Suppl. Fig. 15].

Fig. 5

RMA signal intensity plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 8

RMA signal intensity plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 9

RMA signal intensity plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 10

RMA signal intensity plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 11

RMA signal intensity plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Fig. 6

mRNA half-life data plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 12

mRNA half-life data plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 13

mRNA half-life data plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 14

mRNA half-life data plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 15

mRNA half-life data plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

To assess the value of MRCBS for predicting protein expression levels in Arabidopsis thaliana, we plotted the two experimental sets of data versus MRCBS along with RCA and CAI. The distribution patterns for both the protein expression data with respect to these expression indicators are highly similar. Comparing the performance of the MRCBS, the CAI and RCA as numerical indices of the gene expression level in terms of the Pearson correlation coefficient with the expression data, we observed that MRCBS generally performs better than CAI and RCA. RMA signal intensity plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Conclusion

Our study demonstrates that MRCBS may be a useful tool for predicting highly expressed genes. The idea of supporting our method is based on the hypothesis that codon usage pattern is largely responsible for regulation of gene expression which can occur during transcription or at the level of protein translation. Although the concept of predicting gene expression level from the codon usage pattern was proposed a decade ago, only recently these methods have been successfully applied to identification of highly expressed genes in various bacteria and eukaryotic genomes. The improved reliability of MRCBS for estimating expression levels in Arabidopsis genome thus makes this index a superior choice for undertaking and benchmarking predictions of gene expression. In this study, various approaches to estimating gene expression level based on codon usage have been applied to Arabidopsis genome with the objectives of testing the present alternative method of studying whole-genome gene expression. Our results demonstrate significant heterogeneity in codon usage among genes in Arabidopsis genome. Furthermore, the predicted gene expression level using the quantitative measure CAI was found to correlate well with MRCBS. In addition, since the expression levels measured by current DNA microarray and proteomics technologies represent the accumulated results of expression and degradation, the results from this computational approach could be used as reference data for calibrating and better interpreting experimental data. For example, observation of low level of expression from proteomic or microarray data for a gene with a high PHE index might suggest the possible involvement of degradation in regulating expression levels of that gene. Although most of the PHE genes are essential genes responsible for the habitat, energy sources and life style of an organism, the study also identified a number of functionally unknown genes as PHE genes based on their codon usage profile. Further investigation of these genes by an integrated computational and experimental approach will enhance our knowledge of metabolism. Given that a large volume of experimental data is available on this plant, such novel method may be helpful on extracting meaningful information for understanding the details of functional genomics. The following are the supplementary data related to this article. Distribution of GC content of all protein-coding genes in Arabidopsis thaliana genome. Distribution of GC3 content of all protein-coding genes in Arabidopsis thaliana genome. RMA signal intensity plotted against MRCBS for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against CAI for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against RCA for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against GC for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against GC3 for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). RMA signal intensity plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). RMA signal intensity plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). RMA signal intensity plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Table 1

A list of some well characterized PHE genes in Arabidopsis thaliana.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of interests

We, the authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

23 in total

1. Predicted highly expressed genes of diverse prokaryotic genomes.

Authors: S Karlin; J Mrázek
Journal: J Bacteriol Date: 2000-09 Impact factor: 3.490

Review 2. Codon bias and heterologous protein expression.

Authors: Claes Gustafsson; Sridhar Govindarajan; Jeremy Minshull
Journal: Trends Biotechnol Date: 2004-07 Impact factor: 19.536

3. Predicted highly expressed genes in archaeal genomes.

Authors: Samuel Karlin; Jan Mrázek; Jiong Ma; Luciano Brocchieri
Journal: Proc Natl Acad Sci U S A Date: 2005-05-09 Impact factor: 11.205

Review 4. Codon bias and gene expression.

Authors: C G Kurland
Journal: FEBS Lett Date: 1991-07-22 Impact factor: 4.124

5. Early transcriptional defense responses in Arabidopsis cell suspension culture under high-light conditions.

Authors: Sergio González-Pérez; Jorge Gutiérrez; Francisco García-García; Daniel Osuna; Joaquín Dopazo; Óscar Lorenzo; José L Revuelta; Juan B Arellano
Journal: Plant Physiol Date: 2011-04-29 Impact factor: 8.340

6. Transcript Abundance Explains mRNA Mobility Data in Arabidopsis thaliana.

Authors: Alexander Calderwood; Stanislav Kopriva; Richard J Morris
Journal: Plant Cell Date: 2016-03-07 Impact factor: 11.277

7. Relative codon adaptation: a generic codon bias index for prediction of gene expression.

Authors: Jesse M Fox; Ivan Erill
Journal: DNA Res Date: 2010-05-07 Impact factor: 4.458

Review 8. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity.

Authors: P M Sharp; E Cowe; D G Higgins; D C Shields; K H Wolfe; F Wright
Journal: Nucleic Acids Res Date: 1988-09-12 Impact factor: 16.971

9. Comparative Analysis of Predicted Gene Expression among Crenarchaeal Genomes.

Authors: Shibsankar Das; Brajadulal Chottopadhyay; Satyabrata Sahoo
Journal: Genomics Inform Date: 2017-03-29

10. Predicting gene expression level from relative codon usage bias: an application to Escherichia coli genome.

Authors: Uttam Roymondal; Shibsankar Das; Satyabrata Sahoo
Journal: DNA Res Date: 2009-01-08 Impact factor: 4.458

6 in total

1. Analysis of codon usage patterns in open reading frame 4 of hepatitis E viruses.

Authors: Zoya Shafat; Anwar Ahmed; Mohammad K Parvez; Shama Parveen
Journal: Beni Suef Univ J Basic Appl Sci Date: 2022-05-10

2. Analysis of Codon Usage Patterns of Six Sequenced Brachypodium distachyon Lines Reveals a Declining CG Skew of the CDSs from the 5'-ends to the 3'-ends.

Authors: Jianyong Wang; Yujing Lin; Mengli Xi
Journal: Genes (Basel) Date: 2021-09-23 Impact factor: 4.096

Review 3. Codon usage bias.

Authors: Sujatha Thankeswaran Parvathy; Varatharajalu Udayasuriyan; Vijaipal Bhadana
Journal: Mol Biol Rep Date: 2021-11-25 Impact factor: 2.316