Literature DB >> 32550546

Codon usage pattern and predicted gene expression in Arabidopsis thaliana.

Satyabrata Sahoo1, Shib Sankar Das2, Ria Rakshit3.   

Abstract

The extensive research for predicting highly expressed genes in plant genome sequences has been going on for decades. The codon usage pattern of genes in Arabidopsis thaliana genome is a classical topic for plant biologists for its significance in the understanding of molecular plant biology. Here we have used a gene expression profiling methodology based on the score of modified relative codon bias (MRCBS) to elucidate expression pattern of genes in Arabidopsis thaliana. MRCBS relies exclusively on sequence features for identifying the highly expressed genes. In this study, a critical analysis of predicted highly expressed (PHE) genes in Arabidopsis thaliana has been performed using MRCBS as a numerical estimator of gene expression level. Consistent with previous other results, our study indicates that codon composition plays an important role in the regulation of gene expression. We found a systematic strong correlation between MRCBS and CAI (codon adaptation index) or other expression-measures. Additionally, MRCBS correlates well with experimental gene expression data. Our study highlights the relationship between gene expression and compositional signature in relation to codon usage bias and sets the ground for the further investigation of the evolution of the protein-coding genes in the plant genome.
© 2019 The Authors.

Entities:  

Keywords:  Arabidopsis thaliana; CAI; CAI, Codon adaptation index; CP, Chloroplast Pltd CP; Codon usage bias; GC content; GEO, Gene Expression Omnibus; Gene expression; MADS, Minichromosome maintenance1, Agamous, Deficiens and Serum response factor; MBP, Megabase pair; MRCBS, Score of Modified relative codon bias; MT, Mitochondrion; PHE genes; PHE, Predicted Highly Expressed; RCA, Relative Codon Adaptation; RCB, Relative codon bias; RCBS, Relative Codon Bias Strength; RMA, Relative Molecular Abundance; RP, Ribosomal protein; SAGE, Serial Analysis of Gene Expression; TAIR, The Arabidopsis Information Resourses

Year:  2019        PMID: 32550546      PMCID: PMC7286098          DOI: 10.1016/j.gene.2019.100012

Source DB:  PubMed          Journal:  Gene X        ISSN: 2590-1583


Introduction

Arabidopsis thaliana has proven to be a model experimental organism for essentially developing plant biology at the molecular level. Undoubtedly, any useful insight in understanding the expression of functional proteins of Arabidopsis thaliana will contribute to the development of plant research as well as in the field of modern biotechnology. It is well known that the synthesis of every protein molecule is directed by the arrangement of genetic codes in a genomic DNA sequence. The genetic code uses sixty-one codons to encode 20 amino acids and three codons to terminate translation in the process of protein synthesis. The degeneracy of the genetic code suggests that there must be many alternative nucleotide sequences to encode the same protein. The codon usage pattern varies significantly between different organisms, and also between genes which are expressed at different levels in the same organism. A number of hypotheses prevail regarding the factors which influence the codon usage pattern. Attempts have been made to explain the codon distributions in the protein-coding genes as well as the changes in codon usages among different synonymous codons in each organism (Sharp et al., 1988; Brandis and Hughes, 2016; Sharp and Li, 1987; Ikemura, 1981; Hockenberry et al., 2014; Lee et al., 2010). It is well discussed in the literature that organisms might be subjected to codon biases of different origins. In fact, it is rather difficult to decide the most common dominant codon bias of a genome. Some researchers have speculated that codon bias that tends to reduce the diversity of isoacceptor tRNAs may reduce the metabolic load (Gustafsson and Govindarajan, 2004; Akashi, 1994; Ikemura, 1985). Many other analyses have also revealed that there are many other factors like nucleotide compositional constraint, codon anticodon interaction, amino acid conservation etc. which may also influence the codon usage pattern of a genome. Whatever may be the molecular basis for codon bias, it is evident that codon bias can have a significant impact on the expression of functional proteins. Translational selection pressure or protein secondary structure may have profound effect on codon bias. It is generally thought that a balance between mutation and natural selection on translational efficiency is expected to yield a correlation between codon bias and rate of gene expression, such that highly expressed genes often have stronger relative codon bias (RCB) than genes expressed at lower levels (Kurland, 1991; Hiraoka et al., 2009). Our objective of this work is to identify and analyze PHE genes and codon usage pattern in Arabidopsis thaliana. Our analyses on E.coli, yeast, synechocystis and archaeal genomes support the hypotheses that each genome has evolved a codon usage pattern promoting its gene expression level (Roymondal et al., 2009; Das et al., 2009; Das et al., 2012; Sahoo and Das, 2014a; Das et al., 2017). With the advent of modern technologies, several high-throughput experiments are widely used to identify the highly expressed genes. The most commonly used technique to study large scale gene expression is cDNA microarray. Besides, other novel techniques like 2D gel electrophoresis, Mass spectrometry, Chromatin immunoprecipitation, DNA chip technology and Serial Analysis of Gene Expression (SAGE) have been developed for the purpose. All these experiments require wide range of conditions to match, massive investment of time and resources. To overcome these major obstacles for identifying highly expressed genes in the vast majority of organisms, we must look beyond the direct experimental methods. Following this, we focused our study on developing a computational methodology that can be used to study the large-scale gene expression profile of an organism. Based on the hypothesis that highly expressed genes are often characterized by strong compositional bias in terms of codon usage (Ikemura, 1981; Ikemura, 1985; Kurland, 1991; Sahoo and Das, 2014b; Karlin and Mrazek, 2000; Karlin et al., 2005; Carbone et al., 2003; Supek Fand Vlahovicek, 2005; Supek Fand Vlahovicek, 2010), a number of varieties of software tools like Codon Adaptation Index (CAI) (Sharp and Li, 1987), Relative Codon Adaptation (RCA) (Fox and Erill, 2010), Relative Codon Bias Strength (RCBS) (Roymondal et al., 2009; Das et al., 2009) etc. have been developed to provide numerical indices to predict the expression level of genes. There are no universal standards to make these results more suitable for comparative analysis. However, most of these commonly used computational approaches depend on the knowledge of codon bias of a reference set of highly expressed genes. But, MRCBS has been devised as an alternative model to predict gene expression level from their codon compositions in such a way that the score of the expression indicator may be calculated without any knowledge of previously set selective highly expressed genes as a reference set. In fact, MRCBS performs better to capture the highly expressed genes compared to the performances of several other commonly used measures (Das et al., 2012; Sahoo and Das, 2014a; Das et al., 2017; Sahoo and Das, 2014b). Here, we investigated the gene expression profile and the variation in synonymous codon usage pattern of Arabidopsis thaliana genome. It is a small flowering plant with a relatively short life cycle and is the first plant to have its genome completely sequenced (The Arabidopsis Genome Initiative, 2000). Since 1943, Arabidopsis thaliana started to be widely used as experimental biological material in plant research laboratories around the world. The small size of its genome with approximately 135 MBP and 5 chromosomes makes it a useful model for plant sciences. An extensive study has been done by plant biologist to assign functions of its 2500 genes and 3500 proteins they encode. The latest information on Arabidopsis research is available from Arabidopsis Information Resources (TAIR). The small genome size and the availability of the complete DNA sequence of Arabidopsis thaliana have attracted the attention of a wide range of scientists, including evolutionary biologists and biotechnology companies. The rapid life cycle, unusual properties of inheritance and the vast information about their genealogy suggest that this organism may be used as a useful tool for the plant biologist. Finally, its important role in the study of plant-pathogen interaction makes them very attractive to biotechnology companies for industrial and research uses. Thus, the gene expression profile of Arabidopsis thaliana is expected to make important contributions in plant sciences.

Materials and methods

The whole genome sequence of Arabidopsis thaliana along with the gene annotations was taken from NCBI GenBank have been considered in our study. All gene sequences under study along with those annotated as hypothetical have been extracted from the Gene Bank Accession Nos: NC_003070.9(Chromosome 1),NC_003071.7(Chromosome 2), NC_003074.8(Chromosome 3), NC_003075.7(Chromosome 4),NC_003076.8(Chromosome 5), NC_001284.2(Mitochondrion MT), NC_000932.1(Chloroplast Pltd). In the present communication, we have reported the codon usage pattern and gene expression in Arabidopsis thaliana genome. For this purpose, a variety of computational tools like CAI, Relative codon adaptation (RCA), GC3 and MRCBS have been used in this study.where, N is the number of codons in the gene and relative adaptiveness, w is defined asf is the frequency of the i codon, and f is the maximum frequency of the codon most often used for encoding amino acid aa in a set of highly expressed genes of the particular genome. The score measured by CAI ranges from 0 to 1 indicating that the higher are the CAI values, the genes are more likely to be highly expressed.where L is the length of a gene and RCA(i) is defined by. The codon adaptation index, CAI is given by (Sharp and Li, 1987) The relative codon adaptation (RCA) for an entire genome is computed as (Fox and Erill, 2010) f is the observed relative frequency of a codon xyz in any reference gene set, f(m) is the observed relative frequency of base m at codon position i in the same reference set.where N = any base, S = G or C,and f is the observed frequency of codon xyz.wherewhere f is the normalized codon frequency of a codon xyz and f(m) is the normalized frequency of base m at codon position n in a gene. RCBS is the maximum value of RCBS of codon encoding the same amino acid aa in the same reference set, and N is the codon length of the query sequence. The score of the modified relative codon bias ranges from 0 and 1. The numerical value computed by this method may be used to rank the set of genes with respect to codon bias towards gene expression. It is suggested that the threshold score of the modified relative codon bias identifies the highly expressed genes. But due to evolving codon assignments as well as codon usage patterns as the adaptive response of genomes, threshold score for identifying highly expressed genes varies from genome to genome and the methodology used to calculate threshold score was described in (Sahoo and Das, 2014a). GC3 measures the frequency of G or C at the third position of synonymous codons and can be used as an index of codon bias. It is measured by The score of modified relative codon bias, MRCBS measures the expression level of a gene and is defined as (Das et al., 2012; Sahoo and Das, 2014a; Das et al., 2017; Sahoo and Das, 2014b), In this work, the different expression level predictors have been computed by comparing its codon usage bias with the profile of universally functional genes. The predicted highly expressed genes (PHE) are then characterized on the basis of the strength of the codon usage bias derived from the algorithms as described in the literature and a gene is identified as PHE gene provided its MRCBS exceeds the threshold value. Pearson r correlation coefficients between different codon usage bias indices have been computed for a systematic analysis of the gene expression profile of the genome under study. The impact score of a codon (xyz) in a gene sequence is then defined by MRCBS(xyz) and is used to describe the codon usage profile of the genome under study. If and μ denote the sample mean and population mean of the impact score for a particular codon respectively; and σ the population standard deviation, then z score of a test statistics is given bywhere N is the total no of codons. The impact codons are then identified by the impact score of a codon based on the level of significance from the z score of the test statistic.

Results and discussion

In the present study, we have analyzed gene expression profile of Arabidopsis genome and predicted highly expressed (PHE) genes with respect to MRCBS. We have measured the expression pattern and codon usage bias of all protein-coding gens in the genome under study. Our study includes 12,645 protein-coding sequences of chromosome 1, 7596 protein-coding sequences of chromosome 2, 9474 protein-coding sequences of chromosome 3, 7426 protein-coding sequences of chromosome 4, 10,993 protein-coding sequences of chromosome 5, 117 protein-coding sequences of mitochondrion MT and 85 protein-coding sequences of chloroplast Pltd CP. Some basic information of Arabidopsis genome is given in Table 1. The expression level of all protein-coding genes was calculated by MRCBS and compared with other codons usage models like CAI and RCA. Threshold score for identifying highly expressed genes in Arabidopsis thaliana has been calculated to be 0.77. GC content of the genome under study is 44.26%. The overall GC3 score is 0.4215. Many researchers have argued that GC content or GC3 may be viewed as the primary influence on the codon usage pattern and thus on the expression profile. Table 2 displays the statistics of PHE genes and the top 20 PHE genes of Arabidopsis thaliana genome along with their functions and scores calculated in our approach (MRCBS).
Table 1

Some basic information of the Arabidopsis thaliana genome.

GenomeNumber of genesAverage lengthGC content (%)GC3Number of PHE genesPHEgene %
Chromosome 112,64513260.440.423813.0%
Chromosome 2759612320.440.423003.9%
Chromosome 3947412830.440.423263.4%
Chromosome 4742513200.440.422253.0%
Chromosome 510,99313040.440.423683.3%
Chloroplast genome8592937.50.2700
Mitochondrial genome11758644.60.4300
Table 2

Characteristics of PHE genes and top 20 genes with the highest predicted expression levels for Arabidopsis thaliana genome.

Average lengthAverage GC contentAverage GC3 content% of PHE RP genes% of PHE hypothetical genesTop 20 genes
Locus tag/gene nameFunctionMRCBS
6580.4610.47517.70%8.63%AT5G03710Replication factor C large subunit0.942377
AT3G56020Ribosomal protein L41 family0.902928
AT5G03850Nucleic acid-binding, OB-fold-like protein0.885142
RPS28Ribosomal protein S280.884064
AT3G46430ATP synthase0.877127
AT3G08520Ribosomal protein L41 family0.872734
AT2G04621Trans membrane protein0.869109
AT5G56670Ribosomal protein S30 family protein0.868022
AT3G10090Nucleic acid-binding, OB-fold-like protein0.866286
RPL23AARibosomal protein L23AA0.86058
AT2G19730Ribosomal L28e protein family0.860542
RS27ARibosomal protein S270.860165
AT4G27090Ribosomal protein L140.856987
AT2G14285Small nuclear ribonucleoprotein family protein0.856773
AT3G11120Ribosomal protein L41 family0.855905
AT5G16130Ribosomal protein S7e family protein0.854895
AT2G31490Neuronal acetylcholine receptor subunit alpha-50.854269
CAM3Calmodulin 30.852098
RPS15Cytosolic ribosomal protein S150.848976
CAM2Calmodulin 20.847033
Some basic information of the Arabidopsis thaliana genome. Characteristics of PHE genes and top 20 genes with the highest predicted expression levels for Arabidopsis thaliana genome. Codon usage profile of Arabidopsis genome has been described in terms of average impact score of 27,046 complete protein-coding sequences of the genome [Fig. 1]. Although most of the amino acids can be specified by more than one codon, only a subset of potential codons is used [Table 3] in highly expressed genes. There are no impact codons coding His, Thr and Val in the presently studied Arabidopsis genomes. The impact codons in Arabidopsis are found to be mostly used in coding Phe (ttt,ttc), Leu (ttg,ctt,ctc), Ile (atc), Met (atg), Tyr (tac), Gln (caa,cag), Asn (aac), Lys (aaa,aag), Asp (gat), Glu (gaa,gag), Ser (tct,tcc,tca,agc), Pro (cct,cca), Ala (gct), Cys (tgc), Trp (tgg), Arg (aga), Gly (ggt,gga). Importantly, these codons do not reflect any simple compositional bias. Not all of the preferred (impact) codons are GC rich and GC/GC3 may not be the accurate representation of the trend in codon usage. It may be thought that the selection of the preferred codons causing the optimization of the translational rate possibly depends on the codon–anticodon interaction kinetics.
Fig. 1

Average impact score of codons in Arabidopsis thaliana genome.

Table 3

Codon/Amino Acid Usage of the Arabidopsis thaliana CP/MT genome and nuclear genome.

Amino AcidCodonCODON USAGE
CP genomeMT genomeNuclear genomePHE Genes
AlaGCA0.9240570.9561960.9776930.965759
GCC1.0683171.0154330.695990.821385
GCG0.6337390.61980.5277030.334181
GCU1.2788891.1812311.1755841.84292
CysUGC0.4775580.855031.1204111.100364
UGU0.6542640.8819250.9754160.88164
AspGAC0.6202870.8916310.8849730.732988
GAU1.0278841.0994951.1239440.928023
GluGAA1.5015421.6678561.3792941.363214
GAG0.9076681.2785621.3978981.38124
PheUUC1.539971.7049011.8572612.556277
UUU1.2540811.451261.2254681.079788
GlyGGA1.7048011.6215511.75022.544636
GGC1.2145030.9444870.8448810.556763
GGG1.8279651.3276940.8048630.489334
GGU1.1581491.1058121.1631951.453484
HisCAC0.6093720.648530.7625790.823344
CAU0.7403040.9147120.734680.544987
IleAUA0.7926380.7863690.6204410.243809
AUC1.2233051.0972181.1212741.320139
AUU1.1325620.7834370.7927290.782475
LysAAA1.3871841.4274591.3866441.296746
AAG0.7936391.4511571.580782.442647
LeuCUA0.6749130.8776580.745410.464587
CUC0.9472521.115811.4903881.778466
CUG0.6330640.8926860.8035560.490864
CUU0.8948111.1084991.3834611.59222
UUA1.4590081.0227690.8992260.514989
UUG1.4590081.2182621.6770311.828657
AsnAAC0.9046170.8816051.1640781.109241
AAU1.0421640.9298330.7545190.393298
ProCCA0.9219011.1530691.4879622.096139
CCC1.4688821.0831160.6221050.51766
CCG1.0369820.7943350.8361710.537951
CCU1.0691331.2292231.3065571.772502
GlnCAA1.7343261.5082881.3561561.385078
CAG0.8434241.0373371.1146741.24047
ArgAGA0.8080321.1754781.5110021.794382
AGG0.5604811.1347790.9290071.144426
CGA1.2830311.0981780.7851280.515815
CGC0.9299040.7732740.5933020.483748
CGG1.1203781.0054590.6229070.173957
CGU1.1357560.7425840.8207791.376508
SerAGC0.5546211.0507981.1912720.949226
AGU0.8284910.8545860.8464640.537035
UCA0.899951.2098751.6276531.527831
UCC2.1782561.4417851.2607631.401957
UCG0.8170470.9156880.9086290.641353
UCU1.071131.407071.7269122.176242
ThrACA0.7936090.8288910.9607730.883517
ACC1.1721830.8752130.7706010.86331
ACG0.5017570.5532830.5136370.230112
ACU0.9791650.8318440.7996011.013725
ValGUA0.7645150.7195450.4688020.320551
GUC0.6944810.6768560.7344630.895895
GUG0.6074320.7053510.8804080.890438
GUU0.6575710.6593980.9337541.208662
TyrUAC0.8208270.8491451.0972551.46001
UAU1.2833581.0663620.7253590.473723
MetAUG1.8061661.399681.4465421.755233
TrpUGG2.4572011.5210811.5424321.564577
Average impact score of codons in Arabidopsis thaliana genome. Codon/Amino Acid Usage of the Arabidopsis thaliana CP/MT genome and nuclear genome. The large data set analyzed here revealed a strong bias towards usage of a different set of preferred codons in genes with high cytoplasmic mRNA levels. In contrast, genes with low mRNA levels showed very little synonymous codon usage bias. Usage bias was proposed as a result from translational selection, since using a codon that is translated via an abundant tRNA species were hypothesized to boost translational efficiency. Codon frequencies are found to vary between genes in the same genome. The standard version of the genetic code includes 61 sense codons and three stop codons. Although almost all organisms have made the same codon assignments for each amino acid, the preferred use of individual codons varies greatly among genes. The overall nucleotide composition of the genome which influences the codon usage pattern introduces selective forces acting on highly expressed genes to improve the efficiency of translation. It is now widely accepted that synonymous codon preferences in a unicellular organism are affected by the cellular amount of isoacceptor tRNA species. But we observe that not all tRNA genes corresponding to impact codons have been detected by tRNAscanSE. However many tRNAs can translate more than one codon, but with variable ability and it is suggested that impact codons have favored translational efficiency. Since the highly expressed genes use a preferred set of optimal codons in accordance with their respective tRNA levels, this observation might find another important application in tRNA finding algorithm. Expression profiles of the genes are determined by calculating MRCBS for each gene and their distributions are shown in Fig. 2. The majority of genes (90%) have MRCBS values lying between 0.65 and 0.75, and the mean and median values are 0.3870 and 0.3295, respectively. Only 3.3% genes have MRCBS values >0.77. It was observed that percentage of PHE genes vary between.
Fig. 2

Distribution of MRCBS of all protein-coding genes in Arabidopsis thaliana genome.

Distribution of MRCBS of all protein-coding genes in Arabidopsis thaliana genome. 3% to 4% in Arabidopsis thaliana chromosomes, whereas no highly expressed genes are predicted in CP/MT genomes. The overall variation of GC or GC3 content of the genes is depicted in Suppl. Fig. 1, Suppl. Fig. 2 respectively. It indicates that majority of genes have GC3 score lying between 0.3 and 0.6 and (88.5%) of genes have GC content lying between 0.4 and 0.5. We observed that the percentage of PHE genes varies from chromosome to chromosome and is independent of GC content or GC3 score of these genes. In fact, we have failed to find any correlation between gene expression and GC content or GC3 score. It is well studied that highly expressed genes display more biased codon usage than the lowly expressed genes [Table 3]. We observed that PHE genes of Arabidopsis thaliana mostly include ribosomal protein (RP) genes, translation initiation factors, translation elongation factors, MADS box transcription factor, membrane traffic protein, trans-membrane protein, chaperon, heat shock protein, histone, ubiquitin, nucleic acid binding protein and many stress and energy metabolism genes. However, all RP genes of Arabidopsis thaliana do not comprise the PHE gene class. Table 2 reports the statistics of PHE gens. The percentage of PHE genes in Arabidopsis thaliana is 3.3%, whereas only 17.7% genes fall in the class of RP genes. It is remarkable that 99.21% RP genes in Yeast genome and almost all RP genes in E. coli genome fall in PHE class of genes. An average of 65.56% RP genes in the archaeal genome is PHE. Out of 561 RP genes 255 RP genes are PHE. Thus a very poor fraction of RP genes of Arabidopsis thaliana has highly predicted expression level in contrast to E.coli, Yeast and Archaea. The top 20 genes with the highest predicted expression levels for Arabidopsis thaliana genomes are displayed in Table 2. Our analysis predicted 1063 highly expressed genes in Arabidopsis thaliana. A list of well-characterized PHE genes has been displayed in Suppl. Table 1. It is worth noticing that these genes are separated into different functional categories. Table 4 displays a set of well-characterized PHE genes segregated into different functional categories.
Suppl. Fig. 1

Distribution of GC content of all protein-coding genes in Arabidopsis thaliana genome.

Suppl. Fig. 2

Distribution of GC3 content of all protein-coding genes in Arabidopsis thaliana genome.

Table 4

A list of potential PHE genes segregated into different functional categories.

Transcription factorAT4G10480ElongationAT1G56070AT3G07860
AT3G12390AT4G20360ATG8C
AT5G09920AT3G12915AT3G45180
AT4G35900AT1G07930AT5G57860
AT2G17770Translation initiation factor/elongation factorAT1G30230AT3G58230
AT1G54830AT2G18110DehydrogenaseAT1G53240
AT5G53980AT5G19510AT1G04410
AT1G56170AT5G12110AT5G43330
MADS box transcription factorAT1G69120AT2G46280AT2G02050
AT1G31140AT5G35680AT1G12900
AT1G50780AT2G04520AT3G04120
AT1G71692AT4G20980AT3G26650
Chromatin/chromatin binding proteinAT3G03590AT1G26630AT1G13440
AT1G01160AT5G05470DNA/RNA binding proteinAT4G01060
AT1G75060AT1G69410AT5G08420
HistoneAT4G40040mRNA processing/splicingAT3G62840AT5G47210
AT5G59870AT5G44500AT4G17520
AT5G12910AT4G20440AT4G16830
AT5G10390AT4G30220AT3G57150
TubulinTUA2AT2G14285Membrane traffic proteinAT4G23630
TUA3AT3G11500AT1G73030
TUA4AT2G03870AT2G34250
TUA5AT2G23930AT2G38360
TUB2MethyltransferaseAT4G34050AT1G62880
TUB3AT4G13930AT1G48440
TUB4AT5G66550Transfer/carrier protein/transporterAT3G10640
TUB1AT3G03780AT2G19830
TUB5AT5G17920AT3G15352
TUB7LigaseAT5G10880AT3G57900
TUB9AT1G55570AT2G36830
KISAT1G55560AT3G16240
TUA6AT3G13400Actin/Actin related proteinACT2
Calcium binding proteinCRT1aAT3G13390ACT7
CRT1bAT1G66200ACT8
AT5G39670AT5G35630AT3G09860
AT2G41090AT3G17820ACT11
AT1G76640CalmodulinCAM1Amino acid transporterAT2G45960
G protein coupled receptor/modulatorAT5G42090CAM2AT3G61430
AT5G18520CAM3AT4G00430
AT2G30060CAM5AT1G01620
AT3G07880CAM6ATP SynthaseAT4G23710
Transmembrane ProteinAT2G04621CML42AT3G01390
AT2G01870CML11AT2G33040
AT2G13965AcyltransferaseAT5G11670Carbohydrate kinaseAT3G59480
AT5G19875Basic helix-loop-helix transcription factorAT4G10480AT1G50390
AT5G03120AT3G12390AT1G79550
AT2G29180Basic leucine zipper transcription factorAT4G35900
AT3G18800AT2G17770Extracellular matrix structural proteinAT4G08410
AT2G25297Homeodomain transcription factorAT5G53980AT3G54580
AT5G07165AT5G06640
AT2G22080Cysteine proteaseAT3G04840AT2G24980
AT5G16250AT4G34670AT1G23720
AT5G04790DehydrataseAT3G46440AT5G06630
AT1G74458AT3G51160AT3G28550
AT3G28190Aminoacyl-tRNA synthetaseAT1G55803AT3G54590
AT2G31090Antibacterial response proteinAT5G50840AT1G21310
AT1G17090ABC transporterAT5G60790AT1G76930
AT3G14452Ubiquitin/ubiquitin likeUBQ11Chaperone/heat shock proteinAT1G27330
AT2G05310UBQ13AT4G02450
AT3G28193UBQ4AT5G12020
AT1G65720UBQ5HSC70–1
AT4G21500UBQ6HSP17.6A
AT5G09225UEV1D-4HSP21
AT1G16916UBQ1HSP70
AT5G03460UBQ14Hsp70–2
AT1G49310AT5G18310ERD2
AT3G42075AT3G61113AT3G09440
AT3G18915AT5G32440BIP2
AT2G41905NKS1BIP1
AT1G67235UBC11Hsp81.4
AT5G61340UBL5HSP81–2
AT1G06515APG8AHSP81–3
AT5G19860ATG8BHSP90.1
It has been observed that PHE genes belonged to various functional classes and variably represented in the genome. These include carbohydrate kinase, dehydratase, dehydrogenase, ATP synthase, acyltransferase, methyltransferase,Amino acid transporter, actin/actin-related protein, calcium-binding protein, calimodulin, cysteine protease, chromatin/chromatin-binding protein, DNA directed DNA/RNA polymerase, enzyme modulator, extracellular matrix structural protein, ligase, non motor actin/microtubule-binding protein, non receptor serine/ thionine protein kinase, oxidase, oxidoreductase, nucleotidyltransferase, reductase, peroxidase, phosphatase, peroxodase/phosphatase inhibitor, transfer/ carrier protein. A list of potential PHE genes segregated into different functional categories. Besides, we have identified a number of PHE genes which play important roles in signal transduction mechanism, amino acid transport and metabolism, secondary metabolites biosynthesis and catabolism, cell membrane biogenesis, inorganic ion transport and metabolism, coenzyme transport and metabolism, carbohydrate transport and metabolism, intercellular trafficking, and energy production and conversion. These include vacuolar protein, vacuolar ATP synthase, vacuolar calcium-binding protein, vacuolar ATPase, vesicle coat protein, seed storage albumin,arabinogalactan protein, cytochrome complex, cytochrome c oxidase/electron carrier and members of the cytochrome family, DEFL family, dehydrin family. In addition, a number of PHE genes encoding plasma membrane intrinsic protein, plant defensin, photosystem II, phytochrome associated protein, phytosulfokine, plant viral response protein have significant roles in plant. Among other PHE genes, copper chaperone, copper iron-binding protein, a copper transport protein, Zinc-binding ribosomal family protein and ferredoxin like superfamily protein have important functions in this organism. However, a fraction of poorly characterized hypothetical genes was also found among the PHE genes. Table 2 displays the general statistics of hypothetical or poorly characterized PHE genes in Arabidopsis genome. Genes of unknown function with high predicted expression levels may be attractive candidates for experimental characterizations. The characteristic codon distribution of these genes indicates that they may have important functions in these organisms. A variety of PHE genes encoding proteins of unknown function may provide targets for identification of additional key features of Arabidopsis thaliana. The temporal and spatial organization of these genes for chromosome replication, genome segregation and cell division processes are less characterized in Arabidopsis genome. A detailed analysis of these putative/hypothetical PHE genes would generate a more comprehensive picture of the replication and division machineries, and of the regulatory features of the cell cycle.

Correlations among different codon bias indices

In this study, we compared the performances of several commonly used computation tools for predicting gene expression level. The expression profiles of the Arabidopsis genome were analyzed in terms of CAI, RCA and MRCBS. The CAI scores have been calculated by taking all RP (>80aa) genes as PHE genes which are commonly referred as reference set. RCA frequencies are computed using the identical reference set as used in the calculation of CAI. The results indicate that there is a good correlation between RCA and CAI(r = 0.673761) while the correlation of RCA with MRCBS is significantly higher (r = 0.787772) [Fig. 3]. The novel method of quantitatively predicting gene expressivity MRCBS is then compared with CAI and correlation between them is found to be surprisingly good (r = 0.900204) [Fig. 4]. These correlation coefficients can be used to express the strength of the existing prediction methods. It can be seen that MRCBS consistently yields better correlation than other. We also observe that there is no clear correlation between CAI or MRCBS with GC3(rCAI = −0.05726, rMRCBS = 0.101083) or GC(rCAI = −0.15775, rMRCBS = 0.041383). So, GC content and GC3 may not be the accurate representation of the trend in codon usage bias. Similarly, no correlation between the length of the gene and MRCBS or CAI has observed in our study.
Fig. 3

RCA plotted against MRCBS for each protein coding-genes in Arabidopsis thaliana genome.

Fig. 4

CAI plotted against MRCBS for each protein-coding genes in Arabidopsis thaliana genome.

RCA plotted against MRCBS for each protein coding-genes in Arabidopsis thaliana genome. CAI plotted against MRCBS for each protein-coding genes in Arabidopsis thaliana genome.

Correlation of protein and mRNA expression levels with MRCBS

In this study we choose to compare our results with the experimental datasets. The value of codon-based expression indicator can perhaps be appreciated by comparing them with the experimental gene expression data in general. Of course, the codon-based expression indicator yields static value, whereas gene expression is a dynamic process with very different expression levels under different conditions. The expression data that we have used in this study stems from Gene Expression Omnibus (GEO) datasets. In GEO dataset (GEO accession: GSM2473182) protein expression levels were quantified by RMA (Relative Molecular Abundance) signal intensity. For the entire group of selected genes (20,900 genes)for which the complete data set can be generated along with the codon based expression indicator, the Pearson correlation coefficient between CAI and MRCBS comes out to be 0.901964. The pair-wise correlation coefficient between protein expression level and MRCBS, CAI, RCA and GC turns out to be 0.268321, 0.253094, 0.283545 and 0.206581 respectively. Correlation is worse with GC3 (0.049775).It has been observed that for genes with high RMA signal intensity (>7.59), the pair-wise correlation coefficients are better (0.386227, 0.337139, 0.303723, 0.251336 and 0.290886) [Suppl. Fig. 3, Suppl. Fig. 4, Suppl. Fig. 5, Suppl. Fig. 6, Suppl. Fig. 7].
Suppl. Fig. 3

RMA signal intensity plotted against MRCBS for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 4

RMA signal intensity plotted against CAI for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 5

RMA signal intensity plotted against RCA for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 6

RMA signal intensity plotted against GC for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

Suppl. Fig. 7

RMA signal intensity plotted against GC3 for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182).

In another analysis we have compared our results with the radioactive data (González-Pérez et al., 2011). We have collected 1797 Arabidopsis genes for which there are orthologous in yeast and humans and that have mRNA half-life data (Calderwood et al., 2016). For these dataset, the predicted gene expression level using MRCBS value is found to correlate well with RMA signal intensity(r = 0.50923) [Fig. 5]. The correlation is better than the quantitative measure of CAI (r = 0.470608), RCA(r = 0.442278), GC3(r = 0.405765) and GC(r = 0.362806) [Suppl. Fig. 8, Suppl. Fig. 9, Suppl. Fig. 10, Suppl. Fig. 11]. It suggests that a quantitative estimate of the expression level by MRCBS values performs better than other indices of expression-measure. The novel method of quantitatively predicting gene expressivity is then compared with mRNA half-life data. We observe that the correlation coefficient of mRNA half-life data with MRCBS (r = 0.3504) is good [Fig. 6], but worse compared to RMA signal intensity. Although the pair-wise correlation coefficient among the gene expression levels from two experimental datasets (r = 0.525273) is good, it can be clearly seen that the agreement of predicted and actual protein expression level quantified by mRNA half-life data varied greatly between all examined combinations of prediction method and data set (rCAI = 0.31067, rGC3 = 0.310397, rGC = 0.281694 and rRCA = 0.279249) [Suppl. Fig. 12, Suppl. Fig. 13, Suppl. Fig. 14, Suppl. Fig. 15].
Fig. 5

RMA signal intensity plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 8

RMA signal intensity plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 9

RMA signal intensity plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 10

RMA signal intensity plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 11

RMA signal intensity plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Fig. 6

mRNA half-life data plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 12

mRNA half-life data plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 13

mRNA half-life data plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 14

mRNA half-life data plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Fig. 15

mRNA half-life data plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

To assess the value of MRCBS for predicting protein expression levels in Arabidopsis thaliana, we plotted the two experimental sets of data versus MRCBS along with RCA and CAI. The distribution patterns for both the protein expression data with respect to these expression indicators are highly similar. Comparing the performance of the MRCBS, the CAI and RCA as numerical indices of the gene expression level in terms of the Pearson correlation coefficient with the expression data, we observed that MRCBS generally performs better than CAI and RCA. RMA signal intensity plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against MRCBS for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Conclusion

Our study demonstrates that MRCBS may be a useful tool for predicting highly expressed genes. The idea of supporting our method is based on the hypothesis that codon usage pattern is largely responsible for regulation of gene expression which can occur during transcription or at the level of protein translation. Although the concept of predicting gene expression level from the codon usage pattern was proposed a decade ago, only recently these methods have been successfully applied to identification of highly expressed genes in various bacteria and eukaryotic genomes. The improved reliability of MRCBS for estimating expression levels in Arabidopsis genome thus makes this index a superior choice for undertaking and benchmarking predictions of gene expression. In this study, various approaches to estimating gene expression level based on codon usage have been applied to Arabidopsis genome with the objectives of testing the present alternative method of studying whole-genome gene expression. Our results demonstrate significant heterogeneity in codon usage among genes in Arabidopsis genome. Furthermore, the predicted gene expression level using the quantitative measure CAI was found to correlate well with MRCBS. In addition, since the expression levels measured by current DNA microarray and proteomics technologies represent the accumulated results of expression and degradation, the results from this computational approach could be used as reference data for calibrating and better interpreting experimental data. For example, observation of low level of expression from proteomic or microarray data for a gene with a high PHE index might suggest the possible involvement of degradation in regulating expression levels of that gene. Although most of the PHE genes are essential genes responsible for the habitat, energy sources and life style of an organism, the study also identified a number of functionally unknown genes as PHE genes based on their codon usage profile. Further investigation of these genes by an integrated computational and experimental approach will enhance our knowledge of metabolism. Given that a large volume of experimental data is available on this plant, such novel method may be helpful on extracting meaningful information for understanding the details of functional genomics. The following are the supplementary data related to this article. Distribution of GC content of all protein-coding genes in Arabidopsis thaliana genome. Distribution of GC3 content of all protein-coding genes in Arabidopsis thaliana genome. RMA signal intensity plotted against MRCBS for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against CAI for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against RCA for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against GC for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against GC3 for 20,900 genes of Arabidopsis thaliana available in GEO dataset (GEO accession: GSM2473182). RMA signal intensity plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). RMA signal intensity plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). RMA signal intensity plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). RMA signal intensity plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against CAI for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against GC3 for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against GC for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016). mRNA half-life data plotted against RCA for 1797 identified genes in Arabidopsis thaliana (González-Pérez et al., 2011; Calderwood et al., 2016).

Suppl. Table 1

A list of some well characterized PHE genes in Arabidopsis thaliana.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of interests

We, the authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  23 in total

1.  Predicted highly expressed genes of diverse prokaryotic genomes.

Authors:  S Karlin; J Mrázek
Journal:  J Bacteriol       Date:  2000-09       Impact factor: 3.490

Review 2.  Codon bias and heterologous protein expression.

Authors:  Claes Gustafsson; Sridhar Govindarajan; Jeremy Minshull
Journal:  Trends Biotechnol       Date:  2004-07       Impact factor: 19.536

3.  Predicted highly expressed genes in archaeal genomes.

Authors:  Samuel Karlin; Jan Mrázek; Jiong Ma; Luciano Brocchieri
Journal:  Proc Natl Acad Sci U S A       Date:  2005-05-09       Impact factor: 11.205

Review 4.  Codon bias and gene expression.

Authors:  C G Kurland
Journal:  FEBS Lett       Date:  1991-07-22       Impact factor: 4.124

5.  Early transcriptional defense responses in Arabidopsis cell suspension culture under high-light conditions.

Authors:  Sergio González-Pérez; Jorge Gutiérrez; Francisco García-García; Daniel Osuna; Joaquín Dopazo; Óscar Lorenzo; José L Revuelta; Juan B Arellano
Journal:  Plant Physiol       Date:  2011-04-29       Impact factor: 8.340

6.  Transcript Abundance Explains mRNA Mobility Data in Arabidopsis thaliana.

Authors:  Alexander Calderwood; Stanislav Kopriva; Richard J Morris
Journal:  Plant Cell       Date:  2016-03-07       Impact factor: 11.277

7.  Relative codon adaptation: a generic codon bias index for prediction of gene expression.

Authors:  Jesse M Fox; Ivan Erill
Journal:  DNA Res       Date:  2010-05-07       Impact factor: 4.458

Review 8.  Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity.

Authors:  P M Sharp; E Cowe; D G Higgins; D C Shields; K H Wolfe; F Wright
Journal:  Nucleic Acids Res       Date:  1988-09-12       Impact factor: 16.971

9.  Comparative Analysis of Predicted Gene Expression among Crenarchaeal Genomes.

Authors:  Shibsankar Das; Brajadulal Chottopadhyay; Satyabrata Sahoo
Journal:  Genomics Inform       Date:  2017-03-29

10.  Predicting gene expression level from relative codon usage bias: an application to Escherichia coli genome.

Authors:  Uttam Roymondal; Shibsankar Das; Satyabrata Sahoo
Journal:  DNA Res       Date:  2009-01-08       Impact factor: 4.458

View more
  6 in total

1.  Analysis of codon usage patterns in open reading frame 4 of hepatitis E viruses.

Authors:  Zoya Shafat; Anwar Ahmed; Mohammad K Parvez; Shama Parveen
Journal:  Beni Suef Univ J Basic Appl Sci       Date:  2022-05-10

2.  Analysis of Codon Usage Patterns of Six Sequenced Brachypodium distachyon Lines Reveals a Declining CG Skew of the CDSs from the 5'-ends to the 3'-ends.

Authors:  Jianyong Wang; Yujing Lin; Mengli Xi
Journal:  Genes (Basel)       Date:  2021-09-23       Impact factor: 4.096

Review 3.  Codon usage bias.

Authors:  Sujatha Thankeswaran Parvathy; Varatharajalu Udayasuriyan; Vijaipal Bhadana
Journal:  Mol Biol Rep       Date:  2021-11-25       Impact factor: 2.316

4.  Codon Usage Bias Correlates With Gene Length in Neurodegeneration Associated Genes.

Authors:  Rekha Khandia; Mohd Saeed; Ahmed M Alharbi; Ghulam Md Ashraf; Nigel H Greig; Mohammad Amjad Kamal
Journal:  Front Neurosci       Date:  2022-07-04       Impact factor: 5.152

5.  Prediction of gene expression under drought stress in spring wheat using codon usage pattern.

Authors:  Meshal M Almutairi; Abdullah A Alrajhi
Journal:  Saudi J Biol Sci       Date:  2021-04-20       Impact factor: 4.219

Review 6.  Targeted genome editing of plants and plant cells for biomanufacturing.

Authors:  J F Buyel; E Stöger; L Bortesi
Journal:  Transgenic Res       Date:  2021-03-01       Impact factor: 2.788

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.