Literature DB >> 35106386

Genome-wide analysis of codon usage in sesame (Sesamum indicum L.).

Abstract

Sesamum indicum is an ancient oil crop grown in tropical and subtropical areas of the world. We have analyzed 23,538 coding sequences (CDS) of S. indicum to understand the factors shaping codon usage in this important oil crop plant. We identified eleven highly preferred codons in S. indicum that have AT-endings. The slope of a neutrality plot was less than one while effective number of codons (ENC) plot showed distribution above and below the standard curve. There is a significant relationship between protein length and relative synonymous codon usage (RSCU) at the primary axis while there is a weak correlation between protein length and Nc values. Correspondence analysis conducted on RSCU values differentiated CDS based on their GC content and their characteristic feature and showed a discrete distribution. Moreover, by determining codon usage, we found out that majority of the lignan biosynthesis related genes showed a weaker codon usage bias. These results provide insights into understanding codon evolution in sesame.

Entities: Chemical

Keywords: Codon usage bias; Natural selection; Optimal codon; Sesamum indicum

Year: 2021 PMID： 35106386 PMCID： PMC8789531 DOI： 10.1016/j.heliyon.2021.e08687

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

Triplet codons are the basic coding units of mRNA. Evidence has shown that synonymous codons are not used randomly during translation [1, 2]. The groups of synonymous codons that encode for a specific amino acid are very well conserved over most species although a few small exceptions have been reported [3, 4]. Codon usage bias (CUB) refers to the difference in the frequency of synonymous codons in protein-coding genes from the frequencies expected if synonymous codons were selected randomly. Several factors may account for codon usage bias [5] which was observed in all forms of life. It is a complex and important phenomenon, which may have significant relevance to the understanding of genome evolution through identification of the different selective forces [6, 7, 8, 9, 10]. CUB is also critical in shaping cellular function and gene expression through its effects that ranges from RNA processing to protein translation and protein folding [11, 12, 13, 14, 15]. There are several factors that have been suggested to affect codon usage bias, while mutational bias and selection for translation efficiency are considered as the two primary evolutionary forces with varying relative contribution among species [16, 17, 18, 19, 20, 21, 22]. Mutation is responsible to generate codon diversity while reduction of codon diversity is achieved mainly through natural selection. Further factors suggested to affect CUB include compositional constraints of genes [23, 24, 25, 26], translational selection [27, 28, 29, 30], gene expression level [31, 32, 33, 34, 35, 36, 37, 38], gene length [39, 40, 41, 42], function of the gene [43], the frequency rate of recombination [27, 44], secondary structure of the protein [45, 46, 47], protein amino acid composition [48, 49, 50], the evolutionary age of the genes [51], the length of the intron [52], tRNA abundance [5, 53, 54, 55], and environmental stress [56]. Codon usage indices are used to help the tabulation and investigation of codon usage and they can reduce the codon usage data into a useful summary [7, 9, 27, 32, 36, 52, 57, 58, 59]. The advancement of genome sequencing facilitates the analysis of codon usage in many organisms, helping us to understand evolutionary forces that affect genes [5, 60]. Codon usage bias of several organisms have been analyzed so far; however, the unavailability of high-throughput sequencing of the full gene of Sesamum indicum hamper codon usage bias studies in S. indicum. S. indicum is an ancient oil crop grown in tropical and subtropical areas [61]. Sesame seeds contain valuable oil with anti-oxidative lignans and possess health-promoting properties [62, 63]. Here we report an analysis of codon usage in S. indicum and determine key factors that shape codon choice in protein-coding genes of S. indicum.

Materials and methods

Coding sequence data

The coding sequences (CDS) datasets of S. indicum from the genome sequence databases of Sinbase (http://ocri-genomics.org/Sinbase) [64], and the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/) were retrieved. First, those sequences that are <300 bp were removed to minimize sampling errors. Next, coding sequences that do not possess a start and/or a stop codon as well as those having undetermined nucleotides (N) were filtered out. Later, duplicated sequences were discarded from the dataset. Finally, the 23,538 coding sequences of S. indicum were used for codon usage analysis. RSCU, which is an index to a normalized codon usage [27, 43] was calculated for each gene. It is the ratio of the actual proportional usage of a given codon to the expected proportional usage, if all the codons are used equally. Codon usage is said to be unbiased if the RSCU value is equal to 1, while a value greater than 1 indicates that a particular codon is favored. A codon with an RSCU value greater than 1.6 is considered as an overrepresented codon, and one which is having an RSCU value less than 0.6 is considered an underrepresented codon. The effective number of codons (N) method was used to quantify the absolute codon usage bias of a gene independent of the gene length and number of encoded amino acids [65]. The values of N range from 20 (when only one codon is used per amino acid) to 61 (when all codons are used in equal probability). The GC content of first, second and third codon position (GC1, GC2 and GC3) were then calculated. GC12 is the average of GC 1 and GC 2 and was used for analysis of neutrality plots (GC12 vs GC3) [66]. The GC3s is the frequency of GC at the third codon position and it was used to better elucidate the codon usage variation and compositional constraints. Moreover, A3s, G3s, C3s, and T3s values were obtained and used to quantify the usage of each specific base at synonymous third codon positions. The codon adaptation index (CAI) was used to estimate the extent of bias toward codons that were known to be preferred in highly expressed genes. A CAI value is between 0 and 1.0, and a higher value means a likely stronger codon usage bias and a potential higher expression level [32].

Correspondence analysis (CA)

Correspondence analysis (CA) has been widely used to explore the major trends in codon usage variation among genes [67]. Correspondence analysis was performed on RSCU data to overcome the effect of biases in amino acid composition. The analysis begins with a codon usage matrix that has dimensions X (number of genes) by Y (Codon usage values). Basically, this method plots genes according to their synonymous codon usage in a 59-dimensional space (excluding Met, Trp and stop codons), then it identifies the major trends within this dataset as those axes through this multidimensional approach which account for the largest fractions of variation among these genes [67, 68, 57]. In addition, it eliminates excess noise and complex data structure by providing visual outputs [1, 69].

NC-plot

N values are plotted against the GC3s values in the N-plot and it is used to analyze the influence of base composition on the codon usage in a genome [70]. The functional relationship that existed between N and GC3 values is shown by a standard curve which gives much emphasis for mutation pressure rather than selection pressure. The predicted N values will lie on or around the GC3 curve if the codon choice is mainly due to the G + C mutation bias. However, if the values deviate considerably below the expected GC3 curve, it signifies the presence of other factors, such as selection effects. In addition, to estimate the difference between the observed and the expected N values for all sesame CDS, frequency distributions of (Nexp − Nobs)/Ncexp were plotted.

PR2-bias plot

The parity rule 2 (PR2) is a rule of DNA composition and it is used to indicate the impact of both mutation and selection pressure on codon usage bias [29]. Plotting of AT-bias [A3/(A3+T3)] and GC-bias [G3/(G3+C3)] at the third codon position of the four-codon amino acids (i.e. Ala, Arg, Gly, Leu, Pro, Ser, Thr, and Val) is important for this analysis. When there is no deviation between mutation and selection pressure of two DNA chains, the fractional content of the four bases follows A = T and G = C (where A + T + G + C = 1) [58].

Neutrality plot

The neutrality plot (GC12-GC3) is used to estimate and characterize the relationship between GC12 and GC3 and used to determine the codon usage patterns and biases [71]. Mutation bias is assumed to be the main force shaping codon usage bias if the correlation between GC12 and GC3 is statistically significant and the slope of the regression line is close to 1. Otherwise, a slope of 0 shows the absence of directional mutation pressure [66].

Statistical analysis

To calculate the different indices of codon usage bias in Sesamum indicum, CodonW1.4.4 software [72] (http://codonw.sourceforge.net/), EMBOSS CUSP and CHIPS online service program were used [73]. GRAVY, Aromo, RSCU and ENC values were also calculated and correlation analyses based on Pearson's rank correlation (at a significance level of P < 0.05 or P < 0.01) were performed using the statistical software SPSS 19.0 (IBM, Chicago, IL, USA). Graphs were generated with GraphPad Prism 8.0 (GraphPad Software Inc., La Jolla, CA, USA).

Results

GC content variation, base composition and codon usage bias in S. indicum genome

Gene regulation as well as gene function in different organisms is mainly influenced by base composition or the proportion of guanine and cytosine bases present in the DNA molecule. As a highly variable trait, variation in GC content is observed in both synonymous and non-synonymous sites. The synonymous codon usage pattern as well as the preference for GC and/or AT-ended codons were analyzed by the relative synonymous codon usage (RSCU) analysis of codons. In our analysis, the RSCU value of a codon greater than one indicates the codon is more frequently used whereas, RSCU value less than one indicates the codon is less frequently used in the CDS. The overall RSCU values of S. indicum genes further showed that 27 codons have an RSCU value greater than 1 and they were the most frequently used ones among the 59 codons. From these 27 codons, 22 codons are AT-ending codons which were used predominantly (Table 1). In addition, it has been seen that T-ending codons (14) were mostly favored as compared to A-ending codons (8) followed by five G-ending codons in the CDS of S. indicum. This result clearly showed that S. indicum genes exhibited pertinently more bias towards AT-ending codons and the genome appears to be AT rich, suggesting that the compositional constraints are the most important factor in shaping the codon usage patterns of S. indicum.

Table 1

Codon usage of Sesamum indicum.

Amino acid	Codon	N	RSCU	Amino acid	Codon	N	RSCU
Phe	UUU	218,627	1.08	Ser	UCU	210,054	1.43
Phe	UUC	187,400	0.92		UCC	134,481	0.92
Leu	UUA	108,212	0.68		UCA	185,541	1.27
	UUG	234,522	1.48		UCG	82,163	0.56
	CUU	215,505	1.36	Pro	CCU	174,638	1.38
	CUC	142,937	0.90		CCC	89,309	0.70
	CUA	90,524	0.57		CCA	165,595	1.31
	CUG	160,661	1.01		CCG	77,842	0.61
Ile	AUU	232,353	1.36	Thr	ACU	164,709	1.37
	AUC	148,610	0.87		ACC	103,609	0.86
	AUA	133,024	0.78		ACA	146,978	1.22
Met	AUG	238,739	1.00		ACG	64,844	0.54
Val	GUU	237,229	1.47	Ala	GCU	264,788	1.51
	GUC	121,122	0.75		GCC	143,209	0.82
	GUA	96,779	0.60		GCA	212,485	1.21
	GUG	190,067	1.18		GCG	79,502	0.45
Tyr	UAU	153,170	1.11	Ter	UGA	10,538	1.34
Tyr	UAC	121,840	0.89		UAA	7,089	0.90
Cys	UGU	90,992	1.01		UAG	5,910	0.75
Cys	UGC	89,732	0.99	Trp	UGG	126,740	1.00
His	CAU	143,870	1.22	Arg	CGU	66,194	0.73
His	CAC	91,674	0.78		CGC	50,368	0.56
Gln	CAA	181,110	1.00		CGA	59,332	0.66
Gln	CAG	179,813	1.00		CGG	59,579	0.66
Asn	AAU	254,311	1.19		AGA	164,366	1.82
Asn	AAC	174,393	0.81		AGG	142,731	1.58
Lys	AAA	270,541	0.93	Gly	GGU	179,791	1.09
Lys	AAG	313,367	1.07		GGC	131,385	0.80
Asp	GAU	352,544	1.33		GGA	204,465	1.24
Asp	GAC	176,211	0.67		GGG	142,337	0.87
Glu	GAA	331,043	1.03	Ser	AGU	143,087	0.98
Glu	GAG	312,300	0.97		AGC	124,387	0.85

Note: Codons with RSCU >1.30 are shown in bold, codons with RSCU <0.80 are shown in the underlined text.

Codon usage of Sesamum indicum. Note: Codons with RSCU >1.30 are shown in bold, codons with RSCU <0.80 are shown in the underlined text. Based on the output of the synonymous usage of codons, five codons were identified as high-frequency codons (Table 1) of all the 59 codons after removing the three stop codons together with Methionine and tryptophan. UUG (Leu), GUU (Val), AGA, AGG (Arg) and GCU (Ala) were some of the highly preferred codons while codons such as UCG (Ser), ACG (Thr) and GCG (Ala) were particularly avoided in the S. indicum CDS. Moreover, the GC content value for the 23,538 genes ranged from 33.3 to 68.9 %, with an average value of 46.88 %, indicating that S. indicum has a high AT content. In order to further understand the nucleotide distribution, all the protein coding genes were concatenated to one sequence, which comprised 9,911,268 codons. We examined the overall GC content and GC content in the first (GC1), second (GC2), and third (GC3) codon positions and the distribution plotted (Figure 1A). A unimodal distribution was easily apparent for all GC1, GC2 and GC3 where only GC1 had higher GC content than the overall GC content. This analysis showed that there is a significant difference in the three synonymous codons. Overall, the GC content values were highest at the first codon position, followed by GC in the third and the second codon positions (GC1 > GC3 > GC2), respectively. The average value for GC1, GC2 and GC3 was 51.95 %, 41.86 % and 46.85 %, respectively (Supplementry Table 1). As expected, the peak containing high GC content genes was most pronounced in the third position. As it is clearly shown in Figure 1A, GC3 is the most important factor in S. indicum codon usage bias among GC, GC1, GC2, and GC3.

Figure 1

Analysis of codon usage in Sesamum indicum. (A) The distribution of GC contents at the three codon positions in S. indicum genes, (B) Distribution of effective number of codons (N) and GC3s of S. indicum genes. Individual genes are indicated by dots while the standard curve represents the expected N under random codon usage, (C) Frequency distribution of effective number of codons (N) ratio, (D) Parity Rule 2 (PR2) bias plot analysis. Genes are plotted based on their GC bias [G3/(G3+C3)] and AU bias [A3/(A3+T3)] in the third codon position, (E) Neutrality plot analysis (GC12 vs GC3s) of S. indicum. GC12 is the average value of Guanine and Cytosine content in the first (GC1) and second (GC2) position of the codons, whereas GC3 is the Guanine and Cytosine content at the third codon position. Besides, we examined the relationship between nucleotide content and codon usage in S. indicum using the effective number of codons (N)-plot analysis. As it is shown on Figure 1B, majority of the genes were aggregated close to the expected curve, indicating that nucleotde composition bias at the third codon position caused the observed codon bias in S. indicum genes. Besides, several genes with low N values that are located furthest away from this line were also observed, suggesting that these genes might have additional codon usage bias factors like selection forces besides the mutational bias. To get a better insight, we calculated (Nexp-Nobs)/Nexp ratio among the observed and expected N values of the S. indicum genes (Figure 1C). The frequencies of the genes were highest when the value was within 0.0–0.05, besides the (Nexp-NCobs)/Nexp value ranges in between −0.05–0.25 for most of the genes, which indicates that the observed N values are not identical to the expected N values according to their GC3s. This result provided more evidence for the existence of other factors that affected the S. indicum codon usage bias. Furthermore, the association between purines (A and G) and pyrimidines (C and T) at the third codon position was carried out using Parity Rule 2 (PR2) bias plot in order to further elucidate the impact of selection and mutation pressure on the CUB of S. indicum genes. If the codon usage bias is occurred mainly due to mutation bias, G and C (A and T) should be used proportionally at the third codon position. On the other hand, if natural selection dominates, it would not necessarily cause proportional usage of G and C (A and T). In this analysis, we have noticed that majority of the genes were distributed at the third and fourth quadrat of the PR2-plot (Figure 1D), implying that there exists a codon usage imbalance between A + T and G + C at the third codon position of the S. indicum genes. To further investigate the driving forces that are operating behind and to explore the magnitude of mutational pressure against the effect of natural selection in S. indicum genes, neutrality plot analysis was conducted (Figure 1E). The neutrality plot (GC12 vs GC3s) revealed that the S. indicum genome had a wide range of GC3 distributions and there exists a non-linear relationship between GC3 and codon usage bias in S. indicum CDS. In addition, the slope of the estimated equation revealed that mutation pressure accounted only for 7 % while selection constraints accounted for 93 % of the effects observed indicating that GC12 content was affected by mutation pressure and natural selection at a ratio of 0.7/0.93 = 0.75. In general, the above-mentioned findings suggested that natural selection might have played a major role while mutation pressure might have played a minor role in the evolution of codon usage bias of the S. indicum genes.

Impact of nucleotide composition in affecting codon bias

To investigate synonymous codon usage variation among S. indicum genes, correspondence analysis (COA) was performed on the RSCU values of each codon. Based on our COA analysis of the RSCU values, the first principal axis described 15.57 % of the variation while the second axis accounted for only 5.26 % of the total variation (Figure 2A). This result confirmed that the first axis reflects a significant trend that explain differences in codon usage among the S. indicum genes. As it is shown on Figure 2B, S. indicum genes showed a broad distribution along the horizontal axis, and detailed analysis indicated that S. indicum genes are categorized in to three distinct groups. Genes that have a GC content which is less than 45 % were partitioned at the right side of axis 1 while genes having GC ≥ 60 % were located mainly at the left side of axis 1. Moreover, genes having a GC content of 45 %–60 % were distributed mainly at the center of the plot. Interestingly, those genes located at the left side of axis 1 tend to be the genes with stronger codon usage bias as it is assessed using their N values while the ones that are located at the other extreme of axis 1 are expected to be expressed at low levels. Thus, these results clearly showed that correspondence analysis differentiates genes according to their codon usage differences in addition to showing and it also showed the effects of nucleotide composition at each codon.

Figure 2

Effects of nucleotide composition on codon usage bias. (A) The relative first 20 factors from correspondence analysis according to their amino acid proportions. The line represents the cumulative total of the inertia explained by the first 20 axis. (B) Correspondence analysis of RSCU values of genes. Red, blue and green dots indicate genes with guanine and cytosine contents that are <45 %, ≥45 %, & < 60 % and ≥60 %, respectively. (C) Correspondence analysis on RSCU values of codons. Blue dots, yellow dots, red dots and green dots indicate other genes, ribosomal genes, genes with a Gravy value >0.3 and genes with Aromo value ≥0.15, respectively. In addition, aromatic genes and hydrophobic genes that has a value of ≥0.15, >0.3, respectively as well as ribosomal and other genes were selected to distinguish the pattern of codon usage among the S. indicum genes. The distribution of these genes was marked along axis 1 and axis 2 based on the correspondence analysis (Figure 2C). Most of the ribosomal genes were gathered at the right side of axis 1, while other genes were located mainly on the left side of axis 1. Besides, both aromatic and hydrophobic genes were concentrated mainly at the upper half region of axis 1. Altogether, these results showed that the codon usage bias of these genes might be influenced by base composition under mutation. Besides, the effect of natural selection on codon usage bias of these genes to some extent is clearly observed since the genes showed a discrete distribution. Nucleotide composition is one of the major factors for the mutational pressure in codon usage bias of genes. So, correlation analysis was performed to ascertain the association between the two principal axes (axis 1 and axis 2) and nucleotide composition (Table 2). A significant positive correlation (P < 0.0001) was observed between axis 1 and A, T, C, G, A3, C3, and G3 while a significant negative correlation (P < 0.0001) was observed between axis 1 and GC %, GC1 %, GC2 %, and GC3 %. However, axis 1 had no significant correlation with T3. Similarly, axis 2 showed a significant positive correlation with A, T, C, G, A3, T3, C3, and GC2 % while it showed a significant negative correlation with G3, GC %, GC1 % and GC3 %. Besides, a negative and positive correlation was obtained between Nc and axis 1 as well as axis 2, respectively suggesting that mutational pressure which arises from base composition at least partly might influence the codon usage bias of S. indicum genes.

Table 2

Correlation analysis of axis 1 and axis 2 with overall nucleotide composition and effective number of codons.

	A	T	C	G	A3	T3	C3	G3	GC%	GC1%	GC2%	GC3%	N_c
Axis 1	0.3630	0.3373	0.1798	0.2733	0.4106	−0.00758	0.4038	0.2047	−0.7859	−0.1717	−0.3019	−0.8551	−0.08383
P	0.000	0.000	0.000	0.000	0.000	0.2451	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Axis 2	0.04344	0.1770	0.1351	0.04594	0.08181	0.1854	0.1329	−0.01512	−0.2078	−0.2643	0.07242	−0.2071	0.05478
P	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0204	0.000	0.000	0.000	0.000	0.000

A, T, C, G = frequency of each individual base, A3, T3, C3 and G3 = frequency of each individual base A, T, C and G at the third position of codons, GC = total guanine-cytosine content of the entire gene, GC1 = GC content in the first position of codons, GC2 = GC content in the second position of codons, GC3 = GC content in the third position of codons, and N = effective codon number. N = effective codon number. Significant difference at P < 0.0001. Bold value indicates that there is no significant correlation at P < 0.0001.

Correlation analysis of axis 1 and axis 2 with overall nucleotide composition and effective number of codons. A, T, C, G = frequency of each individual base, A3, T3, C3 and G3 = frequency of each individual base A, T, C and G at the third position of codons, GC = total guanine-cytosine content of the entire gene, GC1 = GC content in the first position of codons, GC2 = GC content in the second position of codons, GC3 = GC content in the third position of codons, and N = effective codon number. N = effective codon number. Significant difference at P < 0.0001. Bold value indicates that there is no significant correlation at P < 0.0001. Moreover, correlation analysis was performed between axis 1 as well as axis 2 with indices like protein length, Aromo, and GRAVY values, which are important factors to govern natural selection (Table 3). As it is clearly shown on Table 3 below, there was a significant positive correlation between axis 1 and protein length while there was a significant negative correlation between axis 1 and Aromo as well as GRAVY respectively. Similarly, axis 2 had a significant positive correlation with both Aromo and GRAVY values while it showed no significant correlation with protein length. These results suggested that codon usage bias in S. indicum genes might also have emerged from the effect of natural selection.

Table 3

Correlation analysis of axis 1 and axis 2 with Protein length, Aromo and GRAVY values.

	Protein length	Aromo	GRAVY
Axis 1	0.3239	−0.03343	−0.06806
P	0.000	0.000	0.000
Axis 2	0.01076	0.05092	0.1123
P	0.0989	0.000	0.000

Aromo = frequency of aromatic amino acids, GRAVY = General average hydrophobicity Significant difference at P < 0.0001. Bold value indicates that there is no significant correlation at P < 0.0001.

Correlation analysis of axis 1 and axis 2 with Protein length, Aromo and GRAVY values. Aromo = frequency of aromatic amino acids, GRAVY = General average hydrophobicity Significant difference at P < 0.0001. Bold value indicates that there is no significant correlation at P < 0.0001.

Effects of gene length on codon usage bias

We have found a significant positive correlation between protein length and RSCU primary axis (axis 1) (r = 0.324, P < 0.001); however, there was a weak linear correlation between protein length and Nc (r = 0.0093, P = 0.1543). It is worth to note the presence of two different patterns that were observed when we did our analysis on the variation that existed among proteins: i.e short protein products had the greatest variation in Nc values and protein length showed a significant variation in relation to higher Nc values (Figure 3). Most of the data points were distributed between Nc values of 48 and 58. This result showed that length of proteins contributed for S. indicum codon usage bias which arises due to selection constraints.

Figure 3

Plot of protein length versus the N value variation.

Effect of hydrophobicity and aromaticity of encoded proteins on codon usage bias

Correlation analysis among, Aromo, GRAVY and N were determined in order to investigate the potential effect of physical and chemical properties of the encoded proteins on S. indicum codon usage bias. As it is shown in Table 4 below, GRAVY and Aromo values of the amino acids were positively correlated with Nc for each protein. This result confirms that both hydrophobicity and aromaticity influence the codon usage bias in S. indicum.

Table 4

Correlation analysis of Nc with GRAVY and Aromo values.

	GRAVY	Aromo
Nc	0.07161	0.1132
P	0.000	0.000

GRAVY = General average hydrophobicity, Aromo = frequency of aromatic amino acids, Nc = effective codon number Significant difference at P < 0.0001.

Correlation analysis of Nc with GRAVY and Aromo values. GRAVY = General average hydrophobicity, Aromo = frequency of aromatic amino acids, Nc = effective codon number Significant difference at P < 0.0001.

Lignan biosynthesis related genes of S. indicum and their codon usage bias

We have taken all the seven known lignan biosynthesis related gene CDS and analyzed their codon usage patterns. To achieve our aim, GC12/GC3 ratios and Nc values were used to calculate the codon usage pattern of the lignan biosynthesis related genes. Later, the obtained values were compared to the average values of all S. indicum coding sequences. The results showed that from the seven lignan biosynthesis related genes CYP81Q1, CYP81Q3 and UGT71A9 have plotted data points which are above the average Nc values while CYP92B14 and UGT94AG1 have plotted data points which showed a higher value compared to the average GC12/GC3 ratios, indicating a weaker codon usage bias when it is compared to the average (Figure 4A). Furthermore, we found out that CYP92B14, UGT94D1 and UGT94AG1 had lower Nc values while CYP81Q3, UGT71A9, UGT94D1 and UGT94AA2 had a much lower GC12/GC3 ratio below the average. Moreover, correspondence analysis showed that the position of the AT ended codons are mainly located at the right side of axis1 while the GC ending codons concentrate mainly at the left side of axis 1 (Figure 4B), indicating that the base composition for mutation bias might correlate to the codon bias.

Figure 4

Codon usage analysis in S. indicum lignan biosynthesis related genes. (A) Comparisons of GC12/GC3 ratio and N values among different lignan biosynthesis related genes. The average N value is 53.54 while the general GC12/GC3 value is 1.04. (B) Correspondence analysis of the synonymous codon usage of S. indicum lignan biosynthesis related genes. Different base ended codons were marked in the figure, where the red, green, blue, and yellow colors refer to codons ending with A, T, C, G respectively.

Discussion

The overall codon usage pattern among the 23,538 of Sesamum indicum CDS was analyzed in this study. Usually RSCU values are used as a measurement of codon usage bias and the RSCU bias differs accordingly among various species and genes [74]. The possible causes of RSCU bias have been investigated in the genomes of numerous living organisms, for example, in Zea mays, Arabidopsis thaliana, cotton, and so others [10, 75, 59]. Accordingly, in the current study, we have found that the A/T-ending codons were the more frequent and preferred ones compared to the G/C-ending codons in S. indicum genome. From the 59 synonymous codons, 27 codons had RSCU value which is greater than 1 and out of these 22 were AT-ending while only five were GC-ending codons. This showed that the genes had little or no bias at all towards the GC-ending codons in S. indicum. In order to ascertain the potential effect of compositional constraints on codon usage, first the nucleotide compositions of the above-mentioned S. indicum CDS were determined. It was found that the contents of GC1, GC2, GC3 and the average content of GC at three positions were less than 50 %, indicating that the S. indicum genome tended to use A/T bases and AT-ending codons more frequently. Moreover, the S. indicum coding sequences were found to be AT rich and compared to A, T is the most frequently preferred nucleotide. A random synonymous codon choice whereby the Guanine-Cytosine on the one hand and the Adenine-Thymine on the other hand would be used proportionally among the degenerate codon groups in a gene or genome, if mutation occurs at the third codon position neutrally [35]. The current study showed the unbalanced usage of AT and GC at the third codon position of S. indicum genes. This in turn supports the influence of mutation pressure. Similar observations were reported previously on comparative analysis of codon usage in the Poaceae family [76], the Asteraceae family [9], the Solanum species [77], Euphorbiaceae species [78], and Paeonia lactiflora [79]. Here, it is worth to note that the GC value of S. indicum (46.88 %) was found to be higher with those reported for dicots (42–44 %) [58]. In this study, the variation among the codon usage patterns of different S. indicum genes is also reflected in the effective Nc values, which range from 61 to the very low value of 28.3, corresponding to an extremely biased codon usage. [65] suggested plotting Nc against GC3s as a means of characterizing codon usage variation among genes. In line with this, if codon usage is affected only by GC3 content, the Nc values lie just over the expected Nc curve, which indicates mutational pressure [80]. Even though, the points for most genes follow this trend in this study, but there are some genes that lie below the expected curve corresponding to the Nc values, which indicates the dominant role played by selection pressure. Furthermore, the PR2 plot as well as the neutrality plot analysis showed that the S. indicum genes did not use GC and AT equally and the slope (0.077) of the regression line which is constructed by GC12 vs GC3 was found to be close to zero, respectively. This unequal usage which is observed in the degenerate codon positions as well as the nearly horizontal line of the neutrality plot analysis gives evidence that natural selection has played a major role in shaping the codon usage bias in Sesamum indicum. Similarly, in Oryza sativa, Zea mays, Arabidopsis thaliana, Triticum aestivum [81], Silene latifolia [82] and Epichloë festucae [83] natural selection plays a much important role in forming the codon usage bias. Generally, in order to determine codon usage bias in different organisms, it is important to maintain a balance between mutation pressure and natural selection [77, 84, 85]. According to the obtained correspondence analysis result, codon usage variation might be one of the factors that contributed as a driving force for CUB of S. indicum genes. To this end, we further performed correspondence analysis of RSCU values to ascertain the major trends in the variation of codon usage. The observed high degree of correlation between positions of genes along axis 1 and GC3s as well as C3s in comparison to G3s can help to conclude that there is one major source of synonymous codon usage variation among these S. indicum genes, visible in the GC3s content. However, when the first axis is observed, it was able to explain only a partial amount of variation (15.57 %) of codon usage among these genes while axis 2 accounted for 5.26 %. So, from this investigation, it is possible to conclude that besides mutational bias and natural selection, there might be several other factors that are responsible at least partially to determine codon usage bias and variation in S. indicum genes. Correspondence analysis also revealed a strong positive correlation between protein length and axis 1 in the current study; however, protein length and Nc showed a weak linear correlation. Since the cost of translating a protein is directly proportional to the length of a protein, there might be a greater pressure for the selection of the most accurate codons in longer genes compared to shorter genes in order to avoid missense errors. In eukaryotic organisms, it has been argued that selection may act to decrease the length of highly expressed genes [28]. Compared to the shorter proteins, a weak codon usage bias is displayed by longer proteins; however, shorter proteins did not particularly affect the codon usage. This indicates that the codon usage pattern in S. indicum might be shaped by other selective forces. Previous studies have showed that there are other natural selection driven factors, which cause codon usage bias, such as Grand average of hydropathicity (GRAVY) and aromaticity of amino acid usage (AROMO) values [86]. The significant positive correlations of GRAVY and AROMO values supports the finding that both values of proteins, which are encoded by the coding sequences could be associated with the codon usage bias of S. indicum genes. Previous studies have found significant positive and negative correlations between GRAVY and AROMO values and codon usage bias in a variety of organisms, such as Pisum species [22], Ginkgo biloba [77], Oncidium Gower Ramsey [87], Epichloë festucae [9]. Finally, in the current study, I investigated the codon usage bias in the seven known lignan biosynthesis related genes and observed substantial differences in the codon usage among the lignan biosynthesis related genes.

Conclusion

In summary, the current study revealed the pattern of codon usage bias in Sesamum indicum genome in association with the different factors that are responsible to bring the bias. Based on the current investigation, codon usage bias in S. indicum appears to be a combined effort of several factors like nucleotide composition, mutation bias, natural selection, protein length (at least partially), aromaticity and hydropathicity.

Declarations

Author contribution statement

Mebeaselassie Andargie; Zhu Congyi: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Wrote the paper.

Funding statement

Mebeaselassie Andargie was supported by the through Georg Förster Research Fellowship.

Data availability statement

Data included in article/supplementary material/referenced in article.

Declaration of interests statement

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

Supplementary Table 1

Means and standard deviations of several index numbers from 23,538 genes in Sesamum indicum.

Class	Genes	Codons	GC all (%)	GC1 (%)	GC2 (%)	GC3 (%)	GC3s (%)	T3s (%)	C3s (%)	A3s (%)	G3s (%)	Gravy	Aromo	ENC	CAI
Total	23,538	9,911,268	46.89 ± 4.39	51.95 ± 6.22	41.86 ± 4.68	46.85 ± 4.19	44.86 ± 9.37	37.47 ± 7.59	27.11 ± 7.87	30.49 ± 6.68	30.21 ± 6.81	−0.31 ± 0.37	0.08 ± 0.03	53.54 ± 4.53	0.20 ± 0.03

GC all = total guanine-cytosine content of the entire gene, GC1 = GC content at the first, GC2 = GC content at the second, GC3 = GC content at the third codon positions, GC3s = proportion of GC nucleotides at the third (variable) coding position of synonymous codons, T3s, C3s, A3s and G3s = frequency of each individual base A, T, G and C at the third position of codons, Gravy = General average hydropathicity, Aromo = frequency of aromatic amino acids, ENC = effective codon number, CAI = Codon adaptation index.

74 in total

1. Mutational and selective pressures on codon and amino acid usage in Buchnera, endosymbiotic bacteria of aphids.

Authors: Claude Rispe; François Delmotte; Roeland C H J van Ham; Andres Moya
Journal: Genome Res Date: 2003-12-12 Impact factor: 9.043

2. Nearly neutrality and the evolution of codon usage bias in eukaryotic genomes.

Authors: Sankar Subramanian
Journal: Genetics Date: 2008-04 Impact factor: 4.562

3. Directional mutation pressure and transfer RNA in choice of the third nucleotide of synonymous two-codon sets.

Authors: S Osawa; T Ohama; F Yamao; A Muto; T H Jukes; H Ozeki; K Umesono
Journal: Proc Natl Acad Sci U S A Date: 1988-02 Impact factor: 11.205

4. Codon distribution in vertebrate genes may be used to predict gene length.

Authors: W Bains
Journal: J Mol Biol Date: 1987-10-05 Impact factor: 5.469

5. Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes.

Authors: J R Lobry; C Gautier
Journal: Nucleic Acids Res Date: 1994-08-11 Impact factor: 16.971