| Literature DB >> 34025951 |
Shir Bahiri-Elitzur1, Tamir Tuller1,2.
Abstract
Codon usage bias (CUB) refers to the phenomena that synonymous codons are used in different frequencies in most genes and organisms. The general assumption is that codon biases reflect a balance between mutational biases and natural selection. Today we understand that the codon content is related and can affect all gene expression steps. Starting from the 1980s, codon-based indices have been used for answering different questions in all biomedical fields, including systems biology, agriculture, medicine, and biotechnology. In general, codon usage bias indices weigh each codon or a small set of codons to estimate the fitting of a certain coding sequence to a certain phenomenon (e.g., bias in codons, adaptation to the tRNA pool, frequencies of certain codons, transcription elongation speed, etc.) and are usually easy to implement. Today there are dozens of such indices; thus, this paper aims to review and compare the different codon usage bias indices, their applications, and advantages. In addition, we perform analysis that demonstrates that most indices tend to correlate even though they aim to capture different aspects. Due to the centrality of codon usage bias on different gene expression steps, it is important to keep developing new indices that can capture additional aspects that are not modeled with the current indices.Entities:
Keywords: Codon usage bias; Gene expression; Transcript evolution
Year: 2021 PMID: 34025951 PMCID: PMC8122159 DOI: 10.1016/j.csbj.2021.04.042
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Codon usage of all 64 codons in the different analyzed organisms, organelles, and viruses. The codon usage was calculated using the ratio between the codon's appearances and the relevant synonymous codons’ total appearances. The color bar represents the frequency of each codon (more details in the case study section). As can be seen, there are large differences in codon usage among the analyzed genomes. Equal frequencies can also be seen for codons that code for amino acids with one synonymous codon.
Fig. 2Different types of CUB indices examined in the paperIndices that are based on the non-uniform usage of synonymous codon.Indices based on codon frequency in a reference set of genes. To deal with alternative splicing when using such indices, the longest isoform of the gene is usually considered.Indices that are based on the adaptation to the tRNA levels, and their supply.Indices that consider complex patterns of codons that affect translation, transcription, and mRNA stability.Indices that are based on a direct experimental procedure such as ribosome profiling.
Codon bias Indices that are based on measures of the non-uniform usage of synonymous codon.
| Index type: The non-uniform usage of synonymous codon | ||||||
|---|---|---|---|---|---|---|
| ENC (effective number of codons) | Similar to the computation of effective population size in population genetic. Considers amino acid degeneracy level and calculates the total number of different codons used in a sequence. | Investigate codon usage patterns across genes and in different organisms. ENC plots provide a visual display of CUB variation for a set of genes. It can easily be adapted to study genes that do not use the 'universal' genetic code. | Range: 20–61. A value of 20 indicates that only one codon is used for each AA, while a value of 61 is when all synonymous codons are used with equal probability. | |||
| RSCU (relative synonymous codon use) | Based on the ratio between the observed number of codons and the number of times the codon would be observed if the synonymous codon usage was completely random. | Acts as a codon weight for many indices requiring the codon count to be normalized into codon frequency and to remove the dependence on gene length. RSCU can be used to find optimal codons and understand evolutionary processes. | For average synonymous codon usage, the RSCU is 1. For codon usage more infrequent than the average codon usage, the RSCU is less than one, and for more frequent usage than the average for the amino acid, the RSCU is greater than 1. | |||
| CPB (codon preference bias) | Based on the multinomial and Poisson distributions. It can be applied to a relatively short piece of sequence. Not used often, the method is quite theoretical. | Scan DNA sequences and measure the strength of codon preference. It can be used to detect bona fide coding sequences. | A higher value indicates more bias towards optimal codons. | |||
| The ‘scaled’ | Calculates the deviation from equal usage of a codon within the synonymous group divided by the total number of codons in the gene using the chi-squared ( | Measures the bias in silent DNA divergence codon usage. | Ranges from 0 to 1. Higher value indicates a stronger bias. | |||
| P (codon preference) | Measures the likelihood of a particular set of codons to a predetermined preferred usage. P is computed for all three reading frames. | It is useful for locating genes in sequenced DNA, predicting the relative level of their expression and for detecting DNA sequencing errors resulting in the insertion or deletion of bases within a coding sequence. | A higher value indicates a more frequent use of preferred codons. | E.coli, | ||
| RCBS (relative codon bias strength) | Measures the difference of the observed frequency of a codon from the expected frequency under the hypothesis that the frequency of codons is only affected by the frequencies of single nucleotides. | Predicting gene expression levels from RSCU. Useful for comparing different sets of gene. | A value close to 0 indicates a lack of bias for the codons. | |||
| ICDI (intrinsic codon deviation index) | Uses the RSCU and the degeneracy of amino acid in the sequence. It gives equal weight to all amino acids included. | Estimate codon bias of genes from species in which optimal codons are not known. It can help predict gene functionality. | Values between 0 (equal usage of all codons) and 1 (one codon per amino acid) | |||
| Dmean (mean dissimilarity-based index) | Quantifies the level of diversity in synonymous codon usage among all genes (or a subset of genes) within a genome. The index is based on the average Pearson correlation between all pairs of genes normalized vectors of codon frequencies. | It is used to measure the diversity level of codon usage among genes. This index can be applied to other compositional features such as amino acid usage and dinucleotide relative abundance as a genomic signature. This index can improve the understanding of compositional diversity among genes. | Lower average correlation values, indicate low diversity level in the use of codons. | 268 bacterial genomes such as | – | |
| Ew (weighted sum of relative entropy) | Measures the degree of deviation away from equal codon usage using a weighted relative entropy of each amino acid. It is defined as the sum of each amino acid's relative entropy weighted by its relative frequency in the sequence. | Allows to avoid some amino acid usage biases and to obtain quantitative information about the degree of overall synonymous codon usage bias of a gene. | Ranges from 0 (maximum bias) to 1 (no bias). | – | ||
| SCUO (Synonymous codon usage order) | Measures the deviation from uniform distribution based on the Shannon entropy. It uses the normalized difference between the maximumentropy and the observed entropy. | This index compares the synonymous codon usage across different organisms. | Ranges from 0 (maximum bias) to 1 (no bias). | |||
| DCBS (directional codon bias score) | A correction to the RCBS that considers over and underrepresented codons. | It can be used to predict gene expression levels from Relative Codon Usage Bias. It is useful for comparing different sets of gene. | A value close to 0 indicates a lack of bias for the codons. | – | ||
| MILC (Measure independent of length and composition) | Measures the different usage of codons based on a log-likelihood ratio of the expected and observed number of codons. | It is used for the prediction of expression levels by taking the ratio of the MILC of a gene to the MILC of a reference set of highly expressed proteins. | A higher value indicates stronger bias. | - | ||
| MCB (Maximum-likelihood codon bias) | Estimates that bias in codon usage using the weight of each amino acid which is estimated by the likelihood of occurrence of each amino acid given its frequency and codon degeneracy. | Designed to account for background nucleotide composition and can also be adapted to correct for di-nucleotide biases. It can be used to estimate ancestral codon usage bias and genetic population analysis. | A higher value indicates a stronger bias. | – | ||
| CDC (Codon Deviation Coefficient) | Based on the cosine distance metric between the expected and the observed codon usage. The authors suggest estimating the significance of this index by using bootstrapping. | It is useful for determining comparative magnitudes and patterns of CUB for genes or genomes with diverse sequence compositions. | Ranges from 0 (no bias) to 1 (maximum bias) | Unicellular and multicellular genomes. | – | |
| MRI (Mutational Response Index) | Based on the difference between the scaled chi-square test and its corrected form. | Measures the extent of codon bias attributable to mutational pressure. | Positive MRI values indicate a response, whereas negative values for MRI imply a codon usage contrary to directional mutation pressure. | – | ||
Codon bias indices based on codon frequency in a reference set of genes summary.
| Index type: Codon frequency in a reference set of genes | ||||||
|---|---|---|---|---|---|---|
| CAI (Codon adaptation index) | Assess the relative merits of each codon. A gene score is calculated from the frequency of use of all codons in that gene based on a reference set of genes. The index assesses the extent to which selection has been effective in molding the pattern of codon usage. | Predict protein expression levels. | A higher score for genes that tend to includes codons that are more frequent in highly expressed genes (range from 0 to 1). | |||
| FOP (Frequency of optimal codons) | The frequency of optimal codons of a gene is the ratio of the number of optimal codons to the total number of codons. The optimal codons can be defined according to nucleotide chemistry, codon usage bias, or tRNA availability. | Predict protein production levels and shows the optimization level of synonymous codon choice. | The FOP values range from 0 (optimal codon never appear) to 1 (optimal codon always appear). | |||
| ΔFop (difference in frequenciesof optimal codons between constrained and nonconstrained codons) | Measures the difference in frequencies of optimal codons used in a gene at codons related to AA that are evolutionary conserved vs. codons related to AA that are not evolutionary conserved. It is based on the assumption that higher codon bias exists in sites related to conserved AA. | It is a useful index to test directly the hypothesis of selection for translational accuracy and selection for the fidelity of protein synthesis. | A significant positive value indicates that optimal codons use is higher at constrained codons than at nonconstrained ones. | – | ||
| CBI (Codon bias index) | Measures the usage of optimal codons using the ratio between the number of optimal codons in a gene and the total number of codons in a gene. It uses the expected usage as a scaling factor. | Reflects the presence of components with high CUB in a particular gene. It can describe foreign gene expression in a host. | Values range between −1 and 1. A value of 1 means only preferred codons are used, zero means random choice and less than zero implies greater use of nonpreferred codons. | |||
| B (Codon usage bias) | Based on the frequency weighted sum of distances of the relative codon usage frequencies between two sets of genes. | It is used to infer the expression level by comparing the fraction of the distance of the query set with respect to all genes over the distance to a reference set, or a linear combination of reference sets. | The possible values range from 0 to 2, rarely exceeding 0.5. Higher value indicates codon usage differences between the two sets of genes. | – | ||
| CEC (Codon-enrichment correlation) | The linear correlation coefficient of the codon enrichment between the ORF and a reference set of genes. The enrichment of each codon is defined as the ratio of its frequency among the named ORFs by its expected frequency in random sequences. | Together with expression data, CEC can be used to identify spurious open reading frames and can be used to detect incorrectly assigned ORFs that are not coding for a protein. | If a sequence is not detected experimentally and the CEC is lower than the cutoff value, then the ORF is designated as spurious. | – | ||
| rCAI (relative codon adaptation index) | This index is similar to CAI, but defines each codon weight by normalizing it with the codon usage in the + 1 and + 2 reading frames. | It is used to discriminate between highly biased and unbiased genomic regions. It can detect codon bias patterns in overlapping genes and capture local codon bias patterns. | rCAI ranges from 0 to an upper limit, which corresponds to an imaginary gene consisting only the maximum-weighted codons. | – | ||
| GCB | Defines iteratively the reference set for weighting the codons frequency. | It can be used to identify hypothetical genes. | A higher GCB value is assigned to genes that tend to include codons that are more frequent in the reference set. | - | ||
| ITE (index of translation elongation) | It is similar to CAI, but for each codon, a weight is calculated based on its frequency among NNR and NNY codons subfamilies in the reference set. The reason for separating synonymous codons into R- and Y-ending codon subfamilies is that different tRNAs typically translate them and they are subjected to different mutation bias. | It can predict protein expression levels. | A higher score for genes that tend to includes codons that are more frequent in highly expressed genes (range from 0 to 1). | Unicellular and multicellular organisms | - | |
| TissueCoCoPUT (Tissue specific Codon and Codon-Pair Usage Tables) | Measures the observed/expected codon, codon pair, and dinucleotides frequency and calculates the difference in different tissues. | It can be used for determining translation kinetics and efficiency across tissues. | Higher values indicate a higher compatibility to a certain tissue. | |||
| RCA (Relative codon adaptation) | This index uses a given reference set to compute observed and expected codon frequency. It considers the underlying nucleotide distribution at the third position in the codon. | It can predict protein expression levels. | A higher score for genes that tend to include codons that are more frequent in highly expressed genes. | – | ||
| COUSIN (Codon usage similarity index) | This index compares the CUB of a query against those of a reference and normalizes the output over a Null Hypothesis of random codon usage. | It can be used to identify differential heterogeneity between and within genomic data sets. | A COUSIN score of 1 indicates that the CUB in the query is similar to the reference data set; A COUSIN score of 0 indicates that the CUB in the query is similar to the Null Hypothesis (i.e., equal usage of synonymous codons). | |||
| SCCI (Self Consistent Codon Index) | This index is based on the CAI, but the chosen reference set is the most biased set of genes. | It can be used to predict gene expression levels, to guide regulatory circuit reconstruction, and to compare species. | A higher score for genes that tend to includes codons that are more frequent in the reference set of genes (range from 0 to 1). | – | ||
Codon bias indices based on adaptation to the tRNA levels and their supply summary.
| Index type: Adaptation to the tRNA levels and their supply | ||||||
|---|---|---|---|---|---|---|
| tAI (tRNA adaptation index) | Computes a weight for each codon based on the tRNA copy number and the properties of anticodon–codon interaction. | It can be used to measure translation efficiency. | Higher values for genes that tend to includes codons that are more adapted to the tRNA pool (range from 0 to 1). | – | ||
| stAI (species-specific tRNA adaptation index) | Species-specific tAI. This index estimates the tAI codon – anti codon interaction weights without the need for gene expression measurements. | It can be used to measure translation efficiency . | Higher values for genes that tend to includes codons that are more adapted to the tRNA pool (range from 0 to 1) | Unicellular and multicellular organisms. | ||
| P1 index | Measuresthe average number of tRNA–codon interactions necessary for a correct recognition for a step for a correct recognition for a stepin the elongation cycle. It is based on protein synthesis dynamics. | It can measure the influence of tRNA availability on the gene translation. | Lower values are related to genes that are optimized for a small number of tRNA discriminations and are often highly expressed. | – | ||
| P2 index | This index is based on the fraction of pyrimidine-ending codons that have intermediate strength. | It can measure the bias for anticodon–codon interactions of intermediary strength. | A high value for highly expressed genes and low for lowly expressed genes. | – | ||
| nTE (normalized translational efficiency) | This index considers both the supply and the demand of a codon by computing for each codon a weight which is based on the tAI and the relative frequency of the relevant codon in all the mRNAs. | It can measure translation efficiency. | Codons are considered optimal if the relative availability of cognate tRNAs exceeds their relative usage resulting in a higher value. | – | ||
| compAI (Competition Adaptation Index) | Based on the competition between cognate and near-cognate tRNAs during translation. It is defined as the harmonic average of the relative adaptiveness of the gene codons. | It can provide information about the speed of protein synthesis. | Values between 0 and 1, where values close to 0 indicate higher competition, and therefore a low translation rate. | – | ||
| TPI (tRNA-pairing index) | A Measure of synonymous codon ordering comparing of the number of changes of tRNA in a coding sequence to the total number of expected changes given a random distribution of the existing codons. | It can be used to measure tRNA reusage of a gene. | A value of 1 means a completely ordered sequence of the codons by their decoding tRNA. A value of −1 means a completely unordered sequence. | – | ||
Codon bias indices based on complex patterns of codons summary.
| Index type: Complex patterns of codon usage | ||||||
|---|---|---|---|---|---|---|
| ChimeraARS | Based on the tendency of a coding region to include long substrings that appear in other coding sequences and assumed to be regulatory regions. | It is used to predict gene expression and estimate the adaptability of an ORF to the intracellular gene expression machinery of a genome (host). | A higher score is related to higher adaptivity of the gene to the host genome. | |||
| ChimeraUGEM | Region and position specific ChimeraARS. | It is used to predict gene expression and estimate the adaptability of an ORF to the intracellular gene expression machinery of a genome (host). | A higher score is related to higher adaptivity of the gene to the host genome. | |||
| GC3 (GC content at the third position of synonymous codons) | Measures the GC content at the third position of synonymous codons. It tries to capture the fact that optimal codons in highly abundant proteins tend to have pyrimidines at the third position. | It can predict protein expression levels. | Values range from 0 to 1, a higher value correlates with highly expressed genes. | |||
| SCUMBLE (Synonymous codon usage bias maximum-likelihood estimation) | The model parameters are estimated by the maximum likelihood approach. This index captures nonlinearities betweenexpression levels and codon usage. | It can identify different sources of bias in various genomes and estimate the other sources' degree of contribution and their effects on a gene.. | A higher value for a source indicate a stronger bias relates to it. | – | ||
| SEMPPR (Stochastic evolutionary model of protein production rate) | Based on the stochastic evolutionary model, which assumes that selection to reduce the cost of nonsense errors drives the evolution of codon bias. This index generates a posterior probability distribution for the protein production rate of a given gene. | It can be used to predict protein production rate and expression levels. | Genes with low production rates will have a smaller difference in the energetic usage between the highest peak and lowest probability distribution valley than those with high production rates. | – | ||
| Codon volatility | Measures the proportion of the point-mutation neighbors of a codon that encodes different amino acids. It is based on the observation that codons differ with respect to the likelihood that a point mutation will cause a nonsynonymous mutation. | It can be used to estimate selective pressures. It can be used to scan an entire genome to find genes that show significantly more, or less, pressure for amino-acid substitutions than the genome as a whole. | A gene that contains many residues under pressure for amino-acid replacements will exhibit on average elevated volatility. | – | ||
| ENcp (effective number of codon-pairs) | Measures codon pairs bias. It is defined analogously to ENC. | Can be used to Investigate codon-pairs usage. | Ranges from 20 (very biased) to 61 (not biased at all). | 1,275,531 individual species | ||
| CPS (codon pair score) | Measures codons pair bias. It is defined analogously to RSCU. | It can be used to determine the level of similarity in codon pair preferences between viruses and their host. | A positive and higher score indicates preferred codon pair. | |||
| Frare (frequency of rare codons) | It is defined by calculating the frequency of occurrence of all codon pairs in coding sequences. | It can be used to measure codon usage and estimate the stability of exogenous genes. | A lower value of a gene indicates that that gene is an essential gene. | – | ||
Fig. 3A. Ribosome profiling procedure. Translation of mRNAs by ribosomes is arrested, then exposed mRNA is digested. Protected mRNA footprints are then sequenced and mapped to the genome, creating for each gene its read count profile. B. NET-seq procedure. A culture is flash frozen and cryogenically lysed. Nascent RNA is co-purified via immunoprecipitation (IP) of the RNAPII elongation complex. Conversion of RNA into DNA results in a DNA library with the RNA as an insert between DNA sequencing linkers. The sequencing primer is positioned such that the 3′ end of the insert is sequenced. m7G refers to the 7-methylguanosine cap structure at the 5′ end of nascent transcripts.
Codon bias indices based on direct experimental measurements of translation and transcription elongation summary.
| Index type: Direct experimental measurements of translation and transcription elongation | ||||||
|---|---|---|---|---|---|---|
| MTDR (Mean typical decoding rate) | Measures the codon decoding time based on ribosome profiling data. | It can be used to predict translation elongation efficiency and predict changes in translation elongation efficiency. | A higher value indicates higher translation efficiency. | – | ||
| RRT (Ribosome Residence Time) | Measures the codon decoding time based on ribosome profiling data. | It can be used to predict translation elongation efficiency. | Ranges from 0 to 1. A higher value indicate a slower codon decoding speed. | – | ||
| MTTR (mean typical transcription elongation rate) | This index estimates the typical transcription elongation of the RNA polymerase using NET-seq data. | It can be used to estimate the average transcription elongation rate of a specific gene. | Higher values are related to selective pressure for higher elongation cycles in genes. | – | ||
Fig. 4A. Different indices of CUB Spearman correlation with PA in S.cerevisiae. The indices are clustered according to types. ENC (effective number of codons), Fop (frequency of optimal codons), CAI (codon adaptation index), CBI (codon bias index), CEC (Codon-enrichment correlation), tAI (tRNA adaptation index), nTE (normalized translational efficiency), Chimera ARS, CPS (codon pair score), MTDR (mean typical decoding rate). All of the correlations between the CUB measure and PA are significant and in the right/expected direction. B. Spearman correlation between the different CUB indices in S.cerevisiae. It can be seen that typically indices from the same type correlate better. C. Different indices of CUB Spearman correlation with PA in E.coli. The indices are clustered according to types. ENC (effective number of codons), Fop (frequency of optimal codons), CAI (codon adaptation index), CBI (codon bias index), tAI (tRNA adaptation index), nTE (normalized translational efficiency), Chimera ARS, CPS (codon pair score), MTDR (mean typical decoding rate). All of the correlations between the CUB measure and PA are significant and in the right/expected direction. D. Spearman correlation between the different CUB indices in E.coli. It can be seen that typically indices from the same type correlate better. E. Different indices of CUB Spearman correlation with PA in Human. The indices are clustered according to types. ENC (effective number of codons), Fop (frequency of optimal codons), CAI (codon adaptation index), CBI (codon bias index), tAI (tRNA adaptation index), nTE (normalized translational efficiency), Chimera ARS, CPS (codon pair score). All of the correlations between the CUB measure and PA are significant and in the right/expected direction. F. Spearman correlation between the different CUB indices in Human. It can be seen that typically indices from the same type correlate better.
Fig. 5A. Dot plot of the lowest correlating indices FOP vs. CPS in S.cerevisiae.B. Dot plot of the highest correlating indices FOP vs. CBI in S.cerevisiae.C. Dot plot of the lowest correlating indices MTDR vs. CPS in E.coli.D. Dot plot of the highest correlating indices FOP vs. CBI in E.coli.E. Dot plot of the lowest correlating indices nTE vs. CAI in Human.B. Dot plot of the highest correlating indices FOP vs. CBI in Human.