Literature DB >> 17537810

CodonO: codon usage bias analysis within and across genomes.

Michael C Angellotti¹, Shafquat B Bhuiyan, Guorong Chen, Xiu-Feng Wan.

Abstract

UNLABELLED: Synonymous codon usage biases are associated with various biological factors, such as gene expression level, gene length, gene translation initiation signal, protein amino acid composition, protein structure, tRNA abundance, mutation frequency and patterns, and GC compositions. Quantification of codon usage bias helps understand evolution of living organisms. A codon usage bias pipeline is demanding for codon usage bias analyses within and across genomes. Here we present a CodonO webserver service as a user-friendly tool for codon usage bias analyses across and within genomes in real time. The webserver is available at http//www.sysbiology.org/CodonO. CONTACT: wanhenry@yahoo.com.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2007 PMID： 17537810 PMCID： PMC1933134 DOI： 10.1093/nar/gkm392

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Within the standard genetic codes, all amino acids except Met and Trp are coded by more than one codon, which are called synonymous codons. DNA sequence data from diverse organisms clearly show that synonymous codons for any amino acid are not used with equal frequency, and these biases are as the consequence of natural selection during evolution. Extensive studies have shown that synonymous codon usage biases are associated with various biological factors, such as gene expression level, gene length, gene translation initiation signal, protein amino acid composition, protein structure, tRNA abundance, mutation frequency and patterns, and GC compositions (1–11). Quantification of codon usage bias, especially at genomic scale, helps understand evolution of living organisms. Many different approaches have been developed in the past few decades. These methods may be grouped into two categories: (i) methods based on the statistical distribution, such as codon-usage preference bias measure (CPS) based on χ2 (12) and scaled χ2 analyses (13); (ii) methods using a group of gene sequences as reference, which can be ‘optimal codons’ [e.g. codon bias index (14)], a defined set of highly expressed genes [e.g. codon preference statistics (15) and codon adaptation index (16)], a defined gene class [e.g. Codon Bias (7)], or all genes in the entire genome [e.g. the Shannon Information Method (17)]. Most of existing computational approaches are only suitable for the comparison of codon usage bias within a single genome. In order to overcome these limitations, we developed a new informatics method based on Shannon informational theory, referred to as synonymous codon usage order (SCUO), which enables a measurement of synonymous codon usage bias within and across genomes (3,12). The review and comparison of SCUO and current available methods are detailed in Wan et al. (18). Several computational software packages or webservers, for instance, CodonW (http://bioweb.pasteur.fr/seqanal/interfaces/codonw.html) and JCAT (19), have been developed to measure Codon Adaptation Index (CAI) for genes. JCAT also integrates intrinsic terminators and enzyme digestion sites into their analyses. Codon usage analyses within and across genomes will facilitate the understanding of evolution and environmental adaptation of living organisms. GC compositions have been shown to drive codon and amino-acid usages thus affect codon usage bias (20). Thus, it will be critical to study the correlation between GC compositions and codon usage bias. Previously, we have developed an analytical model to quantify synonymous codon usage bias by GC compositions based on SCUO (11). However, it is still laborious to perform codon usage analyses within and across genomes based on our knowledge, there is not any available tool designed for these purposes. The CodonO webserver described here is a pipeline for codon usage bias analyses within and across genomic sequences as well as a tool for studying the correlation between codon usage bias and GC compositions, especially for microbial species. Different from the standalone CodonO we developed earlier (10,11,18), CodonO webserver has the following additional functions: (i) besides allowing the users to compare their submissions, it connects genomic database and perform analyses in real time; (ii) it can be used to study the correlation between SCUO and GC compositions; (iii) it performs statistical comparison of SCUO within and across genomes; (iv) besides SCUO values, it extracts and displays codon usage frequency table as well as the gene attribute for each gene from the genomic database; and (v) it provides a user-friendly interface.

MATERIALS AND METHODS

Synonymous codon usage order measurement

CodonO webserver employs the synonymous codon usage order (SCUO) measurement as the method to calculate synonymous codon usage biases. The details about the SCUO concept and method have been described previously (10,11,18). Simply, we calculate the entropy of the i-th amino acid in a sequence Where 1 ⩽ i ⩽ 18, j is the codon for the i-th amino acid, 1 ⩽ j ⩽ 6 for leucine, 1 ⩽ j ⩽ 2 for tyrosine, etc. If the synonymous codons for the i-th amino acid were used at random, one would expect a uniform distribution of them as representatives for the i-th amino acid. Thus, the maximum entropy for the i-th amino acid in each sequence is Thus, we can calculate SCUO for the i-th amino acid in each sequence. Then the average SCUO for each sequence can be represented to summarize the SCUO from each amino acid. The SCUO represents the synonymous codon usage bias for the entire sequence, and j is the codon for the i-th amino acid. Thus, 0 ⩽ SCUO ⩽ 1, and a larger SCUO denotes a higher codon usage bias in the sequence.

Statistical methods

CodonO webserver can perform codon usage bias analyses within genomes using Tukey statistical analysis (21) and across genomes using Wilcoxon Two Sample Test (22). Tukey statistical analysis is a simple and powerful method for estimating outliers for a population, which can be either a normal distribution or a non-normal distribution. We adapted the percentile calculation from JMP method (SAS, Inc., Cary, NC USA). where n is the number of data points; IR is the integer part of R while FR is the fraction part of R. Then, q-th percentile = IR-th observation + FR[(IR + 1)-th observation − IR-th observation] The Tukey outliers are genes with SCUO values less than Q1 − 1.5IQR or greater than Q3 + 1.5IQR, where IQR represent Interquartile range. IQR is the difference between 75th percentile and 25th percentile SCUO. The Wilcoxon Two Sample Test (22) is utilized to test null hypothesis that the distributions of SCUO from two groups of sequences (e.g. genomes) are the same. The Wilcoxon Two Sample Test is a sensitive test in two groups even their values are not Normal distributed.

Features

As shown in Figure 1, CodonO server is directly connected and updated with GenBank genomic database daily. The user can define and select one or multiple genomes for analyses at the same time. The users can upload their own datasets as well. The underlying computations include synonymous codon usage order (SCUO) and GC composition measurements, and the latter includes GC, GC1s, GC2s and GC3s, where GC is the overall GC composition, GC1s is the GC composition at the first site of a codon, GC2s is the GC composition at the second site of a codon, and GC3s is the GC composition at the third site of a codon. The results will be plotted in a two-dimensional graph, by which the clients can visualize and compare the results. The webserver can display the results for multiple genomes in the same plots, by which, the users can analyse the two dimensional differences (GC/GC1s/GC2s/GC3s versus SCUO) between genes within and across genomes (Figure 2A) (11). Generally, a very low or very high GC composition is associated with a large codon usage bias. It has been shown that codon usage bias in some bacteria and archaea were affected by GC composition and environment condition (e.g. temperature) (23). Thus, the users can perform these types of analyses based on their own preferences.

Figure 1.

Simplified CodonO webserver infrastructure.

Figure 2.

(A) Visualization of the correlation between synonymous codon usage bias and GC compositions; (B) Visualization of synonymous codon usage bias for each gene in a specific genome; (C) Statistical analysis of synonymous codon usage bias.

Simplified CodonO webserver infrastructure. (A) Visualization of the correlation between synonymous codon usage bias and GC compositions; (B) Visualization of synonymous codon usage bias for each gene in a specific genome; (C) Statistical analysis of synonymous codon usage bias. As mentioned in the ‘Statistical and methods’ section, the webserver can identify the outliers for a genome or a group of sequences based on Tukey statistical analysis (21). The clients can pick and select the ‘outlier’ from the plot and find associated information for each codon and annotation information of a specific gene (Figure 2B), in which the outliers are marked in different color from the other members in the SCUO population. To compare the statistical analyses across genomes, the CodonO webserver applys the Wilcoxon Two Sample Test (22) to compare whether the SCUO populations are the same or not between different genomes. The P-values from statistical comparison between genomes are listed in table (Figure 2C), and a P-value less than 0.05 informs a significant difference between two SCUO populations compared.

Implementation

The programs in this solution package are written in C/C++ or Java. The shell scripts are written in korn shell script in order to achieve high performance. GNUPlot is used for visualization. Cascading style sheets (CSS) are used for a consistent look across the pages. This also enables to change the overall design just by replacing the CSS definition file. PHP has been used as server side scripting and is written in C. In order to achieve high performance for computing in a genomic scale, we apply hash function or a binary tree, which enables that the codon usage analyses have a time complexity of O(nlog(n)) or O(n). The webservers have also designed special functions targeting the security and concurrency issues.

ACCESS

CodonO has been tested on Microsoft Internet Explorer, Netscape and Mozilla Firefox. The users need JavaScript to obtain full function of CodonO server. The webserver is available at http//www.sysbiology.org/CodonO/. This webserver can be run in a real time manner. The users can compare the maximum of 16 genomes for comparative analyses at the same time.

CONCLUSIONS

In summary, CodonO webserver has three major computational features for codon usage bias analyses: (i) it calculates the codon usage bias for one or more genomes; (ii) it compares and visualizes the correlation between codon usage bias and GC compositions; (iii) it performs statistical analyses for codon usage bias within and across genomes. Thus, CodonO provides an efficient user friendly web service for codon usage bias analyses across and within genomes using SCUO in real time.

19 in total

1. The base composition of the genes is correlated with the secondary structures of the encoded proteins.

Authors: Giuseppe D'Onofrio; Tapash Chandra Ghosh; Giorgio Bernardi
Journal: Gene Date: 2002-10-30 Impact factor: 3.688

2. Synonymous codon usage is subject to selection in thermophilic bacteria.

Authors: David J Lynn; Gregory A C Singer; Donal A Hickey
Journal: Nucleic Acids Res Date: 2002-10-01 Impact factor: 16.971

3. Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures.

Authors: Jiong Ma; Allan Campbell; Samuel Karlin
Journal: J Bacteriol Date: 2002-10 Impact factor: 3.490

4. Codon usage in bacteria: correlation with gene expressivity.

Authors: M Gouy; C Gautier
Journal: Nucleic Acids Res Date: 1982-11-25 Impact factor: 16.971

5. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system.

Authors: T Ikemura
Journal: J Mol Biol Date: 1981-09-25 Impact factor: 5.469

6. The relationship between synonymous codon usage and protein structure in Escherichia coli and Homo sapiens.

Authors: Wanjun Gu; Tong Zhou; Jianmin Ma; Xiao Sun; Zuhong Lu
Journal: Biosystems Date: 2004-02 Impact factor: 1.973

7. Codon selection in yeast.

Authors: J L Bennetzen; B D Hall
Journal: J Biol Chem Date: 1982-03-25 Impact factor: 5.157

8. JCat: a novel tool to adapt codon usage of a target gene to its potential expression host.

Authors: Andreas Grote; Karsten Hiller; Maurice Scheer; Richard Münch; Bernd Nörtemann; Dietmar C Hempel; Dieter Jahn
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

9. A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes.

Authors: R D Knight; S J Freeland; L F Landweber
Journal: Genome Biol Date: 2001-03-22 Impact factor: 13.583

10. Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes.

Authors: Xiu-Feng Wan; Dong Xu; Andris Kleinhofs; Jizhong Zhou
Journal: BMC Evol Biol Date: 2004-06-28 Impact factor: 3.260

43 in total

1. Analysis of synonymous codon usage patterns in different plant mitochondrial genomes.

Authors: Meng Zhou; Xia Li
Journal: Mol Biol Rep Date: 2008-11-14 Impact factor: 2.316

2. Comparative investigation of the various determinants that influence the codon and amino acid usage patterns in the genus Bifidobacterium.

Authors: Ayan Roy; Subhasish Mukhopadhyay; Indrani Sarkar; Arnab Sen
Journal: World J Microbiol Biotechnol Date: 2015-04-05 Impact factor: 3.312

3. Mining of microsatellites using next generation sequencing of seabuckthorn (Hippophae rhamnoides L.) transcriptome.

Authors: Ankit Jain; Saurabh Chaudhary; Prakash Chand Sharma
Journal: Physiol Mol Biol Plants Date: 2013-12-08

4. Coevolution mechanisms that adapt viruses to genetic code variations implemented in their hosts.

Authors: Sushil Kumar; Renu Kumari; Vishakha Sharma
Journal: J Genet Date: 2016-03 Impact factor: 1.166

5. Translational selection of genes coding for perfectly conserved proteins among three mosquito vectors.

Authors: Olaf Rodriguez; Brajendra K Singh; David W Severson; Susanta K Behura
Journal: Infect Genet Evol Date: 2012-06-15 Impact factor: 3.342

6. Using single-nucleotide polymorphisms to discriminate disease-associated from carried genomes of Neisseria meningitidis.

Authors: Lee S Katz; Nitya V Sharma; Brian H Harcourt; Jennifer Dolan Thomas; Xin Wang; Leonard W Mayer; I King Jordan
Journal: J Bacteriol Date: 2011-05-27 Impact factor: 3.490