Literature DB >> 35859338

The Codon Statistics Database: A Database of Codon Usage Bias.

Krishnamurthy Subramanian¹, Bryan Payne¹, Felix Feyertag¹, David Alvarez-Ponce¹.

Abstract

We present the Codon Statistics Database, an online database that contains codon usage statistics for all the species with reference or representative genomes in RefSeq (over 15,000). The user can search for any species and access two sets of tables. One set lists, for each codon, the frequency, the Relative Synonymous Codon Usage, and whether the codon is preferred. Another set of tables lists, for each gene, its GC content, Effective Number of Codons, Codon Adaptation Index, and frequency of optimal codons. Equivalent tables can be accessed for (1) all nuclear genes, (2) nuclear genes encoding ribosomal proteins, (3) mitochondrial genes, and (4) chloroplast genes (if available in the relevant assembly). The user can also search for any taxonomic group (e.g., "primates") and obtain a table comparing all the species in the group. The database is free to access without registration at http://codonstatsdb.unr.edu.

Entities: Chemical

Keywords: codon bias; codon usage; database; synonymous codons

Mesh：

Substances：
Codon

Year: 2022 PMID： 35859338 PMCID： PMC9372565 DOI： 10.1093/molbev/msac157

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 8.800

Introduction

Most amino acids are encoded by multiple synonymous codons. Despite encoding for the same amino acid, some synonymous codons are used significantly more often than others, a phenomenon known as codon usage bias. Species significantly differ in their codon preferences—for instance, glutamic acid is preferentially encoded by GAG in human, whereas the same amino acid is preferentially encoded by GAA in Escherichia coli (Ikemura 1982; Sharp et al. 2010). In addition, genes within any given genome differ in their patterns of codon usage. In particular, gene expression levels significantly correlate with gene-specific metrics of codon usage such as the Effective Number of Codons (ENC; Wright 1990), the Codon Adaptation Index (CAI; Sharp and Li 1987), or the frequency of optimal codons (Fop; Ikemura 1985) (e.g., Gouy and Gautier 1982). Codon preferences can be affected by a number of factors, including the genome’s nucleotide composition (e.g., AT-rich genomes tend to use codons ending in A or T) and translational selection (codons that are translated by highly abundant tRNAs are translated faster and with fewer errors; e.g., Ikemura 1985; Hershberg and Petrov 2008). Understanding codon preferences across the different species and genes is important not only to understanding genome evolution, but can also inform tasks such as heterologous expression, gene prediction, or phylogenetic inference (e.g., Gustafsson et al. 2004; Christianson 2005; Rota-Stabelli et al. 2013). In addition, the patterns of codon usage of viruses tend to be similar to those of their host species (e.g., Shackelton et al. 2006). Despite the relevance of maintaining species- and gene-specific codon usage information, existing databases have not been updated in a long time, focus on specific taxa, and/or do not provide gene-specific metrics (Nakamura et al. 2000; Hilterbrand et al. 2012; Athey et al. 2017).

Implementation

For each of the species with reference or representative genomes in the RefSeq database (release 207), we chose one full assembly (in order of preference, the one labeled as “reference,” the one with the highest assembly level, or the most recent one) and retrieved the corresponding coding sequences (CDSs) file. Using that file as input, a number of tables were pre-computed using an R pipeline. To avoid codon redundancy, only one CDS per gene was used (if multiple were available, the longest one was chosen). The web interface was created using PERL CGI. For each species, we computed the total frequency of each codon, and used the information to compute the Relative Synonymous Codon Usage (RSCU) of each codon. For each gene, we computed the GC content for the entire CDS (GC), the GC content at third codon positions (GC3), the ENC, and the RSCU for each codon. For species with over 1,000 genes, we also compared genes inferred to be highly expressed (bottom 10% ENC values) with genes inferred to be lowly expressed (top 10% ENC values). Codons with significantly higher RSCU values in the highly expressed gene set (according to a Mann–Whitney U test) were considered preferred/optimal. We then computed the Fop for each gene. The highly expressed gene set was also used as reference to compute the CAI of each gene.

The Codon Statistics Database

We present the Codon Statistics Database, an online database that contains codon usage information for all species with reference or representative genomes in RefSeq (over 15,000). The user can search for any species or taxonomic group by taxonomic ID (e.g., “9606”), scientific name (e.g., “Homo sapiens”), or common name (e.g., “human”), and select an option from a drop-down menu. If a species is selected, the user is directed to a table that lists, for each codon, the encoded amino acid, the total count in the genome, the RSCU, and whether the codon is preferred or unpreferred (fig. 1). The user can visualize and download equivalent tables for (1) all nuclear genes (default option), (2) nuclear genes encoding ribosomal proteins (this subset is included since such proteins are often highly expressed and thus subjected to strong codon bias), (3) mitochondrial genes, and (4) chloroplast genes (if such gene sets are available in the relevant genome assembly). For viruses, only one option including all genes is available. Additionally, for each gene set, the user can download a tab-delimited file (.tsv) listing the following statistics for each gene: GC, GC3, ENC, CAI, and Fop.

Fig. 1.

Species summary. Codon statistics corresponding to all human nuclear genes are shown.

Species summary. Codon statistics corresponding to all human nuclear genes are shown. If a taxonomic group with multiple species is selected (e.g., “7215,” “Drosophila,” or “fruit flies”), the user is presented with a table comparing all the species in the group (fig. 2). The user has the option to visualize either codon counts or RSCU values. Preferred codons in each species are marked with asterisks.

Fig. 2.

Taxonomic group summary. Codon preferences for species in the genus Drosophila are shown.

13 in total

1. Codon usage tabulated from international DNA sequence databases: status for the year 2000.

Authors: Y Nakamura; T Gojobori; T Ikemura
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

Review 2. Codon bias and heterologous protein expression.

Authors: Claes Gustafsson; Sridhar Govindarajan; Jeremy Minshull
Journal: Trends Biotechnol Date: 2004-07 Impact factor: 19.536

Review 3. Forces that influence the evolution of codon bias.

Authors: Paul M Sharp; Laura R Emery; Kai Zeng
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2010-04-27 Impact factor: 6.237

4. The 'effective number of codons' used in a gene.

Authors: F Wright
Journal: Gene Date: 1990-03-01 Impact factor: 3.688

Review 5. Selection on codon bias.

Authors: Ruth Hershberg; Dmitri A Petrov
Journal: Annu Rev Genet Date: 2008 Impact factor: 16.830

The Codon Statistics Database: A Database of Codon Usage Bias.

Introduction

Implementation

The Codon Statistics Database

1. Codon usage tabulated from international DNA sequence databases: status for the year 2000.

Review 2. Codon bias and heterologous protein expression.

Review 3. Forces that influence the evolution of codon bias.

4. The 'effective number of codons' used in a gene.

Review 5. Selection on codon bias.

6. Codon usage patterns distort phylogenies from or of DNA sequences.

7. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications.

8. Serine codon-usage bias in deep phylogenomics: pancrustacean relationships as a case study.

9. CBDB: the codon bias database.

10. Evolutionary basis of codon usage and nucleotide composition bias in vertebrate DNA viruses.