Literature DB >> 34718745

CompoDynamics: a comprehensive database for characterizing sequence composition dynamics.

Shuai Jiang^1,2, Qiang Du^1,2,3, Changrui Feng^1,2,3, Lina Ma^1,2,3, Zhang Zhang^1,2,3.

Abstract

Sequence compositions of nucleic acids and proteins have significant impact on gene expression, RNA stability, translation efficiency, RNA/protein structure and molecular function, and are associated with genome evolution and adaptation across all kingdoms of life. Therefore, a devoted resource of sequence compositions and associated features is fundamentally crucial for a wide range of biological research. Here, we present CompoDynamics (https://ngdc.cncb.ac.cn/compodynamics/), a comprehensive database of sequence compositions of coding sequences (CDSs) and genomes for all kinds of species. Taking advantage of the exponential growth of RefSeq data, CompoDynamics presents a wealth of sequence compositions (nucleotide content, codon usage, amino acid usage) and derived features (coding potential, physicochemical property and phase separation) for 118 689 747 high-quality CDSs and 34 562 genomes across 24 995 species. Additionally, interactive analytical tools are provided to enable comparative analyses of sequence compositions and molecular features across different species and gene groups. Collectively, CompoDynamics bears the great potential to better understand the underlying roles of sequence composition dynamics across genes and genomes, providing a fundamental resource in support of a broad spectrum of biological studies.

Entities: Chemical

Mesh：

Year: 2022 PMID： 34718745 PMCID： PMC8728180 DOI： 10.1093/nar/gkab979

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Sequence compositions are intricately implicated with genome evolution and adaptation across all kingdoms of life (1–3). For example, GC content is closely associated with mutational bias (4), DNA recombination (3) and repair (5), mRNA level (6,7) and gene age (8). And codon usage bias (CUB) has significant implications in gene expression (9), translational selection (10), protein structure (11), metabolic ecology (10) and environmental adaptation (12). In addition, sequence composition-derived features, such as coding potential, protein physicochemical property as well as liquid-liquid phase separation (LLPS), impact more directly on molecular functions and biological roles of biomolecules (13–15). Taken together, sequence compositions as well as their derived features are critically essential for better understanding evolutionary processes and molecular mechanisms across all kingdoms of life. Over the past two decades, several valuable resources have been developed to characterize biomolecular sequence composition (16–21). Among them, representative resources are CUTG (16), CBCB (17), HIVE-CUTs (18) and CUBAP (21). Specifically, CUTG (16), established in 1998, is a widely used database compiling codon usage for 3 027 973 CDSs for 35 799 genomes (including 8,233 chloroplast genomes, 12 271 mitochondrion genomes and 439 plastid genomes). CBCB (17) is a specialized database housing CUB estimates for highly expressed genes in 300+ bacterial genomes. HIVE-CUTs (18) contains 855 412 codon usage tables for 689 420 species and their mitochondrial/plastid genomes derived from GenBank and RefSeq. The webserver CUBAP (21) computes GC content and codon usage for 17 634 human genes and facilitates analyses of CUB across human populations. In addition, there are still several other resources that specialize in integration of protein physicochemical properties (22,23) and phase separation properties (24–26). Albeit great efforts were made by existing resources, they have two major limitations. First, none of them covers the full range of sequence compositions as well as derived features. Second, they do not offer available tools for comparing molecular compositions between gene groupings in terms of protein families and GO terms, with friendly online functionalities for customized data analysis and interactive visualization. Particularly, with the rapid advancement of high-throughput sequencing technologies, an ever-increasing number of high-quality genomes covering a broad diversity of species have been sequenced and well annotated. Therefore, it is in urgent need to build a database incorporating sequence compositions and features based on high-quality genomes and genes. To fill this gap, we present CompoDynamics (https://ngdc.cncb.ac.cn/compodynamics/), a comprehensive database for characterizing sequence compositions and molecular features across a wide range of species. Based on a large number of high-quality CDSs and genomes derived from RefSeq, CompoDynamics generates a full range of sequence compositions including nucleotide content, codon usage and amino acid usage, as well as derived features of coding potential, protein physicochemical property and phase separation. In addition, CompoDynamics is equipped with a set of online tools for composition analysis, comparison and visualization. Collectively, CompoDynamics has the great potential to become a fundamental resource for better understanding biological significance of sequence composition dynamics.

MATERIALS AND METHODS

Data collection

Over 127 million coding sequences (*_cds_from_genomic.fna.gz) covering viruses, archaea, bacteria, fungi, protozoa, plants, invertebrates and vertebrates were retrieved from NCBI RefSeq (27) (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/; last access on 2020/5/27). For a species, complete assembly sequence(s) were selected; otherwise, only reference/representative genomes were kept. To enable gene comparisons between different families or functions, protein family/domain annotations were integrated from Pfam (version 34.0) (28), and GO (gene ontology) annotations (29,30) were incorporated using a series of R packages (org.Hs.eg.db, org.Mm.eg.db, org.Dr.eg.db, org.Ss.eg.db, org.Gg.eg.db, org.Mmu.eg.db, org.At.tair.db, org.Dm.eg.db, org.Ag.eg.db, org.Sc.sgd.db, org.EcK12.eg.db). To enable more convenient search, species lineage and alias are integrated from NCBI Taxonomy (31) (https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2020_05_01.zip).

Data pre-processing

CDS quality was evaluated by using in-house scripts with consideration of different genetic codes (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi/). Sequences whose lengths are not multiple of three were removed. Most of CDSs were annotated as canonical, whereas the remaining ones were labelled accordingly if they (i) lack start codon, (ii) lack stop codon or (iii) contain in-frame stop codon. The information including gene locus tag, protein ID, gene name, protein description and genomic location was parsed from CDS fasta files.

Calculation of sequence compositions and features

A full range of sequence compositions and their derived features were computed for each CDS. Among them, gene length, nucleotide content, codon usage, amino acid usage, CDC (Codon Deviation Coefficient) as well as RSCU (Relative Synonymous Codon Usage) were calculated using CAT (32); ENC (Effective Number of Codons) values were computed by CodonW (http://codonw.sourceforge.net/); coding potential was assessed with LGC (33) and CPC2 (34); phase separation features were calculated using ESpritz (35), PLAAC (36) and Pi-Pi (37); and protein physicochemical properties were computed by in-house scripts. Species-specific genetic codes were considered during the above calculations. For each genome, sequence compositions and features were estimated based on the results of all related CDSs, by adopting the following three strategies: (i) weighted averaging based on CDS length, such as GC content, (ii) direct averaging across all CDSs, such as the averaged CDC and ENC, (iii) calculation based on genome-scale values, such as RSCU. For instance, GC and GC1/2/3 (positional GC content at three codon positions) contents were averaged across all CDSs by weights of CDS lengths (this is equal to calculation by concatenating all the CDSs in a genome) and their distributions were represented as boxplots. Mean values of CUB, viz., averaged CDC and ENC, were calculated and their distributions were represented as boxplots. RSCU values were calculated based on total frequencies of each codon within genome. Detailed descriptions for all sequence compositions and molecular features are available at https://ngdc.cncb.ac.cn/compodynamics/help/.

Implementation

CompoDynamics was built with Spring boot (http://spring.io/), a mature and convention-over-configuration Model-View-Controller (MVC) framework, deployed in a Centos Linux 7.9 environment. In the back-end part, CompoDynamics data was stored in MySQL (https://www.mysql.com/), a free and popular relational database management system. The database was run by Apache ShardingSphere (https://shardingsphere.apache.org/), an open-source ecosystem consisted of a set of distributed database solutions. Web pages were constructed using HTML5 and rendered using Thymeleaf (https://www.thymeleaf.org/). Front-end interfaces were developed by using Bootstrap (https://getbootstrap.com/) with JQuery (https://jquery.com/) to provide responsive and user-friendly web pages. Furthermore, HighCharts (https://www.highcharts.com.cn/), ECharts (https://echarts.apache.org/) and DataTables (https://datatables.net/) were used to perform interactive charting and data visualization.

DATABASE CONTENTS AND FEATURES

CompoDynamics is a comprehensive database of sequence compositions and features (nucleotide content, codon usage, amino acid usage, coding potential, protein physicochemical property and phase separation) for 118 689 747 high-quality CDSs and 34 562 genomes, covering 1 692 647 genes and 24 995 species. Moreover, CompoDynamics provides interactive and user-friendly tools to perform comparative analysis of composition features across different species and gene groupings in terms of protein families and GO terms, enabling users to investigate composition dynamics across genes and genomes. In CompoDynamics, all these contents are organized in terms of sequences (gene and genome), compositions, features and tools (Figure 1).

Figure 1.

Database contents and organization. The present version of CompoDynamics provides six groups of sequence compositions (nucleotide content, codon usage and amino acid usage) and features (coding potential, protein physicochemical property and phase separation) for 118 689 747 CDSs and 34 562 genomes derived from RefSeq. These contents could be easily browsed, visualized, retrieved and analyzed at both genome and gene levels.

Nucleotide contents in genomes

Taking advantage of the large quantity of genomic data, CompoDynamics presents a whole picture of nucleotide contents for a wide range of species and CDSs, enabling systematic investigations on nucleotide composition dynamics in a cross-species manner. Specifically, detailed compositions of four individual nucleotides (A, T, G, C), nucleotide contents (GC, AG, GT, AT, AC, CT) and their positional contents in the 1st/2nd/3rd codon positions are presented for each genome and CDS in CompoDynamics. Based on 34 562 high-quality genome sequences, we observe that the Chargaff's second parity rule (A = T and C = G) applies in most of the species categories (R2 > 0.99) (Supplementary Figure S1A, B), and positional GC contents (GC1, GC2, GC3) are positively correlated with overall GC content, which is consistent with previous findings (38,39) (Supplementary Figure S1C). In addition, it is observed that GC1 is always higher than GC2, and GC3 has the most variability consistent with our previous observations (40). Taken together, CompoDynamics characterizes nucleotide compositions across a variety of species, facilitating the whole-genome comparative analysis of nucleotide variation.

Codon/amino acid usage in genomes

Considering that CUB is a complex interplay between mutation and selection, CompoDynamics houses several estimates of codon usage and CUB across a broad spectrum of organisms. CUB is represented by two popular measures viz., ENC (41) and CDC (32). The former evaluates the randomness of observed codon usage relative to uniformed codon usage (ranging from 20 for maximum bias to 61 for no bias) and the latter considers background compositions and thus reflects the strength of selection on synonymous codon usage (ranging from 0 for no bias to 1 for maximum bias). At the genome level, CUB estimates for each genome are averaged over all CDSs (see details in Materials and Methods). CompoDynamics presents codon usage tables and CUB estimates for all collected genomes and genes and offers visualization functionalities to investigate their distributions in bar plots and box plots. Based on all collected genomes in CompoDynamics, we investigate the dynamics of codon usage, CUB, GC content as well as genome size across clades (Figure 2). Consistent with extensive reports (42,43), prokaryotes differ greatly in GC content and CUB (Figure 2A and B). Notably, prokaryotes with extreme high (such as the majority of Actinobacteria) or low (such as Tenericutes and Rickettsiales) GC contents, tend to show lower ENC and CDC estimates (Figure 2A), indicating stronger CUB on their genomes presumably caused by mutational pressure. Likewise, Burkholderiales and Xanthomonadales seem to present lower ENC estimates and relatively higher CDC estimates (Figure 2B), suggesting stronger CUB on their genomes more likely due to selection. Different from prokaryotes, eukaryotes always exhibit moderate ENC/CDC with a narrow range of GC content especially for more complex organisms (Figure 2B). Intriguingly, it is consistently found that extreme GC contents and biased codon usages are mostly observed in less complex organisms, such as protists, fungi and green algae (Figure 2B). More complex organisms, such as Mammalia (mammals), Aves and Actinopteri, show high uniformity in codon and amino acid usage (Figure 2C), which coincides with their narrow GC/ENC/CDC estimates. And not surprisingly, amino acid usage shows better uniformity than codon usage (Figure 2C).

Figure 2.

Codon usage dynamics across prokaryote and eukaryote genomes. (A) CUB distributions in prokaryote and eukaryote genomes. CUB (represented by ENC and CDC), GC content of genomic coding region, genome size, CDS number and metabolism type are visualized by different color palettes. For prokaryote, organisms with CDS count ≥100 are displayed in the cladogram. Several clades are highlighted to exemplify different kinds of strong CUBs. Burkhold*: Burkholderiales; Rick*: Rickettsiales. (B) Relationship between ENC and CDC for different GC values in prokaryote and eukaryote. (C) Codon usage and amino acid usage across six species categories in eukaryote.

Sequence compositions and features in genes

Different genes, in respect of their families or functions, are under different selective pressures and may exhibit different compositional patterns. Regarding this, CompoDynamics provides a wealth of CDS-level compositions and molecular features for 118 689 747 CDSs across 24 995 species and supports online analysis for composition comparison between gene groupings with different families or GO terms (Figure 3). Taking Saccharomyces cerevisiae S288C as an example, five groups of yeast genes, according to GO terms, namely, ‘cytosolic large ribosomal subunit’, ‘cytosolic small ribosomal subunit’, ‘transmembrane transport’, ‘cell wall’ and ‘retrotransposon nucleocapsid’, are selected for composition comparison (Figure 3A). Online tools in CompoDynamics enable the investigation of nucleotide content variation in these five groups (Figure 3B). Among them, genes with the GO terms of ‘cytosolic large ribosomal subunit’ and ‘cytosolic small ribosomal subunit’, exhibit more biased codon usage (represented by low ENC and high CDC) (Figure 3C) as well as amino acid usage (Figure 3D) and are positively charged (Figure 3E), conforming with a previous report (44). In addition, genes with terms of ‘cell wall’ and ‘transmembrane transport’ are observed to be mostly neutrally charged and highly hydrophobic, which coincides well with their specialized functions (Figure 3E and F). Furthermore, by estimating a series of indices for liquid-liquid phase separation potential, we observe that ‘retrotransposon nucleocapsid’ associated genes harbour detectable disordered regions and show relatively strong signals for phase separation potential (Figure 3G). In short, CompoDynamics features online functionalities for composition comparison and visualization, paving the way for in-depth investigation of different genes.

Figure 3.

Sequence composition and feature comparisons between genes with different GO terms. (A) Five groupings of yeast genes, according to GO terms, namely, ‘cytosolic large ribosomal subunit’, ‘cytosolic small ribosomal subunit’, ‘transmembrane transport’, ‘cell wall’ and ‘retrotransposon nucleocapsid’, are selected for comparison with the online tool GOComparator in CompoDynamics. The comparison results are illustrated for (B) nucleotide content, (C) codon usage bias, (D) amino acid usage, (E) positively/neutrally charged amino acids, (F) hydrophobicity and (G) intrinsically disordered regions.

Data organization and presentation

CompoDynamics is organized to enable easy and efficient data browse, search, comparison and statistics of sequence compositions and derived features (Figure 1). In the homepage, CompoDynamics offers a fast and case-insensitive search functionality, allowing users to search simply by specifying any term (including species, taxonomy, assembly, gene and protein) and by setting advanced filtering options. Once a search is submitted, relevant results are displayed in the web page where genome/CDS-level compositions and features are provided. Clicking any individual genome/CDS can direct to the corresponding page that contains detailed results of compositions and features in terms of basic information, nucleotide content, codon usage, amino acid usage, coding potential, physicochemical property and phase separation of each genome/CDS. Additionally, CompoDynamics offers browse functionalities to help users easily retrieve compositions and features for any genome/CDS of interest. Importantly, CompoDynamics provides several interactive online tools for analyzing molecular compositions and features across different species (SpeciesComparator), different gene families (FamilyComparator) and different GO terms (GOComparator). By selecting genomes or gene groups of interest, these tools allow users to perform comparisons in terms of nucleotide composition, codon usage, amino acid usage, coding potential, physicochemical properties and phase separation. In addition, CompoAnalyzer is designed to accept user-input sequences for custom analyses of compositions and features. Moreover, a series of data statistics is presented to help users obtain an overview of compositions and features, Chargaff's rules and the relationships between GC content and other compositions/features across various species categories.

DISCUSSION AND FUTURE DEVELOPMENTS

Based on the high-quality data, CompoDynamics presents a wealth of sequence compositions and molecular features for 118 689 747 CDSs and 34 562 genomes covering 1 692 647 genes and 24 995 species. Consequently, CompoDynamics yields a whole picture of sequence compositions and features across all kingdoms of life, greatly facilitating users to investigate variation dynamics at the genome level, as well as providing important insights on better understanding molecular sequence evolution. Also, CompoDynamics is equipped with several useful tools for online analysis and visualization, thus enabling users to perform comparative analysis of sequence compositions and features across various species and gene groups. Future developments of CompoDynamics include periodical update (at least once a year) of high-quality genomes and CDSs from NCBI RefSeq, GENCODE, and Genome Warehouse (45) in NGDC (46). Moreover, untranslated genomic regions, such as the untranslated regions (UTRs), mRNA introns and noncoding RNAs, will be added to characterize composition dynamics in a larger scale. Furthermore, CompoDynamics will incorporate more curated metadata and molecular features (e.g. di-nucleotides, optimal codons, etc.) and enhance online tools for better analysis and visualization. Collectively, CompoDynamics bears great utility to help users better understand sequence composition dynamics of genes and genomes and thus to serve as a fundamental resource in support of global biological research.

DATA AVAILABILITY

CompoDynamics is a user-friendly database for characterizing sequence composition dynamics and can be accessed directly at https://ngdc.cncb.ac.cn/compodynamics/. Click here for additional data file.

46 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. PLAAC: a web and command-line application to identify proteins with prion-like amino acid composition.

Authors: Alex K Lancaster; Andrew Nutter-Upham; Susan Lindquist; Oliver D King
Journal: Bioinformatics Date: 2014-05-13 Impact factor: 6.937

3. Pi-Pi contacts are an overlooked protein feature relevant to phase separation.

Authors: Robert McCoy Vernon; Paul Andrew Chong; Brian Tsang; Tae Hun Kim; Alaji Bah; Patrick Farber; Hong Lin; Julie Deborah Forman-Kay
Journal: Elife Date: 2018-02-09 Impact factor: 8.140

4. Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics.

Authors: Kui Lin; Yuyu Kuang; Jeremiah S Joseph; Prasanna R Kolatkar
Journal: Nucleic Acids Res Date: 2002-06-01 Impact factor: 16.971

5. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences.

Authors: Zhang Zhang; Jun Yu
Journal: Biol Direct Date: 2010-11-08 Impact factor: 4.540

6. Whole-genome mutational biases in bacteria.

Authors: Peter A Lind; Dan I Andersson
Journal: Proc Natl Acad Sci U S A Date: 2008-11-10 Impact factor: 11.205

7. CBDB: the codon bias database.

Authors: Adam Hilterbrand; Joseph Saelens; Catherine Putonti
Journal: BMC Bioinformatics Date: 2012-04-26 Impact factor: 3.169

8. On the molecular mechanism of GC content variation among eubacterial genomes.

Authors: Hao Wu; Zhang Zhang; Songnian Hu; Jun Yu
Journal: Biol Direct Date: 2012-01-10 Impact factor: 4.540

9. The Gene Ontology resource: enriching a GOld mine.

Authors:
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10. Genome Warehouse: A Public Repository Housing Genome-scale Data.

Authors: Meili Chen; Yingke Ma; Song Wu; Xinchang Zheng; Hongen Kang; Jian Sang; Xingjian Xu; Lili Hao; Zhaohua Li; Zheng Gong; Jingfa Xiao; Zhang Zhang; Wenming Zhao; Yiming Bao
Journal: Genomics Proteomics Bioinformatics Date: 2021-06-24 Impact factor: 6.409

2 in total

1. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022.

Authors:
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

2. Deep learning based on biologically interpretable genome representation predicts two types of human adaptation of SARS-CoV-2 variants.

Authors: Jing Li; Ya-Nan Wu; Sen Zhang; Xiao-Ping Kang; Tao Jiang
Journal: Brief Bioinform Date: 2022-05-13 Impact factor: 13.994

2 in total