Literature DB >> 23203868

COXPRESdb: a database of comparative gene coexpression networks of eleven species for mammals.

Takeshi Obayashi¹, Yasunobu Okamura, Satoshi Ito, Shu Tadaka, Ikuko N Motoike, Kengo Kinoshita.

Abstract

Coexpressed gene databases are valuable resources for identifying new gene functions or functional modules in metabolic pathways and signaling pathways. Although coexpressed gene databases are a fundamental platform in the field of plant biology, their use in animal studies is relatively limited. The COXPRESdb (http://coxpresdb.jp) provides coexpression relationships for multiple animal species, as comparisons of coexpressed gene lists can enhance the reliability of gene coexpression determinations. Here, we report the updates of the database, mainly focusing on the following two points. First, we updated our coexpression data by including recent microarray data for the previous seven species (human, mouse, rat, chicken, fly, zebrafish and nematode) and adding four new species (monkey, dog, budding yeast and fission yeast), along with a new human microarray platform. A reliability scoring function was also implemented, based on coexpression conservation to filter out coexpression with low reliability. Second, the network drawing function was updated, to implement automatic cluster analyses with enrichment analyses in Gene Ontology and in cis elements, along with interactive network analyses with Cytoscape Web. With these updates, COXPRESdb will become a more powerful tool for analyses of functional and regulatory networks of genes in a variety of animal species.

Entities: CellLine Disease Gene Species

Mesh：

Year: 2012 PMID： 23203868 PMCID： PMC3531062 DOI： 10.1093/nar/gks1014

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The construction of a gene network is a fundamental step toward understanding global cellular processes. In addition, recent genome-wide association studies, using high-throughput sequencing technology, have revealed many uncharacterized genotypes associated with a particular phenotype (1,2). To investigate the molecular mechanisms underlying the connections between genotype and phenotype, networks of mRNAs or proteins are useful. Several databases, such as IntAct (3) and STRING (4), have focused on protein-protein interaction network construction. For mRNA network analysis, similarities of gene expression profiles (gene coexpression) of a vast amount of microarray data are constructed. Databases for gene coexpression have achieved great success in the field of plant biology (5–8). On the other hand, however, their use in mammalian fields is still limited, with some exceptional reports (9,10), although several coexpression databases, such as Genevestigator (11), STARNET2 (12), SNPxGE2 (2) and ours, COXPRESdb, have been developed. To promote the use of coexpression analyses in animals, we have been developing a gene coexpression database named COXPRESdb (coexpression database). We have especially focused on the reliability of coexpression data, by providing comparisons of coexpression among the different species, along with a network view of the relationships between coexpressed genes (13,14). Although the gene network view can provide an overview for the system of interest, the construction of a large-scale gene network is not easy because such a network tends to be too complicated to fully comprehend. Several approaches have been developed to visualize and help the understanding of large-scale gene networks, by controlling the cluster size (15) or combining biological-property–based clustering (16). Another weak point in coexpressed gene network analysis is based on the quality of the coexpression data. The quality of the coexpression data for animals is generally worse than that for Arabidopsis in an assessment using Gene Ontology (GO) annotation (17), probably due to the increased complexity of animal systems (18). To enhance the performance of gene coexpression analyses, we updated two aspects of COXPRESdb. First, we increased the number of samples for each species and the number of species from 7 to 11 along with an alternative microarray platform for human as summarized in Table 1. In addition, a reliability scoring system was implemented, based on the similarity of coexpression patterns among the species. Second, the network drawing tool was improved. The new tool automatically divides the large complex network into smaller compact clusters. Each compact cluster is then characterized by GO and cis element enrichment analyses. In addition, users can select the Cytoscape web system (19) to interactively modify the network alignment and to work as a bridge to stand-alone Cytoscape (20) for more complex analyses. Furthermore, all of the coexpression data are now available in SPARQL for the semantic web communities, using the Virtuoso Universal Server at [http://coxpresdb.jp/sparql], which will promote building mashup application with various omics data sets.

Table 1.

Summary of the update of the coexpression data from versions 4.1 to 5.0

Species	Abbreviation	Microarray platform (Affymetrix product ID)	Number of genes	Number of microarrays
				ver. 5.0	ver. 4.1
Homo sapiens	Hsa	HG-U133_Plus_2	19 803	73 083 (c4.0)	4401 (c3.1)
Homo sapiens	Hsa2	HuGene-1_0-st-v1	19 788	6865 (c1.0)
Mus musculus	Mmu	Mouse430_2	20 403	31 479 (c3.0)	2226 (c2.1)
Rattus norvegicus	Rno	Rat230_2	13 751	27 481 (c3.0)	3526 (c2.0)
Gallus gallus	Gga	Chicken	13 757	1024 (c2.0)	352 (c1.0)
Danio rerio	Dre	Zebrafish	10 112	1126 (c2.0)	590 (c1.0)
Drosophila melanogaster	Dme	Drosophila_2	12 626	3336 (c2.0)	1102 (c1.0)
Caenorhabditis elegans	Cel	Celegans	17 256	1034 (c2.0)	514 (c1.0)
Macaca mulatta	Mcc	Rhesus	15 779	675 (c1.0)
Canis lupus	Cfa	Canine_2	16 211	377 (c1.0)
Saccharomyces cerevisiae	Sce	Yeast_2	4461	2693 (c1.0)
Schizosaccharomyces pombe	Spo	Yeast_2	4881	111 (c1.0)

“c” is added for each coexpression version (e.g. c4.0) to prevent confusions with the COXPRESdb version as a whole (e.g. ver. 5.0).

Summary of the update of the coexpression data from versions 4.1 to 5.0 “c” is added for each coexpression version (e.g. c4.0) to prevent confusions with the COXPRESdb version as a whole (e.g. ver. 5.0).

QUALITY ASSESSMENT OF COEXPRESSION DATA

New coexpression data

The calculation procedure for the coexpression data is the same as in our previous report (18). Briefly, GeneChip raw data were obtained from ArrayExpress (21) and normalized by the RMA method (22) for each compressed file, by assuming that each compressed file corresponds to each experimental set. Then the weighted Pearson's correlation coefficient of expression profiles was calculated for every pair of genes in each species. Finally, the correlation coefficient was transferred to mutual rank (MR) (18). A network node corresponds to a gene, and edges are drawn for each gene to the other genes with three most strongly coexpressed genes. The evolutional relationships were determined by using HomoloGene (23) and the edges in the homologous gene pairs, if any, were considered as common edges among the species. To assess the difference between the previous and new versions, we counted the numbers of common edges (Nc) for all pairs of seven species for each version. These numbers provide a quick measure to evaluate the quality of the coexpression data because similar coexpression from independent microarray platforms may eliminate experimental artifact of gene coexpression. As a result, all pairs of species, except for the human–nematode pair, showed an increase in Nc (Figure 1). The average increase rate of Nc was 1.5, and large increases of Nc were observed for the human–mouse, mouse–rat and mouse–chicken pairs, which may correspond to the large increase in the number of mouse samples. In addition to the data renewal of the previous seven species, we added four new species, monkey, dog and two yeast species, as well as human coexpression from another microarray platform. The numbers of Nc against the human data are summarized in Table 2.

Figure 1.

Table 2.

Evolution of number of edges in a human platform commonly observed in other species

Species	ver. 5.0	ver. 4.1
Mus musculus	1397	757
Canis lupus	896
Rattus norvegicus	803	720
Macaca mulatta	545
Gallus gallus	358	211
Danio rerio	172	156
Drosophila melanogaster	84	49
Caenorhabditis elegans	38	39
Saccharomyces cerevisiae	35
Schizosaccharomyces pombe	13

The total number of edges in human are 59 409 (ver. 5.0) and 59 331 (ver. 4.1).

Distribution of the number of common coexpression edges (Nc) between species. Large increases in common coexpression edges are observed in the (a) human–mouse, (b) mouse–rat and (c) mouse–chicken pairs, suggesting significant improvement of the mouse coexpression data. The increase rate of the number of common edges is 1.5 on average. Evolution of number of edges in a human platform commonly observed in other species The total number of edges in human are 59 409 (ver. 5.0) and 59 331 (ver. 4.1).

SIMILARITY OF COEXPRESSION PATTERNS AMONG SPECIES

The coexpressed gene list in COXPRESdb provides a comparable view among orthologous genes in other species (14). This comparative view shows the evolutional conservation of the coexpression pattern of the guide gene, which can be a measure of the reliability of the coexpression data (24,25). Figure 2 shows the coexpressed gene list for the human CHEK1 gene. The alternative human platform (Hsa2) and mouse (Mmu) show similar coexpression degrees with the human (Hsa) coexpression, reflecting the high quality of the coexpression data for these species, based on the large amount of microarray data. The conservation degrees with monkey (Mcc), rat (Rno), dog (Cfa) and zebrafish (Dre) are also good. The low coexpression conservation with fly (Dme), nematode (Cel) and the two yeast species (Sce, Spo) seems to be derived from the greater species distance to human and/or the relatively poor coexpression data based on the small amount of microarray data (Table 1). In particular, the chicken (Gga) coexpression data are different from the human data. This may be due to a defective probe for this gene because when we checked the coexpressed gene list for this gene in chicken, almost no orthologous genes showed coexpression conservation.

Figure 2.

An example of a coexpressed gene list in COXPRESdb. The human CHEK1 gene is used as an example of a guide gene, and the coexpressed genes are shown along with their MR values (smaller MR value indicates stronger coexpression). The 11 columns on the right indicate the coexpression degrees of the ortholog pairs in other species (or another human platform). Coexpressions with MR >200 are considered as weak and they are shown in faded color. A blank cell means that coexpression data are not available for the gene in the corresponding species (or a platform). The reliability is calculated based on the coexpression conservation, and is represented with stars (triple star: highly reliable; no star: no conservation support). This list is available at [http://coxpresdb.jp/cgi-bin/coex_list.cgi?gene=1111&sp=Hsa]. As seen in this example, the conservation of coexpression can ensure the quality of the guide gene (14), but users should check all of the coexpressed genes in each species to determine the reliability of each orthologous gene. To solve this problem, we introduced a similarity measure COXSIM, which is the weighted concordance rate of the coexpressed gene lists. where n(i, g, sp1, sp2) is the number of common genes (orthologous genes in the case of different species comparison) found in the top i coexpressed gene lists from a guide gene g in species sp1 and that in species sp2. We set 100 for k, meaning that we check the gene correspondence of the top 100 coexpressed genes, which is a reasonable limitation to design biological experiment (7). Here, defective probes will show noisy expression patterns, which cause unreliable coexpression that does not show any correspondence with other coexpression data. In other words, the maximal value of COXSIM (coexpression similarity) between the coexpressed gene list from an unreliable gene and that from its orthologous genes should be low. Based on this idea, maxCOXSIM is introduced as the reliability score of a guide gene. The significance of the maxCOXSIM value is assessed from the null distribution for 10 species comparisons, each containing 10 000 genes. Note that this assumption is a rather severe evaluation and thus this P-value is underestimated for most guide genes because both the larger number of species in the comparison and the smaller number of genes in a genome will cause higher maxCOXSIM values by chance. We show this significance degree by stars on the gene list in COXPRESdb, where single, double and triple stars correspond P-values <1E-4, 1E-12 and 1E-20, respectively. Genes with lower reliability can be filtered out by the Row and Column filters (Figure 2). The numbers and ratios of genes at each significance level are shown in Figure 3.

Figure 3.

Number of genes for each reliability level. Reliability levels are represented as stars, where no star is the lowest and a triple star is the highest reliability. Numbers in the bars indicate the percentage of each reliability level in each species, where the numbers with no star include genes without any orthlogous genes in other species.

ENHANCEMENT OF THE NETWORK ANALYSIS TOOL

The coexpressed gene network is especially useful to analyze the large number of genes generated by transcriptome or proteome analyses because the network representation can draw all of the pair-wise gene relationships for the query genes at one time. NetworkDrawer in COXPRESdb is the tool to draw the gene network for the query genes specified by users, by searching for coexpression along with protein–protein interactions among the genes or gene products (Figure 4). In this example, three groups of genes can be identified by visual inspection. To characterize these groups, two new network analysis flows are provided in the new NetworkDrawer, in addition to the marks for KEGG annotation (27) in the previous version of COXPRESdb.

Figure 4.

Two network analysis flows in NetworkDrawer. For a set of user-defined genes, NetworkDrawer draws the gene network. Larger nodes are the query genes and smaller gray nodes are additional nodes with one or more edges to at least one query node. Solid lines and red dotted lines indicate gene coexpression and protein–protein interactions from the HPRD (26) and IntAct (3) databases, respectively. The orange solid lines mean conserved coexpression observed in at least one species in COXPRESdb. The new NetworkDrawer can be used for the two network analysis flows. The first flow is composed of automatic cluster detection (A) and enrichment analyses of cis elements and GO annotations (B) with detailed cis element information (C). The second flow is using the Cytoscape Web system (D), which enables the user to interactively modify the network alignment. The user can output this network as an image, save it and then load it on this web system, or continue the analysis and visualization on stand-alone Cytoscape. The first analysis flow is composed of automatic cluster detection and characterization (Figure 4A–C). The cluster detection step has two parameters, a clique detection parameter and a clique merge parameter, which are both set to 0.5 as the default values, but can be changed through the text box on the web page, where smaller clique parameter and larger merge parameter produce larger sub-graph. The clustering algorithm has been newly developed for both a rapid response and the detection of a clique-like sub-graph, by merging the node with a higher PageRank value iteratively (28). The details of the clustering algorithm will be described elsewhere. After the clustering, users can easily select a cluster by using the radio button in the cluster summary table, to mark the nodes in the selected cluster by balloon icons (the orange balloons in Figure 4A). The results of the enrichment analyses for each cluster are available from the links in the table (Figure 4B). In addition to the GO enrichment analysis, we have also provided the cis element motif enrichment analysis. Gene coexpression is mainly driven by cis elements in the promoter regions, especially the proximal promoter region (29). In Arabidopsis, large-scale cis element discovery was performed, based on gene coexpression (30). Therefore, we performed enrichment analyses by a hypergeometric test for heptamer motifs on the proximal promoter regions (−200 to +100) around transcription start sites downloaded from DBTSS (31). The enriched heptamers are referred to the reported cis elements in JASPAR (32) (Figure 4C). To further characterize the heptamers, the enriched GO annotations of the genes having the heptamer motif are calculated (Figure 4C). The second flow of the gene network analysis is the use of the Cytoscape Web system (19) (Figure 4D). This system enables users to interactively modify the network alignment, export the network as an image (SVG, PNG, PDF formats) and save it in the XGMML format. The XGMML file can be uploaded on the same Cytoscape Web system and also used in stand-alone Cytoscape (20) for advanced analyses. This system is also available for gene networks in the locus page and the GO network page in COXPRESdb.

FUNDING

CREST research project of the Japan Science and Technology Corporation [11102558 to T.O.]; Grants-in-Aid for Innovative Area ‘HD Physiology’ [22136005], for Scientific Research [24570176] and for Publication of Scientific Research Results [228063 to K.K.]. Funding for open access charge: Grants-in-Aid for Innovative Area ‘HD Physiology’ [22136005]. Conflict of interest statement. None declared.

31 in total

Review 1. Approaches for extracting practical information from gene co-expression networks in plant biology.

Authors: Koh Aoki; Yoshiyuki Ogata; Daisuke Shibata
Journal: Plant Cell Physiol Date: 2007-01-23 Impact factor: 4.927

2. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments.

Authors: Helen Parkinson; Ugis Sarkans; Nikolay Kolesnikov; Niran Abeygunawardena; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Ele Holloway; Natalja Kurbatova; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Gabriella Rustici; Anjan Sharma; Eleanor Williams; Tomasz Adamusiak; Marco Brandizi; Nataliya Sklyar; Alvis Brazma
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

3. Cytoscape 2.8: new features for data integration and network visualization.

Authors: Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Trey Ideker
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

4. COXPRESdb: a database to compare gene coexpression in seven model animals.

Authors: Takeshi Obayashi; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2010-11-16 Impact factor: 16.971

5. DBTSS: DataBase of Transcriptional Start Sites progress report in 2012.

Authors: Riu Yamashita; Sumio Sugano; Yutaka Suzuki; Kenta Nakai
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

6. Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes.

Authors: Michele Caselle; Ferdinando Di Cunto; Paolo Provero
Journal: BMC Bioinformatics Date: 2002-02-14 Impact factor: 3.169

7. Rank of correlation coefficient as a comparable measure for biological significance of gene coexpression.

Authors: Takeshi Obayashi; Kengo Kinoshita
Journal: DNA Res Date: 2009-09-18 Impact factor: 4.458

8. STARNET 2: a web-based tool for accelerating discovery of gene regulatory networks using microarray co-expression data.

Authors: Daniel Jupiter; Hailin Chen; Vincent VanBuren
Journal: BMC Bioinformatics Date: 2009-10-14 Impact factor: 3.169

9. Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes.

Authors: Tomas Hruz; Oliver Laule; Gabor Szabo; Frans Wessendorp; Stefan Bleuler; Lukas Oertle; Peter Widmayer; Wilhelm Gruissem; Philip Zimmermann
Journal: Adv Bioinformatics Date: 2008-07-08

10. COXPRESdb: a database of coexpressed gene networks in mammals.

Authors: Takeshi Obayashi; Shinpei Hayashi; Masayuki Shibaoka; Motoshi Saeki; Hiroyuki Ohta; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2007-10-11 Impact factor: 16.971

43 in total

1. COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems.

Authors: Yasunobu Okamura; Yuichi Aoki; Takeshi Obayashi; Shu Tadaka; Satoshi Ito; Takafumi Narise; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 16.971

2. Unique ability of pandemic influenza to downregulate the genes involved in neuronal disorders.

Authors: Esmaeil Ebrahimie; Zahra Nurollah; Mansour Ebrahimi; Farhid Hemmatzadeh; Jagoda Ignjatovic
Journal: Mol Biol Rep Date: 2015-08-06 Impact factor: 2.316

3. Phen-Gen: combining phenotype and genotype to analyze rare disorders.

Authors: Asif Javed; Saloni Agrawal; Pauline C Ng
Journal: Nat Methods Date: 2014-08-03 Impact factor: 28.547

4. Haploinsufficiency predictions without study bias.

Authors: Julia Steinberg; Frantisek Honti; Stephen Meader; Caleb Webber
Journal: Nucleic Acids Res Date: 2015-05-22 Impact factor: 16.971

5. Gene age predicts the strength of purifying selection acting on gene expression variation in humans.

Authors: Konstantin Y Popadin; Maria Gutierrez-Arcelus; Tuuli Lappalainen; Alfonso Buil; Julia Steinberg; Sergey I Nikolaev; Samuel W Lukowski; Georgii A Bazykin; Vladimir B Seplyarskiy; Panagiotis Ioannidis; Evgeny M Zdobnov; Emmanouil T Dermitzakis; Stylianos E Antonarakis
Journal: Am J Hum Genet Date: 2014-12-04 Impact factor: 11.025

6. Systematic exploration of autonomous modules in noisy microRNA-target networks for testing the generality of the ceRNA hypothesis.

Authors: Danny Kit-Sang Yip; Iris K Pang; Kevin Y Yip
Journal: BMC Genomics Date: 2014-12-24 Impact factor: 3.969