Literature DB >> 32759329

Application of Transcriptional Gene Modules to Analysis of Caenorhabditis elegans' Gene Expression Data.

Michael Cary^1,2, Katie Podshivalova³, Cynthia Kenyon^4,3.

Abstract

Identification of co-expressed sets of genes (gene modules) is used widely for grouping functionally related genes during transcriptomic data analysis. An organism-wide atlas of high-quality gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method based on independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism C. elegans These modules represent responses to changes in the environment (e.g., starvation, exposure to xenobiotics), genes regulated by transcriptions factors (e.g., ATFS-1, DAF-16), genes specific to tissues (e.g., neurons, muscle), genes that change during development, and other complex transcriptional responses to genetic, environmental and temporal perturbations. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially expressed genes fail to do so. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g., GO term). Analysis of "module-weighted annotations" improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. We provide an online interactive resource with tutorials at http://genemodules.org/, in which users can find detailed information on each module, check genes for module-weighted annotations, and use both of these to analyze their own gene expression data (generated using any platform) or gene sets of interest.

Entities: CellLine Chemical Disease Gene Species

Keywords: aging; atfs-1; functional annotation of genes; gene co-expression; gene expression data analysis; hif-1; independent component analysis; microarray; mitochondrial unfolded protein response; respiration

Mesh：

Substances：
Transcription Factors

Year: 2020 PMID： 32759329 PMCID： PMC7534440 DOI： 10.1534/g3.120.401270

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Nearly half of the predicted protein-coding genes in Caenorhabditis elegans lack a functional annotation based on direct experimental evidence (Lee ). As a result, querying gene annotation databases, such as the Gene Ontology (GO) or curated pathways can fail to detect biologically meaningful signals in gene expression data (Ashburner ; Fabregat ; Kanehisa ). An alternative approach to understanding gene function is to use information about gene expression. Gene-expression data can be used to define groups of genes that show similar patterns of expression, or co-variation, across many different conditions, which arise due to changes in genes' transcriptional activity and/or their post-transcriptional regulation. These groups are called transcriptional gene modules, with each module potentially representing a discrete biological phenomenon. Gene modules are routinely constructed when clustering algorithms are applied to gene-expression data and have been used successfully to identify gene regulatory mechanisms in a variety of contexts, from the yeast cell cycle (Wu ) and sporulation (Wang ) to human cancer cells (Teschendorff ) to cognitive decline in patients with Alzheimer disease (Mostafavi ). Furthermore, large compendia of data sampling diverse perturbations have been used to define fundamental gene-expression programs of entire organisms (Hughes et al. 2000, Engreitz et al. 2010, Zhou and Altman 2001). In C. elegans, the most recent effort to generate high-quality fundamental transcriptional modules is now almost two decades old (Kim ). Co-expressed genes were grouped together into 43 groups (or “mountains”) based on their correlation across 553 microarrays. However, the compendium did not contain all genes (e.g., over 30% of the microarrays only contained ∼11,000 of C. elegans’ 20,470 protein-coding genes) and each gene was assigned exclusively to one group, although it is well-established that genes can participate in multiple processes(Oeckinghaus ; Gerstein ). The number of perturbations for which gene expression data are available has also increased substantially since 2001. Here, we define 209 transcriptional gene modules in C. elegans using a heterogeneous compendium of 1386 microarrays and a method we call DEXICA, for Deep EXtraction Independent Component Analysis. DEXICA builds on prior implementations (Lee and Batzoglou 2003; Gong ; Engreitz ) of independent component analysis (ICA) for gene module extraction by maximizing the biological information content of the modules. It does so by varying data pre-processing methods and the number of extracted ICA components (modules) until the number of significantly enriched biological annotations in the module set is maximized. DEXICA also uses an artificial neural network to partition each independent component to determine which genes should be included in each module. We show that the 209 DEXICA C. elegans modules capture gene expression patterns that correspond to biological processes; for example, responses to heat stress, xenobiotics and pathogenic bacteria, and to several individual tissues. Furthermore, data analysis in the module space correctly reveals biological processes that are missed by analyses of differentially expressed genes. We provide a user-friendly web interface, with tutorials, in which users can test which of the 209 gene modules are active in their datasets and find detailed information about each module that helps determine which biological process(es) they represent. Finally, we explore whether gene modules can be used to improve existing gene annotations. We reason that an annotation shared among co-expressed genes is more likely to be relevant to their function than one that is not, and that annotations of co-expressed genes can provisionally be “transferred” onto their poorly annotated companions. We calculate what we call “module-weighted gene annotations” by weighting the association between a gene and an annotation by the degree to which the annotation appears predictive of the gene’s module membership. We show that matrix-based analysis of module-weighted annotations is more sensitive and specific than common annotation enrichment tests. We provide a framework for using module-weighted annotations to detect significant GO terms and promoter oligonucleotides directly from expression data, and to identify novel GO terms conferred onto genes based on their module membership. The first part of this paper, which describes the construction and biological validation of the C. elegans DEXICA modules, is directed to computational biologists, and the second to C. elegans biologists who may or may not have experience in computational biology. Readers simply wishing to apply this analytic method to specific gene lists should feel free to skip directly to the later sections but may also enjoy reading about the performance of the modules in our various validation test cases.

Materials and Methods

Compendium construction

To build the compendium of 1386 C. elegans Affymetrix arrays, we first downloaded all CEL files with the appropriate platform ID (GPL200) from the GEO database available on March 1, 2014, excluding those for which the organism was not C. elegans and the sample type was not RNA. We excluded arrays from experiments for which fewer than 8 hybridizations were performed in order to mitigate the effect that under-sampled conditions might have on predicted modules. We then performed a quality control step using the quality assessment functions provided in the simpleAffy (v2.40.0) R package (http://bioinformatics.picr.man.ac.uk/simpleaffy/), discarding arrays that did not meet the quality thresholds recommended in the simpleAffy documentation. We generated expression values for probesets separately for each experiment (determined by GEO series IDs) using the RMA preprocessing procedure provided in the affy (v1.40.0) R package (Gautier ), then used the bias (v0.0.5) R package (Eklund and Szallasi 2008) to remove intensity-dependent biases in expression levels. We then concatenated the expression matrices for each experiment into a single matrix. Next, we either performed between-experiment quantile normalization on the entire matrix using the limma (v3.18.13) R package (Smyth 2005), or omitted this step, depending on the preprocessing method to be tested. Finally, we scaled and centered the arrays, such that the mean of each column was zero and the standard deviation was 1.

Conducting ICA

To conduct ICA of the gene expression matrix, we used the fastICA (v1.2-0) R package (http://CRAN.R-project.org/package=fastICA) with default parameters except for the “method” parameter, which we set to “C” to increase computational speed, and the “row.norm” parameter, which we set to ”TRUE” in order to balance the total compendium variance between genes with subtle changes in expression values and those with large changes in expression values.

Partitioning of independent components

To convert independent components to discrete sets of genes, we employed two methods. In the first, for each component, we assigned all genes with a weight >= 3 to the positive (“a”) hemi-module, and all genes with a weight <= -3 to the negative (“b”) hemi-module. In the second, we created an artificial neural network using the neuralnet (v1.32) R package (http://CRAN.R-project.org/package=neuralnet) to predict positive and negative partitioning thresholds for each independent component, based on the component’s skewness and kurtosis (see Supplemental Methods), then assigned genes whose weights exceeded these thresholds to the corresponding hemi-modules.

Obtaining gene annotations and microarray data

For module optimization, we obtained GO term and REACTOME pathway annotations for C. elegans genes using biomaRt (v2.18.0) R package (Durinck ) and ensembl mart for data retrieval. To obtain genes’ tissue annotations, we downloaded all available data from the GFP Worm database (http://gfpweb.aecom.yu.edu/) (Hunt-Newbury ), which contains annotated expression patterns of promoter::GFP fusion constructs; in total, this dataset provided annotations for 1821 genes across 89 tissue types (n.b., we considered the same tissue in different development stages to be distinct tissue types). For Module Annotation Pages (Tool 3 on http://genemodules.org/), known tissues (anatomies) of gene expression were downloaded from the “expression patterns” database from Wormbase (WS275). This dataset included 5376 genes and 2438 anatomical terms. GO term enrichment was calculated using topGO(v.2.34.0) and celegans.db (v 3.2.3) R packages. To obtain fold changes for mutants, we used data previously published by our group in which () mutants were compared to wild type controls (Cristina ). To obtain fold changes for mutants, we used the maanova (v1.33.2) R package (http://research.jax.org/faculty/churchill) and data previously published by Shen, et al. (Shen ), to calculate the induced gene fold changes upon mutation of . All other microarray data were obtained directly from the authors of the original publications or from the GEO database (see Supplemental Methods).

Optimizing gene module prediction

To optimize gene module prediction, we performed ICA with a variety of different data preprocessing options [e.g., the choice of preprocessing algorithm (RMA, GCRMA, PLIER, MAS 5.0), background, perfect match, bias correction, and normalization methods], and with a varied number of extracted components from 5 to 500 by increments of 5. For each parameter combination, we repeated ICA 5 times. We tested the biological validity of the independent components generated by each ICA run by determining the number of annotations that were enriched in at least one hemi-module. We chose not to optimize based on the number of modules with at least one significant annotation, as tests using simulated data showed that this approach could lead to signals being split into multiple, less accurate representations (data not shown). We first calculated a p-value for the enrichment of genes associated with each annotation term in each hemi-module using the hypergeometric test. We then applied the Simes method (Simes 1986) for multiple hypothesis testing (alpha = 0.05) to the set of p-values for each annotation term. The Simes method is similar to the Benjamini-Hochberg method(Benjamini and Hochberg 1995) for controlling the false discovery rate, but differs in a way that makes it more appropriate here: it aims to answer the question, “Given a set of p-values, what is the likelihood that at least one null hypothesis is false?”, while the Benjamini-Hochberg method asks, “What fraction of rejected null hypotheses are actually true (i.e., falsely discovered)?” Failure of the Simes test indicates that at least one null hypothesis is false at the specified alpha level. To verify the accuracy of our module quality statistics, we repeated all tests using module definition matrices in which gene IDs had been randomly shuffled.

Quantification of module activity in gene expression data

To project a data vector, x, such as a set of gene-expression fold changes, onto a set of gene modules, we used the scalar projection method, in which a mixing vector, a, is calculated from the dot product of the data vector and the unit vectors comprising the module definitions, , as shown in equation 2:The resulting mixing vector, a, provides an indication of the weight of each module definition vector in the projected data, x. Note that because x is a vector of gene fold changes, this analysis can be applied to data generated using any RNA profiling platform (e.g. RNAseq, microarray). Projection of a data matrix, , which generates a mixing matrix, , was carried out using the same procedure. To calculate signed variance explained (SVE), we calculated the relative variance explained (VE) for each module from a as follows, where n is the total number of modules:We then multiplied these values, which are strictly positive, by -1 in each case where a <0 to obtain SVE.

Generation of the Enrichment matrix, E

First, ANN-based partitioning of the module definition matrix, , was used to generate matrix . contains two gene sets per independent component (which we refer to as hemi-modules), for a total of 418. Using a matrix of known Boolean associations between genes and annotations, , we calculated a hypergeometric probability for each annotation in each hemi-module. We used the frequency of genes bearing a particular annotation in the hemi-module, the frequency of such genes in the compendium, the number of genes in the hemi-module, and the number of genes not in the hemi-module as the q, m, k, and n input parameters, respectively, to the phyper() function of the stats (v3.0.3) R package (http://www.R-project.org/). We used these p-values to populate a matrix, , with a row for each hemi-module and a column for each annotation. For under-represented annotations, we entered the log(p-value) in the matrix, and for over-represented annotations we entered the –log(p-value).

Generation of module-weighted annotations

Our calculation of module-weighted annotations takes advantage of the fact that, in modules generated by ICA, prior to partitioning, each gene has a weight in each module. Given a score, or weight, for each annotation in each module, this allows genes to be associated with annotations via a matrix product calculation. In the matrix (see above) highly positive values correspond to strongly enriched annotations in a hemi-module, highly negative values correspond to strongly depleted annotations. We transform the gene module matrix, , into an unpartitioned hemi-module matrix, , by concatenating it with a negative copy of itself column-wise:The product of this matrix with the matrix produces matrix , which relates genes to annotations:As a final step, we normalize the values in this matrix row-wise; i.e., separately for each annotation, by subtracting the mean and dividing by the standard deviation.

Analysis of GO terms based on gene weights

To test whether the ranking of GO terms based on gene weights (Table 1) was statistically significant, we constructed a list comprising all semantic words used in all GO terms, excluding words shorter than three letters and uninformative words (e.g., “the” and “for”.) We then tested each word for bias toward appearing near the top or bottom of the ranked GO term list. In agreement with our initial observations, the most significantly top-biased words pertained to macromolecular complexes, such as “nucleosome”, “cilium”, and “ribosomal”, and the most significant bottom-biased words pertained to cell signaling, such as “signal”, “kinase”, and “receptor”. Many of the GO terms containing cell signaling words were generic in nature, e.g., protein kinase regulator activity, thus, our results may partially be explained by a lack of co-regulation among constituents of different signaling pathways. However, some specific cell signaling terms, e.g., Notch signaling pathway also appeared near the bottom of the ranked GO term list, suggesting that the genes annotated with such terms are either not strongly co-regulated at the gene expression level or that the biological conditions represented by the compendium did not perturb their expression enough to form modules with our method.

Table 1

GO categories with strongest and weakest gene weights. The table shows the top 25 and the bottom 5 GO categories, ranked in decreasing order of the weight of their most strongly associated gene prior to normalization. Similar GO categories are grouped together, and the genes with the highest weights for each group are also shown. rpl- (ribosomal protein, large subunit) and rps- (ribosomal protein, small subunit) genes encode ribosomal proteins, and his- (histone) genes encode nucleosome components

GO Term Group	GO ID	Rank	GO Term	Top Genes
Ribosome	GO:0005840	1	ribosome	rpl-43, rps-15, rps-14, rpl-22, rpl-32
	GO:0003735	2	structural constituent of ribosome
	GO:0030529	3	ribonucleoprotein complex
	GO:0006412	4	translation
	GO:0043228	10	non-membrane-bounded organelle
	GO:0043232	11	intracellular non-membrane- bounded organelle
	GO:0032991	20	macromolecular complex
Heme binding	GO:0020037	5	heme binding	R05D8.9, E02C12.6, T16G1.6, cyp-35A1, C33C12.8
	GO:0046906	6	tetrapyrrole binding
	GO:0005506	7	iron ion binding
	GO:0004497	8	monooxygenase activity
	GO:0009055	9	electron carrier activity
	GO:0016705	15	oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen
Nucleosome	GO:0006334	12	nucleosome assembly	his-4, his-8, his-9, his-45, his-62
	GO:0034728	13	nucleosome organization
	GO:0031497	14	chromatin assembly
	GO:0000786	16	nucleosome
	GO:0032993	17	protein-DNA complex
	GO:0006333	18	chromatin assembly or disassembly
	GO:0065004	21	protein-DNA complex assembly
	GO:0071824	22	protein-DNA complex subunit organization
	GO:0006323	23	DNA packaging
Lipid glycosylation	GO:0030259	19	lipid glycosylation	C33C12.8, cyp-35A5, cyp-35A1, dhs-23, clec-210
Cuticle	GO:0042302	24	structural constituent of cuticle	col-2, col-84, col-158, col-44, R07E5.4
Oxidoreductase activity	GO:0016491	25	oxidoreductase activity	B0272.4, Y75B8A.4, acox-1, F58A6.1, dhs-18
…	…	…	…	…
Presynaptic membrane	GO:0042734	1647	presynaptic membrane	K02E11.7, C45G9.6, clec-233, ttr-10, cup-4
Signal transduction	GO:0046579	1648	positive regulation of Ras protein signal transduction	R04B5.6, cyp-33D3, C34D10.1, stdh-2, ¹⁹FF10.1
Signal transduction	GO:0051057	1649	positive regulation of small GTPase mediated signal transduction	R04B5.6, cyp-33D3, C34D10.1, stdh-2, ¹⁹FF10.1
Calcium-dependent kinase	GO:0009931	1650	calcium-dependent protein serine/threonine kinase activity	tag-83, T21H8.5, cca-1, ZC101.1, ztf-16
Calcium-dependent kinase	GO:0010857	1651	calcium-dependent protein kinase activity	tag-83, T21H8.5, cca-1, ZC101.1, ztf-16

Analysis of expression data with module-weighted annotations

To test whether a set of gene fold-changes was significantly enriched for specific annotations, given a module-weighted annotation matrix, , we calculated the dot product of the data vector, x, comprising the set of gene fold-changes, and the matrix:The resulting vector, a, provides an indication of the degree to which genes with strong weights for each annotation also have strong fold-changes. To generate p-values from these, we permuted the fold-change vector, , 1000 times to create a background distribution for each annotation, which we then used to determine z-scores.

Data Availability

All data supporting the findings of this study are available within the paper and its supplemental materials. Dynamic access to the data, with tutorials, is also provided through a Shiny App graphical user interface at http://genemodules.org/. Specifically, 6 separate functionalities are provided:Supplemental materials include Supplemental Discussion and Supplemental Methods, Supplemental References, eight Supplemental Figures and four Supplemental Files. File S1, “Partitioned S matrix”, contains probe sets and genes that belong to each of the 209 DEXICA modules. File S2, “Module annotation summaries”, provides a summary of information about each module - top enriched GO terms and top activating perturbation. File S3, “Finding experiments that activate similar modules” contains a comparison of module activity in all possible pairs of perturbations (contrasts) in the microarray compendium to facilitate identification of perturbations that activate similar modules. File S4, “Module-weighted GO terms”, contains module-weighted GO annotations of each gene. Supplemental material available at figshare: https://doi.org/10.25387/g3.12061656. Tool 1: Determine which of the 209 modules defined here are active in the user’s gene expression data (by inputting a list of gene expression fold changes) Tool 2: Using the partitioned 209 modules as gene sets, test whether a list of user-specified genes is enriched in any particular module Tool 3: Detailed description of each of the 209 C. elegans gene modules Tool 4: Visualize how genes assigned to a module change under a variety of conditions (conditions can be chosen from the 716 perturbations derived from our microarray compendium or provided by the user) Tool 5: Get module-weighted GO terms associated with a user-specified gene Tool 6: Analyze gene expression data using module-weighted GO terms Code described in this study is available as an R package: DEXICA R package https://github.com/MPCary/DEXICA. Data to reproduce extraction of gene modules can be downloaded from https://github.com/MPCary/DEXDATA.Celegans. To install: install.packages(“devtools”) library(devtools) install_github(“MPCary/DEXICA”, build_vignettes = TRUE) install github(“MPCary/DEXDATA.Celegans”).

Results

Development of DEXICA and extraction of optimized gene modules

A large body of gene-expression data is publicly available(Barrett ; Rustici ) and has enabled computational prediction of gene modules(Kim ; Segal ; Ihmels ; Michoel ; Engreitz ). We refer to our method for going from a raw compendium of gene expression data to an optimized set of gene modules and a list of genes that belong to each module as DEXICA, for Deep EXtraction Independent Component Analysis (described below). While several methods exist for defining gene modules, independent component analysis (ICA) generally outperforms clustering-based approaches and principle component analysis in extracting biologically relevant signals from large datasets15,22-24. Advantages of ICA include its ability to deal well with high dimensional data and to generate modules that can share genes. Furthermore, ICA does not assume that latent signals in the data follow a Gaussian distribution, an important property for gene module prediction, as gene regulation signals appear to be primarily super-Gaussian(Purdom and Holmes 2005). For these reasons, we chose to use ICA for module construction. Briefly, ICA is a blind source separation method that attempts to “unmix” a signal comprising additive subcomponents by separating it into statistically-independent source signals(Hyvärinen and Oja 2000). In the common notation, an n x m data matrix, , is decomposed into two new matrices – an n x k source matrix, , and a k x m mixing matrix, , where k is the number of independent components:In the context of gene-expression analysis, is a matrix of m measurements (e.g., microarrays) of n genes, and k independent components are interpreted as gene modules. indicates the weight of each module in each microarray and indicates the relative level of inclusion of each gene in each module(Kong ) (Figure 1a).

Figure 1

Schematic of module prediction using DEXICA and derivation of module-weighted annotations. (a) A matrix of gene expression data, X, is decomposed using independent component analysis (ICA) into a gene module definition matrix, S, and a matrix containing the weight of each module in each microarray, A. Rows of the S matrix indicate the relative degree of inclusion of a given gene in each module. (b) Annotation enrichment within modules is calculated using matrices B and S. Known associations between genes and annotations are captured in a Boolean matrix B, where 1 indicates an association between a gene and an annotation and 0 indicates a lack of an association. S is partitioned into S, where 0 indicates a gene’s exclusion from a module, while -1 indicates a gene’s assignment to the “a” hemi-module and 1 to the “b” hemi-module (a hemi-module comprises genes that have extreme weights and the same sign in a given column of the S matrix). Enrichment of genes associated with each annotation in each hemi-module is calculated using hypergeometric tests and log(p-values) (for under-represented annotations) or -log(p-values) (for over-represented annotations) are recorded in a new matrix, E. (c) To generate module-weighted annotations, the S matrix is first transformed into a matrix, H, which indicates the weight of each gene in each hemi-module. H contains twice as many columns as S because there are two hemi-modules per module. The dot product between H and E results in the R matrix, where a row indicates the weighted association between a gene and each annotation. The color intensity in R indicates annotation weights that would result given the indicated expression values in X and the Boolean annotations in B. Color saturation in all matrices except B (Boolean) and S (trivalued) indicate relative numeric values, with least saturated colors (i.e., white) indicating highly negative values, and most saturated colors indicating highly positive values. Using simulated data, we found that module-prediction accuracy was highest when the number of extracted components matched the true number of modules (Figure S1). Therefore, we sought to optimize module prediction by evaluating results based upon their biological information content, such as enrichment of Gene Ontology (GO) terms(Ashburner ), REACTOME pathways (Fabregat ), and tissue-specific expression(Hunt-Newbury ). We applied ICA to a diverse compendium of 1386 C. elegans Affymetrix arrays obtained from the Gene Expression Omnibus (GEO) database(Barrett ). We then compared results obtained from a wide variety of preprocessing methodologies and the total number of components extracted. Omitting between-experiment quantile normalization from the preprocessing procedure produced modules that were more annotation rich than those produced by a published implementation of ICA-based module extraction (Engreitz ) (Figure 2a-c). Furthermore, annotation content of the modules was maximized when the number of gene modules (i.e., independent components) ranged from 191 to 226.

Figure 2

Optimization of gene modules. To determine the optimal preprocessing method and the optimal number of components (gene modules) to extract from a gene expression compendium of C. elegans microarray data, we calculated the number of Gene Ontology terms (a), C. elegans tissues (b), and REACTOME pathways (c) that were significantly enriched in at least one gene module. Black points show results from a compendium produced using a preprocessing procedure used by Engreitz et al.(Engreitz ); red points show results for the best alternative preprocessing method that we tested. Black dashed lines indicate the point on the x-axis of each graph at which loess regression curves showed the greatest difference between red points and results from randomized controls (gray points). (d) The number of modules produced by DEXICA, by a different ICA-based method used by Engreitz et al.(Engreitz ), and by C. elegans gene expression topomap generated by Kim et al.(Kim ). The number and the percentage of total predicted modules that have significant enrichment for various annotations are shown. Error bars indicate s.d. between repeat runs of DEXICA or Engreitz et al. method. (e) Distribution of the number of probe sets partitioned to each of the 209 DEXICA modules. (f) Distribution of the number of different DEXICA modules to which a given probe set is partitioned. (g) Each of the 418 DEXICA hemi-modules was tested for enrichment of genes with different structural properties – presence a long 3′-UTR, transcription as part of an operon, and presence of multiple annotated splice variants. To enable functional evaluation of the modules, it is useful to summarize them as discrete gene sets. To this end, we partitioned each column of the matrix into three sets of genes: one set consisting of genes excluded from the module, and two other sets consisting of genes implied to be regulated in opposite directions. We refer to these latter two sets as “hemi-modules” “a” and “b”, one set consisting of genes with highly positive weights and the other consisting of genes with highly negative weights (signs assigned based on skewness) in the independent component. Thus, module genes regulated in one direction (up or down) are part of one hemi-module and genes regulated in the opposite direction (down or up) are part of the other hemi-module. The “a” hemi-module is derived from the more highly skewed side of the independent component, and is usually larger (i.e. contains more genes). While others have used a fixed-threshold approach to component partitioning (Lee and Batzoglou 2003; Chiappetta ; Engreitz ); for example, defining genes with weights exceeding +/− 3 standard deviations from the component mean to be “in” each hemi-module, we found that different individual modules showed maximum annotation enrichments at different thresholds, suggesting that a ‘one-size-fits-all’ approach to partitioning was sub-optimal. An alternative approach to partitioning that we attempted (described in Frigyesi ) failed to converge in many cases (data not shown). Therefore, to increase partitioning accuracy, we trained a function to predict partitioning thresholds from the shape of component distributions. Because plots of training data revealed a complex solution surface, we decided to use an artificial neural network (ANN) to predict partitioning thresholds for each component from the skewness and kurtosis of its distribution. The output of the ANN, i.e., predicted partitioning thresholds, is shown in Figure S2a. Although using a fixed-threshold approach to module partitioning produced similar results qualitatively (Figures S2b-d), it resulted in fewer significant annotations across the range of parameters tested than did ANN-based partitioning (P <2.2E-16, Figure S2e). Because the mean optimum number of extracted components (dashed vertical lines in Figure 2a-c and Figures S2b-d) was similar for both threshold and ANN partitioning (209, and 209.33, respectively), we chose 209 as the final number of components to extract from the compendium. Gene modules are expected to represent sets of genes that are co-regulated at the level of mRNA expression or stability. Therefore, DEXICA modules should be enriched for DNA and RNA regulatory sequences. To test this, we generated a list of potential regulatory oligonucleotide sequences (called ‘words’) by applying the Mobydick algorithm(Bussemaker ) to the set of all predicted C. elegans promoter regions and, separately, to the set of all predicted C. elegans 3′-UTRs. We then calculated the statistical significance of the over- or under-representation of genes bearing each word in each gene module. Across multiple runs with 209 components, the mean number of gene modules containing significant promoter words and 3′-UTR words was 106.3 and 40.6, respectively, significantly greater than results produced by other module prediction methods we tested (P <2.2E-16, Figure 2d). Because the ICA algorithm employed by DEXICA, fastICA, converges to a final solution from a random starting point(Hyvärinen and Oja 2000), small differences typically exist in the output of different runs; these differences can be seen in the vertical spread of data points in Figures 2a-c, and in the error bars in Figure 2d. While others have reconciled such differences through a clustering approach applied to the output of numerous runs of the algorithm (so called “iterated ICA”)(Frigyesi ; Engreitz ), when applied to the C. elegans Affymetrix compendium, many of the final components generated by this method were highly correlated to one another, indicating non-independence and potential redundancy among the components (data not shown). We therefore sought to choose a single, high quality, fastICA run output to use as predicted gene modules. Because we considered word enrichment the most unbiased measure of module quality (as it relies only on DNA sequence data), we chose as our final module set (File S1) the run from a set of 100 with the best combined rank of significant promoter and 3′-UTR words (it ranked first in promoter words and third in 3′-UTR words, Figure S3). The final median module size is 319 probe sets (Figure 2e) with half of all probe sets belonging to at least 3 modules (Figure 2f). Only 4% of all genes (729 genes, 1189 probe sets) do not belong to any module. These genes show significant enrichment for germline or embryonic expression. Co-expressed genes often share common structural elements. For example, genes within operons are switched on together during recovery from growth-arrested states in C. elegans (Zaslaver ) and 3′-UTR length is associated with proliferation in cancer cells (Mayr and Bartel 2009). To further explore the information content of DEXICA-extracted modules, we tested each hemi-module for over- and under-enrichment of genes with long 3′UTRs, for genes appearing in operons and for genes with multiple splice forms. Of the 418 hemi-modules, 65 contained a significant bias toward long 3′-UTR genes and 58 contained a bias toward short 3′-UTR genes (q < 0.1, threshold chosen based on randomized control trials, see below; Figure 2g). Twenty-one hemi-modules were significantly enriched and 205 hemi-modules were significantly depleted for operon genes, and 81 hemi-modules were enriched and 80 hemi-modules were depleted for genes with multiple splice variants (Figure 2g). Control tests performed on the same module set but with randomly scrambled gene IDs produced no significant modules below q = 0.1 for any of the gene properties we tested (data not shown). Therefore, DEXICA successfully groups genes with common structural features into modules, consistent with the known relationship between gene structure and expression. Together, these results show that the 209 C. elegans gene modules extracted from a large microarray compendium are enriched for GO terms, tissue and pathway annotations as well as for potential regulatory DNA sequences and gene structural properties. DEXICA is available as an R package (https://github.com/MPCary/DEXICA). It provides tools to optimize ICA module extraction and partitioning based on annotation enrichment and can be applied to any gene expression compendium.

C. elegans DEXICA modules are biologically informative

We observed enrichment of functional gene annotations and oligonucleotide sequences in DEXICA-extracted modules with even the least optimal parameter settings (see Figure 2). To further test the biological significance of the final 209 modules and to begin annotating them, we constructed an alternate microarray compendium comprising not 1386 individual arrays, as in the original compendium, but rather the gene fold changes arising from contrasting experimental and control samples in the same experiment. Projecting the resulting 716-column matrix (one for each contrast) into the space defined by our gene modules allowed us to see which experimental perturbations activate or inhibit which modules. We compared the most enriched GO terms in each module to the most strongly activating or inhibiting experimental perturbations and observed that in several cases these were in obvious agreement (File S2 and Tool 3 on http://genemodules.org/). For example, the strongest activation of m73 (module 73) is achieved by an embryonic development time course experiment. Consistent with this, the top GO terms enriched within genes that comprise m73 describe cell division and the cell cycle. Module 153 is strongly activated by a comparison of L3 lethargus (a larval period prior to molting) to active L3 larvae and, accordingly, this module is enriched for GO terms that describe cuticle remodeling. Similarly, m200 is activated in response to the pathogen P. aeruginosa and the top GO terms enriched within m200 genes include “defense response to Gram-negative bacteria” and “innate immune response”. DEXICA modules can also capture gene expression patterns that define specific tissues. For example, m23 is strongly activated by GFP-based enrichment of neurons (GSE8004). The top GO term associated with m23 is “neuropeptide signaling pathway” and genes in this module have highly significant associations with neuronal anatomy terms. Similarly, m144 is strongly activated by genes specific to embryonic muscle cells (GSE8462) and is enriched for GO terms such as “striated muscle dense body” and anatomies such as “body wall musculature”. Together, these results indicate that DEXICA modules represent biologically meaningful sets of genes. It is very important to note that a lack of an obvious agreement between the nature of a perturbation that strongly activates a module and the top GO terms enriched in that module does not necessarily indicate that a module is biologically meaningless. On the contrary, this apparent disagreement can be a powerful tool for understanding complex transcriptional responses to perturbations. For example, starvation of L1 larvae induces a developmental arrest and has widespread effects on metabolism and energy allocation (Baugh 2013). Module 51 is strongly activated by L1 starvation and is enriched for GO terms that describe ribosome biogenesis (e.g., “ribosome biogenesis”, “nucleolus”), autophagy (e.g., “autophagosome maturation”) and response to starvation (e.g., “cellular response to nitrogen starvation”). Thus, this module represents several coordinated processes that become activated in response to larval starvation. Obvious agreement between a perturbation and GO terms would also be absent if a process had not previously been associated with a particular condition, making modules useful for hypothesis generation. For example, a wild C. elegans isolate, JU1580, shows strong activity of m55 relative to the laboratory C. elegans strain, N2, and genes within m55 are enriched for the “innate immune response” GO term. This leads to a hypothesis that the immune systems of JU1580 and N2 have different levels of basal activity.

Application of DEXICA modules to improve gene annotations

Because gene annotations can be incomplete and biased, knowledge about which genes tend to be co-expressed in an organism can help inform which annotations are more likely to be biologically relevant and help infer annotations of orphan genes based on annotations of other module members (see Supplemental information for additional discussion). To this end, we devised a calculation of scores, which we call “module-weighted annotations”, based both on how much a given annotation is enriched in each module (matrix , Figure 1b-c) and the degree to which a given gene belongs to that same module (matrix , Figure 1c).

Module-weighted GO terms:

To test the utility of module-weighted annotations, we first examined whether they generally recapitulate traditional (Boolean) annotations. If so, then genes traditionally associated with each annotation should have larger weights for those annotations (when normalization is omitted from their calculation) than do other genes (two-sample KS test, alpha level = 0.05). For this analysis we used GO terms as annotations and restricted the set to terms with at least 15 annotated genes in order to ensure robust signals; this set comprised 1651 GO terms. In 98.6% (1628/1651) of cases, genes associated with each term had significantly larger module-weighted annotations than did other genes, indicating that module-weighted GO annotations do recapitulate Boolean annotations. We ranked each GO term by the weight of its most strongly associated gene (Table 1). GO terms with the most highly weighted genes were “ribosome” (CC) and “structural constituent of ribosome” (MF). Consistent with this finding, genes involved in ribosome biogenesis are known to be transcriptionally co-regulated(Wade ). On the other hand, GO terms associated with signal transduction and kinase activity showed the lowest gene weights. These results are statistically significant (see Supplemental Methods methods) and show that some types of gene groupings (e.g., genes encoding kinases) often used in the analysis of gene expression data may not actually represent functions that are coordinately regulated at the transcript level. Therefore, we suggest that researchers use caution when interpreting a result that certain annotations, like “kinases”, appear to be enriched in a transcriptomics experiment. A sensitive test will tend to give similar results despite small amounts of random noise being added to the input data. To test the sensitivity of a module-weighted annotation-based enrichment test, we obtained gene fold changes for the 5 most recent (at the time of query) C. elegans Affymetrix experiments deposited to the GEO database. These experiments were not included in the data used to construct the modules. We then added varying amounts of Gaussian noise to the fold changes and for each level of noise, we calculated enrichment z-scores for each GO term using three different methods: the Kolmogorov-Smirnov (KS) test, the t-test (n.b., one-sample and two-sample t-tests produced nearly identical results, data not shown), and scalar projection of the gene fold changes onto a module-weighted GO term matrix. We then compared these results to those obtained by each method when no noise was added (Figure 3a). Dissimilarity to the initial results (zero added noise) increased rapidly with added noise for both the KS test and t-test, while the projection-based results were similar (ρ > 0.75) at even the highest noise levels tested (5 standard deviations), suggesting that the projection-based test is more sensitive than both the KS and t-tests.

Figure 3

Performance of module-weighted annotation-based tests. (a) To assess sensitivity, Gaussian noise was added to gene expression data from 5 recent C. elegans Affymetrix experiments not included in the compendium used to train the modules. [The standard deviation of the noise distribution (μ = 0) was varied from 0.5x to 5x the standard deviation of the gene fold change distribution.] At each noise level, z-scores for each GO term (based on 100 random permutations of the gene IDs in the input data) were calculated using three different methods: KS – Kolmogorov-Smirnov test; t-test – two-sample t-test (one-sample test gave highly similar results, data not shown); projection – projection of the gene fold changes into the space defined by the module-weighted annotation matrix, R. When conducting KS and t-test, fold changes of genes assigned to each GO term were compared to fold changes of genes not assigned to that term. Spearman correlation coefficients were calculated between Z-scores at each noise level and Z-scores without any noise added. (b) To assess specificity, 100 highly dissimilar pairs of experiments (Spearman correlation, ρ, of gene fold changes near 0) were selected from a set of 188,805 pairs generated using published microarray data. For each contrast belonging to a pair, we determined Z-scores for GO annotations using the three methods and calculated the rank correlation (rho) of the absolute value of these Z-scores between pair members. The center of the box represents the median value and whiskers extend to the most extreme data point that is not further than 1.5 times the IQR from the box. (c) Gene fold changes in and mutants were projected into promoter word space using module-weighted promoter words (see Methods). The 10 points furthest from the origin are highlighted with colored circles in the figure (orange = positive SVE for , blue = negative SVE for ), and their words corresponding to these points are shown to the right of the figure. Four of the six words highlighted in orange contain full or partial matches to the canonical binding site, (A/G)CGTG (underlined). A specific test will tend to show dissimilar results when given dissimilar inputs. To test the specificity of module-weighted annotation analysis, we selected the 100 most dissimilar experiments (correlation of gene fold changes near zero) from a set of 188,805 comparisons. We compared the results of projecting the gene fold changes from each experiment onto the module-weighted annotation matrix to the results obtained from the KS and t-tests and found that the experiments showed the weakest similarity in the significance levels of annotations when using the projection method (P = 5.2e-13, Figure 3b). These results show that annotation enrichment analysis using module-weighted annotations may provide more reliable biological insights than gene set enrichment analyses that rely on the KS (GSEA(Subramanian )) or t-tests (PAGE(Kim and Volsky 2005), GAGE(Luo )). Module-weighted GO terms for each gene (matrix , Figure 1c) are provided in File S4 and their significance in a query gene expression dataset can be tested using Tool 6 on http://genemodules.org/.

Module-weighted promoter words:

While any Boolean gene annotation may be converted into a module-weighted annotation, module-based weights seem particularly well suited for describing regulatory sequences, such as putative transcription factor and microRNA binding sites. To this end, we generated weighted annotations for each of the 5230 words in the promoter word dictionary we constructed using the Mobydick algorithm (described above and in Methods). To validate predicted regulatory word weights, we searched the literature, the JASPAR database of transcription factor binding profiles (Mathelier ), and the GEO database for C. elegans transcription factors with both an experimentally characterized DNA-binding profile and, separately, a microarray experiment that measured gene expression in a loss-of-function mutant of the transcription factor gene. This search yielded 6 transcription factors: , , , , , and . We then projected the loss-of-function microarray experiment data (positively and negatively changing genes separately) onto the module-weighted word matrix, (Figure 1c), and calculated z-scores for each word. Finally, we compared the top-scoring words to the DNA-binding profiles of the respective transcription factors. If the predicted promoter-word weights are accurate, then words that resemble the binding profile should score highly in this analysis. For and , the most significantly enriched words in the positively and negatively changing genes, respectively, matched the canonical binding sites (Table 2). A word matching the binding site scored 6th overall among the up-regulated genes, and for , four of the top 20 words for the up-regulated genes contained GAACT or AACTT, which partially matched the reverse compliment of a reported binding half-site, AGTTCA(Shostak ). In the data set, several words matching the so-called “ associated element” (DAE)(Zhang ) scored highly. However, none of the four words matching the canonical binding site, T(G/A)TTTAC were among the words comprising our Mobydick promoter-word dictionary, precluding these from being represented in the analysis. The canonical binding site for the final transcription factor, , is GAAC, but like the canonical binding site, neither this word nor its reverse compliment was present in the promoter word dictionary, precluding it from representation. Taken together, these results suggest that module weighted regulatory sequences can be used to determine important regulatory sites in a gene expression experiment, and further validate our method for calculating weighted annotations.

Table 2

Top scoring promoter words for transcription factor perturbation experiments. The table shows the top three most significant words (those with the most extreme z-scores), calculated using module-weighted promoter words, for each transcription factor loss-of-function microarray experiment. If a full or partial match (4 or more bases) to the canonical binding site of the factor does not occur within the top 3 ranking words, the most significant such match is also shown. Bold letters indicate matching positions to canonical binding sites

Factor	Canonical site	Top words	Projection set	Z score	Z score rank
DAF-16	T(G/A)TTTAC, CTTATCA*	TTCTTATCA	Down-regulated	−3.95	1
		GGAAG	Down-regulated	−3.69	2
		TTTTCTG	Down-regulated	3.47	3
HIF-1	ACGTG	ACGTGAAC	Up-regulated	4.39	1
		CGTGAAC	Up-regulated	4.38	2
		ACGTG	Up-regulated	3.71	3
NHR-23	AGGTCA	AGGTCA	Down-regulated	−5.29	1
		TGACCTA	Down-regulated	−4.72	2
		CCTCCCCC	Down-regulated	−4.39	3
HLH-30	TCACGTGA(C/T)	CTTACTATT	Up-regulated	−4.3	1
		CGTAATCC	Up-regulated	4.14	2
		CTTTTTTCT	Down-regulated	4.07	3
		CACGTG	Up-regulated	−3.09	20
LIN-14	GAAC	CCTACCTACCTA	Down-regulated	4.35	1
		GCGCGTCAAATA	Up-regulated	−3.95	2
		GCCGCGCACCCC	Down-regulated	−3.78	3
		GGTTCTGG	Down-regulated	−2.53	109
DAF-12	AGTTCA	CCCCAC	Down-regulated	−4.41	1
		GCTC	Up-regulated	4.28	2
		CCCCGCC	Down-regulated	−4	3
		AACTTTT	Up-regulated	3.51	11

Finally, we tested whether module-weighted words could reveal activity of the transcription factor HIF-1, previously shown to be required for longevity of the mitochondrial respiration mutants (Lee ). We calculated promoter word z-scores for the microarray data set and compared the results to those for . As predicted, we observed a very strong anti-correlation between the promoter word z-scores for mutants and those for mutants (Figure 3c, R = -0.581, P < 2.2E-16) and four of the six most active words matched the canonical HIF-1 binding site [(A/G)CGTG; underlined in Figure 3c]. These results further support the utility of module-weighted promoter oligonucleotides for identifying biologically relevant regulatory sequences directly from gene expression data.

Informative descriptions and annotations of DEXICA modules

Gene co-expression modules are useful for interpreting gene expression data because they often represent transcriptional signatures of biological processes (Segal ). Therefore, detailed information about modules that are active in a gene-expression experiment would be of primary interest to an investigator. We showed above that C. elegans gene modules constructed using DEXICA are biologically informative. As a comprehensive resource, we created a Module Annotation Page for each of the 209 modules (see Data availability, Tool 3). Each page shows the module’s significantly enriched GO terms (excerpted in File S2), tissues of known expression, enriched promoter and 3'-UTR oligonucleotide words, transcription-factor binding sites predicted using HOMER (Heinz et al. 2010) and component genes (each linked to a WormBase description). The component genes can be interrogated further for enrichment of additional gene annotations or regulatory motifs [e.g. using RSAT (Nguyen et al. 2018)] to gain a deeper understanding of a module's function. A limitation of using GO terms and other pre-existing annotations to interpret gene modules is that some cellular activities are poorly annotated (see UPRmt example in section 1 below). For this reason, an additional strategy to understanding what a given module represents is to examine its activity under a variety of conditions. To facilitate this, in the Module Annotation Pages we have also provided a ranked list of perturbations (from the 716 we generated) that activate each module significantly (also see File S2). While it is unlikely that a typical gene module can be described completely using a single semantic annotation due to the complexity of gene regulatory networks, we suggest processes that each module likely captures based on our comparison of module activity under several conditions and our examination of enriched GO terms (File S2). In a typical workflow, once the active modules in a gene expression experiment are identified by a researcher, we recommend that the researcher consult Module Annotation Pages and File S2 to infer the biological process(es) described by the modules of interest.

Guide to using DEXICA modules

Gene modules can uncover processes that are poorly annotated and therefore missed by conventional analyses:

Mutations in components of the electron transport chain reduce respiration, slow development and reproduction and increase lifespan in C. elegans (Dancy ). The mitochondrial unfolded protein response (UPRmt) is induced in response to a stoichiometric imbalance between nuclear and mitochondrial proteins within mitochondria, and this process is known to be active in long-lived respiration mutants (Baker ; Durieux ). However, analysis of significantly differentially expressed genes in the respiration mutant, either using conventional GO term- or KEGG pathway- overrepresentation tests or using a more recent tool designed to query more comprehensive annotations of C.elegans genes, WormCat(Holdorf ), failed to identify UPRmt or, for that matter, any process related to the mitochondria (Figure 4a). We asked whether application of DEXICA modules could successfully identify a UPRmt gene expression signature in mutants.

Figure 4

Analysis of module activity in respiration mutants reveals activity of the key mitochondrial unfolded protein response (UPRmt) transcription factor, ATFS-1. (a) Conventional annotation analyses of genes that are differentially expressed in () respiration mutants. GO term-overrepresentation was determined using GOrilla(Eden ) and KEGG pathway-overrepresentation was determined using DAVID(Huang ). WormCat(Holdorf ) is an online analysis tool that queries the GO database but works independently of GO and addresses some of its limitations. Significant gene expression changes were determined using the limma R package and FDR-adjusted p-value of <0.05. (b) Knockdown of a protein quality-control protease is the top non-respiration microarray experiment in the compendium that induces activity of the module m47. Module activity is expressed as signed variance explained (SVE). (c) Comparison of () mutants and animals that express a constitutively active form of , a key transcription factor required for the UPRmt. The gene expression scatter plot shows all detected genes (22,625; black) overlaid with genes that belong to m47 (367; pink). r, Pearson correlation coefficients. (d) Normalized changes in expression of genes that belong to m47 (367 genes) in response to inhibition of respiration ( mutation, column 1), activation of UPRmt by disrupting mitochondrial protein quality control (column 2), constitutive activation of under normal conditions (column 3), and activation of UPRmt in the absence of (column 4). (e) Genes up-regulated by ATFS-1 during UPRmt, as determined in Nargund et al. 2012 (Nargund ), are strongly enriched in m47. Enrichment was calculated using hypergeometric statistics. As a first step of gene expression analysis using DEXICA modules, we determine a fold change of every detected gene between mutants and wild-type animals. This calculation is done without imposing any cut offs on the data (e.g., p-values or effect sizes). We then project the resulting vector onto the module definition matrix (, see Figure 1), which can be done by using Tool 1 of the online user interface (http://genemodules.org/). Three modules - m47, m66 and m169 - displayed the highest activity (Figure 4b). Module activity is represented as Signed Variance Explained (SVE), where + or - indicates direction of activity (useful when comparing two or more experiments), and the absolute value of SVE represents the relative proportion of variance in a gene expression experiment explained by a particular module. The sum of absolute values of SVE for all modules is 1. By consulting Module Annotation Pages (see section “Annotation of DEXICA modules” and Tool 3 on http://genemodules.org/), we determine that the same three modules are strongly active in other respiratory-chain mutants, but also that the most strongly activating non-respiration perturbation of m47 is a mutation in , a mitochondrial protein quality-control protease, disruption of which is known to induce UPRmt (Pellegrino ) (Figure 4b – a plot like this can be generated by calculating module activity in each experiment using Tool 1 and plotting the resulting SVE values against each other). This finding suggests that m47 may encompass UPRmt genes. If true, then m47 should be active in animals with constitutive activity of the transcriptional UPRmt regulator, ATFS-1. To test this prediction, we calculated module activity in gain-of-function mutants (this perturbation was not part of the original compendium, GEO Accession number GSE73669). Indeed, induction of the UPRmt transcriptional response in otherwise healthy animals using this mutation strongly and specifically induced activity of m47 (Figure 4c). Furthermore, genes that belong to m47 showed concordant expression in the respiration mutant, in response to UPRmt induction by disruption of , and in response to constitutive activity of ATFS-1 in normal animals (Figure 4c, d). As expected, mutation of prevented these changes in animals with an induced UPRmt ( RNAi) (column 4 Figure 4d). Moreover, two chaperone genes known to be induced during UPRmt ( and ) are part of m47. Taken together, these data strongly suggest that m47 represents genes induced by ATFS-1 during the UPRmt. This is an example of how identifying active gene modules in a mutant of interest and finding other conditions that activate the same modules can reveal biologically informative gene signatures. Analysis of module activity, but not GO term or KEGG pathway enrichment analyses, was able to identify correctly a key biological process using gene expression data from mutants. Why did GO term enrichment analysis fail to recover the “mitochondrial unfolded protein response” term (GO:0034514)? To our surprise, there are only eight C. elegans genes annotated with this GO term (, , , , , , and ). Among these, only and are induced during UPRmt, whereas the others are genes needed for activation of UPRmt. In the KEGG pathway database, UPRmt is not included as a discrete entry, but is part of the “Longevity regulating pathway – worm” (map04212) entry, and similar to GO, primarily includes inducers rather than mediators of the UPRmt. These examples illustrate how incomplete annotations and/or overly general groupings (e.g., containing both activators and mediators) can result in a failure of standard annotation enrichment methods to detect biological signals in gene expression data. Nargund et al. have defined a set of ATFS-1-targeted genes based on up-regulation of these genes with or without and in the absence or presence of mitochondrial stress(Nargund ). While module analysis was able to reveal UPRmt without reliance on any pre-existing gene annotations or pre-defined gene sets, we wondered whether GSEA(Subramanian ) analysis of gene expression data using this gene set would have been able to detect ATFS-1 activity. The Nargund et al. ATFS-1 gene set showed enrichment within genes, but this enrichment was not statistically significant (Figure S4). In contrast, when we tested enrichment of the ATFS-1 gene set within modules (enabled by Tool 2 on http://genemodules.org/), we found a highly significant enrichment within m47 (Figure 4e). These results show that given a set of functionally related genes, testing enrichment of that set within modules can be more informative than testing enrichment within a ranked list of gene changes, likely because modules comprise groups of genes that are functionally related.

Gene modules can uncover gene signatures despite a lack of significant differential expression:

Reduction-of-function mutations extend lifespan in a manner dependent on the hypoxia inducible factor 1 (HIF-1) (Lee ), but, unexpectedly, we found that the overlap between the significant genes obtained by comparing or mutants to wild type was not statistically significant (Χ test p-value = 0.17; Figure S5a). The other highly active modules in mutants besides m47 are m66 and m169. We wondered if these modules represent genes regulated by HIF-1. Indeed, when we compared gene module activity in and mutants, we found a strong anti-correlation (Pearson correlation between SVE = -0.730, P = 4.7E-36; Figure S5b) driven by m66 and m169. The negative correlation is consistent with the finding that the life extension observed when activity is reduced requires activity of HIF-1. The idea that HIF-1 directly regulates transcription of genes in m66 and m169 is further supported by the fact that the canonical HIF-1 binding site [(A/G)CGTG] is the top oligonucleotide sequence enriched in the promoters of the genes comprising these modules (q-values 8.13e-105 and 1.47e-28, respectively). The similarity between module activity in and datasets, despite a lack of similarity among their most differentially expressed genes, suggests that the role of HIF-1 in regulating the lifespan of mutants may be to instigate small but coordinated expression changes in many genes, most of which fail significance tests for differential expression in one or both datasets. These results demonstrate that gene modules are sensitive toward identification of transcriptional signatures and therefore can be useful for analysis of datasets in which few gene changes reach statistical significance.

Gene modules as a hypothesis generation tool:

As shown above, the correlation between the module activities of and was substantial (r = -0.730), and therefore served as a strong validation of the module approach to uncovering functionally significant processes. To determine how often two sets of module activities generated from different experiments could be expected to show this degree of similarity, we determined module activity correlations for all possible pairs of the 716 experimental contrasts, excluding pairs in which both contrasts originated from the same experiment (i.e., from the same GEO series). This produced 188,805 contrast pairs, 13,376 (7.08%) of which showed a statistically significant correlation (Holm corrected p-value <0.05). As expected, the highest-ranking pairs were the same experiment performed by different labs or at different times. The strength of the and correlation would have fallen within the top 1%, had those experiments been part of the contrast set. Interestingly, had we observed the similarity between the and projections without any prior genetic data, we might have hypothesized a role for in mutants before such a role was discovered from genetic screening (Lee ). Therefore, to give the reader a chance to find other unexpected correlations between perturbations (i.e., gene expression changes that activate similar modules but that are generated by different experiments), we include the full set of contrast comparisons in File S3. We hope these comparisons will prove useful for hypothesis generation.

Guide to using module-weighted annotations

To aid in functional interpretation of gene expression data despite sparse annotations of many C. elegans genes, we have developed a process of transferring gene annotations based on the degree of gene co-expression to create “module-weighted gene annotations” (see description above). This approach provisionally transfers annotations, such as GO terms, that are enriched in a module to all gene members of that module. This can be useful in at least two ways: for studying individual genes and for identifying functional categories within gene expression data. For studying individual genes: if a query gene has a strong module-weighted association to a GO term with which it is not traditionally associated, it means that this gene is co-expressed with other genes that are traditionally associated with the GO term. For example, by applying Tool 5 (http://genemodules.org/) to the small ribosomal subunit S16 (), we find that most of the module-weighed GO terms associated with this gene have something to do with the ribosome or translation, indicating that is tightly co-expressed with other genes involved in ribosome biogenesis and not much else. On the other hand, while , a superoxide dismutase, does have significant weights for GO terms with which it is traditionally associated (e.g., oxidoreductase activity, superoxide metabolic process and response to superoxide), “catalase activity” ranks more highly for . This indicates that is co-expressed with genes annotated as catalases. While it is unlikely that has a novel catalase activity, it is more likely that expression of , a superoxide dismutase, is coordinated with expression of catalases, since hydrogen peroxide produced by is further degraded by catalases to avoid damage to the cell. Some of the other top-ranking module-weighted GO terms for describe dioxygenase activity. Similarly, this may indicate that dioxygenases generate superoxide radicals, and therefore increased expression of these enzymes is typically correlated with increased expression of a superoxide dismutase. Therefore, analysis of module-weighted annotations of individual genes can help form hypotheses about novel gene functions and transcriptional co-regulation of distinct biological processes. The second application of module-weighted annotations is similar to a common analysis scheme whereby over-represented annotations (e.g., GO terms) are identified within a list of significantly differentially expressed genes. The key difference is that traditional annotations have been extended using information about gene co-expression (i.e., gene modules). Furthermore, this approach leverages all expression data, even genes that do not change in a significant way. Instead of calculating enrichment p-values, Z-scores are calculated based on the strength of the module-weighted association between each gene and each annotation and the magnitude of that gene’s fold-change in the list of gene fold-changes. This analysis scheme is more selective and sensitive toward identifying functional categories of genes that change in response to a treatment than traditional over-representation analyses (see section “Application of DEXICA modules to improve gene annotations” above). Tool 6 on http://genemodules.org/ accepts a list of gene fold-changes and returns a table of module-weighted GO terms ranked by their predicted activity.

Discussion

We have captured much of the transcriptional wiring of C. elegans by extracting gene co-expression modules from a large compendium of gene expression data. These 209 modules represent RNA-level signatures of diverse biological processes that can occur in C. elegans and, along with the tools we provide, can be used as a resource for analyzing gene expression data, or in fact to interrogate gene lists of any sort. Because genes are grouped into modules based purely on how they behave in experimental assays, and because modules encompass both previously-annotated and unannotated genes alike, modules can reveal signals within transcriptomic data that otherwise would be missed due to incomplete knowledge of gene function, subtle gene expression changes or noisy data. Experimentally, gene-expression modules can be used to deconvolve complex phenomena into subsets of co-regulated genes, genes that likely act together to mediate a specific process. The modules can be annotated extensively, as we have done, and these annotations can be applied provisionally to all genes in the module. Regulatory factors associated with specific annotations (like 5′ or 3′ oligonucleotide words) can be implicated as well, and functional links can be revealed between dissimilar conditions that activate the same modules (e.g. see File S3). ICA has been applied to the prediction of gene modules before, but we could find no examples in the literature of optimizing the process using biological metrics in the manner that we describe. Combined with the improved ability to partition independent components provided by our artificial neural network approach, we expect that DEXICA will be useful for constructing gene modules for other organisms. The DEXICA software and our C. elegans modules and data are freely available as R packages online (see Data availability). While we have taken steps to maximize module prediction accuracy for the microarray compendium we assembled, many additional gene modules may exist in C. elegans that were not perturbed sufficiently in the samples comprising the compendium to be detected. These gene modules would remain hidden. As new areas of research are explored and new experiments are published, however, new gene modules will be discovered. For example, most of the gene expression data in our compendium was collected using whole animals. Data collected from isolated cells or tissues could help produce modules that are active in a relatively small number of cells. The DEXICA package can be used to create new and improved gene modules using compendia with expanded information content. The impetus for constructing gene modules in C. elegans was to create gene groupings that do not rely on existing annotations, because annotations of many genes are missing or incomplete. We then wondered whether numeric scores based on gene co-expression could actually improve the existing gene annotations, leading us to develop the concept of module-weighted annotations. By weighting an association between each gene and each annotation by the degree to which that annotation appears to predict gene modularity, annotations that are shared by co-expressed genes are “boosted” and those that are not are diminished. It is important to note, however, that relevance scores between genes and annotations could be calculated using other metrics of gene behavior as well. For example, a gene might be weakly associated with a term in the context of gene expression but strongly associated with it in the context of physical protein interactions. We have found that analysis of module-weighted GO terms is less sensitive to noise than are typical statistical over-representation tests. Furthermore, because module-weighted annotations incorporate information about gene expression, they effectively model the gene-gene correlation structure of the system. This is useful because typical over-representation tests do not perform well with gene sets that have a high level of gene-gene correlation [i.e., annotations assigned to genes that are strongly co-expressed are more likely to be significant (Goeman and Buhlmann 2007; Gatti ; Tamayo )]. GSEA, for example, deals with gene-gene correlation issue using permutation. We show that module-weighted GO terms produce significantly fewer false positives than do over-representation tests (see Figure 3b). Therefore, we think that module-weighted annotations are a promising new way to address the gene-gene correlation problem and are working to develop it further for annotation enrichment analysis. Module-weighted annotations may be especially useful for promoter/3′-UTR word analysis. Projection of gene expression changes onto the module-weighted word matrix allows identification of potential regulatory sequences directly from a list of gene fold changes, bypassing many steps required by traditional regulatory sequence detection (sequence retrieval, repeat masking, over-representation analysis, and background correction). More importantly, this analysis is more likely to yield true positive results because it makes use of a large amount of historical data (via the modules) to filter out sequences with no apparent regulatory function. We used module-weighted word analysis to analyze data from six transcription factor perturbation microarray experiments. Four of the six produced high scoring promoter words that closely matched the known DNA binding sites of the corresponding transcription factors. For the two that did not, and , words that exactly matched the factors’ canonical binding sites were not present in the promoter word dictionary generated by the Mobydick algorithm, and results for these factors could be poor for this reason. Interestingly, the second highest scoring word for the genes down-regulated upon perturbation was GGAAG, and this word occurs twice more as a substring among the top 20 scoring words. This sequence is a partial match to an alternative binding site reported in hookworm (Gao ), G(A/G)(C/G)A(A/T)G, suggesting that this site may be functional in C. elegans as well. Finally, because annotations shared among a significant fraction of the genes in a module are “transferred” to all genes in the module, module-weighted annotations can be used provisionally to infer novel functions for genes, which is especially useful for studying poorly annotated genes.

59 in total

1. affy--analysis of Affymetrix GeneChip data at the probe level.

Authors: Laurent Gautier; Leslie Cope; Benjamin M Bolstad; Rafael A Irizarry
Journal: Bioinformatics Date: 2004-02-12 Impact factor: 6.937

2. Error distribution for gene expression data.

Authors: Elizabeth Purdom; Susan P Holmes
Journal: Stat Appl Genet Mol Biol Date: 2005-07-12

3. Identification of C. elegans DAF-12-binding sites, response elements, and target genes.

Authors: Yuriy Shostak; Marc R Van Gilst; Adam Antebi; Keith R Yamamoto
Journal: Genes Dev Date: 2004-10-15 Impact factor: 11.361

Review 4. A review of independent component analysis application to microarray gene expression data.

Authors: Wei Kong; Charles R Vanderburg; Hiromi Gunshin; Jack T Rogers; Xudong Huang
Journal: Biotechniques Date: 2008-11 Impact factor: 1.993

5. NCBI GEO: archive for functional genomics data sets--10 years on.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Michelle Holko; Oluwabukunmi Ayanbule; Andrey Yefanov; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

6. Identification of hookworm DAF-16/FOXO response elements and direct gene targets.

Authors: Xin Gao; Zhengyuan Wang; John Martin; Sahar Abubucker; Xu Zhang; Makedonka Mitreva; John M Hawdon
Journal: PLoS One Date: 2010-08-19 Impact factor: 3.240

7. PAGE: parametric analysis of gene set enrichment.

Authors: Seon-Young Kim; David J Volsky
Journal: BMC Bioinformatics Date: 2005-06-08 Impact factor: 3.169

8. High-throughput in vivo analysis of gene expression in Caenorhabditis elegans.

Authors: Rebecca Hunt-Newbury; Ryan Viveiros; Robert Johnsen; Allan Mah; Dina Anastas; Lily Fang; Erin Halfnight; David Lee; John Lin; Adam Lorch; Sheldon McKay; H Mark Okada; Jie Pan; Ana K Schulz; Domena Tu; Kim Wong; Z Zhao; Andrey Alexeyenko; Thomas Burglin; Eric Sonnhammer; Ralf Schnabel; Steven J Jones; Marco A Marra; David L Baillie; Donald G Moerman
Journal: PLoS Biol Date: 2007-09 Impact factor: 8.029