| Literature DB >> 32759329 |
Michael Cary1,2, Katie Podshivalova3, Cynthia Kenyon4,3.
Abstract
Identification of co-expressed sets of genes (gene modules) is used widely for grouping functionally related genes during transcriptomic data analysis. An organism-wide atlas of high-quality gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method based on independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism C. elegans These modules represent responses to changes in the environment (e.g., starvation, exposure to xenobiotics), genes regulated by transcriptions factors (e.g., ATFS-1, DAF-16), genes specific to tissues (e.g., neurons, muscle), genes that change during development, and other complex transcriptional responses to genetic, environmental and temporal perturbations. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially expressed genes fail to do so. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g., GO term). Analysis of "module-weighted annotations" improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. We provide an online interactive resource with tutorials at http://genemodules.org/, in which users can find detailed information on each module, check genes for module-weighted annotations, and use both of these to analyze their own gene expression data (generated using any platform) or gene sets of interest.Entities:
Keywords: aging; atfs-1; functional annotation of genes; gene co-expression; gene expression data analysis; hif-1; independent component analysis; microarray; mitochondrial unfolded protein response; respiration
Mesh:
Substances:
Year: 2020 PMID: 32759329 PMCID: PMC7534440 DOI: 10.1534/g3.120.401270
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
GO categories with strongest and weakest gene weights. The table shows the top 25 and the bottom 5 GO categories, ranked in decreasing order of the weight of their most strongly associated gene prior to normalization. Similar GO categories are grouped together, and the genes with the highest weights for each group are also shown. rpl- (ribosomal protein, large subunit) and rps- (ribosomal protein, small subunit) genes encode ribosomal proteins, and his- (histone) genes encode nucleosome components
| GO Term Group | GO ID | Rank | GO Term | Top Genes |
|---|---|---|---|---|
| Ribosome | GO:0005840 | 1 | ribosome | |
| GO:0003735 | 2 | structural constituent of ribosome | ||
| GO:0030529 | 3 | ribonucleoprotein complex | ||
| GO:0006412 | 4 | translation | ||
| GO:0043228 | 10 | non-membrane-bounded organelle | ||
| GO:0043232 | 11 | intracellular non-membrane- bounded organelle | ||
| GO:0032991 | 20 | macromolecular complex | ||
| Heme binding | GO:0020037 | 5 | heme binding | R05D8.9, E02C12.6, T16G1.6, |
| GO:0046906 | 6 | tetrapyrrole binding | ||
| GO:0005506 | 7 | iron ion binding | ||
| GO:0004497 | 8 | monooxygenase activity | ||
| GO:0009055 | 9 | electron carrier activity | ||
| GO:0016705 | 15 | oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen | ||
| Nucleosome | GO:0006334 | 12 | nucleosome assembly | |
| GO:0034728 | 13 | nucleosome organization | ||
| GO:0031497 | 14 | chromatin assembly | ||
| GO:0000786 | 16 | nucleosome | ||
| GO:0032993 | 17 | protein-DNA complex | ||
| GO:0006333 | 18 | chromatin assembly or disassembly | ||
| GO:0065004 | 21 | protein-DNA complex assembly | ||
| GO:0071824 | 22 | protein-DNA complex subunit organization | ||
| GO:0006323 | 23 | DNA packaging | ||
| Lipid glycosylation | GO:0030259 | 19 | lipid glycosylation | C33C12.8, |
| Cuticle | GO:0042302 | 24 | structural constituent of cuticle | |
| Oxidoreductase activity | GO:0016491 | 25 | oxidoreductase activity | B0272.4, Y75B8A.4, |
| … | … | … | … | … |
| Presynaptic membrane | GO:0042734 | 1647 | presynaptic membrane | K02E11.7, C45G9.6, |
| Signal transduction | GO:0046579 | 1648 | positive regulation of Ras protein signal transduction | R04B5.6, |
| GO:0051057 | 1649 | positive regulation of small GTPase mediated signal transduction | ||
| Calcium-dependent kinase | GO:0009931 | 1650 | calcium-dependent protein serine/threonine kinase activity | |
| GO:0010857 | 1651 | calcium-dependent protein kinase activity |
Figure 1Schematic of module prediction using DEXICA and derivation of module-weighted annotations. (a) A matrix of gene expression data, X, is decomposed using independent component analysis (ICA) into a gene module definition matrix, S, and a matrix containing the weight of each module in each microarray, A. Rows of the S matrix indicate the relative degree of inclusion of a given gene in each module. (b) Annotation enrichment within modules is calculated using matrices B and S. Known associations between genes and annotations are captured in a Boolean matrix B, where 1 indicates an association between a gene and an annotation and 0 indicates a lack of an association. S is partitioned into S, where 0 indicates a gene’s exclusion from a module, while -1 indicates a gene’s assignment to the “a” hemi-module and 1 to the “b” hemi-module (a hemi-module comprises genes that have extreme weights and the same sign in a given column of the S matrix). Enrichment of genes associated with each annotation in each hemi-module is calculated using hypergeometric tests and log(p-values) (for under-represented annotations) or -log(p-values) (for over-represented annotations) are recorded in a new matrix, E. (c) To generate module-weighted annotations, the S matrix is first transformed into a matrix, H, which indicates the weight of each gene in each hemi-module. H contains twice as many columns as S because there are two hemi-modules per module. The dot product between H and E results in the R matrix, where a row indicates the weighted association between a gene and each annotation. The color intensity in R indicates annotation weights that would result given the indicated expression values in X and the Boolean annotations in B. Color saturation in all matrices except B (Boolean) and S (trivalued) indicate relative numeric values, with least saturated colors (i.e., white) indicating highly negative values, and most saturated colors indicating highly positive values.
Figure 2Optimization of gene modules. To determine the optimal preprocessing method and the optimal number of components (gene modules) to extract from a gene expression compendium of C. elegans microarray data, we calculated the number of Gene Ontology terms (a), C. elegans tissues (b), and REACTOME pathways (c) that were significantly enriched in at least one gene module. Black points show results from a compendium produced using a preprocessing procedure used by Engreitz et al.(Engreitz ); red points show results for the best alternative preprocessing method that we tested. Black dashed lines indicate the point on the x-axis of each graph at which loess regression curves showed the greatest difference between red points and results from randomized controls (gray points). (d) The number of modules produced by DEXICA, by a different ICA-based method used by Engreitz et al.(Engreitz ), and by C. elegans gene expression topomap generated by Kim et al.(Kim ). The number and the percentage of total predicted modules that have significant enrichment for various annotations are shown. Error bars indicate s.d. between repeat runs of DEXICA or Engreitz et al. method. (e) Distribution of the number of probe sets partitioned to each of the 209 DEXICA modules. (f) Distribution of the number of different DEXICA modules to which a given probe set is partitioned. (g) Each of the 418 DEXICA hemi-modules was tested for enrichment of genes with different structural properties – presence a long 3′-UTR, transcription as part of an operon, and presence of multiple annotated splice variants.
Figure 3Performance of module-weighted annotation-based tests. (a) To assess sensitivity, Gaussian noise was added to gene expression data from 5 recent C. elegans Affymetrix experiments not included in the compendium used to train the modules. [The standard deviation of the noise distribution (μ = 0) was varied from 0.5x to 5x the standard deviation of the gene fold change distribution.] At each noise level, z-scores for each GO term (based on 100 random permutations of the gene IDs in the input data) were calculated using three different methods: KS – Kolmogorov-Smirnov test; t-test – two-sample t-test (one-sample test gave highly similar results, data not shown); projection – projection of the gene fold changes into the space defined by the module-weighted annotation matrix, R. When conducting KS and t-test, fold changes of genes assigned to each GO term were compared to fold changes of genes not assigned to that term. Spearman correlation coefficients were calculated between Z-scores at each noise level and Z-scores without any noise added. (b) To assess specificity, 100 highly dissimilar pairs of experiments (Spearman correlation, ρ, of gene fold changes near 0) were selected from a set of 188,805 pairs generated using published microarray data. For each contrast belonging to a pair, we determined Z-scores for GO annotations using the three methods and calculated the rank correlation (rho) of the absolute value of these Z-scores between pair members. The center of the box represents the median value and whiskers extend to the most extreme data point that is not further than 1.5 times the IQR from the box. (c) Gene fold changes in and mutants were projected into promoter word space using module-weighted promoter words (see Methods). The 10 points furthest from the origin are highlighted with colored circles in the figure (orange = positive SVE for , blue = negative SVE for ), and their words corresponding to these points are shown to the right of the figure. Four of the six words highlighted in orange contain full or partial matches to the canonical binding site, (A/G)CGTG (underlined).
Top scoring promoter words for transcription factor perturbation experiments. The table shows the top three most significant words (those with the most extreme z-scores), calculated using module-weighted promoter words, for each transcription factor loss-of-function microarray experiment. If a full or partial match (4 or more bases) to the canonical binding site of the factor does not occur within the top 3 ranking words, the most significant such match is also shown. Bold letters indicate matching positions to canonical binding sites
| Factor | Canonical site | Top words | Projection set | Z score | Z score rank |
|---|---|---|---|---|---|
| DAF-16 | T(G/A)TTTAC, CTTATCA* | TT | Down-regulated | −3.95 | 1 |
| GGAAG | Down-regulated | −3.69 | 2 | ||
| TTTTCTG | Down-regulated | 3.47 | 3 | ||
| HIF-1 | ACGTG | Up-regulated | 4.39 | 1 | |
| Up-regulated | 4.38 | 2 | |||
| Up-regulated | 3.71 | 3 | |||
| NHR-23 | AGGTCA | Down-regulated | −5.29 | 1 | |
| Down-regulated | −4.72 | 2 | |||
| CCTCCCCC | Down-regulated | −4.39 | 3 | ||
| HLH-30 | TCACGTGA(C/T) | CTTACTATT | Up-regulated | −4.3 | 1 |
| CGTAATCC | Up-regulated | 4.14 | 2 | ||
| CTTTTTTCT | Down-regulated | 4.07 | 3 | ||
| Up-regulated | −3.09 | 20 | |||
| LIN-14 | GAAC | CCTACCTACCTA | Down-regulated | 4.35 | 1 |
| GCGCGTCAAATA | Up-regulated | −3.95 | 2 | ||
| GCCGCGCACCCC | Down-regulated | −3.78 | 3 | ||
| G | Down-regulated | −2.53 | 109 | ||
| DAF-12 | AGTTCA | CCCCAC | Down-regulated | −4.41 | 1 |
| GCTC | Up-regulated | 4.28 | 2 | ||
| CCCCGCC | Down-regulated | −4 | 3 | ||
| Up-regulated | 3.51 | 11 |
Figure 4Analysis of module activity in respiration mutants reveals activity of the key mitochondrial unfolded protein response (UPRmt) transcription factor, ATFS-1. (a) Conventional annotation analyses of genes that are differentially expressed in () respiration mutants. GO term-overrepresentation was determined using GOrilla(Eden ) and KEGG pathway-overrepresentation was determined using DAVID(Huang ). WormCat(Holdorf ) is an online analysis tool that queries the GO database but works independently of GO and addresses some of its limitations. Significant gene expression changes were determined using the limma R package and FDR-adjusted p-value of <0.05. (b) Knockdown of a protein quality-control protease is the top non-respiration microarray experiment in the compendium that induces activity of the module m47. Module activity is expressed as signed variance explained (SVE). (c) Comparison of () mutants and animals that express a constitutively active form of , a key transcription factor required for the UPRmt. The gene expression scatter plot shows all detected genes (22,625; black) overlaid with genes that belong to m47 (367; pink). r, Pearson correlation coefficients. (d) Normalized changes in expression of genes that belong to m47 (367 genes) in response to inhibition of respiration ( mutation, column 1), activation of UPRmt by disrupting mitochondrial protein quality control (column 2), constitutive activation of under normal conditions (column 3), and activation of UPRmt in the absence of (column 4). (e) Genes up-regulated by ATFS-1 during UPRmt, as determined in Nargund et al. 2012 (Nargund ), are strongly enriched in m47. Enrichment was calculated using hypergeometric statistics.