Literature DB >> 19091051

Identification of functional modules based on transcriptional regulation structure.

Etienne Birmelé1, Mohamed Elati, Céline Rouveirol, Christophe Ambroise.   

Abstract

BACKGROUND: Identifying gene functional modules is an important step towards elucidating gene functions at a global scale. Clustering algorithms mostly rely on co-expression of genes, that is group together genes having similar expression profiles.
RESULTS: We propose to cluster genes by co-regulation rather than by co-expression. We therefore present an inference algorithm for detecting co-regulated groups from gene expression data and introduce a method to cluster genes given that inferred regulatory structure. Finally, we propose to validate the clustering through a score based on the GO enrichment of the obtained groups of genes.
CONCLUSION: We evaluate the methods on the stress response of S. Cerevisiae data and obtain better scores than clustering obtained directly from gene expression.

Entities:  

Year:  2008        PMID: 19091051      PMCID: PMC2654972          DOI: 10.1186/1753-6561-2-s4-s4

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

An important step in analyzing gene functions is to cluster genes according to their expression patterns. Such clusters can then be analyzed in several ways, for example by assigning unannotated genes to the majority function of each cluster's genes (see [1] for a review). However, this approach has several limitations. On the one hand, genes of similar expression patterns may not necessarily have the same or similar functions; on the other hand, genes with related functions may not show close correlation in their expression patterns. For example, a transcription factor can activate some genes and repress others in the same pathway. The principal assumption of this paper is that unsupervised clustering of genes on the basis of similar regulators (activators/inhibitors) should assemble functional co-regulated groups of genes. To compute a similarity measure between genes as a function of inferred regulators of these genes, we use the output of a data mining algorithm called LICORN [2], that infers cooperative regulation relations from expression data only. The resulting similarity matrix between genes is considered as the adjacency matrix of a weighted graph. Clustering is then performed to find functional modules of genes in the network. To objectively evaluate clustering, we use Gene Ontology to determine if the obtained clusters can be associated with terms of the Biological Process ontology. The strength of such an association is given by a p-value from an Hypergeometric test. We compare different clusterings by calculating a score based on the p-values which becomes greater when the significant p-values are smaller and more numerous. In section Methods, we introduce our model of gene regulation and briefly describe a data mining algorithm for inferring large-scale cooperative gene regulation. We then propose a similarity measure for the genes based on the inferred regulator sets and define the score of a clustering. Finally, in section Results and discussion, we evaluate our system on a yeast data set.

Methods

Cooperative regulation networks

Let us denote by the set of genes with a known or putative regulation activity and as the set of genes without such an activity. The input of the mining method is a discretised expression matrix for genes of . Each expression value can take the value -1 (under-expressed), 0 (normal), or 1 (over-expressed). A gene regulatory network (GRN) associated with a target gene g is a pair (A, I), where is a co-activator set, and is a co-inhibitor set. The set of GRNs for all target genes can also be seen as a bipartite graph where the top layer contains regulators, the bottom layer contains target genes, and edges code for a regulatory interaction between regulators and target genes, each edge being labelled with a regulatory mode (i.e., activator or inhibitor). The regulation relations we are interested in are combinatorial: each target gene has a number of activators and/or inhibitors. Activators on one side and inhibitors on the other side are aggregated in our model through an extended logical AND, i.e., a regulator set S (activator or repressor) is over-expressed (resp. under-expressed) if and only if all the regulators in S are over-expressed (resp. under-expressed). Finally, we describe in Figure 1 a discrete function called Regulatory Program RP, which, given the combined states of activators A and inhibitors I of g in a sample s computes (A, I), the estimated state of g in s. The main features of our regulation model are therefore the explicit representation of activation and repression relationships for a given target gene, and the representation of co-operative transcriptional regulation.
Figure 1

The regulatory program. Definition of the regulatory program RP, which can be interpreted as follows: i) If GRN contains co-activators only, (A, I) corresponds to the aggregated status of these co-activators. ii) If GRN contains co-inhibitors only, (A, I) is the inverse of the aggregated status of these co-inhibitors. iii) Otherwise, (A, I) depends on a combination of the statuses of co-activators and co-inhibitors, as described by the matrix on the right. For example, (A, I) = 1 when the co-activators are over-expressed and the co-inhibitors are not.

The regulatory program. Definition of the regulatory program RP, which can be interpreted as follows: i) If GRN contains co-activators only, (A, I) corresponds to the aggregated status of these co-activators. ii) If GRN contains co-inhibitors only, (A, I) is the inverse of the aggregated status of these co-inhibitors. iii) Otherwise, (A, I) depends on a combination of the statuses of co-activators and co-inhibitors, as described by the matrix on the right. For example, (A, I) = 1 when the co-activators are over-expressed and the co-inhibitors are not.

Learning algorithm

We have recently proposed [2] an original, scalable technique called LICORN for deriving co-operative regulations, in which many co-regulators act together to activate or repress a target gene. LICORN uses an original heuristic approach to accelerate the search for an appropriate structure for the regulation network. It first computes frequent co-regulator sets, i.e., regulator sets that frequently occur together as over (1) or under (-1)-expressed in the discretised expression matrix. This is done by using an extension of the Apriori algorithm [3] to handle both 1 and -1 supports (The x-support of a co-regulator C in the three-valued expression matrix is the set of samples that include all the regulators of C with the state x). From this representation, a limited subset of candidate co-regulator sets is then associated with each gene. The learning algorithm looks for each gene for regulator sets which have a high "overlap" with the target gene. Intuitively, the overlap constraint checks the size of the intersection between supports of the target gene and a given candidate co-regulator set. A candidate activator set for a target gene g is frequently over-expressed when g is over-expressed or frequently under-expressed when g is under-expressed. On the opposite, a candidate repressor set for a target gene g is frequently over-expressed when g is under-expressed and vice-versa. This search can be efficiently performed because of the property of anti-monotonicity of the overlap constraint with respect to set inclusion. Then, once a limited number of candidate activator and inhibitor sets have been obtained, exhaustive search for the best gene regulatory network can be performed. Finally, a permutation-based procedure is used for selecting statistically significant regulation relations. We have shown in [2] that the co-operative regulation patterns inferred by LICORN cannot be identified by clustering or pairwise methods, and are only partly revealed by constrained Bayesian or decision tree-based techniques, such as those used in previous studies [4,5].

Identification of functional co-regulation modules

Partial overlap of the regulator sets for a set of target genes can be used as an alternative measurement of the distance between genes.

Computation of the co-regulation matrix

We design the co-regulation matrix by using a similarity measure defined as follows: let λ ∈ [0, 1] and (g1, g2) be two genes. The similarity between g1 and g2 is defined by where |A| and |I| are respectively the number of activators and inhibitors of both g1 and g2, |AI| is the number of regulators which activate one gene and inhibit the other and |TF| is the number of transcription factors regulating at least one of the genes. This similarity considers two genes as being far appart (ϕ(g1, g2) = 0) if they do not share any regulators. Two genes are considered most similar if their set of activators and inhibitors are exactly the same (ϕ(g1, g2) = 1). In intermediate situations, λ represents the weight given to common regulators which have opposite effects.

Clustering

To cluster genes from the similarity matrix, we use the MCL algorithm [6,7]. That algorithm, based on the fluxes in a graph, is well suited to weighted graphs and does not require any prior knowledge about the number of clusters. Moreover, it does not require any initial conditions and is therefore reproducible. The inflation parameter of the algorithm is fixed to 1.8, as suggested in [8].

Mapping to GO-terms

To assess the functional significance of obtained clusters, and suggest putative functions for genes with unknown functions, we calculate the enrichment of gene ontology (GO) [9]. To determine the over-represented GO terms in each cluster, we apply the R package GOstats [10] with a p-value cut-off of 5% and the biological process ontology. For each cluster C, we obtain a set Tof GO terms over-represented in C with a rate of 5% and a set of associated p-values {p, t ∈ T}. We define the score of the clustering by where Cis the set of genes of C associated with the GO term t. Parameters cand callow us to avoid clusters that are too small, which don't have a biological meaning, as well as too big ones, which don't have any functional unity.

Results and discussion

As a proof of concept, we used gene expression data sets for S. Cerevisiae. The Gasch data set [11] measures the response of yeast to 173 stress conditions for 6152 genes. We used a set of 237 known and putative transcription factors. We applied LICORN and retained only those GRNs (gene regulatory networks) identified as significant with a 5% FDR level (see [2] for details). 2041 GRNs (of 5703 GRNs) were identified as significant. The structural organization of the learned GRNs has been shown to be consistent with recent advances [12] concerning the characterization of topological transcriptional network features in yeast and provide the first evidence of the relevance of inferred GRNs. In order to choose the parameter λ for the similarity matrix, we computed the matrices and the associated clusterings for several values of λ and compared their scores with parameters c= 5 and c= 200. Figure 2 shows that the best one is obtained for λ = 0.1.
Figure 2

Score and number of p-values for λ varying from 0 to 1.

Score and number of p-values for λ varying from 0 to 1. For λ = 0.1, the clustering gives 30 clusters among which one is too big to be considered (407 genes) and 3 have less than 5 genes. Table 1 gives the best GO term association for the 10 best of them when ranked according to their best p-value. The biological evaluation of these clusters is ongoing.
Table 1

GO-enrichment of the clusters obtained for S. Cerevisiae.

Cluster IdGO BP Idp-valueCluster sizeBiological process
600226131.28e – 2380ribonucleoprotein complex biogenesis and assembly

1500061193.26e – 1318oxidative phosphorylation

900422543.15e – 1155ribosome biogenesis and assembly

400060814.89e – 07142aldehyde metabolic process

200007465.89e – 07155conjugation

700070019.48e – 0768chromosome organization and biogenesis (sensu Eukaryota)

1300069744.10e – 0530response to DNA damage stimulus

2700086522.02e – 046amino acid biosynthetic process

1000469073.50e – 0452intracellular transport

800197547.84e – 0466one-carbon compound catabolic process

Table of the ten best clusters among the 59 ones obtained for λ = 0.3 ranked by their best association with a GO term. For each of them, the best associated GO term and the corresponding p-value are given, as well as the size of the cluster and the biological process associated to the GO term.

GO-enrichment of the clusters obtained for S. Cerevisiae. Table of the ten best clusters among the 59 ones obtained for λ = 0.3 ranked by their best association with a GO term. For each of them, the best associated GO term and the corresponding p-value are given, as well as the size of the cluster and the biological process associated to the GO term. The cluster number 15 that appears on the second line of Table 1 is in fact associated to 32 GO terms with a p-value lower than 1e – 07, most of those terms being related to phosphorylation or triphosphate metabolic process. Moreover, five genes of that cluster belong to the 167 genes having no associated GO term, namely the genes YLR296W, YDR215C, YBL044W, YIR040C and YPR027C. All of them appear in the Entrez gene database but without known functions. We have finally validated our method by comparing clustering performances based on other similarity matrices. We therefore have computed from the original expression data matrices of euclidian distance, partial correlation [13] and mutual information [14]. To compare clustering results with the same number of clusters, we used the hierarchical clustering method AGNES [15] to cluster the genes in 20, 30, 40, and 50 groups. Figure 3 shows the scores for those three methods as well as for ours with λ = 0.1. It clearly shows that inferring the regulatory network from LICORN preprocessing improves the score of the clustering and provide more biologically relevant clusters.
Figure 3

Comparison of the clustering based on LICORN with existing methods. Figure of the scores obtained for hierarchical clustering into 20, 30, 40 and 50 clusters. The red circles are the scores obtained for the similarity matrix given by LICORN and λ = 0.1. The similarity measures which are compared to are euclidian distance, partial correlation and mutual information.

Comparison of the clustering based on LICORN with existing methods. Figure of the scores obtained for hierarchical clustering into 20, 30, 40 and 50 clusters. The red circles are the scores obtained for the similarity matrix given by LICORN and λ = 0.1. The similarity measures which are compared to are euclidian distance, partial correlation and mutual information.

Conclusion

The problem of discovering functional modules from expression data is both biologically important and computationally challenging. From a biological perspective, identifying members of functional modules is the first step toward understanding the regulatory network of the cell. We provide here an alternative way for constructing gene modules: genes are clustered in the same module if they share many regulators, as they have been inferred by LICORN from gene expression data. We expect this way of clustering will discover modules that are out the scope of classical co-expression clustering techniques. From a computational perspective, one of the key challenges is dealing with over-fitting as the number of data samples is so small.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ME and CR designed the inference and participated in experimentations and drafting the manuscript. EB and CA proposed the clustering and evaluation methods.
  11 in total

1.  An efficient algorithm for large-scale detection of protein families.

Authors:  A J Enright; S Van Dongen; C A Ouzounis
Journal:  Nucleic Acids Res       Date:  2002-04-01       Impact factor: 16.971

2.  Minreg: inferring an active regulator set.

Authors:  Dana Pe'er; Aviv Regev; Amos Tanay
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

3.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data.

Authors:  Eran Segal; Michael Shapira; Aviv Regev; Dana Pe'er; David Botstein; Daphne Koller; Nir Friedman
Journal:  Nat Genet       Date:  2003-06       Impact factor: 38.330

4.  Using GOstats to test gene lists for GO term association.

Authors:  S Falcon; R Gentleman
Journal:  Bioinformatics       Date:  2006-11-10       Impact factor: 6.937

5.  Information-theoretic inference of large transcriptional regulatory networks.

Authors:  Patrick E Meyer; Kevin Kontos; Frederic Lafitte; Gianluca Bontempi
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

6.  LICORN: learning cooperative regulation networks from gene expression data.

Authors:  Mohamed Elati; Pierre Neuvial; Monique Bolotin-Fukuhara; Emmanuel Barillot; François Radvanyi; Céline Rouveirol
Journal:  Bioinformatics       Date:  2007-08-24       Impact factor: 6.937

7.  Genomic expression programs in the response of yeast cells to environmental changes.

Authors:  A P Gasch; P T Spellman; C M Kao; O Carmel-Harel; M B Eisen; G Storz; D Botstein; P O Brown
Journal:  Mol Biol Cell       Date:  2000-12       Impact factor: 4.138

8.  SGD: Saccharomyces Genome Database.

Authors:  J M Cherry; C Adler; C Ball; S A Chervitz; S S Dwight; E T Hester; Y Jia; G Juvik; T Roe; M Schroeder; S Weng; D Botstein
Journal:  Nucleic Acids Res       Date:  1998-01-01       Impact factor: 16.971

9.  Evaluation of clustering algorithms for protein-protein interaction networks.

Authors:  Sylvain Brohée; Jacques van Helden
Journal:  BMC Bioinformatics       Date:  2006-11-06       Impact factor: 3.169

Review 10.  Microarray data analysis: from hypotheses to conclusions using gene expression data.

Authors:  Nicola J Armstrong; Mark A van de Wiel
Journal:  Cell Oncol       Date:  2004       Impact factor: 6.730

View more
  3 in total

1.  Mining functional gene modules linked with rheumatoid arthritis using a SNP-SNP network.

Authors:  Lin Hua; Hui Lin; Dongguo Li; Lin Li; Zhicheng Liu
Journal:  Genomics Proteomics Bioinformatics       Date:  2012-02       Impact factor: 7.691

2.  Latent network-based representations for large-scale gene expression data analysis.

Authors:  Wajdi Dhifli; Julia Puig; Aurélien Dispot; Mohamed Elati
Journal:  BMC Bioinformatics       Date:  2019-02-04       Impact factor: 3.169

3.  Machine learning in systems biology.

Authors:  Florence d'Alché-Buc; Louis Wehenkel
Journal:  BMC Proc       Date:  2008-12-17
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.