Literature DB >> 18628289

PathCluster: a framework for gene set-based hierarchical clustering.

Tae-Min Kim¹, Seon-Hee Yim, Yong-Bok Jeong, Yu-Chae Jung, Yeun-Jun Chung.

Abstract

MOTIVATION: Gene clustering and gene set-based functional analysis are widely used for the analysis of expression profiles. The development of a comprehensive method jointly combining the two methods would allow for greater biological insights.
RESULTS: We developed a software package, PathCluster for gene set-based clustering via an agglomerative hierarchical clustering algorithm. The distances between predefined gene sets are illustrated in a dendrogram in which the relationships between gene sets can be visually assessed. Valuable biological insights can be obtained according to the type of gene sets, e.g. coordinated action of molecular functions (functional gene sets) and putative motif synergy (promoter gene set) in a biological process. The combined use of gene sets further enables the interrogation of different biological themes and their putative relationships, such as function-versus-regulatory motif or drug-versus-function. PathCluster can also be used for knowledge-based sample partitioning or class categorization for clinical purposes. With extended applicability, PathCluster will facilitate the gleaning of meaningful biological insights and testable hypotheses in the contexts of given expression profiles. AVAILABILITY: PathCluster executable files can be freely downloaded at http://www.systemsbiology.co.kr/PathCluster/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2008 PMID： 18628289 PMCID： PMC2519159 DOI： 10.1093/bioinformatics/btn357

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 BACKGROUND

The objective of gene clustering is to group genes with similar expression patterns or that are expressed in a coordinated manner (Eisen et al., 1998). Subsequent functional enrichment analysis can provide clues as to which molecular functions or annotation categories are associated with individual gene clusters using biological knowledge. Despite its potential utility, the treatment of gene clusters as exclusive units may raise a number of practical concerns in subsequent functional analysis. For example, a large list of candidate functionalities is obtained as the number of clusters increases, thus making it difficult to compare the results between clusters or to establish appropriate significance thresholds considering multiple testing adjustments. Also, the performance of enrichment analysis is profoundly dependent on prior clustering result, which varies considerably according to the cluster methods and parameter settings. More importantly, the potential relationships between gene sets or clusters are difficult to identify in conventional settings. The integration of a priori knowledge of gene set information in clustering may be an appropriate solution to these problems (Rapaport et al., 2007); however, there are currently no available user-friendly tools that implement this alternate algorithm. Thus, we developed a software package, PathCluster, which utilize an agglomerative hierarchical clustering algorithm for gene set-based clustering. In a given expression profile, the distance matrix is constructed between gene sets and illustrated as a dendrogram. The relationship between gene sets can be visually assessed in the results, thereby facilitating the construction of an association map between diverse annotation categories. The related algorithms are implemented in a freely available software package. Major functionalities of PathCluster are summarized as follows: Gene set-based hierarchical clustering and visualization of the results with user-friendly graphic interface, Identification of potential relationship between gene sets; putative interaction between molecular functions or synergism between regulatory motif sequences, Revealing previously unknown links between different annotation categories in terms of gene sets; function-versus-regulatory motif or drug-versus-function, Function-based class categorization of disease samples.

2 HIERARCHICAL CLUSTERING OF GENE SETS

Two strategies can be employed to determine the expression similarities or distances between gene sets. First, individual gene sets can be scored for the mean expression of belonging genes or enrichment scores derived from non-parametric (GSEA) or parametric version (PAGE) of gene set enrichment algorithms (Cheadle et al., 2007). The matrix of gene set scores with respect to the samples can be used to calculate the gene set distance and hierarchical clustering. Alternatively, the distance between two gene sets can be calculated directly as a mean correlation level of all possible gene pairs, each of which represents one possible gene-to-gene match between corresponding gene sets. When dealing with large gene sets and when the overlapping genes between gene sets have peculiar interests (especially the case of promoter gene sets), the mean correlation can also be calculated only for the gene pairs within overlapping genes between gene sets. Detailed descriptions of the metrics utilized and examples are available in the online manual at the PathCluster homepage (http://www.systemsbiology.co.kr/PathCluster/Manual.pdf). Screenshots of PathCluster. (A) An example of analysis using publicly available expression profiles representing human erythroid differentiation (Keller et al., 2006). The dendrogram shows a clustering of immune-related functional annotations as well as signal-related functionalities and relevant sequence. (B) The function-based classification of human lung cancer samples (Bhattacharjee et al., 2001). Four histological subtypes of lung cancer samples (normal, NL; adenocarcinoma, AD; squamouse cell carcinoma, SQ; small cell carcinoma, SMCL) are distinguished at the gene set-based expression level. PathCluster provides default gene sets covering four kinds of gene annotation categories; molecular functions, the association with regulatory motifs corresponding to transcription factors or miRNA, as well as drug treatment-related expression changes. In addition, gene sets from public databases such as MSigDB or user-defined custom query sets can be readily included in the gene set reference, in order to ensure the versatility of the method.

3 BIOLOGICAL APPLICATION

3.1 Associated molecular functions or regulatory motif sequences in a biological process

Using functional gene sets, PathCluster can identify the putative associations between molecular functions, thereby providing clues on coordinated action of specific functions in a given expression profile. Similarly, in the case of promoter gene sets, PathCluster can identify the putative motif synergy between cis-regulatory motifs or corresponding transcription factors delineating the regulatory crosstalks in a transcriptional regulatory network. Moreover, using combined gene sets with different annotation categories, previously unknown, novel links can be revealed. In erythropoiesis-related expression profiles, a number of functionalities related with immunity and the major histocompatibility complex are observed in a cluster (Fig. 1A). Within the cluster, signal-related functionalities (Ras protein signal transduction and MAPKKK cascade) as well as sequence motifs corresponding transcription factors of GATA-1 and c-Rel (a component of NK-κB) were also observed indicative of their potential interactions during erythropoiesis. This strategy can be also applied to other combinations of gene sets to reveal novel links between different biological themes such as function versus drug and function versus miRNA.

Fig. 1.

Screenshots of PathCluster. (A) An example of analysis using publicly available expression profiles representing human erythroid differentiation (Keller et al., 2006). The dendrogram shows a clustering of immune-related functional annotations as well as signal-related functionalities and relevant sequence. (B) The function-based classification of human lung cancer samples (Bhattacharjee et al., 2001). Four histological subtypes of lung cancer samples (normal, NL; adenocarcinoma, AD; squamouse cell carcinoma, SQ; small cell carcinoma, SMCL) are distinguished at the gene set-based expression level.

3.2 Function-based sample classification

Knowledge-driven or function-based class categorization has recently emerged as a highly challenging subject. This strategy has already been employed to identify the functional relationships in a large cancer-derived expression compendium or to elucidate drug-signature relationships for clinical benefits (Wong et al., 2008). Adopting a user-friendly platform and extended reference of gene sets, PathCluster provides a platform for the classification or molecular diagnosis of clinical samples, also allowing for the interrogation of diverse biological knowledge in terms of gene sets.Figure 1B shows that function-based classification can successfully distinguish between the three lung cancer subtypes, including normal tissues. In this cluster, eight cancer-related functions are specifically up-regulated in small cell lung cancer and squamous cell carcinoma of the lung.

6 in total

1. Transcriptional regulatory network analysis of developing human erythroid progenitors reveals patterns of coregulation and potential transcriptional regulators.

Authors: M A Keller; S Addya; R Vadigepalli; B Banini; K Delgrosso; H Huang; S Surrey
Journal: Physiol Genomics Date: 2006-08-29 Impact factor: 3.107

2. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Authors: A Bhattacharjee; W G Richards; J Staunton; C Li; S Monti; P Vasa; C Ladd; J Beheshti; R Bueno; M Gillette; M Loda; G Weber; E J Mark; E S Lander; W Wong; B E Johnson; T R Golub; D J Sugarbaker; M Meyerson
Journal: Proc Natl Acad Sci U S A Date: 2001-11-13 Impact factor: 11.205

3. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

4. Revealing targeted therapy for human cancer by gene module maps.

Authors: David J Wong; Dimitry S A Nuyten; Aviv Regev; Meihong Lin; Adam S Adler; Eran Segal; Marc J van de Vijver; Howard Y Chang
Journal: Cancer Res Date: 2008-01-15 Impact factor: 12.701

5. Classification of microarray data using gene networks.

Authors: Franck Rapaport; Andrei Zinovyev; Marie Dutreix; Emmanuel Barillot; Jean-Philippe Vert
Journal: BMC Bioinformatics Date: 2007-02-01 Impact factor: 3.169

6. GSMA: Gene Set Matrix Analysis, An Automated Method for Rapid Hypothesis Testing of Gene Expression Data.

Authors: Chris Cheadle; Tonya Watkins; Jinshui Fan; Marc A Williams; Steven Georas; John Hall; Antony Rosen; Kathleen C Barnes
Journal: Bioinform Biol Insights Date: 2009-11-24

6 in total

5 in total

1. A developmental taxonomy of glioblastoma defined and maintained by MicroRNAs.

Authors: Tae-Min Kim; Wei Huang; Richard Park; Peter J Park; Mark D Johnson
Journal: Cancer Res Date: 2011-03-08 Impact factor: 12.701

2. Gene expression pattern in transmitochondrial cytoplasmic hybrid cells harboring type 2 diabetes-associated mitochondrial DNA haplogroups.

Authors: Seungwoo Hwang; Soo Heon Kwak; Jong Bhak; Hae Sun Kang; You Ri Lee; Bo Kyung Koo; Kyong Soo Park; Hong Kyu Lee; Young Min Cho
Journal: PLoS One Date: 2011-07-13 Impact factor: 3.240

3. Gene expression signatures associated with the in vitro resistance to two tyrosine kinase inhibitors, nilotinib and imatinib.

Authors: T-M Kim; S-A Ha; H K Kim; J Yoo; S Kim; S-H Yim; S-H Jung; D-W Kim; Y-J Chung; J W Kim
Journal: Blood Cancer J Date: 2011-08-26 Impact factor: 11.037

4. PALMER: improving pathway annotation based on the biomedical literature mining with a constrained latent block model.

Authors: Jin Hyun Nam; Daniel Couch; Willian A da Silveira; Zhenning Yu; Dongjun Chung
Journal: BMC Bioinformatics Date: 2020-10-02 Impact factor: 3.307

5. Comparison and evaluation of pathway-level aggregation methods of gene expression data.

Authors: Seungwoo Hwang
Journal: BMC Genomics Date: 2012-12-13 Impact factor: 3.969

5 in total