Literature DB >> 21603179

Association rule based similarity measures for the clustering of gene expression data.

Abstract

In life threatening diseases, such as cancer, where the effective diagnosis includes annotation, early detection, distinction, and prediction, data mining and statistical approaches offer the promise for precise, accurate, and functionally robust analysis of gene expression data. The computational extraction of derived patterns from microarray gene expression is a non-trivial task that involves sophisticated algorithm design and analysis for specific domain discovery. In this paper, we have proposed a formal approach for feature extraction by first applying feature selection heuristics based on the statistical impurity measures, the Gini Index, Max Minority, and the Twoing Rule and obtaining the top 100-400 genes. We then analyze the associative dependencies between the genes and assign weights to the genes based on their degree of participation in the rules. Consequently, we present a weighted Jaccard and vector cosine similarity measure to compute the similarity between the discovered rules. Finally, we group the rules by applying hierarchical clustering. To demonstrate the usability and efficiency of the concept of our technique, we applied it to three publicly available, multiclass cancer gene expression datasets and performed a biomedical literature search to support the effectiveness of our results.

Entities: Disease Gene Species

Keywords: Microarray gene expression; association rules; clustering.; similarity measure

Year: 2010 PMID： 21603179 PMCID： PMC3096052 DOI： 10.2174/1874431101004010063

Source DB: PubMed Journal: Open Med Inform J ISSN： 1874-4311

INTRODUCTION

Cancer is the second leading cause of death in United States. According to a report by the American Cancer Society, 23.1% of the deaths in 2006 were caused by cancer. Early detection followed by planning and treatment can significantly reduce the suffering from this disease and lower the healthcare burden. Molecular diagnosis of cancer has the potential to provide personalized healthcare delivery through efficient and accurate computational means with high degrees of specificity and sensitivity. Analysis of microarray gene expression data for cancer classification can lead to information regarding the cellular mechanisms of genes, the regulatory functions of genes, the functions of genes and proteins, the structures of gene networks and pathways, and can yield information relating the risk of being affected by cancer to gene expression profiles [1, 2]. Microarray technology has made it possible to monitor the expressions of thousands of genes simultaneously under different tissue types, treatments, or changes in the expression profile over a certain period of time. Consequently, this technology has allowed researchers to obtain the “global” view of the cell for the first time [3]. However, the exponentially growing microarray data sets present an overarching challenge for computational scientists to contribute to an understanding of biologically significant cellular mechanisms. With its thousands of uncharacterized variables, microarray data analysis presents one of the most daunting challenges facing bioinformatics. The computational complexity of analyzing microarray data is further enhanced because a large number of genes can correspond to different time sequences or tissue types, having dimensionality that is several orders of magnitude more than the evaluated samples. Further, genes function in a complex, interactive manner, and, hence, the challenge is to narrow down the selection of gene markers by discriminating them from the “house-keeping” genes. The challenge posed in this area is to identify the biologically significant sets of correlated, co-regulated genes that share similar patterns and functions. The ultimate goal is to rely on the derived knowledge and utilize it for the drug discovery process, including biomarker identification and tracking. Data mining offers the promise of precise, accurate learning and discovery mechanisms in such complex data. An approach to narrow the search for a gene marker is to select a set of features (discriminatory genes) based on a statistical or machine learning measure, which can distinguish between types of samples according to their gene expression values. Among the data mining methodologies, unsupervised classification (clustering) has emerged as one of the major methods in understanding the biological process which provide insight into the activity of genes that vary during these processes and their effect on the disease states and cellular environments [4, 5]. Clustering is performed on the genes or samples to identify clusters of genes that have similar expression patterns or clusters of samples that have similar expression profiles which can assist in providing insight into therapeutic and pathogenic studies [2, 5].

Related Research

Clustering algorithms such as k-means [6] and hierarchical clustering [7] group genes or conditions in clusters that exhibit “functionally similar” behavior. Tamayo et al. developed GENECLUSTER [8], which uses self-organizing maps to organize genes biologically. For a review of cluster algorithms for gene expression, we refer to [9]. The class of clustering algorithms establishes clusters of correlated genes under certain conditions, limiting their ability by not providing information about embodied associative isomorphic relationships between genes and gene products. The previous results published in [10-16] have shown the association rule discovery (ARD) based approach can overcome this limitation by extracting associations among the subsets of genes and providing insight into how genes collectively react under certain conditions. However, the number of associations generated can be large for a gene expression dataset that contains thousands of genes. Further, applying an ARD algorithm also yields redundant rules, only a few of which actually represent important biological relationships. Hence, the challenge is to explore and sift through the thousands of rules to find those that are “meaningful”. The pruning and grouping of the rules can eliminate the redundancy, while providing insight into some important biological associations between the genes. The pruning and grouping of association rules have been studied in the past. Numerous and irrelevant rules have been generated by traditional approaches of association rule mining, many of which are redundant, which further complicates their interpretation [17, 18]. Han et al. [19] stated that the challenge of mining association rules is not based on the rules discovered under certain constraints but on the discovery of a compact and high quality set of rules. Toivonen et al. [20] introduced the concept of association rule cover for the pruning of association rules. A cover is defined as a “subset of the discovered associations that can cover the database”. The number of rules in a cover can be small and hence a greedy algorithmic approach is proposed to find a good cover, for the pruning of remaining rules. The standard χ2 test employed by Liu et al. [21] prunes irrelevant rules, and the concept of direction setting rules is used to summarize the patterns. Srikant et al. [22] and Ng et al. [23] used the constraints provided by the user to limit the number of rules that were generated. In other literature, different measures have been proposed to discover the interestingness of a rule. The rule template method from [22, 23] separates only those rules that match the template. Finally, Liu et al. [24] proposed a method, which was based on statistics and probability to get a condensed set of rules by removing redundant rules. Lent et al. [25] first proposed the clustering of association rules by developing a geometric-based algorithm that clusters association rules in two-dimensional space. However, this approach restricts to have two fixed attributes in their antecedents. Berrado et al. [26] introduced SCAR (Supervised Clustering with Association Rules) an algorithm for clustering high dimensional categorical data. SCAR uses association rules to identify the similarity between objects and then groups them into clusters. Zaki et al. [27] proposed an itemset clustering technique that clusters frequent itemsets to approximate maximal frequent itemsets. Quan et al. [28] proposed a technique for mining conceptual association rules which are mined using Formal Concept Analysis (FCA). Since, FCA suffers computationally when used with huge datasets, a distance based similarity metric is used, and data clustering is performed. In this paper, we propose a formal approach for feature extraction by first applying feature selection heuristics based on the statistical impurity measures, the Gini Index, Max Minority [29], and the Twoing Rule [30] to obtain the top 100-400 genes. We then analyze associative dependencies between the genes and assign weights to the genes based on their degree of participation in the rules. Consequently, we present a weighted similarity measure based on the Jaccard [31] and vector cosine [32] measures to compute the similarity between the association rules. Finally, we group the rules by applying hierarchical clustering. To demonstrate the usability and efficiency of the concept of our technique, we apply it to three publicly available, cancer gene expression datasets and perform a biomedical literature search to support the efficiency of our method. The rest of the paper is organized as follows. In Section II, we describe the methodology to find frequent patterns, assign weights, and cluster them by using two similarity measures. In Section III, we present the results of our experiments by weighted Jaccard and Cosine similarity measures and outline the featured genes by performing an online biomedical literature search. In Section IV, we present the conclusions of our work.

METHODS

Here, we present a novel computational framework for feature extraction and rule grouping based on weighted similarity measures. The overall methodology is illustrated in Fig. (). The framework consists of the following major computational steps: (1) data preprocessing (standardization and normalization), (2) feature selection (three statistical measures for gene ranking and selection), (3) association rule mining on the selective features to obtain weights for the frequently occurring genes, (4) redundant association rule pruning, (5) association rule clustering based on the manipulation of the Jaccard and cosine similarity measures, and (6) an online biomedical literature search to report the functions/mechanisms of the featured genes.

Datasets

The experiments are carried out on three well-known gene expression datasets, and their characteristics are described in Table .

ALL:

The ALL dataset [33] covers six sub types of acute lymphoblastic leukemia: BCR (15), E2A (27), Hyperdip (64), MLL (20), T (43), and TEL (79); the number of samples for each class are shown in parentheses. The dataset is available at http://www.stjuderesearch.org/data/ALL1/.

MLL:

The MLL-leukemia dataset consists of three classes: ALL(24), AML(28), MLL(20) and can be down-loaded at http://research.dfci.harvard.edu/korsmeyer/MLL. htm. The dataset was first studied in Armstrong et al. [34].

SRBCT:

The SRBCT dataset [35] is a dataset of small, round, blood cell tumors found in children and can be downloaded at http://chibi.ubc.ca/tmm/raw-data.html. The dataset consists of 83 samples, which are divided into four classes: EWS (29), BL (11), NB (18) and RMS (25). The datasets are preprocessed using standardization and normalization. Normalization is performed using the z-score method, which transforms the features with mean 0 and standard deviation 1. This process also standardizes the data.

Feature Selection and Scoring using Associative Pattern Mining

The number of features is large compared to the small number of samples in the gene expression datasets. The program Rankgene [36] ranks the features in the dataset. The measures included in Rankgene have been widely used in machine learning or statistical learning theory. We use statistical impurity-based measures, Gini Index (GI), Max Minority (MM), and the Twoing rule (TR) to extract the relevant features. These measures quantify the effectiveness of the feature by evaluating the predictability of a class by dividing the full range of the expression of a given gene into the two intervals of up-regulation and down-regulation. The prediction is based on the presence of all the samples belonging to a particular interval in the same class. We select the Top-100, Top-200, and Top-400 ranked genes from each of the three statistical measures, which formed our reduced feature datasets. In our approach, we apply the three statistical measures, and the variances for a single subset of genes are expected to reflect three statistical properties. If a particular gene is highly ranked, then the other genes, which are correlated with this gene, are also likely to have high ranks [37]. We utilize the advantage of this correlation among the highly ranked genes by performing ARD to find frequently occurring sets of genes. ARD was first introduced in [38] and has the following definition. Let I be the set of items and D be the set of transactions. Each transaction T in D contains a set of items such as T ⊆ I. Association rules follow the form X ⇒ Y, where X ⊆ I , Y ⊆ I , and X ∩ Y = ø is called the antecedent (left hand side or LHS), and Y is called the consequent (right hand side or RHS) of the rule. The meaning of the rule X ⇒ Y is that data instances that contain X are likely to contain Y as well. To select the interestingness of the rules, various measures of significance and interest can be applied, including support and confidence. The support of the rule is the percentage of transactions that contain both X and Y. The confidence of the rule is the conditional probability of Y given X, P(Y / X). The purpose of association rule mining is to find all the rules, which exceed the user specified threshold of support and confidence. ARD is performed on each of these reduced feature datasets separately to find frequently occurring sets of genes. The frequently occurring genes establish patterns between them of the form Gene ⇒ Gene, which implies that when Gene occurs; it is likely that Gene also occurs. The Frequent-1, Frequent-2, and Frequent-3 patterns are discovered for all the sub-datasets. The scores for the frequently occurring genes are obtained in the following manner. Let F be the set that contains k items occurring together. In our case, k is [1, 3]. Let G ⊆ F such that G = {G1, G2,....,G} such that be the featured genes that form the frequently occurring itemsets with a support score, s associated with them, and let s be the number of times genes, G occur together in all the samples. Hence, ∀G∃s such that, The weight W for each gene, G , in F1, F2 and F3 is calculated using the following formula. , where, k is the number of itemset depending upon whether the gene belongs to F1 F2 F3, or itemset. A detailed description of this method is available in our previous work [39]. Table is a representative example that shows scores calculated for the set of nine genes forming the reduced feature set using the top 100 ranked genes selected based on Gini Index for the ALL dataset. Although some of the rules discovered in this process represent important biological relationships between the genes, other rules often contain redundant information, which is difficult for the decision maker to manually analyze. In the pruning phase, we remove the redundant rules using the following concept. If Gene1 ⇒ Gene2, Gene3 is a frequently occurring rule, then the set of rules, Gene1 ⇒ Gene2 ; Gene1 ⇒ Gene3; Gene1,Gene2 ⇒ Gene3; Gene1,Gene3 ⇒ Gene2 will be frequently occurring, an observation which can be derived from the original rule, and hence it would be redundant to analyze them.

Weighted Similarity Measures

The pruning of the rules removes the redundant associations. However, we are still left with a number of rules, some of which have important biological relationships but are difficult to analyze because of their density. Since, genes exhibit complex relationships, it is important to identify the gene correlations, which contribute to an understanding of biologically significant cellular mechanisms. Thus, we propose two weighted similarity methods based on Jaccard and cosine measures to organize and summarize these gene correlations on the basis of “similarity” which will provide a consistent and precise view of the gene correlations. Previous studies have used the Jaccard coefficient (ratio of the set intersection to the set union) as a similarity metric that describes the degree of overlap (similarity) between the subsets of genes. However, in defining this metric, they do not capture the predictive power of the correlated genes to classify them into samples. Further, they fail to apply any heuristics between two subsets of gene correlations if they have the same cardinality-based similarity measure but entirely different sets of genes present in them. Our proposed method will overcome these limitations by i) using gene ranking measures when discovering gene correlations and ii) assigning weights to the gene subsets based on the cardinality of common genes between them. The cosine measure for the two rules can be arranged into binary valued vectors. It will yield a value of 0 or 1 depending upon whether the common gene(s) between the two rules is present on the RHS or LHS of the rule. This binary value poses a very stringent criterion for the similarity, especially in cases of association rules where the common gene between the rules can be on the either side and the rules will still be similar to some extent. Our proposed weighted cosine measure relaxes the constraint of the cosine similarity measure by computing the dot product with the weights obtained for the genes in Section 2.2. Let be the two frequently occurring association rules such that . Let be the set of genes where i = x or y and j = L or R, respectively then the Jaccard similarity between the two rules can be defined as, and the cosine similarity between the two rules can be defined as, where the rules R can be defined as a vector of genes. is the dot product of the weights of the genes in the two rules, and is the length of the vector.

Example 1: Elucidation of Jaccard similarity calculations for the two rules.

Consider the two rules R and R such that, with corresponding weights derived in Section 2.2 that are given in Table . Based on Jaccard similarity we have, Hence, Next, consider two more rules, which have entirely different sets of genes from (1) and (2) with corresponding weights derived in Section 2.2 that are given in Table . Based on Jaccard similarity, we have and Hence Sim (R, R) =0.2. The above examples show that Jaccard similarity measure does not apply any heuristics between two sets of rules if they have the same cardinality-based on the genes present on the either side of the rules but entirely different sets of genes present in them.

Example 2: Elucidation of Cosine similarity calculations for the two rules.

Consider the two rules R(4) R(5) as in Example 1, Based on cosine similarity we have, Similarly, considering the two rules R(6) and R(7), as in Example 1, we have, Sim (R, R) =1. This equation shows that the cosine measure will yield a value of 0 or 1 based on the set of common genes present at different sides of the two rules or on the same side. Let be the corresponding weights of the attributes in the association rules. We now present the definitions of Weighted Jaccard and Weighted cosine similarity measures as follows.

Definition 1: Weighted Jaccard Similarity Measure.

The weighted Jaccard similarity between the two association rule profiles P and P is defined as, where c, c is the number of common set of genes on the LHS or RHS of the rules.

Example 3: Elucidation of weighted Jaccard similarity calculations for the two rules.

Consider the two rule pairs R(4), R(5) with corresponding R and R in Table . Therefore, Sim(R, R) =1.58. Now, consider R(6), R(7) as in Example 1, with corresponding in Table . Therefore, This example shows that unlike Jaccard similarity measure, our proposed weighted Jaccard similarity measure gives a different measure for rules which have entirely different sets of genes present in them. Thus, providing a more efficient way to cluster them.

Definition 2: Weighted Cosine Similarity Measure.

The weighted cosine similarity between the two association rule profiles Px and P is defined as,

Example 4: Elucidation of Weighted Cosine Similarity Calculations for the Two Rules.

Consider the two rule pairs R(4), R(5) 3 with corresponding and in Table . Therefore, Now consider, R(6) R(7) as in Example 1 with corresponding and in Table . Therefore, This example shows that unlike the cosine measure which gives a binary value for the similarity, our proposed weighted cosine measure relaxes the criteria and produces a continuous measure which is then utilized for clustering similar rules.

RESULTS

We perform several experiments with the top 100-400 ranked genes to evaluate the efficacy of the selected features by using our novel association rule based weighting scheme along with the weighted Jaccard and cosine similarity measures to group the rules. The gene ranking algorithm is run using the Queen Bee LONI supercomputer, and all the other experiments are carried out on 3.2GHz Intel® Pentium® 4 processor with 1GB RAM.

Feature Selection and Weighted Scoring Using Association Rule Mining

In this experiment, three feature selection methods, TR, GI, and MM are used to rank the genes. The top 100, 200, and 400 ranked genes are selected for further analysis. ARD is performed on the 12 sub-datasets of top ranked genes for ALL, MLL, and SRBCT datasets separately to find frequently occurring sets of genes. The support and confidence measures are set to 60% and 90%, respectively, for all sub-datasets in order to generate rules. Our experiment shows that a number of rules have common LHS but different RHS. We limit our selection to those rules that have unique genes present on the LHS of the rule to identify a set of non-overlapping genes. Hence, a smaller number of genes qualify for the set support and confidence threshold in all sub-datasets. We further apply the concept of SAR to eliminate the redundant rules. The scoring method as described in Section 2.3 is used to obtain the weights for each gene in all sub-datasets. The weights are then normalized to the range [0, 1]. An example showing the weights obtained for the GI measure for the top 100 genes is shown in Table . Similarly, this scoring method is performed for the all sets of top ranked genes obtained using the feature selection methods.

Clustering Rules Based on the Weighted Jaccard and the Vector Cosine Measure

We computed the clusters of association rules by employing both the weighted Jaccard and weighted cosine measure. Hierarchical clustering (average-linkage) is used to cluster the rules. Therefore, the algorithm starts with the single rule as an individual group and, at each stage, it merges the most similar pair of rule groups. The process completes when only one group is left containing all of the association rules. Our algorithm can handle the single-, complete-, and average-linkage, i.e. taking the maximum, minimum, or average similarity of all pair-wise similarities between two groups of rules. Each cluster has a group of genes obtained by the weighted Jaccard and cosine similarity measures. Fig. () shows the two proposed measures in all the three datasets for the top 100-400 genes perform very similarly in obtaining gene clusters. In the clusters generated, a significant number of genes are common by both measures. In Fig. (), the genes obtained from both the measures are plotted against their weights.

Performing Biomedical Literature Search to Validate the Results

After obtaining the relevant set of genes, the web based biomedical literature DAVID: Database for Annotation, Visualization, and Integrated Discovery [40] is utilized to study the functional annotation of the discovered genes. DAVID has “over 40 annotation categories, including GO terms, protein-protein interactions, protein functional domains, disease associations, bio-pathways, sequence general features, expressions, literatures, etc.” [41]. The advantage of web-based biomedical search is that it provides authentic information about the selected genes being studied without involving the human expertise for verification. Tables list the unique genes obtained after feature selection and clustering. Table shows the genes participating in the pathway. The clusters obtained for each of the top 100, 200, and 400 genes in a dataset had a number of overlapping genes. The column “Featured Genes” report the genes obtained by our approach that were also obtained from the top 96 ranked genes identified based on artificial neural networks by [35]. Three of these genes (FGFR4, IGF2, and MYL4) are reported to be highly expressed in rhabdomyosarcoma (RMS). In the SRBCT dataset, IGF2 has been reported to be indispensable for the formation of medulloblastoma and RMS [45]. PTPN13 was identified by [44] as a natural target gene for the EWS-FLI1 fusion protein. In [35], two-dimensional hierarchical clustering is performed using Pearson correlation coefficient and an unweighted pair group method using arithmetic averages. The genes reported in [35] and identified by our approach are shown in Table . A flow cytometry analysis is performed using monoclonal antibodies specific for a number of antigens including CD2 to determine lineage derivation [46]. In the comparative expression of kinase, genes in pre–B-lineage acute lymphocytic leukemia, comparison of NEG, and E2A/PBX1 identify nine kinases; the tyrosine kinase gene (LCK) is more highly expressed in the E2A/PBX1 samples [46]. Three of the genes (CCL2, CXCL2, and CFD) identified by our approach are reported by [47]. They are involved in the following GO processes: cell surface receptor linked signal transduction, response to wounding, chemotaxis, response to stress, inflammatory response, immune response, and extracellular region. A few genes reported by [9] showed the highest differential expression within the first 24 hours of cocultivation, which included CXCL2, which acts like a cytokine. TNFAIP6 also identified by [11] was involved in intracellular signaling (integral to plasma membrane, receptor activity, signal transducer activity, cell surface receptor-linked signal transduction, cell motility, G-protein-coupled receptor protein signaling pathway, cell–cell signaling, development, and organogenesis, morphogenesis and extracellular region). TCFL5 is one of the nine selected genes reported by [48] as being the most biologically relevant and being able to independently differentiate between TEL/AML1 positive and TEL/AML1 negative patients. The lymphoid specific gene, MME that is highly expressed in ALL samples and under expressed in MLL samples has a function in early B-cell development [34]. Annotations from the GENECODIS [49, 50] software are used to associate the genes with known the Kyoto Encyclopedia of Genes and Genomes (KEGG [51]) pathways. Table shows the genes listed in the identified pathway for the ALL dataset and their description. Table shows the three statistical measures and top 100-400 genes identified in the pathways.

CONCLUSIONS

This paper introduces a novel approach based on ARD for feature extraction and grouping of rules based on weighted similarity measures. The Jaccard and cosine similarity measures have limitations in clustering of similar rules and will not be effective if applied as is. Experiments conducted on the multiclass cancer datasets along with the biomedical literature datasets show the effectiveness of our technique. We expect that this method can be effectively extended to the large data sets produced in modern microarray experiments. Due to the efficiency and scalability of our proposed technique, it is well suited to the domains of medical image analysis for feature extraction and clustering of similar feature based rules.

Table 1

Description of the Datasets

Dataset	No. of Genes	No. of Samples	No. of Classes
ALL	12,625	248	6
MLL	12,582	72	3
SRBCT	2,276	83	4

Table 2

Scores Calculated for the Set of Nine Genes Forming the Reduced Feature Set Using the Top 100 Ranked Genes Selected Based on Gini Index for the ALL Dataset

GENE_ID	F1%	F2%*2	F3%*3	Scores	Normalized Scores
38319_at	12.43%	75.13%	226.13%	3.136977	1
37780_at	12.80%	70.95%	73.87%	1.57621	0.486235
38147_at	11.20%	19.51%		0.30707	0.068466
38051_at	12.86%	17.26%		0.301177	0.066526
36277_at	9.85%	17.15%		0.269951	0.056248
32724_at	10.58%			0.105846	0.002228
35665_at	10.34%			0.103385	0.001418
35974_at	10.03%			0.100308	0.000405
2059_s_at	9.91%			0.099077	0

Table 3

Weights for Rules R(4) and R(5) in (4) and (5)

Gene-ID	Weight (W)
38319_at	1
38051_at	0.137
38147_at	0.111
32794_g_at	0.007
32649_at	0.014

Table 4

Weights for Rules R(6) and R(7) in (6) and (7)

Gene-ID	Weight (W)
2059_s_at	0.030
38051_at	0.137
33238_at	0.016
32649_at	0.014
32794_g_at	0.007

Table 5

List of Genes Obtained in SRBCT Dataset

Index	Image Id	Gene Symbol	Description	Featured Genes	Similarity Measure Jaccard Cosine
1	138672		ESTs		✓	✓
2	244618	FNDC	ESTs	a, b	✓	✓
3	245330	IGF2	Human Krueppel-related zinc finger protein (H-plk) mRNA, complete cds	a, b	✓	✓
4	296448	IGF2	Insulin-like growth factor 2 (somatomedin A)	a, b, c, d, e	✓	✓
5	298062	TNNT2	Plasticity related-protein	a	✓	✓
6	461425	MYL4	Microsomal Glutathione S-Transferase 3	a	✓	✓
7	784224	FGFR4	Fibroblast growth factor receptor 4	a,b	✓	✓
8	789091	RNPEP	Arginyl Aminopeptidase (Aminopeptidase B)		✓	✓
9	839736	CRYAB	Crystallin, alpha B	a,b	✓	✓
10	866702	PTPN13	protein tyrosine phosphatase, non-receptor type 13 (APO-1/CD95 (Fas)-associated phosphatase)	a,b,d	✓	✓
11	882506	PA3341	Probable transcriptional regulator		✓	✓

Khan et al. 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks [35].

Xuan et al. 2007. Gene Selection for Multiclass Prediction by Weighted Fisher Criterion [42].

El-Badry et al. 1990. Insulin-like growth factor II acts as an autocrine growth and motility factor in human rhabdomyosarcoma tumors [43].

Baer et al. 2004. Profiling and Functional Annotation of MRNA Gene Expression in Pediatric Rhabdomyosarcoma and Ewing's Sarcoma [44].

Wang et al. 2007.Accurate Cancer Classification Using Expressions of Very Few Genes [45].

Table 6

List of Genes Obtained in ALL Dataset

Index	Gene Symbol	Description	Marker Genes	Similarity Measure Jaccard Cosine
1	AQP3	Aquaporin 3 (Gill Blood Group)		✓	✓
2	CD1B	CD1B Antigen		✓	✓
3	CD1E	CD1E Antigen, E Polypeptide		✓	✓
4	CD2	CD2 Antigen (P50), Sheep Blood Cell Receptor	b	✓	✓
5	CD3D	CD3D Antigen, Delta Polypeptide (TIT3 Complex)	a	✓	✓
6	CD3D	CD3D Antigen, Delta Polypeptide (TIT3 Complex)	a	✓	✓
7	CHI3L2	Chitinase 3-LIKE 2	a	✓	✓
8	EPHB6	EPH Receptor B6		✓	✓
9	FXYD2	FXYD Domain containing ion Transport Regulator 2		✓	✓
10	ITM2A	Integral Membrane Protein 2A	a	✓	✓
11	LAT	Linker for Activation of T Cells	a	✓	✓
12	LCK	Lymphocyte-Specific Protein Tyrosine Kinase	a,b	✓	✓
13	MAL	MAL, T-Cell Differentiation Protein	a	✓	✓
14	SEPW1	Selenoprotein W, 1	a	✓	✓
14	SEPW1	Selenoprotein W, 1	a	✓	✓
15	SH2D1A	SH2 Domain Protein 1A, Duncan's Disease (Lymphoproliferative Syndrome)	a	✓	✓
16	TCF7	Transcription Factor 7 (T-Cell Specific, HMG-Box)	a	✓	✓
17	TRA@	T Cell Receptor Alpha Locus	a	✓	✓
18	TRBC1	T Cell Receptor Beta Constant 1		✓	✓
19	TRBV19	T Cell Receptor Beta Variable 19		✓	✓
20	TRBV21-1	T Cell Receptor Beta Variable 21-1		✓	✓
21	TRBV3-1	T Cell Receptor Beta Variable 3-1		✓	✓
22	TRBV5-4	T Cell Receptor Beta Variable 5-4		✓	✓
23	TRD@	T Cell Receptor Delta Locus		✓	✓
24	TRIB2	Tibbles Homolog 2 (Drosophila)		✓	✓
25	VAT1	Vesicle Amine Transport Protein 1 Homolog (T Californica)		✓	✓
26	KIAA0802	Unknown	a,b	✓

Yeoh et al. 2002. Classification, subtype discovery, and prediction of outcome in pediatric lymphoblastic leukemia by gene expression profiling [52].

Chiaretti et al. 2005. Gene Expression Profiles of B-lineage Adult Acute Lymphocytic Leukemia Reveal Genetic Patterns that Identify Lineage Derivation and Distinct Mechanisms of Transformation [46].

Table 7

List of Genes Obtained in MLL Dataset

Index	Probe Id	Gene Symbol	Description	Marker Genes	Similarity Measure Jaccard Cosine
1	37954_AT	ANXA8L2	Annexin A8		✓	✓
2	34375_AT	CD163	CD163 Antigen		✓	✓
3	34375_AT	CCL2	Chemokine (C-C Motif) Ligand 2	a	✓	✓
4	875_G_AT	CCL2	Chemokine (C-C Motif) Ligand 2		✓	✓
5	37187_AT	CXCL2	Chemokine (C-X-C Motif) Ligand 2	a,b	✓	✓
6	36780_AT	CLU	Clusterin		✓	✓
7	40282_S_AT	CFD	Complement Factor D (Adipsin)	a	✓	✓
8	1914_AT	CCNA1	Cyclin A1		✓	✓
9	39660_AT	DEFB1	Defensin, Beta 1		✓	✓
10	864_AT	MNX1	Homeobox HB9		✓	✓
11	37043_AT	ID3	Inhibitor of DNA Binding 3, Dominant Negative Helix-Loop-Helix Protein		✓	✓
12	1389_AT	MME	Membrane Metallo-Endopeptidase	f	✓	✓
13	38604_AT	NPY	Neuropeptide Y		✓	✓
14	36151_AT	PLD3	Phospholipase D Family, Member 3		✓	✓
15	39208_I_AT	PPBP	Pro-Platelet Basic Protein (Chemokine (C-X-C Motif) Ligand 7)		✓	✓
16	39209_R_AT	PPBP	Pro-Platelet Basic Protein (Chemokine (C-X-C Motif) Ligand 7)		✓	✓
17	37185_AT	SERPINB2	Serpin Peptidase Inhibitor, Clade B (Ovalbumin), Member 2		✓	✓
18	1325_AT	SMAD1	SMAD, Mothers Against DPP Homolog 1 (Drosophila)		✓	✓
19	37280_AT	SMAD1	SMAD, Mothers Against DPP Homolog 1 (Drosophila)		✓	✓
20	41097_AT	TERF2	Telomeric Repeat Binding Factor 2	f	✓	✓
21	32872_AT	TCF4	PTranscription Factor 4		✓	✓
22	35614_AT	TCFL5	Transcription Factor-LIKE 5 (Basic Helix-Loop-Helix)	d,e	✓	✓
23	1372_AT	TNFAIP6	Tumor Necrosis Factor, Alpha-Induced Protein 6	c	✓	✓

Bloushtain-Qimron et al. 2008. Cell type-specific DNA methylation patterns in the human breast [47].

Wagner et al. 2005. Hematopoietic Progenitor Cells and Cellular Microenvironment: Behavioral and Molecular Changes upon Interaction [53].

S. Hanash and C. Creighton 2003. Making sense of microarray data to classify cancer [11].

Gandemer et al. 2007. Five distinct biological processes and 14 differentially expressed genes characterize TEL/AML1-positive leukemia [53].

Severin et al. 2009. FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions [54].

Armstrong et al. 2002. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [34].

Table 8

Genes Listed in the Pathways

Pathway	Genes Involved	Description
A	40688_at, 36277_at, 38319_at, 33238_at	(KEGG) 04660: T cell receptor signaling pathway, (KEGG) 05340: Primary immunodeficiency
B	36277_at, 38319_at, 33238_at	(KEGG) 04660: T cell receptor signaling pathway
C	40688_at, 38147_at, 33238_at	(KEGG) 04640: Hematopoietic cell lineage
D	40688_at, 38319_at, 33238_at	(KEGG) 04660: T cell receptor signaling pathway
E	36277_at, 40738_at, 38319_at	(KEGG) 04650: Natural killer cell mediated cytotoxicity
F	34927_at, 37861_at, 38319_at	(KEGG) 04640: Hematopoietic cell lineage

27 in total

1. Clustering gene expression patterns.

Authors: A Ben-Dor; R Shamir; Z Yakhini
Journal: J Comput Biol Date: 1999 Fall-Winter Impact factor: 1.479

2. Mining gene expression databases for association rules.

Authors: Chad Creighton; Samir Hanash
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

3. Analyzing microarray data using quantitative association rules.

Authors: Elisabeth Georgii; Lothar Richter; Ulrich Rückert; Stefan Kramer
Journal: Bioinformatics Date: 2005-09-01 Impact factor: 6.937

4. Hematopoietic progenitor cells and cellular microenvironment: behavioral and molecular changes upon interaction.

Authors: Wolfgang Wagner; Rainer Saffrich; Ute Wirkner; Volker Eckstein; Jonathon Blake; Alexandra Ansorge; Christian Schwager; Frederik Wein; Katrin Miesala; Wilhelm Ansorge; Anthony D Ho
Journal: Stem Cells Date: 2005-06-13 Impact factor: 6.277

5. Insulin-like growth factor II acts as an autocrine growth and motility factor in human rhabdomyosarcoma tumors.

Authors: O M El-Badry; C Minniti; E C Kohn; P J Houghton; W H Daughaday; L J Helman
Journal: Cell Growth Differ Date: 1990-07

6. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

7. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. Five distinct biological processes and 14 differentially expressed genes characterize TEL/AML1-positive leukemia.

Authors: Virginie Gandemer; Anne-Gaëlle Rio; Marie de Tayrac; Vonnick Sibut; Stéphanie Mottier; Béatrice Ly Sunnaram; Catherine Henry; Annabelle Monnier; Christian Berthou; Edouard Le Gall; André Le Treut; Claudine Schmitt; Jean-Yves Le Gall; Jean Mosser; Marie-Dominique Galibert
Journal: BMC Genomics Date: 2007-10-23 Impact factor: 3.969

9. FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions.

Authors: Jessica Severin; Andrew M Waterhouse; Hideya Kawaji; Timo Lassmann; Erik van Nimwegen; Piotr J Balwierz; Michiel Jl de Hoon; David A Hume; Piero Carninci; Yoshihide Hayashizaki; Harukazu Suzuki; Carsten O Daub; Alistair Rr Forrest
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583

10. GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information.

Authors: Ruben Nogales-Cadenas; Pedro Carmona-Saez; Miguel Vazquez; Cesar Vicente; Xiaoyuan Yang; Francisco Tirado; Jose María Carazo; Alberto Pascual-Montano
Journal: Nucleic Acids Res Date: 2009-05-22 Impact factor: 16.971

4 in total

1. Emergence of differentially regulated pathways associated with the development of regional specificity in chicken skin.

Authors: Kai-Wei Chang; Nancy A Huang; I-Hsuan Liu; Yi-Hui Wang; Ping Wu; Yen-Tzu Tseng; Michael W Hughes; Ting Xin Jiang; Mong-Hsun Tsai; Chien-Yu Chen; Yen-Jen Oyang; En-Chung Lin; Cheng-Ming Chuong; Shau-Ping Lin
Journal: BMC Genomics Date: 2015-01-23 Impact factor: 3.969

2. ConGEMs: Condensed Gene Co-Expression Module Discovery Through Rule-Based Clustering and Its Application to Carcinogenesis.

Authors: Saurav Mallik; Zhongming Zhao
Journal: Genes (Basel) Date: 2017-12-28 Impact factor: 4.096

3. Application of text mining to develop AOP-based mucus hypersecretion genesets and confirmation with in vitro and clinical samples.

Authors: Emmanuel Minet; Linsey E Haswell; Sarah Corke; Anisha Banerjee; Andrew Baxter; Ivan Verrastro; Francisco De Abreu E Lima; Tomasz Jaunky; Simone Santopietro; Damien Breheny; Marianna D Gaça
Journal: Sci Rep Date: 2021-03-17 Impact factor: 4.379

4. Genomic landscape of extraordinary responses in metastatic breast cancer.

Authors: Sun Min Lim; Eunyoung Kim; Kyung Hae Jung; Sora Kim; Ja Seung Koo; Seung Il Kim; Seho Park; Hyung Seok Park; Byoung Woo Park; Young Up Cho; Ji Ye Kim; Soonmyung Paik; Nak-Jung Kwon; Gun Min Kim; Ji Hyoung Kim; Min Hwan Kim; Min Kyung Jeon; Sangwoo Kim; Joohyuk Sohn
Journal: Commun Biol Date: 2021-04-09

4 in total