| Literature DB >> 21603179 |
Prerna Sethi1, Sathya Alagiriswamy.
Abstract
In life threatening diseases, such as cancer, where the effective diagnosis includes annotation, early detection, distinction, and prediction, data mining and statistical approaches offer the promise for precise, accurate, and functionally robust analysis of gene expression data. The computational extraction of derived patterns from microarray gene expression is a non-trivial task that involves sophisticated algorithm design and analysis for specific domain discovery. In this paper, we have proposed a formal approach for feature extraction by first applying feature selection heuristics based on the statistical impurity measures, the Gini Index, Max Minority, and the Twoing Rule and obtaining the top 100-400 genes. We then analyze the associative dependencies between the genes and assign weights to the genes based on their degree of participation in the rules. Consequently, we present a weighted Jaccard and vector cosine similarity measure to compute the similarity between the discovered rules. Finally, we group the rules by applying hierarchical clustering. To demonstrate the usability and efficiency of the concept of our technique, we applied it to three publicly available, multiclass cancer gene expression datasets and performed a biomedical literature search to support the effectiveness of our results.Entities:
Keywords: Microarray gene expression; association rules; clustering.; similarity measure
Year: 2010 PMID: 21603179 PMCID: PMC3096052 DOI: 10.2174/1874431101004010063
Source DB: PubMed Journal: Open Med Inform J ISSN: 1874-4311
Description of the Datasets
| Dataset | No. of Genes | No. of Samples | No. of Classes |
|---|---|---|---|
| ALL | 12,625 | 248 | 6 |
| MLL | 12,582 | 72 | 3 |
| SRBCT | 2,276 | 83 | 4 |
Scores Calculated for the Set of Nine Genes Forming the Reduced Feature Set Using the Top 100 Ranked Genes Selected Based on Gini Index for the ALL Dataset
| GENE_ID | F1% | F2%*2 | F3%*3 | Scores | Normalized Scores |
|---|---|---|---|---|---|
| 38319_at | 12.43% | 75.13% | 226.13% | 3.136977 | 1 |
| 37780_at | 12.80% | 70.95% | 73.87% | 1.57621 | 0.486235 |
| 38147_at | 11.20% | 19.51% | 0.30707 | 0.068466 | |
| 38051_at | 12.86% | 17.26% | 0.301177 | 0.066526 | |
| 36277_at | 9.85% | 17.15% | 0.269951 | 0.056248 | |
| 32724_at | 10.58% | 0.105846 | 0.002228 | ||
| 35665_at | 10.34% | 0.103385 | 0.001418 | ||
| 35974_at | 10.03% | 0.100308 | 0.000405 | ||
| 2059_s_at | 9.91% | 0.099077 | 0 |
Weights for Rules R(4) and R(5) in (4) and (5)
| Gene-ID | Weight (W) |
|---|---|
| 38319_at | 1 |
| 38051_at | 0.137 |
| 38147_at | 0.111 |
| 32794_g_at | 0.007 |
| 32649_at | 0.014 |
Weights for Rules R(6) and R(7) in (6) and (7)
| Gene-ID | Weight (W) |
|---|---|
| 2059_s_at | 0.030 |
| 38051_at | 0.137 |
| 33238_at | 0.016 |
| 32649_at | 0.014 |
| 32794_g_at | 0.007 |
List of Genes Obtained in SRBCT Dataset
| Index | Image Id | Gene Symbol | Description | Featured Genes | Similarity Measure Jaccard Cosine | |
|---|---|---|---|---|---|---|
| 1 | 138672 | ESTs | ✓ | ✓ | ||
| 2 | 244618 | ✓ | ✓ | |||
| 3 | 245330 | ✓ | ✓ | |||
| 4 | 296448 | ✓ | ✓ | |||
| 5 | 298062 | ✓ | ✓ | |||
| 6 | 461425 | ✓ | ✓ | |||
| 7 | 784224 | ✓ | ✓ | |||
| 8 | 789091 | RNPEP | Arginyl Aminopeptidase (Aminopeptidase B) | ✓ | ✓ | |
| 9 | 839736 | ✓ | ✓ | |||
| 10 | 866702 | ✓ | ✓ | |||
| 11 | 882506 | PA3341 | Probable transcriptional regulator | ✓ | ✓ | |
Khan et al. 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks [35].
Xuan et al. 2007. Gene Selection for Multiclass Prediction by Weighted Fisher Criterion [42].
El-Badry et al. 1990. Insulin-like growth factor II acts as an autocrine growth and motility factor in human rhabdomyosarcoma tumors [43].
Baer et al. 2004. Profiling and Functional Annotation of MRNA Gene Expression in Pediatric Rhabdomyosarcoma and Ewing's Sarcoma [44].
Wang et al. 2007.Accurate Cancer Classification Using Expressions of Very Few Genes [45].
List of Genes Obtained in ALL Dataset
| Index | Gene Symbol | Description | Marker Genes | Similarity Measure Jaccard Cosine | |
|---|---|---|---|---|---|
| 1 | AQP3 | Aquaporin 3 (Gill Blood Group) | ✓ | ✓ | |
| 2 | CD1B | CD1B Antigen | ✓ | ✓ | |
| 3 | CD1E | CD1E Antigen, E Polypeptide | ✓ | ✓ | |
| 4 | CD2 | CD2 Antigen (P50), Sheep Blood Cell Receptor | ✓ | ✓ | |
| 5 | ✓ | ✓ | |||
| 6 | ✓ | ✓ | |||
| 7 | ✓ | ✓ | |||
| 8 | EPHB6 | EPH Receptor B6 | ✓ | ✓ | |
| 9 | FXYD2 | FXYD Domain containing ion Transport Regulator 2 | ✓ | ✓ | |
| 10 | ✓ | ✓ | |||
| 11 | ✓ | ✓ | |||
| 12 | ✓ | ✓ | |||
| 13 | ✓ | ✓ | |||
| 14 | ✓ | ✓ | |||
| 14 | ✓ | ✓ | |||
| 15 | ✓ | ✓ | |||
| 16 | ✓ | ✓ | |||
| 17 | TRA@ | T Cell Receptor Alpha Locus | ✓ | ✓ | |
| 18 | TRBC1 | T Cell Receptor Beta Constant 1 | ✓ | ✓ | |
| 19 | TRBV19 | T Cell Receptor Beta Variable 19 | ✓ | ✓ | |
| 20 | TRBV21-1 | T Cell Receptor Beta Variable 21-1 | ✓ | ✓ | |
| 21 | TRBV3-1 | T Cell Receptor Beta Variable 3-1 | ✓ | ✓ | |
| 22 | TRBV5-4 | T Cell Receptor Beta Variable 5-4 | ✓ | ✓ | |
| 23 | TRD@ | T Cell Receptor Delta Locus | ✓ | ✓ | |
| 24 | TRIB2 | Tibbles Homolog 2 (Drosophila) | ✓ | ✓ | |
| 25 | VAT1 | Vesicle Amine Transport Protein 1 Homolog (T Californica) | ✓ | ✓ | |
| 26 | Unknown | ✓ | |||
Yeoh et al. 2002. Classification, subtype discovery, and prediction of outcome in pediatric lymphoblastic leukemia by gene expression profiling [52].
Chiaretti et al. 2005. Gene Expression Profiles of B-lineage Adult Acute Lymphocytic Leukemia Reveal Genetic Patterns that Identify Lineage Derivation and Distinct Mechanisms of Transformation [46].
List of Genes Obtained in MLL Dataset
| Index | Probe Id | Gene Symbol | Description | Marker Genes | Similarity Measure Jaccard Cosine | |
|---|---|---|---|---|---|---|
| 1 | 37954_AT | ANXA8L2 | Annexin A8 | ✓ | ✓ | |
| 2 | 34375_AT | CD163 | CD163 Antigen | ✓ | ✓ | |
| 3 | 34375_AT | Chemokine (C-C Motif) Ligand 2 | ✓ | ✓ | ||
| 4 | 875_G_AT | CCL2 | Chemokine (C-C Motif) Ligand 2 | ✓ | ✓ | |
| 5 | 37187_AT | ✓ | ✓ | |||
| 6 | 36780_AT | CLU | Clusterin | ✓ | ✓ | |
| 7 | 40282_S_AT | ✓ | ✓ | |||
| 8 | 1914_AT | CCNA1 | Cyclin A1 | ✓ | ✓ | |
| 9 | 39660_AT | DEFB1 | Defensin, Beta 1 | ✓ | ✓ | |
| 10 | 864_AT | MNX1 | Homeobox HB9 | ✓ | ✓ | |
| 11 | 37043_AT | ID3 | Inhibitor of DNA Binding 3, Dominant Negative Helix-Loop-Helix Protein | ✓ | ✓ | |
| 12 | 1389_AT | ✓ | ✓ | |||
| 13 | 38604_AT | NPY | Neuropeptide Y | ✓ | ✓ | |
| 14 | 36151_AT | PLD3 | Phospholipase D Family, Member 3 | ✓ | ✓ | |
| 15 | 39208_I_AT | PPBP | Pro-Platelet Basic Protein (Chemokine (C-X-C Motif) Ligand 7) | ✓ | ✓ | |
| 16 | 39209_R_AT | PPBP | Pro-Platelet Basic Protein (Chemokine (C-X-C Motif) Ligand 7) | ✓ | ✓ | |
| 17 | 37185_AT | SERPINB2 | Serpin Peptidase Inhibitor, Clade B (Ovalbumin), Member 2 | ✓ | ✓ | |
| 18 | 1325_AT | SMAD1 | SMAD, Mothers Against DPP Homolog 1 (Drosophila) | ✓ | ✓ | |
| 19 | 37280_AT | SMAD1 | SMAD, Mothers Against DPP Homolog 1 (Drosophila) | ✓ | ✓ | |
| 20 | 41097_AT | ✓ | ✓ | |||
| 21 | 32872_AT | TCF4 | PTranscription Factor 4 | ✓ | ✓ | |
| 22 | 35614_AT | ✓ | ✓ | |||
| 23 | 1372_AT | ✓ | ✓ | |||
Bloushtain-Qimron et al. 2008. Cell type-specific DNA methylation patterns in the human breast [47].
Wagner et al. 2005. Hematopoietic Progenitor Cells and Cellular Microenvironment: Behavioral and Molecular Changes upon Interaction [53].
S. Hanash and C. Creighton 2003. Making sense of microarray data to classify cancer [11].
Gandemer et al. 2007. Five distinct biological processes and 14 differentially expressed genes characterize TEL/AML1-positive leukemia [53].
Severin et al. 2009. FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions [54].
Armstrong et al. 2002. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [34].
Genes Listed in the Pathways
| Pathway | Genes Involved | Description |
|---|---|---|
| A | 40688_at, 36277_at, 38319_at, 33238_at | (KEGG) 04660: T cell receptor signaling pathway, (KEGG) 05340: Primary immunodeficiency |
| B | 36277_at, 38319_at, 33238_at | (KEGG) 04660: T cell receptor signaling pathway |
| C | 40688_at, 38147_at, 33238_at | (KEGG) 04640: Hematopoietic cell lineage |
| D | 40688_at, 38319_at, 33238_at | (KEGG) 04660: T cell receptor signaling pathway |
| E | 36277_at, 40738_at, 38319_at | (KEGG) 04650: Natural killer cell mediated cytotoxicity |
| F | 34927_at, 37861_at, 38319_at | (KEGG) 04640: Hematopoietic cell lineage |