| Literature DB >> 29283433 |
Saurav Mallik1, Zhongming Zhao2,3.
Abstract
For transcriptomic analysis, there are numerous microarray-based genomic data, especially those generated for cancer research. The typical analysis measures the difference between a cancer sample-group and a matched control group for each transcript or gene. Association rule mining is used to discover interesting item sets through rule-based methodology. Thus, it has advantages to find causal effect relationships between the transcripts. In this work, we introduce two new rule-based similarity measures-weighted rank-based Jaccard and Cosine measures-and then propose a novel computational framework to detect condensed gene co-expression modules ( C o n G E M s) through the association rule-based learning system and the weighted similarity scores. In practice, the list of evolved condensed markers that consists of both singular and complex markers in nature depends on the corresponding condensed gene sets in either antecedent or consequent of the rules of the resultant modules. In our evaluation, these markers could be supported by literature evidence, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and Gene Ontology annotations. Specifically, we preliminarily identified differentially expressed genes using an empirical Bayes test. A recently developed algorithm-RANWAR-was then utilized to determine the association rules from these genes. Based on that, we computed the integrated similarity scores of these rule-based similarity measures between each rule-pair, and the resultant scores were used for clustering to identify the co-expressed rule-modules. We applied our method to a gene expression dataset for lung squamous cell carcinoma and a genome methylation dataset for uterine cervical carcinogenesis. Our proposed module discovery method produced better results than the traditional gene-module discovery measures. In summary, our proposed rule-based method is useful for exploring biomarker modules from transcriptomic data.Entities:
Keywords: Limma; association rule mining; dynamic tree cut method; gene co-expression modules; gene expression markers; lung squamous cell carcinoma
Year: 2017 PMID: 29283433 PMCID: PMC5793160 DOI: 10.3390/genes9010007
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1An example for performing the post-discretization: sub-figure (a) denotes the initial matrix containing differentially expressed genes (s); (b) represents the matrix after discretization; (c) depicts the matrix after post-discretization; (d) signifies the utilization of association rule mining and the identification of top rules, where “+” refers to up-regulation (also denoted by “1”), and “−” refers to down-regulation (also denoted by “0”), denotes diseased/treated samples, and denotes normal samples.
Figure 2Flowchart of the proposed method, whereas “w.r.t.” denotes “with respect to”.
Top ten condensed markers (s) for the lung squamous cell carcinoma (LUSC) dataset.
| Rank | Condensed Marker ( | Module Label | Availability of Biological Evidence | Status of Condensed Marker |
|---|---|---|---|---|
| 1 | DST- | purple (consequent) | Available | |
| 2 | TP63- | blue, brown (consequent) | Available | |
| 3 | BNC1- | pink (consequent) | Available | |
| 4 | CLCA2- | yellow (consequent) | Available | |
| 5 | GJB5- | dark red (consequent) | Available | |
| 6 | {DSC3-, KRT5-} | dark turquoise (antecedent) | Available for both | |
| 7 | {CGN+, DSC3-} | salmon (antecedent) | Available for both | |
| 8 | {KRT5-, NTRK2-} | blue (antecedent) | Available for both | |
| 9 | {CGN+, KRT5-} | light green (antecedent) | Available for both | |
| 10 | {DSC3-, TMEM40-, NTRK2-} | yellow (antecedent) | Available for DSC3 and NTRK2, not found for TMEM40 |
Biological validations of individual genes belonging to the s in Table 1 for the LUSC dataset.
| Individual Gene | Literature Evidence | KEGG Pathway and GO-Terms ( | |
|---|---|---|---|
| DST | 9.26 | [ | |
| TP63 | 1.27 | [ | |
| BNC1 | 2.82 | [ | |
| CLCA2 | 1.28 | [ | |
| GJB5 | 1.94 | [ | |
| CGN | 1.96 | [ | |
| DSC3 | 3.08 | [ | |
| KRT5 | 6.50 | [ | |
| NTRK2 | 1.47 | [ | |
| TMEM40 | 1.29 | - |
“GO” denotes Gene-Ontology.
Comparison of proposed rule-based gene-module detection method and other existing geneset-based gene-module detection methods for the LUSC dataset.
| Validty Index |
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Inf | Inf | Inf | Inf | - | - | - | - | - | - | ||
| 7.82 | 3.94 | 6.55 | 2.89 | - | - | - | - | - | - | ||
| 6.80 | 4.24 | 3.41 | 3.39 | 3.02 | 9.41 | 1.66 | 6.53 | 1.65 | 6.53 | ||
| 1.52 | 8.74 | 1.30 | 8.64 | 8.86 | 5.18 | 1.47 | 5.38 | 1.47 | 5.38 | ||
|
| 1.26 | 8.24 | 1.24 | 1.00 | - | - | - | - | - | - | |
| 9.83 | 3.99 | 6.42 | 2.90 | 7.782 | 3.18 | 2.40 | 3.30 | 2.40 | 3.30 | ||
| 1.04 | 7.85 | 1.28 | 6.83 | 7.64 | 4.83 | 1.24 | 4.82 | 1.24 | 4.82 | ||
|
| - | - | - | - | - | - | - | - | - | - | |
| - | - | - | - | - | - | - | - | - | - | - |
⇑ signifies that a higher value of the corresponding validity index is better in determining the gene modules, while ⇓ denotes the reverse of the above statement. For each validity index, an entry denoted with bold font indicates that the corresponding method is the best performer in terms of the corresponding index (row-wise). wTOM[pcc]: weighted TOM using Pearson’s correlation coefficient; wTOM[sc]: weighted TOM using Spearman’s correlation; GTOM0[pcc]: generalized TOM of degree 0 using Pearson’s correlation coefficient; GTOM0[sc]: generalized TOM of degree 0 using Spearman’s correlation; avgDI: average Dunn index; avgSW: average Silhoutte width; avgSC: average scaled connectivity; avgCC: average cluster coefficient; avgMAR: average maximum adjacency ratio.
Comparison of proposed rule-based gene-module detection method and other existing geneset-based gene-module detection methods for the cervical carcinogenesis dataset.
| Validty Index |
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Inf | Inf | Inf | Inf | Inf | - | Inf | - | Inf | - | ||
| 1.85 | 3.13 | 2.12 | 9.16 | 1.61 | - | - | - | ||||
| 7.41 | 9.99 | 6.39 | 9.77 | 5.49 | 9.90 | 9.90 | 1.26 | 9.90 | 1.27 | ||
| 9.64 | 2.76 | 9.55 | 2.58 | 1.81 | 1.98 | 1.99 | |||||
| 3.25 | 2.53 | 9.52 | 2.75 | - | - | - | - | - | - | ||
| 1.87 | 9.64 | 2.11 | 9.49 | 1.59 | 1.01 | 1.12 | 1.26 | ||||
| 1.26 | 1.04 | 1.21 | 2.31 | 1.01 | 8.54 | 1.01 | 8.80 | 1.01 | 8.84 | ||
| - | - | - | - | - | - | - | - | - | - | ||
| - | - | - | - | - | - | - | - | - | - | - |
⇑ signifies that a higher value of the corresponding validity index is better in determining the gene modules, while ⇓ denotes the reverse of the above statement. For each validity index, an entry denoted with bold font indicates that the corresponding method is the best performer in terms of the corresponding index (row-wise).
Comparison of proposed rule-based gene-module detection method and other existing geneset-based gene-module detection methods in a simulation study for the LUSC dataset.
| Validty Index |
| ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf | ||
| 5.91 | 2.97 | - | 5.50 | - | 5.54 | 2.80 | 8.66 | 4.51 | 6.11 | ||
| 6.74 | 9.01 | 9.10 | 9.10 | 9.10 | 6.92 | 8.83 | 6.83 | 9.06 | 6.81 | ||
| 2.51 | 6.55 | 5.29 | 5.29 | 5.29 | 6.70 | 5.14 | 8.71 | 6.16 |
| 7.20 | |
| 2.98 | 5.31 | 5.31 | 5.31 | - | - | - | - | - | - | ||
| 2.19 | 6.50 | 5.27 | 5.27 | 5.27 | 5.09 | 1.25 | 7.95 | 2.34 | 3.38 | ||
|
| 7.41 | 5.47 | 5.47 | 5.47 | 1.01 | 9.04 | 1.04 | 1.04 | 8.89 | 8.89 | |
|
| - | - | - | - | - | - | - | - | - | - | |
|
| - | - | - | - | - | - | - | - | - | - |
⇑ signifies that a higher value of the corresponding validity index is better in determining the gene modules, while ⇓ denotes the reverse of the above statement. For each validity index, an entry denoted with bold font indicates that the corresponding method is the best performer in terms of the corresponding index (row-wise).
Comparison of proposed rule-based gene-module detection method and other existing geneset-based gene-module detection methods in the second simulation study for the LUSC dataset.
| Validty Index |
| ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Inf | Inf | Inf | - | - | - | - | - | - | - | ||
| 7.11 | 5.51 | 1.00 | - | - | - | - | - | - | - | ||
| 6.83 | 5.50 | 4.34 | 4.34 | 4.34 | 1.51 | 2.37 | 1.18 | 2.65 | 1.14 | ||
| 1.96 | 1.09 | 1.09 | 1.09 | 1.42 | 7.85 | 1.98 | 9.23 | 2.39 | 9.87 | ||
|
| 1.71 | 1.05 | 1.05 | 1.05 | - | - | - | - | - | - | |
|
| 1.42 | 5.99 | 5.99 | 5.99 | 2.23 | 7.90 | 5.02 | 1.04 | 6.23 | 1.09 | |
| 1.03 | 1.21 | 8.84 | 8.84 | 8.84 | 1.34 | 7.68 | 1.68 | 8.13 | 8.79 | ||
|
| - | - | - | - | - | - | - | - | - | - | |
| - | - | - | - | - | - | - | - | - | - |
⇑ signifies that a higher value of the corresponding validity index is better in determining the gene modules, while ⇓ denotes the reverse of the above statement. For each validity index, an entry denoted with bold font indicates that the corresponding method is the best performer in terms of the corresponding index (row-wise).
Comparison of proposed rule-based gene-module detection method and other existing geneset-based gene-module detection methods in the first simulation study for the cervical carcinogenesis dataset.
| Validty Index |
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Inf | Inf | Inf | Inf | Inf | - | Inf | - | Inf | - | ||
| 9.33 | 3.83 | 1.89 | 9.07 | 1.55 |
| - | - | - | |||
| 7.45 | 9.91 | 6.20 | 9.80 | 9.80 | 1.30 | 9.80 | 1.30 | ||||
| 2.59 | 9.60 | 2.63 | 1.20 | 1.20 | 1.20 | 1.25 |
| 1.25 | |||
| 3.26 | 2.39 | - | - | - | - | - | - | - | - | ||
| 1.89 | 1.85 | 1.49 | 1.49 | 1.49 | 1.59 | 1.59 | |||||
| 6.47 | 9.32 |
| 1.12 | 1.12 | 2.04 | 1.12 | 2.04 | 1.11 | 2.04 | 1.11 | |
| - | - | - | - | - | - | - | - | - | - | ||
|
| - | - | - | - | - | - | - | - | - | - |
⇑ signifies that a higher value of the corresponding validity index is better in determining the gene modules, while ⇓ denotes the reverse of the above statement. For each validity index, an entry denoted with bold font indicates that the corresponding method be the best performer in terms of the corresponding index (row-wise).
Comparison of the proposed rule-based gene-module detection method and other existing geneset-based gene-module detection methods in the second simulation study for the cervical carcinogenesis dataset.
| Validty Index |
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Inf | Inf | Inf | Inf | Inf | - | Inf | - | Inf | - | ||
| 9.88 | 3.67 | 2.78 | 9.15 | 2.19 | - |
| - | - | |||
| 7.37 | 9.92 | 7.62 | 9.80 | 9.80 | 9.80 | 9.80 | 1.47 | 9.80 | 1.48 | ||
| 2.58 | 9.71 | 2.43 | 1.24 | 1.36 | 1.39 | ||||||
| 3.26 | 3.17 | - | - | - | - | - | - | - | - | ||
| 1.88 | 2.77 | 9.60 | 9.60 | 9.60 | 1.64 | 9.60 | 2.00 | 9.60 | 2.04 | ||
| 6.72 | 7.9 | 2.04 | 2.04 | 2.04 | 1.17 | 2.04 | 1.21 | 2.04 | 1.22 | ||
| - | - | - | - | - | - | - | - | - | - | ||
|
| - | - | - | - | - | - |
⇑ signifies that a higher value of the corresponding validity index is better in determining the gene modules, while ⇓ denotes the reverse of the above statement. For each validity index, an entry denoted with bold font indicates that the corresponding method is the best performer in terms of the corresponding index (row-wise).
Summary of comparative performance between our proposed rule-module discovery method (in rows) over the traditional gene-module discovery methods using several existing similarity measures (in columns) for the original and simulated LUSC dataset (denoted as “LUSC”, “LUSC sm1”, and “LUSC sm2”, respectively) as well as the original and simulated cervical datasets (referred to as “Cervical”, “Cervical sm1”, and “Cervical sm2”, respectively).
| Dataset | Method | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| LUSC | 5-1-3 | 7-1-1 | 6-1-2 | 7-1-1 | 7-1-1 | 7-1-1 | 6-1-2 | 7-1-1 | 6-1-2 | 7-1-1 | |
| Cervical | 4-1-4 | 5-1-3 | 4-1-4 | 6-1-2 | 5-1-3 | 7-1-1 | 5-1-3 | 7-1-1 | 5-1-3 | 7-1-1 | |
| LUSC sm1 | 5-0-4 | 7-0-2 | 6-0-3 | 7-0-2 | 5-0-4 | 5-0-4 | 5-0-4 | 6-0-3 | 6-0-3 | 5-0-4 | |
| LUSC sm2 | 6-0-3 | 8-0-1 | 7-0-2 | 8-0-1 | 7-0-2 | 8-0-1 | 7-0-2 | 8-0-1 | 7-0-2 | 8-0-1 | |
| Cervical sm1 | 5-0-4 | 5-0-4 | 6-0-3 | 6-0-3 | 6-0-3 | 8-0-1 | 6-0-3 | 7-0-2 | 6-0-3 | 7-0-2 | |
| Cervical sm2 | 5-0-4 | 6-0-3 | 6-0-3 | 6-0-3 | 6-0-3 | 7-0-2 | 6-0-3 | 7-0-2 | 6-0-3 | 7-0-2 |
The entry at row X under column Y represents the win-draw-loss of X compared to Y.