| Literature DB >> 28962323 |
Keisuke Nagata1, Takashi Washio2, Yoshinobu Kawahara2, Akira Unami1.
Abstract
While the recent advent of new technologies in biology such as DNA microarray and next-generation sequencer has given researchers a large volume of data representing genome-wide biological responses, it is not necessarily easy to derive knowledge that is accurate and understandable at the same time. In this study, we applied the Classification Based on Association (CBA) algorithm, one of the class association rule mining techniques, to the TG-GATEs database, where both toxicogenomic and toxicological data of more than 150 compounds in rat and human are stored. We compared the generated classifiers between CBA and linear discriminant analysis (LDA) and showed that CBA is superior to LDA in terms of both predictive performances (accuracy: 83% for CBA vs. 75% for LDA, sensitivity: 82% for CBA vs. 72% for LDA, specificity: 85% for CBA vs. 75% for LDA) and interpretability.Entities:
Keywords: CBA; Class association rule mining; Microarray; Toxicogenomics
Year: 2014 PMID: 28962323 PMCID: PMC5598536 DOI: 10.1016/j.toxrep.2014.10.014
Source DB: PubMed Journal: Toxicol Rep ISSN: 2214-7500
Exploration of various CBA settings.
| minsup (%) | minconf (%) | Average accuracy (%) | Total time (s) |
|---|---|---|---|
| (A) When minsup was fixed at 10% | |||
| 10 | 50 | 77 | 0.61 |
| 10 | 80 | 76 | 0.59 |
| 10 | 90 | 79 | 0.58 |
| 10 | 100 | 77 | 0.58 |
| (B) When minconf was fixed at 90% | |||
| 20 | 90 | 0 | 0.42 |
| 15 | 90 | 9 | 0.42 |
| 10 | 90 | 79 | 0.58 |
| 8 | 90 | 83 | 22.37 |
| 7 | 90 | Insufficient memory | |
Accuracy of CBA classifiers for increased relative liver weight was evaluated in 10-fold cross validations under various combinations of minsup and minconf.
Comparison of predictive performances.
| Method | Target direction | Average over 10-fold cross validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total | TP | FN | FP | TP | Hold | Accuracy (%) | Sensitivity (%) | Specificity (%) | ||
| CBA | Inc | 14.9 | 4.4 | 1.1 | 1.4 | 8 | – | 83 | 82 | 85 |
| LDA | Inc | 14.9 | 2.7 | 1 | 2.8 | 8.4 | – | 75 | 72 | 75 |
| CBA-DR | Inc | 14.9 | 4.4 | 0 | 1.4 | 0.8 | 8.3 | 79 | 100 | 29 |
| CBA | Dec | 14.9 | 0.2 | 0.7 | 1.4 | 12.6 | – | 86 | 22 | 90 |
| LDA | Dec | 14.9 | 0.2 | 3.3 | 0.7 | 10.7 | – | 73 | 6 | 95 |
| CBA-DR | Dec | 14.9 | 0 | 0.7 | 0 | 12.6 | 1.6 | 95 | 0 | 100 |
Predictive performance of classifiers was compared among CBA, LDA, CBA-DR with 10-fold cross validation.
Target direction: a classifier was built for whether increased (Inc) or decreased (Dec) relative liver weight. Total: average number of total records in a test set of each trial in a cross validation. TP: average number of true positive records in a test set. FN: average number of false negative records in a test set. FP: average number of false positive records in a test set. TN: average number of true negative records in a test set. Hold: average number of records in a test set that did not match any rules except the default rule (only for CBA-DR).
Note that accuracy, sensitivity and specificity for the CBA-DR method were calculated excluding ‘hold’ samples. Totals are not integers here, as the number of records in the original dataset was 149 and thus cannot be divided by 10, the number of trials in the cross validation.
Fig. 1Comparison of the classifier form between CBA and LDA. The form of generated classifiers were compared between CBA and LDA, when all the records were used as a training set for increased relative liver weight. [CBA] The classifier consists of a set of rules, represented as “antecedent → consequence, support, confidence”, one rule par line, in order of confidence. An antecedent is a set of non-class items, each item represented as (gene_id, Inc or Dec). A consequence is a class label that is used as a classification result if the corresponding antecedent is satisfied, shown here as (RLW, Inc or NI). The final rule with an antecedent (NULL) is the default rule, which is satisfied for any records and applied if all the preceding rules are not met. [LDA] The classifier is shown as a discriminative function, fd. fc(gene_id) is a fold change of a gene specified with gene_id. If fd is positive, the classifier predicts RLW as Inc. Otherwise, RLW as NI. gene_id: Represented here as an Affymetrix probe ID. RLW: relative liver weight. Inc: increased. Dec: decreased. NI: not increased.
Canonical pathway analysis of CBA classifier.
| Pathway Name | −log | Molecules | Corresponding Genes | ||
|---|---|---|---|---|---|
| Total | Inc | Dec | |||
| Xenobiotic metabolism signaling | 8.96 | 219 | 8 | 0 | Gsta3, Aldh1a1, Ugt2b1, Nqo1,RGD1559459, Cyp2b2, Ces2c, Sult2a2 |
| LPS/IL-1 mediated inhibition of RXR function | 5.07 | 178 | 4 | 1 | Abccg8, Gsta3 |
| PXR/RXR activation | 3.95 | 58 | 3 | 0 | Aldh1a1, Cyp2b2, Sult2a2 |
| Aryl hydrocarbon receptor signaling | 2.94 | 127 | 3 | 0 | Gsta3, Aldh1a1, Nqo1 |
| Nicotine Degradation III | 2.77 | 37 | 2 | 0 | Ugt2b1, Cyp2b2 |
| Melatonin Degradation I | 2.75 | 38 | 2 | 0 | Ugt2b1, Cyp2b2 |
| Serotonin degradation | 2.67 | 42 | 2 | 0 | Aldh1a1, Ugt2b1 |
| Superpathway of melatonin degradation | 2.67 | 42 | 2 | 0 | Ugt2b1, Cyp2b2 |
| NRF2-mediated oxidative stress response | 2.66 | 159 | 3 | 0 | Gsta3, Akr7a3, Nqo1 |
| Nicotine Degradation II | 2.65 | 43 | 2 | 0 | Ugt2b1, Cyp2b2 |
| Histidine Degradation III | 2 | 6 | 0 | 1 | Hal |
The canonical pathway analysis was conducted with the Ingenuity IPA software for the genes included in the CBA classifier when all the records were used as a training set for increased relative liver weight. Note that, for brevity, only top 10 pathways in order of -logp are shown here.
−log p: −log of p, where p is a value representing statistical significance in the analysis. A smaller p value (thus a larger −log p value) means that the pathway is more statistically significantly involved. Molecules: the total, increased (upregulated) number and decreased (downregulated) number of molecules in each pathway are shown. Corresponding genes: corresponding rat genes for the increased or decreased molecules included in the pathway are shown.
Fig. 2Canonical pathway illustrations of CBA classifier. [A] An excerpt around the NRF2 molecule from the illustration of the Xenobiotic Metabolism Signaling pathway, exported from IPA. [B] Overlapping among the canonical pathways detected as significant, which were divided into three clusters, exported from IPA. Each node corresponds to each canonical pathway detected as significant. Each link corresponds to the number of molecules shared between two pathways. Color depth of nodes corresponds to the −log p value (the deeper depth is, the larger the −log p values is). Line width of links corresponds to the number of molecules shared between two pathways (no line means no shared molecules between two pathways). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Details and category of the genes in our CBA classifier.
| Affymetrix probe ID | Gene symbol | Changedirection | Gene name or detail |
|---|---|---|---|
| 1368121_at | Akr7a3 | Inc | Aldo-keto reductase family 7, member A3 (aflatoxin aldehyde reductase) |
| 1381852_at | RGD1559459 | Inc | Similar to Expressed sequence AI788959 (Ugt2b34, |
| 1387022_at | Aldh1a1 | Inc | Aldehyde dehydrogenase 1 family, member A1 |
| 1368905_at | Ces2C | Inc | Carboxylesterase 2C |
| 1371076_at | Cyp2b2 | Inc | Cytochrome P450, family 2, subfamily b, polypeptide 2 |
| 1371089_at | Gsta3 | Inc | Glutathione S-transferase alpha 3 |
| 1387599_a_at | Nqo1 | Inc | NAD(P)H dehydrogenase, quinone 1 |
| 1370698_at | Ugt2b1 | Inc | UDP glucuronosyltransferase 2 family, polypeptide B1 |
| 1387006_at | Sult2a2 | Inc | Sulfotransferase family 2A, dehydroepiandrosterone (DHEA)-preferring, member 2 |
| 1371942_at | Gstt3 | Inc | glutathione S-transferase, theta 3 |
| 1370067_at | Me1 | Inc | Malic enzyme 1, NADP(+)-dependent, cytosolic |
| 1387307_at | Hal | Dec | Histidine ammonia-lyase |
| 1387783_a_at | Acaa1b | Inc | Acetyl-Coenzyme A acyltransferase 1B |
| 1370828_at | Zdhhc2 | Inc | Zinc finger, DHHC-type containing 2 |
| 1375845_at | Aig1 | Inc | Androgen-induced 1 |
| 1371143_at | Serpina7 | Inc | Serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 7 |
| 1390145_at | Dmxl2 | Dec | Dmx-like 2 |
| 1384225_at | (NA) | Dec | (NA) |
| 1369440_at | Abcg8 | Dec | ATP-binding cassette, subfamily G (WHITE), member 8 |
| 1377599_at | Lpin1 | Inc | Lipin 1 |
| 1373814_at | R3hdm2 | Dec | R3H domain containing 2 |
| 1389253_at | Vnn1 | Inc | Vanin 1 |
Affymetrix probe IDs, gene symbols and gene names for each gene in our CBA classifier are summarized. The genes are divided into four categories, drug metabolism, gluconeogenesis, histidine degradation and the other.
Change direction: the direction of change (Inc or Dec) in the classifier. NA: not available. No further information is available for the gene with Affymetrix probe ID, 1384225_at.
Fig. 3Our CBA Classifier with Categorized Gene Symbols. The CBA classifier, the same as one in Fig. 1, is shown again, with the genes converted from Affymetrix probe IDs to gene symbols and colored according to their category. Red: drug metabolism-related. Blue: gluconeogenesis-related. Green: histidine degradation-related. Black: Other. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
| Sensitivity | True positive/(true positive + false negative) |
| Specificity | True negative/(true negative + false positive) |
| Accuracy | (True positive + true negative)/total |