| Literature DB >> 23227228 |
Abstract
Associative classification mining (ACM) can be used to provide predictive models with high accuracy as well as interpretability. However, traditional ACM ignores the difference of significances among the features used for mining. Although weighted associative classification mining (WACM) addresses this issue by assigning different weights to features, most implementations can only be utilized when pre-assigned weights are available. In this paper, we propose a link-based approach to automatically derive weight information from a dataset using link-based models which treat the dataset as a bipartite model. By combining this link-based feature weighting method with a traditional ACM method-classification based on associations (CBA), a Link-based Associative Classifier (LAC) is developed. We then demonstrate the application of LAC to biomedical datasets for association discovery between chemical compounds and bioactivities or diseases. The results indicate that the novel link-based weighting method is comparable to support vector machine (SVM) and RELIEF method, and is capable of capturing significant features. Additionally, LAC is shown to produce models with high accuracies and discover interesting associations which may otherwise remain unrevealed by traditional ACM.Entities:
Mesh:
Year: 2012 PMID: 23227228 PMCID: PMC3515483 DOI: 10.1371/journal.pone.0051018
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The bipartite model of a dataset.
(The bipartite model is also a heterogeneous system. Blue represents active compounds and red for inactive compounds with both contributing to the green node-feature/attribute.).
A compound dataset encoded by MDL public keys.
| CID | MDL Finger print |
| C1 | {…81,82,83,84…} |
| C2 | {…82,84…} |
| C3 | {…81,84…} |
| C4 | {…81,82,84,85…} |
| C5 | {…81,82,83,84,85…} |
| C6 | {…82,83,85…} |
MDL public keys and their weight.
| Feature | Weight |
| 81 | 0.8 |
| 82 | 1 |
| 83 | 0.8 |
| 84 | 1.6 |
| 85 | 1 |
Supports and types of itemsets (frequent or not).
| Itemset | Classical | Weighted | Adjusted Weighted | |||
| Support | Frequent | Support | Frequent | Support | Frequent | |
| 81 | 0.67 | Y | 0.53 | Y | 0.75 | Y |
| 83 | 0.50 | Y | 0.4 | Y | 0.66 | Y |
| 81 83 | 0.33 | Y |
|
|
|
|
| 83 84 | 0.33 | Y |
|
|
|
|
| 81 84 | 0.67 | Y | 0.8 | Y | 0.75 | Y |
| 81 83 84 | 0.33 | Y | 0.35 | Y | 0.44 | Y |
Figure 2Link-based weighting.
Figure 3Weighted associative classification.
Figure 4Results of different weighting methods.
Correlation analyses of the weighting results.
| Frequency | SVM | RELIEF | LAC | ||
|
| Pearson Correlation | 1 | .776 | .791 | .947 |
| Sig. (2-tailed) | .000 | .000 | .000 | ||
|
| Pearson Correlation | .776 | 1 | .949 | .759 |
| Sig. (2-tailed) | .000 | .000 | .000 | ||
|
| Pearson Correlation | .791 | .949 | 1 | .712 |
| Sig. (2-tailed) | .000 | .000 | .000 | ||
|
| Pearson Correlation | .947 | .759 | .712 | 1 |
| Sig. (2-tailed) | .000 | .000 | .000 | ||
Correlation is significant at the 0.01 level (2-tailed).
The rankings of chemical features from frequency and LAC.
| Bit | Frequency | LAC | Bit | Frequency | LAC | Bit | Frequency | LAC | Bit | Frequency | LAC |
| 1 | 1 | 1 | 43 | 24 | 19 |
|
|
| 126 | 110 | 101 |
| 2 | 1 | 1 |
|
|
|
|
|
| 127 | 152 | 152 |
|
|
|
|
|
|
|
|
|
| 128 | 100 | 80 |
| 4 | 1 | 1 |
|
|
|
|
|
| 129 | 94 | 77 |
| 5 | 1 | 1 | 47 | 44 | 44 | 89 | 96 | 93 |
|
|
|
| 6 | 1 | 1 | 48 | 40 | 39 | 90 | 73 | 67 | 131 | 118 | 111 |
| 7 | 1 | 1 |
|
|
| 91 | 66 | 61 | 132 | 111 | 91 |
| 8 | 12 | 12 | 50 | 51 | 48 |
|
|
|
|
|
|
| 9 | 1 | 1 | 51 | 32 | 26 |
|
|
| 134 | 102 | 98 |
| 10 | 1 | 1 |
|
|
|
|
|
|
|
|
|
| 11 | 13 | 13 | 53 | 52 | 50 | 95 | 88 | 88 | 136 | 117 | 112 |
| 12 | 1 | 1 |
|
|
|
|
|
| 137 | 137 | 137 |
|
|
|
| 55 | 35 | 31 | 97 | 99 | 99 | 138 | 139 | 129 |
| 14 | 8 | 8 |
|
|
|
|
|
| 139 | 123 | 115 |
| 15 | 5 | 3 |
|
|
| 99 | 98 | 94 |
|
|
|
|
|
|
| 58 | 37 | 35 | 100 | 82 | 82 | 141 | 156 | 156 |
| 17 | 7 | 7 |
|
|
|
|
|
|
|
|
|
| 18 | 2 | 2 | 60 | 39 | 34 |
|
|
| 143 | 124 | 122 |
|
|
|
| 61 | 41 | 36 |
|
|
|
|
|
|
| 20 | 6 | 4 |
|
|
| 104 | 80 | 70 |
|
|
|
|
|
|
|
|
|
|
|
|
| 146 | 135 | 128 |
|
|
|
| 64 | 34 | 33 | 106 | 75 | 73 | 147 | 112 | 92 |
| 23 | 14 | 14 |
|
|
| 107 | 81 | 81 |
|
|
|
|
|
|
| 66 | 54 | 49 | 108 | 70 | 58 | 149 | 120 | 110 |
|
|
|
| 67 | 49 | 46 | 109 | 103 | 87 | 150 | 126 | 123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 | 10 | 9 |
|
|
| 111 | 91 | 84 | 152 | 138 | 132 |
|
|
|
|
|
|
| 112 | 87 | 72 | 153 | 131 | 120 |
| 29 | 19 | 15 |
|
|
| 113 | 105 | 104 | 154 | 140 | 126 |
| 30 | 11 | 11 |
|
|
| 114 | 67 | 60 | 155 | 132 | 119 |
|
|
|
| 73 | 45 | 40 | 115 | 83 | 64 |
|
|
|
| 32 | 29 | 29 | 74 | 60 | 52 | 116 | 86 | 74 | 157 | 144 | 139 |
|
|
|
|
|
|
| 117 | 108 | 103 | 158 | 146 | 146 |
| 34 | 31 | 20 | 76 | 61 | 57 | 118 | 71 | 63 | 159 | 149 | 144 |
| 35 | 1 | 1 |
|
|
|
|
|
| 160 | 145 | 142 |
|
|
|
| 78 | 42 | 42 | 120 | 113 | 105 | 161 | 150 | 150 |
| 37 | 26 | 24 |
|
|
| 121 | 116 | 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 | 21 | 21 |
|
|
| 123 | 107 | 85 | 164 | 154 | 151 |
| 40 | 22 | 22 | 82 | 62 | 59 |
|
|
| 165 | 155 | 155 |
| 41 | 17 | 17 |
|
|
|
|
|
| 166 | 1 | 1 |
|
|
|
|
|
|
|
means the ranking in the frequency is higher than that in LAC otherwise bold, and the rest means the same.
The modeling results.
| Model# | RELIEF | SVM | Frequency | CBA | LAC | Bio fingerprint | MDL_Bio fingerprint |
| 1 | 89.71% | 89.71% | 91.70% | 93.39% | 92.93% | 100.00% | 99.69% |
| 2 | 89.09% | 89.40% | 90.63% | 91.40% | 91.40% | 100.00% | 100.00% |
| 3 | 88.63% | 88.63% | 89.71% | 90.32% | 91.71% | 99.33% | 100.00% |
| 4 | 87.86% | 88.79% | 88.79% | 88.17% | 91.71% | 100.00% | 100.00% |
| 5 | 90.02% | 90.02% | 90.17% | 90.48% | 90.78% | 100.00% | 99.06% |
| 6 | 86.64% | 86.94% | 88.02% | 88.48% | 90.32% | 100.00% | 100.00% |
| 7 | 91.09% | 91.40% | 91.86% | 90.63% | 92.78% | 100.00% | 99.69% |
| 8 | 88.63% | 88.79% | 88.79% | 89.55% | 90.63% | 100.00% | 100.00% |
| 9 | 89.25% | 89.40% | 90.48% | 91.86% | 91.55% | 100.00% | 100.00% |
| 10 | 89.55% | 89.55% | 90.94% | 92.01% | 91.86% | 100.00% | 99.06% |
|
|
|
|
|
|
|
|
|
Top 20 rules from frequency and LAC classifier.
| Number | Frequency | LAC |
| 1 | 157,140,93 ->positive |
|
| 2 | 139,124,104 ->positive |
|
| 3 | 157,155,93 ->positive |
|
| 4 | 157,93 ->positive |
|
| 5 | 157,140,123 ->positive |
|
| 6 | 163,140,93 ->positive |
|
| 7 | 118 ->positive |
|
| 8 | 155,140,93 ->positive |
|
| 9 | 157,155,123 ->positive |
|
| 10 | 157,123 ->positive |
|
| 11 | 144,124,104 ->positive |
|
| 12 | 155,140,123 ->positive |
|
| 13 | 157,155,124 ->positive |
|
| 14 | 140,101 ->positive |
|
| 15 | 161,139,104 ->positive |
|
| 16 | 157,126,124 ->positive |
|
| 17 | 124,104 ->positive |
|
| 18 | 139,126,124 ->positive |
|
| 19 | 129,123 ->positive |
|
| 20 | 144,139,124 ->positive |
|
is exclusively in the frequency approach, bold only in LAC and others are common ones.
Selected Top 5 active rules using bio fingerprint.
| Number | Rules | Support | Confidence |
| 1 | MCF7 inactive, HL60(TB) inactive → inactive | 29.1% | 95.8% |
| 2 | MCF7 inactive, MOLT-4 inactive →inactive | 29.7% | 95.8% |
| 3 | MCF7 inactive,CCRF inactive →inactive | 28.7% | 95.4% |
| 4 | MCF7 inactive, K-562 inactive →inactive | 30.7% | 95.4% |
| 5 | MCF7 inactive, RPMI-8226 inactive →inactive | 31.9% | 95.2% |
| … | … | … | … |
Figure 5The connections between chemical features and cell lines.
(Red dot means a connection to active; green solid to inactive; light gray means features associated to each other. Purple: Non-small cell lung; Red: Renal; Pink: Breast cancer; Green; Ovarian and Light blue; Melanoma.).
Top 5 rules using the combined fingerprint.
| Number | Rules | Support | Confidence |
| 1 | MCF7 active, bit 29 → active | 2.0% | 98.2% |
| 2 | SK-MEL-2 active, bit 29 →active | 1.8% | 98.11% |
| 3 | UACC-62 active, bit 33 → active | 2.0% | 97.7% |
| 4 | NCI-H226 active, bit 33 → active | 1.7% | 97.3% |
| 5 | HCC-2998 active, bit 33 → active | 1.6% | 97.2% |