Literature DB >> 23227228

Discovering associations in biomedical datasets by link-based associative classifier (LAC).

Pulan Yu1, David J Wild.   

Abstract

Associative classification mining (ACM) can be used to provide predictive models with high accuracy as well as interpretability. However, traditional ACM ignores the difference of significances among the features used for mining. Although weighted associative classification mining (WACM) addresses this issue by assigning different weights to features, most implementations can only be utilized when pre-assigned weights are available. In this paper, we propose a link-based approach to automatically derive weight information from a dataset using link-based models which treat the dataset as a bipartite model. By combining this link-based feature weighting method with a traditional ACM method-classification based on associations (CBA), a Link-based Associative Classifier (LAC) is developed. We then demonstrate the application of LAC to biomedical datasets for association discovery between chemical compounds and bioactivities or diseases. The results indicate that the novel link-based weighting method is comparable to support vector machine (SVM) and RELIEF method, and is capable of capturing significant features. Additionally, LAC is shown to produce models with high accuracies and discover interesting associations which may otherwise remain unrevealed by traditional ACM.

Entities:  

Mesh:

Year:  2012        PMID: 23227228      PMCID: PMC3515483          DOI: 10.1371/journal.pone.0051018

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Chemical and biological data contain information about various characteristics of compounds, genes, proteins, pathways and diseases. Thus a wide spectrum of data mining methods is used to identify relationships in these large and multidimensional datasets and to generate predictive models with high accuracy and interpretability. Recently, associative classification mining (ACM) has been widely used for this purpose [1]–[4]. ACM is a data mining framework utilizing association rule mining (ARM) technique to construct classification systems, also known as associative classifiers. An associative classifier consists of a set of classification association rules (CARs) [5] which have the form of X→Y whose right-hand-side Y is restricted to the classification class attribute. X→Y can be simply interpreted as if X then Y. ARM is introduced by Agrawal et al [6] to discover CARs which satisfy the user specified constraints denoted respectively by minimum support (minsup) and minimum confidence (minconf) threshold. Given a dataset with each row representing a compound, each column (called as item, feature or attribute) is a test result of this compound on a tumor cell line and all compounds are labeled as active or inactive class, a possible classification association rule can be {MCF7 inactive, HL60 (TB) inactive → inactive} with support = 0.6 and confidence = 0.8. This particular rule states that when a compound is inactive to both MCF7 cell line and HL60 (TB) cell line, it tends to be inactive. The support, which is the probability of a compound being inactive to both MCF7 and HL60 (TB) and being classified as inactive together, is 0.6; the confidence, which is the probability of a compound to be inactive given inactive to both MCF7 and HL60 (TB), is 0.8. In ACM, the relationship between attributes and class is based on the analysis of their co-occurrences within the database so it can reveal interesting correlations or associations among them. For this reason, it has been applied to the biomedical domain especially to address gene expression relations [7]–[11], protein-protein interactions [12], protein-DNA interactions [13], and genotype and phenotype mapping [14] inter alia. Traditional ACM does not consider feature weight, and therefore all features are treated identically, namely, with equal weight. However, in reality, the importance of feature/item is different. For instance, {beef → beer} with support = 0.01 and confidence = 0.8 may be more important than {chips → beer} with support = 0.03 and confidence = 0.85 even though the former holds a lower support and confidence. Items/features in the first rule have more profit per unit sale so they are more valuable. Wang et al [15]–[17] proposed a framework called weighted association rule mining (WARM) to address the importance of individual attributes. The main idea is that a numerical attribute can be assigned to every attribute to represent its significance. For example, {Hypertension = yes, age>50→ Heart_Disease} with {Hypertension = yes, 0.8}, {age>50, 0.3} is a rule mined by WARM. The importance of hypertension and age >50 to heart disease is different and denoted by value 0.8 and 0.3 respectively. The major difference between ARM and WARM is how the support is computed. Several frameworks are developed to incorporate weight information for support calculation [15]–[22]. Studies have been carried out on WARM by using pre-assigned weights. Nonetheless, most datasets do not contain those pre-assigned weight information.

The bipartite model of a dataset.

(The bipartite model is also a heterogeneous system. Blue represents active compounds and red for inactive compounds with both contributing to the green node-feature/attribute.). In machine learning, feature selection and feature weighting are broadly used to deal with the significance of features and derive weight information automatically from a dataset itself. Feature selection is a technique of selecting a subset of relevant features by removing low significant features; feature weighting is a technique of approximating the optimal degree of influence of individual features. Feature weighting preserves all features by assigning smaller weight to relatively insignificant features and has the advantage of taking into account of all features as well as not requiring searching an appropriate cut-threshold [23]. In some circumstances, it might be the only option when eliminating features with a low contribution to classification is inappropriate. Especially, to understand the overall relationship between genes and a disease, a small subset of genes although having good prediction ability may not have sufficient discriminating power [24]. Like feature selection, feature weighting approaches fall into two categories: 1) filter methods which are performed in a pre-processing step before modeling; 2) wrapper methods which are iterative and generally use the same learning algorithm as modeling. In wrapper methods, the evaluation result of relevancy is used for feature weighting. Usually, wrapper methods perform better than filter methods while filter methods are faster and cheaper. Sun et al. [25] proposed a link-based filter feature weighting approach. The weights are derived from the dataset itself by extending Kleinberg’s HITS (Hyper Induced Topic Selection) model [26] and algorithm on bipartite graphs. HITS and PageRank are two major link-based ranking algorithms. PageRank is developed by Brin and Page [27] and has been commercially successfully used in the search engine Google. HITS ranks webpages by analyzing the in-links and out-links. Webpages pointed to by many other pages are defined as “authority” while webpages linked to many other pages are called “hub”. HITS emphasizes the notion of “mutual reinforcement” between the “authority” and “hub”. Its intuitive interpretation is that a good “authority” is pointed to by a lot of good “hubs” and a good “hub” points to many good “authorities”. PageRank uses a very similar idea that a “good” webpage should be linked or link to other “good” webpages. Unlike the “mutual reinforcement” approach, it focuses on hyperlink weight normalization and web surfing based on random walk models. Both approaches have pros and cons. The computation of PageRank is stable and its behavior is well-defined due to the probabilistic interpretation. Furthermore, PageRank can be used on large page collections because even though the larger communities will affect the final ranking, they will not overwhelm the small ones. In contrast, HITS is not stable and cannot be applied to large page collections since only the largest web community will influence the final ranking. However, it can capture the relationships among the webpages with more details [28]. Hence, an algorithm capable of integrating both HITS and PageRank may improve Sun’s weighting method. The general PageRank cannot be applied to bipartite graphs as it produces different rankings for webpages with the same in-links [29], as a result, a better ranking scheme is needed for ranking in bipartite graphs while integrating PageRank and HITS [30]. The SALAS (stochastic approach for link structure analysis) [31]–[33] combines the random surf model of PageRank with hub/authority principle of HITS. It generates a bipartite undirected graph H based on the web graph G. One subset of H contains all the nodes with positive in-degree (the potential “authorities”) and the other subset consists of all the nodes with positive out-degree (the potential “hubs”). A travel is completed by a two-step random walk. For example, from the “hub” to the “authority” and from the “authority” back to the “hub”. As in the PageRank, each individual walk is a Markov process with a well-defined transition probability matrix [31]. Nevertheless, besides SALAS does not really implement the “mutual reinforcement” of HITS because the scores of both authority and hub are not related by the hub to authority and authority to hub reinforcement operations, its score propagation differs from HITS (a similarity-mediated score propagation). Moreover, its random walk model does not directly simulate the behavior of the surfer in PageRank either. For SALAS, a surfer can jump from webpage p to p even though there is no hyperlink between them, and there is no link-interrupt jumps. Based on a similar approach as SALAS, Ding et al proposed a unified framework integrating HITS and PageRank [34]. indicates that a database can be represented by a bipartite graph equally [25]. In the graph, left is the table layout representation and can be represented by the bipartite graph on the right. Compounds and features linked to each other can be viewed as webpages. As a consequence, the link-based algorithms used to rank the webpage such as HITS or PageRank can be utilized to rank compounds or features. The algorithms say that if a webpage has many important links to it, the links from it to other webpages become important too. For our case, this means a highly weighted compound should contain many highly weighted features and a highly weighted feature should exist in many highly weighted compounds. Accordingly, the ranking score can be used for feature weighting. Although Ding’s unified framework can be used to derive the ranking score automatically, it cannot distinguish the contributions of different types of connections. For chemical dataset mining, each chemical feature may connect to both active and inactive compounds; for biological dataset mining, each gene may connect to a disease either as suppressor or activator. Chemical features existing frequently in active compounds or genes major associated with suppressors are more interested in. In , when we consider the contribution of compounds to the weight of a node/attribute 78, we want to distinguish the contribution of compound 5469540 from the contribution of compound 840827 and 5911714. Ding’s unified framework treats the contribution of the nodes equally as a homogenous system [34]; Chen et al developed a framework calculating the weight for either homogenous or heterogeneous systems [35]. In Chen’s model, connections can have different impacts on a node.
Figure 1

The bipartite model of a dataset.

(The bipartite model is also a heterogeneous system. Blue represents active compounds and red for inactive compounds with both contributing to the green node-feature/attribute.).

In this paper, we describe a link-based unified weighting framework which combines the mutual reinforcement of HITS with hyperlink weighting normalization of PageRank based on Ding and Chen’s frameworks, resulting in highly efficient link-based weighted associative classifier mining from biomedical datasets without pre-assigned weight information. Our main contributions are: 1) development of a novel link-based weighting scheme for mining biomedical datasets; 2) implementation of a novel link-based associative classifier by combining the feature weighting method, weighted association rule mining (WARM) and the CBA algorithm [5]; 3) application of this method to two important biomedical datasets. In the following sections, the dataset, link-based feature weighting, WARM and algorithm of LAC will be discussed, followed by the application of LAC to two datasets. In the end, we present our conclusions and future work.

Materials and Methods

1. Data Set

LAC is applied to two datasets: a. Ames mutagenicity dataset [36], b. NCI-60 tumor cell line dataset [37]. In Ames dataset, there are 6,512 compounds provided in SMILES format and is benchmarked by SVM, Random Forests, k-Nearest Neighbors, and Gaussian Processes. The authors used 5-fold cross validation to evaluate the generated models. The area under this ROC-Curve (AUC) is utilized to assess the performance which ranges from 0.79 to 0.86. The GI50 data of NCI-60, which is the concentration of the anti-cancer drug that inhibits the growth of cancer cells by 50%, is used and processed as following. First, among the 60 tumor cell lines, IGR-OV1, MDA-MB-468 and MDA-N are removed due to too many missing values. Then, compounds having missing values are also discarded. In the final dataset, 5,937 compounds with 57 bioassay results in total are included. For the Ames dataset, if a compound is positive, it is carcinogenic; for the NCI-60, the compound is “active” only if its GI 50 is greater than 5.

2. MDL Public Keys

MDL public key set also called MACCS key set is a 166-bit string with each bit encoding a predefined chemical structure feature. MDL public keys are extensively used in biomedical research due to their relatively high performance and the one-to-one map between the structural feature and fingerprint [37], [38]. The fingerprint is computed by using the CDK [39] software package and reformatted for LAC. Correlation is significant at the 0.01 level (2-tailed).

3. Bio Fingerprint

Bioassay readouts have been used as features (“biospectra” or “bio fingerprint”) for data mining in several studies and produced high quality models [40], [41]. These bioactivity profiles link the potential targets with the chemical compounds and provide insights into the relationships among diseases, compounds and bioactivities. In this study, results of related bioassay analyses are used as features for the classification of chemical compounds. Each GI50 value is transformed into “active” (GI50 is greater or equal than 5) or “inactive” (GI50 is less than 5). The T-47D is used as a label class and the results from other cell lines are used as features. means the ranking in the frequency is higher than that in LAC otherwise bold, and the rest means the same. For each of the 6,512 compounds in Ames data, we attempt to predict whether it is carcinogenic or not based on the MDL public keys. For the 5,937 compounds in NCI 60, we first use Bio fingerprint to predict whether they are agonist or antagonist to T-47D cell line. Then, for those 3,199 compounds in the NCI-60 dataset having 2D structures available in the downloaded structure file, a hybrid fingerprint is generated by combing MDL public keys and Bio fingerprint to build models. Let L =  (Lij) be the adjacency matrix of the web graph G =  (V,E), where V is the set of webpages and E is the set of links between them. Lij = 1 if page i links to page j and Lij = 0 otherwise. LT will be the transpose of L. If the graph is directed, the in-degree matrix Din and out-degree matrix Dout are also defined. Given vectors d = (b1, b2, …, bn)T where bj is the in-degrees of page j() and d = (o1,o2, …, on)T where oj is the out-degrees of page j ). Din is a diagonal matrix denoted as Din = diag(d) and Dout = diag(d). is exclusively in the frequency approach, bold only in LAC and others are common ones.

4. HITS

In HITS, vectors x = (x1,x2,…,xn)T and y = (y1,y2,…ym)T represent the scores of authority and hub respectively. HITS defines recursive equations as following: Where k1 and y(0) = e, e is a vector of all 1s and x denotes k-th iteration. tells that authoritative pages are those linked by good hub pages, and means good hubs are pages that link to authoritative pages. It can be rewritten as:

5. PageRank

In PageRank, given x = (x1,x2,…,xn)T, xi is the PageRank of page i; the recursive PageRank equation is defined in matrix notation as:where P =  (Pij) is a stochastic matrix (the sum of every column equals to 1) with Pij = . PT can be expressed as: If considering the link-tracking jump and link-interrupt jump, the full transition probability can be written as:where is the damp factor from 0 to 1. As the way processed in SALAS, if the web graphs are transformed into bipartite graphs, the above x will be the authority score and the hub score y can be defined as:

The connections between chemical features and cell lines.

(Red dot means a connection to active; green solid to inactive; light gray means features associated to each other. Purple: Non-small cell lung; Red: Renal; Pink: Breast cancer; Green; Ovarian and Light blue; Melanoma.). Comparing the equations between HITS and PageRank ( & versus 5 & 8), it is possible that a unified framework can be derived to combine advantages from both HITS and PageRank.

6. Unified Framework

If we define the LTL in and PT in as operation Aop (authority) and LLT in equation 4 and P in as operation Hop (hub). The critical component of the framework is to define the new Aop and Hop. Ding’s implementations of Aop and Hop [34] are used here since it generalizes the features of HITS and PageRank and combines them together. Chen’s model [35] divided the web pages into homogenous and heterogeneous systems so the scores of authority and hub contain the reinforcement of links from both systems. Different weights can be assigned to homogenous or heterogeneous systems to adjust the importance of their links in the final ranking. Similarly, in our case, the nodes, such as compounds, are classified as active/inactive or positive/negative thus the dataset is converted to a heterogeneous system. The relatively higher weight values can be assigned to the active/positive compounds to promote their importance in the final feature weighting. Our link-based framework can be written as follows. a represents the “active” system and b is the “inactive” system. is a class factor ranging from 0 to 1 (In the case that Aop or Hop involves or , or (1-β) will be replaced by their square roots). It has impact on the accuracy and size of classifiers along with rules in the classifiers. Generally, in order to assign higher weight values to active/positive compounds, can be any value greater than 0.5. In our study, is set to 0.9. Based on the comparison of implementations in [34], the following definitions of Aop and Hop are used.

7. Associative Classification Mining

Let F = { f1, f2, …, fn} be a set of n distinct features and C be a list of classes { c1, c2,., cm}. D is a transaction/dataset over F and C. Each transaction/compound ti contains a set of items f1, f2, …fk F and cj C. The set of items here is also called itemset. A classification association rule (CAR) is an implication of the form X Y or X Y where X F and Y C. The support of the rule is the probability of transactions having both X and Y among all the presented cases. An itemset is frequent only if its support satisfies a minimum support θ. Additionally, the confidence of this rule is defined as the support of X and Y divided by the support of X which is the conditional probability Y is true under the circumstance of X. The process of discovering, pruning, ranking and selecting of CARs and applying them to classification is called associative classification.

8. Weighted Associative Classification Mining

For the weighted associative classification (WAC) [15]–[17], each feature fi is associated with a weight wi W = { w1, w2, …, wn}. A pair (fi, wi) is called a weighted item. Each transaction/compound is a set of weighted items plus the class type. The straightforward definition of itemset weight is:W(is) is the weight of itemset and is is the itemset. The weighted support of itemset WS(is) is:T is total transactions and S is all the transactions containing the itemset. In the classical associative classification, the difference of significance of items is not taken into account. It is assumed that if the itemset is frequent, then all of its subsets should be frequent as well. This principle is called downward closure property (DCP). Given the compounds C1–C6, their features and the weight of the features ( & ), if itemset {81, 83, 84} is frequent, then all its subsets {81}, {83}, {84}, {81, 83}, {81, 84} and {83, 84} must all be frequent. However, in WAC, provided the convenient definition (equation 15 & 16), the DCP does not hold. An itemset may be frequent even though some of its subsets are not frequent which can be illustrated in the following example ( = 0.3). As shown in , the support of {83, 84} and {81, 83} are both 0.27 so they are not frequent.
Table 1

A compound dataset encoded by MDL public keys.

CIDMDL Finger print
C1{…81,82,83,84…}
C2{…82,84…}
C3{…81,84…}
C4{…81,82,84,85…}
C5{…81,82,83,84,85…}
C6{…82,83,85…}
Table 2

MDL public keys and their weight.

FeatureWeight
810.8
821
830.8
841.6
851
Table 3

Supports and types of itemsets (frequent or not).

ItemsetClassicalWeightedAdjusted Weighted
SupportFrequentSupportFrequentSupportFrequent
810.67Y0.53Y0.75Y
830.50Y0.4Y0.66Y
81 830.33Y 0.27 N 0.44 Y
83 840.33Y 0.27 N 0.44 Y
81 840.67Y0.8Y0.75Y
81 83 840.33Y0.35Y0.44Y
Several frameworks are proposed to maintain the DCP property [15]–[22], [25]. Before introducing the framework, we define the transaction weight as: t is the transaction. We then define the adjusted weighted support as:The S and T are the same as above. This definition will ensure that if then since any transaction containing Y will have X. By using the AWS, the DCP will not be violated. The discovered association rules are ranked, evaluated and pruned by using CBA approach [5]. The algorithm of PageRank based associative classification is given in & .
Figure 2

Link-based weighting.

Figure 3

Weighted associative classification.

All the computations are carried out on a PC Q6600 2.4GHz with 6G memory running on the Windows 7 64bit operating system. The classifier is implemented in C#. To explore all possible rules, the mining is performed by using the following settings: MinSup (20%) and MinConf (70%) for AMES dataset; MinSup (1%) and MinConf (0%) for NCI-60 dataset. In all experiments, the maximum length of the rules is set to 4 and the maximum number of candidate frequent itemsets is 200,000. In the AMES data set, the SVM and RELIEF weighting method are applied for comparison. SVM and RELIEF are computed using Rapidminer 5.1 [42].

9. Model Assessment and Evaluation

The classification performance is assessed using 10-fold “Cross Validation” (CV) because this approach not only provides reliable assessment of classifiers but the result can be generalized well to new data. The accuracy of the classification can be determined by evaluation methods such as error-rate, recall-precision, any label and label-weight etc. The error-rate used here is computed by the ratio of number of successful cases over total case number in the test data set. This method has been widely adopted in CBA [5], CPAR [42] and CMAR [4] assessment.

Results and Discussion

1. Comparison of Feature Weight and Rank

The comparison is performed on AMES dataset. For AMES dataset mining, the identification of features which are good for “positive” compounds are considered more preferable. So the “positive” here is treated as “active”. The weight generated by LAC is compared to that generated by frequency of the bits, SVM and RELIEF. shows that results of RELIEF and SVM are very similar. To confirm this, a correlation analysis is performed by SPSS 19 [43]. shows at the 0.01 level (2-tailed), SVM and RELIEF, LAC and frequency are highly correlated as the coefficient is 0.949 and 0.958 respectively. The coefficients of SVM, RELIEF and LAC with frequency are greater than 0.75 indicating that all are correlated with frequency. Among them, LAC has the strongest correlation (0.947) with frequency. This is mainly caused by bit 3, 8, 11, 36 and 166. For bit 3, 8 and 11, since their frequencies are not 0, both LAC and frequency assign small weight values while for SVM and RELIEF the weight values are set to 0. On the contrary, the weight values of 36 and 166 are set to 0 for LAC and frequency but are not set to 0 in SVM and RELIEF. The correlation of LAC and frequency can be explained by the principle of link-based weighting–mutual reinforcement. As expected, the rank and weight of features in the LAC and frequency are different. In , all features are ordered by ascending weight. 69 features (bold) are promoted and 61 features (*) are demoted while the rest remains unchanged in LAC. Generally, higher frequency will lead to higher “authority” resulting bigger weight ( ). For example, bit 135 has high weight in both frequency and LAC; bit 127 and 141 are much bigger in LAC (red data label) than in frequency (black data label) since most of their connections are “active” compounds (58.6% and 56.6% respectively). is the rank of the features in each scheme respectively. The bigger the number, the higher the rank is and the more important the feature is. Some features (bold) have a relatively lower rank in frequency; they may get higher ranks due to the promotion from connecting to compounds having higher “rank” values. Likewise, features (*) connected to many “bad” compounds may be degraded. The promotion or demotion depends on the number and type of its connections.
Figure 4

Results of different weighting methods.

Table 4

Correlation analyses of the weighting results.

FrequencySVMRELIEFLAC
Frequency Pearson Correlation1.776** .791** .947**
Sig. (2-tailed).000.000.000
SVM Pearson Correlation.776** 1.949** .759**
Sig. (2-tailed).000.000.000
RELIEF Pearson Correlation.791** .949** 1.712**
Sig. (2-tailed).000.000.000
LAC Pearson Correlation.947** .759** .712** 1
Sig. (2-tailed).000.000.000

Correlation is significant at the 0.01 level (2-tailed).

Table 5

The rankings of chemical features from frequency and LAC.

BitFrequencyLACBitFrequencyLACBitFrequencyLACBitFrequencyLAC
11143* 2419 85 69 78 126110101
211 44 4 6 86 68 71 127152152
3 3 5 45 27 30 87 63 66 128* 10080
411 46 33 41 88 65 68 129* 9477
51147444489* 9693 130 129 130
61148* 403990* 7367131* 118111
711 49 85 109 91* 6661132* 11191
8121250* 5148 92 77 83 133 134 141
91151* 3226 93 93 96 134* 10298
1011 52 56 75 94 121 131 135 130 140
11131353* 5250958888136* 117112
1211 54 58 62 96 114 117 137137137
13 16 18 55* 3531979999138* 139129
1488 56 76 108 98 106 107 139* 123115
15* 53 57 79 89 99* 9894 140 147 148
16 47 53 58* 37351008282141156156
1777 59 36 38 101 23 25 142 133 135
182260* 3934 102 119 127 143* 124122
19 38 43 61* 4136 103 72 79 144 128 134
20* 64 62 53 55 104* 8070 145 143 145
21 15 16 63 92 114 105 141 143 146* 135128
22 48 54 64* 3433106* 7573147* 11292
231414 65 97 106 1078181 148 136 138
24 95 113 66* 5449108* 7058149* 120110
25 20 28 67* 4946109* 10387150* 126123
26 25 27 68 59 69 110 89 90 151 122 124
27* 109 69 50 51 111* 9184152* 138132
28 18 23 70 104 118 112* 8772153* 131120
29* 1915 71 109 121 113* 105104154* 140126
301111 72 90 95 114* 6760155* 132119
31 9 10 73* 4540115* 8364 156 148 149
32292974* 6052116* 8674157* 144139
33 30 32 75 57 65 117* 108103158146146
34312076* 6157118* 7163159* 149144
3511 77 64 76 119 115 125 160* 145142
36 46 47 784242120* 113105161150150
372624 79 74 86 121116116 162 151 153
38 43 45 80 84 100 122 125 133 163 153 154
392121 81 55 56 123* 10785164* 154151
40222282* 6259 124 127 136 165155155
411717 83 101 102 125 142 147 16611
42 28 37 84 78 97

means the ranking in the frequency is higher than that in LAC otherwise bold, and the rest means the same.

2. Comparison of Accuracy of Classification

The average accuracies of frequency, LAC, RELIEF, SVM and CBA are 90.11%, 91.57%, 89.05%, 89.26% and 90.63% respectively ( ). The major purpose of WACM is to find more rules containing interesting items, in other word, items with higher significance, while trying to achieve high accuracy at the same time. Most of current comparisons of performance between WARM and traditional ARM are focused on time and space scalability, such as number of frequent items, number of interesting rules, execution time and memory usage [18]–[20], [43]–[45]. The results showed that the difference between WARM and ARM are minor. The comparison of WACM and traditional ACM is scant due to the lack of easily accessible weighted association classifiers. Soni et al [46] compared their WACM results with those generated by traditional ACM methods–CBA [5], CMAR [4] and CPAR [47] on three biomedical datasets, and their results showed that WACM offered the highest average accuracy. In our study, among all four weighted schemes and CBA, LAC has the highest accuracy.
Table 6

The modeling results.

Model#RELIEFSVMFrequencyCBALACBio fingerprintMDL_Bio fingerprint
189.71%89.71%91.70%93.39%92.93%100.00%99.69%
289.09%89.40%90.63%91.40%91.40%100.00%100.00%
388.63%88.63%89.71%90.32%91.71%99.33%100.00%
487.86%88.79%88.79%88.17%91.71%100.00%100.00%
590.02%90.02%90.17%90.48%90.78%100.00%99.06%
686.64%86.94%88.02%88.48%90.32%100.00%100.00%
791.09%91.40%91.86%90.63%92.78%100.00%99.69%
888.63%88.79%88.79%89.55%90.63%100.00%100.00%
989.25%89.40%90.48%91.86%91.55%100.00%100.00%
1089.55%89.55%90.94%92.01%91.86%100.00%99.06%
Average 89.05% 89.26% 90.11% 90.63% 91.57% 99.93% 99.75%

3. Comparison of Classifiers

There are 10 models generated for each weighting scheme and we are interested in the comparison between the classifiers of CBA and LAC. Model 1 is used as an example and there are 30 rules in the classifier of frequency and 132 in that of LAC. Among them, 14 rules are exclusively in the frequency classifier, 116 only in LAC classifier and 16 rules are shared by both. shows that among the top 20 rules, 11 rules are shared by both classifiers, 9 rules (*) are only in the classifier of frequency and none of the top 20 rules (bold) are included in the classifier of frequency. All rules are ordered based on the CBA definition. During the classification, the match of the new compounds starts from the first and will stop immediately as long as there is a hit. As a result, although those 11 rules are in both classifiers, they may have different impacts on the final result of classification.
Table 7

Top 20 rules from frequency and LAC classifier.

NumberFrequencyLAC
1157,140,93 ->positive 155,140,62 ->positive
2139,124,104 ->positive 140,62 ->positive
3157,155,93 ->positive 132,69 ->positive
4157,93 ->positive 140,118,69 ->positive
5157,140,123 ->positive* 155,62 ->positive
6163,140,93 ->positive* 157,140,69 ->positive
7118 ->positive 157,62 ->positive
8155,140,93 ->positive 158,140,69 ->positive
9157,155,123 ->positive* 62 ->positive
10157,123 ->positive 155,118,69 ->positive
11144,124,104 ->positive 158,157,69 ->positive
12155,140,123 ->positive* 157,118,69 ->positive
13157,155,124 ->positive* 140,69 ->positive
14140,101 ->positive 132,121,70 ->positive
15161,139,104 ->positive 157,132,70 ->positive
16157,126,124 ->positive* 132,70 ->positive
17124,104 ->positive* 140,129,70 ->positive
18139,126,124 ->positive* 157,129,70 ->positive
19129,123 ->positive* 161,157,23 ->positive
20144,139,124 ->positive 157,126,23 ->positive

is exclusively in the frequency approach, bold only in LAC and others are common ones.

4. Rule Interpretation

Our recently submitted paper [48] showed that the rules generated by associative classification based on chemical fingerprints and properties can be interpreted by chemical knowledge and shed a light on the molecule design. In this study, we focus on the analysis of association rules generated by LAC using the bio fingerprint (NCI-60 dataset). The analysis for those generated by frequency can be done in the same manner. The accuracy of both frequency and LAC are 99.93% ( ) and the average size of the classifier is around 350 rules. For all ten models, the top 5 rules are the same but with different order, support and confidence. The intuitive explanation of Rule 1 in is that if compound is inactive to MCF7 and HL60 (TB) then it will be inactive to T47D at the same time. The adjusted weighted support of this rule is 29.1% and weighted confidence is 95.9%. Among the 5,937 compounds, 1730 compounds are covered by this rule. All these cell lines in the top 5 rules fall into two categories: a) breast cancer and b) Leukemia. On one hand, it means that there are many compounds which are inactive neither to breast cancer cell lines nor to Leukemia cell lines; on the other hand, it suggests that there might be some associations between these two types of cancers. [49], [50] clustered the cell lines based on their gene expression data, their results also indicated that the cell lines in these two categories were clustered into one or their clusters were very close to each other. The association of MCF7 and T47D is not surprising as they belong to the same category–breast cancer. The rules here may also provide a potential direction of the drug resistance of breast cancer and leukemia. [50]–[52] discovered a novel ABC transporter, breast cancer resistance protein (BCRP). This transporter was termed breast cancer resistance protein (BCRP) because of its identification in MCF-7 human breast carcinoma cells. The drug-sensitive cells become drug-resistant cells after transfection or overexpression of BCRP. They also found that relatively high expression of BCRP mRNA were observed in around 30% acute myeloid leukemia (AML) cases and suggested a novel mechanism of drug resistance in leukemia.
Table 8

Selected Top 5 active rules using bio fingerprint.

NumberRulesSupportConfidence
1MCF7 inactive, HL60(TB) inactive → inactive29.1%95.8%
2MCF7 inactive, MOLT-4 inactive →inactive29.7%95.8%
3MCF7 inactive,CCRF inactive →inactive28.7%95.4%
4MCF7 inactive, K-562 inactive →inactive30.7%95.4%
5MCF7 inactive, RPMI-8226 inactive →inactive31.9%95.2%
A hybrid feature set integrating the chemical fingerprint and bio fingerprint is generated by combining the MDL public keys and the bio fingerprint. Since we are only interested in the compounds which are active against tumor cell lines, the “inactive” value of the bioassay is treated as a feature of “not existed” in the compound. This also helps to treat the chemical fingerprint and the bio fingerprint equally. The average accuracy of the classification is 99.7% ( ). For rules in the final classifier, for example, (A, B → Active), it will be converted to (A associate Active) and (B associate Active). All the rules are transferred and plotted by Cytoscape 2.8.2 [53]. To make it clearer, nodes with degree less than 10 are removed. shows that generally compounds actively against MDA-MB-231/ATCC, TK-10, OVCAR-4, UACC-257, HOP-92, EKVX, NCI-H226 will also active to T-47D. Chemical features: bit 46(Br), 51 (CSO), 58 (QSQ), 65 (CN), 127 and 111 (NACH2A) are related to active or inactive depending on what other features it coexists with. There are other features which mainly related to inactive.
Figure 5

The connections between chemical features and cell lines.

(Red dot means a connection to active; green solid to inactive; light gray means features associated to each other. Purple: Non-small cell lung; Red: Renal; Pink: Breast cancer; Green; Ovarian and Light blue; Melanoma.).

The top 2 rules in the classifier indicate that compounds containing phosphorus and active to MCF7 or SK-MEL-2 will be active to T-47D too ( ). 22 out of 23 compounds match both rule 1 and 2. Among them, the once abandoned drug NSC 280594 (triciribine) attracts much attention and undergoes phase I trial due to its potential possibility of against a common cancer-causing protein [53]–[55]. These rules reveal that phosphorus might be an important chemical structure for anti-cancer drugs.
Table 9

Top 5 rules using the combined fingerprint.

NumberRulesSupportConfidence
1MCF7 active, bit 29 → active2.0%98.2%
2SK-MEL-2 active, bit 29 →active1.8%98.11%
3UACC-62 active, bit 33 → active2.0%97.7%
4NCI-H226 active, bit 33 → active1.7%97.3%
5HCC-2998 active, bit 33 → active1.6%97.2%

Conclusions

In this paper, we describe a novel link-based feature weighting framework for datasets without pre-assigned weight information. This algorithm employs a unified framework which integrates the advantage of HITS and PageRank–the mutual reinforcement and normalized weights–to derive useful weights. It utilizes connectivity and connection type information. Combined with a weighted support scheme, it offers an effective way to find the useful associations by taking into account both the significance of occurrence and the quality of features. The latter is included by connections to the transactions. Based on this new weight scheme, a CBA based classifier, LAC, is developed. The classifier is applied to two cases: the chemical fingerprint featured dataset and the bio-fingerprint featured dataset. Our experimental results show that although the weighting differs from the traditional RELIEF and SVM, it is able to capture the important features and afford good results. Especially for some sparse dataset, some significant features can be discovered by this link-based analysis which will be ignored by other methods. The link-based classifier discovers interesting associations of bioactivities with chemical features and potential relationships among diseases, for instance, relationship between phosphorus and bioactivity against T47D and potential relationship between breast cancer and leukemia. Our next step will apply this method to large semantic data sets to mine information from the RDF resources such as ChEMBL [56] and KEGG [57].
  23 in total

1.  Mining gene expression databases for association rules.

Authors:  Chad Creighton; Samir Hanash
Journal:  Bioinformatics       Date:  2003-01       Impact factor: 6.937

2.  Derivation and validation of toxicophores for mutagenicity prediction.

Authors:  Jeroen Kazius; Ross McGuire; Roberta Bursi
Journal:  J Med Chem       Date:  2005-01-13       Impact factor: 7.446

3.  Development and validation of a novel protein-ligand fingerprint to mine chemogenomic space: application to G protein-coupled receptors and their ligands.

Authors:  Nathanael Weill; Didier Rognan
Journal:  J Chem Inf Model       Date:  2009-04       Impact factor: 4.956

4.  Expression and activity of breast cancer resistance protein (BCRP) in de novo and relapsed acute myeloid leukemia.

Authors:  Dorina M van der Kolk; Edo Vellenga; George L Scheffer; Michael Müller; Susan E Bates; Rik J Scheper; Elisabeth G E de Vries
Journal:  Blood       Date:  2002-05-15       Impact factor: 22.113

5.  Preclinical testing of the Akt inhibitor triciribine in T-cell acute lymphoblastic leukemia.

Authors:  Camilla Evangelisti; Francesca Ricci; Pierluigi Tazzari; Francesca Chiarini; Michela Battistelli; Elisabetta Falcieri; Andrea Ognibene; Pasqualepaolo Pagliaro; Lucio Cocco; James A McCubrey; Alberto M Martelli
Journal:  J Cell Physiol       Date:  2011-03       Impact factor: 6.384

6.  Discovering protein-DNA binding sequence patterns using association rule mining.

Authors:  Kwong-Sak Leung; Ka-Chun Wong; Tak-Ming Chan; Man-Hon Wong; Kin-Hong Lee; Chi-Kong Lau; Stephen K W Tsui
Journal:  Nucleic Acids Res       Date:  2010-06-06       Impact factor: 16.971

7.  Data mining the NCI cancer cell line compound GI(50) values: identifying quinone subtypes effective against melanoma and leukemia cell classes.

Authors:  Kenneth A Marx; Philip O'Neil; Patrick Hoffman; M L Ujwal
Journal:  J Chem Inf Comput Sci       Date:  2003 Sep-Oct

8.  Identifying compound-target associations by combining bioactivity profile similarity search and public databases mining.

Authors:  Tiejun Cheng; Qingliang Li; Yanli Wang; Stephen H Bryant
Journal:  J Chem Inf Model       Date:  2011-08-18       Impact factor: 4.956

9.  Fast rule-based bioactivity prediction using associative classification mining.

Authors:  Pulan Yu; David J Wild
Journal:  J Cheminform       Date:  2012-11-23       Impact factor: 5.514

10.  Prediction of protein-protein interaction types using association rule based classification.

Authors:  Sung Hee Park; José A Reyes; David R Gilbert; Ji Woong Kim; Sangsoo Kim
Journal:  BMC Bioinformatics       Date:  2009-01-28       Impact factor: 3.169

View more
  1 in total

1.  Activation of the constitutive androstane receptor inhibits gluconeogenesis without affecting lipogenesis or fatty acid synthesis in human hepatocytes.

Authors:  Caitlin Lynch; Yongmei Pan; Linhao Li; Scott Heyward; Timothy Moeller; Peter W Swaan; Hongbing Wang
Journal:  Toxicol Appl Pharmacol       Date:  2014-05-27       Impact factor: 4.219

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.