| Literature DB >> 23176548 |
Abstract
Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models.Entities:
Year: 2012 PMID: 23176548 PMCID: PMC3515428 DOI: 10.1186/1758-2946-4-29
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1ACM framework.
A sample dataset with fingerprint as features
| C1 | 1 | 0 | 0 | 1 | 0 | Active | ||
| C2 | 0 | 0 | 1 | 0 | 0 | 1 | Inactive | |
| C3 | 1 | 0 | 1 | 0 | 0 | 1 | Active | |
| C4 | 0 | 1 | 1 | 0 | 0 | Active | ||
| … | … | … | … | … | … | … | … | … |
| Cn | 0 | 0 | 1 | 0 | 1 | 1 | 1 | Inactive |
The characteristics of the data sets used in this paper
| Source | PKKB
[ | Prathipati et al.
[ | Jeroen et al.
[ |
| #Compounds | 806 | 3,779 | 4,337 |
| Diversity | 0.90 | 0.90 | 0.93 |
| Class | blocker/non-blocker | active/inactive | mutagen/non-mutagen |
Note: The diversity of each dataset is the average distance of all molecules and is calculated based on ECFP_6 by using Pipeline Pilot. The distance is defined as (1- similarity) for every pair of molecules based on the specified fingerprint.
Property descriptors used in the modeling
| ALogP,Molecular_Solubility,Molecular_SurfaceArea,Molecular_PolarSurfaceArea,Molecular_FractionalPolarSurfaceArea,Molecular_SASA,Molecular_PolarSASA,Molecular_FractionalPolarSASA,Molecular_SAVol,ChemAxon_LogP,ChemAxon_Polarizability,ChemAxon_Refractivity,ChemAxon_TPSA,FormalCharge | |
| Num_Atoms,Num_Bonds,Num_Hydrogens,Num_NegativeAtoms,Num_RingBonds,Num_RotatableBonds,Num_BridgeBonds,Num_Rings,Num_RingAssemblies,Num_Chains,Num_ChainAssemblies,Molecular_Weight,Num_H_Acceptors,Num_H_Donors,ChemAxon_HBA,ChemAxon_HBD |
Note: All property descriptors are computed by using Pipeline Pilot. The name and meaning of property descriptors can be found in Pipeline Pilot help documents. In most cases, the meaning of a name can be determined from the name itself. For example, ADMET_BBB_LEVEL means ranking of the LogBB values by using Accelrys blood–brain barrier penetration model: 0 is very high; 1 is high; 2 is medium; 3 is low and 4 is undefined, namely, molecule is outside of the confidence area of the regression model used to calculate LogBB.
Summary of used ACM methods
| Classification based on association rules
[ | |
| Classification based on predictive association rules
[ | |
| Classification based on multiple association rules
[ |
Figure 2Discretization results of the antiTB datasets.
F-score of all the data sets using different descriptors or fingerprints
| 61.82±2.96 | 63.00±2.97 | 70.78±3.51 | 69.84±2.14 | |||
| 69.50±1.84 | 63.27±3.10 | 66.09±5.66 | 63.49±2.51 | |||
| 71.08±1.72 | 68.93±2.23 | 63.62±2.35 | 67.25±1.52 | |||
| 74.26±2.87 | 69.07±3.49 | 77.37±5.25 | 75.48±5.11 | |||
| 70.04±3.99 | 68.82±6.13 | 66.75±2.74 | 74.57±5.44 | |||
| 72.67±3.80 | 66.41±3.66 | 75.77±4.16 | 71.91±5.38 | |||
| 62.62±6.73 | 70.08±9.64 | 72.75±12.26 | 69.20±6.84 | |||
| 75.73±13.35 | 72.78±10.39 | 79.65±6.37 | 80.73±8.18 | |||
| 60.13±9.98 | 73.18±11.89 | 77.72±9.70 | 74.77±8.28 | |||
Accuracy of Y-randomization on antiTB_ MDL
| 44.25±19.38 | 43.08±3.97 | 44.03±4.68 | |
| 40.35±19.54 | 49.04±2.92 | 51.00±2.81 | |
| 39.27±11.29 | 48.98±3.94 | 45.77±3.61 | |
| 57.83±8.73 | 50.66±2.37 | 48.24±3.84 | |
| 57.85±6.11 | 52.62±3.11 | 51.05±5.45 |
Figure 3Rank of the levels of the properties for the antiTB dataset.
Selected association rules for the antiTB dataset
| MDL | ||
| 1 | [#7]~[#6]~[#8] AND *!@[#8]!@* → class = active | 23.10% 75.14% |
| 2 | Not [#7]~*~*~[#8] AND not [#7]!:*:* → class = inactive | 21.38% 75. 50% |
| 3 | [#7]~[#6]~[#8] AND *~*(~*)(~*)~* → class = active | 18.95% 81.98% |
| 4 | [#7]~*~[CH2]~* AND [#8]~[#6]~[#8] → class = active | 18.37% 76.80% |
| Property | | |
| 5 | ALogP[0.985 - 4.446] AND Num_RingBonds[>19] AND ADMET_CYP2D6[=0 ] → class = active | 9.55% 74.64% |
| 6 | Num_Hydrogens[18–50] AND Molecular_Solubility[−12.036 - -7.198] AND Molecular_SASA[690.864 - 1058.920] → class = inactive | 9.03% 78.31% |
| 7 | Molecular_FractionalPolarSASA[0.140 - 0.312] AND Molecular_Solubility[−12.036 - -7.198] AND ChemAxon_HBD[>3] → class = inactive | 9.00% 91.84% |
| 8 | Num_Bonds[<30] AND ChemAxon_TPSA[<46.170] AND Molecular_FractionalPolarSASA[<0.140] → class = active | 9.00% 81.57% |