| Literature DB >> 26075027 |
Hamse Y Mussa1, David Marcus2, John B O Mitchell3, Robert C Glen4.
Abstract
BACKGROUND: In a recent paper, Mussa, Mitchell and Glen (MMG) have mathematically demonstrated that the "Laplacian Corrected Modified Naïve Bayes" (LCMNB) algorithm can be viewed as a variant of the so-called Standard Naïve Bayes (SNB) scheme, whereby the role played by absence of compound features in classifying/assigning the compound to its appropriate class is ignored. MMG have also proffered guidelines regarding the conditions under which this omission may hold. Utilising three data sets, the present paper examines the validity of these guidelines in practice. The paper also extends MMG's work and introduces a new version of the SNB classifier: "Tapered Naïve Bayes" (TNB). TNB does not discard the role of absence of a feature out of hand, nor does it fully consider its role. Hence, TNB encapsulates both SNB and LCMNB.Entities:
Keywords: Classification; Features; Naïve Bayes; Tapering
Year: 2015 PMID: 26075027 PMCID: PMC4464057 DOI: 10.1186/s13321-015-0075-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Enzyme data set: comprising 4,658 ligands to classify according to the enzyme they inhibit
| Activity class | Target ID | No. of active compounds |
|---|---|---|
| Vascular endothelial growth factor 2 | 10980 | 268 |
| Carbonic anhydrase II | 15 | 262 |
| 11-beta-hydroxysteroid dehydrogenase 1 | 11489 | 233 |
| Carbonic anhydrase I | 10193 | 228 |
| Beta-secretase 1 | 12252 | 212 |
| Dipeptidyl peptidase IV | 11140 | 208 |
| Epidermal growth factor erbB1 | 9 | 202 |
| MAP kinase p38 alpha | 10188 | 197 |
| Carbonic anhydrase IX | 12952 | 184 |
| Cyclooxygenase-2 | 194 | 180 |
| Acetylcholinesterase | 93 | 158 |
| Coagulation factor X | 126 | 156 |
| Histone deacetylase 1 | 12697 | 156 |
| Monoamine oxidase B | 104 | 148 |
| Thrombin | 11 | 147 |
| Renin | 11225 | 139 |
| Epoxide hydratase | 11727 | 134 |
| Matrix metalloproteinase 13 | 11024 | 125 |
| Cathepsin K | 10495 | 116 |
| Cathepsin S | 11534 | 112 |
| Matrix metalloproteinase-2 | 13001 | 112 |
| Protein-tyrosine phosphatase 1B | 13061 | 110 |
| Serine–threonine-protein kinase AKT | 12666 | 109 |
| Butyrylcholinesterase | 10532 | 104 |
| Cytochrome P450 19A1 | 65 | 102 |
| Receptor protein-tyrosine kinase erbB2 | 188 | 99 |
| Tyrosine-protein kinase SRC | 10434 | 98 |
| Hepatocyte growth factor receptor | 11451 | 94 |
| Matrix metalloproteinase-1 | 13000 | 91 |
| Glycogen synthase kinase-3 beta | 10197 | 89 |
| Carbonic anhydrase XII | 12209 | 85 |
Columns 1, 2 and 3 denote the protein, the protein identifier (ID) in our dataset and the number of ligands reported for each protein, respectively.
Membrane-receptor data set: comprising 5,031 ligands to classify according to the biological activity they induce on these membrane receptors
| Activity class | Target ID | No. of active compounds |
|---|---|---|
| Adenosine A2a receptor | 252 | 424 |
| Adenosine A3 receptor | 280 | 356 |
| Adenosine A1 receptor | 114 | 322 |
| Cannabinoid CB2 receptor | 259 | 319 |
| Histamine H3 receptor | 10280 | 314 |
| Cannabinoid CB1 receptor | 87 | 304 |
| Dopamine D2 receptor | 72 | 281 |
| Mu opioid receptor | 129 | 269 |
| Kappa opioid receptor | 137 | 244 |
| Delta opioid receptor | 136 | 223 |
| Melanocortin receptor 4 | 10142 | 220 |
| Serotonin 1a (5-HT1a) receptor | 51 | 215 |
| Dopamine D3 receptor | 130 | 213 |
| Melanin-concentrating hormone receptor 1 | 19905 | 206 |
| Serotonin 6 (5-HT6) receptor | 10627 | 173 |
| Serotonin 2a (5-HT2a) receptor | 107 | 155 |
| C-C chemokine receptor type 2 | 11575 | 150 |
| Adenosine A2b receptor | 278 | 136 |
| G protein-coupled receptor 44 | 20174 | 117 |
| Serotonin 2c (5-HT2c) receptor | 108 | 114 |
| Histamine H4 receptor | 11290 | 96 |
| C-C chemokine receptor type 5 | 10580 | 91 |
| Nociceptin receptor | 138 | 89 |
Columns 1, 2 and 3 denote the protein, the protein identifier (ID) in our dataset and the number of ligands reported for each protein, respectively.
Mixed class data set: comprising 1,149 ligands to classify according to the biological activity they induce on these transporters, transcription factor and ion-channel
| Activity class | Target ID | No of active compounds |
|---|---|---|
| Serotonin transporter | 121 | 222 |
| Norepinephrine transporter | 100 | 146 |
| Dopamine transporter | 155 | 136 |
| Sodium/glucose cotransporter 2 | 20092 | 102 |
| hERG | 165 | 448 |
| Peroxisome proliferator-activated receptor gamma | 133 | 95 |
Columns 1, 2 and 3 denote the protein, the protein identifier (ID) in our dataset and the number of ligands reported for each protein, respectively.
Figure 1Plots showing the MCC values of the classification performances of the SNB (red line) and LCMNB (blue line) classifiers versus the number of features employed for the three datasets: a enzyme data set; b membrane receptor data set; and c mixed class data set.
Figure 2Plots showing the MCC values of the classification performances of TNB classifiers (per value) versus the number of features employed for the three datasets: a enzyme data set; b membrane-receptor data set; and c mixed class data set.
Enzyme data set: columns 1 denotes the target identifier
| Target ID |
| |
|---|---|---|
| MCC | MCC | |
| 10980 | 0.903 | 0.869 |
| 15 | 0.424 | 0.375 |
| 11489 | 0.970 | 0.963 |
| 10193 | 0.345 | 0.286 |
| 12252 | 0.980 | 0.980 |
| 11140 | 0.980 |
|
| 9 | 0.831 | 0.806 |
| 10188 | 0.960 | 0.949 |
| 12952 | 0.539 | 0.429 |
| 194 | 0.986 | 0.980 |
| 93 | 0.749 | 0.746 |
| 126 | 0.973 | 0.960 |
| 12697 | 0.987 |
|
| 104 | 0.939 |
|
| 11 | 0.973 | 0.963 |
| 11225 | 0.985 | 0.968 |
| 11727 | 0.928 |
|
| 11024 | 0.711 | 0.682 |
| 10495 | 0.871 | 0.870 |
| 11534 | 0.833 | 0.819 |
| 13001 | 0.499 | 0.473 |
| 13061 | 0.981 | 0.964 |
| 12666 | 0.981 | 0.967 |
| 10532 | 0.724 | 0.714 |
| 65 | 0.922 | 0.912 |
| 188 | 0.701 | 0.693 |
| 10434 | 0.927 | 0.813 |
| 11451 | 0.936 | 0.867 |
| 13000 | 0.784 | 0.751 |
| 10197 | 0.919 | 0.887 |
| 12209 | 0.0267 |
|
Columns 2 and 3 represent the MCC values obtained by TNB () and LCMNB for each of the 31 targets. L is the number of features employed.
Membrane-receptor data set: columns 1 denotes the target identifier
| Target ID |
| |
|---|---|---|
| MCC | MCC | |
| 252 | 0.857 | 0.853 |
| 280 | 0.862 | 0.852 |
| 114 | 0.633 | 0.579 |
| 259 | 0.823 | 0.781 |
| 10280 | 0.962 | 0.959 |
| 87 | 0.803 | 0.781 |
| 72 | 0.639 | 0.582 |
| 129 | 0.455 | 0.430 |
| 137 | 0.525 | 0.523 |
| 136 | 0.644 |
|
| 10142 | 0.986 | 0.981 |
| 51 | 0.794 | 0.778 |
| 130 | 0.785 | 0.777 |
| 19905 | 0.971 | 0.970 |
| 10627 | 0.958 |
|
| 107 | 0.767 | 0.759 |
| 11575 | 0.947 | 0.929 |
| 278 | 0.749 | 0.717 |
| 20174 | 0.953 | 0.844 |
| 108 | 0.790 | 0.780 |
| 11290 | 0.919 | 0.904 |
| 10580 | 0.941 | 0.759 |
| 138 | 0.950 | 0.840 |
Columns 2 and 3 represent the MCC values obtained by TNB () and LCMNB for each of the 23 targets. is the number of features utilised.
Mixed class data set: columns 1 denotes the target identifier
| Target ID |
| |
|---|---|---|
| MCC | MCC | |
| 121 | 0.863 | 0.838 |
| 100 | 0.827 | 0.817 |
| 155 | 0.866 |
|
| 20092 | 1.000 | 0.995 |
| 165 | 0.975 | 0.926 |
| 133 | 0.994 | 0.830 |
Columns 2 and 3 represent the MCC values obtained by TNB () and LCMNB for each of the 6 targets. is the number of features utilised.