| Literature DB >> 21034429 |
Eddie Y T Ma1, Christopher J F Cameron, Stefan C Kremer.
Abstract
UNLABELLED: This paper demonstrates how a Neural Grammar Network learns to classify and score molecules for a variety of tasks in chemistry and toxicology. In addition to a more detailed analysis on datasets previously studied, we introduce three new datasets (BBB, FXa, and toxicology) to show the generality of the approach. A new experimental methodology is developed and applied to both the new datasets as well as previously studied datasets. This methodology is rigorous and statistically grounded, and ultimately culminates in a Wilcoxon significance test that proves the effectiveness of the system. We further include a complete generalization of the specific technique to arbitrary grammars and datasets using a mathematical abstraction that allows researchers in different domains to apply the method to their own work.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21034429 PMCID: PMC2966291 DOI: 10.1186/1471-2105-11-S8-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Artist’s conception of isopentanol molecule
Subset of SMILES grammar rules (derived from [10])
| smiles | ← | chain | 1 |
| chain | ← | atom | 2 |
| chain | ← | atom chain | 3 |
| chain | ← | atom Nbranch chain | 4 |
| atom | ← | C | 5 |
| atom | ← | O | 6 |
| Nbranch | ← | branch | 7 |
| branch | ← | ( chain_rparen | 8 |
| chain_rparen | ← | chain rparen | 9 |
| rparen | ← | ) | 10 |
Derivation of the SMILES string CC(C)CCO (Isopentanol) from the root smiles “symbol”
| smiles | ← | chain | Rule 1 |
| ← | atom chain | Rule 3 | |
| ← | C chain | Rule 5 | |
| ← | C atom Nbranch chain | Rule 4 | |
| ← | C C Nbranch chain | Rule 5 | |
| ← | C C branch chain | Rule 7 | |
| ← | C C ( chain_rparen chain | Rule 8 | |
| ← | C C ( chain rparen chain | Rule 9 | |
| ← | C C ( atom rparen chain | Rule 2 | |
| ← | C C ( C rparen chain | Rule 5 | |
| ← | C C ( C ) chain | Rule 10 | |
| ← | C C ( C ) atom chain | Rule 3 | |
| ← | C C ( C ) C chain | Rule 5 | |
| ← | C C ( C ) C atom chain | Rule 3 | |
| ← | C C ( C ) C C chain | Rule 5 | |
| ← | C C ( C ) C C atom | Rule 2 | |
| ← | C C ( C ) C C O | Rule 6 |
Figure 2Parse tree for isopentanol
Figure 3Weight layers for each rule given in Table 1
Figure 4Neural grammar network for isopentanol
A summary of the final experimental parameters used in both classification experimental designs, in regression Leave-5%-Out internal cross validation, and in designed regression test sets.
| Parameter | Classification | Regression 5%-CV | Regression Designed |
|---|---|---|---|
| Training Constant ( | 0.60 | 0.30 | 0.33 |
| Momentum Coefficient ( | 0.90 | 0.10 | 0.66 |
| Convergence Threshold ( | 0.05 | 0.03 | 0.04 |
| Initial Random Weight Values | [−1.6, −1.0] | [−1.6, −1.0] | [−1.2, −0.4] |
| ∪ [1.0, 1.6] | ∪ [1.0, 1.6] | ∪ [0.4, 1.2] | |
| Maximum Number of Epochs | 5000 | 7500 | 7500 |
| SMILES-NGN Hidden Nodes | 8 | 8 | 8 |
| InChI-NGN Hidden Nodes | 8 | 8 | 8 |
| Output Scale | (0.2, 0.8) | [0.2, 0.8] | [0.2, 0.8] |
A summary of the datasets used in classification experiments. † The range and threshold information was not provided in [2] and [3] for the BBB and FXa datasets, respectively.
| Dataset | Dataset Full Name | Size | Range | Units | Reference | |||
|---|---|---|---|---|---|---|---|---|
| Androgen Receptor | 202 | 146 | 56 | [<−3.56, 2.27] | −3.56 | logRBA | [ | |
| Blood Brain Barrier | 415 | 276 | 139 | † | † | pK | [ | |
| Benzodiazepine Receptor | 405 | 230 | 175 | [<4.2, 9.5] | 7.0 | pIC50 | [ | |
| Cyclooxygenase 2 | 467 | 273 | 194 | [<4.0, 9.0] | 6.5 | pIC50 | [ | |
| Dihydrofolate Reductase | 756 | 302 | 454 | [<3.0, 10.5] | 6.0 | pIC50 | [ | |
| Estrogen Receptor | 232 | 131 | 101 | [<−4.50, 2.60] | −4.50 | logRBA | [ | |
| Factor Xa | 435 | 279 | 156 | † | † | pK | [ |
A summary of the performance on the BZR dataset with results as reported by [1] compared to the InChI-NGN in this work. Missing MCC values in the table reflect missing information in the primary literature.
| Design | Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|---|
| 40% Test Set | SIMCA | 72 | 68 | 76 | - |
| RP | 69 | 64 | 74 | - | |
| SFGA | 75.5 | 70 | 81 | - | |
| InChI-NGN | 63.2 | 62.1 | 64.6 | 0.265 | |
| Leave-20%-Out | SIMCA | 71.5±11.0 | 73±10 | 70±12 | - |
| RP | 65.5±12 | 68±12 | 65±12 | - | |
| SFGA | 68.5±12 | 69±11 | 68±13 | - | |
| InChI-NGN | 69.9±1.98 | 73.4±1.87 | 65.3±2.29 | 0.387±0.041 | |
A summary of the performance on the Cox2 dataset with results as reported by [1] compared to the InChI-NGN in this work. Missing MCC values in the table reflect missing information in the primary literature.
| Design | Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|---|
| 40% Test Set | SIMCA | 71 | 75 | 67 | - |
| RP | 71 | 79 | 63 | - | |
| SFGA | 73.5 | 75 | 72 | - | |
| InChI-NGN | 65.1 | 62.5 | 68.4 | 0.307 | |
| Leave-20%-Out | SIMCA | 78±9 | 79±9 | 77±9 | |
| RP | 69.5±12 | 72±12 | 67±12 | - | |
| SFGA | 74±9.5 | 76±9 | 72±10 | - | |
| InChI-NGN | 72.2±1.36 | 74.4±1.01 | 68.7±2.45 | 0.421±0.275 | |
A summary of the performance on the DHFR dataset with results as reported by [1] compared to the InChI-NGN in this work. Missing MCC values in the table reflect missing information in the primary literature.
| Design | Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|---|
| 40% Test Set | SIMCA | 75.5 | 74 | 71 | - |
| RP | 65 | 57 | 73 | - | |
| SFGA | 68.5 | 71 | 66 | - | |
| InChI-NGN | 73.2 | 73.1 | 100.0 | 0.029 | |
| Leave-20%-Out | SIMCA | 63.5±9.5 | 57±10 | 70±9 | - |
| RP | 61±12 | 57±12 | 65±12 | - | |
| SFGA | 64.5±10.5 | 65±11.0 | 64±10.0 | - | |
| InChI-NGN | 74.8±1.63 | 70.3±2.44 | 77.5±1.72 | 0.471±0.035 | |
A summary of predictive scores for the BBB dataset as presented in the work by [2] followed by the performance of the InChI-NGN. Missing ranges for all values other than the SVM work and our own are due to missing information in the primary literature.
| Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|
| LR | 57.1 | 63.6 | 42.8 | 0.063 |
| LDA | 46.8 | 40.0 | 58.4 | −0.067 |
| C4.5 DT | 73.8 | 83.7 | 54.9 | 0.398 |
| k-NN | 70.8 | 77.0 | 58.0 | 0.348 |
| PNN | 70.3 | 76.2 | 57.8 | 0.357 |
| SVM | 71.0±4.53 | 89.9±3.16 | 64.3±13.07 | 0.524±0.117 |
| LR RFE | 71.0 | 83.9 | 46.4 | 0.321 |
| LDA RFE | 71.2 | 78.2 | 58.3 | 0.360 |
| C4.5 DT RFE | 74.3 | 80.3 | 62.8 | 0.433 |
| k-NN RFE | 77.1 | 85.5 | 61.4 | 0.477 |
| PNN RFE | 76.1 | 84.3 | 62.1 | 0.481 |
| SVM RFE | 83.7±3.90 | 88.6±7.01 | 75.0±12.83 | 0.645±0.080 |
| InChI-NGN | 72.0±2.33 | 77.6±1.72 | 59.0±3.91 | 0.355±0.052 |
A comparison of the work done by [3] and [4] against the InChI-NGN for the FXa dataset. Missing values are due to lacking information in the primary literature.
| Design | Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|---|
| ⅔ Train, ⅓ Test | A-MIF | 88 | - | - | - |
| MIF-MIF | 84 | - | - | - | |
| MK1 | 94.5 | 98.7 | 89.5 | - | |
| MK2 | 95.2 | 98.9 | 87.7 | - | |
| InChI-NGN | 83.8 | 84.3 | 82.9 | 0.657 | |
| Leave-20%-Out | InChI-NGN | 86.4±2.50 | 88.5±0.87 | 82.7±0.05 | 0.705±0.052 |
Summary of the Decision Forest performance [5] against the NGN performance on ER.
| Design | Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|---|
| Leave-10%-Out | Decision Forest | 81.9 | - | - | - |
| Leave-20%-Out | SMILES-NGN | 69.3±2.28 | 71.7±2.39 | 66.0±2.74 | 0.373±0.047 |
| InChI-NGN | 66.1±1.70 | 69.5±2.65 | 61.6±1.33 | 0.309±0.040 | |
Summary of the NGN performance on AR.
| Design | Method | SE(%) | SP(%) | MCC | |
|---|---|---|---|---|---|
| Leave-20%-Out | SMILES-NGN | 70.3±0.91 | 71.8±0.46 | 37.5±11.4 | 0.052±0.033 |
| InChI-NGN | 76.3±3.20 | 80.7±2.86 | 60.6±7.02 | 0.380±0.098 |
A summary of the real-valued ranges of activity for datasets used in regression experiments.
| Dataset | Dataset Full Name | Size | Range | Units | Reference |
|---|---|---|---|---|---|
| ACE | Angiotensin Converting Enzyme | 114 | [2.1, 9.9] | pIC50 | [ |
| AChE | Acetylcholinesterase | 111 | [4.3, 9.5] | pIC50 | [ |
| AR | Androgen Receptor | 146 | [−3.56, 2.27] | logRBA | [ |
| BZR | Benzodiazepine Receptor | 163 | [5.5, 8.9] | pIC50 | [ |
| Cox2 | Cyclooxygenase-2 | 282 | [4.0, 9.0] | pIC50 | [ |
| DHFR | Dihydrofolate Reductase | 397 | [3.3, 9.8] | pIC50 | [ |
| ER | Estrogen Receptor | 131 | [−4.50, 2.60] | logRBA | [ |
| GPB | Glycogen Phosphorylase B | 66 | [1.3, 6.8] | pK | [ |
| Therm | Thermolysin | 76 | [0.5, 10.2] | pK | [ |
| Thr | Thrombin | 88 | [4.4, 8.5] | pK | [ |
The scores for converging Leave-5%-Out Cross Validation regression experiments.
| Grammar | Dataset | Dataset Full Name | |
|---|---|---|---|
| SMILES | Angiotensin Converting Enzyme | 0.386±29.20 | |
| Androgen Receptor | 0.382±18.29 | ||
| Estrogen Receptor | 0.349±21.82 | ||
| Glycogen Phosphorylase B | 0.253±43.18 | ||
| Acetylcholinesterase | 0.193±35.26 | ||
| Thermolysin | −0.288±210.30 | ||
| InChI | Estrogen Receptor | 0.476±11.16 | |
| Angiotensin Converting Enzyme | 0.383±15.09 | ||
| Cyclooxygenase-2 | 0.279±8.97 | ||
| Thermolysin | 0.247±25.96 | ||
| Androgen Receptor | 0.119±79.50 | ||
| Thrombin | 0.088±99.79 | ||
| Acetylcholinesterase | 0.061±39.70 | ||
| Glycogen Phosphorylase B | −0.180±104.87 | ||
A summary of scores for the NGN compared to other methods on datasets described for regression in this work.
| Dataset | SMILES-NGN | InChI-NGN | CoMFA | EVA | HQSAR | 2.5D | MK1 | MK2 |
|---|---|---|---|---|---|---|---|---|
| GPB | 0.79±0.23 | 0.48±2.90 | 0.42 | 0.49 | 0.58 | 0.04 | — | — |
| ACE | 0.74±0.46 | 0.78±0.31 | 0.49 | 0.36 | 0.30 | 0.51 | 0.58 | 0.55 |
| AChE | 0.68±0.80 | 0.60±0.78 | 0.47 | 0.28 | 0.37 | 0.16 | 0.50 | 0.48 |
| Cox2 | 0.56±1.33 | 0.37±4.28 | 0.29 | 0.17 | 0.27 | 0.27 | — | — |
| Therm | 0.47±2.72 | 0.52±1.59 | 0.54 | 0.36 | 0.53 | 0.07 | — | — |
| BZR | −0.29±17.30 | 0.11±8.74 | 0.00 | 0.16 | 0.17 | 0.20 | 0.34 | 0.36 |
| Thr | — | 0.70±0.72 | 0.63 | 0.11 | −0.25 | 0.28 | — | — |
| DHFR | — | 0.66±0.94 | 0.59 | 0.57 | 0.63 | 0.49 | 0.64 | 0.65 |
Description of the neural networks’ parameters used for the experiment
| RMSE Threshold | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ANN | 3 | Same as feature vector | 2 | 1 | 0.9 | 0.3 | (−0.3,0.3) | 50,000 | 0.02 | 0.05 |
| NGN | Dynamic | Dependant on input token | 12 | 1 | 0.3 | 0.3 | (−1.6,−1.0) or (1.0,1.6) | 10,000 | 0.03 | 0.05 |
*activation value of 1.0
*hidden nodes were tested for best performance, leading to 12 hidden nodes to be used with the NGN for the toxicity dataset
Figure 5Comparison of group method determined epsilon values.
Figure 6Comparison of group method determined standard deviation values.
Figure 7Comparison of random method determined epsilon values.
Figure 8Comparison of random method determined standard deviation values.