| Literature DB >> 26703557 |
Elena S Salmina1, Norbert Haider2, Igor V Tetko3,4.
Abstract
The article describes a classification system termed "extended functional groups" (EFG), which are an extension of a set previously used by the CheckMol software, that covers in addition heterocyclic compound classes and periodic table groups. The functional groups are defined as SMARTS patterns and are available as part of the ToxAlerts tool (http://ochem.eu/alerts) of the On-line CHEmical database and Modeling (OCHEM) environment platform. The article describes the motivation and the main ideas behind this extension and demonstrates that EFG can be efficiently used to develop and interpret structure-activity relationship models.Entities:
Keywords: chemical functional groups; chemoinformatics analysis; data interpretation; heterocyclic compounds; machine learning
Mesh:
Year: 2015 PMID: 26703557 PMCID: PMC6273096 DOI: 10.3390/molecules21010001
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Comparison of the performance of regression models developed using functional groups with previously published ones.
| Property | Original Models | Based on CheckMol-FG | New EFG Descriptors | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Best descriptors a | N | Method | RMSE | R2 | RMSE | R2 | RMSE | R2 | |
| Environmental toxicity against | Estate | 644 | ASNN | 0.44 ± 0.02 | 0.83 ± 0.02 | 0.8 ± 0.03 | 0.44 ± 0.04 | 0.48 ± 0.03 | 0.8 ± 0.02 |
| logP for Pt(II/IV) complexes [ | Fragmentor (out of 12) | 233 | ASNN | 0.43± 0.03 | 0.92± 0.02 | 1.42 ± 0.07 | 0.16 ± 0.05 | 0.45± 0.03 | 0.91± 0.02 |
| HIV inhibition [ | Dragon (out of 10) | 286 | ASNN | 0.48± 0.03 | 0.87± 0.02 | 0.68 ± 0.03 | 0.75 ± 0.03 | 0.55 ± 0.03 | 0.83 ± 0.02 |
| Melting point [ | CDK (out of 10) | 47427 | ASNN | 39.1 ± 0.2 | 0.76 ± 0.01 | 50.6 ± 0.2 | 0.59 ± 0.01 | 45.1 ± 0.2 | 0.67 ± 0.01 |
| Melting point [ | Fragmentor (out of 12) | 275133 | LibSVM | 35.4 ± 0.1 | 0.69 ± 0.01 | 47.5 ± 0.1 | 0.46 ± 0.01 | 40.5 ± 0.1 | 0.61 ± 0.01 |
| Lowest Effect Level (LEL) toxicity prediction challenge [ | Adriana (out of 10) | 483 | ASNN | 0.93 ± 0.03 | 0.22 ± 0.04 | 0.98 ± 0.05 | 0.16 ± 0.04 | 0.97 ± 0.05 | 0.17 ± 0.04 |
| Solubility in water [ | Estate | 1311 | ASNN | 0.62 ± 0.2 | 0.91 ± 0.01 | 1.25 ± 0.04 | 0.65 ± 0.02 | 0.66 ± 0.02 | 0.90 ± 0.01 |
| Pyrolysis point [ | Estate | 13769 | LibSVM | 35.6 ± 0.2 | 0.55 ± 0.01 | 42.1 ± 0.3 | 0.38 ± 0.01 | 38.7 ± 0.2 | 0.47 ± 0.01 |
| PTB1B inhibition [ | Dragon | 2237 | ASNN | 0.77 ± 0.02 | 0.71 ± 0.02 | 0.96 ± 0.02 | 0.55 ± 0.02 | 0.81 ± 0.02 | 0.68 ± 0.02 |
| Estrogen Receptor binding [ | ALOGPS + Estate | 1677 | ASNN | 0.062 ± 0.004 | 0.58 ± 0.06 | 0.084 ± 0.006 | 0.33 ± 0.04 | 0.079 ± 0.006 | 0.34 ± 0.04 |
a The descriptors (the abbreviations are explained in the respective articles), which provided the best performance of models developed for the respective training set as well as number of the analyzed descriptor sets are indicated. RMSE—Root Mean Squared Error; R2 is square of Pearson linear correlation coefficient; ASNN is Associative Neural Network [22]; LibSVM is Support Vector Machine [23].
Comparison of the performance of classification models developed using functional groups with previously published ones.
| Property | Original Models | CheckMol-FG Descriptors | EFG Descriptors | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Best descriptors a | N | Method | BA | MCC | BA | MCC | BA | MCC | |
| AMES test [ | Estate | 4361 | ASNN | 77.5% ± 0.6% | 0.55 ± 0.01 | 74.4% ± 0.6% | 0.49 ± 0.01 | 76.2% ± 0.6% | 0.53 ± 0.01 |
| Ready biodegradability [ | CDK (out of 7) | 1884 | ASNN | 86.7% ± 0.8% | 0.72 ± 0.02 | 75% ± 1% | 0.49 ± 0.02 | 83.2%± 0.9% | 0.65 ± 0.02 |
| Solubility in DMSO [ | ALOGPS + Estate (out of 9) | 50620 | ASNN | 73.8% ± 0.4% | 0.24 ± 0.01 | 57.1% ± 0.4% | 0.07 ± 0.01 | 71.5% ± 0.5% | 0.22 ± 0.01 |
| CYP450 inhibition [ | Dragon (out of 10) | 3737 | J48 | 82.1% ± 0.6% | 0.64 ± 0.01 | 78.9% ± 0.7% | 0.59 ± 0.01 | 79.5% ± 0.7% | 0.59 ± 0.01 |
| Pyrolysis/ Melting point classification [ | Estate (out of 10) | 241699 | LibSVM | 78.2%± 0.2% | 0.33 ± 0.01 | 53.1% ± 0.2% | 0.04 ± 0.01 | 74.2%± 0.2% | 0.27 ± 0.01 |
| Androgen receptor binding [ | Dragon | 744 | ASNN | 77% ± 2% | 0.54 ± 0.03 | 70% ± 2% | 0.41 ± 0.04 | 77% ± 2% | 0.54 ± 0.03 |
| Ransthyretin receptor binding [ | Dragon | 162 | ASNN | 89% ± 3.0% | 0.79 ± 0.05 | 83% ± 3% | 0.67 ± 0.06 | 86% ± 3% | 0.72 ± 0.06 |
| Estrogen Receptor binding [ | Dragon (out of 11) | 1677 | ASNN | 74% ± 2.0% | 0.39 ± 0.03 | 62% ± 2% | 0.33 ± 0.04 | 72% ± 2% | 0.34 ± 0.03 |
| Azeotropes classification [ | Adriana | 465 | RF | 78% ± 2% | 0.55 ± 0.04 | 77% ± 2% | 0.55 ± 0.04 | 75% ± 2% | 0.49 ± 0.04 |
| ATAD5 genotoxicity [ | Dragon | 9363 | ASNN | 78% ± 1% | 0.28 ± 0.01 | 58% ± 1% | 0.07 ± 0.01 | 74% ± 1% | 0.22 ± 0.01 |
a See footnotes of Table 1. BA is Balanced Accuracy; MCC is Matthews correlation coefficient; RF is random Forest [30]; J48 is Weka [31] implementation of decision trees.
Figure 1Examples of overrepresented functional groups detected using (a) initial set of CheckMol-FG functional groups [6,10] and (b) the extended EFG set. p-values are calculated using hyper-geometric distribution. Only six most significant groups are shown for each set of descriptors. The full list of significant groups is available at http://ochem.eu/article/27542.
Figure 2Examples of high specificity (HS) and low specificity (LS) functional group patterns for recognizing aromatic 5-membered heterocycles with one heteroatom. “A” denotes any “aromatic” atom except carbon.