| Literature DB >> 30428514 |
Angelica Mazzolari1, Giulio Vistoli2, Bernard Testa3, Alessandro Pedretti4.
Abstract
The study is aimed at developing linear classifiers to predict the capacity of a given substrate to yield reactive metabolites. While most of the hitherto reported predictive models are based on the occurrence of known structural alerts (e.g., the presence of toxophoric groups), the present study is focused on the generation of predictive models involving linear combinations of physicochemical and stereo-electronic descriptors. The development of these models is carried out by using a novel classification approach based on enrichment factor optimization (EFO) as implemented in the VEGA suite of programs. The study took advantage of metabolic data as collected by manually curated analysis of the primary literature and published in the years 2004⁻2009. The learning set included 977 substrates among which 138 compounds yielded reactive first-generation metabolites, plus 212 substrates generating reactive metabolites in all generations (i.e., metabolic steps). The results emphasized the possibility of developing satisfactory predictive models especially when focusing on the first-generation reactive metabolites. The extensive comparison of the classifier approach presented here using a set of well-known algorithms implemented in Weka 3.8 revealed that the proposed EFO method compares with the best available approaches and offers two relevant benefits since it involves a limited number of descriptors and provides a score-based probability thus allowing a critical evaluation of the obtained results. The last analyses on non-cheminformatics UCI datasets emphasize the general applicability of the EFO approach, which conveniently performs using both balanced and unbalanced datasets.Entities:
Keywords: enrichment factor; machine learning; reactive metabolite; toxicity prediction; unbalanced datasets
Mesh:
Year: 2018 PMID: 30428514 PMCID: PMC6278469 DOI: 10.3390/molecules23112955
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Model performances obtained in the calibration study.
| Cluster Size | Sampling Cycles | Variables | EF Cut-Off Top 5% | Ionization State | Mean Top 1% | Mean Top 10% | Best Top 10% |
|---|---|---|---|---|---|---|---|
| 10 | 12 | 3 | 2.0 | N | 82.2 | 25.95 | 32 |
| 20 | 12 | 3 | 2.0 | N | 68.9 | 28.2 | 34 |
| 40 | 12 | 3 | 2.0 | N | 59.4 | 32.7 | 42 |
| 60 | 12 | 3 | 2.0 | N | 58.3 | 34.5 | 45 |
| 80 | 12 | 3 | 2.0 | N | 55.7 | 36.1 | 46 |
| 100 | 12 | 3 | 2.0 (12) | N | 47.8 | 38.2 | 46 |
| 100 | 6 | 3 | 2.0 | N | 48.8 | 38.2 | 47 |
| 100 | 24 | 3 | 2.0 | N | 47.2 | 37.9 | 46 |
| 100 | 12 | 3 | 2.5 (18) | N | 44.4 | 38.1 | 45 |
| 100 | 12 | 3 | 1.5 (6) | N | 48.8 | 36.4 | 45 |
| 100 | 12 | 3 | 1.0 (2) | N | 49.4 | 32.5 | 43 |
| 100 | 12 | 3 | 0.0 (0) | N | 51.1 | 32.7 | 43 |
| 100 | 12 | 1 | 2.0 | N | 37.4 | 25.1 | 30 |
| 100 | 12 | 2 | 2.0 | N | 49.4 | 32.2 | 40 |
| 100 | 12 | 4 | 2.0 | N | 49.9 | 39.7 | 45 |
| 100 | 12 | 5 | 2.0 | N | 51.1 | 41.6 | 47 |
| 100 | 12 | 6 | 2.0 | N | 55.5 | 43.6 | 48 |
| 100 | 12 | 8 | 2.0 | N | 56.7 | 44.3 | 48 |
| 100 | 12 | 10 | 2.0 | N | 57.2 | 45.5 | 48 |
| 100 | 12 | 2 | 2.0 | I | 44.4 | 34.3 | 42 |
| 100 | 12 | 3 | 2.0 | I | 47.2 | 39.4 | 46 |
| 100 | 12 | 4 | 2.0 | I | 49.4 | 40.6 | 47 |
| 100 | 12 | 5 | 2.0 | I | 50.2 | 42.2 | 47 |
| 100 | 12 | 6 | 2.0 | I | 54.4 | 44.0 | 48 |
Best developed models and relative statistics.
| Mod. | Gen./React. | State | Cluster Size | Equation | Statistics |
|---|---|---|---|---|---|
| 1 | First | N | 70 | 1.00 HBT + 2.47 Lipole − 0.0001 Electronic_Energy + 0.13 Dipole + | Precision = 0.42 |
| 2 | First | I | 70 | 1.00 Rotors − 1.55 HBA + 5.09 Lipole − 0.0018 Electronic_Energy + | Precision = 0.35 |
| 3 | All | N | 140 | 1.00 HBA + 1.09 Lipole − 0.0089 Heat_Formation + 0.070 Filled_Levels + | Precision = 0.42 |
| 4 | All | I | 140 | 1.00 Lipole − 0.033 PSA − 0.0059 ASA − 0.0004 Electronic_Energy + | Precision = 0.46 |
| 5 | Csp2/Csp | N | 30 | −1.00 Angles + 19.13 Rotors − 0.43 HBA + 15.47 HBT − 9.89 Impropers + | Precision = 0.67 |
| 6 | Quinone ox | N | 20 | 1.00 Angles + 1.07 Rotors + 68.34 Radius_Gyration − 8.38 HBA + | Precision = 0.63 |
| 7 | NH/NOH | N | 20 | −1.00 HBD + 0.041 Impropers − 0.15 Dipole − 0.0007 E_HOMO + | Precision = 0.63 |
| 8 | Csp2/Csp | I | 30 | −1.00 Rotors − 10.12 HBA + 1.47 HBD + | Precision = 0.61 |
| 9 | Quinone ox | I | 20 | 1.00 Angles + 1.53 Rotors + | Precision = 0.67 |
| 10 | NH/NOH | I | 20 | −1.00 HBD + 0.0083 Impropers + | Precision = 0.70 |
| 11 | Heart | N/A | 75 | −1.00 Pain + | Precision = 0.86 |
N and I stand for substrates simulated in their neutral and ionized forms, respectively.
Comparison of the predictive power (as encoded by MCC value) of Mod. 1 with the corresponding models obtained by using 29 different algorithms as implemented in Weka software.
| Algorithm | MCC | Algorithm | MCC |
|---|---|---|---|
|
|
| IterativeClass | 0.24 |
| BayesNet | 0.12 | RandomSubspace | 0.16 |
| FLDA | 0.25 | DecisionTable | 0.13 |
| LDA | 0.17 | JRip | 0.14 |
| Logistic | 0.11 | PART | 0.19 |
|
|
| DecisionStump | 0.13 |
|
|
|
|
|
| Kstar | 0.27 | LMT | 0.17 |
| LWL | 0.13 |
|
|
| AdaBoostM1 | 0.13 | RandomTree | 0.16 |
| Bagging | 0.21 | REPTree | 0.14 |
| Regression | 0.27 | LogitBoost | 0.18 |
| FilteredClass | 0.16 |
|
|
| A1DE | 0.14 |
|
|
| CHIRP | 0.23 | ExtraTree | 0.25 |
The methods affording an MCC value ≥ 0.30 are indicated in bold and for these best performing approaches, the obtained number of true positives in the test set is reported between parentheses. Notice that the approaches providing models with MCC < 0.1 are not reported for simplicity.
Comparison of the here obtained performances (in terms of accuracy) with those published in ref. 28 (i.e., C4.5, NB and k-NN) for the two used UCI datasets.
| Dataset | Attributes | Instances | Accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| C4.5 | NB | K-NN | EFO (0.0) | EFO (1.0) | EFO (1.5) | |||
| Sonar | 60 | 208 | 0.68 | 0.71 | 0.84 | - | 0.76 | 0.69 |
| Heart | 13 | 270 | 0.74 | 0.86 | 0.59 | 0.87 | 0.73 | 0.73 |
Figure 1Main logical units into which the proposed classification algorithm can be subdivided. The yellow box indicates the input, the green box comprises the initial variable filtering; the blue boxes define the main tasks performed by the algorithm; the red box displays the obtained results. The brown boxes include the computational approaches by which each generated classifier is optimized by maximizing the corresponding quality function.