| Literature DB >> 32756326 |
Assima Rakhimbekova1, Timur I Madzhidov1, Ramil I Nugmanov1, Timur R Gimadiev2, Igor I Baskin1,3,4, Alexandre Varnek2,4.
Abstract
Nowadays, the problem of the model's applicability domain (AD) definition is an active research topic in chemoinformatics. Although many various AD definitions for the models predicting properties of molecules (Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) models) were described in the literature, no one for chemical reactions (Quantitative Reaction-Property Relationships (QRPR)) has been reported to date. The point is that a chemical reaction is a much more complex object than an individual molecule, and its yield, thermodynamic and kinetic characteristics depend not only on the structures of reactants and products but also on experimental conditions. The QRPR models' performance largely depends on the way that chemical transformation is encoded. In this study, various AD definition methods extensively used in QSAR/QSPR studies of individual molecules, as well as several novel approaches suggested in this work for reactions, were benchmarked on several reaction datasets. The ability to exclude wrong reaction types, increase coverage, improve the model performance and detect Y-outliers were tested. As a result, several "best" AD definitions for the QRPR models predicting reaction characteristics have been revealed and tested on a previously published external dataset with a clear AD definition problem.Entities:
Keywords: QSAR/QSPR; Quantitative Reaction–Property Relationship; applicability domain; chemical reactions; chemoinformatics; machine learning; reaction mining
Mesh:
Year: 2020 PMID: 32756326 PMCID: PMC7432167 DOI: 10.3390/ijms21155542
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1A chemical reaction (top left), its Condensed Graph of Reaction (CGR) (top centre), and the reaction centre with R = 1 (top right). Colour lines on CGR represent dynamic bonds. Indices “0 >> −” and “0 >> +” on CGR mean that atomic charges are lowered from 0 to −1 and increased from 0 to +1, respectively. Indices “1 >> 0” and 1 >> 1” on the reaction center mean changes in the number of neighbors, letters—hybridization changes (“s”—sp3, “a”—atom is in the aromatic ring). The signature of the reaction center with R = 1 is shown.
Figure 2The procedure for selecting the hyperparameters of Quantitative Reaction–Property Relationships (QRPR) and the AD definition models.
Applicability domain (AD) definition methods.
| Universal AD Definition Approaches | Machine Learning (ML)-Dependent AD Definition Approaches | |
|---|---|---|
| Without Hyperparameters | With Hyperparameters | With Hyperparameters |
| BB | RTC_cv | GPR-AD |
| FC | 2CC_cv | RFR_VAR |
| Z-1NN | 1-SVM | |
| Leverage | Z-1NN_cv | |
| Leverage_cv | ||
Coefficient of determination (R2) and RMSE of predictions estimated using the nested 5-fold cross-validation (without considering the AD of models).
| Regression Method | DA | SN2 | Tautomerization | E2 | |
|---|---|---|---|---|---|
| RFR | R2 | 0.854 | 0.804 | 0.682 | 0.708 |
| RMSE | 0.734 | 0.516 | 0.914 | 0.799 | |
| GPR | R2 | 0.807 | 0.763 | 0.492 | 0.648 |
| RMSE | 0.845 | 0.568 | 1.156 | 0.876 |
Figure 3Fragment Control considers tautomerization reactions (presented at the top) as belonging to the AD defined for the E2 reactions (bottom) because all fragments of the CGRs of tautomerization reactions (top right) are present in the CGRs of the reactions contained in the E2 dataset. CGRs are shown to the right of the corresponding reaction.
Values of four AD performance metric for four datasets for the composite AD definitions assessed using nested cross-validation with hyperparameters tuned in its inner loop.
| № | AD Definition Method | Coverage | OIR | ∆R2_AD | OD | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN2 | DA | E2 | Tautomerization | SN2 | DA | E2 | Tautomerization | SN2 | DA | E2 | Tautomerization | SN2 | DA | E2 | Tautomerization | ||
| ML-Dependent AD Definition Methods | |||||||||||||||||
| 1 | RFR_VAR*/OIR | 0.98 | 0.94 | 0.93 | 0.95 | 0.66 | 1.38 | 0.71 | 1.63 | 0.01 | 0.05 | 0.04 | −0.05 | 0.56 | 0.73 | 0.68 | 0.73 |
| 2 | GPR-AD*/OIR | 0.97 | 0.94 | 0.92 | 0.96 | 0.70 | 0.67 | 0.41 | 1.73 | 0.03 | 0.05 | 0.05 | 0.01 | 0.63 | 0.73 | 0.75 | 0.72 |
| 3 | RFR_VAR*/OD | 0.79 | 0.83 | 0.81 | 0.90 | 0.51 | 1.22 | 0.57 | 1.74 | 0.07 | 0.10 | 0.10 | −0.11 | 0.80 | 0.86 | 0.76 | 0.91 |
| 4 | GPR-AD*/OD | 0.89 | 0.78 | 0.81 | 0.93 | 0.78 | 0.38 | 0.07 | 1.60 | 0.09 | 0.08 | 0.08 | −0.03 | 0.80 | 0.79 | 0.84 | 0.82 |
| Universal AD definition methods with hyperparameters | |||||||||||||||||
| 5 | 2CC*/OIR | 0.98 | 0.94 | 0.92 | 0.94 | 0.67 | 1.62 | 0.46 | 1.83 | 0.02 | 0.06 | 0.02 | −0.10 | 0.59 | 0.77 | 0.63 | 0.76 |
| 6 | Lev_cv*/OIR | 0.98 | 0.94 | 0.75 | 0.96 | 0.61 | 1.39 | 0.15 | 1.59 | 0.01 | 0.05 | 0.02 | −0.05 | 0.55 | 0.71 | 0.59 | 0.69 |
| 7 | Z-1NN_cv*/OIR | 0.98 | 0.94 | 0.93 | 0.95 | 0.60 | 1.39 | 0.50 | 1.53 | 0.01 | 0.05 | 0.03 | −0.09 | 0.55 | 0.71 | 0.63 | 0.69 |
| 8 | 1-SVM*/OIR | 0.98 | 0.94 | 0.86 | 0.86 | 0.57 | 1.34 | 0.35 | 0.79 | 0.01 | 0.05 | 0.03 | −0.13 | 0.56 | 0.71 | 0.67 | 0.68 |
| 9 | 2CC*/OD | 0.84 | 0.82 | 0.80 | 0.89 | 0.59 | 1.12 | 0.44 | 1.64 | 0.07 | 0.09 | 0.09 | −0.12 | 0.82 | 0.83 | 0.74 | 0.87 |
| 10 | Lev_cv*/OD | 0.83 | 0.72 | 0.85 | 0.82 | 0.28 | 0.59 | 0.42 | 1.13 | 0.03 | 0.07 | 0.07 | −0.15 | 0.61 | 0.73 | 0.74 | 0.83 |
| 11 | Z-1NN_cv*/OD | 0.79 | 0.73 | 0.81 | 0.74 | 0.35 | 0.69 | 0.40 | 0.86 | 0.05 | 0.08 | 0.08 | −0.08 | 0.70 | 0.75 | 0.70 | 0.83 |
| 12 | 1-SVM*/OD | 0.49 | 0.29 | 0.68 | 0.66 | 0.22 | 0.37 | 0.28 | 0.42 | 0.07 | 0.07 | 0.07 | −0.21 | 0.69 | 0.62 | 0.72 | 0.67 |
| Universal AD definition methods without hyperparameters | |||||||||||||||||
| 13 | RTC1 | 0.98 | 0.94 | 0.93 | 0.96 | 0.61 | 1.39 | 0.51 | 1.59 | 0.01 | 0.05 | 0.03 | −0.05 | 0.55 | 0.71 | 0.63 | 0.69 |
| 14 | BB* | 0.98 | 0.94 | 0.93 | 0.94 | 0.58 | 1.34 | 0.50 | 1.30 | 0.01 | 0.05 | 0.03 | −0.05 | 0.56 | 0.71 | 0.63 | 0.68 |
| 15 | Leverage* | 0.95 | 0.93 | 0.91 | 0.92 | 0.45 | 1.29 | 0.55 | 1.19 | 0.02 | 0.05 | 0.04 | −0.19 | 0.60 | 0.72 | 0.67 | 0.71 |
| 16 | Z-1NN* | 0.95 | 0.91 | 0.90 | 0.92 | 0.44 | 1.09 | 0.64 | 1.16 | 0.02 | 0.05 | 0.06 | −0.12 | 0.60 | 0.74 | 0.71 | 0.71 |
| “Zero models” | |||||||||||||||||
| 17 | OZ | 1.00 | 1.00 | 1.00 | 1.00 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.50 | 0.50 | 0.50 | 0.50 |
| 18 | PZ | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | −0.81 | −0.85 | −0.71 | −0.70 | 0.50 | 0.50 | 0.50 | 0.50 |
| 19 | “Perfect AD model” | 0.98 | 0.98 | 0.99 | 0.98 | 1.75 | 3.75 | 2.71 | 4.08 | 0.06 | 0.08 | 0.05 | 0.06 | 1.00 | 1.00 | 1.00 | 1.00 |
Ranking of different AD definition methods.
| Data Set | ||||
|---|---|---|---|---|
| Rank | SN2 | Tautomerization | E2 | DA |
| 1 | “Perfect AD model” | “Perfect AD model” | “Perfect AD model” | “Perfect AD model” |
| 2 | GPR-AD*/OIR | RFR_VAR*/OIR | Z-1NN* | 2CC*/OIR |
| 3 | 2CC*/OD | GPR-AD*/OIR | RFR_VAR*/OIR | RFR_VAR*/OIR |
| 4 | GPR-AD*/OD | GPR-AD*/OD | GPR-AD*/OIR | RFR_VAR*/OD |
| 5 | RFR_VAR*/OIR | 2CC*/OIR | RFR_VAR*/OD | Lev_cv*/OIR |
Coefficient of determination (R2) and RMSE of prediction for the external test set (only reactions within AD are considered).
| Best-Ranked Composite Ads | ||||
|---|---|---|---|---|
| № | AD Method | R2 | RMSE | Coverage |
| 1 | GPR-AD*/OIR | 0.96 | 0.17 | 17 |
| 2 | RFR_VAR*/OIR | 0.66 | 0.64 | 74 |
| 3 | 2CC*/OIR | 0.66 | 0.64 | 74 |
| 4 | GPR-AD*/OD | 0.96 | 0.17 | 17 |
| 5 | RFR_VAR*/OD | 0.80 | 0.39 | 34 |
| 6 | Z-1NN_cv*/OIR | 0.66 | 0.64 | 74 |
| 7 | 2CC*/OD | 0.67 | 0.53 | 25 |
| 8 | RTC1 | 0.66 | 0.64 | 74 |
| 9 | Z-1NN_cv*/OD | 0.94 | 0.22 | 17 |
| 10 | BB* | 0.66 | 0.64 | 73 |
| Without AD | 0.60 | 0.84 | 100 | |
Figure 4Validation of the model on the external test set: predicted vs. experimental logk. Blue datapoints are reactions within AD (X-inliers); red datapoints are outside AD (X-outliers). Reactions with different reaction centers are specified by different shapes of points: reactions with phenylsulphonate leaving group for which rate constants were predicted badly by ML methods are shown as triangles. Reactions with chlorine leaving group are shown as circles, bromine leaving group—as crosses, iodine leaving group—as squares.