| Literature DB >> 35936431 |
Karolina Kwapien1, Eva Nittinger1, Jiazhen He2, Christian Margreitter2, Alexey Voronov2, Christian Tyrchan1.
Abstract
Matched molecular pairs (MMPs) are nowadays a commonly applied concept in drug design. They are used in many computational tools for structure-activity relationship analysis, biological activity prediction, or optimization of physicochemical properties. However, until now it has not been shown in a rigorous way that MMPs, that is, changing only one substituent between two molecules, can be predicted with higher accuracy and precision in contrast to any other chemical compound pair. It is expected that any model should be able to predict such a defined change with high accuracy and reasonable precision. In this study, we examine the predictability of four classical properties relevant for drug design ranging from simple physicochemical parameters (log D and solubility) to more complex cell-based ones (permeability and clearance), using different data sets and machine learning algorithms. Our study confirms that additive data are the easiest to predict, which highlights the importance of recognition of nonadditivity events and the challenging complexity of predicting properties in case of scaffold hopping. Despite deep learning being well suited to model nonlinear events, these methods do not seem to be an exception of this observation. Though they are in general performing better than classical machine learning methods, this leaves the field with a still standing challenge.Entities:
Year: 2022 PMID: 35936431 PMCID: PMC9352238 DOI: 10.1021/acsomega.2c02738
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Number of (Nof) Compounds (cpds) after the Different Curation Steps
| property | data all|w/o outlier | Nof multimeasures | Nof stereoduplicates | Nof cpds in Set 1 |
|---|---|---|---|---|
| log | 215,418|214,320 | 18,429 | 6510 | 207,306 |
| solubility | 226,955|226,189 | 21,444 | 5527 | 219,987 |
| permeability | 18,076|18,051 | 2282 | 646 | 17,257 |
| clearance | 179,637|179,495 | 24,493 | 5408 | 172,947 |
Compounds measured ≥2 times.
Number of (Nof) Compounds (cpds) in Each Data Seta
| property | data | Nof cpds | training | test |
|---|---|---|---|---|
| log | Set 1 (all data) | 207,306 | 165,844 | 41,462 |
| Set 2 (all MMPs) | 187,162 | 149,729 | 37,433 | |
| Set 3 (MMPs A) | 47,380 | 37,904 | 9476 | |
| Set 4 (MMPs N) | 24,775 | 19,820 | 4955 | |
| solubility | Set 1 (all data) | 219,987 | 175,989 | 43,998 |
| Set 2 (all MMPs) | 196,451 | 157,160 | 39,291 | |
| Set 3 (MMPs A) | 45,976 | 36,780 | 9196 | |
| Set 4 (MMPs N) | 27,650 | 22,120 | 5530 | |
| permeability | Set 1 (all data) | 17,257 | 13,805 | 3452 |
| Set 2 (all MMPs) | 14,612 | 11,689 | 2923 | |
| Set 3 (MMPs A) | 4443 | 3554 | 889 | |
| Set 4 (MMPs N) | 909 | 727 | 182 | |
| clearance | Set 1 (all data) | 172,947 | 138,357 | 34,590 |
| Set 2 (all MMPs) | 155,043 | 124,034 | 31,009 | |
| Set 3 (MMPs A) | 33,755 | 27,004 | 6751 | |
| Set 4 (MMPs N) | 21,471 | 17,176 | 4295 |
A, additive data; N, nonadditive data.
Experimental Uncertainty (in Log Units) and Expected Rmax2 Estimated for Each Property
| property | εw mean for multimeasures | εw mean for stereoduplicates | |
|---|---|---|---|
| log | 0.10 | 0.07 | 0.993 |
| solubility | 0.26 | 0.15 | 0.935 |
| permeability | 0.22 | 0.10 | 0.936 |
| clearance | 0.12 | 0.15 | 0.947 |
Number of (Nof) Compounds (cpds) for Each Property after NAA
| property | Nof cpds | Nof cycles | cpds with
significant NA |
|---|---|---|---|
| log | 207,306 | 191,605 | 25,318 (12.21%) |
| solubility | 219,987 | 184,116 | 28,072 (12.76%) |
| permeability | 17,257 | 13,977 | 916 (5.31%) |
| clearance | 172,947 | 121,941 | 21,750 (12.58%) |
Significance threshold determined by two times the experimental uncertainty.
R2 (for Test Sets) for All Algorithms, Data Sets, and Properties Discussed in This Work
| model | ||||||||
|---|---|---|---|---|---|---|---|---|
| property | data | PLS | RF | SVR | XGBoost | DNN-S_20 | DNN-S_50 | DNN-M_20 |
| log | Set 1 (all data) | 0.52 | 0.63 | 0.65 | 0.76 | 0.91 | 0.91 | 0.90 |
| Set 2 (all MMPs) | 0.52 | 0.64 | 0.66 | 0.76 | 0.91 | 0.91 | 0.90 | |
| Set 3 (MMPs A) | 0.55 | 0.67 | 0.58 | 0.77 | 0.95 | 0.95 | 0.95 | |
| Set 4 (MMPs N) | 0.53 | 0.60 | 0.74 | 0.69 | 0.84 | 0.84 | 0.82 | |
| solubility | Set 1 (all data) | 0.36 | 0.46 | 0.46 | 0.56 | 0.67 | 0.67 | 0.68 |
| Set 2 (all MMPs) | 0.36 | 0.48 | 0.47 | 0.57 | 0.68 | 0.68 | 0.68 | |
| Set 3 (MMPs A) | 0.43 | 0.61 | 0.46 | 0.68 | 0.78 | 0.79 | 0.80 | |
| Set 4 (MMPs N) | 0.23 | 0.28 | 0.32 | 0.32 | 0.41 | 0.42 | 0.43 | |
| permeability | Set 1 (all data) | 0.46 | 0.56 | 0.63 | 0.57 | 0.65 | 0.68 | 0.71 |
| Set 2 (all MMPs) | 0.48 | 0.59 | 0.66 | 0.62 | 0.69 | 0.70 | 0.75 | |
| Set 3 (MMPs A) | 0.64 | 0.71 | 0.83 | 0.68 | 0.82 | 0.84 | 0.85 | |
| Set 4 (MMPs N) | 0.11 | 0.21 | 0.18 | 0.20 | 0.24 | 0.18 | 0.41 | |
| clearance | Set 1 (all data) | 0.27 | 0.40 | 0.38 | 0.48 | 0.57 | 0.57 | 0.61 |
| Set 2 (all MMPs) | 0.28 | 0.42 | 0.39 | 0.50 | 0.58 | 0.59 | 0.62 | |
| Set 3 (MMPs A) | 0.37 | 0.52 | 0.37 | 0.54 | 0.71 | 0.72 | 0.75 | |
| Set 4 (MMPs N) | 0.21 | 0.32 | 0.37 | 0.33 | 0.34 | 0.35 | 0.37 | |
RMSE (for Test Set) for All Algorithms, Data Sets, and Properties Discussed in This Work
| model | ||||||||
|---|---|---|---|---|---|---|---|---|
| property | data | PLS | RF | SVR | XGBoost | DNN-S_20 | DNN-S_50 | DNN-M_20 |
| log | Set 1 (all data) | 0.86 | 0.75 | 0.72 | 0.61 | 0.37 | 0.37 | 0.39 |
| Set 2 (all MMPs) | 0.84 | 0.73 | 0.71 | 0.59 | 0.36 | 0.36 | 0.38 | |
| Set 3 (MMPs A) | 0.72 | 0.62 | 0.70 | 0.52 | 0.24 | 0.23 | 0.24 | |
| Set 4 (MMPs N) | 0.86 | 0.79 | 0.64 | 0.70 | 0.51 | 0.51 | 0.54 | |
| solubility | Set 1 (all data) | 0.83 | 0.76 | 0.77 | 0.69 | 0.60 | 0.60 | 0.58 |
| Set 2 (all MMPs) | 0.82 | 0.74 | 0.75 | 0.67 | 0.58 | 0.58 | 0.58 | |
| Set 3 (MMPs A) | 0.71 | 0.59 | 0.70 | 0.54 | 0.45 | 0.44 | 0.42 | |
| Set 4 (MMPs N) | 0.90 | 0.87 | 0.85 | 0.85 | 0.79 | 0.78 | 0.77 | |
| permeability | Set 1 (all data) | 0.63 | 0.57 | 0.53 | 0.57 | 0.51 | 0.49 | 0.46 |
| Set 2 (all MMPs) | 0.60 | 0.54 | 0.49 | 0.52 | 0.47 | 0.46 | 0.42 | |
| Set 3 (MMPs A) | 0.44 | 0.40 | 0.30 | 0.42 | 0.31 | 0.30 | 0.28 | |
| Set 4 (MMPs N) | 0.79 | 0.74 | 0.76 | 0.75 | 0.73 | 0.76 | 0.64 | |
| clearance | Set 1 (all data) | 0.45 | 0.41 | 0.42 | 0.38 | 0.35 | 0.35 | 0.33 |
| Set 2 (all MMPs) | 0.44 | 0.40 | 0.41 | 0.37 | 0.34 | 0.34 | 0.32 | |
| Set 3 (MMPs A) | 0.39 | 0.34 | 0.39 | 0.33 | 0.27 | 0.26 | 0.25 | |
| Set 4 (MMPs N) | 0.47 | 0.44 | 0.42 | 0.43 | 0.43 | 0.43 | 0.42 | |
Figure 1R2 and RMSE for log D and clearance—Set 3 (only additive data) vs Set 4 (only nonadditive data). Comparison of different models and endpoints. Rmax2 (dashed line) is the upper limit for R2 derived from experimental uncertainty (Table ). Full performance details can be found in the Supporting Information (SI Figures S6 and S7).
Figure 2R2 against RMSE for test (a) Set 3 (only additive data) and (b) Set 4 (only nonadditive data). Comparison of different models and endpoints.
Figure 3Predicted versus measured values for solubility. Comparison between RF (left column) and DNN-S_20 (right column) for Set 3: only additive data (blue) and Set 4: only nonadditive data (orange). The values are in log units.