| Literature DB >> 34056351 |
Katsuhisa Matsumoto1, Tomoyuki Miyao1,2, Kimito Funatsu2,3.
Abstract
In ligand-based drug design, quantitative structure-activity relationship (QSAR) models play an important role in activity prediction. One of the major end points of QSAR models is half-maximal inhibitory concentration (IC50). Experimental IC50 data from various research groups have been accumulated in publicly accessible databases, providing an opportunity for us to use such data in predictive QSAR models. In this study, we focused on using a ranking-oriented QSAR model as a predictive model because relative potency strength within the same assay is solid information that is not based on any mechanical assumptions. We conducted rigorous validation using the ChEMBL database and previously reported data sets. Ranking support vector machine (ranking-SVM) models trained on compounds from similar assays were as good as support vector regression (SVR) with the Tanimoto kernel trained on compounds from all the assays. As effective ways of data integration, for ranking-SVM, integrated compounds should be selected from only similar assays in terms of compounds. For SVR with the Tanimoto kernel, entire compounds from different assays can be incorporated.Entities:
Year: 2021 PMID: 34056351 PMCID: PMC8154010 DOI: 10.1021/acsomega.1c00463
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Biological Targets and Assay Names for Integration
| target name | assay ChEMBL IDs (number of compounds) |
|---|---|
| epidermal growth factor receptor erbB1 | 853149 (34), 944276 (177), 3807186 (40), 3854863 (75) |
| estrogen receptor alpha | 829540 (103), 831155 (69), 852960 (32), 860989 (40), 865582 (52), 868001 (34) |
| glucocorticoid receptor | 869701 (32), 890119 (32), 899205 (34), 1670638 (31), 1909150 (47), 1913330 (31), 3374293 (32) |
| arachidonate 5-lipoxygenase | 939889 (42), 2034135 (35), 2167302 (32), 3420628 (42) |
| cytochrome P450 19A1 | 915480 (71), 1011777 (85), 1067556 (40), 1105634 (53) |
| monoamine oxidase A | 964041 (40), 1246374 (36), 1251643 (56), 2060936 (51), 3223013 (65), 3396571 (32) |
| cannabinoid CB1 receptor | 827163 (55), 832272 (36), 887423 (31), 1003147 (41), 1039369 (54), 1065338 (31) |
| acetylcholinesterase | 859798 (31), 895407 (33), 1249049 (38), 3772142 (32) |
| monoamine oxidase B | 964042 (53), 1071596 (46), 1246375 (42), 1251644 (60), 1769015 (31), 1833115 (35), 2060937 (50), 2154471 (49), 3095874 (35), 3223014 (62), 3772179 (37) |
| serotonin transporter | 863612 (31), 984526 (41), 995198 (39), 1044081 (33), 1647124 (31), 1909109 (91) |
| cyclooxygenase-2 | 932255 (34), 1106771 (34), 2340595 (39), 3076632 (39) |
| peroxisome proliferator-activated receptor gamma | 880426 (31), 902597 (32), 983195 (33), 2342574 (35) |
| hERG | 766814 (66), 1027667 (71), 1676103 (176), 1909190 (36) |
| estrogen receptor beta | 829091 (107), 831137 (70), 852961 (36), 860987 (49), 865583 (61), 868002 (38) |
| vascular endothelial growth factor receptor 2 | 829592 (37), 864514 (63), 864971 (34), 866567 (32), 3226104 (34), 3778238 (36) |
| dipeptidyl peptidase IV | 893277 (33), 967900 (35), 1686243 (40), 2149434 (49), 3241352 (33) |
| P-glycoprotein 1 | 974470 (37), 1014292 (43), 1948105 (34), 2061056 (31), 3584050 (53) |
| hepatocyte growth factor receptor | 1826482 (39), 1826790 (447), 2447158 (32), 3420280 (36), 3583343 (33) |
| epoxide hydratase | 907007 (31), 912696 (47), 1634184 (42), 1936453 (72), 3768766 (41) |
| histone deacetylase 1 | 908805 (37), 946979 (39), 1008834 (34), 1119893 (33), 2050420 (42), 3379478 (34), 3388838 (38), 3803863 (31), 3828935 (39) |
| sodium/glucose cotransporter 2 | 1069022 (47), 1117743 (41), 1680964 (46), 1781790 (33), 1788054 (45), 1831435 (34) |
| ATP-binding cassette subfamily G member 2 | 974473 (39), 1936798 (35), 2423205 (32), 3096293 (31) |
| prostaglandin E synthase | 1050553 (37), 1686171 (35), 2340533 (57), 3873577 (33) |
| cyclooxygenase 1 | 763302 (31), 768746 (21), 769638 (20), 772766 (66), 901401 (19), 916301 (25), 932254 (33), 1106770 (39), 1816060 (25) |
Figure 1Validation protocols for assay integration in QSAR modeling.
Figure 2Depiction of assay distance for data integration. As an exemplary case, the chemical space of ErbB1 inhibitors from several assays is visualized (A) using t-distributed stochastic neighbor embedding (t-SNE). Extended connectivity fingerprints with diameter of 4 (ECFP4) were used as a molecular representation. The corresponding distance matrix is shown in (B).
Figure 3CB integration for QSAR model training. For test compounds in the red-shaded circle, integrated compounds from various assays are specified on the t-SNE surface. The yellow-shaded circles represent distance thresholds within which compounds are integrated into training of QSAR models of potency rankings.
Figure 4Comparison of predictive rankings of QSAR models. The violin plots represent the predictive rankings of 20 QSAR modeling methods for the 23 targets. Ten trials with different initial data splitting were conducted for each assay on each target. The modeling methods were ranked using the Kendall rank correlation coefficients.
Statistics: Rankings of Compared Methods
| mean of all ranking (std.) | mean of median ranking (std.) | |
|---|---|---|
| ranking-SVM (nonintegrated) | 8.84 (5.92) | 8.50 (5.74) |
| ranking-SVM/CB/thres0.0 | 11.69 (5.82) | 12.54 (5.76) |
| ranking-SVM/CB/thres0.15 | 11.26 (6.01) | 12.09 (5.89) |
| ranking-SVM/CB/thres0.3 | 9.63 (5.85) | 9.66 (5.77) |
| ranking-SVM/AB/thres7.0 | 11.10 (5.86) | 11.82 (6.10) |
| ranking-SVM/AB/thres5.0 | 8.97 (5.59) | 8.60 (5.55) |
| ranking-SVM/AB/thres3.0 | 8.70 (5.72) | 8.46 (5.30) |
| PLS (nonintegrated) | 10.28 (6.30) | 11.12 (6.08) |
| PLS/NSC/CB/thres0.0 | 11.21 (5.39) | 12.24 (5.31) |
| PLS/NSC/CB/thres0.15 | 11.14 (5.78) | 11.87 (5.42) |
| PLS/NSC/CB/thres0.3 | 10.24 (6.06) | 11.00 (5.56) |
| PLS/scaling | 11.28 (5.73) | 12.15 (5.95) |
| PLS/categorized by one-hot | 11.15 (5.30) | 11.58 (5.05) |
| PLS/potency information | 11.28 (5.38) | 12.35 (5.37) |
| SVR (nonintegrated) | 10.18 (6.41) | 10.93 (6.20) |
| SVR/NSC/CB/thres0.0 | 8.29 (5.12) | 8.49 (5.39) |
| SVR/NSC/CB/thres0.15 | 8.37 (5.09) | 8.18 (4.75) |
| SVR/NSC/CB/thres0.3 | 8.73 (5.63) | 8.65 (5.20) |
| SVR/scaling | 8.26 (5.06) | 7.94 (4.91) |
| SVR/categorized by one-hot | 8.24 (5.03) | 8.26 (4.99) |
Figure 5Performance comparison with/without data integration. Kendall rank correlation coefficients for estrogen receptor alpha as a target by (a) SVM-rank-based methods, (b) PLS-based (numerical prediction) methods including the scaling method, (c) SVR-based (numerical prediction) including the scaling method.
Figure 6Exemplary cases of improved and deteriorated performance resulting from data integration. Predictive rankings for hERG (A) and epidermal growth factor receptor ErbB1 (B) reported as violin plots.
Figure 7Chemical space of inhibitors of the two targets. Visualization of the chemical space of hERG (a) and ErbB1 (b) inhibitors from several assays using t-SNE. ECFP4 was used as a molecular representation.
Figure 8Rank-model performance selected using LOOCV. Kendall rank correlation coefficients of test assays for hERG. Three modeling methods were tested for each test assay: nonintegrated, ranking-SVM by AB data integration (threshold: 3.0), and the best model selected by LOOCV.
Figure 9Comparison of the predictive rankings of QSAR models for COX 1 inhibitors. Method-wise predictive rankings for COX 1 are reported as violin plots.