| Literature DB >> 35890310 |
Natalia Czub1, Adam Pacławski1, Jakub Szlęk1, Aleksander Mendyk1.
Abstract
The drug discovery and development process requires a lot of time, financial, and workforce resources. Any reduction in these burdens might benefit all stakeholders in the healthcare domain, including patients, government, and companies. One of the critical stages in drug discovery is a selection of molecular structures with a strong affinity to a particular molecular target. The possible solution is the development of predictive models and their application in the screening process, but due to the complexity of the problem, simple and statistical models might not be sufficient for practical application. The manuscript presents the best-in-class predictive model for the serotonin 1A receptor affinity and its validation according to the Organization for Economic Co-operation and Development guidelines for regulatory purposes. The model was developed based on a database with close to 9500 molecules by using an automatic machine learning tool (AutoML). The model selection was conducted based on the Akaike information criterion value and 10-fold cross-validation routine, and later good predictive ability was confirmed with an additional external validation dataset with over 700 molecules. Moreover, the multi-start technique was applied to test if an automatic model development procedure results in reliable results.Entities:
Keywords: 5-HT1A receptor; AutoML; OECD principles; QSAR model; curated database
Year: 2022 PMID: 35890310 PMCID: PMC9319483 DOI: 10.3390/pharmaceutics14071415
Source DB: PubMed Journal: Pharmaceutics ISSN: 1999-4923 Impact factor: 6.525
Figure 1Histogram of pKi value in training and test sets.
Figure 2Distribution of Lipinski’s rules features values in training and test sets. (A) Polar surface area; (B) MW, molecular weight; (C) nRot, rotatable bonds; (D) SLogP, logP value; (E) nHBDon, number of H-bonds donors; (F) nHBAcc, number of H-bond acceptors.
Results of the AutoML model evaluation for training and test sets. RMSE—root mean square error; R2—coefficient of determination; AIC—Akaike Information Criterion.
| Inputs Number | 10-CV | External Testing | ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| 216 | 0.5437 | 0.7443 | 0.6806 | 0.6021 | 0.4362 | 1642.8 |
| 123 | 0.5523 | 0.7361 | 0.6830 | 0.5992 | 0.5185 | 1605.5 |
|
|
|
|
|
|
|
|
| 38 | 0.5782 | 0.7108 | 0.7282 | 0.5445 | 0.5196 | 1673.5 |
| 24 | 0.5926 | 0.6962 | 0.7276 | 0.5452 | 0.5298 | 1716.4 |
| 23 | 0.5941 | 0.6946 | 0.7597 | 0.5042 | 0.4882 | 1751.8 |
The structure of stacked ensemble with the coefficients of GLM with Elastic Net as a “metalearner” model (39 inputs).
| Name | Coefficients |
|---|---|
| Intercept | −1.0046 |
| GBM_grid__1_AutoML_20210902_051708_model_54 | 0.6633 |
| GBM_grid__1_AutoML_20210902_051708_model_20 | 0.1599 |
| DeepLearning_grid__3_AutoML_20210902_051708_model_3 | 0.1089 |
| XGBoost_grid__1_AutoML_20210902_051708_model_120 | 0.0848 |
| DeepLearning_grid__3_AutoML_20210902_051708_model_8 | 0.0295 |
| DeepLearning_grid__2_AutoML_20210902_051708_model_2 | 0.0263 |
| DeepLearning_grid__3_AutoML_20210902_051708_model_2 | 0.0187 |
| DeepLearning_grid__3_AutoML_20210902_051708_model_5 | 0.0058 |
| XGBoost_grid__1_AutoML_20210902_051708_model_52 | 0.0066 |
| DeepLearning_grid__2_AutoML_20210902_051708_model_3 | 0.0074 |
| XGBoost_grid__1_AutoML_20210902_051708_model_95 | 0.0058 |
| XGBoost_grid__1_AutoML_20210902_051708_model_131 | 0.0058 |
| XGBoost_grid__1_AutoML_20210902_051708_model_90 | 0.0037 |
| XGBoost_grid__1_AutoML_20210902_051708_model_113 | 0.0026 |
| DeepLearning_grid__2_AutoML_20210902_051708_model_8 | 0.0025 |
| XGBoost_grid__1_AutoML_20210902_051708_model_30 | 0.0026 |
| GBM_grid__1_AutoML_20210902_044957_model_20 | 0.0005 |
Figure 3Scheme of the OECD principles.
Results of one-way ANOVA tests from the multi-start method (30 repetitions).
| Measure | Curated Database | GLASS Database |
|---|---|---|
| F value | 0.0002 | 0.0085 |
| 1.0000 | 1.0000 | |
| Statistically significant different predictions | False | False |
Summary representing mean absolute SHAP value (av|SHAP|) for the 10 most important variables.
| Variable | av|SHAP| | Description |
|---|---|---|
| SMR VSA3 | 0.088 | MOE MR VSA Descriptor 3 |
| GATS3p | 0.088 | Geary autocorrelation of lag 3 weighted by polarizability |
| PEOE VSA2 | 0.083 | MOE Charge VSA Descriptor 2 |
| SaaaC | 0.083 | Sum of aaaC |
| AATSC3se | 0.082 | Averaged and centered Moreau–Broto autocorrelation of lag 3 weighted by Sanderson EN |
| nBondsS | 0.078 | Number of single bonds in non-kekulized structure |
| AATS6dv | 0.073 | Averaged Moreau–Broto autocorrelation of lag 6 weighted by valence electrons |
| GATS6p | 0.071 | Geary coefficient of lag 6 weighted by polarizability |
| PEOE VSA9 | 0.071 | MOE Charge VSA Descriptor 9 |
| IC2 | 0.069 | 2-ordered neighborhood information content |
Figure 4Summary of SHAP analysis for 10 input variables with the overall highest impact on model predictions. SMR VSA3—MOE MR VSA Descriptor 3; GATS3p—Geary autocorrelation of lag 3 weighted by polarizability; PEOE VSA2—MOE Charge VSA Descriptor 2; SaaaC—sum of aaaC; AATSC3se—averaged and centered Moreau–Broto autocorrelation of lag 3 weighted by Sanderson EN; nBondsS—number of single bonds in non-kekulized structure; AATS6dv—averaged Moreau–Broto autocorrelation of lag 6 weighted by valence electrons; GATS6p—Geary coefficient of lag 6 weighted by polarizability; PEOE VSA9—MOE Charge VSA Descriptor 9; IC2—2-ordered neighborhood information content.
Figure 5Calculated SHAP value in relation to MOE MR VSA Descriptor 3 (SMR VSA3).
Figure 6Calculated SHAP value in relation to Geary autocorrelation of lag 3 weighted by polarizability (GATS3p).
Figure 7Two-dimensional partial dependence plot visualizing the interaction and effect on the pKi value precited by the model for the two most important variables based on SHAP analysis: SMR VSA3 and GATS3p.