| Literature DB >> 35964049 |
Aljoša Smajić1, Melanie Grandits2, Gerhard F Ecker1.
Abstract
Machine learning (ML) models require an extensive, user-driven selection of molecular descriptors in order to learn from chemical structures to predict actives and inactives with a high reliability. In addition, privacy concerns often restrict the access to sufficient data, leading to models with a narrow chemical space. Therefore, we propose a framework of re-trainable models that can be transferred from one local instance to another, and further allow a less extensive descriptor selection. The models are shared via a Jupyter Notebook, allowing the evaluation and implementation of a broader chemical space by keeping most of the tunable parameters pre-defined. This enables the models to be updated in a decentralized, facile, and fast manner. Herein, the method was evaluated with six transporter datasets (BCRP, BSEP, OATP1B1, OATP1B3, MRP3, P-gp), which revealed the general applicability of this approach.Entities:
Keywords: Classification models; Decentralization; Jupyter Notebook; Re-training; Transporter proteins
Year: 2022 PMID: 35964049 PMCID: PMC9375336 DOI: 10.1186/s13321-022-00635-2
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 8.489
Overview of the six transporter datasets which are provided on the LiverTox workspace
| Endpoint | LiverTox training | LiverTox test | ||
|---|---|---|---|---|
| Actives | Inactives | Actives | Inactives | |
| Breast cancer resistance protein (BCRP) | 432 | 542 | 109 | 86 |
| Bile salt export pump (BSEP) | 114 | 410 | 43 | 116 |
| Organic anion transporting polypeptide 1B1 (OATP1B1) | 178 | 1472 | 64 | 137 |
| Organic anion transporting polypeptide 1B3 (OATP1B3) | 116 | 1547 | 40 | 169 |
| Multidrug resistance associated protein (MRP3) | 32 | 52 | – | – |
| P-glycoprotein (Pgp) | 612 | 549 | 86 | 48 |
Overview of the six transporter datasets that were used for the training of the models
| Endpoint | Training | Test | ||
|---|---|---|---|---|
| Actives | Inactives | Actives | Inactives | |
| Breast cancer resistance protein (BCRP) | 904 | 786 | 149 | 38 |
| Bile salt export pump (BSEP) | 221 | 1100 | 3 | 7 |
| Organic anion transporting polypeptide 1B1 (OATP1B1) | 292 | 1675 | 18 | 3 |
| Organic anion transporting polypeptide 1B3 (OATP1B3) | 168 | 1818 | 13 | 4 |
| Multidrug resistance associated protein (MRP3) | 74 | 569 | 0 | 3 |
| P-glycoprotein (Pgp) | 1281 | 953 | 136 | 236 |
The training set comprises data from LiverTox plus those extracted from ChEMBL 26 and 27, the test set contains data extracted from ChEMBL 28 and PubChem
Fig. 1Schematic overview of the descriptor analysis carried out for both ABC and SLC transporters
Fig. 2Graphical Illustration of the workflow for model generation
Statistical metrics for all four models of each dataset
| Models | LR | SVM | RF | k-NN | ||||
|---|---|---|---|---|---|---|---|---|
| Train | Test | Train | Test | Train | Test | Train | Test | |
| Accuracy | 0.73 | 0.74 | 0.76 | 0.67 | 0.80 | 0.70 | 0.76 | 0.70 |
| Sensitivity | 0.75 | 0.81 | 0.75 | 0.69 | 0.79 | 0.71 | 0.79 | 0.74 |
| Specificity | 0.69 | 0.50 | 0.77 | 0.61 | 0.83 | 0.63 | 0.73 | 0.53 |
| Balanced accuracy | 0.72 | 0.65 | 0.76 | 0.65 | 0.80 | 0.67 | 0.76 | 0.63 |
| F1-score | 0.74 | 0.83 | 0.77 | 0.77 | 0.80 | 0.79 | 0.78 | 0.79 |
| AUC | 0.80 | 0.65 | 0.83 | 0.65 | 0.88 | 0.67 | 0.80 | 0.63 |
| Precision | 0.74 | 0.86 | 0.79 | 0.87 | 0.85 | 0.88 | 0.77 | 0.86 |
| MCC | 0.46 | 0.28 | 0.53 | 0.25 | 0.61 | 0.28 | 0.52 | 0.23 |
| Accuracy | 0.84 | 0.72 | – | 0.78 | – | 0.83 | - | |
| Sensitivity | 0.22 | – | 0.84 | – | 0.79 | – | 0.52 | - |
| Specificity | 0.96 | – | 0.69 | – | 0.77 | – | 0.89 | - |
| Balanced accuracy | 0.59 | – | 0.77 | – | 0.77 | – | 0.71 | - |
| F1-score | 0.30 | – | 0.54 | – | 0.57 | – | 0.53 | - |
| AUC | 0.73 | – | 0.85 | – | 0.87 | – | 0.79 | - |
| Precision | 0.59 | – | 0.42 | – | 0.50 | – | 0.60 | - |
| MCC | 0.28 | – | 0.44 | – | 0.49 | – | 0.45 | - |
| Accuracy | 0.86 | 0.38 | 0.80 | 0.76 | 0.85 | 0.71 | 0.87 | 0.71 |
| Sensitivity | 0.20 | 0.33 | 0.74 | 0.83 | 0.63 | 0.72 | 0.35 | 0.67 |
| Specificity | 0.97 | 0.67 | 0.81 | 0.33 | 0.89 | 0.67 | 0.96 | 1 |
| Balanced accuracy | 0.59 | 0.50 | 0.77 | 0.58 | 0.74 | 0.69 | 0.65 | 0.83 |
| F1-score | 0.27 | 0.48 | 0.52 | 0.86 | 0.55 | 0.81 | 0.43 | 0.80 |
| AUC | 0.77 | 0.50 | 0.83 | 0.58 | 0.84 | 0.69 | 0.81 | 0.83 |
| Precision | 0.47 | 0.86 | 0.40 | 0.88 | 0.49 | 0.93 | 0.58 | 1 |
| MCC | 0.24 | - | 0.44 | 0.15 | 0.47 | 0.29 | 0.34 | 0.47 |
| Accuracy | 0.91 | 0.35 | 0.84 | 0.71 | 0.86 | 0.59 | 0.92 | 0.65 |
| Sensitivity | 0.14 | 0.23 | 0.81 | 0.77 | 0.77 | 0.69 | 0.36 | 0.62 |
| Specificity | 0.98 | 0.75 | 0.84 | 0.50 | 0.87 | 0.25 | 0.97 | 0.75 |
| Balanced accuracy | 0.56 | 0.49 | 0.83 | 0.64 | 0.82 | 0.47 | 0.67 | 0.68 |
| F1-score | 0.20 | 0.35 | 0.46 | 0.80 | 0.48 | 0.72 | 0.41 | 0.73 |
| AUC | 0.79 | 0.49 | 0.88 | 0.64 | 0.89 | 0.47 | 0.80 | 0.68 |
| Precision | 0.45 | 0.75 | 0.32 | 0.83 | 0.35 | 0.75 | 0.50 | 0.89 |
| MCC | 0.20 | − 0.02 | 0.44 | 0.25 | 0.46 | − 0.05 | 0.38 | 0.31 |
| Accuracy | 0.88 | – | 0.60 | – | 0.59 | – | 0.78 | – |
| Sensitivity | 0 | – | 0.77 | – | 0.68 | – | 0.20 | – |
| Specificity | 0.99 | – | 0.58 | – | 0.59 | – | 0.86 | – |
| Balanced accuracy | 0.5 | – | 0.67 | – | 0.62 | – | 0.53 | – |
| F1-score | 0 | – | 0.43 | – | 0.37 | – | 0.21 | – |
| AUC | 0.44 | – | 0.67 | – | 0.63 | – | 0.57 | – |
| Precision | 0.1 | – | 0.35 | – | 0.32 | – | 0.34 | – |
| MCC | 0 | – | 0.30 | – | 0.20 | – | 0.12 | – |
| Accuracy | 0.74 | 0.65 | 0.72 | 0.68 | 0.76 | 0.68 | 0.71 | 0.64 |
| Sensitivity | 0.81 | 0.92 | 0.72 | 0.81 | 0.81 | 0.92 | 0.76 | 0.88 |
| Specificity | 0.64 | 0.28 | 0.71 | 0.50 | 0.70 | 0.35 | 0.64 | 0.32 |
| Balanced accuracy | 0.73 | 0.60 | 0.71 | 0.65 | 0.76 | 0.64 | 0.70 | 0.60 |
| F1-score | 0.78 | 0.75 | 0.73 | 0.74 | 0.79 | 0.77 | 0.75 | 0.74 |
| AUC | 0.80 | 0.60 | 0.77 | 0.65 | 0.80 | 0.64 | 0.76 | 0.60 |
| Precision | 0.77 | 0.64 | 0.78 | 0.69 | 0.80 | 0.66 | 0.75 | 0.64 |
| MCC | 0.46 | 0.27 | 0.44 | 0.33 | 0.53 | 0.34 | 0.43 | 0.25 |
Test: External Dataset
*Train: tenfold cross-validation
Fig. 3Comparison of the performances (balanced accuracy) of random forest models using 197, 170 and 70 descriptors
Applicability domain estimation for all six transporter protein test sets
| Endpoint | LOF result | |
|---|---|---|
| Compounds | Compounds | |
| Breast cancer resistance protein (BCRP) | 156 | 31 |
| Bile salt export pump (BSEP) | 9 | 1 |
| Organic anion transporting polypeptide 1B1 (OATP1B1) | 15 | 6 |
| Organic anion transporting polypeptide 1B3 (OATP1B3) | 14 | 3 |
| Multidrug resistance associated protein (MRP3) | 3 | 0 |
| P-glycoprotein (Pgp) | 201 | 35 |