| Literature DB >> 35216273 |
Isabella Mendolia1, Salvatore Contino1, Giada De Simone2, Ugo Perricone2, Roberto Pirrone1.
Abstract
In recent years, the debate in the field of applications of Deep Learning to Virtual Screening has focused on the use of neural embeddings with respect to classical descriptors in order to encode both structural and physical properties of ligands and/or targets. The attention on embeddings with the increasing use of Graph Neural Networks aimed at overcoming molecular fingerprints that are short range embeddings for atomic neighborhoods. Here, we present EMBER, a novel molecular embedding made by seven molecular fingerprints arranged as different "spectra" to describe the same molecule, and we prove its effectiveness by using deep convolutional architecture that assesses ligands' bioactivity on a data set containing twenty protein kinases with similar binding sites to CDK1. The data set itself is presented, and the architecture is explained in detail along with its training procedure. We report experimental results and an explainability analysis to assess the contribution of each fingerprint to different targets.Entities:
Keywords: deep learning; drug design; embedding; virtual screening
Mesh:
Substances:
Year: 2022 PMID: 35216273 PMCID: PMC8877815 DOI: 10.3390/ijms23042156
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Accuracy metrics for all the targets. Best/worst values for each column are in bold/italic.
| Target | Acc. | Loss | Sensitivity | MCC | AUC | F1-Score |
|---|---|---|---|---|---|---|
| ACK | 0.9957 | 0.0226 | 0.5000 | 0.6742 | 0.9834 | 0.6463 |
| ALK | 0.9930 | 0.0402 | 0.6575 | 0.7913 | 0.9904 | 0.7804 |
| CDK1 | 0.9910 | 0.0314 | 0.4537 | 0.6397 | 0.9850 | 0.6059 |
| CDK2 | 0.9859 | 0.0431 | 0.5281 | 0.6338 | 0.9845 | 0.6287 |
| CDK6 | 0.9966 | 0.0210 | 0.5865 | 0.7523 | 0.9895 | 0.7305 |
| INSR | 0.9893 | 0.0329 | 0.3779 | 0.5830 | 0.9858 | 0.5342 |
| ITK | 0.9945 | 0.0232 | 0.5886 | 0.7302 | 0.9905 | 0.7154 |
| JAK2 | 0.9898 | 0.0472 |
|
|
|
|
| JNK3 |
|
| 0.5905 | 0.7610 | 0.9901 | 0.7381 |
| MELK | 0.9957 | 0.0229 | 0.7081 | 0.8270 | 0.9897 | 0.8188 |
| CHK1 | 0.9895 | 0.0512 | 0.6385 | 0.7650 | 0.9846 | 0.7565 |
| CK2A1 | 0.9942 | 0.0253 | 0.5166 | 0.6944 | 0.9857 | 0.6667 |
| CLK2 | 0.9936 | 0.0259 |
|
| 0.9771 |
|
| DYRK1A | 0.9916 | 0.0321 | 0.4080 | 0.5987 | 0.9776 | 0.5591 |
| EGFR | 0.9845 |
| 0.7536 | 0.8331 | 0.9874 | 0.8357 |
| ERK2 | 0.9881 | 0.0563 | 0.7295 | 0.8292 | 0.9886 | 0.8272 |
| GSK3 |
| 0.0554 | 0.5827 | 0.6892 |
| 0.6856 |
| IRAK4 | 0.9936 | 0.0287 | 0.7611 | 0.8611 | 0.9938 | 0.8571 |
| MAP2K1 | 0.9931 | 0.0319 | 0.5497 | 0.7184 | 0.9795 | 0.6954 |
| PDK1 | 0.9945 | 0.0271 | 0.6310 | 0.7757 | 0.9875 | 0.7613 |
True Positives versus Positives ratio and Enrichment Factors computed on the entire test set.
| Protein | TP/P 1% * | TP/P 2% * | TP/P 5% * | TP/P 10% * | EF 1% | EF 2% | EF 5% | EF 10% |
|---|---|---|---|---|---|---|---|---|
| ACK | 72/106 | 84/106 | 95/106 | 101/106 | 68 | 40 | 18 | 10 |
| ALK | 131/254 | 202/254 | 229/254 | 247/254 | 52 | 40 | 18 | 10 |
| CDK1 | 111/205 | 150/205 | 189/205 | 196/205 | 54 | 37 | 18 | 10 |
| CDK2 | 118/303 | 194/303 | 264/303 | 289/303 | 39 | 32 | 17 | 10 |
| CDK6 | 79/104 | 90/104 | 98/104 | 101/104 | 76 | 43 | 19 | 10 |
| INSR | 110/217 | 145/217 | 195/217 | 206/217 | 51 | 33 | 18 | 9 |
| ITK | 107/158 | 125/158 | 148/158 | 155/158 | 68 | 40 | 19 | 10 |
| JAK2 | 134/832 | 268/832 | 669/832 | 804/832 | 16 | 16 | 16 | 10 |
| JNK3 | 81/105 | 88/105 | 95/105 | 102/105 | 77 | 42 | 18 | 10 |
| MELK | 130/185 | 157/185 | 178/185 | 181/185 | 70 | 42 | 19 | 10 |
| CHK1 | 134/343 | 233/343 | 300/343 | 324/343 | 39 | 34 | 17 | 9 |
| CK2A1 | 100/151 | 117/151 | 141/151 | 146/151 | 66 | 39 | 19 | 10 |
| CLK2 | 59/102 | 73/102 | 87/102 | 96/102 | 58 | 36 | 17 | 9 |
| DYRK1A | 97/174 | 126/174 | 152/174 | 162/174 | 56 | 36 | 17 | 9 |
| EGFR | 134/702 | 268/702 | 586/702 | 664/702 | 19 | 19 | 17 | 9 |
| ERK2 | 133/525 | 267/525 | 471/525 | 505/525 | 25 | 25 | 18 | 10 |
| GSK3 | 132/393 | 226/393 | 327/393 | 353/393 | 34 | 29 | 17 | 9 |
| IRAK4 | 134/339 | 263/339 | 320/339 | 333/339 | 40 | 39 | 19 | 10 |
| MAP2K1 | 118/191 | 142/191 | 167/191 | 178/191 | 62 | 37 | 17 | 9 |
| PDK1 | 123/187 | 149/187 | 170/187 | 181/187 | 66 | 40 | 18 | 10 |
* Percentage relative to the evaluated test set evaluated (13400 compounds), i.e., 1% = 134 molecules.
The top five test set molecules prioritized by our classifier as the most active on the CDK1 target.
| Molecule ChEMBLID | Chemical Structure |
|
|---|---|---|
| CHEMBL192216 |
| 2 nM |
| CHEMBL3644025 |
| 82 nM |
| CHEMBL445125 |
| 500 nM |
| CHEMBL2403087 |
| 183 nM |
| CHEMBL2403084 |
| 148 nM |
Figure 1Explainability results using SHAP; (a) average SHAP values for each fingerprint computed on the entire test set separately for each target; (b) example of single target explainability analysis for CDK1: SHAP values are reported for each fingerprint, and each row has been grouped in 64 bins to enhance readability.
A summary of all proteins (active and inactive) obtained from preprocessing methods.
| Target | PDB ID | Ligand Code * | Actives | Inactives |
|---|---|---|---|---|
| ACK | 5ZXB | 9KO | 746 | 159,775 |
| ALK | 6E0R | HKJ | 1665 | 227,247 |
| CDK1 | 6GU2 | F9Z | 1241 | 124,473 |
| CDK2 | 6INL | AJR | 1924 | 225,087 |
| CDK6 | 5L2S | 6ZV | 646 | 256,561 |
| INSR | 5E1S | 5JA | 1423 | 195,990 |
| ITK | 4RFM | 3P6 | 1001 | 135,007 |
| JAK2 | 6M9H | J9D | 5526 | 577,409 |
| JNK3 | 2B1P | AIZ | 658 | 95,252 |
| MELK | 6GVX | TAK | 1215 | 246,662 |
| CHK1 | 6FC8 | D4Q | 2175 | 21,763 |
| CK2a1 | 6JWA | 5ID | 1053 | 10,534 |
| CLK2 | 6FYL | 3NG | 671 | 6800 |
| DYRK1A | 4YLK | 4E2 | 1126 | 11,274 |
| EGFR | 5GNK | 80U | 4757 | 47,541 |
| ERK2 | 6OPH | 6QB | 3525 | 35,237 |
| GSK3B | 5F94 | 3UO | 2578 | 25,768 |
| IRAK4 | 6EG9 | OLI | 2131 | 21,282 |
| MAPK2K1 | 4AN9 | ACP; 2P7 | 1254 | 12,508 |
| PDK1 | 3NAX | MP7 | 1117 | 11,166 |
* Most affine lingands.
Figure 2The proposed architecture. (a) Network layout. (b) Model summary.