| Literature DB >> 34123016 |
Timothy E H Allen1,2, Andrew J Wedlake2, Elena Gelžinytė2, Charles Gong2, Jonathan M Goodman2, Steve Gutsell3, Paul J Russell3.
Abstract
Deep learning neural networks, constructed for the prediction of chemical binding at 79 pharmacologically important human biological targets, show extremely high performance on test data (accuracy 92.2 ± 4.2%, MCC 0.814 ± 0.093 and ROC-AUC 0.96 ± 0.04). A new molecular similarity measure, Neural Network Activation Similarity, has been developed, based on signal propagation through the network. This is complementary to standard Tanimoto similarity, and the combined use increases confidence in the computer's prediction of activity for new chemicals by providing a greater understanding of the underlying justification. The in silico prediction of these human molecular initiating events is central to the future of chemical safety risk assessment and improves the efficiency of safety decision making. This journal is © The Royal Society of Chemistry.Entities:
Year: 2020 PMID: 34123016 PMCID: PMC8159362 DOI: 10.1039/d0sc01637c
Source DB: PubMed Journal: Chem Sci ISSN: 2041-6520 Impact factor: 9.825
Pharmacological targets analyzed in this work. Data were extracted from ChEMBL version 23 and ToxCast. The total test set was 144 109 actives and 141 796 inactives for a total of 285 905 compounds
| Target | Target gene | Actives | Inactives | Total |
|---|---|---|---|---|
| Acetylcholinesterase | AChE | 2611 | 1964 | 4575 |
| Adenosine A2a receptor | ADORA2A | 3943 | 2082 | 6025 |
| Alpha-2a adrenergic receptor | ADRA2A | 842 | 1013 | 1855 |
| Androgen receptor | AR | 2637 | 7283 | 9920 |
| Beta-1 adrenergic receptor | ADRB1 | 1260 | 1080 | 2340 |
| Beta-2 adrenergic receptor | ADRB2 | 1943 | 2012 | 3955 |
| Delta opioid receptor | OPRD1 | 3006 | 1219 | 4225 |
| Dopamine D1 receptor | DRD1 | 1350 | 1990 | 3340 |
| Dopamine D2 receptor | DRD2 | 5694 | 1136 | 6830 |
| Dopamine transporter | SLC6A3 | 2509 | 1916 | 4425 |
| Endothelin receptor ET-A | EDNRA | 1285 | 1150 | 2435 |
| Glucocorticoid receptor | NR3C1 | 3018 | 6972 | 9990 |
| hERG | KCNH2 | 4895 | 3245 | 8140 |
| Histamine H1 receptor | HRH1 | 1275 | 1105 | 2380 |
| Mu opioid receptor | OPRM1 | 3610 | 2305 | 5915 |
| Muscarinic acetylcholine receptor M1 | CHRM1 | 2014 | 1241 | 3255 |
| Muscarinic acetylcholine receptor M2 | CHRM2 | 1633 | 2032 | 3665 |
| Muscarinic acetylcholine receptor M3 | CHRM3 | 1537 | 1113 | 2650 |
| Norepinephrine transporter | SLC6A2 | 2910 | 1940 | 4850 |
| Serotonin 2a (5-HT2a) receptor | HTR2A | 3757 | 1033 | 4790 |
| Serotonin 3a (5-HT3a) receptor | HTR3A | 451 | 1054 | 1505 |
| Serotonin transporter | SLC6A4 | 4041 | 1134 | 5175 |
| Tyrosine-protein kinase LCK | LCK | 1732 | 523 | 2255 |
| Vasopressin V1a receptor | AVPR1A | 619 | 1056 | 1675 |
| Type-1 angiotensin II receptor | AGTR1 | 806 | 1179 | 1985 |
| RAC-alpha serine/threonine-protein kinase | AKT1 | 2765 | 1220 | 3985 |
| Beta-secretase 1 | BACE1 | 6016 | 2604 | 8620 |
| Cholinesterase | BCHE | 1400 | 2145 | 3545 |
| Caspase-1 | CASP1 | 1369 | 3196 | 4565 |
| Caspase-3 | CASP3 | 1177 | 1828 | 3005 |
| Caspase-8 | CASP8 | 330 | 1130 | 1460 |
| Muscarinic acetylcholine receptor M5 | CHRM5 | 679 | 1081 | 1760 |
| Inhibitor of nuclear factor kappa-B kinase subunit alpha | CHUK | 316 | 1069 | 1385 |
| Macrophage colony-stimulating factor 1 receptor | CSF1R | 1336 | 1049 | 2385 |
| Casein kinase I isoform delta | CSNK1D | 708 | 1027 | 1735 |
| Endothelin B receptor | EDNRB | 809 | 1236 | 2045 |
| Neutrophil elastase | ELANE | 2134 | 1371 | 3505 |
| Ephrin type-A receptor 2 | EPHA2 | 528 | 1102 | 1630 |
| Fibroblast growth factor receptor 1 | FGFR1 | 2163 | 1207 | 3370 |
| Peptidyl-prolyl | FKBP1A | 354 | 1006 | 1360 |
| Vascular endothelial growth factor receptor 1 | FLT1 | 1088 | 2077 | 3165 |
| Vascular endothelial growth factor receptor 3 | FLT4 | 674 | 1081 | 1755 |
| Tyrosine-protein kinase FYN | FYN | 420 | 1075 | 1495 |
| Glycogen synthase kinase-3 beta | GSK3B | 2549 | 1256 | 3805 |
| Histone deacetylase 3 | HDAC3 | 1051 | 1139 | 2190 |
| Insulin-like growth factor 1 receptor | IGF1R | 2483 | 1132 | 3615 |
| Insulin receptor | INSR | 887 | 1093 | 1980 |
| Vascular endothelial growth factor receptor 2 | KDR | 7816 | 1579 | 9395 |
| Leukotriene B4 receptor 1 | LTB4R | 350 | 1030 | 1380 |
| Tyrosine-protein kinase Lyn | LYN | 454 | 1046 | 1500 |
| Mitogen-activated protein kinase 1 | MAPK1 | 6209 | 11 076 | 17 285 |
| Mitogen-activated protein kinase 9 | MAPK9 | 1227 | 1088 | 2315 |
| MAP kinase-activated protein kinase 2 | MAPKAPK2 | 829 | 1156 | 1985 |
| Hepatocyte growth factor receptor | MET | 2871 | 1144 | 4015 |
| Matrix metalloproteinase-13 | MMP13 | 2388 | 1112 | 3500 |
| Matrix metalloproteinase-2 | MMP2 | 2938 | 1677 | 4615 |
| Matrix metalloproteinase-3 | MMP3 | 1759 | 1036 | 2795 |
| Matrix metalloproteinase-9 | MMP9 | 2582 | 1848 | 4430 |
| Serine/threonine-protein kinase NEK2 | NEK2 | 298 | 1057 | 1355 |
| P2Y purinoceptor 1 | P2RY1 | 560 | 1100 | 1660 |
| Serine/threonine-protein kinase PAK 4 | PAK4 | 380 | 1100 | 1480 |
| Phosphodiesterase 4A | PDE4A | 653 | 1017 | 1670 |
| Phosphodiesterase 5A | PDE5A | 1551 | 1174 | 2725 |
| Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha | PIK3CA | 4724 | 2086 | 6810 |
| Peroxisome proliferator-activated receptor gamma | PPARG | 4362 | 7283 | 11 645 |
| Protein tyrosine phosphatase non-receptor type 1 | PTPN1 | 1471 | 2179 | 3650 |
| Protein tyrosine phosphatase non-receptor type 11 | PTPN11 | 354 | 1211 | 1565 |
| Protein tyrosine phosphatase non-receptor type 2 | PTPN2 | 339 | 1206 | 1545 |
| RAF proto-oncogene serine/threonine-protein kinase | RAF1 | 1351 | 1084 | 2435 |
| Retinoic acid receptor alpha | RARA | 356 | 3249 | 3605 |
| Retinoic acid receptor beta | RARB | 298 | 3347 | 3645 |
| Rho-associated coiled-coil-containing protein kinase I | ROCK1 | 1293 | 1117 | 2410 |
| Ribosomal protein S6 kinase alpha-5 | RPS6KA5 | 224 | 1036 | 1260 |
| NAD-dependent protein deacetylase sirtuin-2 | SIRT2 | 361 | 1284 | 1645 |
| NAD-dependent protein deacetylase sirtuin-3 | SIRT3 | 151 | 1074 | 1225 |
| Proto-oncogene tyrosine-protein kinase Src | SRC | 2704 | 1531 | 4235 |
| Substance-K receptor | TACR2 | 876 | 1914 | 2790 |
| Thromboxane A2 receptor | TBXA2R | 978 | 1922 | 2900 |
| Tyrosine-protein kinase receptor TEK | TEK | 788 | 1132 | 1920 |
Summary of results for various DNN architectures for several targets in initial investigations. Best performing networks on the test data are highlighted in red. Full results can be found in the ESI (Tables S5–S9). The first column represents the NN architecture, showing the number of neurons in each hidden layera
| Training | Validation | Test | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE | SP | ACC | MCC | ROC-AUC | SE | SP | ACC | MCC | ROC-AUC | SE | SP | ACC | MCC | ROC-AUC | |
|
| |||||||||||||||
| [10] | 88.7 | 83.9 | 86.6 | 0.726 | 0.93 | 84.9 | 80.7 | 83.1 | 0.655 | 0.90 | 84.2 | 78.9 | 81.9 | 0.631 | 0.89 |
| [100] | 90.7 | 88.4 | 89.7 | 0.791 | 0.96 | 87.4 | 83.2 | 85.6 | 0.706 | 0.92 | 86.2 | 80.7 | 83.8 | 0.670 | 0.90 |
| [1000] | 88.0 | 83.7 | 86.2 | 0.718 | 0.93 | 85.5 | 78.0 | 82.3 | 0.637 | 0.89 | 84.4 | 78.8 | 82.0 | 0.632 | 0.88 |
| [10,10] | 90.7 | 89.7 | 90.3 | 0.802 | 0.96 | 86.1 | 82.9 | 84.7 | 0.688 | 0.92 | 84.3 | 82.4 | 83.5 | 0.664 | 0.90 |
| [100,100] | 91.5 | 91.3 | 91.4 | 0.826 | 0.97 | 87.1 | 85.2 | 86.3 | 0.721 | 0.92 | 85.0 | 84.2 | 84.7 | 0.689 | 0.91 |
| [1000,1000] | 95.2 | 96.6 | 95.8 | 0.915 | 0.99 | 88.0 | 86.7 | 87.4 | 0.744 | 0.93 | 84.7 | 84.0 | 84.4 | 0.684 | 0.92 |
|
| |||||||||||||||
| [10] | 97.6 | 89.9 | 95.0 | 0.888 | 0.98 | 97.2 | 90.2 | 94.7 | 0.884 | 0.98 | 97.2 | 88.5 | 94.2 | 0.871 | 0.97 |
| [100] | 97.8 | 92.9 | 96.1 | 0.913 | 0.99 | 96.9 | 90.9 | 94.8 | 0.886 | 0.98 | 97.2 | 90.2 | 94.8 | 0.884 | 0.98 |
| [1000] | 97.5 | 90.7 | 95.2 | 0.893 | 0.98 | 97.2 | 89.5 | 94.6 | 0.879 | 0.98 | 97.0 | 89.1 | 94.3 | 0.872 | 0.97 |
| [10,10] | 97.8 | 92.7 | 96.0 | 0.911 | 0.99 | 97.6 | 90.6 | 95.3 | 0.893 | 0.98 | 97.0 | 90.0 | 94.6 | 0.880 | 0.98 |
| [100,100] | 98.1 | 93.7 | 96.6 | 0.924 | 0.99 | 96.8 | 90.8 | 94.8 | 0.883 | 0.98 | 96.9 | 90.5 | 94.7 | 0.881 | 0.98 |
| [1000,1000] | 99.0 | 77.8 | 91.7 | 0.817 | 1.00 | 97.3 | 92.4 | 95.6 | 0.903 | 0.98 | 96.7 | 91.2 | 94.8 | 0.884 | 0.98 |
|
| |||||||||||||||
| [10] | 58.0 | 99.3 | 88.3 | 0.691 | 0.88 | 59.1 | 98.9 | 88.3 | 0.691 | 0.87 | 55.8 | 99.0 | 87.5 | 0.667 | 0.86 |
| [100] | 69.1 | 98.7 | 90.9 | 0.759 | 0.91 | 64.4 | 98.1 | 89.1 | 0.711 | 0.87 | 64.5 | 98.3 | 89.3 | 0.715 | 0.86 |
| [1000] | 65.0 | 98.6 | 89.7 | 0.727 | 0.89 | 61.6 | 98.2 | 88.5 | 0.693 | 0.86 | 61.5 | 98.3 | 88.6 | 0.695 | 0.86 |
| [10,10] | 67.1 | 99.0 | 90.5 | 0.750 | 0.90 | 62.7 | 98.5 | 89.0 | 0.708 | 0.86 | 61.6 | 98.6 | 88.8 | 0.701 | 0.87 |
| [100,100] | 76.1 | 99.4 | 93.2 | 0.823 | 0.95 | 69.2 | 97.8 | 90.2 | 0.740 | 0.87 | 68.0 | 98.1 | 90.1 | 0.737 | 0.87 |
| [1000,1000] | 73.3 | 99.4 | 92.5 | 0.804 | 0.94 | 65.8 | 97.9 | 89.3 | 0.717 | 0.87 | 64.4 | 98.2 | 89.2 | 0.713 | 0.87 |
|
| |||||||||||||||
| [10] | 93.5 | 53.5 | 77.5 | 0.529 | 0.87 | 91.6 | 48.2 | 74.3 | 0.454 | 0.82 | 92.0 | 46.1 | 73.7 | 0.441 | 0.81 |
| [100] | 94.1 | 49.9 | 76.4 | 0.508 | 0.86 | 92.2 | 45.8 | 74.2 | 0.443 | 0.81 | 92.9 | 44.1 | 73.4 | 0.438 | 0.80 |
| [1000] | 89.7 | 64.3 | 79.7 | 0.568 | 0.87 | 84.6 | 59.5 | 72.7 | 0.458 | 0.82 | 87.0 | 55.0 | 74.2 | 0.450 | 0.81 |
| [10,10] | 94.1 | 85.0 | 90.5 | 0.800 | 0.97 | 86.1 | 67.0 | 78.4 | 0.545 | 0.86 | 86.3 | 63.8 | 77.3 | 0.519 | 0.85 |
| [100,100] | 96.2 | 90.5 | 93.9 | 0.873 | 0.98 | 84.9 | 69.8 | 78.8 | 0.555 | 0.86 | 85.1 | 65.5 | 77.3 | 0.519 | 0.84 |
| [1000,1000] | 95.0 | 87.5 | 92.0 | 0.833 | 0.98 | 84.2 | 66.8 | 77.2 | 0.520 | 0.86 | 83.4 | 65.5 | 76.2 | 0.498 | 0.84 |
|
| |||||||||||||||
| [10] | 99.2 | 72.4 | 93.4 | 0.799 | 0.98 | 99.1 | 66.1 | 91.7 | 0.752 | 0.97 | 99.0 | 67.6 | 92.1 | 0.760 | 0.97 |
| [100] | 99.0 | 89.2 | 96.9 | 0.906 | 0.99 | 98.4 | 83.8 | 95.1 | 0.856 | 0.98 | 98.6 | 83.1 | 95.2 | 0.857 | 0.98 |
| [1000] | 99.2 | 77.2 | 94.4 | 0.831 | 0.98 | 98.8 | 73.8 | 93.3 | 0.797 | 0.97 | 99.1 | 73.9 | 93.5 | 0.805 | 0.97 |
| [10,10] | 99.0 | 89.7 | 97.0 | 0.909 | 0.99 | 98.9 | 82.1 | 95.1 | 0.857 | 0.98 | 98.7 | 83.1 | 95.3 | 0.858 | 0.98 |
| [100,100] | 99.4 | 95.8 | 98.6 | 0.959 | 1.00 | 98.2 | 86.1 | 95.6 | 0.867 | 0.98 | 98.6 | 86.8 | 96.0 | 0.882 | 0.99 |
| [1000,1000] | 99.4 | 98.2 | 99.1 | 0.975 | 1.00 | 98.1 | 91.1 | 96.5 | 0.897 | 0.99 | 98.4 | 90.5 | 96.6 | 0.901 | 0.99 |
SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.
Average model performance and standard deviation (SD) for the best performing DNN models at each target. Full results can be found in the ESI (Table S10)a
| Training data | Validation data | Test data | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE | SP | ACC | MCC | ROC-AUC | SE | SP | ACC | MCC | ROC-AUC | SE | SP | ACC | MCC | ROC-AUC | |
| AVERAGE | 92.1 | 96.5 | 95.8 | 0.901 | 0.99 | 86.9 | 93.2 | 92.5 | 0.822 | 0.96 | 86.2 | 92.9 | 92.2 | 0.814 | 0.96 |
| SD | 8.8 | 4.2 | 3.1 | 0.069 | 0.02 | 11.7 | 5.9 | 4.1 | 0.091 | 0.04 | 12.1 | 6.5 | 4.2 | 0.093 | 0.04 |
SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.
Fig. 1Test MCC vs. total number of compounds for all biological targets. Error bars shown are standard deviations across the five-fold clustered cross-validation.
Fig. 2Positive probability curve showing compounds tested at the ADORA2A. Positive probability is the probability a compound is active at the ADORA2A calculated by a trained DNN using the Softmax function. Percentages in each 10% section indicate the percentage of compounds in that section which are experimental positives.
Average model performance and standard deviation (SD) for the best performing DNN models at each target for the validation data sets when adjusted activity thresholds of 0.1 and 0.9 were applied. Full results can be found in the ESI (Tables S11 and S12)a
| 0.1 | 0.9 | |||||||
|---|---|---|---|---|---|---|---|---|
| SE | SP | ACC | MCC | SE | SP | ACC | MCC | |
| AVERAGE | 97.0 | 59.9 | 79.9 | 0.619 | 53.5 | 98.8 | 81.3 | 0.610 |
| SD | 4.2 | 21.3 | 10.2 | 0.137 | 22.8 | 1.6 | 6.6 | 0.125 |
SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient.
Average model performance and standard deviation (SD) for the structural alert (SA), random forest (RF) and deep neural network (DNN) models at each target on a consistent training/test set split. Full comparisons can be found in the ESI (Table S13)a
| Training set | Test set | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| SE | SP | ACC | MCC | SE | SP | ACC | MCC | ||
| SA | Average | 91.0 | 95.8 | 95.0 | 0.882 | 84.1 | 93.5 | 91.1 | 0.790 |
| SD | 7.4 | 3.5 | 2.3 | 0.050 | 11.6 | 4.6 | 4.2 | 0.096 | |
| RF | Average | 94.9 | 94.7 | 96.4 | 0.915 | 89.0 | 90.4 | 92.2 | 0.815 |
| SD | 9.9 | 5.7 | 3.1 | 0.072 | 11.6 | 8.1 | 4.0 | 0.091 | |
| DNN | Average | 92.3 | 96.8 | 95.9 | 0.904 | 87.9 | 93.6 | 92.8 | 0.832 |
| SD | 8.8 | 3.0 | 3.1 | 0.066 | 10.4 | 5.9 | 4.0 | 0.089 | |
SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient.
Fig. 3Histograms showing the distribution of test set model performance across the three modelling approaches, structural alerts (SAs), random forests (RFs) and deep neural networks (DNNs).
Fig. 5Amiodarone, a typical KCNH2 binder, and its five most similar neighbours as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O). Starred network activation similarity values have been rounded to 1.000 but do not represent exact matches.
Average statistical performance for models with test sets generated using chemical clustering and generated randomly. Clustered statistics are taken from Table 3 and random statistics generated from Table 5. The difference shown is the change in performance when moving from random to clusteringa
| ACC | MCC | ROC-AUC | |
|---|---|---|---|
| Clustered test set | 92.2 | 0.814 | 0.96 |
| Random test set | 92.8 | 0.832 | 0.96 |
| Difference | −0.6 | −0.018 | 0 |
ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.
Fig. 4A graph showing the relationship between Tanimoto similarity and NNAS values for the three typical binders andarine, amiodarone and DASB.
Fig. 6DASB, a typical SLC6A4 binder, and its five most similar neighbours as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O).
Fig. 7Andarine, a typical AR binder, and its five most similar neighbours as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O). Starred network activation similarity values have been rounded to 1.000 but do not represent exact matches.