Literature DB >> 34123016

Neural network activation similarity: a new measure to assist decision making in chemical toxicology.

Timothy E H Allen1,2, Andrew J Wedlake2, Elena Gelžinytė2, Charles Gong2, Jonathan M Goodman2, Steve Gutsell3, Paul J Russell3.   

Abstract

Deep learning neural networks, constructed for the prediction of chemical binding at 79 pharmacologically important human biological targets, show extremely high performance on test data (accuracy 92.2 ± 4.2%, MCC 0.814 ± 0.093 and ROC-AUC 0.96 ± 0.04). A new molecular similarity measure, Neural Network Activation Similarity, has been developed, based on signal propagation through the network. This is complementary to standard Tanimoto similarity, and the combined use increases confidence in the computer's prediction of activity for new chemicals by providing a greater understanding of the underlying justification. The in silico prediction of these human molecular initiating events is central to the future of chemical safety risk assessment and improves the efficiency of safety decision making. This journal is © The Royal Society of Chemistry.

Entities:  

Year:  2020        PMID: 34123016      PMCID: PMC8159362          DOI: 10.1039/d0sc01637c

Source DB:  PubMed          Journal:  Chem Sci        ISSN: 2041-6520            Impact factor:   9.825


Introduction

Machine learning algorithms are mathematical models able to learn from data without explicit programming from a human expert. The algorithms have gained much attention as high-quality predictors and classifiers. Classification tasks in toxicology are often explored using a variety of machine learning algorithms. Some examples of this include a support vector machine for predicting liver injury,[1] genotoxicity prediction using random forests (RFs),[2] carcinogenicity predicted using nearest neighbour calculations,[3] and using a naive Bayes classifier,[4] and ensemble methods, combining several classifiers into a single decision-making model for hepatotoxicity.[5] Deep learning or deep neural networks (DNNs) are a machine learning approach that has been gaining attention. These algorithms are extremely powerful but require a large amount of data, and high-powered computers for training.[6] The power of DNNs has been illustrated in drug discovery, where the Merck Molecular Activity Challenge in 2012 was won by an approach using neural networks to make molecular activity predictions.[7] In toxicology these networks can be used to aid in risk assessment and safety science in predictive toxicology.[8] An example of this approach won the Toxicity in the 21st Century (Tox21) prediction challenge in 2015,[9,10] and deep learning has also been applied to predict drug-induced liver injury[11] and cardiac toxicity.[12] A number of studies have shown that DNNs outperform other machine learning algorithms on identical prediction tasks[7,9,10,13] including direct comparisons to RFs in regression[14] and classification.[15] In toxicology, computational methods need to be transparent to be accepted by toxicologists, risk assessors and regulators.[16] Several attempts have been made to do this in the past,[17] including by assigning importance values to features in the test data by removing features of the test set and observing changes in DNN output,[18] or by calculating the gradient of DNN output with respect to features in the test set.[19,20] Making the machine learning methodology more akin to a read-across, in which experimental data on one chemical is used in the evaluation of a similar chemical, is a good strategy to increase confidence in the prediction.[21] We aim to extend these methodologies, in a way appropriate to toxicity prediction, using chemical inputs. Human molecular initiating events (MIEs) make good targets for prediction using DNNs. MIEs are initial chemical–biological interactions which start adverse outcome pathways (AOPs).[22-24] In the past, a wide variety of computational methods have been used to predict MIEs. Some of these methods rely on the use of chemical substructures as alerts or to define chemical categories for the prediction of molecular activity.[25-30] Some compare chemical similarity and reactivity[31] or use decision trees to classify reactivity.[32] Some use more complex calculations including quantum chemistry to identify reactivity barriers.[33] There are also a wide variety of mathematical quantitative structure–activity relationships (QSARs) appropriate for this task.[16,34] While machine learning algorithms such as DNNs,[13] shallow NNs and decision trees,[35] and convolutional neural networks[36] have been used in ligand–receptor binding binary classification tasks in the past, this is the first time DNNs have been applied to predicting MIEs. A wide variety of diverse and important human targets were chosen for DNN classifier construction, including the well-known Bowes targets[37] and an extended list identified in our previous work[38] including targets published by Sipes et al.[39] Many of the previously published papers on machine learning only provide models for a single biological target or endpoint.[1-5,11,12,15] This limits their usefulness and coverage of human toxicology, which we aim to overcome by constructing models for a large number of MIEs. Generating models for this extended list of targets provides a wider screen of potential molecular toxicity. By constructing binary classification DNNs for these targets we can establish their importance in the prediction of MIEs, investigate their working and better understand their predictions, and both compare them to other methods of prediction and consider how these approaches can work together to improve their predictive power and confidence.

Methods

Data set

Data for 79 pharmacologically important biological targets were extracted from the publicly available databases ChEMBL[40] and ToxCast[41] (Table 1). These targets are a subset of those used in our previous work[38] including targets published by Bowes et al.[37] and Sipes et al.[39] These targets were chosen as they provide valuable toxicological information for risk assessment. ChEMBL and ToxCast were combined to provide a relatively balanced dataset with more than 1000 chemicals per target for model construction and evaluation, an amount that was found to be required for DNN training. In total 144 109 active and 141 796 inactive unique compound–target relationships were obtained, for a total of 285 905 and a positive data percentage of 50.4%. On average this equates to 3530 data points per target. Imbalanced datasets cause difficulties for machine learning algorithms,[4,12,13,15] and developing a balanced dataset is a key advantage when constructing models.

Pharmacological targets analyzed in this work. Data were extracted from ChEMBL version 23 and ToxCast. The total test set was 144 109 actives and 141 796 inactives for a total of 285 905 compounds

TargetTarget geneActivesInactivesTotal
AcetylcholinesteraseAChE261119644575
Adenosine A2a receptorADORA2A394320826025
Alpha-2a adrenergic receptorADRA2A84210131855
Androgen receptorAR263772839920
Beta-1 adrenergic receptorADRB1126010802340
Beta-2 adrenergic receptorADRB2194320123955
Delta opioid receptorOPRD1300612194225
Dopamine D1 receptorDRD1135019903340
Dopamine D2 receptorDRD2569411366830
Dopamine transporterSLC6A3250919164425
Endothelin receptor ET-AEDNRA128511502435
Glucocorticoid receptorNR3C1301869729990
hERGKCNH2489532458140
Histamine H1 receptorHRH1127511052380
Mu opioid receptorOPRM1361023055915
Muscarinic acetylcholine receptor M1CHRM1201412413255
Muscarinic acetylcholine receptor M2CHRM2163320323665
Muscarinic acetylcholine receptor M3CHRM3153711132650
Norepinephrine transporterSLC6A2291019404850
Serotonin 2a (5-HT2a) receptorHTR2A375710334790
Serotonin 3a (5-HT3a) receptorHTR3A45110541505
Serotonin transporterSLC6A4404111345175
Tyrosine-protein kinase LCKLCK17325232255
Vasopressin V1a receptorAVPR1A61910561675
Type-1 angiotensin II receptorAGTR180611791985
RAC-alpha serine/threonine-protein kinaseAKT1276512203985
Beta-secretase 1BACE1601626048620
CholinesteraseBCHE140021453545
Caspase-1CASP1136931964565
Caspase-3CASP3117718283005
Caspase-8CASP833011301460
Muscarinic acetylcholine receptor M5CHRM567910811760
Inhibitor of nuclear factor kappa-B kinase subunit alphaCHUK31610691385
Macrophage colony-stimulating factor 1 receptorCSF1R133610492385
Casein kinase I isoform deltaCSNK1D70810271735
Endothelin B receptorEDNRB80912362045
Neutrophil elastaseELANE213413713505
Ephrin type-A receptor 2EPHA252811021630
Fibroblast growth factor receptor 1FGFR1216312073370
Peptidyl-prolyl cistrans isomeraseFKBP1A35410061360
Vascular endothelial growth factor receptor 1FLT1108820773165
Vascular endothelial growth factor receptor 3FLT467410811755
Tyrosine-protein kinase FYNFYN42010751495
Glycogen synthase kinase-3 betaGSK3B254912563805
Histone deacetylase 3HDAC3105111392190
Insulin-like growth factor 1 receptorIGF1R248311323615
Insulin receptorINSR88710931980
Vascular endothelial growth factor receptor 2KDR781615799395
Leukotriene B4 receptor 1LTB4R35010301380
Tyrosine-protein kinase LynLYN45410461500
Mitogen-activated protein kinase 1MAPK1620911 07617 285
Mitogen-activated protein kinase 9MAPK9122710882315
MAP kinase-activated protein kinase 2MAPKAPK282911561985
Hepatocyte growth factor receptorMET287111444015
Matrix metalloproteinase-13MMP13238811123500
Matrix metalloproteinase-2MMP2293816774615
Matrix metalloproteinase-3MMP3175910362795
Matrix metalloproteinase-9MMP9258218484430
Serine/threonine-protein kinase NEK2NEK229810571355
P2Y purinoceptor 1P2RY156011001660
Serine/threonine-protein kinase PAK 4PAK438011001480
Phosphodiesterase 4APDE4A65310171670
Phosphodiesterase 5APDE5A155111742725
Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alphaPIK3CA472420866810
Peroxisome proliferator-activated receptor gammaPPARG4362728311 645
Protein tyrosine phosphatase non-receptor type 1PTPN1147121793650
Protein tyrosine phosphatase non-receptor type 11PTPN1135412111565
Protein tyrosine phosphatase non-receptor type 2PTPN233912061545
RAF proto-oncogene serine/threonine-protein kinaseRAF1135110842435
Retinoic acid receptor alphaRARA35632493605
Retinoic acid receptor betaRARB29833473645
Rho-associated coiled-coil-containing protein kinase IROCK1129311172410
Ribosomal protein S6 kinase alpha-5RPS6KA522410361260
NAD-dependent protein deacetylase sirtuin-2SIRT236112841645
NAD-dependent protein deacetylase sirtuin-3SIRT315110741225
Proto-oncogene tyrosine-protein kinase SrcSRC270415314235
Substance-K receptorTACR287619142790
Thromboxane A2 receptorTBXA2R97819222900
Tyrosine-protein kinase receptor TEKTEK78811321920
ChEMBL (https://www.ebi.ac.uk/chembl/, version 23, data collected April 2018),[40] contains more than a million annotated compounds comprising over twelve million bioactivities covering in excess of 10 000 targets, all abstracted from the primary scientific literature.[42] Compounds with a confidence score of 8 or 9 and with reported activities (Ki/Kd/IC50/EC50) less than, or equal to, 10 μM against human protein targets were treated as binders and those with activity greater than 10 μM treated as non-binders. These cutoffs were chosen to provide chemicals with a pharmacologically relevant activity at a specific, well-defined, human target. A cut-off of 10 μM ensures that the compounds have a good degree of biological activity and represents a trade-off between activity and dataset size. A confidence score of 8 represents the assignment of homologous single proteins, and 9 direct single protein interactions.[40] ToxCast is a high throughput screening library of over nine thousand compounds tested across a thousand assays (https://www.epa.gov/chemical-research/toxicity-forecasting, data collected April 2018).[41] Data were extracted using ToxCast's in-built binary activity assignments[43] and combined with the ChEMBL data. Duplicate data points were removed. This was performed on the molecular structure of each chemical based on its atomic connectivity once salt counterions have been stripped, resulting in different tautomers and enantiomers being treated as different data points. Where chemicals have contrasting experimental datapoints from ChEMBL and ToxCast, the ChEMBL data value was used. All experimental positive compounds are binders irrespective of agonistic and antagonistic activity.

Molecular representation

Chemical fingerprints were generated using RDKit (version 2019.09) for Python.[44] To obtain a good balance between model performance and computational complexity, the type, radius and length of fingerprints must be chosen appropriately. Extended Connectivity Fingerprints (ECFPs) at radius 4 and 6 and several lengths, and MACCS keys were investigated using data for five biological targets and several network architectures. The results of this study are shown in the ESI (Tables S1 and S2†), and the best models were produced using ECFP4 fingerprints at length 10 000.

Cross-validation strategy

The statistical performance of these networks as molecular activity predictors was evaluated using clustered five-fold cross-validation.[13] This should help to alleviate bias in the ChEMBL and ToxCast data where in some cases several molecules are from a structural series, with only small structural differences between them. These molecules can then be easy to predict in the test set if others are placed in the training set, overestimating model performance. Chemical clustering was performed based on chemical fingerprints of the type used as DNN input (ECFP4, length 10 000) using a maximum distance between any 2 clusters of 0.3. This produces a large number of clusters based on the input data. Recombination was then performed to form five clusters of equal size. In each case, one cluster was withheld as a test set and a model trained and validated on the remaining four clusters, with the training and validation sets shuffled and split randomly into 75% training and 25% validation sets.

DNN architecture

Binary classification DNNs were constructed and trained using TensorFlow in Python 3. For five initial biological targets (AChE. ADORA2A, AR, KCNH2 and SLC6A4) the number of hidden layers was varied as either one or two, and the number of neurons in each hidden layer was also varied (10, 100 or 1000) to establish the best architectures for further biological targets. ReLU (rectified linear unit) activation functions were used to provide non-linearity based on an initial investigation into model performance comparing Sigmoid, ReLU and combinations of both. The results of this study can be found in the ESI (Tables S3 and S4†). Chemical features were input as discussed above and a binary prediction of biological activity at a target was provided as an output. In these initial cases (Table 2), networks with two hidden layers of either 100 or 1000 neurons were found to perform best, judged based on having (i) the highest test set MCC, (ii) the highest test set ROC-AUC, (iii) the highest validation set MCC. For the remaining biological targets, the networks were trained and compared to establish the best models. These models were then compared to structural alert (SA) and RF models using identical training and test sets to establish which predictors worked best. Further details on the neural networks and validation statistics are given in the ESI.†

Summary of results for various DNN architectures for several targets in initial investigations. Best performing networks on the test data are highlighted in red. Full results can be found in the ESI (Tables S5–S9). The first column represents the NN architecture, showing the number of neurons in each hidden layera

TrainingValidationTest
SESPACCMCCROC-AUCSESPACCMCCROC-AUCSESPACCMCCROC-AUC
AChE
[10]88.783.986.60.7260.9384.980.783.10.6550.9084.278.981.90.6310.89
[100]90.788.489.70.7910.9687.483.285.60.7060.9286.280.783.80.6700.90
[1000]88.083.786.20.7180.9385.578.082.30.6370.8984.478.882.00.6320.88
[10,10]90.789.790.30.8020.9686.182.984.70.6880.9284.382.483.50.6640.90
[100,100]91.591.391.40.8260.9787.185.286.30.7210.9285.084.284.70.6890.91
[1000,1000]95.296.695.80.9150.9988.086.787.40.7440.9384.784.084.40.6840.92
ADORA2A
[10]97.689.995.00.8880.9897.290.294.70.8840.9897.288.594.20.8710.97
[100]97.892.996.10.9130.9996.990.994.80.8860.9897.290.294.80.8840.98
[1000]97.590.795.20.8930.9897.289.594.60.8790.9897.089.194.30.8720.97
[10,10]97.892.796.00.9110.9997.690.695.30.8930.9897.090.094.60.8800.98
[100,100]98.193.796.60.9240.9996.890.894.80.8830.9896.990.594.70.8810.98
[1000,1000]99.077.891.70.8171.0097.392.495.60.9030.9896.791.294.80.8840.98
AR
[10]58.099.388.30.6910.8859.198.988.30.6910.8755.899.087.50.6670.86
[100]69.198.790.90.7590.9164.498.189.10.7110.8764.598.389.30.7150.86
[1000]65.098.689.70.7270.8961.698.288.50.6930.8661.598.388.60.6950.86
[10,10]67.199.090.50.7500.9062.798.589.00.7080.8661.698.688.80.7010.87
[100,100]76.199.493.20.8230.9569.297.890.20.7400.8768.098.190.10.7370.87
[1000,1000]73.399.492.50.8040.9465.897.989.30.7170.8764.498.289.20.7130.87
hERG
[10]93.553.577.50.5290.8791.648.274.30.4540.8292.046.173.70.4410.81
[100]94.149.976.40.5080.8692.245.874.20.4430.8192.944.173.40.4380.80
[1000]89.764.379.70.5680.8784.659.572.70.4580.8287.055.074.20.4500.81
[10,10]94.185.090.50.8000.9786.167.078.40.5450.8686.363.877.30.5190.85
[100,100]96.290.593.90.8730.9884.969.878.80.5550.8685.165.577.30.5190.84
[1000,1000]95.087.592.00.8330.9884.266.877.20.5200.8683.465.576.20.4980.84
SERT
[10]99.272.493.40.7990.9899.166.191.70.7520.9799.067.692.10.7600.97
[100]99.089.296.90.9060.9998.483.895.10.8560.9898.683.195.20.8570.98
[1000]99.277.294.40.8310.9898.873.893.30.7970.9799.173.993.50.8050.97
[10,10]99.089.797.00.9090.9998.982.195.10.8570.9898.783.195.30.8580.98
[100,100]99.495.898.60.9591.0098.286.195.60.8670.9898.686.896.00.8820.99
[1000,1000]99.498.299.10.9751.0098.191.196.50.8970.9998.490.596.60.9010.99

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.

Neural network activation similarity

We calculate the neural network activation similarity (NNAS) from the properties of each of the nodes of a trained neural network reacting to the fingerprint of a molecule. For a trained DNN a network activation vector, a, is induced by an input fingerprint, x(0), which can be defined as a vector: K(L) is the total number of nodes in layer L and different layers can be combined. The similarity between two compound's network activation vectors, a1 and a2 is measured by Euclidean similarity, which relates to Euclidean distance DE via the following equations:where 0 ≤ NNAS ≤ 1 and n is the total number of nodes considered in the calculation. The NNAS provides a measure of the similarity of molecules which is different to traditional Tanimoto similarity. As an additional comparison point, RF similarity (RFS), was calculated using the Euclidean distance between normalized vectors consisting of the 50 most important physicochemical descriptors identified in RF model construction.[38] This gives an appropriate comparison point based on trained machine learning models, and Tanimoto similarity gives a comparison based on chemical similarity and the DNN model inputs. All datasets and code used in this project are provided via GitHub (https://github.com/teha2/chemical_toxicology). These and generated models are available in the University of Cambridge repository (https://doi.org/10.17863/CAM.50429).

Results and discussion

Statistical performance for the DNNs constructed is included in the ESI.† A summary of the results is shown in Table 3. On average the models show high levels of predictivity, with test accuracy of 92.2 ± 4.2%, test MCC of 0.814 ± 0.093 and test ROC-AUC 0.96 ± 0.04. This is a high level of performance considering several machine learning algorithms struggle to achieve accuracy values above 90% in binary classification tasks,[1-5] although the difficulty of the task must be considered when comparing model performance. The models also do not show excessive levels of overfitting, as can be a problem with DNNs. This can be assessed by considering the differences between model performance on the training set and the validation/test sets. In this case, the differences are 3.3%, 0.079 and 0.03 for the validation set and 3.6%, 0.087 and 0.03 for the test set for the accuracy, MCC and ROC-AUC respectively, which can be considered modest.

Average model performance and standard deviation (SD) for the best performing DNN models at each target. Full results can be found in the ESI (Table S10)a

Training dataValidation dataTest data
SESPACCMCCROC-AUCSESPACCMCCROC-AUCSESPACCMCCROC-AUC
AVERAGE92.196.595.80.9010.9986.993.292.50.8220.9686.292.992.20.8140.96
SD8.84.23.10.0690.0211.75.94.10.0910.0412.16.54.20.0930.04

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve. Individual model performance was also considered in relation to the dataset size (Fig. 1). Test MCC does not appear to increase as dataset size increase, as one might expect, but the standard deviation across the five-fold clustered cross-validation, shown by the error bars, does appear to decrease. This does suggest that larger datasets will provide more consistent models, even if their performance is similar to datasets of around 2000–4000 data points. Notable in Fig. 1 are the labelled data points which appear to show low model MCC for quite large datasets. These are the targets KCNH2 and MAPK1, which were identified in our previous publication as challenging classifications in this dataset.[38]
Fig. 1

Test MCC vs. total number of compounds for all biological targets. Error bars shown are standard deviations across the five-fold clustered cross-validation.

Positive probability values were also calculated for AR binders using the Softmax function on an optimal trained DNN (Fig. 2). This functionality allows a prediction made by the DNN to be accompanied by a percentage indicating how confident the method is that the chemical is an experimental positive. This provides an estimation of the quality of a given binary prediction, with higher probabilities indicating higher confidence in a prediction. For example, the confidence in a positive prediction with positive probability greater than 0.9 at the AR increases greatly, with almost 99.5% of validation set compounds in this area being experimental positives in Fig. 2. This helps to increase confidence in the method for a safety science decision. Positive probability predictions around 0.5 can be considered untrustworthy and followed up with further calculations or experimental testing, and the threshold for model positive prediction assignment can also be adjusted depending on the model's purpose. Table 4 shows a summary of two such example cases, where the threshold during model recall is changed to 0.1 and 0.9 to provide better predictivity of active or inactive chemicals respectively. The 0.1 threshold may be of more use in a screening process when you are considering which chemicals to advance during product development where no pharmacological effects are desired, as any negative predictions made are more likely to be correct. If you are prioritizing chemicals for experimental testing and want to increase the likelihood of finding an active at a particular MIE, the 0.9 threshold may be more useful for the inverse reason. More extensive results in this study are included in the ESI.†
Fig. 2

Positive probability curve showing compounds tested at the ADORA2A. Positive probability is the probability a compound is active at the ADORA2A calculated by a trained DNN using the Softmax function. Percentages in each 10% section indicate the percentage of compounds in that section which are experimental positives.

Average model performance and standard deviation (SD) for the best performing DNN models at each target for the validation data sets when adjusted activity thresholds of 0.1 and 0.9 were applied. Full results can be found in the ESI (Tables S11 and S12)a

0.10.9
SESPACCMCCSESPACCMCC
AVERAGE97.059.979.90.61953.598.881.30.610
SD4.221.310.20.13722.81.66.60.125

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient.

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient. The best performing DNN for each biological target was compared to models previously generated using SAs and RFs.[38] In our previous study, the SAs were constructed using structures obtained from a maximal common substructure algorithm and selected using Bayesian statistics to iteratively select the best alerts. The RF models were based on 200 physicochemical descriptors calculated in RDKit and modelled in sklearn. To ensure fairest possible comparison the DNNs were retrained using the same training and test data as the other two methods. Average results across the three methods are compared in Table 5, with the distributions of these comparisons shown in Fig. 3. Further comparisons are available in Table S13 and Fig. S1 and S2 in the ESI.† Overall, the DNNs show a statistically significant increase in MCC over the SAs and RFs in a P-value test (α = 0.05). On average the accuracy of models increases by 1.7% compared to SAs and 0.67% compared to RFs. The MCC increases by 0.042 and 0.018 respectively. This represents a notable improvement, as the number of inaccurate predictions decreases from 8.9% in the SA model to 7.2% in the DNN model, a percentage decrease of 19%. For the RF models, the decrease is 9%, from 7.8% incorrect predictions to 7.2%. Fig. 5 shows that the distributions of model accuracies and MCCs do show overlap between the methods with DNN predictions being the highest performing overall. Despite this overlap the DNNs do well in direct model comparisons, with 69 DNN models showing higher test accuracies and 71 higher test MCC values compared to SAs, and 59 higher test accuracies and 62 higher test MCC values compared to RFs. Of the 632 comparisons in model performance shown in Table S13,† the DNNs perform better 459 times (73%).

Average model performance and standard deviation (SD) for the structural alert (SA), random forest (RF) and deep neural network (DNN) models at each target on a consistent training/test set split. Full comparisons can be found in the ESI (Table S13)a

Training setTest set
SESPACCMCCSESPACCMCC
SAAverage91.095.895.00.88284.193.591.10.790
SD7.43.52.30.05011.64.64.20.096
RFAverage94.994.796.40.91589.090.492.20.815
SD9.95.73.10.07211.68.14.00.091
DNNAverage92.396.895.90.90487.993.692.80.832
SD8.83.03.10.06610.45.94.00.089

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient.

Fig. 3

Histograms showing the distribution of test set model performance across the three modelling approaches, structural alerts (SAs), random forests (RFs) and deep neural networks (DNNs).

Fig. 5

Amiodarone, a typical KCNH2 binder, and its five most similar neighbours as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O). Starred network activation similarity values have been rounded to 1.000 but do not represent exact matches.

SE = sensitivity, SP = specificity, ACC = accuracy, MCC = Matthews correlation coefficient. The average test set results for these models can also be considered as a comparison between cross-validating using a clustered test set and a randomly assigned test set. These differences are shown in Table 6 and show a small decrease in statistical performance when clustering is used, as is to be expected. The first two decimal places of the ROC-AUC values do not change.

Average statistical performance for models with test sets generated using chemical clustering and generated randomly. Clustered statistics are taken from Table 3 and random statistics generated from Table 5. The difference shown is the change in performance when moving from random to clusteringa

ACCMCCROC-AUC
Clustered test set92.20.8140.96
Random test set92.80.8320.96
Difference−0.6−0.0180

ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve.

ACC = accuracy, MCC = Matthews correlation coefficient, ROC-AUC = area under receiver operating characteristic curve. Finally, NNAS calculations were made, considering how signals propagate through the hidden layers of a DNN when different chemical fingerprints are introduced. This approach is analogous to the use of feature-space distance or latent-space distance which has been used to quantitatively assess DNN uncertainty.[45] These neural network activation similarities between chemicals are considered as potential guidance for read-across in toxicity risk assessment. NNAS calculations were carried out on the highest performing trained DNNs we compared to the SA and RF models using all nodes in all hidden layers of those networks. Typical binders from the literature that were not in the model training sets were identified and use for this task. Andarine, a typical AR binder,[46] amiodarone, a typical KCNH2 binder,[47] and 3-amino-4-(2-dimethylaminomethylphenylsulfanyl)-benzonitrile (DASB), a typical SLC6A4 binder,[48] had their activities predicted by the networks and the highest NNAS, Tanimoto similarity[49] and RFS compounds from the training set were identified. In all three cases, the DNN and the RF correctly predicted the activity of the typical binder. Generally, NNAS values are higher than Tanimoto or RF similarities and have a narrower range. Both Tanimoto and RF similarities typically range from zero to a high between 0.2 and 0.6. Network similarities typically range between 0.8 and 1. In most cases, the compounds with the highest similarities, measured by one metric, are different from the compounds with the others. This is reflected throughout the dataset if the compounds are ranked based on their network and Tanimoto similarity the RMSE between these ranks is always found to be greater than 1000 places. The relationship between Tanimoto and NNAS values is shown for all three case studies in Fig. 4. These plots show a difference between the similarity values as a lack of strong correlation between them. The lack of clear correlation here shows that the networks are not simply memorizing the fingerprint bit strings, they are learning which bits and combinations of bits are important for chemical activity.
Fig. 4

A graph showing the relationship between Tanimoto similarity and NNAS values for the three typical binders andarine, amiodarone and DASB.

Fig. 5 shows amiodarone, an experimentally active KCNH2 binder, and its five most similar chemicals as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O). In this case, the Tanimoto similar compounds all show relatively low similarity (<0.3) and only F appears similar enough to consider making a read-across. The NNAS calculation may be more useful in this case, identifying more experimentally active compounds (4 vs. 3) and more compounds with basic nitrogen atoms attached by linkers to aromatic rings (4 vs. 2), a structural feature associated with biological activity at the KCNH2 target.[50] The RFS calculation identified 5 experimental positive compounds which all contain the basic nitrogen attached to aromatic ring motif, but they all appear to come from a single chemical series – perhaps biasing this result. This example suggests the DNN classifier is learning the right kind of features in this difficult classification task. Fig. 6 shows DASB, an SLC6A4 active chemical, and its similar compounds using the same lettering system. All identified chemicals are active in this case. The Tanimoto and RFS most similar chemicals show a high level of structural similarity to DASB, making them appear as good read-across candidates. They do miss a chemical feature picked up by the NNAS chemicals in a pi-bonding group in the lower left-hand corner, which is present in A. D also shows a similar feature attached to an aromatic ring. Features such as these can be critical to biological activity, and, interestingly, the most network similar compounds pick up these features over the arguably more similar features in F–O.
Fig. 6

DASB, a typical SLC6A4 binder, and its five most similar neighbours as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O).

Finally, Fig. 7 shows Andarine, an experimentally active chemical at the AR, and its most similar compounds. The top five in both the NNAS and Tanimoto lists are all experimentally active, while only two of the most similar RFS compounds are. No chemical is in the top five for two of the similarity measures showing how they are able to measure different types of similarity. Chemicals F, G and H are highly similar to the right-hand side of Andarine, while I and J appear to have more overall similarity with Andarine's shape. The RFS compounds show flexibly attached aromatic rings trifluoromethyl groups and nitro groups. NNAS compounds show trifluoromethyl groups in A, B and D and steroidal structures in C and D. Chemical B, in particular, is an interesting case as it shows relatively high Tanimoto similarity (0.364) while appearing quite different. In this case, the highest Tanimoto similar chemicals are probably the most useful for identifying a read-across relationship.
Fig. 7

Andarine, a typical AR binder, and its five most similar neighbours as measured by NNAS (A–E), Tanimoto similarity (F–J) and RFS (K–O). Starred network activation similarity values have been rounded to 1.000 but do not represent exact matches.

Across the three case studies, the concordance between reference chemical experimental activity and analogue activities was calculated. When considering the five most similar analogues, network, Tanimoto and RF similarities show similar concordances (93%, 87% and 80% respectively). When additional most similar cases are considered the NNAS show a steadily more impressive concordance (98%, 72% and 60% for twenty, and 98%, 65% and 76% for fifty most similar chemicals). The RFS particularly struggled in the Andarine task – identifying only 17 experimental actives in its top 50 most similar compounds, and otherwise should also be considered a potentially useful metric when RF classifiers are used. While the most Tanimoto similar chemicals can be useful for read-across in a more traditional sense when more cases are considered NNAS is far more useful and does provide insight into the otherwise “black box” NN predictions. All three similarity measures provide potentially useful molecular candidates for read-across, and it is useful that they each provide different molecules giving more options for toxicologists using NNs and RFs in decision making. It is certainly the case that all three similarities should be used together to gain confidence in in silico predictions in predictive safety assessment.

Conclusions

In this work, we have constructed DNNs to predict binary activity at human MIEs and developed a new similarity measure, NNAS, which increases confidence in the predictions. Key advantages of this work include the development of a large number of models covering many human MIEs allowing for predictions across a wide expanse of human toxicology, the use of a balanced dataset with an almost equal number of active and inactive chemicals, high-quality predictions with a high level of statistical performance which outperforms SAs and RF on identical prediction tasks and the introduction of NNAS for use in toxicity safety evaluation to increase confidence in the predictions made. The developed networks use chemical fingerprint inputs to learn and correctly classify molecules as binders or non-binders. The classifiers show high performance, with an average ROC-AUC value of 0.96 ± 0.04 in clustered five-fold cross-validation. The DNNs can also be used to provide positive probability values associated with each prediction, and these values can provide additional information and confidence values a risk assessor can use in a safety evaluation, including the ability to adjust the threshold for activity depending on the prediction task. These DNNs have been considered against SA and RF models and show a statistically significant (α = 0.05) improvement in MCC. We introduce a new measure of similarity, NNAS which can be used as read-across in a safety evaluation decision. These values provide information on how the DNN evaluates the test compound, providing information that Tanimoto similarity alone does not and considerably improving our ability to predict adverse outcomes using computational methods. The NNAS improves our understanding of these highly predictive DNNs, and so can contribute to the safety evaluation of new chemicals. These powerful machine learning approaches can be used alongside other computational methods, such as QSARs, SAs or expert systems to provide additional confidence when a consensus is reached among predictions. More impact for these in silico methods will assist in reducing the reliance on animal experiments, in line with 3Rs objectives.[51,52] These methods can also be considered in other predictive tasks using DNNs, particularly those using chemical inputs. Neural networks have the potential to assist in a number of chemical-based tasks, including biological activity prediction, choice of chemical reactants or solvents, and spectra interpretation. As such, understanding gained through methods such as this can assist in the discovery of new chemical reactions, the streamlining of chemical synthesis, and reduction in cost associated with experimental discovery.

Author contributions

The manuscript was written through the contributions of all authors. All authors have given approval to the final version of the manuscript.

Funding sources

The authors acknowledge the financial support of Unilever.

Data statement

According to the University of Cambridge data management policy, all the data used in this paper are available either in the paper or in the SI. A copy of the data is also available in the University of Cambridge repository at: https://doi.org/10.17863/CAM.50429.

Conflicts of interest

There are no conflicts to declare. Accuracy Adverse outcome pathways 3-Amino-4-(2-dimethylaminomethylphenylsulfanyl)-benzonitrile Deep neural network Extended connectivity fingerprint False negative False positive Matthews correlation coefficient Molecular initiating event Neural network activation similarity Quantitative structure–activity relationship Rectified linear unit Random forest Random forest similarity Area under receiver operating curve Structural alert Sensitivity Serotonin transporter Specificity True negative True positive
  2 in total

1.  Increasing the Value of Data Within a Large Pharmaceutical Company Through In Silico Models.

Authors:  Alessandro Brigo; Doha Naga; Wolfgang Muster
Journal:  Methods Mol Biol       Date:  2022

Review 2.  Uncertainty quantification: Can we trust artificial intelligence in drug discovery?

Authors:  Jie Yu; Dingyan Wang; Mingyue Zheng
Journal:  iScience       Date:  2022-07-21
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.