| Literature DB >> 35841114 |
Akshai P Sreenivasan1,2, Philip J Harrison1, Wesley Schaal1, Damian J Matuszewski3, Kim Kultima2, Ola Spjuth4.
Abstract
Comparing chemical structures to infer protein targets and functions is a common approach, but basing comparisons on chemical similarity alone can be misleading. Here we present a methodology for predicting target protein clusters using deep neural networks. The model is trained on clusters of compounds based on similarities calculated from combined compound-protein and protein-protein interaction data using a network topology approach. We compare several deep learning architectures including both convolutional and recurrent neural networks. The best performing method, the recurrent neural network architecture MolPMoFiT, achieved an F1 score approaching 0.9 on a held-out test set of 8907 compounds. In addition, in-depth analysis on a set of eleven well-studied chemical compounds with known functions showed that predictions were justifiable for all but one of the chemicals. Four of the compounds, similar in their molecular structure but with dissimilarities in their function, revealed advantages of our method compared to using chemical similarity.Entities:
Keywords: Deep learning; Drug discovery; Machine learning; Network topology; Neural networks
Year: 2022 PMID: 35841114 PMCID: PMC9284831 DOI: 10.1186/s13321-022-00622-7
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 8.489
Clustering results of the data using different distance thresholds. The results are shown for both the entire dataset and for clusters with support above 100
| Distance | 0.001 | 0.005 | 0.01 | 0.05 | 0.1 | |
|---|---|---|---|---|---|---|
| Entire dataset | Chemicals | 130259 | 130259 | 130259 | 130259 | 130259 |
| Entire dataset | Clusters | 14057 | 13095 | 12162 | 8293 | 5680 |
| Support greater than 100 | Chemicals | 86443 | 89216 | 93010 | 105856 | 112739 |
| Support greater than 100 | Clusters | 249 | 241 | 231 | 153 | 112 |
Fig. 1Histogram showing clusters based on their support (using a distance threshold of 0.005). The bars show the number of clusters within the range of support. The yellow and green lines represent the discarded and selected chemicals (support greater than 100), respectively
Fig. 2Data pre-processing for CNN and RNN architectures. A An example of SMILES string conversion to a matrix of dimension 42 X (length of SMILES string). This matrix is padded along the y-axis up to a defined maximum length. B (a) Atomwise and (b) SMILES-PE tokenization for the compound aspirin
Fig. 3Architecture of the CNN used. The given network takes the feature matrix as input and has three convolutional layers followed by a fully connected layer prior to the final classification layer
Fig. 4Two different RNN architectures used: A (a) Seq2seq architecture with both perceiver and interpreter networks. The perceiver network is pre-trained using unlabelled data, to learn patterns and structures present in the data. (b) Finetuning of seq2seq network, where the perceiver network is connected to a fully connected layer for classification. B Pre-training and fine-tuning of the MolPMoFiT model. Weights from the pre-trained embedding matrix and three layers of LSTM are transferred and fine-tuned to perform the classification task at hand
Fig. 5Workflow used in this study. The data was obtained from STITCH and STRING databases and were processed using Quantmap followed by hierarchical clustering using several distance thresholds. For each distance threshold, a subset of 20 clusters was used to evaluate different deep learning architectures. Further, a dataset of interest was selected for training and functional assignment of clusters was carried out. The final trained model was later evaluated using well-known chemicals
Fig. 6Comparison between different architectures. F1 score means and standard deviations (for ten cross-validation folds) of the deep learning models compared on the five clustering distance thresholds
Fig. 7Predicted distances between the eleven chemicals selected for in-depth analysis
Comparison of the functional groups predicted for the eleven evaluation chemicals against their DrugBank annotations. Functions annotated in italics or bold are exclusive for the compound in the same font type
| Compound | Qcutoff | Lcutoff | DrugBank annotation |
|---|---|---|---|
| Morphine | Kappa-type opioid receptor | Kappa-type, Mu-type and Delta-type opioid receptor, Cytochrome P450 2D6, Proenkephalin-B | Kappa-type, Mu-type and Delta-type opioid receptor, Lymphocyte antigen 96 |
| Nalorphine | Not found | ||
| Solute carrier family 15 member 1 and 2 | Solute carrier family 15 member 1 and 2, Angiotensin- converting enzyme, Protein polybromo-1, Band 3 anion transport protein | Solute carrier | |
| Beta-2 adrenergic receptor | Beta-1, 2, and 3 adrenergic receptor, Extracellular calcium-sensing receptor, ER membrane protein complex subunit 6 | Alpha-1A, 1B, 1D, 2A, 2B and | |
| Estrogen | Carbonic anhydrase 9, 2, 1, 12, 14, 7, 5A, 5B and 13 | Carbonic anhydrase 9, 2, 1, 12, 14, 7, 5A, 5B and 13 | Estrogen receptor alpha and beta, Nuclear receptor subfamily 1 group I member 2, Neuronal acetylcholine receptor subunit alpha-4, G-protein coupled estrogen receptor 1, ATP synthase subunit a, Beclin-1 |
| 5-hydroxytryptamine receptor 2A; G-protein coupled receptor for 5-hydroxytryptamine (serotonin) | 5-hydroxytryptamine receptor 2A, 2C, Potassium voltage-gated channel subfamily H member 2, D(2) dopamine receptor, Sodium-dependent serotonin transporter | Sodium-dependent noradrenaline, serotonin and | |
| Chlorpromazine | Dopamine receptor D4 | D4, D(2), D(3) dopamine receptor, 5-hydroxytryptamine receptor 1A, 5-hydroxytryptamine receptor 2A | D(1), D1, D2, D3, D4, and D5 dopamine receptor, 5-hydroxytryptamine receptor 1A, 2A, 2C, 2, 6, and 7, Alpha-1A and 1B adrenergic receptor, Histamine H1 and H4 receptor, Potassium voltage-gated channel subfamily H member 2, Alpha-1 and 2 adrenergic receptors, M1 and M3 muscarinic acetylcholine receptor, Sphingomyelin phosphodiesterase, Calmodulin, Alpha1-acid glycoprotein |
| Promethazine | Cholinesterase, Testis, prostate and placenta expressed | Cholinesterase, Acetylcholinesterase, Acetylcholinesterase collagenic tail peptide, Amyloid-beta A4 protein, Potassium voltage-gated channel subfamily H member 1 | Histamine H1 receptor, Histamine H2 receptor, Dopamine D2 receptor, Muscarinic acetylcholine receptor M1, M2, M3, M4 and M5, Alpha adrenergic receptor, Calmodulin, P2 Purinoceptors, Voltage-gated sodium channel alpha subunit, Voltage-gated Potassium Channels |