| Literature DB >> 32722290 |
Kristy Carpenter1, Alexander Pilozzi1, Xudong Huang1.
Abstract
The use of virtual drug screening can be beneficial to research teams, enabling them to narrow down potentially useful compounds for further study. A variety of virtual screening methods have been developed, typically with machine learning classifiers at the center of their design. In the present study, we created a virtual screener for protein kinase inhibitors. Experimental compound-target interaction data were obtained from the IDG-DREAM Drug-Kinase Binding Prediction Challenge. These data were converted and fed as inputs into two multi-input recurrent neural networks (RNNs). The first network utilized data encoded in one-hot representation, while the other incorporated embedding layers. The models were developed in Python, and were designed to output the IC50 of the target compounds. The performance of the models was assessed primarily through analysis of the Q2 values produced from runs of differing sample and epoch size; recorded loss values were also reported and graphed. The performance of the models was limited, though multiple changes are proposed for potential improvement of a multi-input recurrent neural network-based screening tool.Entities:
Keywords: artificial intelligence (AI); deep learning (DL); machine learning (ML); recurrent neural network (RNN); virtual drug screening
Year: 2020 PMID: 32722290 PMCID: PMC7435591 DOI: 10.3390/molecules25153372
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Q2 values for various models with log IC50 output.
| Samples | Epochs | Input Rep. | [Q2 for x-val] |
|---|---|---|---|
| 3000 | 500 | One-hot | [−3.8677] |
| 3000 | 500 | Embed | [−0.3743, −0.3382, −0.3323, −0.3917, −0.2402] |
| 10000 | 200 | One-hot | [0.0510, 0.0005, 0, 0.0671] |
| 10000 | 200 | Embed | [0.0391, 0, 0.0364, −0.0689, −0.0512] |
Q2 values and Q2 F3 values for various models with log IC50 output.
| Model/Input Rep. | Mean Q2 | Mean Q2F3 |
|---|---|---|
| One-hot | 0.0601 | 0.0714 |
| Embed | 0.0537 | 0.0467 |
Figure 1Loss plot for model with 3000 samples, 500 epochs, one-hot encoding of IUPAC international chemical identifiers (InChIs).
Figure 2Loss plot for model with 10,000 samples, 200 epochs, embedding of InChIs.
Figure 3Loss plot for baseline model with 3000 samples, 500 epochs.
Biochemical features for amino acids in protein targets.
| AAindex ID | Description |
|---|---|
| KLEP840101 | Net charge |
| KYTJ820101 | Kyte-Doolittle hydrophobicity |
| FASG760101 | Molecular weight |
| FAUJ880103 | Normalized Van der Waals volume |
| GRAR740102 | Polarity |
| CHAM820101 | Polarizability |
| JANJ780101 | Average accessible surface area |
| PRAM900102 | Relative frequency in alpha-helix |
| PRAM900103 | Relative frequency in beta-sheet |
| PRAM900104 | Relative frequency in reverse-turn |
| NADH010104 | Hydropathy scale (20% accessibility) |
| NADH010107 | Hydropathy scale (50% accessibility) |
| RADA880103 | Transfer free energy from vap to chx |
| RICJ880113 | Relative preference value at C2 |
| RICJ880112 | Relative preference value at C3 |
| KHAG800101 | Kerr-constant increments |
| PRAM820103 | Correlation coefficient in regression analysis |
Figure 4The multi-input recurrent neural network (RNN) architecture with one-hot encoded inputs.
Figure 5The multi-input RNN architecture with string inputs that undergo embedding.
Figure 6The baseline neural network (NN) architecture with string inputs that undergo embedding.