| Literature DB >> 35694454 |
Hirotomo Moriwaki1, Shin Saito1, Tomoya Matsumoto1, Takayuki Serizawa2, Ryo Kunimoto2.
Abstract
In drug discovery, the prediction of activity and absorption, distribution, metabolism, excretion, and toxicity parameters is one of the most important approaches in determining which compound to synthesize next. In recent years, prediction methods based on deep learning as well as non-deep learning approaches have been established, and a number of applications to drug discovery have been reported by various companies and organizations. In this research, we performed activity prediction using deep learning and non-deep learning methods on in-house assay data for several hundred kinases and compared and discussed the prediction results. We found that the prediction accuracy of the single-task graph neural network (GNN) model was generally lower than that of the non-deep learning model (LightGBM), but the multitask GNN model, which combined data from other kinases, comprehensively outperformed LightGBM. In addition, the extrapolative validity of the multitask model was verified by using it for prediction on known kinase ligands. We observed an overlap between characteristic protein-ligand interaction sites and the atoms that are important for prediction. By building appropriate models based on the conditions of the data set and analyzing the feature importance of the prediction results, a ligand-based prediction method may be used not only for activity prediction but also for drug design.Entities:
Year: 2022 PMID: 35694454 PMCID: PMC9178758 DOI: 10.1021/acsomega.2c00664
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Illustration of the GNN model architecture. Dimensions of the output are shown to the left of the arrow. Italics indicate the name of the hyperparameter in Table .
Atomic Features
| category | dimensions | values |
|---|---|---|
| element | 44 | C, N, O, S, F, Si, P, Cl, Br, Mg, Na, Ca, Fe, As, Al, I, B, V, K, Tl, Yb, Sb, Sn, Ag, Pd, Co, Se, Ti, Zn, H, Li, Ge, Cu, Au, Ni, Cd, In, Mn, Zr, Cr, Pt, Hg, Pb, other |
| degree | 11 | 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
| valence | 8 | 0, 1, 2, 3, 4, 5, 6, other |
| formal charge | 1 | |
| number of radical electrons | 1 | |
| hybridization | 6 | SP, SP2, SP3, SP3D, SP3D2, other |
| aromaticity | 1 |
Range of Hyperparameters Explored in the GNN
| parameter name | values |
|---|---|
| graph layer | GCNConv[ |
| number of graph layers | 1–5 |
| number of graph layer units | 16–1024 |
| number of linear layers | 1–3 |
| number of linear layer units | 16–1024 |
Range of Hyperparameters Explored in the Random Forest Classification
| name | distribution | range |
|---|---|---|
| number of estimators | int uniform | 2–500 |
| max_depth | uniform | 1–64 |
| min_samples_weight | uniform | 2–128 |
| max_features | uniform | 0.1–1 |
Range of Hyperparameters Explored in LightGBM
| name | distribution | range |
|---|---|---|
| lambda_l1, lambda_l2 | log uniform | 10–8–101 |
| num_leaves | int uniform | 2–512 |
| feature_fraction | uniform | 0.4–1 |
| bagging_fraction | uniform | 0.4–1 |
| bagging_freq | int uniform | 1–7 |
| min_child_samples | int uniform | 5–100 |
Figure 2Distribution of similar compounds in the data set. Tanimoto coefficients using the Morgan2 fingerprint were used for a similarity score and the distribution was estimated by KDE. (a) shows the distribution of k-th most similar compounds in the in-house training data set. Blue, orange, green, red, and purple lines represent the 1st, 10th, 30th, 50th, and 100th similarity compound’s distributions, respectively. (b) shows the distribution of 10th most similar compounds between training and other data sets. Blue, orange, and green lines present the relationship between the training set, validation set, and test set, respectively. (c) shows the distribution of k-th most similar compounds between the in-house data set and the PDB data set. Blue, orange, green, red, and purple lines represent the 1st, 10th, 30th, 50th, and 100th similarity compound’s distributions, respectively.
Prediction of the ROC-AUC Score of Each Modela
| single-task | multitask | transfer learning | ||||
|---|---|---|---|---|---|---|
| alidation v | test | alidation v | test | alidation v | test | |
| GCNConv | 0.9510 ± 0.0004 | 0.8985 ± 0.0040 | 0.9357 ± 0.0027 | 0.9266 ± 0.0049 | 0.9418 ± 0.0026 | 0.9304 ± 0.0051 |
| GraphConv | 0.9534 ± 0.0008 | 0.9010 ± 0.0014 | 0.9379 ± 0.0025 | 0.9277 ± 0.0059 | 0.9436 ± 0.0024 | 0.9320 ± 0.0062 |
| SAGEConv | 0.9516 ± 0.0012 | 0.8999 ± 0.0035 | 0.9361 ± 0.0012 | 0.9282 ± 0.0029 | 0.9412 ± 0.0014 | 0.9319 ± 0.0031 |
| GATConv | 0.9514 ± 0.0007 | 0.8976 ± 0.0040 | 0.9364 ± 0.0028 | 0.9237 ± 0.0041 | 0.9418 ± 0.0013 | 0.9278 ± 0.0022 |
| ARMAConv | 0.9514 ± 0.0008 | 0.9004 ± 0.0051 | 0.9389 ± 0.0022 | 0.9295 ± 0.0041 | 0.9440 ± 0.0016 | 0.9327 ± 0.0036 |
| SGConv | 0.9510 ± 0.0002 | 0.8997 ± 0.0033 | 0.9352 ± 0.0027 | 0.9251 ± 0.0043 | 0.9416 ± 0.0019 | 0.9294 ± 0.0039 |
| LightGBM | 0.9521 ± 0.0007 | 0.9190 ± 0.0054 | ||||
| RF | 0.9445 ± 0.0009 | 0.9082 ± 0.0040 | ||||
| SVC | 0.8891 ± 0.0034 | 0.8565 ± 0.0006 | ||||
| kNN | 0.9292 ± 0.0013 | 0.8991 ± 0.0064 | ||||
The table provides statistics for the prediction results with the ROC-AUC score. For each method, scores of the validation set and the test set are reported.
Targets for which Learning Failed by the Multitask Model or LightGBMa
| target | ROC-AUC
score | N | positive | negative | ||
|---|---|---|---|---|---|---|
| single task | multitask | LightGBM | ||||
| CAMK1B | 0.8333 | 52 | 48 | 4 | ||
| CDK10 | 0.9375 | 59 | 53 | 6 | ||
| NEK10 | 151 | 137 | 14 | |||
| NIK | 0.8333 | 151 | 135 | 16 | ||
| WNK1 | 628 | 625 | 3 | |||
| WNK3 | 627 | 620 | 7 | |||
The table shows the prediction scores from the multitask ROC-AUC for targets which cannot be made single-task GNN/LightGBM. In addition, the total number of compounds (N) and the number of positive and negative compounds are reported.
Figure 3Comparison of the per-target prediction ROC-AUC between models. The color scale of each point indicates the size of the training set for that target. (a) Single-task model on the x-axis and multitask model on the y-axis. (b) LightGBM on the x-axis and multitask model on the y-axis.
Range of Hyperparameters Explored in the SVC
| name | distribution | range |
|---|---|---|
| kernel | RBF | |
| C | log uniform | 2–1–26 |
| γ | log uniform | 2–8–20 |
Figure 4Comparing the prediction ROC-AUC of multitask and transfer learning models for each target. The x-axis shows the respective kinase, and the y-axis shows the difference in the prediction ROC-AUC between transition learning and multitasking.
Figure 5Visualization results of the FGFR2-small molecule complex. The importance of the atoms in the small molecule is indicated by the color scale. Yellow ribbons indicate nucleotide-binding regions (UniProt ID: P21802).
Figure 6Visualization results of the PAK4-small molecule complex. The importance of the atoms in the small molecule is indicated by the color scale. Yellow ribbons indicate nucleotide-binding regions (UniProt ID: O96013). (a) Molecules in which atoms around the hinge binder became highly important. (b) Molecules in which atoms interacting with GLU329, ASP458, and ASP444 became highly important. (c) Molecules in which atoms that are less associated with interactions became highly important.
Range of Hyperparameters Explored in the k-Nearest Neighbor Classification
| name | distribution | range |
|---|---|---|
| number of neighbors | uniform | 1–50 |
| weights | {uniform, distance} | |
| P | {1, 2} |