Literature DB >> 35694454

Global Analysis of Deep Learning Prediction Using Large-Scale In-House Kinome-Wide Profiling Data.

Hirotomo Moriwaki¹, Shin Saito¹, Tomoya Matsumoto¹, Takayuki Serizawa², Ryo Kunimoto².

Abstract

In drug discovery, the prediction of activity and absorption, distribution, metabolism, excretion, and toxicity parameters is one of the most important approaches in determining which compound to synthesize next. In recent years, prediction methods based on deep learning as well as non-deep learning approaches have been established, and a number of applications to drug discovery have been reported by various companies and organizations. In this research, we performed activity prediction using deep learning and non-deep learning methods on in-house assay data for several hundred kinases and compared and discussed the prediction results. We found that the prediction accuracy of the single-task graph neural network (GNN) model was generally lower than that of the non-deep learning model (LightGBM), but the multitask GNN model, which combined data from other kinases, comprehensively outperformed LightGBM. In addition, the extrapolative validity of the multitask model was verified by using it for prediction on known kinase ligands. We observed an overlap between characteristic protein-ligand interaction sites and the atoms that are important for prediction. By building appropriate models based on the conditions of the data set and analyzing the feature importance of the prediction results, a ligand-based prediction method may be used not only for activity prediction but also for drug design.

Entities: Chemical

Year: 2022 PMID： 35694454 PMCID： PMC9178758 DOI： 10.1021/acsomega.2c00664

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

In the process of drug discovery, a huge number of experiments need to be conducted, making the creation of new drugs expensive and time consuming.[1] In the initial stage of research, hit compounds are obtained by screening hundreds of thousands to millions of compounds. Each pharmaceutical company has a compound library typically featuring a million or more compounds, and the number of compounds currently synthesized commercially is in the hundreds of millions.[2,3] Since it is practically difficult to screen all of these compounds, virtual screening is used to search for hit compounds.[4,5] In the hit-to-lead stage of the research, various experiments are conducted on absorption, distribution, metabolism, excretion, and toxicity (ADMET) as well as activity to prioritize compounds in order to obtain those with a target product profile based on hit compounds.[6] To reduce the number of experiments and improve their efficiency, various parameter prediction methods have been developed.[7,8] Docking simulation has long been used to predict the interaction between a target protein and a compound.[9] Since this approach analyzes the physicochemical interaction between a protein and a compound, it is impossible to predict the interaction unless the protein structure is known. To circumvent this problem, machine learning approaches are used to predict the activity and ADMET profile with previously obtained assay data and their chemical descriptors.[10,11] While these approaches do not require the protein structure, they are highly dependent on the quality and quantity of training data.[12] In addition, the chemical descriptors that represent the molecules are also important factors and, depending on the data set and the properties of the prediction target when creating the model, they have a significant impact on the prediction results.[13] In the last decade, prediction methods based on deep neural networks have emerged for which various algorithms have been developed.[14] Although these methods are still heavily influenced by training data, it is expected that molecular structure information can be learned more accurately by using graph convolutional neural networks (GNNs), namely, representing molecules as graphs, compared with conventional chemical descriptors.[15] The methodological features of deep learning are multitask models and transfer learning. Multitask models allow prediction of multiple parameters simultaneously, while transfer learning builds models by diverting data sets with different objective variables. The QSAR modeling approach, which takes advantage of these deep learning features, has obtained excellent results on multiple data sets in pharmaceutical research studies.[16,17] Furthermore, these approaches have been particularly reported in the study of kinase families,[18−20] which are important drug targets and for which there is much information on inhibitors. On the other hand, the application of deep learning has already been reported in the field of drug discovery, not only for activity prediction[21] but also for de novo molecule generation[22] and reverse synthesis analysis.[23] While all of these achievements are impressive, these methods are not always successful.[24] To avoid failures of this kind, the amount of information in the training data, the relationship between the training data and the test data (query compound), the characteristics of the method, and the adjustment of hyperparameters are all important in creating a good prediction model. Nowadays, large-scale assay data are also available in public databases, but there are still few data sets obtained from assays performed for the same compound set under the same conditions for multiple targets, and the data obtained by different pharmaceutical companies are unique and vary markedly. In this research, we developed prediction models using different machine learning methods with inhibitory assay data for thousands of compounds on hundreds of kinases acquired in-house. We compared the trend of accuracy of each prediction model and also investigated the application of the prediction results to drug design using publicly available data.

Materials and Methods

In-House Kinase Assay Data

The predictions of each method were validated using the results of previous in-house active-site-directed competition binding assays.[25,26] This assay approach is an efficient way of determining the specificity of a kinase inhibitor by measuring the binding affinity of a small molecule to the ATP site of the kinase. A total of 7165 compounds from our in-house library have been assayed for 387 kinases, and the measurement data of 1,694,201 ligand–kinase pairs were obtained. The evaluated compounds differed among the various assayed kinases, and the total matrix density of ligands versus kinases was 61.1%. Under the condition of ATP concentration of 1 mM and compound concentration of 10 μM, compounds with residual activity rates of less than 20% were defined as hits, and the hit rate was 17.7%. The average hit rate for individual targets was 18.11%, with a standard deviation of 2.98%. Eight of the 387 targets contained only positive or negative compounds, so they were used only for multitask learning and were excluded from single-task learning and evaluation. The assay data were randomly divided into training, validation, and test sets at a ratio of 8:1:1. Similarity distribution among compound data sets was considered using Morgan2 fingerprint calculated using RDKit-2021.03.1.[27] To check reproducibility, the split of the training set, validation set, and test set were randomly changed, and experiments were conducted on the three different data sets.

Public Validation Data

Since the assay data used to construct the prediction model are from measurements of the binding of small molecules to the ATP site of the kinase, it is necessary to examine compounds for which information on direct interaction with the ATP binding site is available. For this reason, the prediction model was validated for extrapolation using publicly available data in the Protein Data Bank (PDB).[28] The X-ray crystal structures of the human kinase proteins in the PDB were obtained from SIFTS[29,30] using the UniProt ID as query keys, and the compound structure of the ligand in the crystal structure was obtained from Ligand Expo.[31] There were 3404 compounds with heavy atom numbers of 7 or more for which information on the interaction with the 251 kinases used in the assay was available, as of May 4, 2021. From the extracted data set, 82 kinases that had three or more ligands interacting around the ATP binding site as recorded in the PDB were used to validate the prediction model. The X-ray crystallographic information of the ligand–protein complexes was obtained from the PDB of RCSB, and the interaction sites were confirmed visually.

Deep Learning Prediction Approach

Three types of GNN classification models were used: a model that predicts compound activity for one target (single-task GNN model), a model that predicts all targets at once (multitask GNN model), and a model that fine-tunes each single target from a trained model obtained from the multitask model (transfer learning GNN model). For single-task and multitask GNN models, GNN models with the structure shown in Figure was implemented using RDKit-2020.09.5,[27] PyTorch-1.8.0,[32] and PyTorch-Geometric-1.6.3.[33] For the loss function, the cross-entropy loss was used for the single-task GNN model, and the average of the cross-entropy losses for each target was used for the multitask GNN model. The single-task and multitask GNN models were trained using Adam Optimizer.[34] A 72-dimensional feature set was used for each atom with reference to DeepChem (Table ).[35] The single-task and multitask GNN models were trained using Adam Optimizer, and the hyperparameters of Table were optimized to maximize the area under the receiver operating characteristic curve (ROC-AUC) for the validation set using optuna-2.6.0.[36] The Tree-structured Parzen Estimator (TPE)[37] was used to explore 100 hyperparameters, with a maximum of 100 epochs of training in each trial. In the process of searching for hyperparameters, PercentilePruner was used to interrupt trials that were not expected to improve. The improvement in ROC-AUC scores was confirmed to have converged after exploring 100 hyperparameters. In the transfer learning model, the weights of the multitask GNN model with the best ROC-AUC score in the validation data set were used as initial weights, and 100 epochs of fine-tuning using the stochastic gradient descent method (SGD; learning rate = 0.0001, momentum = 0.9, nesterov = true, weight decay = 0.05)[38] were performed for each target.

Figure 1

Illustration of the GNN model architecture. Dimensions of the output are shown to the left of the arrow. Italics indicate the name of the hyperparameter in Table .

Table 1

Atomic Features

category	dimensions	values
element	44	C, N, O, S, F, Si, P, Cl, Br, Mg, Na, Ca, Fe, As, Al, I, B, V, K, Tl, Yb, Sb, Sn, Ag, Pd, Co, Se, Ti, Zn, H, Li, Ge, Cu, Au, Ni, Cd, In, Mn, Zr, Cr, Pt, Hg, Pb, other
degree	11	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
valence	8	0, 1, 2, 3, 4, 5, 6, other
formal charge	1
number of radical electrons	1
hybridization	6	SP, SP2, SP3, SP3D, SP3D2, other
aromaticity	1

Table 2

Range of Hyperparameters Explored in the GNN

parameter name	values
graph layer	GCNConv[39]GraphConv[40]SAGEConv[41]GATConv[42]ARMAConv[43]SGConv[44]
number of graph layers	1–5
number of graph layer units	16–1024
number of linear layers	1–3
number of linear layer units	16–1024

Illustration of the GNN model architecture. Dimensions of the output are shown to the left of the arrow. Italics indicate the name of the hyperparameter in Table .

Non-Deep Learning Prediction Approach

For comparison, a non-deep learning model using random forest classification (RF), support vector classification (SVC), k-nearest neighbor classification (kNN), and LightGBM-3.2.0[45] was trained using Morgan fingerprints calculated using RDKit-2021.03.1 as features. The hyperparameters of Tables –6 were optimized to maximize the ROC-AUC for the validation set using optuna-2.7.0. The TPE was used to explore 100 hyperparameters.

Table 3

Range of Hyperparameters Explored in the Random Forest Classification

name	distribution	range
number of estimators	int uniform	2–500
max_depth	uniform	1–64
min_samples_weight	uniform	2–128
max_features	uniform	0.1–1

Table 6

Range of Hyperparameters Explored in LightGBM

name	distribution	range
lambda_l1, lambda_l2	log uniform	10^–8–10¹
num_leaves	int uniform	2–512
feature_fraction	uniform	0.4–1
bagging_fraction	uniform	0.4–1
bagging_freq	int uniform	1–7
min_child_samples	int uniform	5–100

Visualization

Visualization of the importance of atoms was performed to confirm whether there is a relationship between the parts of small molecules that are important for model prediction and the parts that are important for protein binding. Visualization of the part of the input molecule that the model focused on during prediction was performed using Vanilla Gradient,[46] Integrated Gradients,[47] and SmoothGrad.[48] Vanilla Gradient simply visualizes from backpropagated values, while the other methods are extended to reduce noise. Then, the relationship with the major protein–ligand interactions in the cocrystal structures in the PDB was examined.

Results and Discussion

In-House Assay Data Analysis

First, we analyzed the chemical space of a set of 7165 in-house compounds used in this research. Figure a shows the density distribution of similarity for the in-house compound set estimated using kernel density estimation (KDE) with the aim of analyzing the similarity among the library compounds in the assay. The x-axis shows the Tanimoto coefficient calculated using the Morgan2 fingerprint (TcMorgan2) and the y-axis shows the density distribution of compound counts. As shown by the distribution of the most similar compounds (Figure a, n = 1), 93.8% of the compounds have TcMorgan2 > 0.5 similar compounds in the library, indicating that most of the compounds have at least one or more similar compound. Furthermore, for the 10th and 100th most similar compounds, the rates of similarity with TcMorgan2 > 0.5 were 69.5 and 17.7%, respectively. On the other hand, there were no compounds in the library with more than 500 similar compounds (not shown in the graph). The results indicated that the in-house data set forms a certain number of compound clusters and also has scaffold diversity.

Figure 2

Distribution of similar compounds in the data set. Tanimoto coefficients using the Morgan2 fingerprint were used for a similarity score and the distribution was estimated by KDE. (a) shows the distribution of k-th most similar compounds in the in-house training data set. Blue, orange, green, red, and purple lines represent the 1st, 10th, 30th, 50th, and 100th similarity compound’s distributions, respectively. (b) shows the distribution of 10th most similar compounds between training and other data sets. Blue, orange, and green lines present the relationship between the training set, validation set, and test set, respectively. (c) shows the distribution of k-th most similar compounds between the in-house data set and the PDB data set. Blue, orange, green, red, and purple lines represent the 1st, 10th, 30th, 50th, and 100th similarity compound’s distributions, respectively. The compound data set was randomly divided into three parts: training set, validation set, and test set. The kernel density distribution of similarity of the 10th most similar compound in each data set from the training set is shown in Figure b. The Tanimoto coefficient calculated using the Morgan2 fingerprint (TcMorgan2) is shown on the x-axis and the density distribution of compound counts is shown on the y-axis. Since the training set has more data than the validation set and the test set, the density distribution of the training set itself tends to be high similarity, while the validation set and the test set have similar distributions. This suggests that predicting the test set using a model tuned to the validation set can accurately evaluate the performance of the prediction model. The results of comparing the compounds in the in-house data set with the PDB ligands used to study the extrapolation of the prediction model are shown in Figure c. The distribution of the most similar compounds indicated that 61.7% of the compounds were similar (TcMorgan2 > 0.5) in the library, while the 10th most similar compound had 4.14% and the 30th compound had 0%. This result indicates that the chemical spaces of the compound data set in the in-house and the PDB are different. This may be because the in-house data set consists of druglike compounds, while the PDB data includes endogenous ligands.

Prediction Results

Table shows the average ROC-AUC for all targets in each model. Balanced accuracy and the Matthews correlation coefficient (MCC) are shown in Supporting Information Tables S1 and S2. Although the single-task GNN model outperformed the SVC in all metrics, it was not possible to say that the model was superior to other models and it was inferior to LightGBM in ROC-AUC and the MCC. In the multitask model, ROC-AUC and MCC outperformed the single-task GNN and LightGBM. In addition, the multitask GNN model showed a smaller gap between validation and test scores; thus, the multitask GNN model helped avoid overfitting. It is thought that this improvement was achieved because it was able to use other kinase data, which cannot be handled in the single-task model. Transfer learning from the multitask model showed some improvement in all metrics. The number of targets for which learning failed due to the small amount of data or the positive/negative imbalance was 53 in the single-task model, 6 in LightGBM, and 3 in the multitask model. Even for targets for which learning failed in LightGBM, a good score (ROC-AUC > 0.8) was achieved in multitasking (Table ).

Table 7

Prediction of the ROC-AUC Score of Each Modela

	single-task		multitask		transfer learning
	alidation v	test	alidation v	test	alidation v	test
GCNConv	0.9510 ± 0.0004	0.8985 ± 0.0040	0.9357 ± 0.0027	0.9266 ± 0.0049	0.9418 ± 0.0026	0.9304 ± 0.0051
GraphConv	0.9534 ± 0.0008	0.9010 ± 0.0014	0.9379 ± 0.0025	0.9277 ± 0.0059	0.9436 ± 0.0024	0.9320 ± 0.0062
SAGEConv	0.9516 ± 0.0012	0.8999 ± 0.0035	0.9361 ± 0.0012	0.9282 ± 0.0029	0.9412 ± 0.0014	0.9319 ± 0.0031
GATConv	0.9514 ± 0.0007	0.8976 ± 0.0040	0.9364 ± 0.0028	0.9237 ± 0.0041	0.9418 ± 0.0013	0.9278 ± 0.0022
ARMAConv	0.9514 ± 0.0008	0.9004 ± 0.0051	0.9389 ± 0.0022	0.9295 ± 0.0041	0.9440 ± 0.0016	0.9327 ± 0.0036
SGConv	0.9510 ± 0.0002	0.8997 ± 0.0033	0.9352 ± 0.0027	0.9251 ± 0.0043	0.9416 ± 0.0019	0.9294 ± 0.0039
LightGBM	0.9521 ± 0.0007	0.9190 ± 0.0054
RF	0.9445 ± 0.0009	0.9082 ± 0.0040
SVC	0.8891 ± 0.0034	0.8565 ± 0.0006
kNN	0.9292 ± 0.0013	0.8991 ± 0.0064

The table provides statistics for the prediction results with the ROC-AUC score. For each method, scores of the validation set and the test set are reported.

Table 8

Targets for which Learning Failed by the Multitask Model or LightGBMa

target	ROC-AUC score			N	positive	negative
	single task	multitask	LightGBM	N	positive	negative
CAMK1B		0.8333		52	48	4
CDK10		0.9375		59	53	6
NEK10				151	137	14
NIK		0.8333		151	135	16
WNK1				628	625	3
WNK3				627	620	7

The table shows the prediction scores from the multitask ROC-AUC for targets which cannot be made single-task GNN/LightGBM. In addition, the total number of compounds (N) and the number of positive and negative compounds are reported.

The table provides statistics for the prediction results with the ROC-AUC score. For each method, scores of the validation set and the test set are reported. The table shows the prediction scores from the multitask ROC-AUC for targets which cannot be made single-task GNN/LightGBM. In addition, the total number of compounds (N) and the number of positive and negative compounds are reported.

Comparison of Prediction Results among Each Model

Figure a shows a scatter plot of the mean ROC-AUC for each target, with the single-task model on the x-axis and the multitask one on the y-axis. Figure b shows a scatter plot with LightGBM on the x-axis and the multitask model on the y-axis. The results for ARMAConv, which had the best validation ROC-AUC, are shown in Table . Figure a shows that, for 98.9% of the targets, the multitask model has higher prediction ROC-AUC than the single-task model. Upon focusing only on targets with fewer than 2000 data (22 targets), the average difference in the prediction ROC-AUC is 0.0508, which is higher than that of the entire target (0.0295). This may be due to the possibility of complementing the small amount of data of the target with data from other targets. The multitask model showed a higher ROC-AUC than LightGBM for 81.7% of the targets, and the average difference in the prediction ROC-AUC for targets with fewer than 2000 data (32 targets) was 0.0270, which was better than the overall difference (0.0110).

Figure 3

Table 4

Range of Hyperparameters Explored in the SVC

name	distribution	range
kernel		RBF
C	log uniform	2^–1–2⁶
γ	log uniform	2^–8–2⁰

Comparison of the per-target prediction ROC-AUC between models. The color scale of each point indicates the size of the training set for that target. (a) Single-task model on the x-axis and multitask model on the y-axis. (b) LightGBM on the x-axis and multitask model on the y-axis. Figure shows a bar graph with the difference in the mean ROC-AUC between the multitask model and the transfer learning model on the y-axis and the target on the x-axis for 361 targets with an ROC-AUC of 0.8 or higher in the multitask model and more than 1000 training sets. Three hundred and twenty-five targets showed an increase in the ROC-AUC. The largest improvement (from 0.8457 to 0.8619) was observed for PIK3CB, indicating the usefulness of transfer learning. On the other hand, 36 targets showed a decrease in the ROC-AUC, with AKT3 showing the largest decrease (0.8350 to 0.8131).

Figure 4

Comparing the prediction ROC-AUC of multitask and transfer learning models for each target. The x-axis shows the respective kinase, and the y-axis shows the difference in the prediction ROC-AUC between transition learning and multitasking.

Evaluation of the Models with Publicly Available Data

From the visualization of the importance of each atom for the prediction of the cocrystal structure in the PDB, we discuss the results for FGFR2 and PAK4 for which the prediction ROC-AUC was sufficiently high. Figure shows the structure of FGFR2 in a complex with all of the unique small molecules (PDB ID: 3ri1, 6agx, 4j95, 1oec). The importance of the atoms in the small molecule is indicated by the color scale, and the nucleotide-binding region obtained from UniProt (UniProt ID: P21802) is indicated in yellow. In all compounds, the hinge binder atoms were highly important, suggesting that they are associated with important parts of the interaction. In PAK4, the complex structure with 12 compounds was used for analysis (PDB ID: 2x4z, 4o0v, 4o0x, 4o0y, 5i0b, 5vee, 5zjw, 2cdz, 5xva, 5xvf, 5xvg, 5bms). Three typical structures are shown in Figure . For PAK4, 7 out of 12 compounds (2x4z, 4o0v, 4o0x, 4o0y, 5i0b, 5vee, 5zjw) showed highly important atoms around the hinge binder, as in the case of FGFR2 (Figure a). In four compounds (2cdz, 5xva, 5xvf, 5xvg), the importance of the sites interacting with GLU329 main-chain oxygen, ASP458 side chain, and ASP444 main-chain oxygen was high, indicating that interaction sites other than those in the hinge binder could be highly important (Figure b). On the other hand, for one compound (5bms), the importance of the noninteracting sites was increased, and it cannot be said that the atoms important for the formation of the complex structure could be extracted for all compounds (Figure c). However, in many targets and compounds other than those shown in the examples, correspondences were also found between the atoms of high importance and the interaction parts in small molecule–protein complexes, suggesting that the importance of the atoms from prediction models can be a useful tool for inferring important interactions in compounds. On the other hand, it should be noted that there were some parts as shown in this section that clearly formed interactions, even though their importance was low and that low importance does not mean that they do not form interactions. All visualization data for each target in the validation set are contained in an open access deposition on the Zenodo site (https://doi.org/10.5281/zenodo.5869798).

Figure 5

Figure 6

Visualization results of the PAK4-small molecule complex. The importance of the atoms in the small molecule is indicated by the color scale. Yellow ribbons indicate nucleotide-binding regions (UniProt ID: O96013). (a) Molecules in which atoms around the hinge binder became highly important. (b) Molecules in which atoms interacting with GLU329, ASP458, and ASP444 became highly important. (c) Molecules in which atoms that are less associated with interactions became highly important.

Visualization results of the FGFR2-small molecule complex. The importance of the atoms in the small molecule is indicated by the color scale. Yellow ribbons indicate nucleotide-binding regions (UniProt ID: P21802). Visualization results of the PAK4-small molecule complex. The importance of the atoms in the small molecule is indicated by the color scale. Yellow ribbons indicate nucleotide-binding regions (UniProt ID: O96013). (a) Molecules in which atoms around the hinge binder became highly important. (b) Molecules in which atoms interacting with GLU329, ASP458, and ASP444 became highly important. (c) Molecules in which atoms that are less associated with interactions became highly important.

Conclusions

In this study, we examined the accuracy of various prediction methods using the kinase inhibition assay data that have been accumulated over the years in a pharmaceutical company. Large-scale high-throughput screening data for single targets are available in public databases (e.g., PubChem and ChEMBL), but data from assays of thousands of compounds against hundreds of targets using the same assay protocol do not exist in public databases. In addition, since the assays are conducted using the in-house chemical library of a pharmaceutical company, the prediction models are created using druglike compounds or, depending on the viewpoint, compounds with limited chemical space. All of the prediction models developed in this study showed prediction scores sufficient to be of practical use for in-house data. On the data set used in this study, the single-task GNN did not outperform LightGBM, but for 81% of the targets, the multitask GNN achieved better scores than LightGBM, and the average score for all targets was also better. This indicates that the multitask GNN could be a viable alternative to LightGBM for data sets containing many targets. Incidentally, even though this study involved a ligand-based approach and did not take into account information on the interaction with the target, an overlap was observed between the atoms that are important for predicting the activity of the compound and the important sites of interaction with the target protein. The validation results showed that the prediction models generated from homogeneous assay data can be used not only for binary prediction of activity/inactivity but also for supporting molecular design. However, since some atoms with low importance in the prediction were included even though they were important substituents for the interaction, it is difficult to predict all of the interaction using only the current representation of the compound. In the future, we would like to investigate the possibility of creating a more accurate prediction method by considering the orientation and molecular alignment of known active compounds in order to connect the ligand-based prediction results to drug design.

Table 5

Range of Hyperparameters Explored in the k-Nearest Neighbor Classification

name	distribution	range
number of neighbors	uniform	1–50
weights		{uniform, distance}
P		{1, 2}

30 in total

1. Ligand Depot: a data warehouse for ligands bound to macromolecules.

Authors: Zukang Feng; Li Chen; Himabindu Maddula; Ozgur Akcan; Rose Oughtred; Helen M Berman; John Westbrook
Journal: Bioinformatics Date: 2004-04-01 Impact factor: 6.937

2. IDDkin: Network-based influence deep diffusion model for enhancing prediction of kinase inhibitors.

Authors: Cong Shen; Jiawei Luo; Wenjue Ouyang; Pingjian Ding; Xiangtao Chen
Journal: Bioinformatics Date: 2020-12-26 Impact factor: 6.937

3. The Joint European Compound Library: boosting precompetitive research.

Authors: Jérémy Besnard; Philip S Jones; Andrew L Hopkins; Andrew D Pannifer
Journal: Drug Discov Today Date: 2014-09-06 Impact factor: 7.851

4. Exploiting machine learning for end-to-end drug discovery and development.

Authors: Sean Ekins; Ana C Puhl; Kimberley M Zorn; Thomas R Lane; Daniel P Russo; Jennifer J Klein; Anthony J Hickey; Alex M Clark
Journal: Nat Mater Date: 2019-04-18 Impact factor: 43.841

Review 5. Machine learning in chemoinformatics and drug discovery.

Authors: Yu-Chen Lo; Stefano E Rensi; Wen Torng; Russ B Altman
Journal: Drug Discov Today Date: 2018-05-08 Impact factor: 7.851

6. Algebraic graph-assisted bidirectional transformers for molecular property prediction.

Authors: Dong Chen; Kaifu Gao; Duc Duy Nguyen; Xin Chen; Yi Jiang; Guo-Wei Wei; Feng Pan
Journal: Nat Commun Date: 2021-06-10 Impact factor: 14.919

7. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL.

Authors: Andreas Mayr; Günter Klambauer; Thomas Unterthiner; Marvin Steijaert; Jörg K Wegner; Hugo Ceulemans; Djork-Arné Clevert; Sepp Hochreiter
Journal: Chem Sci Date: 2018-06-06 Impact factor: 9.825

8. Ultra-large library docking for discovering new chemotypes.

Authors: Jiankun Lyu; Sheng Wang; Trent E Balius; Isha Singh; Anat Levit; Yurii S Moroz; Matthew J O'Meara; Tao Che; Enkhjargal Algaa; Kateryna Tolmachova; Andrey A Tolmachev; Brian K Shoichet; Bryan L Roth; John J Irwin
Journal: Nature Date: 2019-02-06 Impact factor: 49.962

9. Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery.

Authors: Lun K Tsou; Shiu-Hwa Yeh; Shau-Hua Ueng; Chun-Ping Chang; Jen-Shin Song; Mine-Hsine Wu; Hsiao-Fu Chang; Sheng-Ren Chen; Chuan Shih; Chiung-Tong Chen; Yi-Yu Ke
Journal: Sci Rep Date: 2020-10-08 Impact factor: 4.379