| Literature DB >> 31217689 |
Srikanth Kuthuru1,2, Adam T Szafran3,4, Fabio Stossi3,4,5, Michael A Mancini3,4,5, Arvind Rao1,2,6.
Abstract
In recent years, protein kinases have become some of the most significant drug targets in cancer patients. Kinases are known to regulate the activity of many human proteins, and consequently their inhibition has been used to control cancer proliferation. A significant challenge in drug discovery is the rapid and efficient identification of new small molecules. In this study, we propose a novel in silico drug discovery approach to identify kinase targets that impinge on nuclear receptor signaling with data generated using high-content analysis (HCA). A high-throughput imaging dataset was generated from an siRNA human kinome screen on engineered cells that allow direct visualization of effects on estrogen receptor-α or a chimeric progesterone receptor B binding to specific DNA. Two types of kinase descriptors are extracted from these imaging data: first, a population-median-based descriptor and second a bag-of-words (BoW) descriptor that can capture heterogeneity information in the imaging data. Using these descriptors, we provide prediction results of drug-kinase-target interactions based on single-task learning, multi-task learning, and collaborative filtering methods. The best performing model in target-based drug discovery gives an area under the receiver operating characteristic curve (AUC) of 0.86, whereas the best model in ligand-based discovery gives an AUC of 0.79. These promising results suggest that imaging-based information can be used as an additional source of information to existing virtual screening methods, thereby making the drug discovery process more time and cost efficient.Entities:
Keywords: drug discovery; high-throughput imaging; machine learning
Year: 2019 PMID: 31217689 PMCID: PMC6563400 DOI: 10.1177/1176935119856595
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1.The machine learning setups for various drug discovery tasks: (A) lead compound identification—the task is to predict drug interactions with new kinase targets to find lead compounds that show significant bioactivity when tested experimentally; (B) drug activity prediction—the task is to predict the target interactions of new drugs to find drug-like molecules; (C) drug repurposing—given partially known bioactivity data, the task is to predict unknown interactions. This is useful for finding new therapeutic uses for already established drugs.
Figure 2.Bag-of-words. This is a simulated example used for illustration purpose only. We can observe that the blue-colored points are highly heterogeneous and, therefore, have a flatter histogram profile. The red-colored points are local (eg, very homogeneous and therefore have a narrow histogram profile).
Figure 3.(A) Single-task learning setup. Independent models are trained for predicting bioactivity for each kinase. For example, Task-1 represents the task of predicting the bioactivity of all compounds with a particular protein kinase (say AAK1). (B) Multi-task learning. A shared model is used to extract an initial set of features from all the drugs. These features are later used as inputs to smaller models which can make predictions on all the tasks.
Single-task learning—compound profile prediction AUC results of protein kinases using either population-median-based descriptor or bag-of-words-based descriptor.
| Model | AUC (with 95% confidence interval) | |
|---|---|---|
| Median feature | Bag-of-words feature | |
| KNN ( | 0.68 (0.65-0.69) | 0.68 (0.65-0.72) |
| Logistic regression | 0.8 (0.79-0.83) | 0.84 (0.81-0.87) |
| Linear SVM | 0.83 (0.79-0.86) | 0.82 (0.8-0.86) |
| Random forest | 0.83 (0.81-0.85) | 0.83 (0.81-0.85) |
| 2-layered neural network | 0.82 (0.8-0.84) | 0.84 (0.79-0.86) |
| Multi-task neural network | 0.85 (0.84-0.86) | 0.86 (0.84-0.87) |
Abbreviations: AUC, area under the receiver operating characteristic curve; KNN, k-nearest neighbor; SVM, support vector machine.
A simple linear SVM could provide very good performance on the test set. Average AUCs and confidence intervals are calculated over 100 independent trials.
Figure 4.Receiver operating characteristic (ROC) curves of a multi-task neural network for predicting drug-target interactions with BoW-based kinase features as inputs. Each ROC curve corresponds to 1 experiment. The average AUC of all the ROC curves is 0.86, with a 95% confidence interval of [0.84-0.87]. AUC indicates area under the receiver operating characteristic curve; ROC, receiver operating characteristic.
Single-task learning—compound profile prediction AUPR results of protein kinases using either population-median-based descriptor or bag-of-words-based descriptor.
| Model | AUPR (with 95% confidence interval) | |
|---|---|---|
| Median feature | Bag-of-words feature | |
| KNN ( | 0.16 (0.13-0.18) | 0.15 (0.13-0.17) |
| Logistic regression | 0.25 (0.22-0.3) | 0.33 (0.29-0.37) |
| Linear SVM | 0.3 (0.28-0.36) | 0.29 (0.23-0.33) |
| Random forest | 0.26 (0.24-0.3) | 0.26 (0.24-0.27) |
| 2-layered neural network | 0.3 (0.27-0.33) | 0.32 (0.29-0.36) |
| Multi-task neural network | 0.3 (0.27-0.33) | 0.32 (0.29-0.34) |
Abbreviations: AUPR, area under the precision-recall curve; KNN, k-nearest neighbor; SVM, support vector machine.
Average AUPR scores and their confidence intervals are calculated over 100 independent trials.
Drug activity prediction results using ECFPs as input features.
| Model | AUC (with 95% confidence interval) | AUPR (with 95% confidence interval) |
|---|---|---|
| KNN ( | 0.68 (0.63-0.7) | 0.2 (0.14-0.24) |
| Logistic regression | 0.8 (0.77-0.81) | 0.25 (0.22-0.29) |
| Linear SVM | 0.77 (0.76-0.79) | 0.21 (0.19-0.26) |
| Random forest | 0.8 (0.78-0.83) | 0.22 (0.17-0.28) |
| 2-layered neural network | 0.72 (0.7-0.74) | 0.12 (0.1-0.14) |
| Multi-task neural network | 0.8 (0.76-0.83) | 0.22 (0.19-0.26) |
Abbreviations: AUC, area under the receiver operating characteristic curve; AUPR, area under the precision-recall curve; ECFPs, extended-connectivity fingerprints; KNN, k-nearest neighbor; SVM, support vector machine.
Multi-task network and random forest methods are marginally better than the other methods.
| Algorithm 1. Alternating optimization for matrix factorization. |
|---|
Here . Note that this loss function does not contain squared terms. We have used this loss because it is faster to optimize and has the same properties as the loss function in equation (1); prevIterLoss = loss function value in previous iteration; IterLoss = loss function value in present iteration.
argmin: Minimization is done using the cvxopt function in the CVXPY toolbox. This function uses the stochastic gradient descent (SGD) method to find the global minimum of any convex function. cvxopt outputs the minimizer and the corresponding loss function value. If the loss function is non-decreasing over successive iterations, then the algorithm is considered to converge to an optimal solution. The entire implementation on our dataset is provided on Github.[39]
Figure 5.Performance of the collaborative filtering (using LRMF) and kinase-feature-based methods (logistic regression model) at varying levels of training data sparsity. AUC indicates area under the receiver operating characteristic curve; LRMF, low-rank matrix factorization.