Literature DB >> 35449922

Deep Neural Network-Assisted Drug Recommendation Systems for Identifying Potential Drug-Target Interactions.

Yogesh Kalakoti¹, Shashank Yadav¹, Durai Sundar^1,2.

Abstract

In silico methods to identify novel drug-target interactions (DTIs) have gained significant importance over conventional techniques owing to their labor-intensive and low-throughput nature. Here, we present a machine learning-based multiclass classification workflow that segregates interactions between active, inactive, and intermediate drug-target pairs. Drug molecules, protein sequences, and molecular descriptors were transformed into machine-interpretable embeddings to extract critical features from standard datasets. Tools such as CHEMBL web resource, iFeature, and an in-house developed deep neural network-assisted drug recommendation (dNNDR)-featx were employed for data retrieval and processing. The models were trained with large-scale DTI datasets, which reported an improvement in performance over baseline methods. External validation results showed that models based on att-biLSTM and gCNN could help predict novel DTIs. When tested with a completely different dataset, the proposed models significantly outperformed competing methods. The validity of novel interactions predicted by dNNDR was backed by experimental and computational evidence in the literature. The proposed methodology could elucidate critical features that govern the relationship between a drug and its target.

Entities: Chemical

Year: 2022 PMID： 35449922 PMCID： PMC9016825 DOI： 10.1021/acsomega.2c00424

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Identifying novel drug–target interactions (DTIs) is considered a stagnant and labor-intensive process. A conventional drug discovery and development workflow can extend to about 14 years and drain almost a billion USD in capital.[1,2] Lead identification, optimization, screening, and characterization are a few of the many steps in an assay-based drug discovery workflow. With the advent of big data and computational advances, in silico methods have found utility in predicting novel DTIs, ultimately aiding the process of drug discovery.[3,4] While traditional workflows fare better than in silico alternatives in terms of reliability and robustness, analysis and characterization of vast volumes of data are not possible due to their inherent throughput limitations. Computer-aided DTI estimation methods roughly fall into two classes—biophysical models and statistical methods. Biophysical methods such as molecular dynamics try to replicate a biological arrangement in silico under a set of physical constraints and infer DTIs at a molecular level.[5,6] Such methods are limited by the availability of molecular structures and computational restraints.[7] Alternatively, statistical approaches such as support vector machines, kernel learning, and supervised bipartite graphs extract information from interaction data and make logical inferences.[8] Until recently, using these solutions, most DTI studies were limited to the datasets wherein drug–target pairs with no known affinity were included, thus neglecting any measure of binding affinity altogether.[8−12] With the recent advances in high-throughput assays, chemoproteomic approaches such as thermal profiling have allowed the quantification of compound potency on a broader scale.[13] DTI is generally quantified using measures such as dissociation constant (Kd), inhibition constant (Ki), or half-maximal inhibitory concentration (IC50). With the availability of large, curated DTI datasets such as DrugBank in the public domain, numerous computer-aided methods have been developed that complement the process of drug discovery.[14−17] Deep learning methods find their application in almost any research field that generates one or another form of data. It has found its relevance in computer vision, natural language processing (NLP), genomics, and drug discovery.[18,19] The most significant advantage of deep learning architectures is that they can model nonlinear relationships in data and generate a better representation that ultimately aids the learning process.[20,21] In sequence form, drug and protein data have been used with convolutional neural networks (CNNs) to extract local residue patterns and predict binding affinities.[22] However, the functioning/characteristics of a protein or drug depend on the order of its elements rather than the elements themselves. To include this vital concept into a DTI prediction system, it was proposed to harness the potential of state-of-the-art machine learning (ML) algorithms such as bi-directional long short-term memory (biLSTM) and graph CNN (gCNN). These algorithms efficiently incorporate sequential information to make sense of its ordering and can be trained to predict the type of interaction between a given drug–target pair. In addition to the sequential information, small molecules have an inherent graphical structure, which is lost when represented and processed in one-hot encodings of text-based formats such as SMILES. Recent studies suggested that retaining the underlying graphical structure of a drug molecule could allow the model to represent them in an efficient manner.[23,24] The combination of gCNN and biLSTM blocks was introduced to process the structure of chemical compounds from simplified molecular input line entry system (SMILES) and protein sequences, respectively. Moreover, molecular descriptors for proteins and drugs were also used to train the complete network. The generated representations and molecular descriptors were then fed into a fully connected feed-forward neural network to make sense of a multiclass classification problem.

Materials and Methods

Formulation of the Problem

For a list of probable drug–target pairs, the aim was to segregate the samples into (i) Class-I: Active, (ii) Class-II: Intermediate, or (iii) Class-III: Inactive. The proposed methodology followed a three-step process: (i) processing and labeling DTI data into predefined classes, (ii) developing and training a multiclass classification model for given drug–target pairs, and (iii) inference and validation using an external dataset to infer real-world performance. A multiclass classification approach was chosen to efficiently understand DTIs rather than the more conventional binary classification because (i) most of the binary classification tasks tend to label nontested drug–target combinations as a negative data point and (ii) even in the case where we have the activity profile of the drug–target pair in terms of IC50, Kd, or Ki, a single activity threshold is not uniformly followed in the literature. Furthermore, the conventional binary classification task has some inherent drawbacks and inadequacies. The most evident is the need for a predefined binarization threshold, often arbitrarily decided. To mitigate the issues mentioned earlier, binding affinities were segregated into three categories based on the magnitude of their value. The choice of activity thresholds was central to the overall objective of the proposed method. Current literature was extensively explored to formalize static thresholds that are a good indicator of the activity or inactivity of a drug–target pair. An IC50 value of <0.1 μM was a good indicator of an active DTI.[25] Similarly, DTIs with an IC50 value of >30 μM can be considered inactive pairs. The remaining DTI data points were grouped under the intermediate category. We have previously explored this segregation methodology in a similar study.[26] Following this criterion, a total of 7057 active (Class-I) interactions, 24,752 intermediate (Class-II) interactions, and 28,748 (Class-III) inactive interactions were fed into deep neural network-assisted drug recommendation (dNNDR) models. The overall input statistics are summarized in Figure B.

Figure 1

General overview of the overall methodology. (A) Primary DTI data was collected, processed, screened, and encoded. The statistic of the screened dataset is summarized in (B). (C) An attention-biLSTM network was constructed for the protein sequences and (D) a gCNN derivative was employed for drug SMILES. (E) A fully connected feed-forward neural network taking inputs from attention-biLSTM and gCNN and (F) molecular descriptors were trained for a (G) multiclass classification problem in a fivefold cross-validated setup.

Chemical Datasets

The primary evaluation of dNNDR models was done on the drugs from the Kinase dataset Davis and KIBA dataset.[27,28] These datasets have previously been used in similar DTI prediction tasks and serve as a benchmark in DTI prediction tasks.[29,30] In the form of SMILE strings, drug molecules were retrieved for KIBA drugs using in-house programs in tandem with ChEMBL web services.[31] The SMILE string for each drug was encoded as an undirected graph to feed into a compatible ML framework. Moreover, we calculated 111 drug descriptors for the retrieved molecules using RDKit, an open-source cheminformatics framework available at http://www.rdkit.org. Applying these operations on all the drugs provided us with two feature matrices to describe sequence information and chemical characteristics. Similarly, a data retrieval pipeline was built to extract protein sequences for the retrieved drug–target accession identifiers using UNIPROT’s web framework and stored as FASTA files.[32] Amino acid sequences for each target were encoded as a one-hot vector representing it in a machine-interpretable form. In addition to the sequence information, feature matrices for all the protein targets were also retrieved using iFeature. This Python package simplified the process of computing sequence level characteristics such as amino acid composition (AAC), composition/transition/distribution (CTD), among others, for any given protein molecule.[33] Also, to further simplify the process of feature extraction, a graphical user interface was developed. We termed it dNNDR-Featx, and it can be used to extract multiple types of protein and drug descriptors without any command-line tool. A general interface and basic functionalities of dNNDR-featx are depicted in Figure S1. Also, it is an open-source utility freely made available at https://github.com/TeamSundar/dNNDR-featx.

Pharmacological Data

The KIBA dataset aggregates the bioactivity of drug–target pairs as a custom-unified metric, combining the values from IC50, Kd, and Ki measures.[28] As the activity thresholds were not well defined in the KIBA dataset, IC50 values for the interacting pairs were extracted directly from ChEMBL. Although a large pool (∼0.2 M) of DTIs was retrieved from ChEMBL, a healthy chunk of it was filtered out due to nonstandard/missing activity values and incomplete information. Of 30,474 compounds, 981 targets and 61,624 interactions were finally screened after all the preprocessing steps. All the datasets used are summarized in Table .

Table 1

General Summary of All the Datasets Used in the Study

	proteins	compounds	interactions
KIBAa (IC50)	961	30,474	61,624
Davis Metzb	237	18	4255
Davis Anastasiadisb	154	24	2575

Note: Proteins for drugs listed in the KIBA dataset were extracted manually from CHEMBL.

Used as an external validation dataset.

Note: Proteins for drugs listed in the KIBA dataset were extracted manually from CHEMBL. Used as an external validation dataset.

Model Architecture

All the models were executed in Python, whereas Scikit-learn, TensorFlow-Keras, and Spektral were used to implement the ML algorithms. The metrics such as accuracy, auROC, auPR, and macro-averaged F1-score was employed to quantify the performance of the proposed and baseline models. A modular ML architecture was designed for the problem at hand. As protein sequences and drug SMILES have a fundamental difference in carrying the information forward, different ML methods were employed to handle them. The first method employed a biLSTM architecture clubbed with attention (att-biLSTM) for sequential data such as SMILES and protein sequences.[34] In cases where the molecular structure of a drug was being used in the network, gCNN was employed due to its ability to represent molecules efficiently. These methods were seamlessly integrated into three different model architectures which were termed as dNNDRa (att-biLSTM), dNNDRb (att-biLSTM + descriptors), and dNNDRc (biLSTM + gCNN) thereafter. A clear description of all the models is summarized in Table , while a graphical representation of all the model architectures is compiled in Figures S2–S6.

Table 2

Key Characteristics of All the Models Tested in the Study

method	architecture highlights	Data
sequence-baseda	1-D convolution	SMILES and protein sequences
CTD-baseda	1-D convolution	SMILES, protein sequences, and CTD features
dNNDRa	Att-biLSTM	SMILES and protein sequences
dNNDRb	Att-biLSTM	SMILES, protein sequences, and molecular descriptors
dNNDRc	biLSTM and gCNN	SMILES and protein sequences

Baseline methods.

Input Representation

To apply any mathematical operation on the sequences, they must be converted into a machine-compatible format. The sequences were represented in the form of integer encoding. Forty-four unique categories of characters from SMILES and 21 unique characters from protein sequences were encoded with a unique integer (e.g., “C”:1, “N”:2, “O”:3). For instance, “CN=C=O” was encoded as [ C N=C=O] = [1 2 35 1 35 3]. Using a similar process, protein sequences were also encoded by 21 unique characters. The sequence length limits were decided based on exploratory data analysis (EDA). Based on mean sequence lengths calculated by EDA, maximum lengths of 150 and 1000 were decided for SMILES and protein sequences, respectively. The same can be visualized from the distribution of sequence lengths for all the included datasets (Figure C). All the sequences were trimmed or padded to match the decision criteria. Additional features were concatenated with the existing ones for the models where descriptors were included. Details of the input dimensions are summarized in Table .

Figure 2

Overall statistics of the datasets used in the study. (A,B) A general overview of multiple interactions in the datasets is depicted for drugs and targets. (C) Estimation of the sequence thresholds for drug SMILES and protein sequences. The length of SMILE strings and protein sequences does not exceed a value of about 150 and 2000 for any dataset. (D) Number of data points and an estimate of their activities for each of the three interaction groups.

Table 3

Summary of All the Models under Study and the Final Dimensions of the Data after Processing

method	drug feature dimension	protein feature dimension
sequence-based	150 × 44	1000 × 21
CTD-based	150 × 44	(1000 × 21) + 273
dNNDRa	150 × 44	1000 × 21
dNNDRb	(150 × 44) + (150 × 111)	1000 × 21
dNNDRc	(123 × 123) + (123 × 17) + (123 × 3)	1000 × 21

biLSTM with Attention and 1-D Convolution

LSTM and biLSTM are two of the best and most-used modifications of recurrent neural networks (RNNs) in the field of NLP.[35−38] They try to learn long-term dependencies between sequence data such that the model can pass on critical information to the terminal layers of the network. By doing so, they have been proven to be a stable and powerful way of modeling long-term dependencies, as in the case of long amino acid sequences. On the other hand, attention helps the network focus on the given input and extract the most important information iteratively. biLSTM was preferred over vanilla LSTM because (i) it learns faster than conventional LSTM and (ii) it has a better contextual understanding.[35] dNNDR models had an embedding layer to start with, followed by biLSTM and attention layers.

Embedding Layer

To draw out the semantic information of the amino acid/SMILE sequences, each one of them was represented as a sequence of embeddings to start with.where vector e1→ denotes the vector of the ith amino acid/atom with a dimension d and s the whole sequence as a global vector.[39]

biLSTM Layer

biLSTM is a gradual extension of the traditional LSTM network used to obtain high-level features with sequential information. For the model to have a sense of the sequence order, both forward and backward LSTM outputs were employed. Therefore, the final output was calculated using the element-wise sum of both forward and backward outputs for each character. A detailed description of the working of the LSTM network is out of the scope of this study and can be found elsewhere.[34]

Attention Layer

It is widely known and accepted that functional and structural importance regions exist in any protein sequence.[40] By its nature, the attention mechanism can pinpoint locations of relevance in a long protein/SMILE sequence such that the succeeding layers get the most critical information. The attention mechanism can be formally summarized as follows.where Y is the output vector coming from biLSTM (refer to section 5.4.2), Rave is the output from the mean pooling layer, α is the attention vector, and Ratt is the attention-weighted sequence representation. After all these operations, Ratt is fed to the final layers of the networks for further training.

gCNN Branch for Drugs

gCNN is an extension of conventional CNNs that can learn a graphical representation.[41] gCNN has attracted considerable attention, especially in drug discovery, due to the inherent compatibility of molecules to be represented as undirected graphs.[42−44] Molecular structures can be efficiently modeled as a graph, and various studies have demonstrated the effectiveness of such representations.[45] Given a molecule with its spatial coordinates, three matrices were created for its graphical representation: an edge matrix, a node matrix, and an adjacency matrix. More specifically, a subtle variation of gCNN called graph edge conditioned CNN (gECCNN) was employed for all practical purposes in this study. These three matrices served as inputs to the gECCNN network, followed by a series of edge convolutions. The complete gECCNN model was built with the help of Spektral, which is an open-source library for graph deep learning.[46] The complete network architecture is available in Supporting Information File 1.

Loss and Model Optimization

Two fully connected layers of varying neuron size followed information propagating from the att-biLSTM or gECCNN branches. The neuron sizes for the dense layers were optimized for performance using a manual grid search. Training time depended on the type of model being trained but remained under a day for all the models except dNNDRc. Being a multiclass classification problem, categorical cross-entropy was used as the loss function. It was defined as a sum of losses for each class label coming out of a SoftMax function and is mathematically given byHere, N is the number of classes (three in this case), i denotes the data point, yis the binary target indicator [0,1], and p is the model prediction.

Evaluation Metrics

The auROC, the auPR, accuracy, and macro-averaged F1-score were used to evaluate the performances for comparative analysis. As identifying false-negatives and false-positives are vital for a drug–target estimation system, the F1-score was included as an evaluation metric. It is mathematically computed as described in eq . auROC and auPR are independent of imbalances and serve as a better indicator of performance than accuracy as they tend to become unreliable with class imbalances in training data. Moreover, to efficiently balance bias and variance trade-off, fivefold cross-validation was employed. It works by randomly splitting the dataset into five equal parts, training on four and testing on one, iteratively over five training rounds.

Baseline Models for Comparison

Sequence-based methods aim to generalize the relationship between drug–target pairs based on their sequence characteristics. These methods generally rely on a similarity-based learning approach wherein similarity matrices of drugs and targets are constructed to infer novel interactions. The kernel regression, bipartite local method, and pairwise kernel method are some of those where these matrices are employed.[8,47] However, for heuristic approaches such as deep learning, most of the architectures of sequence-based models have a similar backbone. Similarly, molecular descriptor-based methods generally use CTD to represent a protein and its properties.[48,49] On similar lines, two architectures that represented sequence and descriptor-based approaches such as FRnet-DTI, DeepConv-DTI, and DeepDTA were used.[22,50,51] Of note, none of these methods were multiclass classification problems, and hence, a direct comparison was not possible. However, an indirect comparison through similar model architectures could be a valid approach for such cases.

Results

With comparative analysis against baseline methods, three model architectures are proposed in this study. The seed architecture over which all three models were built is described in Figure . The proposed models differ in the choice of data combinations and the way in which they were processed. For the first model dNNDRa, we used biLSTM with the attention layer on SMILES and protein sequences along with 1-D convolution. For the second model dNNDRb, 111 types of molecular descriptors were included along with the sequences. The architecture was similar to dNNDRa with an addition of a parallel descriptor input. Similarly, the third model dNNDRc used gCNN to handle drug data and biLSTM layers for protein data. The shape of inputs and outputs varied accordingly for each variation and are summarized in Table . The results suggested that the inclusion of biLSTM, attention, and gCNN into the architecture improved the performance over models that used only 1-D convolution for sequence features and similar transformations for numerical features. Table reported the average value of all the performance metrics used in all the models under consideration. Further, a benchmarking experiment was performed with the qm9 dataset wherein a set of ∼100 k small molecules were trained with LSTMs and gNNs to predict eight quantum chemical properties.[52] The comparative analysis (Table S1) clearly indicated that gNNs outperformed LSTMs in terms of mean absolute error and coefficient of determination (R2). This provides significant evidence in support of utilizing the underlying graphical structure of small molecules, in addition to the sequence-based processing methods for building such predictive models. All the proposed methods outperformed baseline models with high accuracy and a greater auROC, auPR, and F1-score (Figure ). It must be emphasized that auROC for all the dNNDR variants was better than baseline models for all three types of interactions (Figure A–C). Therefore, the proposed model accurately predicted the interactions and avoided giving false-negative and false-positive inferences to an extent.

Table 4

Performance of All Models under Consideration Are Summarizeda

	auROC			auPR	validation accuracy
	I	II	III	auPR	validation accuracy
seq based	0.86 (0.003)	0.90 (0.006)	0.85 (0.006)	0.83 (0.007)	0.74 (0.008)
CTD	0.87 (0.005)	0.90 (0.004)	0.86 (0.003)	0.83 (0.003)	0.75 (0.005)
dNNDRa	0.87 (0.007)	0.91 (0.005)	0.86 (0.004)	0.84 (0.005)	0.76 (0.005)
dNNDRb	0.88 (0.005)	0.93 (0.005)	0.87 (0.007)	0.86 (0.007)	0.77 (0.008)
dNNDRc	0.90 (0.002)	0.94 (0.001)	0.89 (0.002)	0.88 (0.003)	0.79 (0.004)

All the models were fivefold cross-validated with the standard deviation mentioned alongside every result. The best-performing models among the cross-validated ones are marked in bold. Note: dNNDRa: att-biLSTM, dNNDRb: att-biLSTM + descriptors, dNNDRc: biLSTM + gCNN.

Figure 3

Comparison of performances among all the methods under observation. ROC curves for the three classes, namely, (A) active, (B) intermediate, and (C) inactive, indicate the effectiveness of dNNDR over baseline methods in predicting the type of interaction for a given drug–target pair. (D) Validation accuracy, (E) precision-recall curves, and (F) macro-averaged F1-score also follow a similar trend and reinforce the utility of the proposed methods. All the models were fivefold cross-validated with the standard deviation mentioned alongside every result. The best-performing models among the cross-validated ones are marked in bold. Note: dNNDRa: att-biLSTM, dNNDRb: att-biLSTM + descriptors, dNNDRc: biLSTM + gCNN.

External Validation Reinforces the Robustness of the Approach

Though the methodology demonstrated a significant improvement in performance over other methods (Figure ), the ultimate criterion for effectiveness is generalizability for any ML solution. Therefore, two entirely different datasets were prepared for external validation of all the methods under consideration. For a fair comparison with benchmark methods, model weights were updated to be compatible with a binary classification task. It is emphasized that datapoints belonging to only the active and inactive classes of the original training data were employed for updating the model weights. This was necessary as the model architecture was optimized for differentiating active and inactive datapoints from the intermediate class and including such datapoints in either region (active/inactive) would not be ideal. Minor modifications were done to the final layer of the model architecture for achieving this moderation. Validation datasets from KIBA were processed using the same procedure as followed previously to segregate datapoints into three classes and inference was made (again using only the datapoints from active and inactive classes) for the updated binary classification models. Outputs from all the baseline and existing methods were binarized if required (Table ).

Table 5

External Validation Results Reinforcing the Effectiveness of the Proposed Modelsa

	external validation set
	Davis Anastasiadis			Davis Metz
method	accuracy	precision	F1-score	accuracy	precision	F1-score
sequence-based	0.69	0.60	0.64	0.66	0.56	0.60
CTD	0.19	0.03	0.06	0.22	0.05	0.08
dNNDRa	0.47	0.67	0.51	0.53	0.51	0.52
dNNDRb	0.76	0.58	0.66	0.70	0.59	0.58
DeepDTA	0.70	0.56	0.68	0.64	0.55	0.57
DeepConv-DTI	0.72	0.61	0.64	0.68	0.56	0.61

The best-performing method has been marked in bold. 2. dNNDRa: Att-biLSTM, dNNDRb: Att-biLSTM + descriptors, dNNDRc: biLSTM + gCNN.

Predicting Unknown DTIs

After the external validation of dNNDR models, prediction runs for a set of proteins and drugs from a gold-standard dataset were performed.[53] The goal was to predict all potential interactions that were not present in the dataset. dNNDRb was used for making the predictions. Salicylic acid, acetaminophen, mesalamine, and sodium salicylate were among the most commonly occurring compounds in the interactions predicted by the model. The majority of compounds were anti-inflammatory, COX inhibitors, analgesics, or neuropsychiatric agents. Similarly, proteins associated with purine metabolism, cGMP-PKG signaling pathway, steroid hormone biosynthesis, linoleic acid metabolism, and chemical carcinogenesis were abundant in the novel interactions. It must be emphasized that most of the compounds designated as an interacting partner showed the properties of a drug molecule, such as aromaticity and active centers (Table ). A complete list of predicted interactions is available on the GitHub repository.

Table 6

Novel DTI Predicted by dNNDRba

A subset of the novel predictions from drugs and proteins in the gold-standard dataset is reported here.

Discussion

In this study, a data-driven approach was attempted to overcome the demerits of framing DTI prediction as a binary classification task by introducing an intermediate region between an active and inactive interaction. This makes for a more realistic and practical solution while staying uncompromised on the performance front at the same time. As with any sequential dataset, realizing the order of individual elements in the data is as crucial as the data itself. Therefore, much emphasis was given upon the information aspect by using attention, gCNN, and biLSTM networks. As described in the earlier section, dNNDR models consistently outperformed related methods in most performance metrics. While the other methods tend to drastically deter when tested on an entirely new dataset (external validation), dNNDR models showed exceptional effectiveness levels (Table ). Furthermore, a graphical interface was also designed to complement feature extraction for any DTI prediction task. The dNNDR-featx program accepts plain text files containing drug–target pairs and extracts the molecular descriptors selected by the user. A general overview of the interface is shown in Figure . Although experimental screening methods can only verify the validation of the binding energies of putative DTIs, these results indicated that the proposed models could mature into promising methods for the identification of novel DTIs. Ultimately, the proposed integrative approach recommended a set of promising DTIs that could be experimentally validated as promising leads for novel cancer therapies. However, experimental and computational evidence in the existing literature supported dNNDR’s DTI predictions (Supporting Information file 1). Acetaminophen has been shown to bind with CYP1A1 (cytochrome P450 family 1 subfamily A member 1), which was predicted by the proposed model.[54] CYP1A1 metabolizes acetaminophen to N-acetyl-p-benzoquinone imine (NAPQI), along with 3-hydroxy acetaminophen.[54] Salicylic acid, predicted to bind to mammalian carbonic anhydrases, has been reported to inhibit carbonic anhydrase I.[55] Aspirin (acetyl derivative of salicylic acid) is computationally predicted to interact with phosphodiesterase 7B (PDE7B) in another report.[56] Evidence of salicylamide inhibiting monoamine oxidase activity in rat liver and brain fractions has also been reported.[57] Such experimental studies that validate some of the predictions from dNNDR’s methodology reinforced confidence in the predictive ability of dNNDR. Although experimentation is the ultimate validation tool for binding characteristics of model recommendations and for any possible clinical applications, these results indicated that the proposed models could mature into promising methods for the identification of novel DTIs. It was observed that although dNNDR models performed relatively well, there is much scope for improvement. In the case of dNNDRc, while it performed significantly better than most of the other methods, the size of the model exponentially increased with the inclusion of high-dimensional representations in the form of a graph matrix, a node matrix, and an adjacency matrix (Table and Figure ). This led to a massive increase in training time and limited the ability to optimize the model efficiently. Hence, it must be noted that although dNNDRc showed great promise, it was excluded from the final inference.

Conclusions

Using the ordered information present in SMILES and protein sequences was central to the idea of dNNDR. To solve a multiclass classification problem, att-biLSTM and gCNN were employed to learn representations from raw sequence data and molecular descriptors. These methods were compared extensively with baseline methods on various measures of performance. The results obtained in this study reinforced the idea of using representations that try to capture the underlying order in sequential data. Including att-biLSTM and gCNN served the same purpose, and as a result, it was observed that there was a significant improvement in the performance compared to the baseline methods. Moreover, dNNDR’s effectiveness was evident in the external validation setup, where they consistently outperformed the baseline models with a healthy margin. In the future, dNNDRc can be made more efficient in terms of hardware utilization, focusing on the interpretability of the proposed models. It is also proposed to integrate the proposed ML models into the dNNDR-featx interface in the near future to make it a standalone drug recommendation program. Analyzing the details of what the model is learning can be of great utility in improving the methodology further. Furthermore, the idea of using structural information of proteins for DTI prediction remains to be of immense utility.[58] Therefore, the idea is to represent the spatial information provided by the protein 3D structure to improve the proposed methodology further. Although experimental screening methods can only verify the validation of the binding energies of putative DTIs, these results indicated that the proposed models could mature into promising methods for the identification of novel DTIs.

51 in total

1. Carbonic anhydrase inhibitors: inhibition of mammalian isoforms I-XIV with a series of substituted phenols including paracetamol and salicylic acid.

Authors: Alessio Innocenti; Daniela Vullo; Andrea Scozzafava; Claudiu T Supuran
Journal: Bioorg Med Chem Date: 2008-06-13 Impact factor: 3.641

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

3. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis.

Authors: Jing Tang; Agnieszka Szwajda; Sushil Shakyawar; Tao Xu; Petteri Hintsanen; Krister Wennerberg; Tero Aittokallio
Journal: J Chem Inf Model Date: 2014-02-21 Impact factor: 4.956

Review 4. Deep learning in bioinformatics.

Authors: Seonwoo Min; Byunghan Lee; Sungroh Yoon
Journal: Brief Bioinform Date: 2017-09-01 Impact factor: 11.622

Review 5. Deep Learning in Drug Discovery.

Authors: Erik Gawehn; Jan A Hiss; Gisbert Schneider
Journal: Mol Inform Date: 2015-12-30 Impact factor: 3.353

10. nifPred: Proteome-Wide Identification and Categorization of Nitrogen-Fixation Proteins of Diaztrophs Based on Composition-Transition-Distribution Features Using Support Vector Machine.

Authors: Prabina K Meher; Tanmaya K Sahu; Jyotilipsa Mohanty; Shachi Gahoi; Supriya Purru; Monendra Grover; Atmakuri R Rao
Journal: Front Microbiol Date: 2018-05-29 Impact factor: 5.640