Yogesh Kalakoti1, Shashank Yadav1, Durai Sundar1,2. 1. DAILAB, Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India. 2. School of Artificial Intelligence, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India.
Abstract
The identification of novel drug-target interactions is a labor-intensive and low-throughput process. In silico alternatives have proved to be of immense importance in assisting the drug discovery process. Here, we present TransDTI, a multiclass classification and regression workflow employing transformer-based language models to segregate interactions between drug-target pairs as active, inactive, and intermediate. The models were trained with large-scale drug-target interaction (DTI) data sets, which reported an improvement in performance in terms of the area under receiver operating characteristic (auROC), the area under precision recall (auPR), Matthew's correlation coefficient (MCC), and R2 over baseline methods. The results showed that models based on transformer-based language models effectively predict novel drug-target interactions from sequence data. The proposed models significantly outperformed existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions predicted by TransDTI was found to be backed by molecular docking and simulation analysis, where the model prediction had similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed approaches can have a significant impact on the development of personalized therapy and clinical decision making.
The identification of novel drug-target interactions is a labor-intensive and low-throughput process. In silico alternatives have proved to be of immense importance in assisting the drug discovery process. Here, we present TransDTI, a multiclass classification and regression workflow employing transformer-based language models to segregate interactions between drug-target pairs as active, inactive, and intermediate. The models were trained with large-scale drug-target interaction (DTI) data sets, which reported an improvement in performance in terms of the area under receiver operating characteristic (auROC), the area under precision recall (auPR), Matthew's correlation coefficient (MCC), and R2 over baseline methods. The results showed that models based on transformer-based language models effectively predict novel drug-target interactions from sequence data. The proposed models significantly outperformed existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions predicted by TransDTI was found to be backed by molecular docking and simulation analysis, where the model prediction had similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed approaches can have a significant impact on the development of personalized therapy and clinical decision making.
Identification
of novel drug–target interactions (DTIs)
is generally a stagnant, labor-intensive, and precarious process.
A conventional drug discovery and development pipeline can burn through
a billion USD, and more importantly, around 14 years.[1,2] Assay-based protocols in a drug discovery workflow follow several
steps, including lead identification, optimization, screening, and
characterization, eventually escalating the financial and temporal
burden. Alternatively, computational methods have gathered pace for
their utility in predicting novel drug–target interactions
and aiding the process of drug discovery.[3,4] Although
traditional methods beat in silico alternatives in terms of reliability
and robustness, experimental characterization of every possible drug–target
is not practical owing to its low-throughput nature.Traditional
DTI prediction workflows can be categorized into three
classes: (i) ligand-based approaches, (ii) docking-based approaches,
and (iii) chemogenomic approaches.[5−7] In DTI prediction, computational
approaches are divided into three major groups. Molecular similarity
serves as a deciding criterion for ligand-based approaches.[8] However, due to insufficient data regarding various
targets, this approach can be erroneous. Similarly, docking-based
approaches rely on molecular structures and sophisticated algorithms/software
to simulate interactions between the drug–target pair under
consideration. The biggest bottleneck of such an approach is the nonavailability
of quality three-dimensional (3D) protein structures.[9] Experimental techniques for solving a protein’s
crystal structure are time-taking and labor-intensive processes. For
instance, solving the 3D structure for targets like G protein-coupled
receptors (GPCRs) is still challenging.[10] Therefore, docking-based approaches can only cover a fraction of
the entire DTI spectrum. Alternatively, chemogenomic approaches try
to evade the drawbacks of the aforementioned methods by concurrently
employing the information of drug and target to establish their association.Advances in sequencing technologies have enabled the collection
of vast amounts of biological data.[11] Data
at such a scale have presented a golden opportunity for developing
powerful sequence-based approaches for modeling the protein structure
and functions, eventually aiding DTI prediction. Similar to grammatical
rules responsible for the working of natural languages, biological
sequences hold semantic and syntactical information that govern their
functioning, mechanism of action, and utility in the central dogma
of molecular biology. Extracting such semantic information from biological
sequences can help understand the biochemical properties of novel
sequences and estimate interactions among entities without explicitly
formalizing their biophysical or biochemical mechanisms.[12−14] There have been multiple attempts at formalizing an efficient workflow
capable of extracting efficient representations of molecular data.
SPVec, DTI-CDF, and DTI-MLCD are some of the many attempts in this
direction.[15−17] Self-supervised deep language modeling has recently
been employed for biological sequences. More specifically, MolTrans
and TransformerCPI are a couple of transformer-based methods developed
for DTI estimation and have demonstrated improved performance in comparison
to conventional methods.[18,19] In particular, such
encoder-based models have shown great potential for modeling protein
sequences. Models like Bidirectional Encoder Representations from
Transformers (BERT) rely on efficient network architectures that are
specifically designed for sequence data and are pretrained with massive
data sets.[20,21] For instance, the attention-based
transformer is one such model that has depicted immense effectiveness
in a range of benchmark data sets.[22,23]To incorporate
advances in transformer-based protein and drug embedding
methods, we have developed TransDTI, a modular framework that incorporates
molecular embeddings from various language models and estimates DTI
for a given drug–target pair. To evaluate the effectiveness
of TransDTI, we have compared 10-fold cross-validated results with
existing methods like DeepConvDTI, DeepDTA, and DeepDTI.[24−26] Moreover, the predicted DTIs were analyzed through a molecular dynamics
(MD) workflow to establish the effectiveness of the model estimates.
Results
With comparative analysis against baseline
and current methods,
a model architecture with 10 models is proposed in this study. The
seed architecture over which all models were built is described in Figure S1. The proposed models accurately predicted
the interactions in test sets and demonstrated a high level of effectiveness,
as quantified by multiple performance metrics. The general schematic
of the entire workflow is compiled as Figure .
Figure 1
General overview of the overall methodology.
(A) Primary DTI data
were collected, processed, and screened. (B) Embeddings were generated
for protein and drug sequences using multiple language model-based
methods. (C) Extracted embeddings were used to train a fully connected
feedforward deep neural network for classification and regression
task. (D) Molecular docking and dynamics workflow for validating the
activity of predicted drug–target interactions.
General overview of the overall methodology.
(A) Primary DTI data
were collected, processed, and screened. (B) Embeddings were generated
for protein and drug sequences using multiple language model-based
methods. (C) Extracted embeddings were used to train a fully connected
feedforward deep neural network for classification and regression
task. (D) Molecular docking and dynamics workflow for validating the
activity of predicted drug–target interactions.
Performances of Methods under Evaluation and
Selected Hyperparameters
Following a standardized protocol,
various aspects of the deep learning model were tuned: the learning
rate, number of hidden layers, dropout characteristics, and activation
functions. As described earlier, classification and regression models
were trained separately that were evaluated on their respective metrics.
A modular platform was built so that different types of embeddings
could be tested on the same architectural parameters. The entire model
architecture is depicted in Figure S1.
The effect of using different families of sequence embedders, namely,
ESM, ProtBert, and AlphaFold, was examined. On average, ESM family
models gave a Matthew’s correlation coefficient (MCC) of ∼0.71
and an R2 of ∼0.77. On similar lines, models
built on ProtBert and AlphaFold showed slightly better MCC and R2 of ∼0.72 and ∼0.76, respectively. These
values significantly outperformed sequence and CTD-based methods.
Similar trends were observed in the case of other metrics like the
area under precision recall (auPR) and the area under receiver operating
characteristic (auROC).
Comparison of Performance
with Other Molecular
Descriptors and Existing Models
Independent test data sets
were employed to compare the performance of the proposed method with
other molecular descriptor (sequence and CTD)-based workflows. The
results showed that the protein embedding families of models consistently
outperformed the conventional approaches for all evaluation metrics
(Figure A,B). Further,
a model built using CTD features performed better than the one trained
using only one-hot sequence encodings. It may be interpreted that
CTD features have a higher level of predictive capacity when compared
with one-hot encoded sequence features (Table ).
Figure 2
Comparison of performances among all of the
methods under observation.
(A) MCC and (B) R2 statistics of all of the methods
under consideration indicate the effectiveness of TransDTI over baseline
methods in predicting the type of interaction for a given drug–target
pair. (C) Comparison with existing methods also validates the effectiveness
of the proposed method. It should be noted that the AlphaFold model
was employed for the comparisons. (D) auROC and auPR statistics for
the three classes are depicted for the proposed models.
Table 1
Tenfold Cross-Validated Performance
Metrics for the Various Models under Considerationab
auROC
auPR
I
II
III
I
II
III
MCC
R2
sequence
0.8152 (0.0135)
0.8413
(0.0080)
0.8155 (0.0075)
0.6351 (0.0215)
0.8569 (0.0267)
0.7803 (0.0129)
0.6587 (0.0097)
0.7085 (0.0127)
CTD
0.8461 (0.0156)
0.8894 (0.0102)
0.8376 (0.0088)
0.6558 (0.0298)
0.8733 (0.0115)
0.8021 (0.0108)
0.6865
(0.0120)
0.7259 (0.0136)
ESM1
0.9305 (0.0037)
0.9593 (0.0023)
0.9210 (0.0028)
0.7066 (0.0107)
0.9563
(0.0029)
0.8798 (0.0055)
0.7136 (0.0043)
0.7653 (0.0058)
ESM1v
0.9300 (0.0035)
0.9594 (0.0021)
0.9211
(0.0024)
0.7061 (0.0128)
0.9568 (0.0023)
0.8811 (0.0058)
0.7140 (0.0065)
0.7722 (0.0079)
ESM1b
0.9305
(0.0035)
0.9598 (0.0021)
0.9212 (0.0026)
0.7074 (0.0120)
0.9574 (0.0024)
0.8806 (0.0059)
0.7152 (0.0062)
0.7704
(0.0062)
ProtBert
0.9304 (0.0036)
0.9603 (0.0023)
0.9218 (0.0024)
0.7052 (0.0120)
0.9581 (0.0027)
0.8823
(0.0054)
0.7151 (0.0054)
0.7657 (0.0068)
ProtBert-BFD
0.9291 (0.0030)
0.9599 (0.0023)
0.9207 (0.0032)
0.7023 (0.0113)
0.9578 (0.0025)
0.8797
(0.0066)
0.7144 (0.0074)
0.7663 (0.0058)
ProtAlbert
0.9305 (0.0039)
0.9596 (0.0022)
0.9212 (0.0027)
0.7047 (0.0124)
0.9572 (0.0027)
0.8812
(0.0052)
0.7161 (0.0070)
0.7700 (0.0083)
ProtT5XL
0.9305 (0.0031)
0.9603 (0.0017)
0.9219 (0.0020)
0.7062
(0.0132)
0.9580 (0.0023)
0.8809 (0.0055)
0.7175 (0.0057)
0.7713 (0.0042)
ProtT5XLBFD
0.9298 (0.0045)
0.9601 (0.0017)
0.9214 (0.0025)
0.7043 (0.0139)
0.9572 (0.0021)
0.8807
(0.0046)
0.7159 (0.0061)
0.7694 (0.0064)
ProtXLNet
0.9299 (0.0038)
0.9599 (0.0019)
0.9207 (0.0026)
0.7031
(0.0138)
0.9577 (0.0021)
0.8809 (0.0070)
0.7131 (0.0044)
0.7668 (0.0070)
AlphaFold
0.9378 (0.0036)
0.9668 (0.0021)
0.9289 (0.0025)
0.7096 (0.0108)
0.9593 (0.0023)
0.8914 (0.0054)
0.7141 (0.0069)
0.7787 (0.0072)
Standard deviations for the 10-fold
CV runs are in the brackets and the best performing model is marked
in bold.
ESM1, ESM1v, and
ESM1b belong to
the ESM family;[27] ProtBert, ProtBert-BFD,
ProtAlbert, Prot-T5-XL, Prot-T5-XL-BFD, and ProtXLNet belong to the
ProtBert family;[28] AlphaFold.[29]
Comparison of performances among all of the
methods under observation.
(A) MCC and (B) R2 statistics of all of the methods
under consideration indicate the effectiveness of TransDTI over baseline
methods in predicting the type of interaction for a given drug–target
pair. (C) Comparison with existing methods also validates the effectiveness
of the proposed method. It should be noted that the AlphaFold model
was employed for the comparisons. (D) auROC and auPR statistics for
the three classes are depicted for the proposed models.Standard deviations for the 10-fold
CV runs are in the brackets and the best performing model is marked
in bold.ESM1, ESM1v, and
ESM1b belong to
the ESM family;[27] ProtBert, ProtBert-BFD,
ProtAlbert, Prot-T5-XL, Prot-T5-XL-BFD, and ProtXLNet belong to the
ProtBert family;[28] AlphaFold.[29]In
addition to comparison with conventional descriptor-based models,
the performance of TransDTI was also compared against a few recently
developed DTI prediction models, namely, DeepDTI, DeepDTA, and DeepConvDTI.[24−26] While DeepDTI employs deep belief networks (DBNs) to process amino
acid, dipeptide, and tripeptide compositions, and drug topological
fingerprints estimate DTIs, DeepDTA and DeepConvDTI employ protein
sequences and drug SMILES to estimate affinity between the given drug–target
pair.Working codes were extracted for all three methods from
their GitHub
repositories. As the existing models were being evaluated in a binary
classification setup, the model presented in this study was trained
for a binary classification task using only class I (active) and class
III (inactive) data points. Minor changes on the final layers were
induced on the original codes to train the classification tasks for
a fair comparison. The results of comparative analysis in terms of
accuracy, MCC, and F1 score for the binary classification task are
compiled in Figure C. The proposed model in this study (TransDTI) performs better than
related models in all three comparison metrics. This can be attributed
to the greater ability of language models to encode residue-level
information in their embeddings.
Models
Trained with Random Embeddings Depicted
the Importance of Choosing Appropriate Evaluation Metrics
In addition to using embeddings from various methods, models were
also trained on randomly generated data. As evident from Figure A–I, the models
performed very poorly for both the tasks (classification Figure A–F and regression Figure G–I) when
analyzed using metrics like R2, auPR, and MCC. However,
other metrics such as class-wise accuracy and auROC deemed these random
(partial or complete) models effective by giving them a higher score.
Therefore, the choice of performance metrics is essential in determining
the actual effectiveness of ML models. The performance of the proposed
models (TransDTI) in terms of MCC and R2 reinforces
the inference that embeddings generated with language models have
predictive ability and can efficiently encode the semantic and structural
information hidden in biological sequences.
Figure 3
Performance of prediction
models on partial random data. (A–D)
Prediction losses, MAE, and R2 trends for models
trained on random protein and drug embeddings. (E, F) Model performance
in terms of ROC and PR curves for randomized data. R2 scores for regression models trained on (G, H) partial and (I)
complete random data.
Performance of prediction
models on partial random data. (A–D)
Prediction losses, MAE, and R2 trends for models
trained on random protein and drug embeddings. (E, F) Model performance
in terms of ROC and PR curves for randomized data. R2 scores for regression models trained on (G, H) partial and (I)
complete random data.
External
Validation on Gold-Standard Data
Sets Reinforces the Robustness of TransDTI
TransDTI demonstrated
high predictive ability with KIBA data sets used for training and
testing. However, model generalizability is considered to be the ultimate
criterion of an effective ML workflow. Although performance was calculated
using the holdout test data sets, the effectiveness of the proposed
models was externally validated on gold-standard external data sets
from DTI-MLCD.[15] In addition to two baseline
methods, three descriptor methods and two transformer-based methods
(MolTrans and TransformerCPI) were included for this analysis.[18,19] As evident from Table , TransDTI performed exceedingly well, further reinforcing its effectiveness.
Out of the four protein classes in the external validation data set,
TransDTI outperformed all other methods under consideration in three
classes (GPCR, enzymes, and nuclear receptors). In the case of ion
channels, the performance of TransDTI was not comparable with other
methods, as its architecture has been optimized for a three-class
classification setup and might be marginally less effective even after
being retrofitted into a binary classifier for the purpose of comparison.
Table 2
External Validation on Gold-Standard
Data Sets Showed the Effectiveness of TransDTI against Baseline and
Competing Methodsac
validation set
metric
CTD
Seq
DeepDTA
DeepConvDTI
DeepDTI
MolTrans
TransforerCPI
TransDTIb
GPCR
accuracy
0.61
0.57
0.73
0.68
0.74
0.72
0.69
0.77
MCCd
0.43
0.41
0.58
0.67
0.61
0.59
0.56
0.69
F1 score
0.56
0.53
0.63
0.68
0.59
0.62
0.59
0.60
sensitivity
0.57
0.52
0.67
0.67
0.54
0.55
0.54
0.58
specificity
0.69
0.64
0.78
0.72
0.83
0.84
0.80
0.85
enzyme
accuracy
0.58
0.53
0.62
0.57
0.71
0.69
0.67
0.75
MCCd
0.45
0.38
0.47
0.41
0.53
0.55
0.57
0.59
F1 score
0.53
0.51
0.59
0.56
0.68
0.65
0.64
0.70
sensitivity
0.56
0.50
0.53
0.53
0.67
0.61
0.59
0.68
specificity
0.60
0.57
0.69
0.64
0.64
0.77
0.77
0.81
ion channel
accuracy
0.62
0.56
0.67
0.59
0.64
0.65
0.62
0.63
MCCd
0.39
0.37
0.49
0.38
0.46
0.45
0.49
0.48
F1 score
0.51
0.48
0.62
0.56
0.60
0.56
0.55
0.59
sensitivity
0.47
0.44
0.60
0.52
0.57
0.51
0.50
0.57
specificity
0.72
0.67
0.72
0.67
0.69
0.76
0.72
0.70
nuclear receptors
accuracy
0.54
0.57
0.58
0.61
0.65
0.71
0.64
0.74
MCCd
0.34
0.31
0.35
0.43
0.47
0.52
0.48
0.55
F1 score
0.51
0.52
0.53
0.56
0.62
0.58
0.50
0.68
sensitivity
0.41
0.47
0.50
0.50
0.61
0.52
0.45
0.63
specificity
0.65
0.66
0.67
0.72
0.69
0.83
0.77
0.84
All models are validated for a binary
classification task.
Trained
on AF embeddings for GPCR
and ProtT5XL embeddings for the enzyme, ion channel, and nuclear receptors.
Best performing method is marked
in bold for every metric and validation set.
Scaled 0–1 for consistency
with other metrics.
All models are validated for a binary
classification task.Trained
on AF embeddings for GPCR
and ProtT5XL embeddings for the enzyme, ion channel, and nuclear receptors.Best performing method is marked
in bold for every metric and validation set.Scaled 0–1 for consistency
with other metrics.
Molecules Predicted by the Models Are at Par
with Known Inhibitors in Terms of Interacting Potential
The
interacting potential of predicted compounds was compared against
already known protein inhibitors to gauge the effectiveness of the
proposed models. The top 20 predictions from the 10 models for the
two proteins under consideration are shown in Figure . It can be observed that the docking scores
for most of the predictions were at par (comparable) with the known
inhibitor. For mitogen-activated protein kinase (MAP2k), the docking
score of the known inhibitor (extracted from 5Z1D) was −4.483
kcal/mol. Dock scores for the model predictions were −6.691,
−9.324, and −7.332 kcal/mol for ESM1, ProtBert, and
AlphaFold, respectively. Similarly, model predictions for (transforming
growth factor-β (TGFβ)) were at par with the known inhibitor
(Figure ). Random
molecules from the Schrodinger decoy molecule set (functionality of
Desmond) were also docked at the active site. As expected, the molecules
did not exhibit comparable docking results. Moreover, predictions
from DeepDTI, DeepDTA, and DeepConvDTI were also docked with MAP2k
and TGFβ. The effectiveness of the competing methods was underlined
in the docking results, as a significant number of predicted molecules
were at par with the known inhibitor. However, as evident from Figure , TransDTI gave better
predictions, both in terms of quality and quantity. Further, to test
for statistically significant differences in the docking scores among
competing models, the paired sample t-test was performed.
The results are compiled in Tables S1 and S2 for TGFβ and MAP2k docking scores, respectively, which agree
with previously observed trends (Figure ).
Figure 4
Docking statistics for known and predicted molecules
with MAP2k
and TGFβ. Docking scores of the predicted molecules were at
par with those of the known inhibitors for (A) MAP2k and (B) TGFβ.
(C) The mean docking scores for all of the predictions under consideration
are summarized.
Docking statistics for known and predicted molecules
with MAP2k
and TGFβ. Docking scores of the predicted molecules were at
par with those of the known inhibitors for (A) MAP2k and (B) TGFβ.
(C) The mean docking scores for all of the predictions under consideration
are summarized.
Molecular
Dynamics Validates the Effectiveness
of Predicted Molecules
Docked protein–ligand complexes
for MAP2k and TGFβ were simulated for 100 ns. Figure S2 shows the comparative analysis of sustained interactions
between known inhibitor–protein and predicted compounds–protein
complexes. For MAP2k, while the complex with the known inhibitor had
around four interactions on average, complexes with predicted compounds
had six, seven, and four interactions for ESM1, ProtBert, and AlphaFold,
respectively. Moreover, no significant fluctuations were observed
in the carbon backbone throughout the simulations (Figure S3). Similar results were obtained for TGFβ and
its predicted molecules. Table compiles the polar and nonpolar interacting residues for
all of the simulations. The average poses for MAP2k and TGFβ
are depicted in Figure , while all polar and nonpolar interactions have been depicted in Figure . It is evident from
the docking poses and their interacting residues that predicted molecules
carry a similar signature as the known inhibitor. For instance, in
the case of MAP2k, it can be observed that Asp277 serves as a consistent
binding partner for all of the predicted drug molecules. This finds
its relevance in identifying novel drugs for a given target by incorporating
similarity characteristics of already known inhibitors indirectly.
Moreover, sustained interactions (Figure ) during the course of 100 ns simulations
for both the proteins under consideration provide mechanistic confidence
in the prediction given by TransDTI.
Table 3
Residues of MAP2k and TGFβ Interacting
with the Known and Predicted Ligands in Their Best-Docked Poses
types
of interactions and residues involved (premolecular dynamic simulations)
protein
ligand
molecular
docking (kcal/mol)
H-bonds
hydrophobic, polar,
and π–π stacking
MAP2k
CID135398122a
–4.483
Asp277
Met142
CHEMBL287306
–6.691
Asp277
Lys165
CHEMBL361297
–9.324
Asp277, Met215, Lys165
Gly148, Ser276, Ser263,
Leu266, Met142
CHEMBL1242568
–7.332
Asp277, Met215
TGFβ
DB02010a
–9.345
His283, Asn287,
Asp281
Val211, Asn287, Lys337
CHEMBL1852770
–7.417
His283, Asp351, Cys350
Glu245, Phe352, Lys232, Ala230
CHEMBL1762790
–10.876
Asp351, Glu245, Thr280
Lys213, Leu260, Leu340
CHEMBL1241768
–9.751
His283, Asp281, Thr280
Ala230, Phe282, Lue260
Previously reported inhibitors in
the literature.[30,31]
Figure 5
Interaction poses for averaged structures
for all of the simulated
drug–target complexes for (A–D) MAP2k and (E–H)
TGFβ. It can be observed that the interacting partners for the
predicted compounds are similar to the ones for the known inhibitors.
Also, interactions and their interacting residues are annotated.
Figure 6
H-bond statistics throughout the simulations for (A–D)
MAP2k
and (E–H) TGFβ. The interactions were consistent throughout
the simulations, showing the effectiveness of the prediction models.
Moreover, the predicted compounds from ESM, ProtBert, and AlphaFold
showed similar interacting potential as the known inhibitor.
Interaction poses for averaged structures
for all of the simulated
drug–target complexes for (A–D) MAP2k and (E–H)
TGFβ. It can be observed that the interacting partners for the
predicted compounds are similar to the ones for the known inhibitors.
Also, interactions and their interacting residues are annotated.H-bond statistics throughout the simulations for (A–D)
MAP2k
and (E–H) TGFβ. The interactions were consistent throughout
the simulations, showing the effectiveness of the prediction models.
Moreover, the predicted compounds from ESM, ProtBert, and AlphaFold
showed similar interacting potential as the known inhibitor.Previously reported inhibitors in
the literature.[30,31]
Lower-Dimensional Visualization of Protein
Embeddings
Comparative and validation results suggested that
TransDTI embeddings could capture critical aspects of protein and
drug sequences that are predictive of DTIs. Therefore, protein embeddings
were examined using t-distributed stochastic neighbor embedding (tSNE)
to generate lower-dimensional representations. Protein ANalysis THrough
Evolutionary Relationships (PANTHER) was employed for ontology analysis
for identifying protein classes for a subset of data points.[32] Although the transformer models were not explicitly
tuned for identifying protein subclasses, some level of classification
could be observed (Figure S4).
Discussion
This study demonstrates the utility of transformer-based
language
models for identifying novel DTIs by employing embeddings from ESM,
ProtBert, and AlphaFold families of models. In contrast to the conventional
binary classification setup for predicting DTIs, we opted for a multiclass
(three) classification approach by categorizing drug activity in an
intermediate range (0.01 μM < IC50 < 30 μM).
This was done to mitigate the ambiguity of defining a strict threshold
between active and inactive drug–target pairs in a conventional
setup and provide a realistic and practical solution. Transformer
models extract semantic information in NLP tasks by jointly conditioning
on both left and right contexts in all layers.[21] This is particularly an essential feature in context to
biological sequences, which are multidirectional in nature. The inclusion
of robust sequence embeddings allowed the proposed models to score
well with various performance metrics (Figures S5 and S6).As described in earlier sections, the proposed
methods consistently
excelled at multiple validation checks and outperformed other descriptors
and earlier models in the literature. The proposed models also avoided
overfitting, which is evident from the training curves (Figure S7). Furthermore, molecular docking and
dynamics analysis revealed molecular insights into the effectiveness
of the predicted drugs in comparison to already known inhibitors. Figure S2 depicts the representative structure
from eight simulation runs for the two proteins (MAP2k and TGFβ)
and their model predictions under consideration.Ultimately,
TransDTI’s integrative approach recommended
a set of promising drug–target interactions that could be experimentally
validated as promising leads for novel cancer therapies. Although
validation of the binding energies of putative drug–target
interactions can only be verified by experimental screening methods,
these results indicated that the proposed models could mature into
promising methods for identifying novel drug–target interactions.
It should be noted that the models were developed using heterogeneous
classes of proteins that included transporters, transcription regulators,
and junction proteins, among others (Figure S8). It can be inferred that the heterogeneous nature of the training
data contributed to the effectiveness of the models.It was
observed that although TransDTI performed relatively well,
there is still scope for improvement. For instance, most protein and
drug embeddings except AlphaFold are entirely dependent on protein
or SMILES sequences. However, structural features ideally have critical
information that could be employed for developing more generalized
models. Slightly better performance of the AlphaFold-based model in
the regression task could be attributed to the fact that structural
templates are incorporated in its pipeline.[29,33] In contrast to the BFD database that was used in the original AlphaFold
workflow for generating MSAs, UniRef100 was employed in the interest
of computational runtimes and hardware bottlenecks.Although
DTI predictions were evaluated by external validation
and comparative analysis with existing methods and conventional descriptors,
as an alternative validation strategy, the proposed models were tested
to predict potential drug candidates for mitogen-activated protein
kinase (MAP2k) and transforming growth factor-β (TGFβ),
both of which serve as essential therapeutic targets in anticancer
therapy. The docked complexes were simulated in an explicit water
environment to evaluate the dynamics and sustainability of interactions
for an extended duration, which cannot be reliably ascertained by
just molecular docking. For instance, despite good docking scores,
TGFβ-CHEMBL1241678 was found to be unstable in a 100 ns MD simulation.
Conclusions
This study developed an end-to-end scalable
framework that could
understand the intricate relationships among drug–target pairs
and make inferences for interactions among given drugs and targets
using transformer-based language models. Extracting the ordered information
present in data sets like SMILES and protein sequences was central
to the idea of TransDTI. The transformer-based language models were
employed to learn representations from raw sequence data and molecular
descriptors and to solve a multiclass classification and regression
problem. These methods were compared extensively with baseline and
existing methods on various measures of performance.The results
obtained in this study reinforced the idea of using
representations that try to capture the underlying order in sequential
data. Including language models served the same purpose, and as a
result, it was observed that there was a significant improvement in
the performance compared to the baseline methods. Moreover, TransDTI’s
effectiveness was evident in the external validation setup, where
they consistently outperformed the baseline models with a healthy
margin.Analyzing the details of what the model is learning
could be of
great utility in improving the methodology further. Furthermore, the
idea of using structural information of proteins for DTI prediction
remains to be of immense utility, especially in the context of the
recently developed AlphaFold model.[34] Therefore,
spatial information provided by the protein 3D structure would be
incorporated in the future to enhance its effectiveness further.
Methods
This section describes the various elements
of the entire DTI prediction
workflow. The entire workflow was divided into three steps:Extracting protein
and drug embeddings
using multiple language models.The second phase trained a fully connected
feedforward deep neural network on the extracted embeddings for predicting
interaction scores as well as the interaction state.The final phase was aimed toward validating
the model prediction using molecular docking and dynamics analysis.A general schematic describing the various
sections of the proposed
workflow is shown in Figure .
Data Sets and Study Design
TransDTI
has been designed to be compatible with regression and classification
tasks with minimal architectural differences. While the regression
task was straightforward, the conventional binary classification setup
to predict DTIs had some inherent limitations discussed in the following
sections. A large-scale kinase inhibitor bioactivity (KIBA) data set
for model building and evaluation was adopted.[35] The KIBA data set quantifies bioactivities of kinase inhibitors
in terms of a single scoring measure derived from individual metrics
like IC50, Kd, and Ki.[36][36] As the activity thresholds were not well defined in the
KIBA data set, IC50 values for the interacting pairs were
extracted directly from ChEMBL. Although a large pool (∼0.2
M) of drug–target interactions was retrieved from ChEMBL, a
healthy chunk of it was filtered out due to nonstandard/missing activity
values and incomplete information. Of 30 474 compounds, 961
targets and 61 624 interactions were finally screened after
all of the preprocessing steps. The summary statistics of all of the
data sets employed is compiled in Table .
Table 4
Summary of All of
the Data Sets Used
in the Studya
proteins
compounds
interactions
training
KIBA (IC50)b
961
30 474
61 624
validationc (DTI-MLCD)
enzyme
1411
1777
7371
GPCR
156
1680
5383
nuclear receptors
33
541
886
ion channel
204
210
1476
KIBA data set was employed for training
and internal validation, while the gold-standard data set from DTI-MLCD
was used for external validation.
Proteins for drugs listed in the
KIBA data set were extracted manually from ChEMBL.
Used as an external validation data
set.
KIBA data set was employed for training
and internal validation, while the gold-standard data set from DTI-MLCD
was used for external validation.Proteins for drugs listed in the
KIBA data set were extracted manually from ChEMBL.Used as an external validation data
set.
Formulation
of the Problem
The proposed
methodology followed a two-step process: (i) Training a multiclass
classification problem on given drug–target pairs. (ii) Inferring
the interactions among unknown drug–target and validating using
an external data set to infer real-world performance.A multiclass
classification approach was chosen for an efficient understanding
of drug–target interactions rather than the more conventional
binary classification. The reason for this is that (i) most of the
binary classification tasks tend to label nontested drug–target
combinations as a negative data point and (ii) even in the case where
one has information regarding the activity profile of the drug–target
pair (in terms of IC50, Kd or Ki), a single activity threshold was not uniformly
followed in the literature. Further, the conventional binary classification
task had some inherent drawbacks and inadequacies. The most evident
is the need for a predefined binarization threshold, which is often
arbitrarily decided.To mitigate the issues mentioned above,
binding affinities were
segregated into three categories based on the magnitude of their value.
The entire batch of retrieved data points was categorized into three
classes, namely, (i) class I: active, (ii) class II: intermediate,
and (iii) class III: inactive, owing to the disadvantages of a binary
formulation as described earlier. For instance, an IC50 value of <0.1 μM was considered a good indicator of an
active drug–target interaction, and such drug–target
pairs were marked as positive data points.[37] As indicated earlier, due to no distinct separation among strong
and weak binding affinities, a threshold of >30 μM was set
for
an interaction to be categorized as an noninteracting (negative) example.
Everything in between was labeled as intermediate interactions. Following
this criterion, a total of 7057 active (class I), 24 752 intermediate
(class II), and 28 748 (class III) inactive interactions were
fed into the classification and regression models.
Protein and Drug Embeddings
Drug
embeddings were generated using SMILES transformer, a pretrained language
model-based utility that learns molecular fingerprints through an
unsupervised sequence-to-sequence language model employing extensive
SMILES data.[38]Multiple transformer
protein language models were employed for generating protein embeddings
in addition to conventional sequences and chemical descriptor-based
features. The protein embedding could be categorized under three families,
ESM, ProtBert, and AlphaFold. While the ESM family of embeddings had
models like ESM1, ESM1v, and ESM1b, the ProtBert family included ProtBert,
ProtBert-BFD, Prot-T5-XL, Prot-T5-XL-BFD, ProtAlbert, and ProtXLNet.[27,28] As most of the mentioned language models do not explicitly provide
the functionality to extract data embeddings, custom scripts were
written on top of the publicly provided code for the task. Embeddings
from AlphaFold were extracted by reverse engineering the open-source
code on their GitHub repository and ColabFold.[29,33] Similarly, embeddings for other methods were extracted using the
resources provided in GitHub. Details regarding the model architecture
and parameters for each model are compiled in Table .
Table 5
Summary of All of
the Protein Embedding
Models Employed for Developing the Proposed Methodology
shorthand
layers
params
data set
embedding dim
ESM1
34
670M
UR50/S2018_03
1280
ESM1b
33
650M
UR50/S2018_03
1280
ESM1v
33
650M
UR90/S2020_03
1280
ProtBert
30
420M
UniRef100
1024
ProtBert-BFD
12
224M
UniRef100
4096
ProtAlbert
30
420M
BFD100
1024
Prot-T5-XL
24
3B
UniRef100
1024
Prot-T5-XL-BFD
24
3B
BFD100
1024
ProtXLNet
30
409M
UniRef100
1024
AlphaFold
48a
92M
UniRef100
384
Evoformer blocks.
Evoformer blocks.
Model
Architecture
Once the embeddings
for protein and drugs were extracted, a fully connected feedforward
deep neural network was developed and optimized for classification
and regression tasks separately. While the underlying architecture
remained the same for the two tasks, the final layer of the network
was modified for their activations and shapes.All of the hidden
layers had an exponential linear unit (elu) as an activation function.
Loss
and Model Optimization
For
the multiclass classification problem, categorical cross-entropy was
used as the loss function. It was defined as a sum of losses for each
class labels coming out of a SoftMax function and is mathematically
given byHere, N is the number of
classes (three in this case), i is the data point, y is the
binary target indicator [0,1], and p is the model
prediction. In the case of the regression task, the mean squared error
(MSE) was employed as the loss function.As for the regression
setup, the mean squared error (MSE) was employed as the loss function,
which is the average of the squared differences between the actual
and predicted values. It can be mathematically represented as followswhere n is the number of
samples, Y is the vector
of predicted values, and Ŷ is the vector of actual values.
Performance
Evaluation Metrics
Tenfold
cross-validation (CV) was employed during model training to ensure
the goodness of fit and minimize overfitting. Multiple evaluation
metrics were utilized for the classification and regression task including
coefficient of determination (r2), Matthew’s
correlation coefficient (MCC), auROC, and auPR.auROC, auPR,
accuracy, and macro-averaged F1 score were used to evaluate all of
the methods under comparison. F1 scores became necessary for the cases
wherein false positives and false negatives were crucial, as in the
case here. It was computed as followsWhile accuracy does not consider
the effect of imbalance in the
data set, MCC and auPR are independent of imbalances and serve as
better indicators of performance for the given use case. A completely
random classifier would have a value of 0.5 and a perfect classifier
would have a score of 1 for both the metrics. Also, both metrics ranged
from 0 to 1 except MCC, which ranged from −1 to 1. In addition
to the above metrics, sensitivity and specificity were also reported
for the external validation analysis. Moreover, to efficiently balance
the well-known bias and variance tradeoff, 10-fold cross-validation
was employed.
Preparation of the Structures
of the Protein
and Ligands for Molecular Docking
The crystal structures
of MAP2k and TGFβ are available in the Protein Data Bank (PDB)
with IDs 5Z1D and 5E8X respectively,
along with their known inhibitors.[30,31] These structures
were optimized and prepared for docking studies using the protein
preparation wizard of the Schrodinger suite.[39,40] Preparation of the structures included removing water molecules,
adding polar hydrogen atoms, filling of missing amino acid side chains,
and minimization of the structure using the OPLS3e force field.[39,41]
Molecular Docking of the Predicted Ligands
with MAP2k and TGFβ
The key catalytic residues of MAP2k
and TGFβ involved in the proteolytic activity are reported in
the literature.[30,31] A 20 Å grid for docking
with MAP2k and TGFβ was generated, taking its catalytic residues
as the centroid. The Glide module of Schrodinger was used for the
extra precision (XP) flexible docking of the ligands at the generated
grid sites.[42,43] For MD validation, molecular
interactions of the protein and its known inhibitor were compared
against predictions from the proposed models. One model from each
family (ESM1, ProtBert, and AlphaFold) was selected for comparison.
Explicit Water Molecular Dynamics (MD) Simulation
and Its Analysis
A standardized MD protocol has been used
for the analysis of docked protein–ligand complexes.[44] The docked molecules were subjected to MD simulations
to investigate the stability of binding and interactions between the
proteins and ligands. All MD simulations were performed in the Schrodinger
suite using the Desmond tool and the Maestro platform.[41,45] First, the docked complexes were solvated with the TIP3P water model
to build the system. It was followed by neutralizing the orthorhombic
periodic boundary system by adding counterions (Na+/Cl−). The
energy of the prepared systems was minimized by running 100 ps low-temperature
(10 K) Brownian motion MD simulations in the NVT ensemble to remove
steric clashes and move the system away from an unfavorable high-energy
conformation. The systems were further relaxed using the default parameters
of the “relax system before simulation” option of Desmond.
The equilibrated systems were then subjected to 100 ns simulations
in the NPT ensemble at a temperature of 300 K maintained using a Nose–Hoover
chain thermostat, a constant pressure of 1 atm maintained using a
Martyna–Tobias–Klein barostat, a time step of 2 fs,
and a recording interval of 20 ps.The generated MD simulation
trajectories were visualized and analyzed using the system event analysis
and simulation interaction diagram tools. The root-mean-square deviation
(RMSD) of the simulated structures with reference to initial docked
structures throughout the simulation run was analyzed to account for
the stability of the interactions. The root-mean-square fluctuation
(RMSF) was also calculated to investigate the average fluctuation
in the amino acid residues of apoproteins and their complexes with
ligands under consideration.As hydrogen bonds are crucial in
determining the specificity and
affinity of a drug toward its receptor, the average number of the
hydrogen bonds formed during the simulation time for each protein–ligand
complex was then calculated. Furthermore, to investigate the significant
interactive residues of the proteins, which were contacting the ligands
during the MD simulation, the polar and nonpolar interactions, as
well as the occupancy time of these interactions, were calculated.
Implementation
All of the models were developed
on the TensorFlow-Keras platform
and trained with Nvidia RTX2080 GPU (8 GB vRAM) and Tesla V100 (32GB
vRAM). The model parameters were optimized using grid search, and
saved model weights can be found on the GitHub repository.
Authors: Richard A Friesner; Robert B Murphy; Matthew P Repasky; Leah L Frye; Jeremy R Greenwood; Thomas A Halgren; Paul C Sanschagrin; Daniel T Mainz Journal: J Med Chem Date: 2006-10-19 Impact factor: 7.446
Authors: Andrew W Senior; Richard Evans; John Jumper; James Kirkpatrick; Laurent Sifre; Tim Green; Chongli Qin; Augustin Žídek; Alexander W R Nelson; Alex Bridgland; Hugo Penedones; Stig Petersen; Karen Simonyan; Steve Crossan; Pushmeet Kohli; David T Jones; David Silver; Koray Kavukcuoglu; Demis Hassabis Journal: Nature Date: 2020-01-15 Impact factor: 49.962
Authors: Regina S Salvat; Andrew S Parker; Yoonjoo Choi; Chris Bailey-Kellogg; Karl E Griswold Journal: PLoS Comput Biol Date: 2015-01-08 Impact factor: 4.475