| Literature DB >> 35566330 |
Aida Tayebi1, Niloofar Yousefi1, Mehdi Yazdani-Jahromi1, Elayaraja Kolanthai2, Craig J Neal2, Sudipta Seal2,3, Ozlem Ozmen Garibay1.
Abstract
Drug-target interaction (DTI) prediction through in vitro methods is expensive and time-consuming. On the other hand, computational methods can save time and money while enhancing drug discovery efficiency. Most of the computational methods frame DTI prediction as a binary classification task. One important challenge is that the number of negative interactions in all DTI-related datasets is far greater than the number of positive interactions, leading to the class imbalance problem. As a result, a classifier is trained biased towards the majority class (negative class), whereas the minority class (interacting pairs) is of interest. This class imbalance problem is not widely taken into account in DTI prediction studies, and the few previous studies considering balancing in DTI do not focus on the imbalance issue itself. Additionally, they do not benefit from deep learning models and experimental validation. In this study, we propose a computational framework along with experimental validations to predict drug-target interaction using an ensemble of deep learning models to address the class imbalance problem in the DTI domain. The objective of this paper is to mitigate the bias in the prediction of DTI by focusing on the impact of balancing and maintaining other involved parameters at a constant value. Our analysis shows that the proposed model outperforms unbalanced models with the same architecture trained on the BindingDB both computationally and experimentally. These findings demonstrate the significance of balancing, which reduces the bias towards the negative class and leads to better performance. It is important to note that leaning on computational results without experimentally validating them and by relying solely on AUROC and AUPRC metrics is not credible, particularly when the testing set remains unbalanced.Entities:
Keywords: ACE2 receptor; SARS-CoV-2; deep learning; drug-target interaction; ensemble learning; machine learning; spike protein
Mesh:
Year: 2022 PMID: 35566330 PMCID: PMC9100109 DOI: 10.3390/molecules27092980
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.927
Dataset description and statistical information.
| Unique Drugs | Unique Targets | Total Pairs | Positive Pairs | Negative Pairs | Imbalance Ratio | |
|---|---|---|---|---|---|---|
| BindingDB Dataset | 679,118 | 5941 | 1,369,057 | 492,970 | 876,087 | 1.78 |
Detailed IC50 values, lengths of SMILES characters and lengths of protein sequences.
| Bioactivity | IC50 Value | SMILES Sequence Length | FASTA Sequence Length | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Max | Min | Avg | Max | Min | Avg | Max | Min | Avg | ||
| BindingDB Dataset | IC50 | 1 × 107 | 0 | 3.79 × 104 | 1.94 × 103 | 2.0 × 100 | 5.85 × 101 | 7.18 × 103 | 9.0 × 100 | 7.07 × 102 |
Figure 1Framework of the proposed method to predict DTI. The in silico component starts with preprocessing and data construction. Each data construction includes the entire positive set and bagging on the negative set. These base learners are then aggregated and create the deep ensemble-balanced learning models. The output is the predicted probability of unknown interaction of a drug-target pair. The architecture of each individual learner is shown as a part of this framework, where the drug and protein representations are first extracted by utilizing neural networks from their corresponding SMILES and amino acid sequences, and then these encodings are concatenated and fed into the final neural network, where the model is trained. The in vitro component includes the validation part of our framework, where computational results of our proposed model are compared to the experimentally-measured DTI in the laboratory.
Details of the neural networks.
| Layer | ErG | ESPF | PSC | Final FC Network |
|---|---|---|---|---|
| First | Linear(315, 1024) | Linear(2586, 1024) | Linear(8420, 1024) | Linear(512, 1024) |
| 2nd | Linear(1024, 256) | Linear(1024, 256) | Linear(1024, 256) | Linear(1024, 1024) |
| 3rd | Linear(256, 64) | Linear(256, 64) | Linear(256, 64) | Linear(1024, 512) |
| 4th | Linear(64, 256) | Linear(64, 256) | Linear(64, 256) | Linear(512, 1) |
Summary of structural features used for protein and fingerprint features for drugs.
| Name | Description | Size | Feature Group |
|---|---|---|---|
| PSC | Amino acid composition up to 3-mers | 8420 | Target |
| ErG | 2D pharmacophore descriptions for scaffold hopping | 315 | Drug |
| ESPF | Explainable Substructure Partition Fingerprint | 2586 | Drug |
Comparison of the computational results yielded for two unbalanced models and our proposed method on the BindingDB dataset.
| AUROC | AUPRC | F1-Score | Recall (TPR) | |
|---|---|---|---|---|
|
| 0.924 | 0.879 | 0.809 | 0.797 |
|
| 0.926 | 0.876 | 0.796 | 0.809 |
|
| 0.952 | 0.920 | 0.838 | 0.903 |
Comparison of the two unbalanced models and our proposed method on in-lab experimental data.
| Compound | Lab Results | Unbalanced Model 1 | Unbalanced Model 2 | Proposed Model |
|---|---|---|---|---|
| darunavir | P | N | N | P |
| 2-keto-3-deoxynononic | P | N | N | N |
| Cytidine-5monophospho-N-acetylneuraminic | P | N | P | P |
| N-Glycolylneuraminic | P | N | P | P |
| N-acetyl-neuraminic | P | N | P | P |
| N-Acetyllactosamine | P | N | N | P |
| 3 | N | N | N | N |
| Recall(TPR) | 0 | 0.5 | 0.833 |
Figure 2Precision recall curve yielded for two unbalanced models and our proposed method on the BindingDB dataset.
Figure 3Receiver operating characteristic curve yielded for two unbalanced models and our proposed method on the BindingDB dataset.