| Literature DB >> 35625572 |
Gan Wang1, Xudong Zhang1, Zheng Pan2, Alfonso Rodríguez Patón3, Shuang Wang1, Tao Song1,3, Yuanqiang Gu4.
Abstract
Prediction on drug-target interaction has always been a crucial link for drug discovery and repositioning, which have witnessed tremendous progress in recent years. Despite many efforts made, the existing representation learning or feature generation approaches of both drugs and proteins remain complicated as well as in high dimension. In addition, it is difficult for current methods to extract local important residues from sequence information while remaining focused on global structure. At the same time, massive data is not always easily accessible, which makes model learning from small datasets imminent. As a result, we propose an end-to-end learning model with SUPD and SUDD methods to encode drugs and proteins, which not only leave out the complicated feature extraction process but also greatly reduce the dimension of the embedding matrix. Meanwhile, we use a multi-view strategy with a transformer to extract local important residues of proteins for better representation learning. Finally, we evaluate our model on the BindingDB dataset in comparisons with different state-of-the-art models from comprehensive indicators. In results of 100% BindingDB, our AUC, AUPR, ACC, and F1-score reached 90.9%, 89.8%, 84.2%, and 84.3% respectively, which successively exceed the average values of other models by 2.2%, 2.3%, 2.6%, and 2.6%. Moreover, our model also generally surpasses their performance on 30% and 50% BindingDB datasets.Entities:
Keywords: DTI prediction; deep learning; embedding dictionary; multi-view strategy; transformer
Mesh:
Substances:
Year: 2022 PMID: 35625572 PMCID: PMC9138327 DOI: 10.3390/biom12050644
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Two flowcharts on comparisons between traditional drug development and drug repositioning.
BindingDB dataset.
| Name | Positive Samples | Negative Samples | Total Samples | Number of Drugs | Number of Proteins |
|---|---|---|---|---|---|
| BindingDB (100%) | 6571 | 6571 | 13,142 | 7137 | 1253 |
BindingDB datasets of different proportions.
| Percent | Train/Valid/Test | Ratio of Positive and Negative Samples in Train/Valid/Test |
|---|---|---|
| 100% | 9200/1970/1972 | 1:1/1:1/1:1 |
| 50% | 4600/1970/1972 | 1:1/1:1/1:1 |
| 30% | 2770/1970/1972 | 1:1/1:1/1:1 |
Figure 2Data sample distribution on our customized BindingDB dataset.
Figure 3Overall architecture of Multi-TransDTI.
Figure 4The transformer architecture in our model.
Comprehensive performance of different models on 100% BindingDB.
| Methods | AUC | AUPR | ACC | F1-Score | Threshold |
|---|---|---|---|---|---|
| DNN | 0.875 | 0.852 | 0.805 | 0.812 | 0.351 |
| ModelCPI | 0.880 | 0.892 | 0.805 | 0.799 | 0.654 |
| Moltrans | 0.881 | 0.855 | 0.811 | 0.819 | 0.514 |
| DeepConv | 0.901 | 0.878 | 0.834 | 0.834 | 0.552 |
| Multi-TransDTI | 0.909 | 0.898 | 0.842 | 0.843 | 0.604 |
Comprehensive performance of different models on 50% BindingDB.
| Methods | AUC | AUPR | ACC | F1-Score | Threshold |
|---|---|---|---|---|---|
| DNN | 0.853 | 0.836 | 0.789 | 0.794 | 0.521 |
| ModelCPI | 0.872 | 0.875 | 0.804 | 0.790 | 0.496 |
| Moltrans | 0.869 | 0.841 | 0.804 | 0.796 | 0.349 |
| DeepConv | 0.880 | 0.865 | 0.810 | 0.825 | 0.316 |
| Multi-TransDTI | 0.891 | 0.884 | 0.820 | 0.829 | 0.397 |
Comprehensive performance of different models on 30% BindingDB.
| Methods | AUC | AUPR | ACC | F1-Score | Threshold |
|---|---|---|---|---|---|
| DNN | 0.834 | 0.803 | 0.763 | 0.762 | 0.489 |
| ModelCPI | 0.860 | 0.860 | 0.784 | 0.787 | 0.387 |
| Moltrans | 0.849 | 0.818 | 0.767 | 0.783 | 0.364 |
| DeepConv | 0.868 | 0.840 | 0.793 | 0.800 | 0.355 |
| Multi-TransDTI | 0.871 | 0.860 | 0.799 | 0.802 | 0.553 |
Figure 5Model comparisons of AUC and AUPR on 30% and 50% BindingDB dataset (Our = MultiTrans-DTI).
Figure 6Our model achieves the best AUC and AUPR on 100% BindingDB dataset (Our = MultiTransDTI).
Figure 7Comparisons of different models on ACC (Our = Multi-TransDTI).
Figure 8Comparisons of different models on F1-socre (Our = Multi-TransDTI).
Ablation experiments on 100% BindingDB dataset.
| Channels | AUC | AUPR | F1-Score | ACC |
|---|---|---|---|---|
| Protein_CNN | 0.905 | 0.893 | 0.836 | 0.836 |
| Protein_transformer | 0.893 | 0.878 | 0.838 | 0.830 |
| Drug_CNN | 0.896 | 0.888 | 0.836 | 0.829 |
| Drug_fingerprints | 0.905 | 0.894 | 0.837 | 0.833 |
| ALL | 0.909 | 0.898 | 0.842 | 0.843 |
Protein maximum length coverage.
| Maximum Embedding Length of Protein | Coverage on Training Set | Coverage on Validation Set | Coverage on Test Set | Coverage on All Sets |
|---|---|---|---|---|
| 600 | 85.8% | 84.9% | 85.0% | 85.5% |
| 700 | 92.5% | 92.8% | 91.9% | 92.4% |
| 800 | 96.2% | 96.4% | 96.1% | 96.2% |
Drug maximum length coverage.
| Maximum Embedding Length of Drug | Coverage on Training Set | Coverage on Validation Set | Coverage on Test Set | Coverage on All Sets |
|---|---|---|---|---|
| 80 | 87.3% | 88.5% | 88.8% | 87.7% |
| 90 | 91.7% | 92.8% | 92.1% | 91.9% |
| 100 | 93.0% | 93.8% | 92.8% | 93.1% |
Hyperparameters of our model.
| Hyperparameter | Range | Selected Value |
|---|---|---|
| Learning rate | [0.01,0.001,0.0001,0.0002] | 0.0001 |
| Decay rate | [0.01,0.001,0.0001] | 0.0001 |
| Activation function | [Sigmoid, ReLU, ELU] | ReLU, Sigmoid |
| Dropout rate | [0,0.1,0.2,0.3,0.4,0.5] | 0.2 |
| Epoch | 0–60 | 50 |
| Batch size | [8,16,32,64,128] | 16,32 |