| Literature DB >> 32784497 |
Cheng Wang1, Wenyan Wang2,3, Kun Lu2, Jun Zhang4, Peng Chen4, Bing Wang1,2,3.
Abstract
The task of drug-target interaction (DTI) prediction plays important roles in drug development. The experimental methods in DTIs are time-consuming, expensive and challenging. To solve these problems, machine learning-based methods are introduced, which are restricted by effective feature extraction and negative sampling. In this work, features with electrotopological state (E-state) fingerprints for drugs and amphiphilic pseudo amino acid composition (APAAC) for target proteins are tested. E-state fingerprints are extracted based on both molecular electronic and topological features with the same metric. APAAC is an extension of amino acid composition (AAC), which is calculated based on hydrophilic and hydrophobic characters to construct sequence order information. Using the combination of these feature pairs, the prediction model is established by support vector machines. In order to enhance the effectiveness of features, a distance-based negative sampling is proposed to obtain reliable negative samples. It is shown that the prediction results of area under curve for Receiver Operating Characteristic (AUC) are above 98.5% for all the three datasets in this work. The comparison of state-of-the-art methods demonstrates the effectiveness and efficiency of proposed method, which will be helpful for further drug development.Entities:
Keywords: APAAC; E-state fingerprints; distance-based negative sampling; drug-target interactions; support vector machines
Mesh:
Substances:
Year: 2020 PMID: 32784497 PMCID: PMC7570185 DOI: 10.3390/ijms21165694
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Results of proposed method. Prec., Rec., Acc., F1., MCC and AUC are short for precision, recall, accuracy, F1-scores, Matthews Correlation Coefficient and Area Under ROC-curves.
| Enzyme | GPCR | Ion Channel | Nuclear Receptor | |
|---|---|---|---|---|
| Prec. (%) | 100.00 ± 0.00 # | 100.0 ± 0.00 | 100.0 ± 0.00 | 100.0 ± 0.00 |
| Rec. (%) | 97.85 ± 0.01 | 94.38 ± 0.28 | 95.46 ± 0.03 | 91.50 ± 0.68 |
| Acc. (%) | 98.92 ± 0.01 | 97.19 ± 0.14 | 97.73 ± 0.02 | 95.75 ± 0.34 |
| F1. (%) | 98.91 ± 0.01 | 97.11 ± 0.15 | 97.68 ± 0.02 | 95.56 ± 0.37 |
| MCC (%) | 97.87 ± 0.01 | 94.53 ± 0.27 | 95.56 ± 0.03 | 91.83 ± 0.63 |
| AUC (%) | 99.58 ± 0.02 | 98.66 ± 0.09 | 98.57 ± 0.07 | 98.51 ± 0.30 |
# The value in the table means that average value ± standard deviation.
Figure 1Distribution of results in benchmark datasets. Prec., rec., acc., F1. and MCC are abbreviations of Precision, Recall, Accuracy, F1-scores and Matthews Correlation Coefficient, respectively. AUC indicates the area under ROC-curve.
Figure 2The fluctuations of AUC values in four subdatasets after executing 100 times.
Comparison for state-of-the-art methods on gold standard datasets. Ran-proposed and Dis-proposed represent that the methods with random and distance-based sampling for negative DTIs, respectively.
| AUC | Enzyme | GPCR | Ion Channel | Nuclear Receptor | Dimension of Features | |
|---|---|---|---|---|---|---|
| Similarity-based | KBMF2K | 0.832 | 0.857 | 0.799 | 0.824 | - |
| NetCBP | 0.825 | 0.823 | 0.803 | 0.839 | - | |
| Bigram | 0.948 | 0.872 | 0.889 | 0.869 | - | |
| PUDT | 0.884 | 0.878 | 0.831 | 0.885 | - | |
| Feature vector-based | Cao et al. | 0.948 | 0.890 | 0.872 | 0.878 | 343 |
| Wang et al. | 0.943 | 0.874 | 0.911 | 0.818 | 1281 | |
| MFDR | 0.969 | 0.904 | 0.933 | 0.886 | 1448/2330 | |
| FRnet-DTI | 0.976 | 0.948 | 0.951 | 0.924 | 4096 | |
| Ran-proposed | 0.973 | 0.926 | 0.967 | 0.928 | 159 | |
| Dis-proposed | 0.996 | 0.987 | 0.986 | 0.985 | 159 | |
Comparison with DeepDTI and Hu et al. Ran-proposed and Dis-proposed represent the methods based on random sampling and distance-based sampling for negative DTIs. TPR, TNR, Acc. and AUC represent True Positive Ratio, True Negative Ratio, Accuracy and Area Under ROC-curves.
| Methods | TPR (%) | TNR (%) | Acc. (%) | AUC (%) |
|---|---|---|---|---|
| DeepDTI | 82.27 ± 0.65 # | 89.53 ±1.30 | 85.88 ± 0.49 | 91.58 ± 0.59 |
| Hu et al. of Random sampling | 91.94 ± 0.91 | 91.14 ± 1.96 | 88.14 ± 0.75 | 95.27 ± 0.43 |
| Hu et al. of Distance-based sampling | 97.09 ± 0.67 | 96.86 ± 1.29 | 96.04 ± 0.32 | 99.47 ± 0.21 |
| Ran-proposed | 81.67 ± 2.33 | 81.71 ± 2.51 | 81.69 ± 1.72 | 89.05 ± 1.30 |
| Dis-proposed | 99.80 ± 0.30 | 99.97 ± 0.06 | 99.89 ± 0.14 | 99.98 ± 0.04 |
# The value in the table means that average value ± standard deviation.
Results of the Independent dataset from ChEMBL. Ran-ChEMBL and Dis-ChEMBL represent the experiments with random sampling and distance-based sampling for negative DTIs, respectively.
| Methods | Prec. (%) | Rec. (%) | Acc. (%) | F1. (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|---|
| Ran-ChEMBL | 72.48 ± 4.39 # | 90.14 ± 1.22 | 77.68 ± 3.77 | 80.23 ± 2.75 | 57.34 ± 6.68 | 92.05 ± 1.35 |
| Dis-ChEMBL | 99.86 ± 0.24 | 98.99 ± 0.02 | 99.41 ± 0.13 | 99.42 ± 0.12 | 98.86 ± 0.25 | 99.83 ± 0.02 |
# The value in the table means average value ± standard deviation.
Figure 3Comparison between different methods for the calculation of the positive center. With-PCA and Without-PCA represent the calculation with and without PCA, respectively. (a) Plot of the distribution of the two methods. (b) Plot of the trend of the two methods after 10 repetitions.
Figure 4ROC-curves of two negative sampling methods. Ran-negative and Dis-negative represent the setting of negative sampling based on random and distance, respectively.
Figure 5The robustness of two negative sampling methods. Ran-negative and Dis-negative indicate the experimental setting for negative sampling based on random and distance in 100-times repetition, respectively.
Statistics of gold standard datasets.
| Enzyme | GPCR | Ion Channel | Nuclear Receptor | |
|---|---|---|---|---|
| Drugs | 445 | 223 | 210 | 54 |
| Targets | 664 | 95 | 204 | 26 |
| Positive Interactions | 2926 | 635 | 1476 | 90 |
| Total DT-pairs | 295,480 | 21,180 | 42,840 | 1404 |
| proportion of positive | 0.99% | 3.00% | 3.45% | 6.41% |
Figure 6The flowchart of proposed method.