| Literature DB >> 35941686 |
Heba El-Behery1, Abdel-Fattah Attia2, Nawal El-Fishawy3, Hanaa Torkey3.
Abstract
BACKGROUND: Recently, drug repositioning has received considerable attention for its advantage to pharmaceutical industries in drug development. Artificial intelligence techniques have greatly enhanced drug reproduction by discovering therapeutic drug profiles, side effects, and new target proteins. However, as the number of drugs increases, their targets and enormous interactions produce imbalanced data that might not be preferable as an input to a prediction model immediately.Entities:
Keywords: Data balancing; Drug–target interaction; Machine learning; Support vector machine
Year: 2022 PMID: 35941686 PMCID: PMC9361677 DOI: 10.1186/s13036-022-00296-7
Source DB: PubMed Journal: J Biol Eng ISSN: 1754-1611 Impact factor: 6.248
Summary and comparison of DTI prediction methods for identification interactions relative to our presented framework
| Paper | Drug feature and protein feature | Method for negative samples | Description | Method |
|---|---|---|---|---|
DTI-SNNFRA [ (2021) | Protein: amino acid, pseudoamino acid, and CTD | First is the similarity between the drugs and the proteins. Then, the shared nearest neighbors and k-medoids clustering | First, the similarity between the drugs and the proteins. Then, the shared nearest neighbors and k-medoids clustered using the RUSBoost classifier for the prediction stage. | 1. Shared nearest neighbors 2. RUSBoost Classifier |
DeepCon [ (2019) | Dependent on the similarity between the drugs and the proteins; then compute the distance between the drug and protein. | First compute the distance depending on the similarity of drug and target features for predict the negative samples to achieved the class balance, second apply to DBN for prediction stage. | 1. The similarity of drug and target features 2. Deep belief network (DBN) | |
Idti-MLKdr [ (2021) | evaluate the molecular similarity of drug and target features based on the Tanimoto coefficient (TC). Then, the Cluster-Based Molecular Similarity algorithm calculates and selects the top-ranked drugs and targets. | The Tanimoto coefficient (TC) depends on the similarity between the drugs and between the proteins. Then, use Cluster algorithm and finally using Multikernel learning (MKL). | 1. Cluster algorithm 2. Multikernel learning (MKL) | |
PreDTIs [ (2021) | Using the SVM classifier. Then, the Euclidean distance is calculated from the predicted and the value of the real features | Use the SVM classifier. Then, calculate the Euclidean distance between the real and predicted values, using the LightGBM for prediction. | 1. Euclidean distance 2. LightGBM Classifier | |
| [ | Randomly select the number of negative samples, which is the same as the number of positive samples. | Randomly select the negative samples, equal to the positive samples. Apply the rotation forest for prediction. | 1. Rotation forest | |
| [ | The negative sample sets consist of the same number of randomly selected pairs of unrelated drugs and proteins. | Randomly select the negative samples. Apply Random Forest for prediction. | 1. Random Forest classifier | |
| [ | fingerprints (MFs). | The negative dataset can be randomly selected from the DTS. | Random select the negative samples. Apply the deep belief network for prediction | 1. Deep belief network (DBN) |
| [ | The Euclidean distance from all unlabeled samples to the positive center is calculated and sorted. The farther the distance is, the more likely the sample is to be negative. | The Euclidean distance from all unlabeled samples to the positive center. Apply support vector machines (SVM) for prediction. | 1. Euclidean distance 2. Support vector machines (SVM) |
Fig. 1The proposed framework model: A) is the overall prediction framework, 1) is the feature extraction and preprocessing stage for the DTI dataset, 2) is the prediction of negative samples stage, and 3) is the application of the prediction algorithms stage
Fig. 2The pseudocode to predict negative samples using a one-class SVM classifier
DrugBank dataset statistics
| Drug | Protein | Positive interaction |
|---|---|---|
| 11150 | 5260 | 19866 |
Four feature sets of the drug–target interaction
| Feature set | Drug feature | Protein feature | Number of features |
|---|---|---|---|
| Feature set [1] | Morgan fingerprint | Amino acid composition | 1044 |
| Feature set [2] | Morgan fingerprint | Dipeptide composition | 1424 |
| Feature set [3] | constitution | Amino acid composition | 50 |
| Feature set [4] | constitution | Dipeptide composition | 430 |
| All feature set | Morgan fingerprint + constitution | Amino acid composition + Dipeptide composition | 1474 |
Evaluation results of negative sample prediction using one-class SVM
| Method | Precision | Recall | F-score | Accuracy |
|---|---|---|---|---|
| One-class SVM | 1 | 0.989 | 0.995 | 0.989 |
Evaluation results of feature sets of the drug–target interaction using machine and ensemble algorithms according to precision, recall, F-score, and accuracy
| Feature set | Prediction algorithms | Precision | Recall | F-score | Accuracy |
|---|---|---|---|---|---|
| Feature set [1] | SVM | 0.995 | 0.995 | 0.995 | 0.996 |
| RF | 0.9996 | 0.9996 | 0.9996 | 0.9997 | |
| AB | |||||
| XG | 0.9994 | 0.9995 | 0.9995 | 0.9996 | |
| Light | 0.9997 | 0.9997 | 0.9997 | 0.9998 | |
| Feature set [2] | SVM | 0.9992 | 0.9992 | 0.9992 | 0.9991 |
| RF | 0.9996 | 0.9996 | 0.9996 | 0.9996 | |
| AB | |||||
| XG | 0.9995 | 0.9995 | 0.9995 | 0.9996 | |
| Light | 0.9996 | 0.9996 | 0.9996 | 0.9997 | |
| Feature set [3] | SVM | 0.992 | 0.992 | 0.992 | 0.992 |
| RF | |||||
| AB | |||||
| XG | 0.999 | 0.999 | 0.999 | 0.9988 | |
| Light | 0.9989 | 0.9989 | 0.9989 | 0.9987 | |
| Feature set [4] | SVM | 0.951 | 0.948 | 0.948 | 0.942 |
| RF | 0.999 | 0.999 | 0.999 | ||
| AB | |||||
| XG | 0.999 | 0.999 | 0.999 | 0.9987 | |
| Light | 0.9988 | 0.9988 | 0.9988 | 0.998 | |
| All Feature set | SVM | 0.993 | 0.993 | 0.993 | 0.994 |
| RF | 0.9992 | 0.9992 | 0.9992 | ||
| XG | 0.998 | 0.998 | 0.998 | 0.998 | |
| Light | 0.9991 | 0.9991 | 0.9991 | 0.999 |
Record area under the curve (AUC), mean square error, and MCC are achieved by different techniques
| Feature set | Prediction algorithms | AUC | Mean square error | MCC |
|---|---|---|---|---|
| Feature set [1] | SVM | 0.9954 | 0.0047 | 0.99 |
| RF | 0.9996 | 0.00038 | 0.9993 | |
| AB | ||||
| XG | 0.9995 | 0.0005 | 0.9991 | |
| Light | 0.9997 | 0.0003 | ||
| Feature set [2] | SVM | 0.981 | 0.0008 | 0.998 |
| RF | 0.9996 | 0.00035 | 0.9993 | |
| AB | ||||
| XG | 0.9995 | 0.0004 | 0.9994 | |
| Light | 0.9996 | 0.0004 | 0.9991 | |
| Feature set [3] | SVM | 0.976 | 0.0082 | 0.984 |
| RF | ||||
| AB | ||||
| XG | 0.999 | 0.0009 | 0.9982 | |
| Light | 0.9989 | 0.001 | 0.9979 | |
| Feature set [4] | SVM | 0.949 | 0.051 | 0.8997 |
| RF | 0.999 | 0.0009 | ||
| AB | ||||
| XG | 0.999 | 0.0009 | ||
| Light | 0.9988 | 0.001 | 0.997 | |
| All feature sets | SVM | 0.993 | 0.007 | 0.986 |
| RF | 0.9992 | 0.0008 | 0.999 | |
| XG | 0.998 | 0.0018 | 0.996 | |
| Light | 0.9991 | 0.00085 | 0.998 |
Fig. 3The results for the ROC curve and the value of AUC for the learning techniques show that the AdaBoost method predicts the max score in the AUC = 0.9998 for feature set [1] and set [2]
Fig. 4The results of the ROC curve and the AUC value for the AdaBoost and Random Forest learning methods, which predicted the max AUC as 0.9993 for feature set [3]. In feature set [4], the AdaBoost method predicted the max score in the AUC = 0.9992
Fig. 5The results of the ROC curve and the value of the AUC for the learning techniques. The AdaBoost method predicted the max score in the AUC = 0.9993 for all feature sets
Fig. 6The results when applying the feature important stage before the classifier showed that the XGBoost method obtained the highest score for feature set [2] in the Random Forest classifier whereas the genetic method obtained the highest score in feature set [1] in the AdaBoost classifier
Fig. 7The results when applying the feature analysis stage using the random under sampling and SMOTE oversampling method in feature set [3] and using the Random Forest and AdaBoost obtained the highest performance in all feature analyses
Fig. 8The comparison between related works and the proposed work (feature set [2])