| Literature DB >> 31998687 |
Yu-Fang Zhang1, Xiangeng Wang1, Aman Chandra Kaushik1,2, Yanyi Chu1, Xiaoqi Shan1, Ming-Zhu Zhao3, Qin Xu1, Dong-Qing Wei1,4.
Abstract
Drug discovery is an academical and commercial process of global importance. Accurate identification of drug-target interactions (DTIs) can significantly facilitate the drug discovery process. Compared to the costly, labor-intensive and time-consuming experimental methods, machine learning (ML) plays an ever-increasingly important role in effective, efficient and high-throughput identification of DTIs. However, upstream feature extraction methods require tremendous human resources and expert insights, which limits the application of ML approaches. Inspired by the unsupervised representation learning methods like Word2vec, we here proposed SPVec, a novel way to automatically represent raw data such as SMILES strings and protein sequences into continuous, information-rich and lower-dimensional vectors, so as to avoid the sparseness and bit collisions from the cumbersomely manually extracted features. Visualization of SPVec nicely illustrated that the similar compounds or proteins occupy similar vector space, which indicated that SPVec not only encodes compound substructures or protein sequences efficiently, but also implicitly reveals some important biophysical and biochemical patterns. Compared with manually-designed features like MACCS fingerprints and amino acid composition (AAC), SPVec showed better performance with several state-of-art machine learning classifiers such as Gradient Boosting Decision Tree, Random Forest and Deep Neural Network on BindingDB. The performance and robustness of SPVec were also confirmed on independent test sets obtained from DrugBank database. Also, based on the whole DrugBank dataset, we predicted the possibilities of all unlabeled DTIs, where two of the top five predicted novel DTIs were supported by external evidences. These results indicated that SPVec can provide an effective and efficient way to discover reliable DTIs, which would be beneficial for drug reprofiling.Entities:
Keywords: Word2vec; drug-target interaction; feature embedding; machine learning; representation learning
Year: 2020 PMID: 31998687 PMCID: PMC6967417 DOI: 10.3389/fchem.2019.00895
Source DB: PubMed Journal: Front Chem ISSN: 2296-2646 Impact factor: 5.221
Figure 1Flowchart of the whole pipeline for DTI prediction in this article (left) in comparison to the traditional pipeline (right), with the procedures of feature representations squared in dashed lines.
Number of entries of the five different datasets obtained from DrugBank dataset.
| Drug | 6,068 | 6,068 | 537 | 6,068 | 537 |
| Target | 3,839 | 3,839 | 3,839 | 160 | 160 |
| Interactions | 15,434 | 3,348 | 1,735 | 264 | 37 |
Figure 2Biochemical implications from SMILES2Vec features. (A) Visualizations of the SMILES2Vec vector space of drugs in DrugBank using t-SNE. (B) The top 10 drugs most similar to Acetophenazine (DrugBank ID: DB01063) according to their SMILES2Vec vectors. Red values show their cosine distances with Acetophenazine. The smaller the value, the more similar in the chemical structures.
Figure 3Normalized distributions of biochemical and biophysical properties in a 2D space projected by t-SNE from the 100-dimensional ProtVec protein-space. In these plots, each point represents a protein, and the colors indicate the scale for each property.
Results of classification performance of four feature combinations using three classifiers on BindingDB via 10 × 5-fold cross-validation, with the highest scores highlighted in the bold font.
| SPVec (SMILES2Vec-ProtVec) | GBDT | 0.9923 | 0.9695 | |||
| RF | 0.9675 | 0.9540 | 0.9672 | |||
| DNN | 0.9617 | 0.9332 | 0.9287 | 0.9248 | 0.9197 | |
| SMILES2Vec-AAC | GBDT | |||||
| RF | 0.8770 | 0.7974 | 0.8657 | 0.7050 | 0.7772 | |
| DNN | 0.8708 | 0.8124 | 0.7993 | 0.7879 | 0.7126 | |
| MACCS-ProtVec | GBDT | |||||
| RF | 0.9302 | 0.8542 | 0.8712 | 0.8322 | 0.8512 | |
| DNN | 0.9136 | 0.8034 | 0.8025 | 0.8097 | 0.8074 | |
| MACCS-AAC | GBDT | |||||
| RF | 0.8360 | 0.7468 | 0.8366 | 0.6150 | 0.7089 | |
| DNN | 0.8451 | 0.7832 | 0.7884 | 0.7726 | 0.7724 |
AUCs of SPVec and other models on DTI predictions using DrugBank.
| Drug structure information | 2,216 | AAC, DC | 11,943 | DNN | 0.81 | You et al., |
| Constitutional, topological and molecular descriptors, 2D autocorrelations, topological charge indices, eigenvalue-based indices | 1,664 | AAC; DC | 1,080 | RF | 0.8950 | Yu et al., |
| Constitutional, topological and geometrical descriptors | 193 | AAC; DC | 1,260 | DT | 0.760 | Ezzat et al., |
| PubChem fingerprints indicating presence or absence of 881 known chemical substructures | 881 | Fingerprints of 876 different protein domains that are obtained from the Pfam database | 876 | EnsemDT | 0.882 | Ezzat et al., |
| RF | 0.855 | |||||
| SMILES2Vec | 100 | ProtVec | 100 | GBDT | 0.9467 | This work |
| RF | 0.9469 | |||||
| DNN | 0.8637 |
DC, dipeptide composition;
TC, tripeptide composition;
These models are trained on different versions of DrugBank, whose AUCs are only as references.
Results of classification performance using three classifiers on datasets obtained from DrugBank, with the highest scores highlighted in the bold font.
| Dataset_1 | GBDT | 0.9506 | 0.9367 | |||
| RF | 0.9234 | 0.9378 | 0.9337 | |||
| DNN | 0.8952 | 0.8732 | 0.8345 | 0.8437 | 0.8654 | |
| GBDT | 0.8628 | |||||
| Dataset_2 | RF | 0.8930 | 0.8645 | 0.8467 | 0.8555 | |
| DNN | 0.8201 | 0.8026 | 0.8138 | 0.8199 | 0.8144 | |
| GBDT | ||||||
| Dataset_3 | RF | 0.7448 | 0.7299 | 0.7198 | 0.7243 | 0.7230 |
| DNN | 0.6999 | 0.6922 | 0.6825 | 0.6798 | 0.6832 | |
| GBDT | ||||||
| Dataset_4 | RF | 0.7235 | 0.7034 | 0.7108 | 0.7078 | 0.71 |
| DNN | 0.7173 | 0.6899 | 0.6884 | 0.6896 | 0.6866 | |
| GBDT | ||||||
| Dataset_5 | RF | 0.5689 | 0.5605 | 0.5398 | 0.5321 | 0.5411 |
| DNN | 0.6267 | 0.6098 | 0.607 | 0.6122 | 0.6114 | |
Figure 4ROC curves with different models on the test sets obtained from DrugBank.
Top five novel DTIs predicted by SPVec-GBDT.
| DB11805 | P07947 | Saracatinib | The tyrosine-protein kinase Yes | Patel et al., |
| DB09282 | P42262 | Molsidomine | Glutamate receptor 2 | None |
| DB05524 | Q99640 | Pelitinib | Membrane-associated tyrosine and threonine-specific cdc2-inhibitory kinase | |
| DB03017 | Q16620 | Lauric acid | BDNF/NT-3 growth factors receptor | None |
| DB13165 | P11362 | Ripasudil | Fibroblast growth factor receptor 1 | None |