| Literature DB >> 33933121 |
Narumi Watanabe1, Yuuto Ohnuki1, Yasubumi Sakakibara2.
Abstract
MOTIVATION: Virtual screening, which can computationally predict the presence or absence of protein-compound interactions, has attracted attention as a large-scale, low-cost, and short-term search method for seed compounds. Existing machine learning methods for predicting protein-compound interactions are largely divided into those based on molecular structure data and those based on network data. The former utilize information on proteins and compounds, such as amino acid sequences and chemical structures; the latter rely on interaction network data, such as protein-protein interactions and compound-compound interactions. However, there have been few attempts to combine both types of data in molecular information and interaction networks.Entities:
Keywords: Deep learning; Heterogeneous interaction network; Integration; Protein–compound interaction
Year: 2021 PMID: 33933121 PMCID: PMC8088618 DOI: 10.1186/s13321-021-00513-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 8.489
Fig. 1Deep learning architecture that integrates molecular structure data and interactome data to predict protein–compound interactions. It integrates graph- and sequence-based representations for the target protein and compound. The amino acid sequence of the protein input was embedded into a one-hot vector of 20 dimensions in height. The ECFP representation of the compound input was embedded into a 1024-dimensional vector. The feature vectors were also extracted from the protein–protein and compound–compound interaction network using node2vec, a feature representation learning method for graphs. These feature vectors were combined as a protein vector and a compound vector. The interaction was predicted in the output unit
Fig. 2The output layer architecture. The integrated model predicts the protein–compound interactions by embedding the protein and compound data from different modalities into a common latent space. The feature vectors for the proteins and compounds are mapped onto the same latent space by applying a fully connected layer. Then, their similarity in the latent space is calculated with an element-wise product calculation followed by a fully connected layer
Performance comparison of three proposed models with existing methods on the baseline dataset
| AUROC | AUPRC | F-measure | Accuracy | |
|---|---|---|---|---|
| Integrated model (molecular + network) | 0.972 ± 0.004 | 0.954 ± 0.005 | 0.900 ± 0.006 | 0.933 ± 0.004 |
| Single-modality model (molecular) | 0.956 ± 0.004* | 0.927 ± 0.006* | 0.868 ± 0.009* | 0.911 ± 0.006* |
| Single-modality model (network) | 0.947 ± 0.008* | 0.920 ± 0.010* | 0.853 ± 0.015* | 0.904 ± 0.009* |
| Graph CNN-based method [ | 0.917 ± 0.006* | 0.850 ± 0.006* | 0.794 ± 0.014* | 0.864 ± 0.008* |
| NeoDTI [ | 0.956 ± 0.005* | 0.905 ± 0.016* | 0.872 ± 0.006* | 0.917 ± 0.004* |
| SVM | 0.805 ± 0.009* | 0.651 ± 0.012* | 0.743 ± 0.012* | 0.837 ± 0.006* |
| Random forest | 0.873 ± 0.009* | 0.767 ± 0.015* | 0.837 ± 0.012* | 0.895 ± 0.007* |
Performance comparison on the unseen compound-test dataset
| AUROC | AUPRC | F-measure | Accuracy | |
|---|---|---|---|---|
| Integrated model (molecular + network) | 0.890 ± 0.039 | 0.842 ± 0.050 | 0.727 ± 0.085 | 0.843 ± 0.038 |
| Single-modality model (molecular) | 0.869 ± 0.027 | 0.786 ± 0.023* | 0.657 ± 0.053 | 0.802 ± 0.017 |
| Single-modality model (network) | 0.831 ± 0.053 | 0.759 ± 0.055* | 0.661 ± 0.073* | 0.809 ± 0.030* |
| Graph CNN-based method [ | 0.804 ± 0.037* | 0.679 ± 0.031* | 0.637 ± 0.027 | 0.773 ± 0.009* |
| NeoDTI [ | 0.823 ± 0.067 | 0.773 ± 0.064* | 0.621 ± 0.062* | 0.805 ± 0.024* |
| SVM | 0.765 ± 0.020* | 0.603 ± 0.029* | 0.689 ± 0.029 | 0.810 ± 0.016 |
| Random forest | 0.770 ± 0.023* | 0.635 ± 0.026* | 0.697 ± 0.036 | 0.828 ± 0.014 |
Performance comparison on the hard dataset
| AUROC | AUPRC | F-measure | Accuracy | |
|---|---|---|---|---|
| Integrated model (molecular + network) | 0.882 ± 0.035 | 0.834 ± 0.041 | 0.714 ± 0.064 | 0.836 ± 0.030 |
| Single-modality model (molecular) | 0.851 ± 0.023 | 0.770 ± 0.023* | 0.662 ± 0.038* | 0.806 ± 0.020* |
| Single-modality model (network) | 0.780 ± 0.051* | 0.706 ± 0.040* | 0.601 ± 0.057* | 0.784 ± 0.023* |
| Graph CNN-based method [ | 0.707 ± 0.038* | 0.563 ± 0.083* | 0.427 ± 0.132* | 0.719 ± 0.043* |
| NeoDTI [ | 0.790 ± 0.039* | 0.715 ± 0.046* | 0.297 ± 0.084* | 0.719 ± 0.018* |
| SVM | 0.652 ± 0.019* | 0.500 ± 0.023* | 0.481 ± 0.044* | 0.755 ± 0.012* |
| Random forest | 0.605 ± 0.033* | 0.452 ± 0.046* | 0.364 ± 0.075* | 0.728 ± 0.026* |
Fig. 3(Left) Relationship between the amino acid sequence similarity and the similarity in the protein–protein interaction network. (Right) Relationship between the chemical-structure similarity and the similarity in the compound–compound interaction network. The amino acid sequence similarity was calculated using DIAMOND, and the chemical structure similarity was calculated as the Jaccard coefficient of the ECFPs of the two compounds. The correlation coefficients are 0.127 and 0.0346, respectively
Fig. 4(Left) Part of the protein–protein interaction network around HTR6 and ADRA2A. (Right) Part of the compound-compound interaction network around Mesulergine and Pergolide