| Literature DB >> 33841755 |
Sangsoo Lim1, Yijingxiu Lu2, Chang Yun Cho3, Inyoung Sung3, Jungwoo Kim2, Youngkuk Kim2, Sungjoon Park2, Sun Kim1,2,4,3.
Abstract
There has recently been a rapid progress in computational methods for determining protein targets of small molecule drugs, which will be termed as compound protein interaction (CPI). In this review, we comprehensively review topics related to computational prediction of CPI. Data for CPI has been accumulated and curated significantly both in quantity and quality. Computational methods have become powerful ever to analyze such complex the data. Thus, recent successes in the improved quality of CPI prediction are due to use of both sophisticated computational techniques and higher quality information in the databases. The goal of this article is to provide reviews of topics related to CPI, such as data, format, representation, to computational models, so that researchers can take full advantages of these resources to develop novel prediction methods. Chemical compounds and protein data from various resources were discussed in terms of data formats and encoding schemes. For the CPI methods, we grouped prediction methods into five categories from traditional machine learning techniques to state-of-the-art deep learning techniques. In closing, we discussed emerging machine learning topics to help both experimental and computational scientists leverage the current knowledge and strategies to develop more powerful and accurate CPI prediction methods.Entities:
Keywords: Chemical descriptors; Compound-protein interaction; Data representation; Deep learning; Interpretable learning; Machine learning; Pharmacophore discovery; Protein descriptors
Year: 2021 PMID: 33841755 PMCID: PMC8008185 DOI: 10.1016/j.csbj.2021.03.004
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Overview of how chemical compound and protein data that are processed to perform CPI prediction tasks. Data encoding depends on data type and how the data can be prepared as input to ML models. Data processed by network-based methods are not allocated here and will be discussed in detail in 4.2).
Fig. 2Formats and encoding schemes of chemical compounds. a) String-based methods translate compounds into context-aware strings - SMILES, SMARTS, or SELFIES. SMARTS, based on SMILES, focuses on localized substructures. SELFIES produces a string that guarantees valid molecular structure. b) Chemical fingerprint encodes a compound into a binary bit vector by comparing the compound with pre-defined set of substructures. c) Graphical representation of chemical compounds transforms the input molecule into a set of matrices - adjacency and node features.
Fig. 3Formats and corresponding encoding schemes of proteins. a) String: Represent protein as amino acid sequence. b) Evolutionary information: Encode protein considering its evolutionary information. c) Graph: Encode protein structure whether by considering sequentially connected relations (Middle), or by calculating the spatial distance between the residues (Right).
List of databases used in CPI prediction. The databases are organized in three separate categories: chemistry-centric, protein-centric and integrated databases. (a) Chemistry-centric databases mostly focus on integrating the information from chemical experiments. They comprise SMILES, InChI key, or other accession data and their interacting/targeting proteins with corresponding affinities. (b) Protein databases provide sequence information in general. They rarely contain information linked with chemical compounds. (c) Other databases include integrated information in addition to compounds or proteins, such as association with genes, diseases, or phenotypes.
| Database | Coverage (Number of entities) | ML methods to use DB | Reference | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Compounds | Proteins | Interactions | T | F | G | S | P | D | ||||
| PubChem | 111 m | 99 k | 273 m | – | ||||||||
| ChEMBL | 1,961,462 | 13,382 | 16,066,124 | |||||||||
| DUD-E | 22,886 | 102 | 22.8 k* | |||||||||
| DrugBank | 13,791 | 5,696 | 27,954 | |||||||||
| STITCH | 0.5 m | 9.6 m | 1.6b | – | ||||||||
| TTD | 2,251 | 3,473*** | 43,875 | – | – | – | – | |||||
| PharmGKB | 708 | – | – | |||||||||
| Matador | 801 | 2,901 | 15,843 | |||||||||
| DrugCentral | 2,529 | 2,003 | 17,390 | – | – | – | – | |||||
| SuperTarget | 195,770 | 6,219 | 332,828 | – | – | |||||||
| Metz | 3,858 | 172 | 258,094 | – | – | – | – | – | ||||
| MUV | 93 k | 17 | – | – | – | – | ||||||
| ZINC | 750 m** | 2,864 (for eukaryotes) | 638,174 | – | – | – | – | – | ||||
| Protein-centric databases | ||||||||||||
| Compounds | Proteins | Interactions | T | F | G | S | P | D | ||||
| UniProt | – | 20,385 | – | – | – | – | ||||||
| Protein Data Bank | – | 170,597 | – | – | – | |||||||
| PDBbind | 11,762 | 3,566 | 17,679* | – | – | – | – | |||||
| Pfam | – | 18,259 | – | – | – | – | – | |||||
| BRENDA | 46 | 8083** | 500 k | – | – | – | ||||||
| Integrated databases | ||||||||||||
| Compounds | Proteins | Interactions | T | F | G | S | P | D | ||||
| KEGG | 18,749*** | 31,224,482**** | – | – | ||||||||
| BindingDB | 910,479 | 8,161 | 2.1 m | – | – | |||||||
| Davis | 72 | 442 | 30 k | – | ||||||||
| K KIBA | 229 | 211 | 118 k | |||||||||
| IUPHAR/BPS | 10,053 | 2,943 | 48,902 | – | – | – | ||||||
positives.
213 m 3D information available.
raw file downloaded on Nov 11th, 2020.
protein-ligand complexes.
EC numbers (online).
drugs:11.3 k
human proteins:19.7 k
Fig. 4Overview of the process of CPI prediction. After obtaining the original chemical or protein data from databases, various encoding techniques are used to prepare the data with model-readable vector formats (left). These encoded data are then submitted to a model or a combination of several models to learn the pattern of data. We categorize these models into two main groups (machine learning and deep learning) that contain five subgroups together (middle). After training models with the data, different metrics are chosen to evaluate the model. Note that CPI is predicted whether in a regression style by predicting the affinity value or in a classification style by predicting the interaction label (right).
Fig. 5Methods for interpretation of deep learning models divided into 3 types: inherently interpretable method, saliency map, and attention mechanism. (a) Inherently interpretable method refers to the method whose components can be further used to comprehend how the machine makes decisions. Hierarchical division of molecular structures can classify compounds in terms of the existing substructures determine given class labels (b) Saliency map is widely used to reveal the most contributed part of an input that activates the specific layer of the network. (c) Attention mechanism is mainly applied to neural networks, revealing where the model focuses on the input representation when making predictions.
Tree-based methods for CPI prediction.
| Yu et al. | 2012 | C: - | DRAGON |
| P: - | PROFEAT WEBSEVER | ||
| A method that integrates the chemical, genomic, and pharmacological information to predict CPI. | |||
| Zhang et al. | 2017 | C: - | *- |
| P: AA properties | AAindex1 | ||
| An ensemble of REPTree classifiers by random projection to identify drug-target interactions. | |||
| Li et al. | 2019 | C: - | MACCS |
| P: AA seq | AAC and * | ||
| A method that applies Bayesian Additive Regression Trees on uniform proteochemical space to predict protein–ligand interactions. | |||
| Shi et al. | 2019 | C: FP2 | Pubchem (binary vector) |
| P: AA seq | PSSM matrix | ||
| A method that uses LASSO to remove redundant information from protein PsePSSM and molecular FP2 description and makes prediction with Random Forest. | |||
| Mahmud et al. | 2020 | C: SMILES | MSF |
| P: AA seq | PseAAC and ** | ||
| A computational model that uses balancing techniques and applies feature eliminator to extract features for CPI prediction. | |||
| Zeng et al. | 2020 | C: - | interaction and *** |
| P: - | interaction and *** | ||
| A network-based computational framework that learns low-dimensional vector representation of features and predicts CPI with cascade deep forest. | |||
C: Compound, P: Protein, : Physicochemical features, Property groups.
MSF: Molecular Substructure Fingerprint, : PSSM-Bigram and SPIDER2.
: association and similarity matrices.
Network- and Kernel-based machine learning methods for CPI prediction.
| Yamanishi et al. | 2011 | C: FP | Pubchem * |
| P: domains | binary coding scheme | ||
| A method that utilizes sparse CCA to extract chemical substructures and protein domains. | |||
| Cheng et al. | 2012 | C: - | - |
| P: - | – | ||
| A network-based inference method to create compound-protein network and predict new CPI. | |||
| Tabei et al. | 2012 | C: FP | Pubchem * |
| P: domains | binary coding scheme | ||
| A classifier-based approach to identify chemogenomic features that are involved in compound-protein interaction networks. | |||
| Yu et al. | 2012 | C: - | DRAGON |
| P: - | PROFEAT WEBSEVER | ||
| A method that integrates the chemical, genomic, and pharmacological information to predict CPI. | |||
| Zu et al. | 2015 | C: FP | Pubchem * |
| P: domains | binary coding scheme | ||
| A statistical model to evaluate substructure-domain interactions globally and infer interactions. | |||
| Hu et al. | 2016 | C: FP | Pubchem * |
| P: AA seq | binary coding scheme | ||
| A hybrid model based on stacked sparse autoencoder and SVM. | |||
| You et al. | 2019 | C: structural info | OCHEM |
| P: AA seq | AAC ** | ||
| A LASSO-DNN model for compound and protein feature extraction and CPI prediction. | |||
| Mahmud et al. | 2020 | C: SMILES | MSF |
| P: AA seq | PseAAC *** | ||
| A computational model that uses balancing techniques and applies feature eliminator to extract features for CPI prediction. | |||
C: Compound, P: Protein, MSF: Molecular Substructure Fingerprint, *: binary vector.
**: DC, TC, ajacency matrix, ***: PSSM-Bigram and SPIDER2.
RNN and CNN methods for DTI prediction.
| Wallach et al. | 2015 | co-complex structure | * with 1Åspacing |
| The first structure-based CNN model to predict CPI. | |||
| Ragoza et al. | 2017 | co-complex structure | * with 0.5Åresolution |
| A CNN based model to predict protein–ligand interaction with 3D depiction of co-complex structure. | |||
| Gao et al. | 2018 | C: SMILES | chemical structure graph |
| P: AA seq, GO term | lookup embedding | ||
| An end-to-end deep neural network that embedded with two-way attention mechanism for identifying compound-protein interactions. | |||
| Öztürk et al. | 2018 | C: SMILES | label encoding ** |
| P: AA seq | label encoding ** | ||
| An end-to-end CNN-based CPI prediction model that eliminates the need for feature engineering. | |||
| Feng et al. | 2018 | C: SMILES | MGC, ECFP |
| P: AA seq | PSC descriptor | ||
| A feature-engineering free deep learning-based model for CPI prediction. | |||
| Karimi et al. | 2019 | C: SMILES | seq2seq ** |
| P: AA seq | seq2seq (SPS) | ||
| A semi-supervised unified RNN-CNN model for jointly learning protein/compound representations and predicting affinity. | |||
| Karimi et al. | 2019 | C: SMILES | chemical structure graph |
| P: AA seq | k-mers (SSPro/ACCPro) | ||
| An intrinsically explainable neural network architecture for predicting compound-protein interactions. | |||
| Lee et al. | 2019 | C: SMILES | Morgan/Circular Fingerprint |
| P: AA seq | lookup embedding | ||
| A CNN-based model for detecting local residue patterns and predicting CPI. | |||
| Nguyen et al. | 2019 | C: SMILES | chemical structure graph |
| P: AA seq | label encoding *** | ||
| A deep learning based network for capturing compound structural information and predict binding affinity. | |||
| Öztürk et al. | 2019 | C: SMILES | label encoding * |
| P: AA seq, motifs, domains | label encoding | ||
| A deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. | |||
| Shin et al. | 2019 | C: SMILES | word embedding |
| P: AA seq | label encoding ** | ||
| A self-attention-based molecular transformer for CPI prediction. | |||
| Tsubaki et al. | 2019 | C: SMILES | chemical structure graph |
| P: AA seq | overlapping 3-gram AA vector | ||
| A deep learning based CPI prediction model that captures interaction sites between compound and protein with neural attention mechanism. | |||
| Huang et al. | 2020 | C: SMILES | one-hot encoding **** |
| P: AA seq | one-hot encoding **** | ||
| An end-to-end biologically inspired transformer based framework for CPI modeling. | |||
| Li et al. | 2020 | C: SMILES | chemical structure graph |
| P: AA seq | BLOSUM62 matrix | ||
| A multi-objective neural network to predict non-covalent interaction and binding affinity. | |||
| Peng et al. | 2020 | C: SMILES | word embedding |
| P: - | – | ||
| An end-to-end deep learning-based framework to learn molecular representation and predict toxicity. | |||
| Rifaioglu et al. | 2020 | C: SMILES | label encoding |
| P: AA seq | Physicochemical features | ||
| An end-to-end parallel convolution neural networks to obtain 1D representations from protein sequences and compounds SMILES. | |||
| Rifaioglu et al. | 2020 | C: SMILES | 2D compound image |
| P: - | P: - | ||
| A CPI prediction system that takes compound 2D image as input. | |||
| Wang et al. | 2020 | C: SMILES | MSF |
| P: AA seq | PSSM matrix | ||
| A DeepLSTM-based method for representing and compressing features for compound-protein pairs interaction prediction. | |||
| Zhang et al. | 2020 | C: SMILES | SMILES2Vec **** |
| P: AA seq | encoded with ProtVec | ||
| A word2vec-inspired featrue representation method for CPI prediction | |||
| Zheng et al. | 2020 | C: SMILES | token embedding |
| P: AA seq | pairwise distance matrix | ||
| A Visual Question Answering (VQA)-inspired interpretable deep learning model for compound-protein interaction prediction. | |||
C: Compound, P: Protein, MGC: Molecular Graph Convolutio, MSF: Molecular Substructure Fingerprint.
*: fixed-size grid, **: fixed-size vector, SPS: SSPro/ACCPro.
***: SMILES, Max Common Substructure, ****: substructure representation.
Graph based methods for CPI prediction.
| Gao et al. | 2018 | C: SMILES | CSG |
| P: AA seq, GO term | lookup embedding | ||
| An end-to-end deep neural network that embedded with two-way attention mechanism for identifying compound-protein interactions. | |||
| Karimi et al. | 2019 | C: SMILES | CSG |
| P: AA seq | k-mers (k-mers) | ||
| An intrinsically explainable neural network architecture for predicting compound-protein interactions. | |||
| Lim et al. | 2019 | co-complex structure | ajacency matrix (graph embedding) |
| A GNN-based model that predict CPI with 3D structure-embedded graph representation of protein–ligand complex. | |||
| Shin et al. | 2019 | C: SMILES | word embedding |
| P: AA seq | label encoding | ||
| A self-attention-based molecular transformer for CPI prediction. | |||
| Torng and Altman | 2019 | C: SMILES | CSG |
| P: PDB file | FEATURE * | ||
| A GNN-based method to learn fixed-size representations of protein pockets and chemical structural graph synchronously and predict CPI. | |||
| Tsubaki et al. | 2019 | C: SMILES | CSG |
| P: AA seq | overlapping 3-gram AA vector | ||
| A deep learning based CPI prediction model that captures interaction sites between compound and protein with neural attention mechanism. | |||
| Li et al. | 2020 | C: SMILES | CSG |
| P: AA seq | BLOSUM62 matrix | ||
| A multi-objective neural network to predict non-covalent interactions and binding affinites. | |||
| Karlov et al. | 2020 | co-complex structure | 3D grid representation map |
| An MPNN framework for learning protein–ligand complex features and predicting binding affinity. | |||
C: Compound, P: Protein, CSG: Chemical Structure Graph, SPS: SSPro/ACCPro.
: graph of key residues.
Emerging methods for CPI prediction.
| Hu et al. | 2016 | C: FP | PubChem FP |
| P: AA seq | binary coding scheme | ||
| A hybrid model based on stacked sparse autoencoder and SVM. | |||
| Tian et al. | 2016 | C: FP | PubChem FP |
| P: domains | binary coding scheme | ||
| A DNN model to extract features from chemical subsrtucture and protein domain and predict CPI. | |||
| Karimi et al. | 2019 | C: SMILES | seq2seq * |
| P: AA seq | seq2seq (SPS) | ||
| A semi-supervised unified RNN-CNN model for jointly learning protein/compound representations and predicting affinities. | |||
| Lee et al. | 2019 | C: SMILES | Morgan/Circular Fingerprint |
| P: AA seq | lookup embedding | ||
| A CNN-based model for detecting local residue patterns and predicting CPI. | |||
| Zhao et al. | 2019 | C: SMILES | text embedding |
| P: AA seq | text embedding | ||
| A semi-supervised GAN-based GANs to learn representations from the raw sequence data of proteins and compounds and predict affinity. | |||
| Agyemang et al. | 2020 | C: SMILES | various descriptor schemes |
| P: AA seq | various descriptor schemes | ||
| A multi-view self-attention-based architecture for learning the representation of compounds and targets from different unimodal descriptor schemes. | |||
| Zeng et al. | 2020 | C: - | interaction and ** |
| P: - | interaction and ** | ||
| A network-based computational framework that learns low-dimensional vector representation of features and predict CPI with cascade deep forest. | |||
| Zeng et al. | 2020 | C: - | probabilistic co-occurrence matrix |
| P: - | probabilistic co-occurrence matrix | ||
| A network-based deep learning methodology for CPI prediction that embeds various types of chemical, genomic, phenotypic, and cellular networks. | |||
C: Compound, P: Protein, : fixed-size vector, SPS: SSPro/ACCPro.
: association and similarity matrics.
Applications of the attention mechanism in CPI methods.
| a) molecular string, protein string | |
|---|---|
| Karimi et al. | Included different attention mechanisms in the unified RNN-CNN models to quantify the contribution of compound and protein. |
| Shin et al. | Proposed a Molecule Transformer that models molecular SMILES strings into better representation vectors with self-attention mechanism. |
| Tsubaki et al. | Used a neural attention mechanism to weight for hidden vectors of subsequences in protein considering molecular vector. |
| b) molecular graph, protein string | |
| Gao et al. | Used two-way attention mechanism to estimate how CPI pair interacts. |
| Agyemang et al. | Used multi-head self-attention mechanism to learn most significant segments (segment refers to an atom in molecule or a residue in target) that may be vital to protein–ligand recognition. |
| c) molecular graph, protein graph | |
| Lim et al. | Devised distance-aware graph attention mechanism to find the significant nodes and differentiate the contribution of each interaction to binding affinity. |