| Literature DB >> 32292649 |
Haiping Zhang1, Konda Mani Saravanan1, Jinzhi Lin1, Linbu Liao2, Justin Tze-Yang Ng3, Jiaxiu Zhou4, Yanjie Wei1.
Abstract
Accurate identification of ligand-binding pockets in a protein is important for structure-based drug design. In recent years, several deep learning models were developed to learn important physical-chemical and spatial information to predict ligand-binding pockets in a protein. However, ranking the native ligand binding pockets from a pool of predicted pockets is still a hard task for computational molecular biologists using a single web-based tool. Hence, we believe, by using closer to real application data set as training and by providing ligand information, an enhanced model to identify accurate pockets can be obtained. In this article, we propose a new deep learning method called DeepBindPoc for identifying and ranking ligand-binding pockets in proteins. The model is built by using information about the binding pocket and associated ligand. We take advantage of the mol2vec tool to represent both the given ligand and pocket as vectors to construct a densely fully connected layer model. During the training, important features for pocket-ligand binding are automatically extracted and high-level information is preserved appropriately. DeepBindPoc demonstrated a strong complementary advantage for the detection of native-like pockets when combined with traditional popular methods, such as fpocket and P2Rank. The proposed method is extensively tested and validated with standard procedures on multiple datasets, including a dataset with G-protein Coupled receptors. The systematic testing and validation of our method suggest that DeepBindPoc is a valuable tool to rank near-native pockets for theoretically modeled protein with unknown experimental active site but have known ligand. The DeepBindPoc model described in this article is available at GitHub (https://github.com/haiping1010/DeepBindPoc) and the webserver is available at (http://cbblab.siat.ac.cn/DeepBindPoc/index.php).Entities:
Keywords: Deep neural network; Densely fully connected neural network; Ligand pocket identification; Mol2vec; Protein–ligand interactions
Year: 2020 PMID: 32292649 PMCID: PMC7144620 DOI: 10.7717/peerj.8864
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Classification of datasets into different groups.
The ET_A, ET_B, ET_C stands for extra test A, extra test B and extra test C, respectively.
Figure 2The workflow of DeepBindPoc model.
The DeepBindPoc performance on Training A, Validation A, Test A, Training B, Validation B and Test B.
The normalization strategy is based on the Training dataset. The DeepBindPoc is trained by Training A. The details of each data set were described in Materials and methods section. Pos_size and Neg_size in the table denotes size of the positive and negative dataset.
| Data set | AUC | Accuracy | TPR | Precision | MCC | Pos_size | Neg_size |
|---|---|---|---|---|---|---|---|
| Training A | 1.00 | 0.98 | 0.97 | 0.98 | 0.95 | 6,000 × 3 | 18,000 |
| Validation A | 0.98 | 0.93 | 0.90 | 0.96 | 0.86 | 1,000 | 1,000 |
| Test A | 0.98 | 0.95 | 0.89 | 0.71 | 0.77 | 677 | 5,822 |
| Training B | 1.00 | 0.98 | 0.98 | 0.98 | 0.96 | 6,000 × 3 | 18,000 |
| Validation B | 0.99 | 0.97 | 0.99 | 0.96 | 0.94 | 1,000 | 1,000 |
| Test B | 1.00 | 0.97 | 0.98 | 0.97 | 0.94 | 7,491 | 5,822 |
The DeepBindPoc performance on the extra test set A and the independent GPCR dataset, which is close to the real application.
Pos_size and Neg_size in the table denotes size of the positive and negative dataset.
| Data set | Normalized strategy | AUC | Accuracy | TPR | Precision | MCC | Pos_size | Neg_size |
|---|---|---|---|---|---|---|---|---|
| Extra test A | 1# | 0.90 | 0.80 | 0.83 | 0.18 | 0.32 | 12 | 238 |
| Extra test A | 2# | 0.93 | 0.88 | 0.75 | 0.24 | 0.38 | 12 | 238 |
| GPCR set | 1# | 0.96 | 0.85 | 0.95 | 0.16 | 0.36 | 98 | 3,050 |
| GPCR set | 2# | 0.97 | 0.91 | 0.93 | 0.26 | 0.46 | 98 | 3,050 |
Note:
#1, based on data itself; #2, based on training set.
The comparison of the performance of DeepBindPoc and fpocket based on whether they can successful identify near-native pocket with in top 5, 3 and 1, respectively.
| PDB ID | DeepBindpoc | Fpocket | ||||
|---|---|---|---|---|---|---|
| In top 5 | In top 3 | In top 1 | In top 5 | In top 3 | In top 1 | |
| 6NQ0 | × | × | × | × | × | × |
| 6J4H | ✓ | ✓ | × | × | × | × |
| 6J0O | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| 6IEZ | ✓ | ✓ | ✓ | × | × | × |
| 6I2A | ✓ | ✓ | ✓ | × | × | × |
| 6GGG | ✓ | ✓ | × | ✓ | ✓ | ✓ |
| 6K04 | ✓ | ✓ | ✓ | ✓ | × | × |
| 6GEV | ✓ | ✓ | × | ✓ | ✓ | ✓ |
| 6E3T | ✓ | × | × | × | × | × |
| 6PSJ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 6SJM | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 5OVE | ✓ | ✓ | × | × | × | × |
| Summary | 11 | 10 | 6 | 6 | 5 | 4 |
The top three predicted values of DeepBindPoc on the five cases (extra test B) that fpocket failed to generate near-native; the corresponding fpocket prediction value was also given for comparison.
Interestingly, for each case, the native pockets are all ranked as top 1. The Normalized strategy is based on the training set. The pocket is Native pocket or pocket from the fpocket generation. Pos and Neg in the table denotes positive and negative data. The number after “poc” in table represents rank of the prediction.
| Prediction | Label | |
|---|---|---|
| 6MT8_native_poc | 1.00 | pos |
| 6MT8_poc1 | 1.00 | neg |
| 6MT8_poc9 | 0.98 | neg |
| 6J8V_native_poc | 1.00 | pos |
| 6J8V_poc31 | 1.00 | neg |
| 6J8V_poc18 | 1.00 | neg |
| 6J5W_native_poc | 1.00 | pos |
| 6J5W_poc74 | 1.00 | neg |
| 6J5W_poc42 | 1.00 | neg |
| 6I53_native_poc | 0.99 | pos |
| 6I53_poc3 | 0.98 | neg |
| 6I53_poc1 | 0.85 | neg |
| 6F83_native_poc | 1.00 | pos |
| 6F83_poc4 | 0.09 | neg |
| 6F83_poc3 | 0.05 | neg |
The performance of P2Rank on the 17 proteins of the extra testing set A and B.
It was observed at least one near-native pocket among all the prediction, the near native pocket often ranks top (in this test all are rank top 1). However, there are still some cases where the near-native is not in the predicted pocket decoys.
| Protein | Number of predicted Pocket | Near-native pocket |
|---|---|---|
| 5OVE | 9 | None |
| 6J4H | 31 | 6J4H_poc1 |
| 6E3T | 21 | 6E3T_poc1 |
| 6K04 | 1 | 6K04_poc1 |
| 6F83 | 3 | None |
| 6GEV | 5 | 6GEV_poc1 |
| 6PSJ | 5 | 6PSJ_poc1 |
| 6GGG | 9 | 6GGG_poc1 |
| 6I2A | 7 | None |
| 6I53 | 6 | None |
| 6IEZ | 5 | 6IEZ_poc1 |
| 6J0O | 10 | 6J0O_poc1 |
| 6J5W | 45 | None |
| 6SJM | 3 | 6SJM_poc1 |
| 6J8V | 27 | None |
| 6MT8 | 7 | 6MT8_poc1 |
| 6NQ0 | 26 | None |
Figure 3The three cases that have better performance over both P2Rank and fpocket.
The Protein Data Bank identifiers of three cases (6I2A, 6F83 and 6J5W) are shown as (A), (B) and (C), respectively. The figure is plotted by VMD.
Figure 4The two-ligand corresponding to two pocket cases.
(A) Two pockets of the protein with PDB ID: 6QTN. (B) Two pockets of the protein with PDB ID: 5ZG2. The figure is plotted by UCSF chimera.