| Literature DB >> 26797600 |
Jianhua Jia1,2, Zi Liu3, Xuan Xiao4,5, Bingxiang Liu6, Kuo-Chen Chou7,8.
Abstract
Knowledge of protein-protein interactions and their binding sites is indispensable for in-depth understanding of the networks in living cells. With the avalanche of protein sequences generated in the postgenomic age, it is critical to develop computational methods for identifying in a timely fashion the protein-protein binding sites (PPBSs) based on the sequence information alone because the information obtained by this way can be used for both biomedical research and drug development. To address such a challenge, we have proposed a new predictor, called iPPBS-Opt, in which we have used: (1) the K-Nearest Neighbors Cleaning (KNNC) and Inserting Hypothetical Training Samples (IHTS) treatments to optimize the training dataset; (2) the ensemble voting approach to select the most relevant features; and (3) the stationary wavelet transform to formulate the statistical samples. Cross-validation tests by targeting the experiment-confirmed results have demonstrated that the new predictor is very promising, implying that the aforementioned practices are indeed very effective. Particularly, the approach of using the wavelets to express protein/peptide sequences might be the key in grasping the problem's essence, fully consistent with the findings that many important biological functions of proteins can be elucidated with their low-frequency internal motions. To maximize the convenience of most experimental scientists, we have provided a step-by-step guide on how to use the predictor's web server (http://www.jci-bioinfo.cn/iPPBS-Opt) to get the desired results without the need to go through the complicated mathematical equations involved.Entities:
Keywords: IHTS; KNNC; Optimize training dataset; PseAAC; physicochemical property; protein-protein binding sites; stationary wavelet transform; target cross-validation
Mesh:
Substances:
Year: 2016 PMID: 26797600 PMCID: PMC6274413 DOI: 10.3390/molecules21010095
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Maximum accessible surface area (ASA) of different amino acids a.
| AA | A | B | C | D | E | F | G | H | I | K | L | M |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MaxASA | 106 | 160 | 135 | 163 | 194 | 197 | 84 | 184 | 169 | 205 | 164 | 188 |
| AA | N | P | Q | R | S | T | V | W | X | Y | Z | |
| MaxASA | 157 | 136 | 198 | 248 | 130 | 142 | 142 | 227 | 180 | 222 | 196 |
a Amino acids are represented by their one-letter codes. Here, B stands for D or N; Z for E or Q, and X for an undetermined amino acid.
Figure 1A schematic drawing to show how to use the extended chain of Equation (7) to define the working segments of Equation (6) for those sites when their sequence positions in the protein are less than or greater , where the left dummy segment stands for the mirror image of at N-terminus and the right dummy segment for that of at the C-terminus.
The original values of the seven physicochemical properties for each amino acid.
| Amino Acid Code | Physicochemical Property ( | ||||||
|---|---|---|---|---|---|---|---|
| H1 | H2 | V | P1 | P2 | SASA | NCI | |
| A | 0.62 | −0.5 | 27.5 | 8.1 | 0.046 | 1.181 | 0.007187 |
| C | 0.29 | −1 | 44.6 | 5.5 | 0.128 | 1.461 | −0.03661 |
| D | −0.9 | 3 | 40 | 13 | 0.105 | 1.587 | −0.02382 |
| E | −0.74 | 3 | 62 | 12.3 | 0.151 | 1.862 | 0.006802 |
| F | 1.19 | −2.5 | 115.5 | 5.2 | 0.29 | 2.228 | 0.037552 |
| G | 0.48 | 0 | 0 | 9 | 0 | 0.881 | 0.179052 |
| H | −0.4 | −0.5 | 79 | 10.4 | 0.23 | 2.025 | −0.01069 |
| I | 1.38 | −1.8 | 93.5 | 5.2 | 0.186 | 1.81 | 0.021631 |
| K | −1.5 | 3 | 100 | 11.3 | 0.219 | 2.258 | 0.017708 |
| L | 1.06 | −1.8 | 93.5 | 4.9 | 0.186 | 1.931 | 0.051672 |
| M | 0.64 | −1.3 | 94.1 | 5.7 | 0.221 | 2.034 | 0.002683 |
| N | −0.78 | 2 | 58.7 | 11.6 | 0.134 | 1.655 | 0.005392 |
| P | 0.12 | 0 | 41.9 | 8 | 0.131 | 1.468 | 0.239531 |
| Q | −0.85 | 0.2 | 80.7 | 10.5 | 0.18 | 1.932 | 0.049211 |
| R | −2.53 | 3 | 105 | 10.5 | 0.291 | 2.56 | 0.043587 |
| S | −0.18 | 0.3 | 29.3 | 9.2 | 0.062 | 1.298 | 0.004627 |
| T | −0.05 | −0.4 | 51.3 | 8.6 | 0.108 | 1.525 | 0.003352 |
| V | 1.08 | −1.5 | 71.5 | 5.9 | 0.14 | 1.645 | 0.057004 |
| W | 0.81 | −3.4 | 145.5 | 5.4 | 0.409 | 2.663 | 0.037977 |
| Y | 0.26 | −2.3 | 117.3 | 6.2 | 0.298 | 2.368 | 0.023599 |
a H1, hydrophobicity; H2, hydrophilicity; V, volume of side chains; P1, polarity; P2, polarizability; SASA, solvent accessible surface area; NCI, net charge index of side chains.
Figure 2A schematic drawing to illustrate the procedure of multi-level SWT (stationary wavelets transform). See Equations (10)–(12) as well as the relevant text for further explanation.
Figure 3A flowchart to illustrate the ensemble classifier of Equation (17) that exploits all the different groups of features, where D(1) means the decision made by , D(2) means the decision made by , and so forth. See the text as well as Equations (11) and (16) for further explanation.
Figure 4A plot of Acc vs. K for (a) the surface-residue benchmark dataset (cf. Equation (4)); and (b) the all-residue benchmark dataset (cf. Equation (5)). It can be seen from panel (a) that the overall accuracy reaches its peak at , and from panel (b) that the overall accuracy reaches its peak at .
Comparison of the iPPBS-Opt with the other existing methods via the 10-fold cross-validation on the surface-residue benchmark dataset (Equation (4)) and the all-residue benchmark dataset (Equation (5)).
| Benchmark Dataset | Method | Acc (%) | MCC | Sn (%) | Sp (%) | AUC |
|---|---|---|---|---|---|---|
| Surface-residue | Deng a | N/A | 0.3456 | 76.77 | 63.16 | 0.7976 |
| Chen b | 75.09 | 0.4248 | 43.81 | 92.12 | 0.8004 | |
| iPPBS-PseAAC c | 58.26 | 94.14 | ||||
| All-residue | Deng a | N/A | 0.3763 | 76.33 | 78.61 | 0.8465 |
| Chen b | 73.77 | 0.3286 | 24.95 | 96.52 | 0.8001 | |
| iPPBS-PseAAC c | 39.14 | 96.66 |
a Results reported by Deng et al. [10]; b Results reported by Chen et al. [11]; c Results obtained on the same testing dataset by the current predictor iPPBS-Opt with its parameter for the surface-residue benchmark dataset (cf. Equation (4)) and for the all-residue benchmark dataset (cf. Equation (5)). Also see Figure 4 for the details.
Figure 5The ROC (Receiver Operating Characteristic) curves to show the 10-fold cross validation by iPPBS-Opt, Deng et al.’s method [10], and Chen et al.’s method [11] on (a) surface-residue benchmark dataset; and (b) the all-residue benchmark dataset. As shown on the figure, the area under the ROC curve for iPPBS-Opt is obviously larger than those of their counterparts, indicating a clear improvement of the new predictor in comparison with the existing ones.
Figure 6A semi-screenshot of the top page for the web server iPPBS-Opt at http://www.jci-bioinfo.cn/iPPBS-Opt.