| Literature DB >> 28718782 |
Jinjian Jiang1,2, Nian Wang3, Peng Chen4, Chunhou Zheng5, Bing Wang6.
Abstract
Hotspot residues are important in the determination of protein-protein interactions, and they always perform specific functions in biological processes. The determination of hotspot residues is by the commonly-used method of alanine scanning mutagenesis experiments, which is always costly and time consuming. To address this issue, computational methods have been developed. Most of them are structure based, i.e., using the information of solved protein structures. However, the number of solved protein structures is extremely less than that of sequences. Moreover, almost all of the predictors identified hotspots from the interfaces of protein complexes, seldom from the whole protein sequences. Therefore, determining hotspots from whole protein sequences by sequence information alone is urgent. To address the issue of hotspot predictions from the whole sequences of proteins, we proposed an ensemble system with random projections using statistical physicochemical properties of amino acids. First, an encoding scheme involving sequence profiles of residues and physicochemical properties from the AAindex1 dataset is developed. Then, the random projection technique was adopted to project the encoding instances into a reduced space. Then, several better random projections were obtained by training an IBk classifier based on the training dataset, which were thus applied to the test dataset. The ensemble of random projection classifiers is therefore obtained. Experimental results showed that although the performance of our method is not good enough for real applications of hotspots, it is very promising in the determination of hotspot residues from whole sequences.Entities:
Keywords: IBk; ensemble system; hot spots; random projection
Mesh:
Substances:
Year: 2017 PMID: 28718782 PMCID: PMC5536031 DOI: 10.3390/ijms18071543
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Prediction performance of individual classifiers with the reduced dimension of 5 on the Binding Interface Database 0 (BID0) test dataset training by Alanine Scanning Energetics Database 0 (ASEdb) dataset. There are 50 top individual classifiers listed here for a simple comparison between classifiers. Here measures of “Sen”, “Prec”, “F1” and “MCC” denote Sensitivity, Precision, F-Measure, and Matthews Correlation Coefficient, respectively.
| No. | Training | Test | ||||||
|---|---|---|---|---|---|---|---|---|
| 1 | 0.259 | 0.110 | 0.069 | 0.109 | 0.558 | 0.332 | 0.220 | 0.315 |
| 2 | 0.069 | 0.125 | 0.250 | 0.108 | 0.558 | 0.357 | 0.250 | 0.345 |
| 3 | 0.138 | 0.080 | 0.070 | 0.093 | 0.212 | 0.141 | 0.122 | 0.155 |
| 4 | 0.069 | 0.085 | 0.129 | 0.090 | 0.500 | 0.274 | 0.173 | 0.257 |
| 5 | 0.121 | 0.075 | 0.071 | 0.089 | 0.308 | 0.194 | 0.150 | 0.201 |
| 6 | 0.069 | 0.083 | 0.125 | 0.089 | 0.096 | 0.040 | 0.044 | 0.060 |
| 7 | 0.069 | 0.076 | 0.108 | 0.084 | 0.269 | 0.136 | 0.096 | 0.141 |
| 8 | 0.069 | 0.076 | 0.108 | 0.084 | 0.269 | 0.129 | 0.090 | 0.135 |
| 9 | 0.138 | 0.071 | 0.061 | 0.084 | 0.558 | 0.364 | 0.259 | 0.354 |
| 10 | 0.138 | 0.069 | 0.058 | 0.082 | 0.346 | 0.226 | 0.173 | 0.231 |
| 11 | 0.069 | 0.071 | 0.098 | 0.081 | 0.135 | 0.038 | 0.037 | 0.058 |
| 12 | 0.086 | 0.066 | 0.075 | 0.080 | 0.615 | 0.337 | 0.205 | 0.308 |
| 13 | 0.052 | 0.080 | 0.150 | 0.077 | 0.577 | 0.317 | 0.196 | 0.293 |
| 14 | 0.052 | 0.076 | 0.136 | 0.075 | 0.404 | 0.227 | 0.153 | 0.222 |
| 15 | 0.069 | 0.064 | 0.083 | 0.075 | 0.135 | 0.082 | 0.080 | 0.100 |
| 16 | 0.052 | 0.074 | 0.130 | 0.074 | 0.577 | 0.323 | 0.203 | 0.300 |
| 17 | 0.052 | 0.074 | 0.130 | 0.074 | 0.596 | 0.279 | 0.153 | 0.243 |
| 18 | 0.069 | 0.062 | 0.080 | 0.074 | 0.404 | 0.225 | 0.151 | 0.220 |
| 19 | 0.069 | 0.062 | 0.080 | 0.074 | 0.308 | 0.152 | 0.102 | 0.153 |
| 20 | 0.052 | 0.072 | 0.125 | 0.073 | 0.115 | 0.030 | 0.033 | 0.052 |
| 21 | 0.121 | 0.058 | 0.052 | 0.073 | 0.192 | 0.135 | 0.123 | 0.150 |
| 22 | 0.052 | 0.067 | 0.111 | 0.071 | 0.288 | 0.150 | 0.105 | 0.154 |
| 23 | 0.190 | 0.064 | 0.044 | 0.071 | 0.577 | 0.281 | 0.159 | 0.249 |
| 24 | 0.069 | 0.056 | 0.070 | 0.070 | 0.269 | 0.145 | 0.105 | 0.151 |
| 25 | 0.086 | 0.054 | 0.057 | 0.069 | 0.423 | 0.171 | 0.095 | 0.155 |
| 26 | 0.086 | 0.053 | 0.057 | 0.068 | 0.212 | 0.079 | 0.057 | 0.090 |
| 27 | 0.086 | 0.051 | 0.054 | 0.066 | 0.365 | 0.218 | 0.156 | 0.218 |
| 28 | 0.052 | 0.058 | 0.091 | 0.066 | 0.250 | 0.091 | 0.060 | 0.097 |
| 29 | 0.052 | 0.057 | 0.088 | 0.065 | 0.481 | 0.237 | 0.141 | 0.218 |
| 30 | 0.034 | 0.095 | 0.286 | 0.062 | 0.519 | 0.241 | 0.136 | 0.215 |
| 31 | 0.034 | 0.095 | 0.286 | 0.062 | 0.346 | 0.204 | 0.146 | 0.206 |
| 32 | 0.052 | 0.050 | 0.073 | 0.061 | 0.173 | 0.095 | 0.081 | 0.110 |
| 33 | 0.138 | 0.048 | 0.039 | 0.061 | 0.442 | 0.271 | 0.190 | 0.266 |
| 34 | 0.052 | 0.049 | 0.071 | 0.060 | 0.231 | 0.115 | 0.085 | 0.124 |
| 35 | 0.224 | 0.055 | 0.035 | 0.060 | 0.346 | 0.186 | 0.127 | 0.186 |
| 36 | 0.034 | 0.078 | 0.200 | 0.059 | 0.250 | 0.161 | 0.131 | 0.172 |
| 37 | 0.207 | 0.052 | 0.034 | 0.059 | 0.519 | 0.273 | 0.167 | 0.252 |
| 38 | 0.034 | 0.074 | 0.182 | 0.058 | 0.365 | 0.238 | 0.181 | 0.242 |
| 39 | 0.034 | 0.064 | 0.143 | 0.056 | 0.192 | 0.083 | 0.064 | 0.096 |
| 40 | 0.052 | 0.044 | 0.061 | 0.056 | 0.231 | 0.146 | 0.120 | 0.158 |
| 41 | 0.052 | 0.042 | 0.059 | 0.055 | 0.135 | 0.070 | 0.065 | 0.088 |
| 42 | 0.103 | 0.038 | 0.036 | 0.054 | 0.327 | 0.145 | 0.091 | 0.143 |
| 43 | 0.103 | 0.037 | 0.036 | 0.053 | 0.192 | 0.111 | 0.093 | 0.125 |
| 44 | 0.034 | 0.049 | 0.095 | 0.051 | 0.077 | 0.013 | 0.025 | 0.037 |
| 45 | 0.069 | 0.035 | 0.040 | 0.051 | 0.154 | 0.054 | 0.046 | 0.071 |
| 46 | 0.121 | 0.034 | 0.031 | 0.050 | 0.423 | 0.231 | 0.151 | 0.222 |
| 47 | 0.224 | 0.041 | 0.028 | 0.050 | 0.288 | 0.172 | 0.129 | 0.179 |
| 48 | 0.241 | 0.037 | 0.026 | 0.046 | 0.308 | 0.152 | 0.102 | 0.153 |
| 49 | 0.052 | 0.030 | 0.040 | 0.045 | 0.442 | 0.210 | 0.125 | 0.195 |
| 50 | 0.155 | 0.031 | 0.026 | 0.045 | 0.462 | 0.252 | 0.162 | 0.240 |
Prediction performance of the ensemble of the top N classifiers with reduced instance dimension of 5 on the two datasets.
| Test Set | No. Dimension | ||||
|---|---|---|---|---|---|
| ASEdb0 | 2 | 0.224 | 0.322 | 0.481 | 0.306 |
| 3 | 0.793 | 0.428 | 0.245 | 0.374 | |
| 5 | 0.897 | 0.383 | 0.177 | 0.295 | |
| 10 | 1.000 | 0.299 | 0.103 | 0.186 | |
| 15 | 1.000 | 0.219 | 0.062 | 0.116 | |
| 25 | 1.000 | 0.149 | 0.036 | 0.070 | |
| 50 | 1.000 | 0.081 | 0.021 | 0.041 | |
| BID0 | 2 | 0.385 | 0.260 | 0.200 | 0.263 |
| 3 | 0.846 | 0.601 | 0.440 | 0.579 | |
| 5 | 1.000 | 0.461 | 0.226 | 0.369 | |
| 10 | 1.000 | 0.283 | 0.096 | 0.175 | |
| 15 | 1.000 | 0.222 | 0.066 | 0.124 | |
| 25 | 1.000 | 0.145 | 0.038 | 0.074 | |
| 50 | 1.000 | 0.078 | 0.024 | 0.046 |
Prediction performance of the ensemble of the top 3 classifiers with different reduced instance dimensions on the BID0 test dataset.
| No. Dimension | ||||
|---|---|---|---|---|
| 1 | 0.328 | 0.475 | 0.704 | 0.447 |
| 2 | 0.328 | 0.352 | 0.396 | 0.358 |
| 5 | 0.846 | 0.601 | 0.440 | 0.579 |
| 10 | 0.846 | 0.499 | 0.310 | 0.454 |
| 20 | 0.481 | 0.240 | 0.144 | 0.221 |
| 50 | 0.500 | 0.274 | 0.173 | 0.257 |
| 100 | 0.538 | 0.252 | 0.141 | 0.224 |
Figure 1Prediction performance for different sliding windows in instance encoding on the BID0 dataset training by the ASEdb0 dataset. The symbol “I” for each window denotes the calculation error of prediction performance in F1.
Performance comparison of the three methods on the BID0 dataset by training on the ASEdb0 dataset.
| Method | Type | ||||
|---|---|---|---|---|---|
| Our Method | Random Projection | 0.846 | 0.601 | 0.440 | 0.579 |
| ISIS | Neural Networks | 0.191 | 0.030 | 0.026 | 0.046 |
| Random Predictor | 0.983 | 0.000 | 0.018 | 0.035 | |
Figure 2The performance of our method for testing on the BID0 dataset by training on the ASEdb0 dataset. The left graph illustrates the ROC (receiver operating characteristic) curve, and the right one shows the four measure curves with respect to sensitivity.
Figure 3Case study for the complex of protein PDB:1DDM. The subgraphs (a,b) are shown for the prediction comparison of our method and the ISIS method, respectively, where the chain B of protein 1DDM is colored in wheat. The subgraph (c) illustrates the cartoon structure of the protein complex, where the chain B of protein 1DDM is colored in green. Here, red residues are the hotspots that are predicted correctly; green residues are non-hotspots that are predicted to be hotspots; while yellow ones are real hotspots that are predicted to be non-hotspot residues. All other residues are correctly predicted as non-hotspots.
The details of the hotspot datasets.
| Dataset | Hot Spots | Non-Hotspots | Total Residues | Ratio |
|---|---|---|---|---|
| BID0 | 54 | 2895 | 2949 | 1.831% |
| ASEdb0 | 58 | 3957 | 4015 | 1.445% |
| BID | 54 | 58 | 112 | 48.214% |
| ASEdb | 58 | 91 | 149 | 38.926% |
The ratio of the number of hotspots to that of total residues in the dataset.
Figure 4The flowchart of the ensemble system for the hotspot prediction. Here, means the k-th random projection. The IBk implements k-Nearest Neighbors (KNN) algorithm. Here the black arrows denote the flow of the training subset, while the blue ones are that of the test subset.