| Literature DB >> 22480135 |
Jan-Oliver Janda1, Markus Busch, Fabian Kück, Mikhail Porfenenko, Rainer Merkl.
Abstract
BACKGROUND: One aim of the in silico characterization of proteins is to identify all residue-positions, which are crucial for function or structure. Several sequence-based algorithms exist, which predict functionally important sites. However, with respect to sequence information, many functionally and structurally important sites are hard to distinguish and consequently a large number of incorrectly predicted functional sites have to be expected. This is why we were interested to design a new classifier that differentiates between functionally and structurally important sites and to assess its performance on representative datasets.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22480135 PMCID: PMC3391178 DOI: 10.1186/1471-2105-13-55
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Classification performance of SVMs and FRpred on functionally and structurally important residue-positions
| 2C-SVM | 0.324 | 0.213 | 0.782 |
| CLIPS-1D | 0.337 | 0.117 | 0.666 |
| FRpred, score ≥ 8 | 0.231 | 0.219 | 41% |
| FRpred, score = 9 | 0.250 | 0.197 | 22% |
The line "2C-SVM" gives MCC-values resulting from a classification of catalytic sites (CAT_sites) with SVM, of ligand-binding sites (LIG_sites) with SVM, and of structurally important sites (STRUC_sites) with SVM. The line "CLIPS-1D" shows the performance of the MC-SVM. For FRpred, performance resulting from the analysis of HSSP-MSAs is given. For CAT_sites and LIG_sites, MCC-values are listed resulting from FRcons-cat or FRcons-lig scores of at least 8 or 9, respectively. For STRUC_sites, the same percentage of false positives resulted from FRcons-cat and FRcons-lig predictions.
abund(k, CLASS)-values for amino acid residues
| Residue | |||
|---|---|---|---|
| A | -2.0424 | -0.3537 | -0.1210 |
| C | 1.3255 | 0.7376 | 1.2398 |
| D | 1.1178 | 0.0426 | -0.0498 |
| E | 0.6536 | -0.3856 | -0.6615 |
| F | -0.7708 | -0.0081 | 0.5057 |
| G | -0.7533 | 0.4195 | 0.7020 |
| H | 1.8883 | 0.8279 | -0.3044 |
| I | -2.8164 | -0.3026 | -0.6449 |
| K | 0.6051 | -0.3615 | -1.0215 |
| L | -2.4503 | -0.5416 | 0.2116 |
| M | -1.4026 | 0.1374 | -0.4882 |
| N | -0.1972 | 0.3566 | -0.2254 |
| P | -5.0000 | -0.4542 | 0.3643 |
| Q | -0.7243 | -0.1841 | -0.5615 |
| R | 0.6834 | 0.3879 | -0.2593 |
| S | 0.0027 | -0.0125 | -0.7006 |
| T | -0.5435 | 0.2314 | -0.3363 |
| V | -2.9568 | -0.4130 | -0.3294 |
| W | 0.1927 | 0.5548 | 1.2811 |
| Y | 0.3265 | 0.4572 | 0.7058 |
The score-values were deduced from residues belonging to the respective classes. See formula (6) for a definition of the scores.
Figure 1Classification performance of CLIPS-1D in predicting functionally and structurally important residue-positions. Based on the maximal class-probability pall members of the classes CAT_sites, LIG_sites, STRUC_sites, and NOANN_sites were categorized. NOANN_sites are all residue-positions not selected as STRUC_sites in the NON_ENZ dataset, i.e. positions without assigned function. Note that the absolute numbers of residue-positions are plotted with a logarithmic scale.
Residue-specific MCC-values
| Residue | |||
|---|---|---|---|
| A | -0.002 | 0.164 | 0.774 |
| C | 0.404 | 0.162 | 0.676 |
| D | 0.302 | 0.016 | 0.315 |
| E | 0.345 | 0.052 | 0.348 |
| F | 0.058 | 0.041 | 0.771 |
| G | 0.024 | 0.262 | 0.591 |
| H | 0.424 | -0.063 | 0.086 |
| I | -0.001 | 0.135 | 0.701 |
| K | 0.452 | 0.031 | 0.337 |
| L | -0.001 | 0.056 | 0.815 |
| M | -0.002 | 0.127 | 0.666 |
| N | 0.071 | 0.139 | 0.561 |
| P | - | 0.139 | 0.683 |
| Q | 0.098 | 0.111 | 0.678 |
| R | 0.287 | 0.040 | 0.319 |
| S | 0.307 | 0.156 | 0.595 |
| T | 0.055 | 0.174 | 0.682 |
| V | - | 0.119 | 0.761 |
| W | -0.008 | 0.007 | 0.689 |
| Y | 0.097 | 0.046 | 0.741 |
The MCC-values were determined in a class- and residue-specific manner. Due to missing cases, MCC-values could not be determined for Pro and Val residues at CAT_sites.
Performance of CLIPS-1D for different p-values
| Cut-off | Sensitivity | Specificity | Precision | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CAT | LIG | STRUC | CAT | LIG | STRUC | CAT | LIG | STRUC | |
| 0.010 | 0.170 | 0.030 | 0.225 | 0.996 | 0.991 | 0.991 | 0.316 | 0.176 | 0.827 |
| 0.025 | 0.276 | 0.077 | 0.445 | 0.992 | 0.977 | 0.977 | 0.270 | 0.178 | 0.789 |
| 0.050 | 0.401 | 0.137 | 0.582 | 0.987 | 0.954 | 0.961 | 0.246 | 0.165 | 0.742 |
The three performance measures were determined (see Methods) by selecting as positive cases all residue-positions with a p-value not greater than the given cut-off. Labels: "CAT" CAT_sites, "LIG" LIG_sites, "STRUC" STRUC_sites.
CLIPS-1D predictions for residue-positions in sIGPS (PDB-ID 1A53)
| Residue | Position | Classification | |||||||
|---|---|---|---|---|---|---|---|---|---|
| CS | LBS | STRUC | |||||||
| I | 49 | 0.001 | 0.154 | 0.824 | 0.022 | 0.003 | SC | ||
| E | 51 | 0.806 | 0.075 | 0.114 | 0.005 | 0.020 | CAT | ||
| K | 53 | 0.835 | 0.065 | 0.088 | 0.012 | 0.004 | CAT | ||
| K | 55 | 0.051 | 0.544 | 0.197 | 0.208 | 0.011 | SC | ||
| S | 56 | 0.017 | 0.170 | 0.801 | 0.012 | 0.004 | SC | ||
| L | 60 | 0.002 | 0.128 | 0.829 | 0.041 | 0.019 | IA | ||
| A | 77 | 0.006 | 0.172 | 0.810 | 0.011 | 0.018 | FC | ||
| I | 82 | 0.002 | 0.259 | 0.667 | 0.073 | 0.011 | SR | ||
| T | 84 | 0.002 | 0.111 | 0.881 | 0.007 | 0.003 | N | ||
| L | 108 | 0.006 | 0.106 | 0.863 | 0.024 | 0.012 | SR | ||
| K | 110 | 0.866 | 0.078 | 0.046 | 0.011 | 0.002 | CAT | ||
| F | 112 | 0.146 | 0.053 | 0.788 | 0.014 | 0.020 | STRUC | FC | |
| Q | 118 | 0.007 | 0.114 | 0.872 | 0.008 | 0.002 | FC | ||
| A | 122 | 0.001 | 0.066 | 0.882 | 0.051 | 0.010 | FC | ||
| A | 127 | 0.024 | 0.193 | 0.776 | 0.008 | 0.022 | N | ||
| L | 131 | 0.001 | 0.071 | 0.920 | 0.008 | 0.006 | STRUC | SR | |
| L | 132 | 0.004 | 0.164 | 0.794 | 0.038 | 0.023 | SR,FC | ||
| I | 133 | 0.005 | 0.169 | 0.790 | 0.036 | 0.005 | FC | ||
| L | 137 | 0.007 | 0.151 | 0.813 | 0.029 | 0.020 | SC,FC | ||
| L | 157 | 0.001 | 0.105 | 0.886 | 0.008 | 0.010 | SC,FC | ||
| E | 159 | 0.899 | 0.048 | 0.050 | 0.003 | 0.005 | CAT | ||
| D | 165 | 0.189 | 0.071 | 0.699 | 0.040 | 0.007 | N | ||
| I | 179 | 0.001 | 0.819 | 0.068 | 0.112 | 0.021 | SCE | ||
| N | 180 | 0.098 | 0.770 | 0.116 | 0.016 | 0.016 | LIG | ||
| S | 181 | 0.011 | 0.774 | 0.134 | 0.081 | 0.019 | SCE | ||
| L | 184 | 0.009 | 0.157 | 0.818 | 0.016 | 0.020 | IA | ||
| L | 197 | 0.003 | 0.130 | 0.818 | 0.049 | 0.020 | N | ||
| E | 210 | 0.866 | 0.059 | 0.068 | 0.007 | 0.008 | CAT | ||
| S | 211 | 0.738 | 0.168 | 0.087 | 0.007 | 0.005 | CAT | ||
| L | 231 | 0.003 | 0.224 | 0.762 | 0.011 | 0.025 | STRUC | SC | |
| I | 232 | 0.006 | 0.835 | 0.059 | 0.099 | 0.017 | LIG | ||
The first two columns give the residue and its position in sIGPS. The following four columns list the probabilities for the residue's membership with CAT_sites, LIG_sites, STRUC_sites, or NOANN_sites. The column labeled "p-value" lists the p-value for the class with max(p). The columns "CS" and "LBS" indicate the classification of known catalytic and ligand-binding sites. The last column lists the annotation deduced for residues predicted as STRUC_sites. Meaning of labels: "CAT", "LIG", "STRUC", residues predicted as CAT_sites, LIG_sites, or STRUC_sites, respectively. "SC" element of a stabilization center pair in sIGPS, "SCE" ditto in eIGPS, "SR" stabilization residue in sIGPS; see [36]. "FC" element of the folding core; see [37]. "IA" interaction with substrate; see [38]. "N" no function assigned.
Figure 2Localization of . Based on PDB-ID 1A53, the surface of the whole protein (grey) and of residues predicted as STRUC_sites (orange) is shown. The substrate indole-3-glycerole phosphate is plotted in dark blue. The picture was generated by means of PyMOL [39].