| Literature DB >> 18833291 |
Sungsam Gong1, Tom L Blundell.
Abstract
Substitutions of individual amino acids in proteins may be under very different evolutionary restraints depending on their structural and functional roles. The Environment Specific Substitution Table (ESST) describes the pattern of substitutions in terms of amino acid location within elements of secondary structure, solvent accessibility, and the existence of hydrogen bonds between side chains and neighbouring amino acid residues. Clearly amino acids that have very different local environments in their functional state compared to those in the protein analysed will give rise to inconsistencies in the calculation of amino acid substitution tables. Here, we describe how the calculation of ESSTs can be improved by discarding the functional residues from the calculation of substitution tables. Four categories of functions are examined in this study: protein-protein interactions, protein-nucleic acid interactions, protein-ligand interactions, and catalytic activity of enzymes. Their contributions to residue conservation are measured and investigated. We test our new ESSTs using the program CRESCENDO, designed to predict functional residues by exploiting knowledge of amino acid substitutions, and compare the benchmark results with proteins whose functions have been defined experimentally. The new methodology increases the Z-score by 98% at the active site residues and finds 16% more active sites compared with the old ESST. We also find that discarding amino acids responsible for protein-protein interactions helps in the prediction of those residues although they are not as conserved as the residues of active sites. Our methodology can make the substitution tables better reflect and describe the substitution patterns of amino acids that are under structural restraints only.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18833291 PMCID: PMC2527532 DOI: 10.1371/journal.pcbi.1000179
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Four Categories of Functional Residues Considered in this Study.
| Functional Category | Database | Feature Identifier | Description | Masking Type | URL | |||
| A | B | C | D | |||||
|
|
| N/A | Database of domain–domain interaction interface | √ | √ |
| ||
|
|
| N/A | Database documenting enzyme active sites and catalytic residues in enzymes of 3D structure | √ | √ | √ |
| |
|
|
| Amino acid(s) involved in the activity of an enzyme | √ | √ | √ |
| ||
|
|
| N/A | Database of protein–nucleic acid interactions | √ | √ | √ | N/A | |
|
|
| Extent of a DNA-binding region | √ | √ | √ |
| ||
|
|
|
| Binding site for any chemical group (co-enzyme, prosthetic group, etc.) | √ | √ | √ |
| |
|
| Extent of a calcium-binding region | √ | √ | √ | ||||
|
| Extent of a nucleotide phosphate-binding region | √ | √ | √ | ||||
|
| Binding site for a metal ion | √ | √ | √ | ||||
The versions of CSA [16] and UniProt [17] were 2.2.7 and 12.2, respectively. InterPare [18] was based on SCOP [19] version 1.71. The “Feature Identifier” is only for UniProt annotations. (A: all masking, B: no protein–protein interaction, C: no active sites, D: active-site only.)
17 ESSTs and the Number of Functional Residues Masked from the Alignments.
| Alignment Source | Number | Matrix Type | Masking Type | Masking Residues | %Mask | ||
| Family | Structure | Residue | |||||
|
| 177 | 706 | 146,437 |
| X | 0 | 0.00 |
| J | 2,048 | 1.40 | |||||
| B | 4,601 | 3.14 | |||||
| R | 4,601 | 3.14 | |||||
|
| 221 | 902 | 235,588 |
| X | 0 | 0.00 |
| A | 37,808 | 16.05 | |||||
| B | 6,195 | 2.63 | |||||
| C | 36,265 | 15.39 | |||||
| D | 1,615 | 0.69 | |||||
| R | 37,808 | 16.05 | |||||
| 566 | 2,556 | 384,618 |
| X | 0 | 0.00 | |
| 1,187 | 5,833 | 1,096,027 |
| X | 0 | 0.00 | |
| A | 198,411 | 18.10 | |||||
| B | 21,830 | 1.99 | |||||
| C | 191,377 | 17.46 | |||||
| D | 1,840 | 0.17 | |||||
| R | 198,411 | 18.10 | |||||
New ESSTs were based on the structure alignments of SCOP families [19]. ENZ is 221 enzyme-specific SCOP families which contain at least one ACT_SITE annotation of UniProt [17] or hand-curated CSA entry [16]. NOENZ is the opposite of ENZ. NOENZ does not even contain the predicted entries of CSA. ALL is the final alignment source obtained from the filtering process (see Materials and Methods). The masking sources of A, B, C, and D are in Table 1. X is for non-masking and R is for random-masking. R is set as a control to see the significance of removing functional residues from the substitution models. The ESST of Shi et al. (OLD-J) [11] is based on 177 HOMSTRAD families which consist of 706 structures. It masks 2,048 resides which are involved in (1) interaction with heteroatoms and (2) domain–domain interaction. OLD-X and OLD-R is non-masking and random-masking model of J.
Number of all residues.
Number of masking residues.
%Mask = number of masking residues/number of all residues*100.
Figure 1Probabilities of Residue Conservation for 21 Amino Acids.
The probability of residue conservation (PCONS) was averaged for the diagonal axis of substitution tables. (A) PCONS of three matrix-types (ENZ, NOENZ and ALL) are compared with the OLD. Non-masking models (X) were used for three matrix-types and OLD to see the effect of alignment source. (ENZ: enzyme-specific 221 SCOP families, NONENZ: non-enzymes, ALL: all the alignments, OLD: non-masking ESST of Shi et al. [11]. See Table 2 for details.) (B) Five masking tables and one non-masking table are compared with the ESST of Shi et al. [11]. Masking and non-masking tables are from the 221 enzyme-specific alignments (ENZ). Masking sources of A, B, C, and D are listed in Table 1. (R: random-masking, X: non-masking.)
Rank Correlation.
| PCONS | Z-Score | SENS | DIST | %Mask | |
|
| 1 | −0.85 | −0.93 | −0.38 | −0.30 |
|
| 1 | 0.95 | 0.54 | 0.45 | |
|
| 1 | 0.48 | 0.45 | ||
|
| 1 | 0.29 | |||
|
| 1 |
Spearman's rank correlations were calculated between the variables of PCONS, Z-score, SENS, DIST, and %Mask. See Materials and Methods for the definition of Spearman's rank correlation. %Mask is from Table 2. Z-Score and SENS are from Table 5. DIST is from the first row of Table S2. PCONS is from the bottom line of Table S1. Pcons: probability of residue conservation, Z-score: average Z-score 602 active sites, SENS: sensitivity, DIST: distance between two ESSTs, %Mask: percentage of discarded functional residues.
Performance of 17 ESSTs on Detecting Active Sites.
| Matrix Type | Masking Type | TP | FP | FN | TN | SENS | SPEC | COV | F-Measure |
|
| X | 168 | 4832 | 432 | 75976 | 0.28 | 0.9401 | 0.0336 | 0.060 |
| R | 168 | 4830 | 432 | 75978 | 0.28 | 0.9401 | 0.0336 | 0.060 | |
| J | 189 | 4877 | 411 | 75931 | 0.315 | 0.9395 | 0.0373 | 0.067 | |
| B | 219 | 4888 | 381 | 75920 | 0.365 | 0.9394 | 0.0429 | 0.077 | |
|
| Xt | 221 | 4942 | 379 | 75866 | 0.3683 | 0.9387 | 0.0428 | 0.077 |
| Rt | 225 | 4968 | 375 | 75840 | 0.375 | 0.9384 | 0.0433 | 0.078 | |
| Ct | 240 | 4870 | 360 | 75938 | 0.4 | 0.9396 | 0.047 | 0.084 | |
| Dt | 248 | 4977 | 352 | 75831 | 0.4133 | 0.9383 | 0.0475 | 0.085 | |
| At | 264 | 4805 | 336 | 76003 | 0.44 | 0.9404 | 0.0521 | 0.093 | |
| Bt | 270 | 4984 | 330 | 75824 | 0.45 | 0.9382 | 0.0514 | 0.092 | |
|
| X | 273 | 5234 | 327 | 75574 | 0.455 | 0.9351 | 0.0496 | 0.089 |
|
| Xt | 249 | 5283 | 351 | 75525 | 0.415 | 0.9345 | 0.045 | 0.081 |
| Dt | 259 | 5285 | 341 | 75523 | 0.4317 | 0.9345 | 0.0467 | 0.084 | |
| Rt | 262 | 5246 | 338 | 75562 | 0.4367 | 0.935 | 0.0476 | 0.086 | |
| At | 273 | 5150 | 327 | 75658 | 0.455 | 0.9362 | 0.0503 | 0.091 | |
| Ct | 277 | 5136 | 323 | 75672 | 0.4617 | 0.9363 | 0.0512 | 0.092 | |
| Bt | 282 | 5187 | 318 | 75621 | 0.47 | 0.9357 | 0.0516 | 0.093 |
Out of 81,410 residues in the test-sets, 602 residues are annotated as “ACT_SITE” by UniProt [17] or CSA [16]. For those active sites, CRESCENDO [8] could either correctly predict (TP) or fail to predict (FN) (see text). Two active sites of ‘d7odca1’ (A chain of PDB 7odc), which is a SCOP domain in the test-sets, was discarded as of an internal error; hence, 600 active sites either in the TP or FN. The number of predicted residues is same as the sum of TP and FP for each ESST type. Note that residues only from the first cluster of predicted residues (rank 1) were considered in this analysis. TP: True Positive, FP: False Positive, FN: False Negative, TN: True Negative, SENS: Sensitivity, SPEC: Specificity, COV: Coverage.
Z-Score of CRESCENDO for Functional Residues.
| Matrix Type | Masking Type | Average Z-Score | Ratio | P-Value | |||||
| All | Predicted | Active Site | PPI | PNI | PLI | ||||
|
| X | 0.00063 | 1.396 | 0.480 | 0.0250 | 0.055 | 0.449 | 0.78 | 0.081 |
| R | 0.00067 | 1.402 | 0.483 | 0.0249 | 0.052 | 0.450 | 0.79 | 0.080 | |
| J | 0.00062 | 1.410 | 0.612 | 0.0284 | 0.055 | 0.461 | 1.00 | 0.079 | |
| B | 0.00065 | 1.420 | 0.734 | 0.0274 | 0.059 | 0.490 | 1.20 | 0.078 | |
|
| Xt | 0.00060 | 1.387 | 0.635 | 0.0042 | 0.024 | 0.426 | 1.04 | 0.083 |
| Rt | 0.00060 | 1.387 | 0.652 | 0.0067 | 0.025 | 0.431 | 1.06 | 0.083 | |
| Ct | 0.00063 | 1.413 | 0.734 | 0.0100 | 0.025 | 0.427 | 1.20 | 0.079 | |
| Dt | 0.00062 | 1.399 | 0.772 | 0.0078 | 0.051 | 0.428 | 1.26 | 0.081 | |
| At | 0.00063 | 1.423 | 0.858 | 0.0143 | 0.056 | 0.433 | 1.40 | 0.077 | |
| Bt | 0.00064 | 1.411 | 0.870 | 0.0086 | 0.068 | 0.447 | 1.42 | 0.079 | |
|
| X | 0.00063 | 1.420 | 0.835 | 0.0046 | 0.099 | 0.508 | 1.36 | 0.078 |
|
| Xt | 0.00063 | 1.414 | 0.696 | 0.0085 | 0.068 | 0.489 | 1.14 | 0.079 |
| Rt | 0.00064 | 1.415 | 0.771 | 0.0065 | 0.075 | 0.501 | 1.26 | 0.079 | |
| Dt | 0.00066 | 1.412 | 0.798 | 0.0055 | 0.078 | 0.495 | 1.30 | 0.079 | |
| At | 0.00064 | 1.433 | 0.860 | 0.0159 | 0.069 | 0.495 | 1.41 | 0.076 | |
| Ct | 0.00067 | 1.436 | 0.893 | 0.0155 | 0.077 | 0.515 | 1.46 | 0.076 | |
| Bt | 0.00068 | 1.435 | 0.936 | 0.0073 | 0.086 | 0.518 | 1.53 | 0.076 | |
The average Z-scores are shown for four categories of functional residues in the test-sets: catalytic activity, protein–protein interactions, protein–nucleic acid interactions, and protein–ligand interactions. The test-sets consist of 73 SCOP families, which is one third of SCOP families in ENZ (see Table 2).
Total number of residue from test-sets (81,410).
Residue predicted by CRESCENDO.
Active-site residues (602).
Protein–protein interaction sites (11,917).
Protein–nucleic acid interaction sites (194).
Protein–ligand interaction sites (1,348).
Ratio of Z-score at the active site residues compared with that of OLD-J.
P-value (right-tail) of the predicted residues.
Figure 2Performance of 17 ESSTs on Detecting Active Site Residues.
Z-score (blue) and sensitivity (red) are plotted against 17 ESSTs. Z-score is averaged for 602 active-site residues in the test-sets (see text). Z-score and sensitivity (SENS) are highly correlated (0.95 in Spearman's rank correlation, Table 3). If any SCOP families in the test-sets are included in 17 ESSTs, they are removed from the ESSTs to avoid any bias. Those benchmarking ESSTs are marked by ‘t’ (e.g., At, Bt, Ct and Dt) to distinguish from the original. Z-score and SENS of non-masking (X) and random-masking (R) tables are always lower than those of masking models (At, Bt, Ct, and Dt) within the same matrix type (OLD, ENZ, ALL). All the masking-tables outperform the ESST of Shi et al. (J) [11].
Figure 3Predicting Four Categories of Functional Residues by CRESCENDO.
Four case-studies of predicting functional residues are shown; (A) active-sites, (B) PPI (protein–protein interaction), (C) PNI (protein–nucleic acid interaction, (D) PLI (protein–ligand interaction). SCOP domains d1evua4 [23], d1i7kb_ [24], d1k8wa5 [33] and d1ed9a_ [34] were used for A, B, C, and D, respectively. True positives (TP) are coloured in pink, false negatives (FN, missing residues) in orange and false positives (FP) in green. TP and FN are shown as sticks (bold-frame). (A) Cysteine protease. CRESCENDO predicted 27 residues as functional residues. All three (CYS-314, HIS-373 and ASP-396) catalytic residues were correctly identified. ALL-B type ESST (see Table 2) was used in this figure. FP (green) are clustered around the three real active sites (pink). (B) Ubiquitin conjugating (UBC) enzyme. 12 residues were predicted by CRESCENDO using ALL-A ESST. Five (coloured in pink) were correctly identified among 14 residues annotated as PPI residues. Interacting partner (A chain of 1i7k) is placed at the bottom and coloured in gray. The solvent accessible surface areas (SASA) for five TP are as follow; ARG-34 (35.64), PRO-90 (4.12), SER-123 (4.74), ALA-124 (0.55), LEU-125 (72.39). SASA for 9 FN are as follow; PRO-30 (77.26), VAL-31 (24.02), SER-87 (110.40), GLY-88 (16.05), TYR-89 (0.01), TYR-91 (58.29), GLU-120 (108.68), LYS-121 (113.96), TRP-122 (7.20). The SASA is from InterPare [18]. (C) Pseudouridine synthase. BIPA (S. Lee, unpublished) annotates 43 residues as PNI. 14 residues were TP (coloured in pink) among 20 residues predicted by CRESCENDO. ALL-D was used as ESST. DNA is coloured in blue. (D) Alkaline phosphatase. UniProt annotates 9 residues as metal-binding (METAL), which were all correctly identified by CRESCENDO among 30 predicted residues. ALL-B was used as ESST. ZN (zinc) and MG (magnesium) are coloured in cyan and blue, respectively.
Performance of ESSTs on Protein–Protein Interaction Residues.
| Matrix Type | Masking Type | TP | FP | FN | TN | SENS | SPEC | COV | F-Measure |
|
| B | 931 | 4176 | 10986 | 65317 | 0.0781 | 0.8560 | 0.1823 | 0.1094 |
| R | 934 | 4064 | 10983 | 65429 | 0.0784 | 0.8563 | 0.1869 | 0.1104 | |
| X | 939 | 4061 | 10978 | 65432 | 0.0788 | 0.8563 | 0.1878 | 0.1110 | |
| J | 939 | 4127 | 10978 | 65366 | 0.0788 | 0.8562 | 0.1854 | 0.1106 | |
|
| At | 906 | 4163 | 11011 | 65330 | 0.0760 | 0.8558 | 0.1787 | 0.1067 |
| Ct | 908 | 4202 | 11009 | 65291 | 0.0762 | 0.8557 | 0.1777 | 0.1067 | |
| Xt | 921 | 4242 | 10996 | 65251 | 0.0773 | 0.8558 | 0.1784 | 0.1078 | |
| Rt | 925 | 4268 | 10992 | 65225 | 0.0776 | 0.8558 | 0.1781 | 0.1081 | |
| Dt | 960 | 4265 | 10957 | 65228 | 0.0806 | 0.8562 | 0.1837 | 0.1120 | |
| Bt | 973 | 4281 | 10944 | 65212 | 0.0816 | 0.8563 | 0.1852 | 0.1133 | |
|
| X | 893 | 4614 | 11024 | 64879 | 0.0749 | 0.8548 | 0.1622 | 0.1025 |
|
| Xt | 930 | 4602 | 10987 | 64891 | 0.0780 | 0.8552 | 0.1681 | 0.1066 |
| Bt | 953 | 4516 | 10964 | 64977 | 0.0800 | 0.8556 | 0.1743 | 0.1096 | |
| Dt | 963 | 4581 | 10954 | 64912 | 0.0808 | 0.8556 | 0.1737 | 0.1103 | |
| Rt | 980 | 4528 | 10937 | 64965 | 0.0822 | 0.8559 | 0.1779 | 0.1125 | |
| Ct | 1000 | 4245 | 10917 | 65248 | 0.0839 | 0.8567 | 0.1907 | 0.1165 | |
| At | 1003 | 4420 | 10914 | 65073 | 0.0842 | 0.8564 | 0.1850 | 0.1157 |
11,917 residues are annotated by InterPare [18] out of 81,410 residues in the test-sets. The definitions of TP, FP, FN, TN, SENS, SPEC, COV, and F-measure are same as Table 5. Residues only from the first cluster of predicted residues were considered in this analysis. TP: True Positive, FP: False Positive, FN: False Negative, TN: True Negative, SENS: Sensitivity, SPEC: Specificity, COV: Coverage.