| Literature DB >> 20199672 |
Elisa Cilia1, Andrea Passerini.
Abstract
BACKGROUND: Prediction of catalytic residues is a major step in characterizing the function of enzymes. In its simpler formulation, the problem can be cast into a binary classification task at the residue level, by predicting whether the residue is directly involved in the catalytic process. The task is quite hard also when structural information is available, due to the rather wide range of roles a functional residue can play and to the large imbalance between the number of catalytic and non-catalytic residues.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20199672 PMCID: PMC2844391 DOI: 10.1186/1471-2105-11-115
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sequence-based features.
| Features | Description |
|---|---|
| 1 | Target amino acid name |
| 1 | Target amino acid type |
| 1 | Conservation profiles |
Representation: features extracted from the protein sequence.
Structured-based features.
| Features | Description |
|---|---|
| 3 | Physical and chemical properties (amino acid attributes) |
| 3 | Amino acidic composition |
| 3 | Charge/Neutrality |
| 3 | Water molecule quantity |
| 3 | Atomic density |
| 3 | Flexibility B-factor |
| 3 | Disulphide bond |
| 3 | Heterogens |
| 3D9 | Cofactor binding |
Representation: scalar features extracted from the residue structural neighborhood.
Legend of abbreviations.
| Abbreviation | Description |
|---|---|
| the attributes extracted from the protein sequence among the 24 in [ | |
| the whole set of 24 attributes proposed in [ | |
| the optimal set of 7 attributes selected among the 24 in [ | |
| the attributes from 1 |
Legend of abbreviations for the different sets of attributes tried in the experiments (see Methods).
Feature evaluation.
| 1. | 22 ± 11 | 30 ± 11 | 1.3 ± 0.7 | 24 ± 7 | 24 ± 8 | 0.9172 | 0.2777 | |
| 2. | 26 ± 8 | 29 ± 12 | 0.9 ± 0.3 | 27 ± 9 | 26 ± 9 | 0.9311 | 0.3129 | |
| 3. | 27 ± 10 | 30 ± 10 | 1.0 ± 0.4 | 27 ± 8 | 27 ± 8 | 0.9370 | 0.3204 | |
| 4. | 22 ± 11 | 37 ± 11 | 1.8 ± 1.3 | 26 ± 10 | 27 ± 10 | 0.9490 | 0.3532 | |
| 5. | 26 ± 10 | 37 ± 14 | 1.2 ± 0.5 | 30 ± 9 | 30 ± 10 | 0.9529 | 0.3605 | |
| 6. | 26 ± 6 | 44 ± 10 | 1.4 ± 0.3 | 32 ± 7 | 33 ± 7 | 0.9556 | 0.3659 | |
| 7. | 28 ± 9 | 46 ± 10 | 1.4 ± 0.5 | 34 ± 8 | 34 ± 8 | 0.9635 | 0.3723 | |
| 8. | 33 ± 14 | 48 ± 8 | 1.4 ± 0.7 | 37 ± 7 | 38 ± 6 | 0.9633 | 0.4125 | |
Summary of the results of the cross-validation on different selected attributes (linear kernel, regularization parameter c = 1).
Statistical analysis.
| 23 | 24 | 25 | 24 | 27 | 24 | |
| 21 | 24 | 24 | 23 | 30 | 23 | |
| 26 ◦ ∙ | 28 ◦ ∙ | 27 ◦ ∙ | 28 ◦ ∙ | 34 ◦ | 27 ◦ ∙ |
Statistical comparisons of our best set of sequence-based features (SVM_P5_1D1-3), the set of sequence-and structure-based features employed in Petrova and Wu [9] (SVM_P24), and their combination with our additional set of structural neighborhood features (SVM_P 24_1D1-3, 3D1-6), excluding those coming from ligand information. Cross-validated F1measures (%) and results of a paired Wilcoxon test (α = 0.05) on the statistical significance of the performance differences are reported for all benchmark datasets employed in this study. A white circle indicates a statistically significant improvement of the classifier in the row over the sequence-based classifier (SVM_P5_1D1-3), while a black bullet indicates a statistical significant improvement over the Petrova and Wu features (SVM_P24).
Figure 1Feature weight vector. Vector of the feature weights of a classifier trained on the PW dataset.
Comparison with state-of-the-art sequence-based approach [16].
| CRpred | R | 54.0 | 48.2 | 52.1 | 58.3 | 53.7 | 50.1 |
| P | 14.9 | 17.0 | 17.0 | 18.6 | 17.5 | 14.7 | |
| Equal P | R | 67.4 | 64.6 | 66.2 | 61.3 | 69.7 | 54.8 |
| Equal R | P | 21.0 | 24.1 | 23.9 | 20.5 | 22.5 | 15.5 |
Comparison with the CRpred [16] sequence-based approach on six benchmark datasets. For each dataset we report recall obtained by our predictor at a precision equal to that of the competing method and precision at equal recall. Results are obtained without including ligand information.
Figure 2ROC and Recall/Precision curves. ROC and Recall/Precision curves of the predictions on two low homology benchmark datasets.
Comparison with the structure-based approaches by Chea et al. [13] and Youn et al. [10] on their benchmark datasets.
| Competing methods | R | 29.3 | 51.1 | 53.9 | 57.0 |
| P | 16.5 | 17.1 | 16.9 | 18.5 | |
| Equal P | R | 63.4 | 64.2 | 67.3 | 61.7 |
| Equal R | P | 30.9 | 22.1 | 22.5 | 20.9 |
For each dataset we report the recall obtained by our predictor at a precision equal to that of the competing method and the precision at equal recall. Results are obtained without including ligand information.
Comparison with the structure-based approach by Tang et al. [14] on the PW dataset.
| Performance % | |||||||
|---|---|---|---|---|---|---|---|
| 192 | 73 | 3.8 | 312 | 36 | 0.9313 | 0.3556 | |
| 28 | 46 | 1.4 | 34 | 34 | 0.9635 | 0.3723 | |
1subsampling of negative examples with a ratio of 1:6 w.r.t. positives
2directly computed
Results include both performance measures at fixed decision threshold and average areas under ROC and RP curves. Results are obtained without including ligand information.
Figure 3ROC curves for comparison with Tong et al., 2009. ROC curves superimposed with those reported in [20] on the PW dataset.
Comparison with the best results reported for the POOL structured-based method [20] on their benchmark dataset of 160 proteins.
| Performance % | ||||
|---|---|---|---|---|
| 1. | POOL(T)POOL(G)POOL(C)/allprotein ( | 19.07 | 64.68 | 0.925 |
| 2. | 19.07 | 78.10 | 0.948 | |
| 3. | 26.61 | 64.68 | 0.948 | |
Performance measures include: the average per-protein precision at equal recall, the average recall at equal precision, and the average area under the ROC curve (AUCROC). Results are obtained without including ligand information.
Figure 4Centroids. A residue 3D representation: point SC is the side-chain centroid, which we used as the residue representative point.
Figure 5Active site of L-arginine:glycine amidinotransferase. The L-arginine:glycine amidinotransferase (1JDW) and its highlighted catalytic pocket.
Figure 6Structural neighborhood. A residue structural neighborhood.
Figure 7The 3D feature vector. An example of feature vector extracted from the three-dimensional neighborhood of the (catalytic) residue GLU 988 in the poly(adp-ribose) polymerase (1A26).
Figure 8Frequency of the Heterogens in the Dataset. Histogram of the frequencies of heterogen molecules in the PW dataset (79 enzymes). Only the heterogens appearing in more than one protein structure are reported.
Figure 9Heterogens Distances from Catalytic Residues. Histograms of the distances of the most frequent heterogens from catalytic residues in the PW dataset.
Heterogen analysis.
| Class | Heterogens |
|---|---|
| FE2, MN, CU1, MG, ZN3, ZN, HEM, HEG, HEC, SRM, MPD, MRD, FOK, PLP, P5P, PHS, OWQ, NO3, FS4, SF4, PVL, PYR, SEG, DHZ, FMT, HAD, CIT, ACN, PAC, ACT, 2PE, CNA, U5P, IKT, PGC, PGH, IMU, F6P, IMP, EEB, GLP, FBP, UD1, FCN, AZA, CRB, DHS, BME, ATP, ADP, GSH, FAD, FMN, SAM, AMP, NAD, GDP, GTP, GMP, MHF, NDP, NAG, NRI | |
| K, NA, NI, FE, CA, CL, SAC, FCY, PCA, MES, MAN | |
| PO4, PI, IPS, POP, SO4, SUL, GOL | |
Classification of the heterogens into three groups.
Figure 10Shapes. Two examples of triangular shapes extracted from the HYS303 three-dimensional neighborhood.