| Literature DB >> 16790052 |
Natalia V Petrova1, Cathy H Wu.
Abstract
BACKGROUND: The number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. Knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. In this paper, we present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16790052 PMCID: PMC1534064 DOI: 10.1186/1471-2105-7-312
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The performance of 26 machine learning algorithms for the prediction of catalytic residues as measured by the Matthews correlation coefficient (MCC) in 10-fold cross-validation analysis.
Figure 2The predictive accuracy of the SMO algorithm based on individual residue properties in comparison with 24 combined attributes.
Performance of the SMO classifier in the absence of individual residue property in the optimal 7-attribute set in 10-fold cross-validation analysis
| 1. conservation_score | 0.526 | 0.536 | 0.483 | 0.515 | 76.17 | 76.52 | 73.97 | 75.55 |
| 2. AA_identity | 0.668 | 0.679 | 0.660 | 0.669 | 83.40 | 83.95 | 82.97 | 83.44 |
| 3. nearest_cleft_distance | 0.708 | 0.746 | 0.707 | 0.720 | 85.35 | 87.28 | 85.32 | 85.98 |
| 4. distance_to_3_largest_clefts | 0.724 | 0.757 | 0.726 | 0.736 | 86.13 | 87.87 | 86.30 | 86.77 |
| 5. HB_main_chain_protein | 0.725 | 0.746 | 0.738 | 0.736 | 86.13 | 87.28 | 86.89 | 86.77 |
| 6. nearest_cleft_rank | 0.740 | 0.746 | 0.730 | 0.739 | 86.91 | 87.28 | 86.50 | 86.90 |
| 7. nearest_cleft_SA_area | 0.736 | 0.746 | 0.738 | 0.740 | 86.72 | 87.28 | 86.89 | 86.96 |
| all attributes (24) | ||||||||
| selected attributes (7) | ||||||||
Figure 3The learning curve of the SMO algorithm with the 7-attribute set in 10-fold cross-validation analysis using (A) a balanced dataset or (B) the whole benchmarking dataset as a test set.
The properties and performance of two test datasets: a balanced dataset and whole benchmarking dataset
| 254 | 254 | ||||
| 1:1 | 1:1 | 1:1 | 1:1 | ||
| 254 | 254 | ||||
| 1:1 | 1:1 | 1:92 | 1:92 | ||
| 86.38 | 87.42 | 86.68 | 86.96 | ||
| 0.88 | 0.89 | 0.90 | 0.90 | ||
| 0.15 | 0.14 | 0.13 | 0.13 | ||
| 0.73 | 0.75 | 0.23 | 0.23 | ||
*- number of catalytic residues in each fold in 10-fold cross- validation analysis
Figure 4Method overview.
The initial set of 24 residue properties
| 1. | AA_identity | PDB database [41, |
| 2. | AA_type | [23] |
| 3. | entropy | 9-Component Dirichlet |
| 4. | relative_entropy | Mixture algorithm [30] |
| 5. | conservation_score | Scorecons server [32, |
| 6. | B_factor | PDB database [41, |
| 7. | SAS_all_atoms_ABS | Naccess program [34] |
| 8. | SAS_all_atoms_REL | |
| 9. | SAS_total_side_ABS | |
| 10. | SAS_total_side_REL | |
| 11. | SAS_main_chain_ABS | |
| 12. | SAS_main_chain_REL | |
| 13. | SAS_non_polar_ABS | |
| 14. | SAS_non_polar_REL | |
| 15. | SAS_all_polar_ABS | |
| 16. | SAS_all_polar_REL | |
| 17. | nearest_cleft_rank | CASTp server [36, |
| 18. | nearest_cleft_SA_volume | PDB database [41, |
| 19. | nearest_cleft_SA_area | |
| 20. | nearest_cleft_distance | |
| 21. | distance_to_3_largest_clefts | |
| 22. | HB_main_chain_protein | MolMol Program [37] |
| 23. | HB_side_chain_protein | |
| 24. | 2D_structure | DSSP program [38] |