| Literature DB >> 18834528 |
Kshitiz Gupta1, Vivek Sehgal, Andre Levchenko.
Abstract
BACKGROUND: Prediction of function of proteins on the basis of structure and vice versa is a partially solved problem, largely in the domain of biophysics and biochemistry. This underlies the need of computational and bioinformatics approach to solve the problem. Large and organized latent knowledge on protein classification exists in the form of independently created protein classification databases. By creating probabilistic maps between classes of structural classification databases (e.g. SCOP) and classes of functional classification databases (e.g. PROSITE), structure and function of proteins could be probabilistically related.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18834528 PMCID: PMC2573881 DOI: 10.1186/1472-6807-8-40
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Figure 1SVM training of SCOP and PROSITE. 5751 protein instances common to both PROSITE and SCOP were taken to train the respective SVMs. 30 most populated classes in PROSITE and 37 most populated classes in SCOP were used. Randomly half of 5751 proteins were used to train the SVM classifier for PROSITE and other half for SCOP. Blocks, elastic 2-D profile, molecular mass, size, percentage of helices, β chain were used as orthogonal features for 1 vs rest SVM training for each class.
Comparison between F measure while using Blocks and k-length subsequences
| Small proteins | .78 | .73 |
| Globin like | .71 | .684 |
| NAD P binding Rossmann fold domains | .54 | .537 |
Comparison of F-measure using k-length overlapping subsequences of length (k) equal to 4 and blocks as features. In the case of using blocks as features, F-measure was found to be only slightly lower for all the classes, while the size of feature set was many times smaller.
Performance evaluation for some functional classes
| Class Name | FP Rate | Precision | Recall | |
| Immunoglobulins and MHC protein | 0.002 | 0.765 | 0.619 | |
| Ig-like domain profile | 0.005 | 0.547 | 0.343 | |
| Cytochome c family, heme-binding site signature | 0 | 0.988 | 0.908 | |
| Globins family | 0.002 | 0.792 | 0.623 | |
| Pyridine nucleotide-disulphide oxidoreductases c-I | 0.001 | 0.704 | 0.559 | |
| Serine protease trypsin family | 0.001 | 0.918 | 0.789 | |
| ATP dependent helicases signatures | 0.002 | 0.52 | 0.419 | |
| Phospholipase A2 active site signature | 0 | 1 | 0.862 | |
| Nuclear hormones receptors DNA-binding | 0.003 | 0.667 | 0.561 |
Performance evaluation while trying to classify functional classes using structural classes as the features. The high F-measure values (Equation 1) indicate that the function is strongly dependent on the structure that the protein has. F-measure was calculated as the harmonic mean between Recall and Precision. Class-name refers to domain in PROSITE database. Highly significant F-measure values are shown in bold.
Performance evaluation for some structural classes
| Class-Id | FP Rate | Precision | Recall | |
| 0.008 | 0.25 | 0.259 | 0.254 | |
| Immunoglobulin | 0.018 | 0.296 | 0.282 | 0.289 |
| 0.005 | 0.649 | 0.61 | ||
| Small proteins | 0.063 | 0.245 | 0.216 | 0.213 |
| Peptides | 0.017 | 0.242 | 0.18 | 0.207 |
Performance evaluation while trying to classify structural classes using functional classes as the features. The low F-measure values indicate that though structure is dependent on function of the protein, the present features are incapable in distinguishing them completely. F-measure was calculated as the harmonic mean between Recall and Precision. Presented table lists classes which showed significant F-measure values. Class-name refers to superfamily in SCOP database. Highly significant F-measure values are shown in bold.
Figure 2Cross training flow. Datasets are generated by cross training, where the taxonomy A (or B) had features as classes from taxonomy B (or A) respectively. In effect, classifier of PROSITE is trained using classes of SCOP as features and vice versa. SVM classifiers were created for both PROSITE and SCOP (Figure 1). Classes of PROSITE were used as features for SCOP and protein feature vector was updated. Similarly, classes of SCOP were used as features for classifier of PROSITE and protein feature vector was updated. Cross training was iterated till further gain in accuracy stops.
Prediction rules between classes in SCOP and PROSITE obtained by cross-training
| All |
| All |
| All |
| All |
| All |
| All |
| All |
| All |
| All |
| All |
| Membrane & cell surface proteins and peptides → C-type lectin domain signature & profile |
| All |
| All |
| Small proteins → Serine proteases trypsin family signatures & profile |
| Small proteins → Ig-like domain profile |
| Globins family profile → All |
| Serine proteases trypsin family signatures & profile → Small proteins |
| Protein kinases signatures & profile → Small proteins |
| Legume lectins signatures → All |
| Cytochrome c family heme-binding site signature → All |
Selected prediction rules found as a by-product of cross-training SCOP and PROSITE. Top panel shows rules probabilistically predicting the PROSITE domain/signature given the SCOP superfamily, while the bottom panel shows rules predicting SCOP superfamily given the PROSITE domains. DOT ('.') links parent class to child class in SCOP hierarchy. The bold value following each rule is the probabilistic weighted score found using cross training.