| Literature DB >> 18596929 |
Mridul K Kalita1, Umesh K Nandal, Ansuman Pattnaik, Anandhan Sivalingam, Gowthaman Ramasamy, Manish Kumar, Gajendra P S Raghava, Dinesh Gupta.
Abstract
Functional annotation of protein sequences with low similarity to well characterized protein sequences is a major challenge of computational biology in the post genomic era. The cyclin protein family is once such important family of proteins which consists of sequences with low sequence similarity making discovery of novel cyclins and establishing orthologous relationships amongst the cyclins, a difficult task. The currently identified cyclin motifs and cyclin associated domains do not represent all of the identified and characterized cyclin sequences. We describe a Support Vector Machine (SVM) based classifier, CyclinPred, which can predict cyclin sequences with high efficiency. The SVM classifier was trained with features of selected cyclin and non cyclin protein sequences. The training features of the protein sequences include amino acid composition, dipeptide composition, secondary structure composition and PSI-BLAST generated Position Specific Scoring Matrix (PSSM) profiles. Results obtained from Leave-One-Out cross validation or jackknife test, self consistency and holdout tests prove that the SVM classifier trained with features of PSSM profile was more accurate than the classifiers based on either of the other features alone or hybrids of these features. A cyclin prediction server--CyclinPred has been setup based on SVM model trained with PSSM profiles. CyclinPred prediction results prove that the method may be used as a cyclin prediction tool, complementing conventional cyclin prediction methods.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18596929 PMCID: PMC2435623 DOI: 10.1371/journal.pone.0002605
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Amino Acid Composition.
Frequency of each of the amino acid in cyclin and non-cyclin protein sequences used in the training dataset. The plot reveals the differential amino acid propensities in cyclins and non-cyclin sequences, especially with respect to that of Leu, Gly, Ala, Val, Lys and Asn.
Performance of SVM classifiers for various combinations of protein sequence features, kernels, parameters and validation methods.
| Model | Feature | Dm | Validation | ACC (%) | SN (%) | SP (%) | MCC | F1 | Parameters | |
| C | γ | |||||||||
| Module1 | AAC | 20 | a |
| 83.82 | 83.33 | 0.671 | 0.712 | 1.5 | 84 |
| Module2 | DPC | 400 | a |
| 85.29 | 84.72 | 0.699 | 0.734 | 30 | 12 |
| Module3 | SSC | 60 | a |
| 76.47 | 84.72 | 0.614 | 0.658 | 7 | 10 |
| Module4 | PSSM | 400 | a |
| 92.64 | 91.66 | 0.842 | 0.851 | 32.5 | 1 |
| b1 |
| 97.05 | 98.61 | 0.957 | 0.889 | 47.5 | 0.5 | |||
| b2 |
| 100 | 100 | 1 | 1 | 0.5 | 100 | |||
| c1 |
| 88.23 | 88.88 | 0.771 | 0.787 | 4 | 9.6 | |||
| c2 |
| 91.17 | 97.22 | 0.886 | 0.872 | 19 | 0.5 | |||
| Hybrid-1 | AAC+DPC | 420 | a |
| 82.35 | 80.55 | 0.628 | 0.682 | 880 | 0.1 |
| Hybrid-2 | AAC+SSC | 80 | a |
| 89.70 | 81.94 | 0.717 | 0.753 | 10.8 | 10 |
| Hybrid-3 | DPC+SSC | 460 | a |
| 79.41 | 83.33 | 0.628 | 0.675 | 10.8 | 10 |
| Hybrid-4 | PSSM+AAC | 420 | a |
| 92.64 | 90.27 | 0.828 | 0.84 | 230 | 0.1 |
| Hybrid-5 | PSSM+DPC | 800 | a |
| 92.64 | 90.27 | 0.828 | 0.84 | 30 | 1 |
| Hybrid-6 | PSSM+SSC | 460 | a |
| 92.64 | 91.66 | 0.842 | 0.851 | 19.2 | 1.5 |
| b1 |
| 97.05 | 100 | 0.971 | 0.904 | 19.2 | 1.5 | |||
| b2 |
| 100 | 100 | 1 | 1 | 0.5 | 50 | |||
| c1 |
| 88.23 | 88.88 | 0.771 | 0.787 | 0.5 | 30 | |||
| c2 |
| 82.35 | 86.11 | 0.685 | 0.753 | 3 | 7 | |||
| Hybrid-7 | PSSM+DPC+SSC | 860 | a |
| 91.17 | 88.88 | 0.80 | 0.815 | 10 | 2 |
| Hybrid-8 | PSSM+DPC+AAC | 820 | a |
| 92.64 | 90.27 | 0.828 | 0.84 | 200 | 0.1 |
| b1 |
| 97.10 | 94.46 | 0.914 | 0.847 | 200 | 0.1 | |||
| b2 |
| 100 | 100 | 1 | 1 | 0.5 | 100 | |||
| c1 |
| 73.52 | 97.23 | 0.731 | 0.862 | 0.5 | 19 | |||
| c2 |
| 97.05 | 86.11 | 0.834 | 0.774 | 0.3 | 4 | |||
Dm: dimension, a = Jackknife test CV, b1 = self-consistency test (mode 1), b2 = self-consistency test (mode 2), c1 & c2 = holdout-test, SN: sensitivity, SP: specificity, MCC: Mathew's Correlation Coefficient, F1: F1 statistics, C: tradeoff value, γ: gamma factor (a parameter in RBF kernel).
Figure 2Threshold-independent performance of SVMs.
(A) ROC plot of SVMs based on different protein sequence features which depicts relative trade-offs between true positive and false positives. The diagonal line (line of no-discrimination) represents a completely random guess. Closer a point in the upper left corner of the ROC space, better is the prediction as it represents 100% sensitivity (when all true positives are found) and 100% specificity (when no false positives are found). The PSSM based (standalone as well as hybrids) SVM models show a similar prediction having AUC more than 95%. (B) Area under curve (AUC) obtained from the ROC plot. All SVM models are based on RBF kernel unless mentioned.
Estimation of quality for best SVM model for each feature or combinations of features (hybrid models) as compared to random prediction (S).
| Model | Feature | Correct (TP+TN) | S (%) |
| Module1 | AAC | 117 | 67.12 |
| Module2 | DPC | 119 | 69.98 |
| Module3 | SSC | 113 | 61.31 |
| Module4 | PSSM | 129 | 84.27 |
| Hybrid-1 | AAC+DPC | 114 | 62.85 |
| Hybrid-2 | AAC+SSC | 120 | 71.47 |
| Hybrid-3 | DPC+SSC | 114 | 62.79 |
| Hybrid-4 | PSSM+AAC | 128 | 82.85 |
| Hybrid-5 | PSSM+DPC | 128 | 82.85 |
| Hybrid-6 | PSSM+SSC | 129 | 84.27 |
| Hybrid-7 | PSSM+DPC+SSC | 126 | 80.00 |
| Hybrid-8 | PSSM+DPC+AAC | 128 | 82.85 |
TP: true positive, TN: true negative.
S: percentage of random prediction.
Figure 3Component loadings for the first two Principal Components (PC).
(A, C, D) Superimposed plot of component loadings of features used (AAC, DPC and PSSM) and training dataset from PCA analysis - showing the feature usage variability, thereby showing what degree the original variables contribute to the PCs. The plot signifies the correlations between amino acids by virtue of its loading scores as well as relative abundance in cyclins and non-cyclins to each of the PC analyzed. Green, red and blue spots represent cyclin, non-cyclins and component loadings of feature used, respectively. (B) PC weight plot of each of the 20 amino acids for the first three PCs of AAC model. The plot signifies the discriminative properties of amino acids to specific PCs by virtue of its loading scores.
Figure 4CyclinPred server.
(A) Snapshot of CyclinPred server (B) Sample prediction result.