| Literature DB >> 15988830 |
Manish Kumar1, Manoj Bhasin, Navjot K Natt, G P S Raghava.
Abstract
This paper describes a method for predicting a supersecondary structural motif, beta-hairpins, in a protein sequence. The method was trained and tested on a set of 5102 hairpins and 5131 non-hairpins, obtained from a non-redundant dataset of 2880 proteins using the DSSP and PROMOTIF programs. Two machine-learning techniques, an artificial neural network (ANN) and a support vector machine (SVM), were used to predict beta-hairpins. An accuracy of 65.5% was achieved using ANN when an amino acid sequence was used as the input. The accuracy improved from 65.5 to 69.1% when evolutionary information (PSI-BLAST profile), observed secondary structure and surface accessibility were used as the inputs. The accuracy of the method further improved from 69.1 to 79.2% when the SVM was used for classification instead of the ANN. The performances of the methods developed were assessed in a test case, where predicted secondary structure and surface accessibility were used instead of the observed structure. The highest accuracy achieved by the SVM based method in the test case was 77.9%. A maximum accuracy of 71.1% with Matthew's correlation coefficient of 0.41 in the test case was obtained on a dataset previously used by X. Cruz, E. G. Hutchinson, A. Shephard and J. M. Thornton (2002) Proc. Natl Acad. Sci. USA, 99, 11157-11162. The performance of the method was also evaluated on proteins used in the '6th community-wide experiment on the critical assessment of techniques for protein structure prediction (CASP6)'. Based on the algorithm described, a web server, BhairPred (http://www.imtech.res.in/raghava/bhairpred/), has been developed, which can be used to predict beta-hairpins in a protein using the SVM approach.Entities:
Mesh:
Year: 2005 PMID: 15988830 PMCID: PMC1160264 DOI: 10.1093/nar/gki588
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Prediction results with the 2880 protein dataset using the ANN
| Approach | Coverage (%) | Probability (%) | Accuracy (%) | MCC | ||
|---|---|---|---|---|---|---|
| AA | 58.4 | 78.5 | 68.1 | 63.7 | 65.5 | 0.31 |
| MSA | 66.7 | 67.3 | 66.7 | 53.4 | 67.0 | 0.34 |
| AA + ACC_O | 60.9 | 72.0 | 68.5 | 65.0 | 66.4 | 0.33 |
| AA + SS_O | 67.9 | 74.4 | 72.8 | 70.1 | 71.2 | 0.43 |
| Seq–Str network (SS_O) | 58.5 | 80.7 | 75.1 | 66.2 | 69.6 | 0.40 |
| Seq–Str network (SS_P) | 56.5 | 77.5 | 71.5 | 64.2 | 67.1 | 0.37 |
AA: amino acid sequence; MSA: multiple sequence alignment; ACC_O: observed accessibility (DSSP); SS_O: secondary structure observed (DSSP); seq: sequence; str: structure; SS_P: secondary structure predicted (PSIPRED).
The performance of our SVM based modules on 2880 proteins using 5-fold cross-validation
| Approach | Coverage (%) | Probability (%) | Accuracy (%) | MCC | ||
|---|---|---|---|---|---|---|
| AA | 63.7 | 72.4 | 69.7 | 66.7 | 68.1 | 0.36 |
| MSA (1) | 77.3 | 72.4 | 73.7 | 76.3 | 74.9 | 0.49 |
| AA + ACC_O (2) | 69.1 | 70.6 | 70 | 69.7 | 69.9 | 0.39 |
| AA + SS_O (3) | 67.9 | 80.4 | 77.5 | 71.6 | 74.2 | 0.49 |
| Hybrid (1 + 2 + 3) | 82.6 | 75.7 | 77.2 | 81.4 | 79.2 | 0.59 |
| AA + ACC_P (4) | 64.1 | 71.9 | 69.5 | 66.9 | 68.0 | 0.36 |
| AA + SS_P (5) | 68.9 | 72.3 | 71.2 | 70.0 | 70.6 | 0.41 |
| Hybrid (1 + 4 + 5) | 76.2 | 79.6 | 78.8 | 77.1 | 77.9 | 0.56 |
AA: amino acid; MSA: multiple sequence alignment; ACC_O: observed accessibility (DSSP); SS_O: secondary structure observed (DSSP); SS_P: secondary structure predicted (PSIPRED); ACCP: predicted accessibility.
Performance of the consensus and combined approaches on the 2880 protein dataset
| Consensus prediction | Threshold | Combined prediction | ||
|---|---|---|---|---|
| Sensitivity (%) | Specificity (%) | Sensitivity (%) | Specificity (%) | |
| 76.1 | 79.6 | 0.1 | 98.9 | 10.6 |
| 75.1 | 79.8 | 0.2 | 94.8 | 27.5 |
| 72.2 | 80.7 | 0.3 | 89.3 | 43.4 |
| 67.6 | 82.2 | 0.4 | 84.9 | 55.7 |
| 61.5 | 84.6 | 0.5 | 81.6 | 64.7 |
| 53.1 | 87.4 | 0.6 | 79 | 71.9 |
| 41.3 | 91.2 | 0.7 | 77.4 | 76.4 |
| 25.7 | 95.7 | 0.8 | 76.4 | 78.9 |
| 5.8 | 99.8 | 0.9 | 76.2 | 79.5 |
Performance of BhairPred in predicting of ECE patterns as hairpins or non-hairpins
| CASP6 categories | No. of proteins | No. of ECE patterns | No. of discarded ECE patterns | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|
| ALL | 63 | 201 | 21 | 47 | 85 | 17 | 31 |
| NF | 4 | 7 | 0 | 0 | 4 | 2 | 1 |
| FR(A) | 6 | 12 | 1 | 4 | 4 | 1 | 2 |
| FR(H) | 10 | 23 | 1 | 6 | 6 | 4 | 6 |
| CM | 20 | 55 | 8 | 11 | 24 | 5 | 7 |
ALL: number of target proteins in CASP6; NF: new fold; FR(A): fold recognition (analogous); FR(H): fold recognition (homologous); CM: comparative modeling; TP: true positives; TN: true negatives; FP: false positives; FN: false negatives.
Performance of Bhairpred on hairpins assigned by promotif in CASP6 proteins
| CASP6 categories | No. of proteins | No. of hairpins (Promotif) | Exact matching ECE | Non-exact matching ECE | Non-exact at all |
|---|---|---|---|---|---|
| ALL | 63 | 159 | 27 (22) | 51 (25) | 61 |
| NF | 4 | 9 | 0 | 1 (1) | 7 |
| FR(A) | 6 | 9 | 2 (2) | 4 (2) | 3 |
| FR(H) | 10 | 20 | 4 (3) | 8 (3) | 6 |
| CM | 20 | 46 | 5 (4) | 13 (7) | 20 |
aCorrectly predicted hairpins by Bhairpred in parentheses.
ALL: number of target proteins in CASP6; NF: new fold; FR(A): fold recognition (analogous); FR(H): fold recognition (homologous); CM: comparative modeling.