| Literature DB >> 16844994 |
Abstract
In this study a systematic attempt has been made to integrate various approaches in order to predict allergenic proteins with high accuracy. The dataset used for testing and training consists of 578 allergens and 700 non-allergens obtained from A. K. Bjorklund, D. Soeria-Atmadja, A. Zorzet, U. Hammerling and M. G. Gustafsson (2005) Bioinformatics, 21, 39-50. First, we developed methods based on support vector machine using amino acid and dipeptide composition and achieved an accuracy of 85.02 and 84.00%, respectively. Second, a motif-based method has been developed using MEME/MAST software that achieved sensitivity of 93.94 with 33.34% specificity. Third, a database of known IgE epitopes was searched and this predicted allergenic proteins with 17.47% sensitivity at specificity of 98.14%. Fourth, we predicted allergenic proteins by performing BLAST search against allergen representative peptides. Finally hybrid approaches have been developed, which combine two or more than two approaches. The performance of all these algorithms has been evaluated on an independent dataset of 323 allergens and on 101 725 non-allergens obtained from Swiss-Prot. A web server AlgPred has been developed for the predicting allergenic proteins and for mapping IgE epitopes on allergenic proteins (http://www.imtech.res.in/raghava/algpred/). AlgPred is available at www.imtech.res.in/raghava/algpred/.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16844994 PMCID: PMC1538830 DOI: 10.1093/nar/gkl343
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Performance of SVM-based method using amino acid composition
| Threshold | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC |
|---|---|---|---|---|---|---|
| 1.0 | 0.3374 | 0.9829 | 0.6918 | 0.9417 | 0.6442 | 0.4336 |
| 0.8 | 0.4243 | 0.9700 | 0.7239 | 0.9208 | 0.6729 | 0.4843 |
| 0.6 | 0.5200 | 0.9586 | 0.7608 | 0.9116 | 0.7093 | 0.5456 |
| 0.4 | 0.5878 | 0.9386 | 0.7804 | 0.8871 | 0.7357 | 0.5732 |
| 0.2 | 0.6539 | 0.9243 | 0.8024 | 0.8765 | 0.7657 | 0.6099 |
| 0.0 | 0.7357 | 0.8929 | 0.8220 | 0.8494 | 0.8054 | 0.6422 |
| −0.2 | 0.8383 | 0.8543 | 0.8471 | 0.8253 | 0.8667 | 0.6930 |
| −0.4 | ||||||
| −0.6 | 0.9061 | 0.7614 | 0.8267 | 0.7573 | 0.9096 | 0.6680 |
| −0.8 | 0.9391 | 0.6957 | 0.8055 | 0.7171 | 0.9347 | 0.6441 |
| −1.0 | 0.9583 | 0.6100 | 0.7671 | 0.6687 | 0.9489 | 0.5933 |
The boldface indicates the best result
Figure 1The ROC plot of SVM-based method using amino acid (SVMcomp) and dipeptide (SVMdipep) composition.
The probability of correct prediction of allergens and non-allergens is shown in terms of PPV and NPV, respectively, for SVM-based methods using residue and dipeptide composition
| SVM score (range) | SVM based on composition | SVM based on dipeptide composition | ||
|---|---|---|---|---|
| PPV | NPV | PPV | NPV | |
| 1.0–0.8 | 85.64 | 67.96 | 100.00 | 59.74 |
| 0.8–0.6 | 87.05 | 71.53 | 82.97 | 62.40 |
| 0.6–0.4 | 81.83 | 74.03 | 86.55 | 66.47 |
| 0.4–0.2 | 74.81 | 76.94 | 85.88 | 72.01 |
| 0.2–0.0 | 70.05 | 80.74 | 74.14 | 79.04 |
| 0.0 to −0.2 | 64.55 | 86.61 | 63.10 | 85.56 |
| −0.2 to −0.4 | 47.13 | 89.71 | 39.40 | 89.34 |
| −0.4 to −0.6 | 18.21 | 71.24 | 27.66 | 92.40 |
| −0.6 to −0.8 | 22.82 | 92.94 | 13.26 | 74.19 |
| −0.8 to −1.0 | 15.19 | 94.18 | 8.69 | 75.22 |
Searching of 178 IgE epitopes in protein dataset consists of 578 allergens and 700 non-allergens
| Approach | PID (cut-off) | Total hits | |
|---|---|---|---|
| Allergens | Non-allergens | ||
| PID100 | 100 | 56 (9.69%) | 2 (0.28%) |
| PID81 | >80 | 77 (13.32%) | 8 (1.11%) |
| PID80 | ≥80 | 102 (17.65%) | 74 (10.57 %) |
| PID876 | >80 (epitopes have ≤9 amino acids) | 91 (15.74%) | 11 (1.57%) |
| ≥70 (epitopes have >9 and ≤15 amino acids) | |||
| >60 (epitopes have >15 amino acids) | |||
| PID865 | >80 (epitopes have ≤9 amino acids) | 101 (17.47%) | 13 (1.85%) |
| ≥60 (epitopes have >9 and ≤15 amino acids) | |||
| >50 (epitopes have >15 amino acids) | |||
Different PID cut-off was used to search epitopes in proteins, cut-off was also set based on amino acids in IgE epitopes.
MEME/MAST results of allergen and non-allergen motifs
| Total hits | ||
|---|---|---|
| Allergen | Non-allergen | |
| 10−3 | 38 (6.57%) | 20 (2.86%) |
| 10−1 | 86 (14.88%) | 62 (8.86%) |
| 1 | 142 (24.57%) | 113 (16.14%) |
| 10 | 246 (42.56%) | 240 (34.29%) |
| 20 | 309 (53.46%) | 288 (41.14%) |
| 50 | 427 (73.88%) | 389 (55.57%) |
| 100 | 543 (93.94%) | 468 (66.86%) |
It shows allergen hits out of total 578 allergens and non-allergen hits out of total 700 non-allergens.
The search results of 1364 proteins (664 allergens and 700 non-allergens), which searched against ARPs database using BLAST
| Total hits | ||
|---|---|---|
| Allergen | Non-allergen | |
| 1 | 626 (94.28%) | 338 (48.25%) |
| 10−1 | 586 (88.22%) | 48 (6.86%) |
| 10−2 | 562 (84.64%) | 23 (3.29%) |
| 10−3 | ||
| 10−4 | 527 (79.37%) | 8 (1.14%) |
| 10−6 | 465 (70.03%) | 5 (0.71%) |
| 10−9 | 350 (52.71%) | 2 (0.28%) |
The boldface indicates the best result
The performance of hybrid approach, which combines SVM-based approach using amino acid composition and IgE epitope based approach (PID865)
| Threshold | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC |
|---|---|---|---|---|---|---|
| 1.0 | 0.4452 | 0.9814 | 0.7396 | 0.9517 | 0.6836 | 0.5211 |
| 0.8 | 0.4922 | 0.9700 | 0.7545 | 0.9309 | 0.7000 | 0.5405 |
| 0.6 | 0.5652 | 0.9586 | 0.7812 | 0.9181 | 0.7293 | 0.5829 |
| 0.4 | 0.6191 | 0.9386 | 0.7945 | 0.8922 | 0.7509 | 0.5995 |
| 0.2 | 0.6713 | 0.9243 | 0.8102 | 0.8793 | 0.7749 | 0.6248 |
| 0.0 | 0.7443 | 0.8929 | 0.8259 | 0.8509 | 0.8106 | 0.6499 |
| −0.2 | 0.8417 | 0.8543 | 0.8486 | 0.8259 | 0.8692 | 0.6963 |
| −0.4 | 0.8887 | 0.8186 | 0.8502 | 0.8009 | 0.9009 | 0.7053 |
| −0.6 | 0.9061 | 0.7614 | 0.8267 | 0.7573 | 0.9096 | 0.6680 |
| −0.8 | 0.9391 | 0.6957 | 0.8055 | 0.7171 | 0.9347 | 0.6441 |
| −1.0 | 0.9583 | 0.6100 | 0.7671 | 0.6687 | 0.9489 | 0.5933 |
Performance of different methods on 101725 non-allergens obtained from Swiss-Prot and on 323 allergens (independent dataset not used in training or testing of methods).
| Prediction methods | 101 725 non-allergens obtained from Swiss-Prot | Independent dataset of 323 allergens | |
|---|---|---|---|
| Falsely predicted allergens | Specificity (predicted non-allergens) (%) | Allergens correctly predicted allergens (sensitivity) | |
| SVMc | 44684 | 56.07 | 272 (84.21%) |
| SVMd | 39590 | 61.09 | 274 (84.83%) |
| MAST (ev100) | 13545 | 86.68 | 58 (17.95%) |
| MAST (ev 0.1) | 3480 | 96.58 | 40 (12.38%) |
| BLAST (ARP) | 2060 | 97.97 | 215 (66.56%) |
| IgE epitope | 1777 | 98.25 | 35 (10.84%) |
Figure 2(a) Snapshot of home page of AlgPred server (b) Snapshot of input page of AlgPred server. (c) Snapshot of output results.