| Literature DB >> 20163709 |
Kirsten Roomp1, Iris Antes, Thomas Lengauer.
Abstract
BACKGROUND: Experimental screening of large sets of peptides with respect to their MHC binding capabilities is still very demanding due to the large number of possible peptide sequences and the extensive polymorphism of the MHC proteins. Therefore, there is significant interest in the development of computational methods for predicting the binding capability of peptides to MHC molecules, as a first step towards selecting peptides for actual screening.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20163709 PMCID: PMC2836306 DOI: 10.1186/1471-2105-11-90
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Prediction accuracies for the full dataset
| Allele Name | Binders | Non-Binders | Total | DynaPredPOS | NetMHC | SVMHC | YKW | |
|---|---|---|---|---|---|---|---|---|
| 1 | A*0101 | 163 | 1316 | 1479 | 0.93 | 0.98 | 0.95 | 0.94 |
| 2 | A*0201 | 1544 | 1929 | 3473 | 0.93 | 0.96 | 0.92 | 0.91 |
| 3 | A*0202 | 723 | 697 | 1420 | 0.88 | 0.93 | 0.85 | 0.85 |
| 4 | A*0203 | 732 | 685 | 1417 | 0.88 | 0.95 | 0.86 | 0.84 |
| 5 | A*0206 | 633 | 782 | 1415 | 0.88 | 0.95 | 0.87 | 0.86 |
| 6 | A*0301 | 637 | 1618 | 2255 | 0.89 | 0.96 | 0.88 | 0.80 |
| 7 | A*1101 | 816 | 1279 | 2095 | 0.92 | 0.96 | 0.90 | 0.91 |
| 8 | A*2402 | 202 | 464 | 666 | 0.80 | 0.85 | 0.78 | 0.81 |
| 9 | A*2601 | 69 | 885 | 954 | 0.84 | 0.93 | 0.83 | 0.84 |
| 10 | A*3101 | 510 | 1480 | 1990 | 0.89 | 0.95 | 0.89 | 0.88 |
| 11 | A*3301 | 203 | 994 | 1197 | 0.88 | 0.96 | 0.89 | 0.88 |
| 12 | A*6801 | 578 | 620 | 1198 | 0.85 | 0.92 | 0.82 | 0.81 |
| 13 | A*6802 | 439 | 980 | 1419 | 0.84 | 0.93 | 0.85 | 0.83 |
| 14 | B*0702 | 238 | 1110 | 1348 | 0.94 | 0.98 | 0.94 | 0.93 |
| 15 | B*0801 | 23 | 687 | 710 | 0.82 | 0.99 | 0.78 | 0.79 |
| 16 | B*1501 | 182 | 836 | 1018 | 0.86 | 0.97 | 0.89 | 0.90 |
| 17 | B*2705 | 81 | 917 | 998 | 0.93 | 0.97 | 0.90 | 0.94 |
| 18 | B*3501 | 273 | 578 | 851 | 0.83 | 0.93 | 0.84 | 0.85 |
| 19 | B*4001 | 94 | 1112 | 1206 | 0.90 | 0.97 | 0.93 | 0.91 |
| 20 | B*4402 | 76 | 136 | 212 | 0.77 | 0.84 | 0.75 | 0.76 |
| 21 | B*4403 | 71 | 142 | 213 | 0.68 | 0.81 | 0.65 | 0.70 |
| 22 | B*5101 | 108 | 249 | 357 | 0.82 | 0.93 | 0.80 | 0.81 |
| 23 | B*5301 | 127 | 228 | 355 | 0.86 | 0.95 | 0.85 | 0.87 |
| 24 | B*5801 | 78 | 893 | 971 | 0.90 | 0.99 | 0.93 | 0.93 |
Alleles included in the study with the number of binders and non-binders available; all available binders and non-binders were included in the analysis irrespective of whether quantitative laboratory test data was available or not (Dataset F). Only unique peptide sequences were included in the counts; all peptides with more than one entry for a particular allele in IEDB were counted once only. The overall performance of the four prediction models on different alleles is shown; AUC = area under the curve (ROC analysis). The average AUC for each method is included at the bottom of each column.
Prediction accuracies for the dataset containing only weak binders and non-binders
| Allele Name | Binders | Non-Binders | Total | DynaPredPOS | NetMHC | SVMHC | YKW | |
|---|---|---|---|---|---|---|---|---|
| 1 | A*0201 | 616 | 135 | 751 | 0.67 | 0.77 | 0.65 | 0.66 |
| 2 | A*0202 | 286 | 87 | 373 | 0.53 | 0.70 | 0.41 | 0.54 |
| 3 | A*0203 | 261 | 126 | 387 | 0.58 | 0.79 | 0.58 | 0.60 |
| 4 | A*0206 | 264 | 74 | 338 | 0.56 | 0.73 | 0.57 | 0.57 |
| 5 | A*0301 | 335 | 106 | 441 | 0.58 | 0.70 | 0.60 | 0.59 |
| 6 | A*1101 | 374 | 91 | 465 | 0.56 | 0.69 | 0.62 | 0.63 |
| 7 | A*3101 | 278 | 103 | 381 | 0.42 | 0.69 | 0.48 | 0.56 |
| 8 | A*3301 | 129 | 72 | 201 | 0.62 | 0.52 | 0.39 | 0.63 |
| 9 | A*6801 | 273 | 96 | 369 | 0.54 | 0.68 | 0.44 | 0.57 |
| 10 | A*6802 | 227 | 123 | 350 | 0.52 | 0.74 | 0.50 | 0.50 |
| 11 | B*1501 | 169 | 33 | 202 | 0.59 | 0.79 | 0.49 | 0.59 |
Results from Dataset I, in which only weak binders (50 nM to 500 nM binding affinity) and non-binders (500 nM to 1000 nM binding affinity) were included. Alleles included in Dataset F, which had fewer than 200 binders and non-binders in total in Dataset I, were no longer included in the analysis. The average AUC for each method is included at the bottom of each column.
Prediction accuracies for the dataset containing only strong binders and clear non-binders
| Allele Name | Binders | Non-Binders | Total | DynaPredPOS | NetMHC | SVMHC | YKW | |
|---|---|---|---|---|---|---|---|---|
| 1 | A*0101 | 34 | 284 | 318 | 0.96 | 1.00 | 0.96 | 0.97 |
| 2 | A*0201 | 549 | 503 | 1052 | 0.97 | 0.99 | 0.97 | 0.95 |
| 3 | A*0202 | 290 | 267 | 557 | 0.98 | 1.00 | 0.97 | 0.97 |
| 4 | A*0203 | 273 | 255 | 528 | 0.97 | 0.99 | 0.96 | 0.96 |
| 5 | A*0206 | 216 | 371 | 587 | 0.97 | 0.99 | 0.96 | 0.95 |
| 6 | A*0301 | 123 | 332 | 455 | 0.93 | 1.00 | 0.94 | 0.95 |
| 7 | A*1101 | 228 | 269 | 497 | 0.95 | 1.00 | 0.94 | 0.97 |
| 8 | A*2402 | 69 | 272 | 341 | 0.88 | 0.84 | 0.84 | 0.85 |
| 9 | A*2601 | 15 | 256 | 271 | 0.97 | 1.00 | 0.94 | 0.97 |
| 10 | A*3101 | 114 | 349 | 463 | 0.96 | 1.00 | 0.96 | 0.97 |
| 11 | A*3301 | 36 | 620 | 656 | 0.95 | 1.00 | 0.94 | 0.93 |
| 12 | A*6801 | 155 | 235 | 390 | 0.95 | 1.00 | 0.94 | 0.94 |
| 13 | A*6802 | 95 | 440 | 535 | 0.96 | 1.00 | 0.97 | 0.94 |
| 14 | B*0702 | 45 | 161 | 206 | 0.95 | 1.00 | 0.96 | 0.93 |
Results from Dataset S, in which only strong binders (less than 10 nM binding affinity) and very clear non-binders (greater than 10,000 nM binding affinity) were included. Alleles included in Dataset F, which had fewer than 200 binders and non-binders in total in Dataset S, were no longer included in the analysis. The average AUC for each method is included at the bottom of each column.
Figure 1Overall performance evaluation. ROC plot for the overall performance evaluation of SVMHC, YKW, DynaPredwith models that are trained and tested on Dataset S pertaining to allele A*0201. NetMHC was only available online and therefore could not be trained; the results shown result from testing with Dataset S.
Prediction accuracies for an independent dataset
| Allele | Binders | Non-Binders | Total | DynaPredPOS | NetMHC | SVMHC | YKW | |
|---|---|---|---|---|---|---|---|---|
| 1 | A*0201 | 33 | 143 | 176 | 0.92 | 0.94 | 0.82 | 0.93 |
| 2 | A*0301 | 11 | 165 | 176 | 0.77 | 0.92 | 0.70 | 0.85 |
| 3 | A*1101 | 17 | 159 | 176 | 0.84 | 0.89 | 0.72 | 0.83 |
| 4 | A*2402 | 37 | 139 | 176 | 0.90 | 0.78 | 0.90 | 0.64 |
| 5 | B*0702 | 9 | 167 | 176 | 0.86 | 0.98 | 0.72 | 0.68 |
| 6 | B*0801 | 10 | 166 | 176 | 0.97 | 0.92 | 0.97 | 0.61 |
| 7 | B*1501 | 14 | 162 | 176 | 0.71 | 0.92 | 0.71 | 0.80 |
A set of 176 novel peptides, generated and tested by Lin et al [15], were used to test the prediction accuracy of the four methods in this study. The average AUC for each method is included at the bottom of each column.
Figure 2Robustness analysis. The reproducibility of the results of the prediction methods and their dependence on the size of the available dataset was examined in selected alleles. Box plots of randomly selected balanced sets of binders and non-binders from Dataset F for the alleles A*0201, A*3101, and B*0702 are shown. The smallest dataset for each allele consisted of 50 binders and 50 non-binders. The size of the largest dataset for each allele depends on the total number of binders or non-binders available for that particular allele. NetMHC was not included in this analysis as the predictor is only available online and could therefore not be trained by the authors.
Figure 3Performance Comparison on Dataset F. The performance of the three prediction models, trained on Dataset F of A*0201 and tested on Dataset F of all alleles (AUCs). NetMHC was not included in this analysis as the predictor is only available online and therefore could not be trained by the authors.