| Literature DB >> 25545691 |
Stephen J Goodswen1, Paul J Kennedy2, John T Ellis1.
Abstract
Given thousands of proteins constituting a eukaryotic pathogen, the principal objective for a high-throughput in silico vaccine discovery pipeline is to select those proteins worthy of laboratory validation. Accurate prediction of T-cell epitopes on protein antigens is one crucial piece of evidence that would aid in this selection. Prediction of peptides recognised by T-cell receptors have to date proved to be of insufficient accuracy. The in silico approach is consequently reliant on an indirect method, which involves the prediction of peptides binding to major histocompatibility complex (MHC) molecules. There is no guarantee nevertheless that predicted peptide-MHC complexes will be presented by antigen-presenting cells and/or recognised by cognate T-cell receptors. The aim of this study was to determine if predicted peptide-MHC binding scores could provide contributing evidence to establish a protein's potential as a vaccine. Using T-Cell MHC class I binding prediction tools provided by the Immune Epitope Database and Analysis Resource, peptide binding affinity to 76 common MHC I alleles were predicted for 160 Toxoplasma gondii proteins: 75 taken from published studies represented proteins known or expected to induce T-cell immune responses and 85 considered less likely vaccine candidates. The results show there is no universal set of rules that can be applied directly to binding scores to distinguish a vaccine from a non-vaccine candidate. We present, however, two proposed strategies exploiting binding scores that provide supporting evidence that a protein is likely to induce a T-cell immune response-one using random forest (a machine learning algorithm) with a 72% sensitivity and 82.4% specificity and the other, using amino acid conservation scores with a 74.6% sensitivity and 70.5% specificity when applied to the 160 benchmark proteins. More importantly, the binding score strategies are valuable evidence contributors to the overall in silico vaccine discovery pool of evidence.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25545691 PMCID: PMC4278717 DOI: 10.1371/journal.pone.0115745
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Example of online output from IEDB peptide-MHC class I binding predictor.
The binding predictor conceptually slides a window of a user-defined length (either eight to eleven amino acid residues) one residue at a time from the start of the protein sequence. An affinity score is predicted for the ability of each fixed-length subsequence (as defined by each position of the sliding window) to bind to a user-specified MHC I allele. Fig. 1 shows the output when a sequence (e.g. MARHAIFFALCVLGL…) is input into the program to predict if it contains peptides of length 9 that bind to the MHC allele, HLA-A*11∶01. The IC50 (nM) affinity scores for subsequence ‘MARHAIFFA’ at position 1 to 9 are highlighted.
Descriptive statistics for predicted high-affinity peptides against 76 common human MHCs.
| Description | Benchmark Proteins |
|
|
|
| Number of proteins tested | 160 | 124 | 760 | 19378 |
| IC50 scores per protein | H = 4.5, L = 0.2, A = 1.45 | H = 3.1, L = 0.2, A = 1.2 | H = 7.7, L = 0.12, A = 1.3 | H = 49.3, L = 0.03, A = 2.6 |
| Number of peptides on a protein | Max = 1583, Min = 21, A = 292 | Max = 2071, Min = 54, A = 350 | Max = 2528. Min = 4, A = 354 | Max = 3768, Min = 5, A = 227 |
| Number of allele-peptide length combinations used per protein out of 304 combinations | Max = 283, Min = 21, A = 78 | Max = 148, Min = 66, A = 81 | Max = 330, Min = 21, A = 86 | Max = 137, Min = 6, A = 66 |
| Frequency of prediction method used | SMM = 124, ANN = 35, NetMHCpan = 1 | SMM = 95, ANN = 27 | SMM = 582, ANN = 173, NetMHCpan = 5 | SMM = 14802, ANN = 2407, NetMHCpan = 2058 |
| Maximum number of | 28 (HLA-C*03∶03 length 11) | 20 (HLA-C*14∶02 length 8) | 98 (HLA-C*03∶03 length 10) | 3515 (HLA-A*68∶01 length 9) |
| Maximum number of | 1526 (HLA-C*03∶03 length 10) | 1556 (HLA-C*14∶02 length 8) | 7542 (HLA-C*03∶03 length 10) | 406529 (HLA-B*58∶01 length 10) |
Abbreviations: P. falciparum = Plasmodium falciparum, C. elegans = Caenorhabditis elegans, T. gondii = Toxoplasma gondii, H = highest, L = lowest, A = average, Max = maximum, Min = minimum, SMM = stabilized matrix method, ANN = artificial neural network.
Benchmark Proteins are proteins from published studies with known or expected T-cell responses (source species: T. gondii).
Figure 2Example of rule-based approach applied to highest affinity peptide on each test protein.
Proteins are listed in ascending order based on the lowest IC50 (nM) binding affinity score. A threshold value e.g. 1.5 is applied to the score to segregate the list into two classifications. Below the threshold is ‘YES’ for vaccine candidacy and above is ‘NO’. The rule-based classification is compared with the expected classification to determine performance accuracy. Threshold value is derived from a trial-and-error approach with the intention to classify the greatest number of true positives and negatives.
Sensitivity and specificity for rule-based tests applied to high-affinity peptide-MHC binding scores for vaccine classification.
| Rule # | Statistical property for rule-based test | Threshold | BenchmarkProteins |
|
| |||
| SN | SP | SN | SP | SN | SP | |||
| 1 | Lowest IC50 score per protein | 1.5 | 42.7 | 45.8 | 69.8 | 32.0 | 60.9 | 38.3 |
| 2 | Number of high-affinity peptides per protein | 200 | 64.0 | 61.2 | 42.4 | 80.0 | 28.4 | 68.9 |
| 3 | Number of different MHC alleles per protein binding to high-affinity peptides | 74 | 56.0 | 63.5 | 43.8 | 79.2 | 25.5 | 73.5 |
| 4 | Maximum number of high-affinity peptides per protein binding to a particular MHC allele-peptide length combination | 10 | 66.7 | 58.2 | 47.9 | 69.4 | 37.3 | 65.3 |
| 5 | Total binding score per protein | 32289 | 61.3 | 61.2 | 58.9 | 72.0 | 36.8 | 59.1 |
| 6 | Groups: one with proteins containing peptides binding to promiscuous MHCs; one with proteins containing peptides NOT binding to promiscuous MHCs | Not applicable | 47.2 | 44.5 | 45.2 | 48.3 | 47.2 | 45.3 |
Abbreviations: P. falciparum = Plasmodium falciparum, C. elegans = Caenorhabditis elegans, T. gondii = Toxoplasma gondii, SN = sensitivity (%) = true positives/(true positives+false negatives), SP = specificity (%) = true negatives/(true negatives+false positives).
Proteins ordered on statistical property and test thresholds applied to perform a binary classification.
Threshold derived from a trial-and-error approach, using the mean as a seed threshold, on benchmark proteins to achieve the greatest number of true positives and negatives. Same universal rule (i.e. threshold) is applied to P. falciparum and C. elegans data.
Benchmark Proteins are proteins from published studies with known or expected T-cell responses (source species: T. gondii).
Figure 3Example file format of training dataset used in machine learning.
There is one protein per line that consists of the total binding affinity score for each peptide-MHC length combination e.g. 304 combinations for 76 common MHC I alleles (MHC I binds to peptides, typically eight to eleven amino acid residues in length. Therefore, 76 alleles * 4 peptide lengths = 304 combinations). Binding affinity score = an IEDB IC50 (nM) score <5000. Each score is weighted by the length of the protein. The scores represent input variables or predictors. The last column is a 1 or 0 that indicates an expected ‘YES’ or ‘NO’ vaccine candidacy and represents the target variable. This expectation is based on the subcellular location annotation associated with the protein in UniProtKB (secreted or membrane-associated = 1, internal location = 0).
Sensitivity and specificity for random forest tests applied to peptide-MHC binding scores for vaccine classification of Benchmark dataset.
| Training dataset | Cross-validation | Benchmark | ||||
| SN | SP | HE | SN | SP | OE | |
|
|
|
|
|
|
|
|
|
| 38.9 | 61.7 | 49.7 | 58.2 | 49.7 | 54.2 |
|
|
|
|
|
|
|
|
|
| 55.4 | 48.3 | 51.6 | 52.4 | 56.4 | 54.3 |
|
|
|
|
|
|
|
|
| Apicomplexans (R) | 54.0 | 38.2 | 42.9 | 74.0 | 39.0 | 42.5 |
Abbreviations: (R) = target variable e.g. 1 or 0 in training data randomly changed for each protein, HE = hold-out dataset error (%) i.e. error when predicting 30% of training data, OE = overall error (%) i.e. percentage of incorrect predictions, SN = sensitivity (%) = true positives/(true positives+false negatives), SP = specificity (%) = true negatives/(true negatives+false positives).
Cross-validation involved a random sample of 70% from training dataset to build predictive model and remaining 30% used for testing. This was repeated 10 times and predictions averaged (predictions for the same input data fluctuate unless a random seed is set initially).
Benchmark are proteins from published studies with known or expected T-cell responses (source species: T. gondii) –100% from training data used to build predictive model.
Note: Number of input variables used to build predictive model = 304 (i.e. number of allele-peptide length combinations derived from 76 common alleles).
Figure 4Plot of conservation scores computed for binding peptides along a protein (UniProtKB ID: P13664).
Each circle represents the amino acid conservation score computed at a sliding window. The window is of length 9 and slides one residue at a time. The colour of the circle represents binding affinities against 76 common MHC alleles computed at each window. A window (i.e. a peptide) can theoretically bind to all 76 alleles and colours are therefore plotted in a set order: no, low, intermediate, and high affinity. For example, a dark blue circle for low affinity indicates there are no intermediate or high affinity peptides at the window; however, a green circle for high affinity provides no indication of other affinities at the same window. Mean conservation = 0.7805; median conservation = 0.7946. For protein P13664 (Major surface antigen p30) 54.6% high, 56% intermediate, and 55.9% low binders have conservation scores below the mean. The study shows that vaccine candidates are significantly more likely to have either a greater number of less conserved peptides or a lower total conservation score than non-vaccine candidates.
Figure 5Performance comparison between high-throughput subcellular location predictors and peptide-MHC binding strategies.
A column chart showing the sensitivity (SN) and the specificity (SP) performance measures for high-throughput programs in classifying 160 benchmark proteins as either membrane-associated or secreted. Predictors for membrane = TMHMM, Phobius TM, and WoLF PSORT; predictors for secreted = Phobius SP, SignalP, TargetP, and WoLF PSORT. Threshold criteria applied to each program’s specific output to achieve binary classification: TMHMM – membrane if tmhmm_ExpAA >18$$; Phobius TM – membrane if number of transmembrane domains >0; Phobius SP – secreted if value = ‘Y’; SignalP – secreted if SignalP_D >0.5; TargetP – secreted if value >0.5; WoLF PSORT – membrane if score >16$$ and annotation = ‘membrane’, or secreted if score >16$$ and annotation = ‘secreted’ (where $$ is a value recommended by the creator of the program). Machine learning = strategy using random forest algorithm with peptide-MHC binding scores. Conservation = strategy using amino acid conservation of predicted binding peptides. Performance measures for peptide-MHC strategies are derived from classification of benchmark proteins as either vaccine or non-vaccine candidates.