| Literature DB >> 28157153 |
Ashley I Heinson1, Yawwani Gunawardana2, Bastiaan Moesker3, Carmen C Denman Hume4, Elena Vataga5, Yper Hall6, Elena Stylianou7, Helen McShane8, Ann Williams9, Mahesan Niranjan10, Christopher H Woelk11.
Abstract
Reverse vaccinology (RV) is a bioinformatics approach that can predict antigens with protective potential from the protein coding genomes of bacterial pathogens for subunit vaccine design. RV has become firmly established following the development of the BEXSERO® vaccine against Neisseria meningitidis serogroup B. RV studies have begun to incorporate machine learning (ML) techniques to distinguish bacterial protective antigens (BPAs) from non-BPAs. This research contributes significantly to the RV field by using permutation analysis to demonstrate that a signal for protective antigens can be curated from published data. Furthermore, the effects of the following on an ML approach to RV were also assessed: nested cross-validation, balancing selection of non-BPAs for subcellular localization, increasing the training data, and incorporating greater numbers of protein annotation tools for feature generation. These enhancements yielded a support vector machine (SVM) classifier that could discriminate BPAs (n = 200) from non-BPAs (n = 200) with an area under the curve (AUC) of 0.787. In addition, hierarchical clustering of BPAs revealed that intracellular BPAs clustered separately from extracellular BPAs. However, no immediate benefit was derived when training SVM classifiers on data sets exclusively containing intra- or extracellular BPAs. In conclusion, this work demonstrates that ML classifiers have great utility in RV approaches and will lead to new subunit vaccines in the future.Entities:
Keywords: bacterial pathogen; bacterial protective antigen; machine learning; reverse vaccinology; support vector machine
Mesh:
Substances:
Year: 2017 PMID: 28157153 PMCID: PMC5343848 DOI: 10.3390/ijms18020312
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1(A) Plot of the difference in area under the curve (AUC) between the support vector machine (SVM) classifier BPAD200+N+B+AF versus randomly permutated data with increasing feature numbers. SVM classifiers were trained to discriminate bacterial protective antigens (BPAs) and non-BPAs in BPAD200+N+B+AF and receiver operator characteristic (ROC) curves generated from a nested leave tenth out cross-validation approach for different numbers of features selected by greedy backward feature elimination. Five iterations were performed to assess the random breakage of ties during greedy backward feature elimination and AUC was averaged across iterations for each feature set. This analysis was then repeated for five datasets where the BPA and non-BPA labels were randomly permutated and average AUC calculated across randomly permutated data sets for each feature set; (B) ROC curves for the average of the five iterations of the 10 feature SVM classifier derived from BPAD200+N+B+AF (black solid line) and from each of the five randomly-permutated datasets (dotted grey lines).
Figure 2ROC curves were generated from SVM classifiers utilizing 10 features selected by greedy backward feature elimination in a LTOCV approach. Averages were plotted across five iterations of SVM classifiers implemented to randomly break ties resulting from the greedy backward feature elimination procedure. The benchmark to assess these modifications was a non-nested, non-balanced training data set of 136 BPAs and 136 non-BPAs annotated with 122 features from 19 protein annotation tools (BPAD136) [20]. Subsequent modifications were added in a stepwise fashion and included: a nested cross-validation approach (BPAD136+N), balanced selection of non-BPAs for predicted subcellular localization (BPAD136+N+B), increased size of training data (BPAD200+N+B), and additional features (525 total) derived from an increased number of protein annotation tools (BPAD200+N+B+AF).
Figure 3Pie charts showing subcellular localization as predicted by PSORTb [3] for the numbers of BPAs and non-BPAs in the following subsets of the BPAD136 dataset. (A) positive training data (i.e., 136 BPAs); (B) negative training data (i.e., 136 non-BPAs); and (C) negative training data balanced for subcellular localization (i.e., 136 non-BPAs).
The top 10 annotation features selected by greedy backward feature elimination for discrimination of BPAs from non-BPAs by the SVM classifier trained on the BPAD200+N+B+AF data set.
| Rank | Feature | Name of Bioinformatics Tool | Protein Annotation Tool Type | Correlated with BPA or Non-BPA |
|---|---|---|---|---|
| 1 | LipoP_Signal_Avr_Length | LipoP | Lipoprotein | BPA |
| 2 | YinOYang-T-Count | YinOYang | Glycosylation | BPA |
| 3 | NetPhosK-S-Count | NetPhosK | Phosphorylation | BPA |
| 4 | LipoP_SPI_Avr_Length | LipoP | Lipoprotein | BPA |
| 5 | T-Cell Epitope predictor (MHC Class II) | BPA | ||
| 6 | TargetP-SecretFlag | TargetP | Subcellular Compartmentalisation—In Eukaryotic Cells | BPA |
| 7 | YinOYang-Average-Difference1_Length | YinOYang | Glycosylation | Non-BPA |
| 8 | T-Cell Epitope predictor | BPA | ||
| 9 | MHC Peptide Binding | Non-BPA | ||
| 10 | PropFurin-Count_Score | ProP | Cleavage Sites—In Eukaryotic Cells | BPA |
Features in bold represent those derived from protein annotation tools that were added in this study compared to our previous approach [20]. For a full list of bioinformatics tools utilized in this study and the annotation features derived from them please see Table S1.
Figure 4Hierarchical clustering of 142 BPAs from BPAD200+N+B+AF using all 525 annotation features, distances between BPAs were calculated using Euclidean metrics and then clustered using the Ward algorithm. White labels at the branch tips refer to BPAs with subcellular localization predicted by PSORTb [3] as intracellular (i.e., cytoplasm or cytoplasmic membrane) and black labels as extracellular BPAs (i.e., extracellular, periplasmic, outer membrane, cell wall).
Figure 5(A) ROC curves obtained from SVM classifiers trained to distinguish BPAs from non-BPAs in the following data sets: iBPAD51 (dotted line), eBPAD91 (solid grey line) and BPAD200+N+B+AF (black line). Curves were drawn by averaging results from five iterations of SVM classifiers consisting of 10 features selected by greedy backward feature elimination assessed in a LTOCV approach; (B) Plot showing the average percentage accuracy (five iterations) of SVM classifiers of 10 features trained on different sized subsets of BPAD200+N+B+AF for comparison to SVM classifiers derived from iBPAD51 and eBPAD91.
The top 10 annotation features selected by greedy backward feature elimination utilized by SVM classifiers trained on (A) eBPAD 91 and (B) iBPAD51.
| 1 | Pad-value | Adhesin | 42 |
| 2 | DictOGlyc_Ser_Average_Threshold_Length | Glycosylation | 189 |
| 3 | LipoP_SPI_AvrScore | Lipoprotein | NF |
| 4 | Netsurfp_RSA_Exposed_AverageDiff | Surface accesibility and secondary structure | NF |
| 5 | PoloPhosphorylation_CorAvg | Phosphorylation | NF |
| 6 | Net_Chop_CorCount | Predicts cleavage sites | NF |
| 7 | DictOGlyc-No_Score_Sites_Length | Glycosylation | NF |
| 8 | GPS_SUMO_Sumoylation_Average_Score | Small ubiquitin like modifiers (SUMOs) binding site prediction | 9 |
| 9 | ProtParam-PercIsoleucine | General Annotation | 144 |
| 10 | ProtParam-PercGlutamicAcid | General Annotation | NF |
| 1 | Bepipred-Count_Length | B-Cell Epitope | 149 |
| 2 | CCD_av_diff | Calpain Cleavage | NF |
| 3 | YinOYang-T-Average-Difference1_Length | Glycosylation | NF |
| 4 | ProtParam-GRAVY | General Annotation | 35 |
| 5 | NetOGlyc-T-Max-I | Glycosylation | 196 |
| 6 | YinOYang-T-Average_Length | Glycosylation | NF |
| 7 | ProtParam-PercAlanine | General Annotation | 97 |
| 8 | NetPhosK-Y-MaxScore | Phosphorylation | NF |
| 9 | GPS_SUMO_Sumoylation_Average_Score | Small Ubiquitin like modifiers (SUMOs) binding site predictor | 8 |
| 10 | MBAAgl7_CorAvg | T-Cell Epitope predictor | NF |
Protein annotation tools listed in bold represent those not present in the other classifier type. NF: not found in the top 200 features of the other classifier type that were submitted to the greedy backward feature elimination algorithm following non-specific F-score filtering.