| Literature DB >> 24180526 |
Stephen J Goodswen, Paul J Kennedy, John T Ellis1.
Abstract
BACKGROUND: An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24180526 PMCID: PMC3826511 DOI: 10.1186/1471-2105-14-315
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets used for training and testing machine learning models
| 8 | 13 | 18 | | ||
| 47 | 26 | 51 | Includes | ||
| 324 | 56 | 380 | | ||
| Combined species | 379 | 95 | 449 | Combination of organisms | Includes |
| Benchmark | 70c | 70 | Combination of two organisms | ||
aThis is the name used to refer to the dataset throughout the paper.
bProteins (except for the benchmark dataset) were initially grouped in accordance with the subcellular location descriptor in UniProtKB, then fine-tuned in accordance to cross-validation testing, epitope presence, and reference to other UniProtKB annotations and Gene Ontology. Benchmark proteins were taken from published studies (70 experimentally shown to induce immune responses).
cCombination of proteins from membrane-associated, secreted, and unknown subcellular locations.
Note: Membrane-associated and Secreted proteins are expected ‘YES’ classification for vaccine candidacy. Neither membrane-associated nor secreted proteins are expected ‘NO’ classification. There was an attempt to create an equal representation of YES and NO classifications in the training datasets.
High-throughput standalone programs used in this study to predict protein characteristics
| WoLF PSORT | 0.2 | Protein localisation | 80.0% [ | |
| SignalP | 4.0 | Secretory signal peptides | 93.0%b[ | |
| TargetP | 1.1 | Secretory signal peptides | 90.0% [ | |
| TMHMM | 2.0 | Transmembrane domains | 97.0% [ | |
| Phobius | _ | Transmembrane domains and signal peptides | 94.1% [ | |
| Peptide-MHC I Bindingc | | Peptide binding to MHC class I | 95.7%d[ | |
| Peptide-MHC II Bindingc | Peptide binding to MHC class II | 76.0%d[ |
aPredictive accuracies taken from publications by the creators of the programs. The prediction accuracy varies for different target pathogens.
bSignalP version 3.0.
cPrediction Tools from The Immune Epitope Database and Analysis Resource (IEDB) [http://www.iedb.org].
dArea under curve value (AUC). Program uses different methods. For MHC I best method = artificial neural network (ANN) [14] and MHC II best method = Consensus [15].
Figure 1A schematic of a typical vaccine discovery pipeline output. A typical in silico pipeline output is a collection of different protein characteristics that are predicted by bioinformatics programs. The schematic depicts a collection of some of the scores (potential evidence) associated with these predicted characteristics. A collection of scores for one protein is referred to as an evidence profile in the study. Each column represents a potential input variable or predictor for machine learning algorithms. The last column is a ‘YES’ or ‘NO’ as to whether the protein is expected to be a vaccine candidate (a requirement for machine learning training data) and represents the target variable i.e. the variable to be predicted for new profiles.
Figure 2An extract of evidence profiles. Specific values from high-throughput standalone prediction programs are extracted and compiled to generate evidence profiles. Each row contains the collection of evidence for one protein (i.e. an evidence profile). Each column contains the score for a protein characteristic predicted by a specific program (i.e. an input variable or predictor). See the ‘Contents of evidence profiles’ subsection for a description of the columns. We apologise if the reintroduction of Figure 2 creates additional work for you, but hopefully you can appreciate the problem raised above, and ultimately the readers will benefit.
Figure 3Example of test applied to a predicted protein characteristic for the purpose of binary classification. In this example, proteins are listed in descending order based on the number of transmembrane (TM) domains per protein predicted by the program Phobius (input value = Phobius_TM). A threshold value of 0 is applied to the score (i.e. number of TM domains) to segregate the list into two classifications. Above the threshold is ‘YES’ for vaccine candidacy and below or equal is ‘NO’. The classification is compared with the expected classification to determine sensitivity and specificity performance measures.
Sensitivity and specificity performance measures of binary classification for individual input variables taken from datasets
| | | | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| | | | ||||||||
| Phobius_TM | TM | D | 0.57 | 0.89 | 0.85 | 0.90 | 0.91 | 0.97 | 0.74 | 0.93 |
| Phobius_SP | SP | T | 0.52 | 0.89 | 0.39 | 1.00 | 0.25 | 0.99 | 0.49 | 0.96 |
| SignalP | SP | C | 0.52 | 1.00 | 0.39 | 1.00 | 0.25 | 1.00 | 0.39 | 1.00 |
| TargetP_SP | SP | C | 0.67 | 1.00 | 0.77 | 1.00 | 0.34 | 1.00 | 0.56 | 1.00 |
| TargetP_loc | SP | T | 0.67 | 0.94 | 0.76 | 1.00 | 0.27 | 1.00 | 0.56 | 1.00 |
| TMHMM_AA | TM | C | 0.62 | 0.89 | 0.66 | 0.98 | 0.91 | 1.00 | 0.80 | 1.00 |
| TMHMM_First60 | SP | C | 0.43 | 0.93 | 0.26 | 1.00 | 0.37 | 1.00 | 0.49 | 0.97 |
| TMHMM_TM | TM | D | 0.57 | 0.89 | 0.65 | 1.00 | 0.90 | 1.00 | 0.77 | 1.00 |
| WoLF_PSORT | Sub | C | 0.76 | 0.94 | 0.42 | 1.00 | 0.77 | 0.98 | 0.60 | 0.97 |
| WoLF_PSORT_annotation | Sub | T | 1.00 | 0.56 | 0.92 | 0.74 | 1.00 | 0.72 | ||
| MHCI | B | C | 0.76 | 0.56 | 0.78 | 0.84 | 0.77 | 0.69 | 0.74 | 0.84 |
| MHCII | B | C | 0.86 | 0.39 | 0.80 | 0.74 | 0.90 | 0.52 | 0.54 | 0.84 |
Abbreviations: SN = sensitivity; SP = specificity; T. gondii = Toxoplasma gondii; Plasmodium = species in the genus Plasmodium including falciparum, yoelii yoelii, and berghei; C. elegans = Caenorhabditis elegans; Benchmark = dataset comprising evidence for T. gondii and Neospora caninum proteins from published studies.
aInput variable = predicted protein characteristic (i.e. a column from evidence profile).
bType = prediction type: transmembrane domains (TM), secretory signal peptide (SP), sub-cellular location (Sub), peptide-MHC binding (B).
cData = data type: discrete (D), continuous (C), text (T).
The values underlined denote the best performing input variable for classifying the published proteins.
Test criteria on input variable for binary classification:
Phobius_TM: YES if number of transmembrane domains > 0 else NO.
Phobius_SP: YES if = ‘Y’ else NO.
SignalP: YES if > 0.5 else NO.
TargetP_SP: YES if > 0.5 else NO.
TargetP_loc: YES if = ‘S’ else NO.
TMHMM_AA: YES if > 0 18$$ else NO.
TMHMM_ First60: YES if > 10$$ else NO.
TMHMM _TM: YES if number of transmembrane domains > 0 else NO.
Wolf_PSORT: YES if > 16$$ else NO.
WoLF_PSORT_annotation: YES if = ‘membrane’ or ‘secreted’ else NO.
MHCI: YES if > 0.5 else NO.
MHCII: YES if > 0.5 else NO.
$$A value recommended by the creator of the program.
Figure 4A graph of proteins from the combined training dataset using only two input variables to illustrate a rule-based approach for binary classification. Abbreviations: TMHMM_AA = number of amino acid residues in transmembrane helices (a transmembrane domain is expected to be greater than 18), WoLF PSORT = nearest neighbour score (16 = 50%). Triangles and circles indicate expected vaccine candidacy of proteins. The aim of the rule-based approach is to find the optimum threshold values that segregate majority of triangles from majority of circles. Best rule for binary classification is ‘NO if TMHMM_AA < 12 and WoLF PSORT < 15 (shaded area on graph) else YES’. Two examples of where YES and NO classification rules are broken are shown on graph. When this best rule was applied to the benchmark dataset the sensitivity and specificity were 0.43 and 0.97 respectively.
Sensitivity and specificity of classifications on applying rule to benchmark dataset
| NO if TMHMM_AA < 12 and WoLF PSORT < 15 else YES | 0.43 | 0.97 |
| NO if TMHMM_TM = 0 and WoLF PSORT < 15 else YES | 0.41 | 0.97 |
| NO if Phobius_TM = 0 and WoLF PSORT < 15 else YES | 0.41 | 0.90 |
| NO if TMHMM_TM = 0 and MHCI < 0.5 else YES | 0.63 | 0.84 |
| NO if Phobius_TM = 0 and MHCII < 0.5 else YES | 0.46 | 0.80 |
| NO if TMHMM_AA < 18 and TargetP_SP < = 0.55 else YES | 0.39 | 1.00 |
| NO if TMHMM_TM = 0 and Target_SP < 0.55 else YES | 0.31 | 1.00 |
| NO if Phobius_TM = 0 and TargetP_SP < 0.45 else YES | 0.34 | 0.93 |
| NO if TMHMM_TM = 0 and SignalP < 3.8 else YES | 0.24 | 1.00 |
| NO if TMHMM_AA < 10 and SignalP < = 0.38 else YES | 0.26 | 1.00 |
| NO if TMHMM_AA < 12 and Phobius_SP = ‘N’ else YES | 0.31 | 0.96 |
| NO if TMHMM_TM = 0 and Phobius_SP = ‘N’ else YES | 0.29 | 0.96 |
| NO if TMHMM_AA < 18 and TargetP_SP < = 0.55 and MHCI < 0.5 else YES | 0.31 | 0.84 |
| NO if Phobius_TM = 0 and SignalP <0.45 else YES | 0.21 | 0.93 |
| NO if Phobius_TM = 0 and Phobius_SP = ‘N’ else YES | 0.24 | 0.89 |
| NO if TMHMM_AA < 18 and TargetP_SP < = 0.55 and WoLF_PSORT_annotation = NOT_screted_or_membrane else YES | 0.37 | 0.73 |
| NO if TMHMM_AA < 18 and TargetP_SP < = 0.55 and MHCII < 0.5 else YES | 0.24 | 0.84 |
Abbreviations: SN = sensitivity; SP = specificity.
Note: In benchmark dataset, number of YES classifications = 70; number of NO classifications = 70; total number = 140.
Sensitivity and specificity performance measures of binary classification on different test datasets when using machine learning algorithms with different training datasets
| | | | | | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||
| 1.00
| 0.81
| 0.95 | 0.89 | 1.00 | 0.83 | 1.00 | 0.83 | 1.00 | 0.83 | |
| 0.84 | 0.90 | 1.00
| 1.00
| 0.85 | 0.96 | 1.00 | 0.92 | 1.00 | 0.98 | |
| 0.87 | 0.93 | 1.00 | 0.99 | 1.00
| 1.00
| 1.00 | 0.99 | 1.00 | 0.98 | |
| Combined species | 0.87 | 0.92 | 1.00 | 0.99 | 0.98 | 0.99 | 1.00
| 0.98
| 1.00 | 0.97 |
| Benchmark | 0.86 | 0.91 | 0.96 | 0.96 | 0.97 | 0.91 | 1.00
| 1.00
| ||
| | ||||||||||
| 0.51
| 0.06
| 0.96 | 0.88 | 1.00 | 0.83 | 1.00 | 0.91 | 1.00 | 0.83 | |
| 0.82 | 0.99 | 0.98
| 0.96
| 0.95 | 0.96 | 1.00 | 1.00 | 1.00 | 0.98 | |
| 0.87 | 0.99 | 1.00 | 1.00 | 1.00
| 1.00
| 1.00 | 1.00 | 1.00 | 0.98 | |
| Combined species | 0.87 | 0.99 | 1.00 | 0.99 | 0.99 | 0.99 | 1.00
| 0.99
| 1.00 | 0.98 |
| Benchmark | 0.85 | 0.99 | 0.97 | 0.98 | 0.97 | 0.96 | 0.98
| 0.97
| ||
| | ||||||||||
| 0.97
| 0.90
| 1.00 | 0.83 | 1.00 | 0.89 | 1.00 | 1.00 | 1.00 | 0.83 | |
| 0.87 | 1.00 | 0.99
| 0.99
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.98 | |
| 0.83 | 1.00 | 0.98 | 1.00 | 1.00
| 1.00
| 1.00 | 1.00 | 1.00 | 1.00 | |
| Combined species | 0.84 | 1.00 | 0.98 | 0.99 | 1.00 | 1.00 | 1.00
| 1.00
| 1.00 | 0.99 |
| Benchmark | 0.82 | 1.00 | 0.99 | 0.99 | 0.97 | 0.99 | 0.99
| 0.99
| ||
| | ||||||||||
| 0.80
| 0.83
| 1.00 | 0.83 | 0.95 | 0.83 | 1.00 | 0.83 | 0.90 | 0.78 | |
| 0.77 | 0.96 | 0.95
| 0.84
| 0.88 | 0.96 | 0.99 | 0.94 | 0.81 | 0.96 | |
| 0.88 | 0.99 | 0.99 | 0.95 | 0.96
| 0.98
| 0.99 | 0.99 | 0.95 | 0.98 | |
| Combined species | 0.87 | 0.98 | 0.99 | 0.94 | 0.97 | 0.98 | 0.96
| 0.97
| 0.92 | 0.97 |
| Benchmark | 0.93 | 0.96 | 1.00 | 0.90 | 0.96 | 0.96 | 0.98
| 0.96
| ||
| | ||||||||||
| 1.00
| 0.91
| 1.00 | 0.78 | 1.00 | 0.83 | 1.00 | 0.83 | 1.00 | 0.83 | |
| 0.97 | 0.98 | 0.98
| 0.99
| 1.00 | 0.92 | 1.00 | 0.96 | 1.00 | 0.98 | |
| 0.87 | 1.00 | 0.92 | 0.95 | 1.00
| 0.98
| 0.97 | 0.98 | 1.00 | 0.99 | |
| Combined species | 0.89 | 0.99 | 0.93 | 0.95 | 1.00 | 0.97 | 0.98
| 0.97
| 1.00 | 0.98 |
| Benchmark | 0.81 | 1.00 | 0.97 | 0.94 | 1.00 | 0.93 | 1.00
| 1.00
| ||
| | ||||||||||
| 0.98
| 0.90
| 0.99 | 0.83 | 1.00 | 0.84 | 1.00 | 0.91 | 0.99 | 0.83 | |
| 0.88 | 0.92 | 0.99
| 0.89
| 0.99 | 0.97 | 0.97 | 0.98 | 0.93 | 0.97 | |
| 0.83 | 0.99 | 0.92 | 0.98 | 0.99
| 0.99
| 1.00 | 1.00 | 0.98 | 0.97 | |
| Combined species | 0.91 | 0.96 | 0.93 | 0.98 | 0.99 | 0.98 | 0.99
| 0.98
| 0.97 | 0.97 |
| Benchmark | 0.78 | 0.97 | 0.97 | 0.97 | 0.99 | 0.95 | 1.00
| 0.95
| ||
| | ||||||||||
| 0.83
| 0.92
| 0.89 | 1.00 | 0.89 | 0.89 | 1.00 | 0.89 | 1.00 | 0.83 | |
| 0.88 | 0.97 | 0.98
| 0.98
| 0.96 | 0.98 | 1.00 | 0.98 | 1.00 | 0.98 | |
| 0.83 | 0.89 | 0.98 | 0.99 | 0.94
| 0.99
| 0.99 | 1.00 | 0.91 | 0.99 | |
| Combined species | 0.84 | 0.91 | 0.98 | 0.98 | 0.99 | 0.99 | 0.92
| 0.99
| 0.93 | 0.98 |
| Benchmark | 0.74 | 0.99 | 0.96 | 0.96 | 0.94 | 0.99 | 0.83
| 0.92
| ||
Abbreviations: SN = sensitivity; SP = specificity; T. gondii = Toxoplasma gondii; Plasmodium = species in the genus Plasmodium including falciparum, yoelii yoelii, and berghei; C. elegans = Caenorhabditis elegans; Combined species = combination of T. gondii, Plasmodium, and C. elegans datasets; Benchmark = dataset comprising evidence for T. gondii and Neospora caninum proteins from published studies.
Results from the same input data fluctuate. The algorithm-specific R functions were executed 100 times and the prediction outcomes (false positives and negatives, true positives and negatives) were averaged to calculate SN and SP.
Obtained from multiple cross-validations i.e. the algorithm-specific R functions randomly used 70% of the training dataset to build a model and the remaining 30% was used in the binary classification test. The cross-validation was executed 100 times and the prediction outcomes were averaged to calculate SN and SP.
The values underlined denote the best performing training dataset for classifying the benchmark proteins.
Figure 5Overview of a proposed classification system using a pool of machine learning algorithms to determine the suitability of proteins for vaccine candidacy. Protein sequences for a target species are input into seven prediction programs. These programs provide evidence as to whether the proteins associated with the sequences are either membrane-associated or secreted, and contain epitopes. Evidence for each protein is collated to create an evidence profile. A collection of evidence profiles are used as input to a pool of six independent machine learning algorithms for classification. Final classification is based on voting and a majority rule decision.
Misclassified proteins from the benchmark dataset by machine learning algorithms
| Adaptive boosting | | Q27298 |
| B6K9N1 | B0LUH4 | |
| B9Q0C2 | P84343 | |
| | Q9U483 | |
| Naive Bayes Classifier | B9PK71 | |
| Neural Networks | | |
| Random Forest | | Q27298 |
| B9PRX5 | ||
| Support Vector Machines | | Q27298 |
| B9QH60 | ||
| B9PRX5 |
Protein identifiers e.g. Q27298 are UniProt IDs. Refer to Additional file 1 for a description of the protein and its relevance as a vaccine candidate.
Description of proteins from the benchmark dataset that were misclassified by at least one machine learning algorithm
| Q27298 | SAG1 protein (P30 | Membrane | YES | YES | AB RF SVM | Q27298,0,Y,0.297,0.141,M,2,7.30,0.56,0,21.5,Secreted,0.255,0.205,YES |
| B0LUH4 | Microneme protein 13 | Unknown | YES | YES | kNN | B0LUH4,0,Y,0.888,0.907,S,1,0.11,0.11,0,29.0,Secreted,0.270,0.355,YES |
| P84343 | Peptidyl-prolyl cis-trans isomerase | Unknown | YES | YES | kNN | P84343,0,Y,0.817,0.963,S,1,1.11,1.11,0,29.0,Secreted,0.465,0.536,YES |
| Q9U483 | Microneme protein Nc-P38 | Unknown | YES | YES | kNN | Q9U483,0,Y,0.427,0.587,S,4,0.23,0.23,0,30.0,Secreted,0.355,0.1736,YES |
| B9PRX5 | Proteasome subunit alpha type | Unknown | YES | YES | RF SVM | B9PRX5,0,Y,0.250,0.254,M,2,16.81,7.23,0,22.0,Secreted,0.648,0.515,YES |
| B9QH60 | Acetyl-CoA carboxylase, putative | Unknown | YES | YES | SVM | B9QH60,1,N,0.322,0.019,M,1,22.02,0.00,1,5.0,Secreted,0.846,0.437,YES |
| B6K9N1 | Cytochrome P450 (putative) | Unknown | NO | NO | kNN | B6K9N1,1,N,0.131,0.041,U,2,15.35,0.03,0,5.0,Membrane,0.197,0.480,NO |
| B9Q0C2 | Anamorsin homolog | Cytoplasm | NO | NO | kNN | B9Q0C2,0,Y,0.245,0.108,U,4,0.54,0.00,0,20.0,Secreted,0.382,0.210,NO |
| B9PK71 | DNA-directed RNA polymerase subunit | Nucleus | NO | NO | NB | B9PK71,0,N,0.188,0.223,U,4,0.00,0.00,0,22.0,Secreted,0.368,0.380,NO |
aFinal classification takes into account predictions from each algorithm and the most frequent classification type is used i.e. a majority rule approach. A YES classification is adopted for tied votes e.g. Q27298.
bAlgorithms are executed multiple times on the same input data. An in-house Perl script summarises the multiple runs and indicates the number of times (as a percentage) the predicted classification of protein differs from the expected. Proteins are regarded as misclassified if the number of times = 100%.
cColumn headers: 1 = ID, 2 = Phobius_TM, 3 = Phobius_SP, 4 = SignalP, 5 = TargetP_SP, 6 = TargetP_loc, 7 = TargetP_RC, 8 = TMHMM_AA, 9 = TMHMM_First60, 10 = TMHMM_TM, 11 = WoLF_PSORT, 12 = WoLF_PSORT_annotation, 13 = MHCI, 14 = MHCII, 15 = Expected classification.
Abbreviations: AB = Adaptive boosting, RF = random forest, SVM = support vector machines, NB = Naive Bayes, kNN = k-Nearest neighbour, NN = neural network.