| Literature DB >> 31842745 |
Adonis D'Mello1, Christian P Ahearn2,3, Timothy F Murphy2,3,4, Hervé Tettelin5.
Abstract
BACKGROUND: Reverse vaccinology accelerates the discovery of potential vaccine candidates (PVCs) prior to experimental validation. Current programs typically use one bacterial proteome to identify PVCs through a filtering architecture using feature prediction programs or a machine learning approach. Filtering approaches may eliminate potential antigens based on limitations in the accuracy of prediction tools used. Machine learning approaches are heavily dependent on the selection of training datasets with experimentally validated antigens (positive control) and non-protective-antigens (negative control). The use of one or few bacterial proteomes does not assess PVC conservation among strains, an important feature of vaccine antigens.Entities:
Keywords: Antigen scoring; Bacterial; Core genome; Orthology; Pan-genome; Reverse vaccinology; Vaccines
Mesh:
Substances:
Year: 2019 PMID: 31842745 PMCID: PMC6916091 DOI: 10.1186/s12864-019-6195-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
List of all the programs run in ReVac and their predicted features, with the scoring scheme for each programs output. Additional scoring descriptions based on outputs from multiple programs are listed at the bottom
| Module (Reference) | Gene property | Evidence | Output | Scoring weight (points) | Example Protein ( | Example Feature | Example Weight | Example Cumulative Score |
|---|---|---|---|---|---|---|---|---|
| PSORTb* [ | Surface exposure^ | Sub-cellular localization | Surface localization prediction | + 1 if surface exposed | 9.52|OuterMembrane | Positive surface exposure | 1 | 1 |
| −1 if cytoplasmic | ||||||||
| LipoP [ | Surface exposure^ | Lipoprotein motif | Presence or absence of a motif | 1 or 0 | SpI|18.809 | Positive for lipoprotein motif | 1 | 2 |
| TMHMM [ | Surface exposure^ | Transmembrane spans | Number of helices | If surface exposed < 2: + 0.5 | 1 | Presence of 1 TMH | 0.5 | 2.5 |
| 02:00.0 | ||||||||
| 3: −0.2 | ||||||||
| ≥4: −2 | ||||||||
| If cytoplasmic | ||||||||
| −2 | ||||||||
| SignalP [ | Surface exposure^ | Signal peptide | Signal peptide | + 1 for presence | MNKTSTQLGLLAVSVSLIMASLPAHA | Signal peptide present | 1 | 3.5 |
| SPAAN [ | Surface exposure^ | Adhesin protein | Adhesin protein score | + 0.5 if above cutoff score (default 0.75) | 0.907057 | Predicted Adhesin | 0.5 | 4 |
| Surface HMMs [ | Surface exposure Function | HMM for motif or function | HMM title and score | 0.5 | None | No HMM alignment | 0 | 4 |
| Antigenic [ | Antigenicity | Antigenic epitopes | Peptides, scores, protein coverage | 0.5 | QLGLLAVSVSLIMASLPAHAVYLDR|1.193|10(169)|41.73 | Predicted antigenic region. | 0.5 | 4.9173 |
| + 0–1 proportional to coverage | 41.73% of the protein is antigenic | 0.4173 | ||||||
| Bcell Pred [ | Antigenicity | B cell epitopes, 6 prediction methods combined | Number of peptides, protein coverage | + 0–1 proportional to coverage | 6(59)|14|14.57 | Predicited B-cell Epitopes | 0.1457 | 5.09809 |
| + 0–1 proportional to total number of peptides of a given length per protein | 14.57% predicted in 14 peptides of 7AA | 14/(405–7 + 1) = 0.03509 | ||||||
| MHC class I [ | Antigenicity | MHC-I epitopes | Number of peptides, protein coverage | + 0–1 proportional to coverage if 80–90% | 6(378)|124|73|93.33 | Predicted MHC binding | 6.41039 | |
| + 0–1 proportional to total number of peptides of a given length per protein | 93.33% predicted in 124 peptides of 9AA | 124/(405–9 + 1) = 0.3123 | ||||||
| + 1 if coverage is > = 90% | 1 | |||||||
| NetCTLpan [ | Antigenicity | MHC-I epitopes | Number of peptides, protein coverage | + 0–1 proportional to coverage if 80–90% | 12(334)|70|12|82.47 | Predicted MHC binding | 0.8247 | 7.41139 |
| + 0–1 proportional to total number of peptides of a given length per protein | 82.47% predicted in 70 peptides | 70/(405–9 + 1) = 0.1763 | ||||||
| + 1 if coverage is > = 90% | ||||||||
| Immunogenicity (MHC-I) [ | Antigenicity | MHC-I epitopes immunogenicity | Number of peptides, protein coverage | + 0–1 proportional to coverage | 7(76)|14|36|18.77 | Predicted immunogenic region | 0.1877 | 8.63435 |
| + 0–1 proportional to total number of peptides of a given length per protein | 14 peptides of 9AA | 14/(405–9 + 1) = 0.035264 | ||||||
| + 1 if coverage is > = 10% | 1 | |||||||
| MHC class II [ | Antigenicity | MHC-II epitopes | Number of peptides, protein coverage | + 0–1 proportional to coverage if 80–90% | 2(404)|315|61|99.75 | Predicted MHC-II binding | 10.43995 | |
| + 0–1 proportional to total number of peptides of a given length per protein | 99.75% predicted in 315 peptides of 15AA | 315/(405–15 + 1) = 0.8056 | ||||||
| + 1 if coverage is > = 90% | 1 | |||||||
| BLAT (IEDBb database*) [ | Antigenicity | Similarity to curated epitopes from IEDB | Protein coverage | + 0–1 proportional to coverage | None | No hits to epitope database | 0 | 10.43995 |
| + 1 if coverage is > 70% | ||||||||
| Autoimmunity [ | Autoimmunity | Similarity to human proteins | Protein coverage | + 1 if no autoimmunity | None | No hits to Human | 1 | 11.43995 |
| −2 *(0 to1) proportional to coverage | ||||||||
| −2 if coverage is > 20% | ||||||||
| Autoimmunity Commensals [ | Autoimmunity | Similarity to user-defined commensal organisms’ proteins | Protein coverage | + 1 if no autoimmunity | 3(39)|9.63 | 9.63% similarity to commensal | (0.0963)x(− 2) = −0.1926 | 11.24735 |
| −2 *(0 to1) proportional to coverage | (Negative feature) | |||||||
| −2 if coverage is > 20% | ||||||||
| SSRd Finder [ | Variability of expression | Phase variation | Number of simple sequence repeats | + 1 if no SSR | None | No DNA SSR found | 1 | 12.24735 |
| −0.5 for each SSR | ||||||||
| −0.25 for each SSR in the promoter | ||||||||
| −0.5 for each SSR with frameshift potential | ||||||||
| −0.01 times the length of the SSR. | ||||||||
| SSRd Finder Protein [ | Variability of expression | Potential conformational shifts | Number of protein tandem repeats | −0.2 for each protein repeat, max penalty of 1. | None | No protein SSR found | 0 | 12.24735 |
| IslandPath [ | Potential for horizontal gene transfer | Genomic Islands | Presence in a GI | −1 for each protein in a GI | None | Not present in a GI | 0.5 | 12.74735 |
| + 0.5 for absence | ||||||||
| Jaccard Clusters [ | Conservation | Orthologous clusters | Presence in an orthologous cluster | + 1 for each protein in a COG in > = 90% of genomes in atleast one method | j_ortholog_cluster_3254|63 | Present in > 90% of the genomes | ||
| −0.25 for each protein in a COG in < 90% of genomes | ||||||||
| PanOCT [ | Conservation | Orthologous clusters | Presence in an orthologous cluster | + 1 for each protein in a COG in > = 90% of genomes in atleast one method | PanOCT_cluster_108|63 | Present in > 90% of the genomes | ||
| −0.25 for each protein in a COG in < 90% of genomes | 1 | 13.74735 | ||||||
| OrthoMCL [ | Conservation | Orthologous clusters | Presence in an orthologous cluster | + 1 for each protein in a COG in > = 90% of genomes in atleast one method | orthomcl_cluster1407|63 | Present in > 90% of the genomes | ||
| −0.25 for each protein in a COG in < 90% of genomes | ||||||||
| LS-BSR [ | Conservation | Orthologous clusters | Presence in an orthologous cluster | + 1 for each protein in a COG in > = 90% of genomes in atleast one method | 63 | Present in > 90% of the genomes | ||
| −0.25 for each protein in a COG in < 90% of genomes | ||||||||
| Attributorc | Function | Annotation & GO Terms | Annotation & GO Terms | + 1 for each GO term in our surface exposed GO db | hypothetical_protein_domain_protein | No conclusive GO terms predicted | 0 | 13.74735 |
| −1 for each GO term in our non-surface exposed GO db | ||||||||
| + 1 if presence of surface exposure keywords if predicted periplasmic | ||||||||
| aHMM: Hidden Markov Model. This component includes a collection of HMMs selected from the Pfam database for motifs associated with surface proteins. | ||||||||
| bIEDB: Immune Epitope Database and Analysis Resource | ||||||||
| cIn house Perl/Python script | ||||||||
| dSSRs: simple sequence repeats | ||||||||
| *If any three of PSORTB, LipoP, SignalP and IEDB Database matches are all positive, weight is incremented by 2. | True | 2 | 15.74735 | |||||
| ^If all surface exposure tools fail a conclusive prediction, weight is decremented by 2 | False | 0 | 15.74735 | |||||
| †Each protein is given an additional 0.1 for > 90% presence in each of the clustering algorithms, Jaccard Clusters, PanOCT, OrthoMCL and LS-BSR, and penalized 0.5 for < 90% presence or absence of a cluster for each tool. | True | 0.4 | ReVac Score = 16.14735 | |||||
Fig. 1Schematic of the ReVac workflow, its components and underlying features. Blue arrows indicate the components where control datasets were used to develop the scoring algorithm. Red arrows indicate a user’s input query dataset, which runs through all components and the scoring algorithm, to output a list of prioritized candidates for the supplied species. Scoring based on core genes or orthology components is indicated by the black arrow
Fig. 2A density plot showing the scores for all sequences run through ReVac, and the cutoff for our M. catarrhalis and NTHi datasets
Examples of control proteins used to develop the scoring scheme, and a summary of the outputs from each of ReVac’s components
| General Information | |||||||
| No | ReVac Score | Score Breakdown | Organism | Gram Stain | Type | ||
| 1 | 14.853 | 15.253–0.400 | – | Antigen | |||
| 2 | 13.709 | 13.709–0.000 | Non-typable | – | Antigen | ||
| 3 | 9.049 | 9.049–0.000 | – | Antigen | |||
| 4 | 8.192 | 8.192–0.000 | + | Antigen | |||
| 5 | 6.791 | 6.791–0.000 | + | Antigen | |||
| 6 | 6.32 | 6.520–0.200 | – | Antigen | |||
| 7 | 5.768 | 7.768–2.000 | + | Non Antigen | |||
| 8 | 2.475 | 5.542–3.066 | + | Non Antigen | |||
| Surface Exposure Predictions | |||||||
| No. | PSORTB Localization | LipoProtein | Transmembrane Helices | Signal Peptide | SPAAN adhesin ratio | HMM mapping to surface exposed database | Annotation/GO Terms |
| 1 | OuterMembrane | SignalPeptidase I | None | MNMSLSRIVKAAPLRRTTLAMALGALGAAPAAHA | None | Positive | outer membrane autotransporter barrel|GO:0009405,GO:0015474,GO:0045203,GO:0046819 |
| 2 | OuterMembrane | SignalPeptidase II | None | MNKFVKSLLVAGSVAALAACSSSNNDA | None | Positive | peptidoglycan-associated lipoprotein|GO:0009279 |
| 3 | None | SignalPeptidase II | None | MQFSKSIPLFFLFSIPFLA | None | Positive | Bacterial extracellular solute-binding protein |
| 4 | Cellwall | SignalPeptidase I | 1 | None | 0.782535 | Positive | hypothetical protein |
| 5 | Extracellular | Intracellular | None | None | None | None | Thiol-activated cytolysin family protein|GO:0015485,GO:0009405 |
| 6 | Periplasmic | SignalPeptidase II | None | MFKRSVIAMACIFALSACG | None | None | Transferrin binding family protein|GO:0016020 |
| 7 | None | Intracellular | None | None | None | None | Capsular polysaccharide synthesis family protein |
| 8 | None | Intracellular | None | None | None | None | shikimate dehydrogenase ec::1.1.1.25|GO:0004764,GO:0009423 |
| Antigenicity Predictionsa | |||||||
| No. | Antigenicity | B cell epitopes | MHC I binding | MHC II binding | MHC binding + Antigen Processing | Immunogenicity within MHC complex | Alignment to curated epitopes |
| 1 | 45.05% | 15.16% | 94.07% | 100.00% | 61.10% | 13.08% | 26.92% |
| 2 | 30.72% | 5.23% | 96.73% | 94.12% | 79.08% | 17.65% | 99.35% |
| 3 | 44.02% | 16.03% | 94.57% | 100.00% | 69.57% | 23.91% | None |
| 4 | 50.40% | 38.97% | 83.10% | 90.46% | 43.34% | 1.79% | None |
| 5 | 43.74% | 13.80% | 95.33% | 98.94% | 73.04% | 12.10% | 22.08% |
| 6 | 30.33% | 15.16% | 81.56% | 86.68% | 46.93% | 1.84% | None |
| 7 | 48.94% | 4.61% | 96.81% | 100% | 81.91% | 30.85% | None |
| 8 | 34.32% | 7.75% | 96.31% | 98.89% | 77.49% | 16.61% | None |
| Adverse Features | |||||||
| No. | Autoimmunity with humans | Repeat regions genes & copy number | Repeat regions proteins & copy number | ||||
| 1 | None | None | |APAGGAVPGG 2||PQP 3| | ||||
| 2 | None | None | None | ||||
| 3 | None | None | None | ||||
| 4 | None | None | None | ||||
| 5 | None | None | None | ||||
| 6 | None | None | |ARFRRS 2| | ||||
| 7 | None | None | None | ||||
| 8 | 3.32% | None | None | ||||
aPercents are relative to the length of the amino acid sequence
Top candidates selected from ReVac’s output for NTHi and M. catarrhalis (M.W/pI represent molecular weights and isoelectric points)
| Example locus tag | Amino acid length | M.W./pI | Annotation | Gene |
|---|---|---|---|---|
| NTHi | ||||
| 84P48H1_01193 | 562 | 62.24/9.43 | Hemopexin transporter | hxuB |
| 84P8H1_00650 | 274 | 30.48/8.97 | Lipoprotein E precursor | Hel (P4) |
| ADC73_RS07905 | 405 | 44.01/9.47 | Hypothetical protein (porin family) | None |
| E9Y_00353 | 537 | 60.40/8.95 | Protein of unknown function (DUF560) | None |
| M137P16B1_1805 | 344 | 38.37/8.95 | Gram-negative porin protein | None |
| AO373_1452a | 331 | 36.16/9.9 | Ferric iron ABC transporter iron-binding protein | None |
All the above candidates were surface exposed, predicted antigenic, conserved core proteins with low autoimmunity and no repeat regions. Further information about these candidates are available in Additional files 2 and 3 respectively. aPresent in M. canis
Fig. 3a An estimation of ReVac’s CPU time focused on its rate-limiting steps using batches of 1000 and 5000 proteins. Multiple runs (one for each time point on the figure) were submitted in succession on a single host, using increasing amounts of dedicated cores, each running the same batch of the respective 1000 (solid line) or 5000 proteins (dashed line). The total numbers of proteins analyzed using 1 and 48 cores are provided as labels for comparison to (b). b Real-life CPU time estimates derived from the entire ReVac workflow running on 150–300 compute clusters through Ergatis, each utilizing a single host in most cases
Fig. 4a Whole genome tree of the 69 M. catarrhalis genomes used in ReVac. The four clades seen are labeled as, blue-indicating a sero-resistant clade, green-indicating a sero-sensitive clade, orange-indicating older isolates of M catarrhalis dating to 1932, and red-indicating misannotated M. canis genomes from NCBI. b Whole genome tree of 128 currently available M. catarrhalis genomes on NCBI, maintains the same topology as 4A. c A protein alignment tree of one of ReVac’s top candidates, which separates sero-sensitive and sero-resistant clades, but is absent in the other two clades (also present in the respective clades of (b)). d A protein alignment tree of the candidate iron transporter that replicates the whole genome tree topology
Fig. 5Whole genome tree of the 270 NTHi genomes used in ReVac