| Literature DB >> 34335615 |
Alejandro D Ricci1, Mauricio Brunner1, Diego Ramoa1, Santiago J Carmona1, Morten Nielsen1,2, Fernán Agüero1.
Abstract
Availability of highly parallelized immunoassays has renewed interest in the discovery of serology biomarkers for infectious diseases. Protein and peptide microarrays now provide a rapid, high-throughput platform for immunological testing and validation of potential antigens and B-cell epitopes. However, there is still a need for tools to prioritize and select relevant probes when designing these arrays. In this work we describe a computational method called APRANK (Antigenic Protein and Peptide Ranker) which integrates multiple molecular features to prioritize potentially antigenic proteins and peptides in a given pathogen proteome. These features include subcellular localization, presence of repetitive motifs, natively disordered regions, secondary structure, transmembrane spans and predicted interaction with the immune system. We trained and tested this method with a number of bacteria and protozoa causing human diseases: Borrelia burgdorferi (Lyme disease), Brucella melitensis (Brucellosis), Coxiella burnetii (Q fever), Escherichia coli (Gastroenteritis), Francisella tularensis (Tularemia), Leishmania braziliensis (Leishmaniasis), Leptospira interrogans (Leptospirosis), Mycobacterium leprae (Leprae), Mycobacterium tuberculosis (Tuberculosis), Plasmodium falciparum (Malaria), Porphyromonas gingivalis (Periodontal disease), Staphylococcus aureus (Bacteremia), Streptococcus pyogenes (Group A Streptococcal infections), Toxoplasma gondii (Toxoplasmosis) and Trypanosoma cruzi (Chagas Disease). We have evaluated this integrative method using non-parametric ROC-curves and made an unbiased validation using Onchocerca volvulus as an independent data set. We found that APRANK is successful in predicting antigenicity for all pathogen species tested, facilitating the production of antigen-enriched protein subsets. We make APRANK available to facilitate the identification of novel diagnostic antigens in infectious diseases.Entities:
Keywords: antigenicity; antigens; human pathogens; linear epitopes; prediction
Mesh:
Substances:
Year: 2021 PMID: 34335615 PMCID: PMC8320365 DOI: 10.3389/fimmu.2021.702552
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
List of pathogen species used in this paper.
| Pathogen Species | Disease | Group | Taxonomy (Phylum) |
|---|---|---|---|
|
| Lyme disease | Gram Negative Bacteria | Spirochaetia |
|
| Brucellosis | Alpha-proteobacteria | |
|
| Q fever | Gamma-proteobacteria | |
|
| Gastroenteritis | Gamma-proteobacteria | |
|
| Tularemia | Gamma-proteobacteria | |
|
| Leptospirosis | Spirochaetia | |
|
| Periodontal disease | Bacteroidetes | |
|
| Leprosy | Gram Positive Bacteria | Actinobacteria |
|
| Tuberculosis | Actinobacteria | |
|
| Bacteremia | Firmicutes | |
|
| GAS infections | Firmicutes | |
|
| Leishmaniasis | Eukaryotic Protozoa | Euglenozoa |
|
| Malaria | Apicomplexa | |
|
| Toxoplasmosis | Apicomplexa | |
|
| Chagas Disease | Euglenozoa |
Predictors used to analyze different features of proteins and peptides.
| Focus | Feature | Predictor | Basis |
|---|---|---|---|
| Stimulation of an immune response | B-cell epitopes | BepiPred 1.0 | Antigenicity by HMM |
| Binding to MHC Class II molecules | NetMHCIIpan 2.0 | ANN trained with peptide and MHC Class II sequence information | |
| Peculiarities in the protein sequence | Glycosylation sites | NetOglyc 3.1d | ANN trained with mucin type GalNAc O-glycosylation sites in mammalian proteins |
| GPI-anchored proteins | PredGPI 1.4.3 | Discrimination of the anchoring signal by SVM and prediction of the most probable omega-site by HMM | |
| Signal peptide cleavage sites | SignalP 4.0 | Prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several ANN | |
| Tandem repeats | Xstream 1.71 | SE algorithm to explicitly locate exact and degenerate tandem repeats TRs of all periods in protein sequences | |
| Three dimensional structure | Disorder | Iupred 1.0 | Amino acids favorable interactions potential |
| Parallel coiled coil fold | Paircoil2 | Uses pairwise residue probabilities with the Pair coil algorithm and an updated coiled coil database | |
| Secondary Structure | NetSurfp 1.0 | ANN trained with sequence profiles and predicted secondary structure | |
| Surface access | NetSurfp 1.0 | ANN trained to predict the relative surface exposure of the individual amino acid residues | |
| Transmembrane helices in proteins | TMHMM 2.0c | Membrane protein topology prediction method based on a HMM | |
| Molecular properties | Isoelectric point | Pepstats (EMBOSS 6.6.0.0) | Amino acids pK values |
| Molecular Weight | Pepstats (EMBOSS 6.6.0.0) | Amino acids weights | |
| Similarities within itself and with the host | Sequence similarity (pathogen/host) | CrossReactivity | Shared kmers between pathogen and host proteins |
| Sequence similarity (pathogen proteins) | SelfSimilarity | Shared kmers between pathogen proteins |
CrossReactivity and SelfSimilarity are custom Perl scripts. ANN, Artificial Neural Network; HMM, Hidden Markov Model; SE, Seed Extension; SVM, Support Vector Machine.
Amount of antigenic proteins and peptides for each species.
| Species | Group | Proteins | Peptides | ||||
|---|---|---|---|---|---|---|---|
| Total | Antigenic | Total | Antigenic | ||||
| Original | After BLAST | Original | After kmer expansion | ||||
| B. burgdorferi | Gram - | 1,390 | 137 | 152 | 386,683 | 117 | 863 |
| B. melitensis | Gram - | 3,178 | 13 | 13 | – | – | – |
| C. burnetii | Gram - | 1,853 | 102 | 104 | – | – | – |
| E. coli | Gram - | 4,778 | 7 | 7 | 1,428,744 | 9 | 158 |
| F. tularensis | Gram - | 1,556 | 27 | 27 | – | – | – |
| L. interrogans | Gram - | 3,683 | 10 | 10 | 1,113,309 | 19 | 342 |
| P. gingivalis | Gram - | 1,881 | 10 | 11 | 626,536 | 165 | 1181 |
| M. leprae | Gram + | 1,605 | 7 | 8 | 515,942 | 76 | 633 |
| M. tuberculosis | Gram + | 3,940 | 81 | 89 | 1,268,272 | 416 | 4,369 |
| S. aureus | Gram + | 2,607 | 16 | 16 | 758,970 | 55 | 575 |
| S. pyogenes | Gram + | 1,690 | 13 | 13 | 491,619 | 263 | 985 |
| L. braziliensis | Eukaryote | 8,084 | 8 | 12 | 4,964,396 | 14 | 182 |
| P. falciparum | Eukaryote | 5,337 | 106 | 131 | 4,009,580 | 562 | 9,120 |
| T. gondii | Eukaryote | 8,322 | 15 | 16 | 6,535,220 | 94 | 457 |
| T. cruzi | Eukaryote | 21,170 | 242 | 2,480 | 10,408,841 | 4,025 | 7,317 |
This table shows the amount of antigenic proteins and sequences extracted from bibliography and the final amount after processing. For proteins, BLAST was used to also tag as antigenic other proteins of the same species that were similar to the antigenic ones. For peptides, a custom mapping method named ‘kmer expansion’ was used to tag peptides as antigenic based on the antigenic sequences in bibliography (see Methods). We did not have information at peptide level for three of the species.
Figure 1Schematic flowchart used to obtain APRANK’s species-specific models. With the aim of testing and tuning our method, training and prioritization was performed for both proteins and peptides using data from a single proteome of interest. This process was repeated for all of our 15 species.
Figure 2Schematic flowchart used to obtain APRANK’s generic models. With the aim of creating a set of models that could make predictions for a wide range of species, training and prioritization was performed for both proteins and peptides using combined data from all of our 15 species. When testing the generic models, leave-one-out models were used, where 14 species were used to train the models and the 15th species to test them. This process was repeated for all of our 15 species.
Figure 3Performance of APRANK training using balanced or unbalanced data. Performance of APRANK’s species-specific models for B. burgdorferi and P. gingivalis. ROC curves for each iteration of training and testing are shown in light gray, and the average curves are shown in green (dashed lines).
Prediction results for the specific models.
| Species | Proteins | Peptides | ||||
|---|---|---|---|---|---|---|
| BTR | Trained with unbalanced data | Trained with balanced data | BTR | Trained with unbalanced data | Trained with balanced data | |
| Mean AUC | Mean AUC | Mean AUC | Mean AUC | |||
| B. burgdorferi | Yes | 0.809 ± 0.014 | 0.799 ± 0.017 | Yes | 0.767 ± 0.021 | 0.773 ± 0.020 |
| B. melitensis | Yes | 0.710 ± 0.037 | 0.700 ± 0.033 | – | – | – |
| C. burnetii | Yes | 0.611 ± 0.011 | 0.620 ± 0.010 | – | – | – |
| E. coli |
| 0.511 ± 0.034 | 0.515 ± 0.039 | Yes | 0.584 ± 0.056 | 0.633 ± 0.047 |
| F. tularensis | Yes | 0.783 ± 0.018 |
| – | – | – |
| L. interrogans | Yes | 0.827 ± 0.033 | 0.867 ± 0.023 | Yes | 0.559 ± 0.015 | 0.565 ± 0.011 |
| P. gingivalis | Yes | 0.785 ± 0.031 |
| Yes | 0.690 ± 0.019 | 0.698 ± 0.020 |
| M. leprae | Yes | 0.633 ± 0.018 | 0.652 ± 0.018 | Yes | 0.557 ± 0.029 | 0.585 ± 0.023 |
| M. tuberculosis | Yes | 0.635 ± 0.010 | 0.647 ± 0.011 |
| 0.508 ± 0.010 | 0.502 ± 0.010 |
| S. aureus | Yes | 0.765 ± 0.032 | 0.772 ± 0.023 |
| 0.438 ± 0.054 | 0.420 ± 0.057 |
| S. pyogenes | Yes | 0.884 ± 0.039 |
| Yes | 0.832 ± 0.021 | 0.844 ± 0.019 |
| L. braziliensis | Yes |
| 0.673 ± 0.020 | Yes | 0.778 ± 0.029 |
|
| P. falciparum | Yes | 0.821 ± 0.009 | 0.826 ± 0.007 | Yes | 0.758 ± 0.016 |
|
| T. gondii | Yes | 0.656 ± 0.032 |
| Yes |
| 0.584 ± 0.020 |
| T. cruzi | Yes | 0.803 ± 0.029 |
| Yes | 0.838 ± 0.019 | 0.854 ± 0.016 |
The prediction was considered to be successful if it was significantly Better Than a Random set of scores (BTR). Each specific model was calculated 50 times using different, but overlapping, subsets of data as training and test sets. In bold we show the model with the significantly higher AUC when comparing training with unbalanced or balanced data (Student’s t-test, *< 0.05, **< 0.01, ***< 0.001).
Prediction results for the leave-one-out generic models.
| Species | Proteins | Peptides | ||||
|---|---|---|---|---|---|---|
| BTR | LOO model | BTR | LOO model | LOO model + protein scores | Combined score relative AUC gain | |
| B. burgdorferi | Yes | 0.786 | Yes | 0.768 | 0.950 |
|
| B. melitensis | Yes | 0.774 | – | – | – | – |
| C. burnetii | Yes | 0.620 | – | – | – | – |
| E. coli | Yes | 0.754 | Yes | 0.742 | 0.780 |
|
| F. tularensis | Yes | 0.698 | – | – | – | – |
| L. interrogans | Yes | 0.947 | Yes | 0.679 | 0.948 |
|
| P. gingivalis | Yes | 0.854 | Yes | 0.665 | 0.871 |
|
| M. leprae | Yes | 0.758 | Yes | 0.692 | 0.731 |
|
| M. tuberculosis | Yes | 0.702 | Yes | 0.586 | 0.711 |
|
| S. aureus | Yes | 0.737 | Yes | 0.752 | 0.790 |
|
| S. pyogenes | Yes | 0.983 | Yes | 0.838 | 0.970 |
|
| L. braziliensis | Yes | 0.709 | Yes | 0.946 | 0.878 |
|
| P. falciparum | Yes | 0.807 | Yes | 0.748 | 0.835 |
|
| T. gondii | Yes | 0.837 | Yes | 0.583 | 0.720 |
|
| T. cruzi | Yes | 0.867 | Yes | 0.843 | 0.857 | 1.58% |
The prediction was considered successful if it was significantly Better Than a Random set of scores (BTR). For peptides, we show both the performance of the model alone, and the performance obtained by combining the protein and peptide scores. In bold we show any difference greater than 5% between the peptide score and the combined score for a given species. LOO Model, Leave-One-Out Model.
Figure 4Density analysis for the antigenicity scores of T. cruzi. Plots were obtained by analyzing the proteome of T. cruzi with the leave-one-out generic models, and then distinguishing between antigens and non-antigens. The figure shows the enrichment score obtained by keeping only the proteins and peptides with a score greater than 0.6, as well as the amount of antigens and non-antigens that would be inside or outside that subset.
Comparison between APRANK and the predictor with highest solo AUC (BepiPred 1.0).
| Species | Proteins | Peptides | ||||
|---|---|---|---|---|---|---|
| BepiPred score AUC | APRANK score AUC | APRANK relative AUC gain | BepiPred score AUC | APRANK score AUC | APRANK relative AUC gain | |
| B. burgdorferi | 0.729 |
|
| 0.796 | 0.768 | -3.46% |
| B. melitensis | 0.710 |
|
| – | – | – |
| C. burnetii | 0.558 |
|
| – | – | – |
| E. coli | 0.587 |
|
| 0.662 |
|
|
| F. tularensis | 0.570 |
|
| – | – | – |
| L. interrogans | 0.839 |
|
| 0.676 | 0.679 | 0.42% |
| P. gingivalis | 0.852 | 0.854 | 0.25% | 0.674 | 0.665 | -1.36% |
| M. leprae |
| 0.758 |
| 0.689 | 0.692 | 0.51% |
| M. tuberculosis | 0.666 |
|
| 0.561 | 0.586 | 4.58% |
| S. aureus | 0.723 | 0.737 | 1.86% | 0.767 | 0.752 | -1.93% |
| S. pyogenes | 0.970 | 0.983 | 1.33% | 0.800 | 0.838 | 4.73% |
| L. braziliensis | 0.549 |
|
| 0.905 | 0.946 | 4.48% |
| P. falciparum | 0.793 | 0.807 | 1.84% | 0.642 |
|
|
| T. gondii | 0.579 |
|
| 0.584 | 0.583 | -0.21% |
| T. cruzi | 0.814 |
|
| 0.819 | 0.843 | 3.03% |
The relative AUC gain shows the increase or decrease of the AUC obtained by our method relative to the one obtained by BepiPred. Differences greater than 5% are shown in bold.
Performance of APRANK on Onchocerca volvulus.
| Total | Score | #MIP | Antigenic | AUC | Antigens with score 0.6 | Enrichment score for 0.6 | |
|---|---|---|---|---|---|---|---|
| Proteins | 12,994 | Protein score | 1 | 886 | 0.677 | 150 | 2.28 |
| 2 | 177 | 0.713 | 38 | 2.89 | |||
| 3 | 28 | 0.828 | 11 | 5.29 | |||
| Peptides | 4,872,082 | Peptide score | 1 | 1,097 → 14,122 | 0.800 | 6,108 | 3.33 |
| 2 | 397 → 4,498 | 0.798 | 1,995 | 3.42 | |||
| 3 | 104 → 1,182 | 0.836 | 598 | 3.90 | |||
| Combined score | 1 | 1,097 → 14,122 | 0.750 | 3,376 | 3.10 | ||
| 2 | 397 → 4,498 | 0.774 | 1,342 | 3.88 | |||
| 3 | 104 → 1,182 | 0.871 | 512 | 5.63 |
Proteins and peptides were tagged as antigenic based on the number of Minimum Immunoreactive Peptides (#MIP). For proteins, we considered as antigenic those with at least #MIP immunoreactive peptides. For peptides, we considered as antigenic any immunoreactive peptide found inside proteins with at least #MIP immunoreactive peptides. We show the number of antigenic peptides before and after spreading the antigenicity from the original immunoreactive peptides to their neighboring peptides (before → after). The rule to define an ‘immunoreactive peptide’ was extracted from Lagatie et al., 2017 (see Methods). The enrichment score represents the proportion of antigens in the selected subset relative to the proportion of antigens in the whole proteome.
Figure 5Density analysis for the antigenicity scores of Onchocerca volvulus. Plots were obtained by analyzing the proteome of O. volvulus with the final generic models, and then distinguishing between antigens and non-antigens. The figure shows the enrichment score obtained by keeping only the proteins and peptides with a score greater than 0.6, as well as the amount of antigens and non-antigens that would be inside or outside that subset. The plots correspond to the case where a protein was tagged as antigenic if it had at least 3 ‘immunoreactive’ peptides (see Results).
Figure 6Validation of APRANK against antigens with known seroprevalence. Detailed information on the seroprevalence of Plasmodium falciparum proteins in cases of Human Malaria was obtained from (50) (n = 38). Proteins were clustered in different seroprevalence groups and matched against APRANK antigenicity scores (see Results).