| Literature DB >> 23272069 |
Santiago J Carmona1, Paula A Sartor, María S Leguizamón, Oscar E Campetella, Fernán Agüero.
Abstract
The availability of complete pathogen genomes has renewed interest in the development of diagnostics for infectious diseases. Synthetic peptide microarrays provide a rapid, high-throughput platform for immunological testing of potential B-cell epitopes. However, their current capacity prevent the experimental screening of complete "peptidomes". Therefore, computational approaches for prediction and/or prioritization of diagnostically relevant peptides are required. In this work we describe a computational method to assess a defined set of molecular properties for each potential diagnostic target in a reference genome. Properties such as sub-cellular localization or expression level were evaluated for the whole protein. At a higher resolution (short peptides), we assessed a set of local properties, such as repetitive motifs, disorder (structured vs natively unstructured regions), trans-membrane spans, genetic polymorphisms (conserved vs. divergent regions), predicted B-cell epitopes, and sequence similarity against human proteins and other potential cross-reacting species (e.g. other pathogens endemic in overlapping geographical locations). A scoring function based on these different features was developed, and used to rank all peptides from a large eukaryotic pathogen proteome. We applied this method to the identification of candidate diagnostic peptides in the protozoan Trypanosoma cruzi, the causative agent of Chagas disease. We measured the performance of the method by analyzing the enrichment of validated antigens in the high-scoring top of the ranking. Based on this measure, our integrative method outperformed alternative prioritizations based on individual properties (such as B-cell epitope predictors alone). Using this method we ranked [Formula: see text]10 million 12-mer overlapping peptides derived from the complete T. cruzi proteome. Experimental screening of 190 high-scoring peptides allowed the identification of 37 novel epitopes with diagnostic potential, while none of the low scoring peptides showed significant reactivity. Many of the metrics employed are dependent on standard bioinformatic tools and data, so the method can be easily extended to other pathogen genomes.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23272069 PMCID: PMC3522711 DOI: 10.1371/journal.pone.0050748
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Features, Attributes and Tools.
| Feature | Basis | Method | Weight (Score) |
| Cellular Surface Localization Index (CSLI) | Potentially secreted/surface protein | SignalP, DGPI | Positive, Large (5) |
| Protein Expression Index(PEI) | Timing and abundance of expression | Proteomic data, Codon Usage Bias, Gene copy number | Positive, Large (5) |
| Predicted B-cell epitopes | Antigenicity | Bepipred | Positive, Medium (3) |
| Internal Aminoacid Repeats | Immunogenicity | Trust | Positive, Medium (3) |
| Extracellular domain of integral membrane protein | Surface Localization | TMHMM | Positive, Low (1) |
| Trans-membrane domain | Low accessibility | TMHMM | Negative, Large (−5) |
| Natively unstructured region | Selection of linear epitopes | IUPred | Positive, Medium (3) |
| High local sequence similarity agains host proteins | Low immunogenicity | FASTA | Negative, Large (−5) |
| High local sequence similarity against related pathogens | Misleading diagnosis | FASTA | Negative, Large (−5) |
| Protein has an additional domain not present in other orthologs | Potential immunogenic domain | BLASTP/Perl | Positive, Large (5) |
| Potentially glycosylated regions | Avoid post-transcriptional modifications | NetOGlyc | Negative, Low (−1) |
| Regions with very low sequence complexity | Low specificity | SEG | Negative, Large (−5) |
| Region upstream of Signal peptide cleavage residue | Absent in mature protein | SignalP | Negative, Large (−5) |
| Region downstream of GPIanchor addition residue | Absent in mature protein | DGPI | Negative, Large (−5) |
| Intra-species genetic diversity(polymorphic residues) | Non conserved peptides | TcSNP Database | Negative, Large (−5) |
| Cysteine in peptides | Synthetic peptides are sensitive to oxidation and cyclization | Custom Perl Script | Negative, Large (−5) |
The Table lists features evaluated by our computational pipeline, the basis for their selection and method for calculation. The numerical weight (score) listed for each features is applied to modulate the contribution of each attribute to the final peptides scores.
Figure 1Visualization of peptide-score profiles generated by the method.
A) the 60S ribosomal protein L19 (locus identifier TcCLB.509149.40), and B) a putative lectin (locus TcCLB.506239.30). These plots display peptide scores and features along protein sequences. Mapped features in these examples are those listed in Table 1: antigenicity (Bepipred), protein disorder, internal repeats, signal peptide, signal peptide cleavage site, non-synonymous polymorphisms, high conservation vs human, high conservation vs Leishmania spp, low sequence complexity, glycosylated threonines, cysteines, and presence of domain absent in orthologous proteins (NC DOMAIN). Vertical boxes represent overlapped 12-residue peptides, and their height and level of green are proportional to the peptide score. They vary around their base protein scores (i.e. 4.7 and 5.5), which accounts for subcellular localization and expression.
Figure 2Assessing enrichment of known antigens.
The figure shows a number of enrichment plots obtained under different prioritization scenarios. In all plots: the x axis contains the prioritized proteome (top ranking proteins at the origin); the y axis displays the fraction of known validated antigens recovered in the top x proteins; the blue dashed line displays an hypothetical enrichment plot with an AUC = 0.5 (expected by chance), while the black solid line represents the actual enrichment obtained in each prioritization. From the top-left: comparison of different prioritization strategies (ordered by decreasing AUC values): 1) our composite method, 2–9) a number of prioritizations using a single criteria in each case: 2) Codon Usage bias (CAI), 3) Internal repeats, 4) Proteomic evidence of expression, 5) natively unstructured regions, 6) antigenicity (Bepipred), 7) surface localization (GPI), 8) O-Glycosylation, 9) antigenicity (EMBOSS antigenic). p-value, p-value (p-values based on a random permutation test, n = 10,000).
Figure 3Experimental Distribution of peptide intensities ratios (log2 fold change) vs. statistical significance of the signal (negative log scale of q-value) after multiple testing adjustment.
FDR = False Discovery Rate. The q-value is the FDR analog of the p-value. Panel A corresponds to measurements obtained from a peptide chip assayed with sera from a pool of Chagas positive samples, while panel B corresponds to a chip assayed with sera from healthy donors. Points in the higher-right corner of the quadrant are marked as reactive peptides.
Summary of peptide reactivities.
| Peptide Class | Assayed | AND Chagas (+) | AND Healthy (−) | AND Leishmaniasis (−) |
| Curated | 36 | 16 (44.4%) | 13 (36.1%) | 13 (36.1%) |
| New | 190 | 52 (27.4%) | 37 | 32 |
The table summarizes the results from the screening of pools of positive (Chagas), negative (healthy donors) and related (Leishmaniasis) sera. From left to right the columns show the results of cumulative additional criteria (boolean AND): 1) Assayed, 2) Assayed AND Positive for Chagas Disease sera, etc.
derived from 85 distinct proteins.
derived from 27 distinct proteins.
derived from 23 distinct proteins.
Complete list of reactive peptides.
| ID | Gene Name | Description | Pos. | Sequence | Score | Tc+ (N = 5) | Healthy+ (N = 5) | Leish+ (N = 2) |
| n42 | TcCLB.508175.329 | 60S ribosomal protein L19, putative | 335 |
| 10.57 | 80% | 0% | 0% |
| n67 | TcCLB.509149.40 | 60S ribosomal protein L19, putative | 275 |
| 12.84 | 80% | 0% | 0% |
| n126 | TcCLB.504159.10 | hypothetical protein, conserved | 443 |
| 6.91 | 80% | 0% | 0% |
| n86 | TcCLB.508831.150 | hypothetical protein, conserved | 47 |
| 7.30 | 60% | 0% | 50% |
| n90 | TcCLB.506239.30 | lectin, putative | 409 |
| 13.54 | 60% | 0% | 0% |
| n96 | TcCLB.511671.50 | hypothetical protein, conserved | 47 |
| 7.10 | 60% | 0% | 0% |
| n25 | TcCLB.510305.70 | hypothetical protein, conserved | 457 |
| 8.91 | 60% | 20% | 50% |
| n1 | TcCLB.511633.79 | microtubule-associated protein, putative | 239 |
| 11.48 | 40% | 0% | 0% |
| n26 | TcCLB.510305.70 | hypothetical protein, conserved | 463 |
| 8.62 | 40% | 0% | 0% |
| n38 | TcCLB.508175.329 | 60S ribosomal protein L19, putative | 233 |
| 10.68 | 40% | 0% | 0% |
| n40 | TcCLB.508175.329 | 60S ribosomal protein L19, putative | 273 |
| 8.94 | 40% | 0% | 0% |
| n85 | TcCLB.508831.150 | hypothetical protein, conserved | 41 |
| 8.15 | 40% | 0% | 0% |
| n87 | TcCLB.508831.150 | hypothetical protein, conserved | 53 |
| 7.73 | 40% | 0% | 0% |
| n88 | TcCLB.506239.30 | lectin, putative | 363 |
| 12.99 | 40% | 0% | 0% |
| n41 | TcCLB.508175.329 | 60S ribosomal protein L19, putative | 323 |
| 9.91 | 40% | 20% | 0% |
| n154 | TcCLB.506441.20 | hypothetical protein, conserved | 677 |
| 9.56 | 40% | 20% | 50% |
| n190 | TcCLB.508719.70 | hypothetical protein, conserved | 390 |
| 6.76 | 40% | 20% | 50% |
| n24 | TcCLB.510305.70 | hypothetical protein, conserved | 451 |
| 8.72 | 20% | 0% | 50% |
| n28 | TcCLB.506177.20 | lectin, putative | 347 |
| 12.75 | 20% | 0% | 0% |
| n31 | TcCLB.506177.20 | lectin, putative | 393 |
| 12.94 | 20% | 0% | 0% |
| n44 | TcCLB.508385.10 | hypothetical protein, conserved | 1313 |
| 8.00 | 20% | 0% | 0% |
| n51 | TcCLB.506791.30 | hypothetical protein, conserved | 1775 |
| 7.34 | 20% | 0% | 0% |
| n56 | TcCLB.510217.10 | hypothetical protein | 95 |
| 9.23 | 20% | 0% | 0% |
| n63 | TcCLB.506559.559 | antigenic protein, putative | 2209 |
| 10.35 | 20% | 0% | 0% |
| n74 | TcCLB.506959.90 | hypothetical protein, conserved | 123 |
| 9.46 | 20% | 0% | 0% |
| n77 | TcCLB.508677.60 | hypothetical protein | 99 |
| 11.68 | 20% | 0% | 0% |
| n89 | TcCLB.506239.30 | lectin, putative | 407 |
| 13.49 | 20% | 0% | 0% |
| n97 | TcCLB.511671.50 | hypothetical protein, conserved | 53 |
| 7.34 | 20% | 0% | 0% |
| n112 | TcCLB.508595.20 | UDP-Gal-dependent glycosyltransferase | 41 |
| 11.76 | 20% | 0% | 50% |
| n115 | TcCLB.506147.190 | hypothetical protein, conserved | 253 |
| 9.82 | 20% | 0% | 0% |
| n122 | TcCLB.510565.11 | tyrosine aminotransferase, putative | 27 |
| 7.98 | 20% | 0% | 0% |
| n124 | TcCLB.510733.50 | hypothetical protein, conserved | 99 |
| 10.02 | 20% | 0% | 50% |
| n129 | TcCLB.510877.40 | hypothetical protein, conserved | 173 |
| 12.13 | 20% | 0% | 0% |
| n135 | TcCLB.511861.120 | hypothetical protein | 97 |
| 9.85 | 20% | 0% | 0% |
| n136 | TcCLB.511861.120 | hypothetical protein | 101 |
| 10.26 | 20% | 0% | 0% |
| n147 | TcCLB.504625.70 | kinetoplast DNA-associated protein, putative | 443 |
| 9.56 | 20% | 0% | 0% |
| n152 | TcCLB.506441.20 | hypothetical protein, conserved | 665 |
| 9.87 | 20% | 0% | 50% |
| n161 | TcCLB.503975.100 | hypothetical protein, conserved | 343 |
| 7.02 | 20% | 0% | 0% |
| n165 | TcCLB.507603.260 | cathepsin L-like, putative | 353 |
| 11.65 | 20% | 0% | 0% |
| n184 | TcCLB.463155.20 | retrotransposon hot spot (RHS) protein | 511 |
| 7.56 | 20% | 0% | 0% |
| n186 | TcCLB.511815.170 | hypothetical protein, conserved | 50 |
| 10.09 | 20% | 0% | 0% |
| n176 | TcCLB.511233.20 | 60S ribosomal protein L34, putative | 111 |
| 10.75 | 20% | 20% | 0% |
| n183 | TcCLB.511345.10 | retrotransposon hot spot (RHS) protein | 539 |
| 4.52 | 20% | 20% | 50% |
Peptides displaying at least one T. cruzi positive assay with at most 1 healthy individual (control) positive are listed, showing the corresponding Locus Identifier, protein description, amino acid start position, sequence, prioritization score and the percentage of assayed samples in which the peptide was positive for T. cruzi infected, healthy control and Leishmania infected subjects (e.g. peptide n42 reacted in 4 of 5 -80%- of the T. cruzi samples). Letters n and c in Peptide ID indicate “novel” (highly ranked) and “curated” peptides, respectively.
Bibliographic references for validated antigens can be found in Table-S2, except for antigens marked with *.