| Literature DB >> 16977698 |
D Soeria-Atmadja1, T Lundell, M G Gustafsson, U Hammerling.
Abstract
The placing of novel or new-in-the-context proteins on the market, appearing in genetically modified foods, certain bio-pharmaceuticals and some household products leads to human exposure to proteins that may elicit allergic responses. Accurate methods to detect allergens are therefore necessary to ensure consumer/patient safety. We demonstrate that it is possible to reach a new level of accuracy in computational detection of allergenic proteins by presenting a novel detector, Detection based on Filtered Length-adjusted Allergen Peptides (DFLAP). The DFLAP algorithm extracts variable length allergen sequence fragments and employs modern machine learning techniques in the form of a support vector machine. In particular, this new detector shows hitherto unmatched specificity when challenged to the Swiss-Prot repository without appreciable loss of sensitivity. DFLAP is also the first reported detector that successfully discriminates between allergens and non-allergens occurring in protein families known to hold both categories. Allergenicity assessment for specific protein sequences of interest using DFLAP is possible via ulfh@slv.se.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16977698 PMCID: PMC1540723 DOI: 10.1093/nar/gkl467
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Outline of design and function of the DFLAP algorithm. (I) The allergen amino acid sequences are segmented into overlapping peptides and are subsequently compared with all sequences in the non-allergen set. The rational is that peptides, with high similarity to any non-allergen sequence, are likely to be structurally/functionally unrelated to allergenicity. Conversely, peptides lacking appreciable similarity with non-allergen sequences are potentially important to allergic reactions, broadly defined. These peptides (after concatenation of directly overlapping peptides) are stored in a special catalogue designated Filtered Length-adjusted Allergen Peptide (FLAP) set. Thus, the non-allergen amino acid sequence set can be regarded as a filter wherein only peptides dissimilar to non-allergens are allowed to pass. (II) Feature vectors, based on the alignment scores between training amino acid sequences (both allergen and non-allergen) against the FLAP set, are thereafter allowed to educate a supervised learning algorithm. This process trains the algorithm to determine, in a quantitative manner, the level of similarity to the FLAP set, which is required for a protein to be assigned as an allergen. (III) The educated system allows for interrogation by any query amino acid sequence with respect to allergen potential, essentially as described in step II. If sufficient similarity to the FLAPs is found, the query sequence is assigned as an allergen. In this step, the trained detector quantifies what ‘sufficient similarity’ means.
Outline of tests conducted
| Test type | Detection system | Sequences used for generation of FLAPs (Allergens/non-allergens)* | Training sequences (Allergens/non-allergens)* | Test sequences (Allergens/non-allergens)* |
|---|---|---|---|---|
| Parameter selection (3-fold CV) | DFLAP | 333a/52081b | 333a/666c | 167d/339 (334c+5e) |
| Assessment of sensitivity (holdout) | FAO/WHO | — | 500a | 262d (168, 141, 116, 99)f |
| ILSI/IFBC | — | 500a | 262d (168, 141, 116, 99)f | |
| DASARP | — | 500a | 262d (168, 141, 116, 99)f | |
| DFLAP** | 500a/52081b | 500a/1000c | 262d (168, 141, 116, 99)f | |
| Assessment of intra-family discrimination (holdout) | FAO/WHO | — | 697a | 65d/193g |
| ILSI/IFBC | — | 697a | 65d/193g | |
| DASARP | — | 697a | 65d/193g | |
| DFLAP** | 697a/52081b | 697a/1394c | 65d/193g | |
| Assessment of specificity (holdout) | FAO/WHO | — | 762 | 164970h |
| ILSI/IFBC | — | 762 | 164970h | |
| DASARP | — | 762 | 164970h | |
| DFLAP** | 762/52081b | 762/1524 | 164970h |
*All datasets are publicly available on .
**The parameter setting was lmin = 22, FLAP threshold = 48, n = 4 and C = 100.
aSubsets of the total amount (762) of allergens used to train each test method in the different evaluation procedures. In the case of DFLAP these subsets were initially also used to generate of FLAPs.
bNon-allergen filter set used in the Computerized Peptide Filtration and Aggregation (CPFA).
cSubsets of the total amount (1524) of the sequences referred to as Swiss-Prot non-allergens in Materials and Methods, used for SVM training (and testing in the parameter selection procedure) in the DFLAP method.
dSubsets of the total amount (762) of allergens used to test each test method in the different evaluation procedures.
eFive tropomyosins used to measure specificity in the evaluation of DFLAP parameters.
fThe four numbers corresponds to different levels of maximal sequence identity between training and test set (95, 90, 85 and 80%), respectively.
gPresumed non-allergens from tropomyosins, profilins and parvalbumins.
hSwiss-Prot, release 45.3.
Occurrence frequencies among the different parameters for the best detectors found according to a 3-fold CV
| Occurrence frequencies among the different parameters (%) | |||
|---|---|---|---|
| a | b | c | |
| Cost parameter | |||
| | 0 | 0 | 24 |
| | 0 | 0 | 22 |
| | 25 | 10 | 20 |
| | 41 | 46 | 18 |
| | 34 | 44 | 16 |
| Peptide length | |||
| | 0 | 0 | 0 |
| | 0 | 11 | 0 |
| | 0 | 4 | 2 |
| | 4 | 13 | 17 |
| | 3 | 6 | 17 |
| | 21 | 18 | 21 |
| | 31 | 25 | 21 |
| | 41 | 24 | 22 |
| Retention level (filtration degree) | |||
| 75% retention | 1 | 0 | 26 |
| 65% retention | 15 | 13 | 27 |
| 55% retention | 36 | 25 | 26 |
| 45% retention* | 48 | 63 | 21 |
| Number of matches | |||
| | 19 | 21 | 20 |
| | 23 | 20 | 20 |
| | 19 | 25 | 20 |
| | 20 | 16 | 20 |
| | 20 | 18 | 20 |
*Preferred parameter setting in the finally selected detector.
(a) Summary of parameter settings returning the 80 highest detection rates, while at the same time showing tropomyosin false alarm estimates below 10%; (b) Summary of parameter settings producing the 80 highest detection levels, regardless of associated tropomyosin false alarm estimates; (c) Summary of parameter settings providing tropomyosin false alarm estimates below 10%, regardless of the associated detection performances.
Figure 2Length distribution of the final FLAP set based on 762 allergens (minimal peptide length, lmin = 22 and FLAP threshold = 48). Of all FLAPs 50% are of length 28 or shorter (left part of dotted line) and 80% of length 41 or shorter (left part of dashed line).
Figure 3BC intervals (95%) of the unknown detection performance of the four tested methods using different levels of maximal sequence identity between training and test set. Clearly, the detection performance of the FAO/WHO method is much better than the other three but as shown in Figure 4, the corresponding false alarm rates make this approach useless. The DFLAP parameter setting was lmin = 22, FLAP threshold = 48, n = 4 and C = 100.
Estimated fractions of allergens in the Swiss-Prot database
| Method | Swiss-Prot (1 64 970 samples) (%) |
|---|---|
| FAO/WHO | 75.4 |
| ILSI/IFBC | 6.2 |
| DASARP | 3.1 |
| DFLAP* | 1.5 |
*The parameter setting was lmin = 22, FLAP threshold = 48, n = 4 and C = 100.
Figure 4BC intervals (95%) of the false alarms for the four tested methods using (presumed) non-allergens belonging to three different protein families. Clearly, DFLAP is the only method that is able to discriminate successfully between allergens and non-allergens within the same protein family. The DFLAP parameter setting was lmin = 22, FLAP threshold = 48, n = 4 and C = 100.