Literature DB >> 31906791

Allergen false-detection using official bioinformatic algorithms.

Abstract

Bioinformatic amino acid sequence searches are used, in part, to assess the potential allergenic risk of newly expressed proteins in genetically engineered crops. Previous work has demonstrated that the searches required by government regulatory agencies falsely implicate many proteins from rarely allergenic crops as an allergenic risk. However, many proteins are found in crops at concentrations that may be insufficient to cause allergy. Here we used a recently developed set of high-abundance non-allergenic proteins to determine the false-positive rates for several algorithms required by regulatory bodies, and also for an alternative 1:1 FASTA approach previously found to be equally sensitive to the official sliding-window method, but far more selective. The current investigation confirms these earlier findings while addressing dietary exposure.

Entities: CellLine Chemical Disease Species

Keywords: Allergen; bioinformatics; cross-reactivity; exposure; selectivity

Mesh：

Substances：
Allergens
Plant Proteins

Year: 2020 PMID： 31906791 PMCID： PMC7289518 DOI： 10.1080/21645698.2019.1709021

Source DB: PubMed Journal: GM Crops Food ISSN： 2164-5698 Impact factor: 3.074

Introduction

Newly expressed proteins in genetically engineered (GE) foods are evaluated for allergenic risk. Multiple lines of evidence are used in a weight-of-evidence risk assessment. The most important factors to consider in this risk assessment include the allergenic status of the organism from which the transgene originates, the concentration of the protein in food, and the structural similarity of the protein to known allergens.[1] Regulatory agencies that assess the safety of GE crops also consider the heat and digestive stability of the expressed proteins, but these factors have been shown to be poorly associated with allergenic risk.[1-4] The structural similarity of a novel food protein to known allergens is typically assessed by comparing amino acid sequences. Previous work has shown that the official bioinformatic algorithms required by regulatory agencies for comparing the amino acid sequence of a newly expressed protein with that of known allergens falsely implicate many non-allergens as being an allergenic risk.[5-7] The most commonly used official bioinformatic method divides the newly expressed protein into overlapping 80-amino-acid contiguous sequences and then looks for >35% identity among aligning sequences within known allergen sequences (sliding-window approach).[8,9] Another standard approach looks for exact 8-amino-acid contiguous matches between the novel food protein and known allergens, but this latter method has been largely dismissed by scientists as not useful although most regulatory agencies still require such searches to be completed.[10,11] Short amino-acid identity matches have been shown to identify many false-positive sequences while not identifying any novel cross-reactive allergen pairs.[10] More recently, the European Food Safety Authority (EFSA) issued guidance on assessing proteins for non-IgE-mediated celiac-disease risk using short amino acid motifs and partial matches with 9-mer peptides known to cause celiac disease.[12] Predictably, these latter bioinformatic searches find a large number of random false-positive sequences derived from plant and animal proteins not associated with celiac disease.[13] We and others have previously published on equally sensitive bioinformatic algorithms for detecting allergenic risk, but with substantially better selectivity for eliminating proteins with negligible risk.[5-7] These latter methods use conventional software (e.g. FASTA) for estimating amino acid sequence similarity rather than identity, and categorize risk using thresholds based on statistical measures of similarity (e.g. E-values). E-value calculations were initially developed to detect evolutionary relationships between sequences and organisms but have been found useful in detecting similar protein functions and structures, the latter of which might indicate cross-reactive binding to the IgE antibodies that are typically associated with allergy.[14] False-positive rates typically were estimated in these published investigations by determining the percentage of the full suite of proteins in one or more rarely allergenic food crops that are detected by various bioinformatic algorithms as representing an allergenic risk. One weakness of using a large set of protein sequences from a non-allergenic food crop to assess false-positive rates is that actual dietary exposure to many of the proteins may be limited due to low concentrations in food. While relative comparisons among bioinformatic methods are still valid, the absolute false-positive rates might be skewed upward due to real allergens being expressed in non-allergenic crops at levels below which allergy is induced. Recently, a list of abundant food proteins with low allergenic potential (hereafter referred to as non-allergens) was published along with the methods used to determine their abundance and status as non-allergens.[15] This list can now be used to better assess the false-positive rates for different bioinformatic algorithms designed to selectively detect allergenic risk. Here we used this list of abundant non-allergenic food proteins to assess the false-positive rates for the official criteria of >35% identity over an 80-mer sliding-window and an 8-mer exact-match, and a previously reported 1:1 FASTA similarity approach.[7,8] Furthermore, we evaluate the selectivity of the recently implemented EFSA celiac peptide motif searches using these high-abundance non-allergens.

Methods and Materials

The 178 UniRef90 Cluster IDs listed in Table 4 of Krutz et al.[15], were used to search the UniProt database to obtain an amino acid sequence for each protein. Of these sequences, 169 returned current entries, and of those, 125 indicated the same source organism as listed in Table 4 of Krutz et al. The amino acid sequences for these 125 high-abundance non-allergens were compared with the allergen sequences in the COMPARE database version 2019 (http://db.comparedatabase.org/) using the standard search for >35% identity across 80-amino-acid windows and with the previously described 1:1 FASTA approach (with an E-value threshold of 1E-9 using FASTA version 35).[6] The percentage of non-allergens showing above threshold identity or similarity, respectively, was used to estimate the false-positive rate for each bioinformatic algorithm. In addition, 8-amino-acid exact matches between the non-allergens and allergens were determined. Finally, the number of sequences detected by the EFSA celiac-causing Q/EX1PX2 motif (Q = glutamine; E = glutamic acid; X1 = L [leucine], Q, F [phenylalanine], S [serine], or E; P = proline; X2 = Y [tyrosine], F, A [alanine], V [valine], or Q) and partial-match identity searches were determined (9-mer match allowing 3 mismatches with HLA-DQ8 restricted epitopes). The COMPARE database is used by the major registrants of genetically engineered crops when implementing the sliding-window, contiguous eight amino acid, and celiac peptide searches required by various regulatory bodies and thus represents current practice.

Results and Discussion

Of the 125 high-abundance non-allergenic food-crop proteins evaluated, 11 were implicated as an allergenic risk by the standard sliding-window bioinformatic approach and 1 was implicated by the 1:1 FASTA approach (Table 1). The 8-amino-acid search produced 3 hits and the EFSA celiac-peptide motif searches produced 13 hits (none of which could be excluded based on the presence of a proline duplex or based on positively-charged amino acids appearing in all 9-mer restricted-epitope matches at key positions as outlined by EFSA guidance). There were no 9-mer matches allowing 3 mismatches with the HLA-DQ8 restricted epitopes.

Table 1.

Bioinformatic matches (number of allergens or motifs) between non-allergens and allergens using different algorithms.

UniProt Entry	Bioinformatic Match (hits)				UniProt Entry	Bioinformatic Match (hits)				UniProt Entry	Bioinformatic Match (hits)
Maize	Sliding Window	1:1 FASTA	8-mer Match	EFSA Celiac	Spinach	Sliding Window	1:1 FASTA	8-mer Match	EFSA Celiac	Potato	Sliding Window	1:1 FASTA	8-mer Match	EFSA Celiac
P28794					P80082					J7ENS8
P06673					P10871					O24378
P81009					A0A0K9QE98					Q43652
P46517	1				P12301					Q9M3H3
B6T8E4					P06003					P19595
B6SGF3					P00833					O24379
B6TTP4					P04160					P04045
Q41881	1			2	P60128					M1D7J7
B6UH99					P12355					M1AYK4
P80639					A0A0K9QP00				1	Q93X17
B6UH67					P12359					M0ZNV9
B4FFZ9					P00455					Q00782
B4FFK9			2		P12353					M1BPE5
E9JVD4					Q41385					Q9AWA5
P29518					P11402				1	Q3HRY7				1
B6SL97					P13788					P37829
Q01526					O20252				1	Q9M4G5	25
B4FUH2					P17353					Q9M4G4
B4FPL1					P22418					C6F3B7	2
B6UGJ4					A1XIR6					P33191	7		1	5
P55240					Q8RU73					P37830
Q84J79					P09559					K7WJT8
K7UNW7					P05435					Q948Z8
B6T7B2					P06508					P23509
P93804					Q02254
B4F7S2					Q02060
Rice					Tomato					Wheat
Q6Z782					P14903					P33432		2
A2XMB2					P38416					Q08000
P37833					Q08655					P20158
P46520					Q9SWF5					Q03968	2
Q07661					P10967				1	W5BUF4
P55142					P47921					P12299
P0C5A4	1				Q6QLX4			5		P02276				1
Q6AVA8					Q40128					P30523
Q69UI2					Q08451					P12783
Q01L47	2				Q43497
Q94JF2					P93207
Q10LP5					P05116
Q9AUV8				1	P46301
Q5ZEL0					Q5NE21	1
C7J0T2				1	K4B3I4
Q6ZHP6				1	Q9ZR41
P30298					O24024
Q8H8B0	2			1	Q42876
Q8H920					P38546				1
A3AHG5	5				Q6QLU0

Bioinformatic matches (number of allergens or motifs) between non-allergens and allergens using different algorithms. Previous work using 50,090 protein sequences from maize found the sliding-window bioinformatic approach to falsely implicate 19.9% of putative non-allergens as allergens, while the 1:1 FASTA approach falsely implicated 7.5% of proteins.[16] Using the 125 high-abundance non-allergens from food crops[15], false-positive rates were found to be 8.8% for the sliding-window approach and 0.8% for the 1:1 FASTA approach. The 8-amino acid exact match criterion falsely implicated 2.4% of the 125 high-abundance food-crop proteins and the celiac-peptide-motif search incorrectly found 10.4% of the 125 non-allergenic proteins to represent a celiac-disease risk (only 1 of which originated from a crop known to cause celiac symptoms, wheat, but with no reports of this peptide causing celiac disease). Together, the sliding window, 8-mer, and celiac-peptide-motif searches are required by some global government regulatory bodies and found 24 of the 125 non-allergenic proteins to present an allergenic risk (19.2%). Clearly, identifying nearly 1 in 5 putative high-abundance non-allergens as an allergenic risk demonstrates that these bioinformatic algorithms are not fit for purpose as they greatly overestimate risk and impede the use of safe proteins to develop improved crops. This is especially evident since this investigation in combination with previous investigations have found alternative bioinformatic algorithms, including the 1:1 FASTA approach, to be equally sensitive to the sliding-window search for detecting true allergens but with dramatically better selectivity for not falsely detecting low-risk protein sequences.[5-7] Similarly, previous investigations have suggested more selective methods for identifying peptides with potential risk of causing celiac disease.[13] Multifactor bioinformatic criteria have also been suggested with much improved selectivity for detecting known allergens and represent an additional avenue for evaluating the allergenic risk of novel food proteins.[17] The current results evaluating the selectivity of bioinformatic searches using high-abundance non-allergenic food proteins support past investigations using a comprehensive list of proteins from crops with a low allergenic potential. Together, these findings give realistic estimates of relative false-positive rates and clearly support the superiority of alternative bioinformatic approaches using modern bioinformatic tools (e.g. 1:1 FASTA method).

16 in total

1. The value of short amino acid sequence matches for prediction of protein allergenicity.

Authors: Andre Silvanovich; Margaret A Nemeth; Ping Song; Rod Herman; Laura Tagliani; Gary A Bannon
Journal: Toxicol Sci Date: 2005-12-07 Impact factor: 4.849

2. Further evaluation of the utility of "sliding window" FASTA in predicting cross-reactivity with allergenic proteins.

Authors: Robert F Cressman; Gregory Ladics
Journal: Regul Toxicol Pharmacol Date: 2008-12-11 Impact factor: 3.271

3. Comparative assessment of multiple criteria for the in silico prediction of cross-reactivity of proteins to known allergens.

Authors: Henry P Mirsky; Robert F Cressman; Gregory S Ladics
Journal: Regul Toxicol Pharmacol Date: 2013-08-09 Impact factor: 3.271

4. Validation of bioinformatic approaches for predicting allergen cross reactivity.

Authors: Rod A Herman; Ping Song
Journal: Food Chem Toxicol Date: 2019-07-03 Impact factor: 6.023

5. Q-X1-P-X2 motif search for potential celiac disease risk has poor selectivity.

Authors: Ping Song; Nancy Podevin; Henry Mirsky; Jennifer Anderson; Bryan Delaney; Carey Mathesius; Laura Rowe; Rod A Herman
Journal: Regul Toxicol Pharmacol Date: 2018-09-26 Impact factor: 3.271

6. Allergenic sensitization versus elicitation risk criteria for novel food proteins.

Authors: Rod A Herman; Gregory S Ladics
Journal: Regul Toxicol Pharmacol Date: 2018-02-23 Impact factor: 3.271

7. Guidance on allergenicity assessment of genetically modified plants.

Authors: Hanspeter Naegeli; Andrew Nicholas Birch; Josep Casacuberta; Adinda De Schrijver; Mikolaj Antoni Gralak; Philippe Guerche; Huw Jones; Barbara Manachini; Antoine Messéan; Elsa Ebbesen Nielsen; Fabien Nogué; Christophe Robaglia; Nils Rostoks; Jeremy Sweet; Christoph Tebbe; Francesco Visioli; Jean-Michel Wal; Philippe Eigenmann; Michelle Epstein; Karin Hoffmann-Sommergruber; Frits Koning; Martinus Lovik; Clare Mills; Francisco Javier Moreno; Henk van Loveren; Regina Selb; Antonio Fernandez Dumont
Journal: EFSA J Date: 2017-06-22

Review 8. An introduction to sequence similarity ("homology") searching.

Authors: William R Pearson
Journal: Curr Protoc Bioinformatics Date: 2013-06

Review 9. Food Allergens: Is There a Correlation between Stability to Digestion and Allergenicity?

Authors: Katrine Lindholm Bøgh; Charlotte Bernhard Madsen
Journal: Crit Rev Food Sci Nutr Date: 2016-07-03 Impact factor: 11.176

10. Value of eight-amino-acid matches in predicting the allergenicity status of proteins: an empirical bioinformatic investigation.

Authors: Rod A Herman; Ping Song; Arvind Thirumalaiswamysekhar
Journal: Clin Mol Allergy Date: 2009-10-29

1 in total

1. History of safe exposure and bioinformatic assessment of phosphomannose-isomerase (PMI) for allergenic risk.

Authors: Rod A Herman; Zhenglin Hou; Henry Mirsky; Mark E Nelson; Carey A Mathesius; Jason M Roper
Journal: Transgenic Res Date: 2021-03-24 Impact factor: 2.788

1 in total