Literature DB >> 31906791

Allergen false-detection using official bioinformatic algorithms.

Rod A Herman1, Ping Song1.   

Abstract

Bioinformatic amino acid sequence searches are used, in part, to assess the potential allergenic risk of newly expressed proteins in genetically engineered crops. Previous work has demonstrated that the searches required by government regulatory agencies falsely implicate many proteins from rarely allergenic crops as an allergenic risk. However, many proteins are found in crops at concentrations that may be insufficient to cause allergy. Here we used a recently developed set of high-abundance non-allergenic proteins to determine the false-positive rates for several algorithms required by regulatory bodies, and also for an alternative 1:1 FASTA approach previously found to be equally sensitive to the official sliding-window method, but far more selective. The current investigation confirms these earlier findings while addressing dietary exposure.

Entities:  

Keywords:  Allergen; bioinformatics; cross-reactivity; exposure; selectivity

Mesh:

Substances:

Year:  2020        PMID: 31906791      PMCID: PMC7289518          DOI: 10.1080/21645698.2019.1709021

Source DB:  PubMed          Journal:  GM Crops Food        ISSN: 2164-5698            Impact factor:   3.074


Introduction

Newly expressed proteins in genetically engineered (GE) foods are evaluated for allergenic risk. Multiple lines of evidence are used in a weight-of-evidence risk assessment. The most important factors to consider in this risk assessment include the allergenic status of the organism from which the transgene originates, the concentration of the protein in food, and the structural similarity of the protein to known allergens.[1] Regulatory agencies that assess the safety of GE crops also consider the heat and digestive stability of the expressed proteins, but these factors have been shown to be poorly associated with allergenic risk.[1-4] The structural similarity of a novel food protein to known allergens is typically assessed by comparing amino acid sequences. Previous work has shown that the official bioinformatic algorithms required by regulatory agencies for comparing the amino acid sequence of a newly expressed protein with that of known allergens falsely implicate many non-allergens as being an allergenic risk.[5-7] The most commonly used official bioinformatic method divides the newly expressed protein into overlapping 80-amino-acid contiguous sequences and then looks for >35% identity among aligning sequences within known allergen sequences (sliding-window approach).[8,9] Another standard approach looks for exact 8-amino-acid contiguous matches between the novel food protein and known allergens, but this latter method has been largely dismissed by scientists as not useful although most regulatory agencies still require such searches to be completed.[10,11] Short amino-acid identity matches have been shown to identify many false-positive sequences while not identifying any novel cross-reactive allergen pairs.[10] More recently, the European Food Safety Authority (EFSA) issued guidance on assessing proteins for non-IgE-mediated celiac-disease risk using short amino acid motifs and partial matches with 9-mer peptides known to cause celiac disease.[12] Predictably, these latter bioinformatic searches find a large number of random false-positive sequences derived from plant and animal proteins not associated with celiac disease.[13] We and others have previously published on equally sensitive bioinformatic algorithms for detecting allergenic risk, but with substantially better selectivity for eliminating proteins with negligible risk.[5-7] These latter methods use conventional software (e.g. FASTA) for estimating amino acid sequence similarity rather than identity, and categorize risk using thresholds based on statistical measures of similarity (e.g. E-values). E-value calculations were initially developed to detect evolutionary relationships between sequences and organisms but have been found useful in detecting similar protein functions and structures, the latter of which might indicate cross-reactive binding to the IgE antibodies that are typically associated with allergy.[14] False-positive rates typically were estimated in these published investigations by determining the percentage of the full suite of proteins in one or more rarely allergenic food crops that are detected by various bioinformatic algorithms as representing an allergenic risk. One weakness of using a large set of protein sequences from a non-allergenic food crop to assess false-positive rates is that actual dietary exposure to many of the proteins may be limited due to low concentrations in food. While relative comparisons among bioinformatic methods are still valid, the absolute false-positive rates might be skewed upward due to real allergens being expressed in non-allergenic crops at levels below which allergy is induced. Recently, a list of abundant food proteins with low allergenic potential (hereafter referred to as non-allergens) was published along with the methods used to determine their abundance and status as non-allergens.[15] This list can now be used to better assess the false-positive rates for different bioinformatic algorithms designed to selectively detect allergenic risk. Here we used this list of abundant non-allergenic food proteins to assess the false-positive rates for the official criteria of >35% identity over an 80-mer sliding-window and an 8-mer exact-match, and a previously reported 1:1 FASTA similarity approach.[7,8] Furthermore, we evaluate the selectivity of the recently implemented EFSA celiac peptide motif searches using these high-abundance non-allergens.

Methods and Materials

The 178 UniRef90 Cluster IDs listed in Table 4 of Krutz et al.[15], were used to search the UniProt database to obtain an amino acid sequence for each protein. Of these sequences, 169 returned current entries, and of those, 125 indicated the same source organism as listed in Table 4 of Krutz et al. The amino acid sequences for these 125 high-abundance non-allergens were compared with the allergen sequences in the COMPARE database version 2019 (http://db.comparedatabase.org/) using the standard search for >35% identity across 80-amino-acid windows and with the previously described 1:1 FASTA approach (with an E-value threshold of 1E-9 using FASTA version 35).[6] The percentage of non-allergens showing above threshold identity or similarity, respectively, was used to estimate the false-positive rate for each bioinformatic algorithm. In addition, 8-amino-acid exact matches between the non-allergens and allergens were determined. Finally, the number of sequences detected by the EFSA celiac-causing Q/EX1PX2 motif (Q = glutamine; E = glutamic acid; X1 = L [leucine], Q, F [phenylalanine], S [serine], or E; P = proline; X2 = Y [tyrosine], F, A [alanine], V [valine], or Q) and partial-match identity searches were determined (9-mer match allowing 3 mismatches with HLA-DQ8 restricted epitopes). The COMPARE database is used by the major registrants of genetically engineered crops when implementing the sliding-window, contiguous eight amino acid, and celiac peptide searches required by various regulatory bodies and thus represents current practice.

Results and Discussion

Of the 125 high-abundance non-allergenic food-crop proteins evaluated, 11 were implicated as an allergenic risk by the standard sliding-window bioinformatic approach and 1 was implicated by the 1:1 FASTA approach (Table 1). The 8-amino-acid search produced 3 hits and the EFSA celiac-peptide motif searches produced 13 hits (none of which could be excluded based on the presence of a proline duplex or based on positively-charged amino acids appearing in all 9-mer restricted-epitope matches at key positions as outlined by EFSA guidance). There were no 9-mer matches allowing 3 mismatches with the HLA-DQ8 restricted epitopes.
Table 1.

Bioinformatic matches (number of allergens or motifs) between non-allergens and allergens using different algorithms.

UniProt Entry
Bioinformatic Match (hits)
UniProt Entry
Bioinformatic Match (hits)
UniProt Entry
Bioinformatic Match (hits)
MaizeSliding Window1:1 FASTA8-mer MatchEFSA CeliacSpinachSliding Window1:1 FASTA8-mer MatchEFSA CeliacPotatoSliding Window1:1 FASTA8-mer MatchEFSA Celiac
P28794    P80082    J7ENS8    
P06673    P10871    O24378    
P81009    A0A0K9QE98    Q43652    
P465171   P12301    Q9M3H3    
B6T8E4    P06003    P19595    
B6SGF3    P00833    O24379    
B6TTP4    P04160    P04045    
Q418811  2P60128    M1D7J7    
B6UH99    P12355    M1AYK4    
P80639    A0A0K9QP00   1Q93X17    
B6UH67    P12359    M0ZNV9    
B4FFZ9    P00455    Q00782    
B4FFK9  2 P12353    M1BPE5    
E9JVD4    Q41385    Q9AWA5    
P29518    P11402   1Q3HRY7   1
B6SL97    P13788    P37829    
Q01526    O20252   1Q9M4G525   
B4FUH2    P17353    Q9M4G4    
B4FPL1    P22418    C6F3B72   
B6UGJ4    A1XIR6    P331917 15
P55240    Q8RU73    P37830    
Q84J79    P09559    K7WJT8    
K7UNW7    P05435    Q948Z8    
B6T7B2    P06508    P23509    
P93804    Q02254         
B4F7S2    Q02060         
Rice    Tomato    Wheat    
Q6Z782    P14903    P33432 2  
A2XMB2    P38416    Q08000    
P37833    Q08655    P20158    
P46520    Q9SWF5    Q039682   
Q07661    P10967   1W5BUF4    
P55142    P47921    P12299    
P0C5A41   Q6QLX4  5 P02276   1
Q6AVA8    Q40128    P30523    
Q69UI2    Q08451    P12783    
Q01L472   Q43497         
Q94JF2    P93207         
Q10LP5    P05116         
Q9AUV8   1P46301         
Q5ZEL0    Q5NE211        
C7J0T2   1K4B3I4         
Q6ZHP6   1Q9ZR41         
P30298    O24024         
Q8H8B02  1Q42876         
Q8H920    P38546   1     
A3AHG55   Q6QLU0         
Bioinformatic matches (number of allergens or motifs) between non-allergens and allergens using different algorithms. Previous work using 50,090 protein sequences from maize found the sliding-window bioinformatic approach to falsely implicate 19.9% of putative non-allergens as allergens, while the 1:1 FASTA approach falsely implicated 7.5% of proteins.[16] Using the 125 high-abundance non-allergens from food crops[15], false-positive rates were found to be 8.8% for the sliding-window approach and 0.8% for the 1:1 FASTA approach. The 8-amino acid exact match criterion falsely implicated 2.4% of the 125 high-abundance food-crop proteins and the celiac-peptide-motif search incorrectly found 10.4% of the 125 non-allergenic proteins to represent a celiac-disease risk (only 1 of which originated from a crop known to cause celiac symptoms, wheat, but with no reports of this peptide causing celiac disease). Together, the sliding window, 8-mer, and celiac-peptide-motif searches are required by some global government regulatory bodies and found 24 of the 125 non-allergenic proteins to present an allergenic risk (19.2%). Clearly, identifying nearly 1 in 5 putative high-abundance non-allergens as an allergenic risk demonstrates that these bioinformatic algorithms are not fit for purpose as they greatly overestimate risk and impede the use of safe proteins to develop improved crops. This is especially evident since this investigation in combination with previous investigations have found alternative bioinformatic algorithms, including the 1:1 FASTA approach, to be equally sensitive to the sliding-window search for detecting true allergens but with dramatically better selectivity for not falsely detecting low-risk protein sequences.[5-7] Similarly, previous investigations have suggested more selective methods for identifying peptides with potential risk of causing celiac disease.[13] Multifactor bioinformatic criteria have also been suggested with much improved selectivity for detecting known allergens and represent an additional avenue for evaluating the allergenic risk of novel food proteins.[17] The current results evaluating the selectivity of bioinformatic searches using high-abundance non-allergenic food proteins support past investigations using a comprehensive list of proteins from crops with a low allergenic potential. Together, these findings give realistic estimates of relative false-positive rates and clearly support the superiority of alternative bioinformatic approaches using modern bioinformatic tools (e.g. 1:1 FASTA method).
  16 in total

1.  The value of short amino acid sequence matches for prediction of protein allergenicity.

Authors:  Andre Silvanovich; Margaret A Nemeth; Ping Song; Rod Herman; Laura Tagliani; Gary A Bannon
Journal:  Toxicol Sci       Date:  2005-12-07       Impact factor: 4.849

2.  Further evaluation of the utility of "sliding window" FASTA in predicting cross-reactivity with allergenic proteins.

Authors:  Robert F Cressman; Gregory Ladics
Journal:  Regul Toxicol Pharmacol       Date:  2008-12-11       Impact factor: 3.271

3.  Comparative assessment of multiple criteria for the in silico prediction of cross-reactivity of proteins to known allergens.

Authors:  Henry P Mirsky; Robert F Cressman; Gregory S Ladics
Journal:  Regul Toxicol Pharmacol       Date:  2013-08-09       Impact factor: 3.271

4.  Validation of bioinformatic approaches for predicting allergen cross reactivity.

Authors:  Rod A Herman; Ping Song
Journal:  Food Chem Toxicol       Date:  2019-07-03       Impact factor: 6.023

5.  Q-X1-P-X2 motif search for potential celiac disease risk has poor selectivity.

Authors:  Ping Song; Nancy Podevin; Henry Mirsky; Jennifer Anderson; Bryan Delaney; Carey Mathesius; Laura Rowe; Rod A Herman
Journal:  Regul Toxicol Pharmacol       Date:  2018-09-26       Impact factor: 3.271

6.  Allergenic sensitization versus elicitation risk criteria for novel food proteins.

Authors:  Rod A Herman; Gregory S Ladics
Journal:  Regul Toxicol Pharmacol       Date:  2018-02-23       Impact factor: 3.271

7.  Guidance on allergenicity assessment of genetically modified plants.

Authors:  Hanspeter Naegeli; Andrew Nicholas Birch; Josep Casacuberta; Adinda De Schrijver; Mikolaj Antoni Gralak; Philippe Guerche; Huw Jones; Barbara Manachini; Antoine Messéan; Elsa Ebbesen Nielsen; Fabien Nogué; Christophe Robaglia; Nils Rostoks; Jeremy Sweet; Christoph Tebbe; Francesco Visioli; Jean-Michel Wal; Philippe Eigenmann; Michelle Epstein; Karin Hoffmann-Sommergruber; Frits Koning; Martinus Lovik; Clare Mills; Francisco Javier Moreno; Henk van Loveren; Regina Selb; Antonio Fernandez Dumont
Journal:  EFSA J       Date:  2017-06-22

Review 8.  An introduction to sequence similarity ("homology") searching.

Authors:  William R Pearson
Journal:  Curr Protoc Bioinformatics       Date:  2013-06

Review 9.  Food Allergens: Is There a Correlation between Stability to Digestion and Allergenicity?

Authors:  Katrine Lindholm Bøgh; Charlotte Bernhard Madsen
Journal:  Crit Rev Food Sci Nutr       Date:  2016-07-03       Impact factor: 11.176

10.  Value of eight-amino-acid matches in predicting the allergenicity status of proteins: an empirical bioinformatic investigation.

Authors:  Rod A Herman; Ping Song; Arvind Thirumalaiswamysekhar
Journal:  Clin Mol Allergy       Date:  2009-10-29
View more
  1 in total

1.  History of safe exposure and bioinformatic assessment of phosphomannose-isomerase (PMI) for allergenic risk.

Authors:  Rod A Herman; Zhenglin Hou; Henry Mirsky; Mark E Nelson; Carey A Mathesius; Jason M Roper
Journal:  Transgenic Res       Date:  2021-03-24       Impact factor: 2.788

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.