Ha X Dang1, Christopher B Lawrence2. 1. Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA. 2. Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA.
Abstract
MOTIVATION: Accurately identifying and eliminating allergens from biotechnology-derived products are important for human health. From a biomedical research perspective, it is also important to identify allergens in sequenced genomes. Many allergen prediction tools have been developed during the past years. Although these tools have achieved certain levels of specificity, when applied to large-scale allergen discovery (e.g. at a whole-genome scale), they still yield many false positives and thus low precision (even at low recall) due to the extreme skewness of the data (allergens are rare). Moreover, the most accurate tools are relatively slow because they use protein sequence alignment to build feature vectors for allergen classifiers. Additionally, only web server implementations of the current allergen prediction tools are publicly available and are without the capability of large batch submission. These weaknesses make large-scale allergen discovery ineffective and inefficient in the public domain. RESULTS: We developed Allerdictor, a fast and accurate sequence-based allergen prediction tool that models protein sequences as text documents and uses support vector machine in text classification for allergen prediction. Test results on multiple highly skewed datasets demonstrated that Allerdictor predicted allergens with high precision over high recall at fast speed. For example, Allerdictor only took ∼6 min on a single core PC to scan a whole Swiss-Prot database of ∼540 000 sequences and identified <1% of them as allergens. AVAILABILITY AND IMPLEMENTATION: Allerdictor is implemented in Python and available as standalone and web server versions at http://allerdictor.vbi.vt.edu CONTACT: lawrence@vbi.vt.edu Supplementary information: Supplementary data are available at Bioinformatics online.
MOTIVATION: Accurately identifying and eliminating allergens from biotechnology-derived products are important for human health. From a biomedical research perspective, it is also important to identify allergens in sequenced genomes. Many allergen prediction tools have been developed during the past years. Although these tools have achieved certain levels of specificity, when applied to large-scale allergen discovery (e.g. at a whole-genome scale), they still yield many false positives and thus low precision (even at low recall) due to the extreme skewness of the data (allergens are rare). Moreover, the most accurate tools are relatively slow because they use protein sequence alignment to build feature vectors for allergen classifiers. Additionally, only web server implementations of the current allergen prediction tools are publicly available and are without the capability of large batch submission. These weaknesses make large-scale allergen discovery ineffective and inefficient in the public domain. RESULTS: We developed Allerdictor, a fast and accurate sequence-based allergen prediction tool that models protein sequences as text documents and uses support vector machine in text classification for allergen prediction. Test results on multiple highly skewed datasets demonstrated that Allerdictor predicted allergens with high precision over high recall at fast speed. For example, Allerdictor only took ∼6 min on a single core PC to scan a whole Swiss-Prot database of ∼540 000 sequences and identified <1% of them as allergens. AVAILABILITY AND IMPLEMENTATION: Allerdictor is implemented in Python and available as standalone and web server versions at http://allerdictor.vbi.vt.edu CONTACT: lawrence@vbi.vt.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Gregory S Ladics; Robert F Cressman; Corinne Herouet-Guicheney; Rod A Herman; Laura Privalle; Ping Song; Jason M Ward; Scott McClain Journal: Regul Toxicol Pharmacol Date: 2011-02-12 Impact factor: 3.271
Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers Journal: Nucleic Acids Res Date: 2009-11-12 Impact factor: 16.971
Authors: Wenzhe Lu; Surendra S Negi; Catherine H Schein; Soheila J Maleki; Barry K Hurlburt; Werner Braun Journal: Mol Immunol Date: 2018-04-06 Impact factor: 4.407
Authors: Xiaofeng Dong; Kittipong Chaisiri; Dong Xia; Stuart D Armstrong; Yongxiang Fang; Martin J Donnelly; Tatsuhiko Kadowaki; John W McGarry; Alistair C Darby; Benjamin L Makepeace Journal: Gigascience Date: 2018-12-01 Impact factor: 6.524
Authors: Alice Easton; Shenghan Gao; Scott P Lawton; Sasisekhar Bennuru; Asis Khan; Eric Dahlstrom; Rita G Oliveira; Stella Kepha; Stephen F Porcella; Joanne Webster; Roy Anderson; Michael E Grigg; Richard E Davis; Jianbin Wang; Thomas B Nutman Journal: Elife Date: 2020-11-06 Impact factor: 8.140