| Literature DB >> 36131892 |
Pallavi M Shanthappa1, Rakshitha Kumar1.
Abstract
Background: An allergic reaction is the immune system's overreacting to a previously encountered, typically benign molecule, frequently a protein. Allergy reactions can result in rashes, itching, mucous membrane swelling, asthma, coughing, and other bizarre symptoms. To anticipate allergies, a wide range of principles and methods have been applied in bioinformatics. The sequence similarity approach's positive predictive value is very low and ineffective for methods based on FAO/WHO criteria, making it difficult to predict possible allergens. Method: This work advocated the use of a deep learning model LSTM (Long Short-Term Memory) to overcome the limitations of traditional approaches and machine learning lower performance models in predicting the allergenicity of dietary proteins. A total of 2,427 allergens and 2,427 non-allergens, from a variety of sources, including the Central Science Laboratory and the NCBI are used. The data was divided 80:20 for training and testing purposes. These techniques have all been implemented in Python. To describe the protein sequences of allergens and non-allergens, five E-descriptors were used. E1 (hydrophilic character of peptides), E2 (length), E3(propensity to form helices), E4(abundance and dispersion), and E5 (propensity of beta strands) are used to make the variable-length protein sequence to uniform length using ACC transformation. A total of eight machine learning techniques have been taken into consideration.Entities:
Keywords: ACC transformation; ADA boost; Allergen prediction; Bagging classifier; Classifier; Extra tree classifier; Gaussian naive bayes; LSTM model; Linear discriminant analysis; Quadratic discriminant analysis
Year: 2022 PMID: 36131892 PMCID: PMC9484702 DOI: 10.5599/admet.1335
Source DB: PubMed Journal: ADMET DMPK ISSN: 1848-7718
Figure 1.Sample dataset in Fasta Format of a protein sequence
Analysis of the classification results of different classifiers based on training and testing data.
| Method | Training data | Testing data |
|---|---|---|
| Gaussian Naive Bayes | 63.8 | 59.7 |
| Radius Neighbours Classifier | 50.5 | 47.6 |
| ADA Boost | 84.6 | 78.1 |
| Linear Discriminant Analysis | 78.6 | 74.9 |
| Quadratic Discriminant Analysis | 88.7 | 81.6 |
| Bagging Classifier | 88.4 | 86.3 |
| Extra Tree Classifier | 93.4 | 89.8 |
| LSTM (Long Short-Term Memory) | 94.1 | 91.5 |
Figure 2.Output of ACC transformation
Analysis of the effectiveness of different classifiers
| Method | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Gaussian Naive Bayes | 64.14 | 0.74 | 0.46 | 0.56 |
| Radius Neighbours Classifier | 49.2 | 0.49 | 1.00 | 0.66 |
| ADA Boost | 76.9 | 0.79 | 0.75 | 0.77 |
| Linear Discriminant Analysis | 76.13 | 0.80 | 0.71 | 0.75 |
| Quadratic Discriminant Analysis | 84.2 | 0.93 | 0.74 | 0.83 |
| Bagging Classifier | 85.8 | 0.89 | 0.86 | 0.88 |
| Extra Tree Classifier | 90 | 0.95 | 0.85 | 0.90 |
| LSTM (Long Short-Term Memory) | 91.5 | 0.91 | 0.91 | 0.91 |
Assessment of web servers for allergenicity prediction
| Server | Accuracy |
|---|---|
| Allerhunter | 0.871 |
| AlgPred (SVM_single_aa) | 0.775 |
| AlgPred (SVM_dipeptide) | 0.796 |
| AlgPred(ARP) | 0.842 |
| APPEL | 0.783 |
| ProAp(motif) | 0.505 |
| ProAp(SVM) | 0.843 |
| AllerTop v.1 | 0.828 |
| AllergenFP | 0.879 |
| AllerTop v.2 | 0.887 |
| LSTM model | 0.915 |
Figure 3.The LSTM model’s ROC curve
Figure 4.Interface of ProAll-D for protein allergen detection.
Figure 5.Working of ProAll-D.
Figure 6.Data-set Section