| Literature DB >> 33918556 |
Liyang Wang1, Dantong Niu2, Xinjie Zhao3, Xiaoya Wang1, Mengzhen Hao1, Huilian Che1.
Abstract
Traditional food allergen identification mainly relies on in vivo and in vitro experiments, which often needs a long period and high cost. The artificial intelligence (AI)-driven rapid food allergen identification method has solved the above mentioned some drawbacks and is becoming an efficient auxiliary tool. Aiming to overcome the limitations of lower accuracy of traditional machine learning models in predicting the allergenicity of food proteins, this work proposed to introduce deep learning model-transformer with self-attention mechanism, ensemble learning models (representative as Light Gradient Boosting Machine (LightGBM) eXtreme Gradient Boosting (XGBoost)) to solve the problem. In order to highlight the superiority of the proposed novel method, the study also selected various commonly used machine learning models as the baseline classifiers. The results of 5-fold cross-validation showed that the area under the receiver operating characteristic curve (AUC) of the deep model was the highest (0.9578), which was better than the ensemble learning and baseline algorithms. But the deep model need to be pre-trained, and the training time is the longest. By comparing the characteristics of the transformer model and boosting models, it can be analyzed that, each model has its own advantage, which provides novel clues and inspiration for the rapid prediction of food allergens in the future.Entities:
Keywords: allergenicity prediction; comparative analysis; deep learning; ensemble learning; food allergens
Year: 2021 PMID: 33918556 PMCID: PMC8069377 DOI: 10.3390/foods10040809
Source DB: PubMed Journal: Foods ISSN: 2304-8158
Figure 1The workflow of this study. Notes: NB: Naive Bayes; K-NN: K-nearest neighbor; SVM: Support Vector Machines.
Figure 2Transformer structure.
Figure 3Histogram algorithm flow.
Figure 4Comparison of Leaf-wise and Level-wise growth strategies.
Figure 5ROC curve and corresponding AUC value of Bidirectional Encoder Representation from Transformers (BERT) model. Notes: ROC: the receiver operating characteristic curve; AUC: area under the receiver operating characteristic curve.
The key parameters optimization results of the ensemble learning models. Notes: RF: Random Forest.
| Model | Key Parameters Names and Corresponding Values |
|---|---|
| LightGBM | n_estimators = 400, learning_rate = 0.1, max_depth = 5, num_leaves = 32 |
| XGBoost | learning_rate = 0.0001, n_estimators = 1000, max_depth = 5, subsample = 0.8, seed = 27 |
| RF | n_estimators = 60, max_depth = 13, min_samples_split = 120, min_samples_leaf = 20, max_features = 7 |
Performance of the ensemble learning models in the task of predicting food allergens.
| Model | Acc | Recall | Prec | F1 |
|---|---|---|---|---|
| LightGBM | 0.8686 ± 0.0132 | 0.8793 ± 0.0250 | 0.8571 ± 0.0388 | 0.8684 ± 0.0098 |
| XGBoost | 0.8186 ± 0.0248 | 0.7778 ± 0.0316 | 0.8426 ± 0.0491 | 0.7981 ± 0.0235 |
| RF | 0.7797 ± 0.0370 | 0.7586 ± 0.0449 | 0.7857 ± 0.0544 | 0.7720 ± 0.0182 |
Figure 6ROC curves and corresponding AUC values of the ensemble models. (A) is the ROC curves and AUC value of LightGBM; (B) is the ROC curves and AUC value of XGBoost; (C) is the ROC curves and AUC value of RF.
The key parameters optimization results of the previous machine learning models.
| Model | Key Parameters Names and Corresponding Values |
|---|---|
| SVM | C = 1.0, kernel = ‘rbf’, gamma = 0.01 |
| K-NN | n_neighbors = 5, n_jobs = 1 |
| NB | alpha = 0.9 |
Performance of previous machine learning models in the task of predicting food allergens.
| Model | Acc | Recall | Prec | F1 |
|---|---|---|---|---|
| SVM | 0.7418 ± 0.0420 | 0.7032 ± 0.0443 | 0.7591 ± 0.0524 | 0.7303 ± 0.0389 |
| K-NN | 0.7722 ± 0.0234 | 0.7436 ± 0.0385 | 0.7838 ± 0.0410 | 0.7630 ± 0.0132 |
| NB | 0.7203 ± 0.0375 | 0.6293 ± 0.0466 | 0.7604 ± 0.0517 | 0.6891 ± 0.0221 |
Figure 7ROC curves and corresponding AUC values of the previous models. (A) is the ROC curves and AUC value of SVM; (B) is the ROC curves and AUC value of K-NN; (C) is the ROC curves and AUC value of NB.
Comparison of different types of models in this work.
| Model Type | Model Name | Prediction Accuracy (%) | Time-Consuming | Computing Equipment |
|---|---|---|---|---|
| Deep Learning | BERT | 93.10 | About 19 500 s | NVIDIA® Tesla T4 GPU, accelerated by CUDA |
| Ensemble Learning | LightGBM | 86.86 | About 100 s | CPU Intel Core I7-6700HQ, 3.5 GHz, 4 GB memory |
| XGBoost | 81.86 | About 125 s | ||
| RF | 77.97 | About 90 s | ||
| Previous Machine Learning | SVM | 74.18 | About 75 s | |
| K-NN | 77.22 | About 70 s | ||
| NB | 72.03 | About 60 s |