| Literature DB >> 35024621 |
Leonieke M van den Bulk1, Yamine Bouzembrak1, Anand Gavai1, Ningjing Liu1, Lukas J van den Heuvel1, Hans J P Marvin1.
Abstract
Systematic reviews are used to collect relevant literature to answer a research question in a way that is clear, thorough, unbiased and reproducible. They are implemented as a standard method in the domain of food safety to obtain a literature overview on the state-of-the-art research related to food safety topics of interest. A disadvantage to systematic reviews, however, is that this process is time-consuming and requires expert domain knowledge. The work reported here aims to reduce the time needed by an expert to screen all possible relevant articles by applying machine learning techniques to classify the articles automatically as either relevant or not relevant. Eight different machine learning algorithms and ensembles of all combinations of these algorithms were tested on two different systematic reviews on food safety (i.e. chemical hazards in cereals and leafy greens). The results showed that the best performance was obtained by an ensemble of naive Bayes and a support vector machine, resulting in an average decrease of 32.8% in the amount of articles the expert has to read and an average decrease in irrelevant articles of 57.8% while keeping 95% of the relevant articles. It was concluded that automatic classification of the literature in a systematic literature review can support experts in their task and save valuable time without compromising the quality of the review.Entities:
Keywords: Artificial intelligence; Classification models; Document screening; Food safety hazards; Literature reviews; Text mining
Year: 2021 PMID: 35024621 PMCID: PMC8728304 DOI: 10.1016/j.crfs.2021.12.010
Source DB: PubMed Journal: Curr Res Food Sci ISSN: 2665-9271
The data augmentation parameters for each of the algorithms in the two data cases: cereals and leafy greens.
| Algorithm | Cereals | Leafy greens | ||
|---|---|---|---|---|
| SMOTE | SO | SMOTE | SO | |
| AdaBoost | False | False | True | True |
| BERT | False | False | True | False |
| Gradient boosting | False | False | True | True |
| Logistic Regression | False | False | True | False |
| LSTM | False | False | False | False |
| Naive Bayes | False | False | True | True |
| Random forest | False | False | True | False |
| Support vector machine | False | False | False | False |
Performance of the trained models on the test and future set from the systematic review on cereals. Performance is shown in terms of precision, recall and F1 score for the relevant and not relevant class. An average across the two classes is also shown. The best values per column and set are boldfaced.
| Algorithm | Set | Relevant | Not relevant | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pr | re | F1 | pr | re | F1 | pr | re | F1 | ||
| AB | Test set | 75.0% | 76.4% | 75.7% | 84.0% | 82.9% | 83.4% | 79.5% | 79.6% | 79.6% |
| Future set | 79.4% | 70.4% | 74.6% | 75.0% | 82.9% | 78.7% | 77.2% | 76.7% | 76.7% | |
| BERT | Test set | 80.0% | 87.3% | 83.5% | 90.9% | 85.4% | 88.1% | 85.5% | 86.3% | 85.8% |
| Future set | 71.8% | 80.3% | 78.0% | 85.0% | 84.5% | 82.6% | 82.7% | |||
| GB | Test set | 81.2% | 70.9% | 75.7% | 82.0% | 89.0% | 85.4% | 81.6% | 80.0% | 80.6% |
| Future set | 85.5% | 66.2% | 74.6% | 73.9% | 89.5% | 81.0% | 79.7% | 77.8% | 77.8% | |
| LR | Test set | 85.5% | 90.1% | |||||||
| Future set | 90.0% | 76.1% | 82.4% | 80.5% | 92.1% | 85.9% | 85.2% | 84.1% | 84.2% | |
| LSTM | Test set | 80.4% | 67.3% | 73.3% | 80.2% | 89.0% | 84.4% | 80.3% | 78.1% | 78.8% |
| Future set | 90.6% | 67.6% | 77.4% | 75.5% | 93.4% | 83.5% | 83.0% | 80.5% | 80.5% | |
| NB | Test set | 76.9% | 83.3% | 81.7% | 87.0% | 85.0% | 86.3% | 85.2% | ||
| Future set | 85.7% | 84.5% | 85.1% | 85.7% | 86.8% | 86.3% | 85.7% | 85.7% | 85.7% | |
| RF | Test set | 75.4% | 78.2% | 76.8% | 85.0% | 82.9% | 84.0% | 80.2% | 80.6% | 80.4% |
| Future set | 87.3% | 77.5% | 82.1% | 81.0% | 89.5% | 85.0% | 84.1% | 83.5% | 83.5% | |
| SVM | Test set | 81.4% | 87.3% | 84.2% | 91.0% | 86.6% | 88.8% | 86.2% | 86.9% | 86.5% |
| Future set | 91.0% | 92.1% | ||||||||
Performance of the trained models on the test and future set from the systematic review on leafy greens. Performance is shown in terms of precision, recall and F1 score for the relevant and not relevant class. An average across the two classes is also shown. The best values per column and set are boldfaced.
| Algorithm | Set | Relevant | Not relevant | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pr | re | F1 | pr | re | F1 | pr | re | F1 | ||
| AB | Test set | 80.0% | 57.1% | 66.7% | 83.3% | 93.8% | 88.2% | 81.7% | 75.4% | 77.5% |
| Future set | 88.9% | 64.5% | 74.8% | 56.9% | 85.3% | 68.2% | 72.9% | 74.9% | 71.5% | |
| BERT | Test set | 70.6% | 77.4% | 84.4% | 88.5% | 81.8% | 83.0% | |||
| Future set | 85.9% | 88.7% | 87.3% | 78.1% | 73.5% | 75.8% | 82.0% | 81.1% | 81.5% | |
| GB | Test set | 81.8% | 64.3% | 72.0% | 85.7% | 93.8% | 89.6% | 83.8% | 79.0% | 80.8% |
| Future set | 85.4% | 56.5% | 68.0% | 50.9% | 82.4% | 62.9% | 68.1% | 69.4% | 65.4% | |
| LR | Test set | 76.9% | 71.4% | 74.1% | 87.9% | 90.6% | 89.2% | 82.4% | 81.0% | 81.7% |
| Future set | 87.8% | 69.4% | 77.5% | 59.6% | 82.4% | 69.1% | 73.7% | 75.9% | 73.3% | |
| LSTM | Test set | 62.5% | 71.4% | 66.7% | 86.7% | 81.2% | 83.9% | 74.6% | 76.3% | 75.3% |
| Future set | 83.9% | 83.9% | 83.9% | 70.6% | 70.6% | 70.6% | 77.2% | 77.2% | 77.2% | |
| NB | Test set | 64.7% | 78.6% | 71.0% | 89.7% | 81.2% | 85.2% | 77.2% | 79.9% | 78.1% |
| Future set | 84.5% | 67.6% | ||||||||
| RF | Test set | 64.3% | 72.0% | 85.7% | 89.6% | 83.8% | 79.0% | 80.8% | ||
| Future set | 88.1% | 59.7% | 71.2% | 53.7% | 85.3% | 65.9% | 70.9% | 72.5% | 68.5% | |
| SVM | Test set | 78.6% | 78.6% | 90.6% | 90.6% | 84.6% | ||||
| Future set | 72.6% | 80.4% | 63.0% | 72.5% | 76.5% | 78.9% | 76.4% | |||
Performance of the top five best ensemble models on the test and future set from the systematic review on cereals. Performance is shown in terms of precision, recall and F1 score for the relevant and not relevant class. An average across the two classes is also shown.
| Ensemble top 5 | Set | Relevant | Not relevant | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pr | re | F1 | pr | re | F1 | pr | re | F1 | ||
| 1. NB, SVM | Test set | 82.0% | 90.9% | 86.2% | 93.4% | 86.6% | 89.9% | 87.7% | 87.7% | 88.0% |
| 2. AB, LR, NB, RF, SVM | Test set | 83.1% | 89.1% | 86.0% | 92.3% | 87.8% | 90.0% | 87.7% | 88.4% | 88.0% |
| 3. GB, NB, SVM | Test set | 84.2% | 87.3% | 85.7% | 91.2% | 89.0% | 90.1% | 87.7% | 88.1% | 87.9% |
| 4. GB, LR, NB, SVM | Test set | 84.2% | 87.3% | 85.7% | 91.2% | 89.0% | 90.1% | 87.7% | 88.1% | 87.9% |
| 5. AB, GB, NB, SVM | Test set | 84.2% | 87.3% | 85.7% | 91.2% | 89.0% | 90.1% | 87.7% | 88.1% | 87.9% |
| 1. NB, SVM | Future set | 91.3% | 88.7% | 90.0% | 89.7% | 92.1% | 90.9% | 90.5% | 90.4% | 90.5% |
| 2. AB, NB, SVM | Future set | 91.3% | 88.7% | 90.0% | 89.7% | 92.1% | 90.9% | 90.5% | 90.4% | 90.5% |
| 3. AB, SVM, NB | Future set | 91.0% | 85.9% | 88.4% | 87.5% | 92.1% | 89.7% | 89.3% | 89.0% | 89.1% |
| 4. RF, SVM, AB | Future set | 91.0% | 85.9% | 88.4% | 89.3% | 89.0% | 89.7% | 89.3% | 89.0% | 89.1% |
| 5. NB, RF, SVM | Future set | 91.0% | 85.9% | 88.4% | 87.5% | 92.1% | 89.7% | 89.3% | 89.0% | 89.1% |
Performance of the top five best ensemble models on the test and future set from the systematic review on leafy greens. Performance is shown in terms of precision, recall and F1 score for the relevant and not relevant class. An average across the two classes is also shown.
| Ensemble top 5 | Set | Relevant | Not relevant | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pr | re | F1 | pr | re | F1 | pr | re | F1 | ||
| 1. AB, SVM | Test set | 78.6% | 78.6% | 78.6% | 90.6% | 90.6% | 90.6% | 84.6% | 84.6% | 84.6% |
| 2. BERT, SVM | Test set | 78.6% | 78.6% | 78.6% | 90.6% | 90.6% | 90.6% | 84.6% | 84.6% | 84.6% |
| 3. NB, SVM | Test set | 78.6% | 78.6% | 78.6% | 90.6% | 90.6% | 90.6% | 84.6% | 84.6% | 84.6% |
| 4. AB, BERT, SVM | Test set | 78.6% | 78.6% | 78.6% | 90.6% | 90.6% | 90.6% | 84.6% | 84.6% | 84.6% |
| 5. AB, NB, SVM | Test set | 78.6% | 78.6% | 78.6% | 90.6% | 90.6% | 90.6% | 84.6% | 84.6% | 84.6% |
| 1. AB, NB | Future set | 84.3% | 95.2% | 89.4% | 88.5% | 67.6% | 76.7% | 86.4% | 81.4% | 83.0% |
| 2. LR, NB, SVM | Future set | 91.1% | 82.3% | 86.4% | 72.5% | 85.3% | 78.4% | 81.8% | 83.8% | 82.4% |
| 3. BERT, NB | Future set | 85.1% | 91.9% | 88.4% | 82.8% | 70.6% | 76.2% | 83.9% | 81.3% | 82.3% |
| 4. AB, BERT, NB | Future set | 85.1% | 91.9% | 88.4% | 82.8% | 70.6% | 76.2% | 83.9% | 81.3% | 82.3% |
| 5. NB, SVM | Future set | 89.7% | 83.9% | 86.7% | 73.7% | 82.4% | 77.8% | 81.7% | 83.1% | 82.2% |
Performance of the best ensemble model (NB and SVM) with a threshold of 0.25 on the test and future set from the systematic review on cereals and leafy greens. Performance is shown in terms of precision, recall and F1 score for the relevant and not relevant class. An average across the two classes and an average across the data sets is also shown.
| Case | Set | Relevant | Not relevant | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pr | re | F1 | pr | re | F1 | pr | re | F1 | ||
| Cereals | Test set | 57.1% | 94.5% | 71.2% | 93.5% | 52.4% | 67.2% | 75.3% | 73.5% | 69.2% |
| Future set | 71.4% | 98.6% | 82.8% | 98.0% | 63.2% | 76.8% | 84.7% | 80.9% | 79.8% | |
| Leafy greens | Test set | 52.0% | 92.9% | 66.7% | 95.2% | 62.5% | 75.5% | 73.6% | 77.7% | 71.1% |
| Future set | 79.5% | 100.0% | 88.6% | 100.0% | 52.9% | 69.2% | 89.7% | 76.5% | 78.9% | |