| Literature DB >> 30719371 |
Xue-Zhen Hong1,2, Xian-Shu Fu3, Zheng-Liang Wang3, Li Zhang4, Xiao-Ping Yu3, Zi-Hong Ye3.
Abstract
This work presents a reliable approach to trace teas' geographical origins despite changes in teas caused by different harvest years. A total of 1447 tea samples collected from various areas in 2014 (660 samples) and 2015 (787 samples) were detected by FT-NIR. Seven classifiers trained on the 2014 dataset all succeeded to trace origins of samples collected in 2014; however, they all failed to predict origins for the 2015 samples due to different data distributions and imbalanced dataset. Three outlier detection based undersampling approaches-one-class SVM (OC-SVM), isolation forest and elliptic envelope-were then proposed; as a result, the highest macro average recall (MAR) for the 2015 dataset was improved from 56.86% to 73.95% (by SVM). A model updating approach was also applied, and the prediction MAR was significantly improved with increase in the updating rate. The best MAR (90.31%) was first achieved by the OC-SVM combined SVM classifier at a 50% rate.Entities:
Year: 2019 PMID: 30719371 PMCID: PMC6335731 DOI: 10.1155/2019/1537568
Source DB: PubMed Journal: J Anal Methods Chem ISSN: 2090-8873 Impact factor: 2.193
Description of tea samples.
| Year | Group ID | Type | Group size | Tea gardens per group | Raw variables per sample |
|---|---|---|---|---|---|
| 2014 | 2014IN | WYa | 495 | 33 | 4148 |
| 2014OUT | NWYb | 165 | 11 | 4148 | |
|
| |||||
| 2015 | 2015IN | WY | 687 | 74 | 4148 |
| 2015OUT | NWY | 100 | 10 | 4148 | |
aWY: authentic Wuyi rock-essence tea with a protected geographical indication (GI); bNWY: Non-Wuyi tea.
Figure 1NIR responses (average values) to the four tea groups collected from different geographical origins in different years.
Comparative classification of tea samples based on seven different classifiers.
| Classifier | Training set (70% of 14 dataset) | Testing set (30% of 14 dataset) | 15 dataset | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Recall 1 | Recall 2 | MARa | Recall 1 | Recall 2 | MAR | Recall 1 | Recall 2 | MAR | |
| LDA | 0.9994 | 0.9983 |
| 0.9826 | 0.8925 |
| 0.6199 | 0.2347 |
|
| SVM | 1.0000 | 1.0000 |
| 0.9639 | 0.8291 |
| 0.9074 | 0.2298 |
|
| SGDb | 0.9987 | 0.9804 |
| 0.9740 | 0.7961 |
| 0.6621 | 0.1638 |
|
| Decision tree | 1.0000 | 1.0000 |
| 0.9132 | 0.7046 |
| 0.8461 | 0.2461 |
|
| Random forest | 0.9996 | 0.9904 |
| 0.9703 | 0.7046 |
| 0.8723 | 0.2195 |
|
| AdaBoostc | 1.0000 | 1.0000 |
| 0.9691 | 0.7963 |
| 0.8370 | 0.2622 |
|
| MLPd | 1.0000 | 1.0000 |
| 0.9670 | 0.8957 |
| 0.5842 | 0.2576 |
|
aMAR: macro average recall; bSGD: stochastic gradient descent; cAdaBoost: adaptive boosting; dMLP: multilayer perceptron. Means with the same letter(s) are not significantly different at 0.01 level.
Improving the prediction of 2015 tea samples using three outlier detection based undersampling approaches.
| Classifier | One Class SVM (OC-SVM) | Isolation forest (IF) | Elliptic envelope (EE) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Recall 1 | Recall 2 | MARa | Recall 1 | Recall 2 | MAR | Recall 1 | Recall 2 | MAR | |
| LDA | 0.4993 | 0.3700 |
| 0.4643 | 0.4100 |
| 0.4498 | 0.3900 |
|
| SVM | 0.7409 | 0.7300 |
| 0.7191 | 0.7600 |
| 0.7118 | 0.7500 |
|
| SGDb | 0.5997 | 0.5100 |
| 0.5997 | 0.5100 |
| 0.5662 | 0.5100 |
|
| Decision tree | 0.6652 | 0.5700 |
| 0.6958 | 0.5800 |
| 0.6201 | 0.5400 |
|
| Random forest | 0.7089 | 0.6300 |
| 0.6812 | 0.6200 |
| 0.6710 | 0.6100 |
|
| AdaBoost | 0.6405 | 0.6800 |
| 0.7322 | 0.5400 |
| 0.6667 | 0.5600 |
|
| MLPc | 0.6856 | 0.5200 |
| 0.8020 | 0.3800 |
| 0.7205 | 0.3800 |
|
aMAR: macro average recall; bSGD: stochastic gradient descent; cMLP: multilayer perceptron.
Figure 2Prediction of tea samples from 2015 based on OC-SVM and IF combined SVM classifiers at different updating ratios.
Comparison of prediction results for 2015 dataset based on OC-SVM and IF combined SVM classifiers at different updating ratios.
| Updating rate | One-class SVM (OC-SVM) | Isolation forest (IF) | ||||
|---|---|---|---|---|---|---|
| Recall 1 | Recall 2 | MARa | Recall 1 | Recall 2 | MAR | |
| 10% | 0.8275 | 0.7324 |
| 0.8414 | 0.7257 |
|
| 20% | 0.8482 | 0.8386 |
| 0.8617 | 0.7983 |
|
| 30% | 0.8612 | 0.8749 |
| 0.8717 | 0.8312 |
|
| 40% | 0.8739 | 0.9077 |
| 0.8767 | 0.8599 |
|
| 50% | 0.8824 | 0.9238 |
| 0.8802 | 0.8727 |
|
| 60% | 0.8867 | 0.9379 |
| 0.8860 | 0.8770 |
|
aMAR: macro average recall. Means with the same letter(s) are not significantly different at 0.01 level.