| Literature DB >> 35548689 |
Jixiong Zhang1, Hong Yan1, Yanmei Xiong1, Qianqian Li2, Shungeng Min1.
Abstract
Wavelength selection is a critical factor for pattern recognition of vibrational spectroscopic data. Not only does it alleviate the effect of dimensionality on an algorithm's generalization performance, but it also enhances the understanding and interpretability of multivariate classification models. In this study, a novel partial least squares discriminant analysis (PLSDA)-based wavelength selection algorithm, termed ensemble of bootstrapping space shrinkage (EBSS), has been devised for vibrational spectroscopic data analysis. In the algorithm, a set of subsets are generated from a data set using random sampling. For an individual subset, a feature space is determined by maximizing the expected 10-fold cross-validation accuracy with a weighted bootstrap sampling strategy. Then an ensemble strategy and a sequential forward selection method are applied to the feature spaces to select characteristic variables. Experimental results obtained from analysis of real vibrational spectroscopic data sets demonstrate that the ensemble wavelength selection algorithm can reserve stable and informative variables for the final modeling and improve predictive ability for multivariate classification models. This journal is © The Royal Society of Chemistry.Entities:
Year: 2019 PMID: 35548689 PMCID: PMC9087301 DOI: 10.1039/c8ra08754g
Source DB: PubMed Journal: RSC Adv ISSN: 2046-2069 Impact factor: 3.361
Fig. 1The core idea of the EBSS algorithm.
Parameters for the GA-PLA-DA
| Population size | 50 chromosomes |
| Maximum number of generations | 100 |
| Generation gap | 0.95 |
| Crossover rate | 0.75 |
| Mutation rate | 0.01 |
| Maximum number of variables selected in the chromosome | 50 |
| Fitness value | accuracy of 10-fold cross-validation of PLSDA |
Characteristics of the data sets
| Data set | Scan | No. of training samples | No. of test samples | No. of features | No. of classes |
|---|---|---|---|---|---|
| Olive oils | FTIR | 82 | 38 | 570 | 4 |
| Red wines | FTIR | 30 | 14 | 842 | 4 |
| NIR tablets | NIR | 211 | 99 | 404 | 4 |
| Raman tablets | Raman | 82 | 38 | 3401 | 4 |
Validation set accuracy (aave ± astd%)a
| Data set | Type | PLS-DA | BSS | GA-PLS-DA | s-PLS-DA | EBSS |
|---|---|---|---|---|---|---|
| Olive oil | FTIR | 93.2 ± 2.2 | 94.7 ± 2.6 | 93.6 ± 3.1 | 95.1 ± 3.1 | 96.6 ± 3.2 |
| Red wine | FTIR | 59.3 ± 14.3 | 60 ± 13.4 | 60.4 ± 9.4 | 66.8 ± 9.6 | 71.1 ± 10.2 |
| NIR tablet | NIR | 88.9 ± 2.5 | 87 ± 3.6 | 86.4 ± 3.4 | 88.3 ± 2.9 | 89.3 ± 3.2 |
| Raman tablet | Raman | 85.8 ± 5.7 | 81.4 ± 4.2 | 80.4 ± 4.7 | 78.8 ± 4.9 | 89.3 ± 5.1 |
a ave ± astd: average accuracy rate ± standard error over 20 repeats.
The number of selected variables (nave ± nstd)a
| Data set | Type | PLS-DA | BSS | GA-PLS-DA | s-PLS-DA | EBSS |
|---|---|---|---|---|---|---|
| Olive oil | FTIR | 570 | 34 ± 33 | 29 ± 10 | 69 ± 22 | 8 |
| Red wine | FTIR | 842 | 43 ± 34 | 33 ± 15 | 52 ± 31 | 21 |
| NIR tablet | NIR | 404 | 46 ± 21 | 44 ± 8 | 59 ± 18 | 20 |
| Raman tablet | Raman | 3041 | 58 ± 22 | 60 ± 8 | 77 ± 19 | 40 |
n ave ± nstd: average number of selected variable ± standard error over 20 repeats.
Fig. 2Variables selected by the different methods for the olive oil data: BSS (a), GA-PLS-DA (b), s-PLS-DA (c) and EBSS (d).
Fig. 3Effect of number of variables selected by EBSS on the accuracy for the olive oil data.
Fig. 4Variables selected by the different methods for the red wine data set: BSS (a), GA-PLS-DA (b), s-PLS-DA (c) and EBSS (d).
Selected variables for the four different data sets using EBSS
| Data set | Wavenumber (cm−1) |
|---|---|
| Olive oil | 966.8, 1003.4, 1123.1, 1125.0, 1126.9, 1194.1, 1628.6, 1665.3 |
| Red wine | 956.0, 1114.1, 1202.8, 1222.0, 1237.5, 1279.9, 1303.0, 1499.6, 1518.9, 1526.6, 2313.0, 2347.7, 2525.0, 2733.2, 2737.1, 2798.7, 3666.1, 4167.3, 4444.8, 4556.6, 4919.0 |
| Tablet (NIR) | 7429.2, 7436.9, 7444.6, 7691.5, 7976.9, 7992.4, 8023.2, 8030.9, 8061.8, 8069.5, 8100.4, 8154.4, 8169.8, 8200.6, 8347.2, 8941.2, 10 198.7, 10 214.1, 10 353.0 |
| Tablet (Raman) | 3575, 3514, 3345, 3192, 3048, 3047, 2826, 2816, 2666, 2279, 2058, 2056, 1957, 1955, 1954, 1858, 1840, 1839, 1838, 1703, 1701, 1699, 1556, 1477, 1356, 1395, 1196, 1194, 1193, 1191, 1190, 993, 989, 983, 982, 639, 632, 597, 540, 449 |
Fig. 5Effect of number of variables selected by EBSS on the accuracy for the red wine data.
Fig. 6Variables selected by the different methods for the NIR tablet data: BSS (a), GA-PLS-DA (b), s-PLS-DA (c) and EBSS (d).
Fig. 7Variables selected by the different methods for the Raman tablet data: BSS (a), GA-PLS-DA (b), s-PLS-DA (c) and EBSS (d).
Fig. 8Effect of selected variables on the accuracy for the tablet data sets: (a) NIR and (b) Raman.