| Literature DB >> 28471412 |
Tiezhu Shi1, Huizeng Liu2,3, Yiyun Chen4, Teng Fei5,6, Junjie Wang7, Guofeng Wu8.
Abstract
This study investigated the abilities of pre-processing, feature selection and machine-learning methods for the spectroscopic diagnosis of soil arsenic contamination. The spectral data were pre-processed by using Savitzky-Golay smoothing, first and second derivatives, multiplicative scatter correction, standard normal variate, and mean centering. Principle component analysis (PCA) and the RELIEF algorithm were used to extract spectral features. Machine-learning methods, including random forests (RF), artificial neural network (ANN), radial basis function- and linear function- based support vector machine (RBF- and LF-SVM) were employed for establishing diagnosis models. The model accuracies were evaluated and compared by using overall accuracies (OAs). The statistical significance of the difference between models was evaluated by using McNemar's test (Z value). The results showed that the OAs varied with the different combinations of pre-processing, feature selection, and classification methods. Feature selection methods could improve the modeling efficiencies and diagnosis accuracies, and RELIEF often outperformed PCA. The optimal models established by RF (OA = 86%), ANN (OA = 89%), RBF- (OA = 89%) and LF-SVM (OA = 87%) had no statistical difference in diagnosis accuracies (Z < 1.96, p < 0.05). These results indicated that it was feasible to diagnose soil arsenic contamination using reflectance spectroscopy. The appropriate combination of multivariate methods was important to improve diagnosis accuracies.Entities:
Keywords: feature selection; heavy metal contamination; machine-learning; spectral pre-processing; visible and near-infrared reflectance spectroscopy
Year: 2017 PMID: 28471412 PMCID: PMC5469641 DOI: 10.3390/s17051036
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Study areas (a) and spatial distribution of soil samples in Yixing (b) and Zhongxiang (c).
Confusion matrix of observed and diagnosed soil samples for calculating overall accuracy 1.
| Allocation | Observed | ||
|---|---|---|---|
| Contaminated (Positive, Value = 1) | Uncontaminated (Negative, Value = 0) | ||
| Predicted | Contaminated (positive, value = 1) | ||
| Uncontaminated (negative, value = 0) | |||
1 pp: number of correctly diagnosed contaminated soil samples; np: number of falsely diagnosed uncontaminated soil samples; pn: number of falsely diagnosed contaminated soil samples; nn: number of correctly diagnosed uncontaminated soil samples.
Assessment of the statistical significance of the difference between two diagnosis models using McNemar’s Test 1.
| Allocation | Diagnosis Model 2 | ||
|---|---|---|---|
| Correct | Incorrect | ||
| Diagnosis model 1 | Correct | ||
| Incorrect | |||
1 f12: the test soil samples that are correctly diagnosed by diagnosis model 1 but misdiagnosed by diagnosis model 2; f21 the test soil samples that are correctly diagnosed by diagnosis model 2 but misdiagnosed by diagnosis model 1.
Statistical descriptions for the arsenic contents (mg·kg−1) and the percent value of contaminated samples (per %) 1.
| No. | Minimum | Maximum | Mean | Std. | Per % | |
|---|---|---|---|---|---|---|
| Total data set | 195 | 1.91 | 133.36 | 18.13 | 18.67 | 27 |
| Training data set | 98 | 1.91 | 106.10 | 12.70 | 16.81 | 26 |
| Test data set | 97 | 4.40 | 133.36 | 19.00 | 20.43 | 29 |
1 No.: number of samples; Std.: standard deviation.
Figure 2The reflectance spectra and the three first principal components (PC1, PC2 and PC3) for the contaminated and uncontaminated soil samples: (a) original reflectance spectra, (b) mean centering spectra, (c) standard normal variate spectra, (d) multiplicative scatter correction spectra, (e) first derivative spectra, and (f) second derivative spectra.
Figure 3RELIEF weights and the selected spectral features for original reflectance spectra (a), mean centering spectra (b), standard normal variate spectra (c), multiplicative scatter correction spectra (d), first derivative spectra (e), and second derivative spectra (f). The threshold of RELIEF weight was set to 0 (horizontal dashed lines).
The operation times, parameter setting, and overall accuracies for diagnosis models by using different pre-processing, feature selection and machine-learning methods 1.
| Machine-Learning Methods | Pre-Processing Methods | Feature Selection Methods | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No Feature Selection | PCA | RELIEF | ||||||||||||||||||||||
| Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | ||||||||||||||||
| RF | none | 70 | 5 | 0.32 | 80 | 5 | 60 | 7 | 0.22 | 82 | 8 | 150 | 5 | 0.04 | 85 | |||||||||
| MC | 270 | 4 | 0.27 | 74 | 7 | 160 | 2 | 0.17 | 83 | 8 | 130 | 3 | 0.03 | 71 | ||||||||||
| SNV | 290 | 3 | 0.32 | 84 | 7 | 20 | 2 | 0.05 | 70 | 6 | 60 | 3 | 0.03 | 82 | ||||||||||
| MSC | 150 | 4 | 0.25 | 71 | 6 | 30 | 2 | 0.03 | 71 | 6 | 30 | 2 | 0.03 | 71 | ||||||||||
| 1st | 50 | 2 | 0.25 | 77 | 8 | 80 | 4 | 0.05 | 79 | 10 | 30 | 3 | 0.03 | 81 | ||||||||||
| 2nd | 200 | 2 | 0.28 | 85 | 6 | 50 | 4 | 0.05 | 71 | |||||||||||||||
| Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | ||||||||||||||||
| ANN | none | 1 | 0.34 | 86 | 6 | 9 | 0.05 | 71 | 8 | 3 | 0.02 | 84 | ||||||||||||
| MC | 2 | 0.48 | 76 | 8 | 2 | 0.04 | 71 | 8 | 10 | 0.05 | 76 | |||||||||||||
| SNV | 1 | 0.27 | 81 | 6 | 2 | 0.03 | 64 | 6 | 6 | 0.03 | 86 | |||||||||||||
| MSC | 1 | 0.28 | 29 | 8 | 2 | 0.03 | 40 | 6 | 3 | 0.02 | 52 | |||||||||||||
| 1st | 3 | 0.67 | 87 | 10 | 1 | 0.03 | 81 | |||||||||||||||||
| 2nd | 1 | 0.30 | 82 | 5 | 2 | 0.03 | 62 | 12 | 1 | 0.03 | 75 | |||||||||||||
| Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | ||||||||||||||||
| RBF-SVM | none | 0.01 | 1 | 32 | 0.11 | 80 | 7 | 0.04 | 1 | 32 | 0.05 | 85 | 8 | 0.17 | 1 | 32 | 0.02 | 82 | ||||||
| MC | 0.01 | 1 | 32 | 0.14 | 70 | 7 | 0.08 | 1 | 35 | 0.05 | 87 | 8 | 0.38 | 1 | 31 | 0.03 | 76 | |||||||
| Machine-learning methods | Pre-processing methods | Feature selection methods | ||||||||||||||||||||||
| No feature selection | PCA | RELIEF | ||||||||||||||||||||||
| Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | ||||||||||||||||
| RBF-SVM | SNV | 0.01 | 1 | 36 | 0.09 | 81 | 9 | 0.04 | 1 | 42 | 0.04 | 66 | 6 | 0.28 | 1 | 36 | 0.03 | 80 | ||||||
| MSC | 0.01 | 1 | 37 | 0.08 | 71 | 5 | 0.23 | 1 | 38 | 0.03 | 71 | 6 | 0.31 | 1 | 37 | 0.02 | 71 | |||||||
| 1st | 0.01 | 1 | 46 | 0.06 | 79 | 8 | 0.05 | 1 | 43 | 0.05 | 75 | 10 | 0.09 | 1 | 33 | 0.33 | 82 | |||||||
| 2nd | 0.01 | 1 | 53 | 0.08 | 81 | 5 | 0.07 | 1 | 41 | 0.03 | 71 | |||||||||||||
| Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | Parameters | time (s) | OA (%) | ||||||||||||||||
| LF-SVM | none | 1 | 36 | 0.16 | 84 | 7 | 1 | 35 | 0.05 | 81 | 8 | 1 | 37 | 0.05 | 80 | |||||||||
| MC | 1 | 36 | 0.12 | 85 | 7 | 1 | 35 | 0.05 | 85 | 8 | 1 | 35 | 0.03 | 79 | ||||||||||
| SNV | 1 | 33 | 0.11 | 86 | 5 | 1 | 27 | 0.06 | 56 | 6 | 1 | 39 | 0.06 | 72 | ||||||||||
| MSC | 1 | 34 | 0.11 | 29 | 5 | 1 | 39 | 0.06 | 29 | 6 | 1 | 39 | 0.04 | 73 | ||||||||||
| 1st | 1 | 26 | 0.09 | 80 | 8 | 1 | 27 | 0.05 | 80 | |||||||||||||||
| 2nd | 1 | 36 | 0.10 | 76 | 4 | 1 | 30 | 0.06 | 63 | 12 | 1 | 26 | 0.05 | 81 | ||||||||||
1 RF: random forests; ANN: artificial neural network; SVM: support vector machine; RBF: radial basis function; LF: linear function; MC: mean centering; SNV: standard normal variate; MSC: multiplicative scatter correction; 1st: first derivative; 2nd: second derivative; PCA: principle component analysis; time: operation time for calibration; OA: validated overall accuracy; nPC: number of principle components; nfeature: number of RELIEF selected features. nt: number of trees; nv: number of variables; nlayer: number of layers; nsv: number of support vectors. C: regularization parameter; γ: kernel width. The results of selected models are emphasized in bold.
Figure 4Mean decrease GINI values for RELIEF-selected spectral features.
Figure 5Values of samples predicted by using: (a) second derivative spectra (second), RELIEF and random forests; (b) first derivative spectra (first), principle component analysis and artificial neural network; (c) second, RELIEF and radial basis function-based support vector machine (SVM); and (d) first, RELIEF and linear function-based SVM. Value 1 indicates contaminated, and value 0 indicates uncontaminated. The correctly-diagnosed and misdiagnosed samples are displayed in the figures.
Z Values of McNemar’s test between the optimal diagnosis models.1
| Second + RELIEF + RF | First + PCA + ANN | Second + RELIEF + RBF-SVM | |
|---|---|---|---|
| First + PCA + ANN | 0.24 | ||
| Second + RELIEF + RBF-SVM | 0.90 | 0.00 | |
| First + RELIEF + LF-SVM | 0.30 | 0.26 | 0.41 |
1 Second: second derivative spectra; First: first derivative spectra; PCA: principle component analysis; RF: random forests; ANN: artificial neural network; SVM: support vector machine; RBF: radial basis function; LF: linear-function.