| Literature DB >> 36231480 |
Thi-Minh-Trang Huynh1, Chuen-Fa Ni1,2, Yu-Sheng Su3, Vo-Chau-Ngan Nguyen4, I-Hsien Lee1,2, Chi-Ping Lin1,2, Hoang-Hiep Nguyen1.
Abstract
Monitoring ex-situ water parameters, namely heavy metals, needs time and laboratory work for water sampling and analytical processes, which can retard the response to ongoing pollution events. Previous studies have successfully applied fast modeling techniques such as artificial intelligence algorithms to predict heavy metals. However, neither low-cost feature predictability nor explainability assessments have been considered in the modeling process. This study proposes a reliable and explainable framework to find an effective model and feature set to predict heavy metals in groundwater. The integrated assessment framework has four steps: model selection uncertainty, feature selection uncertainty, predictive uncertainty, and model interpretability. The results show that Random Forest is the most suitable model, and quick-measure parameters can be used as predictors for arsenic (As), iron (Fe), and manganese (Mn). Although the model performance is auspicious, it likely produces significant uncertainties. The findings also demonstrate that arsenic is related to nutrients and spatial distribution, while Fe and Mn are affected by spatial distribution and salinity. Some limitations and suggestions are also discussed to improve the prediction accuracy and interpretability.Entities:
Keywords: Random Forest; explainable artificial intelligence (XAI); groundwater quality; heavy metals; prediction intervals
Mesh:
Substances:
Year: 2022 PMID: 36231480 PMCID: PMC9566676 DOI: 10.3390/ijerph191912180
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1A Conceptual research framework.
Figure 2The locations of 453 observation wells in ten groundwater basins in Taiwan.
Statistical summary of physicochemical parameters.
| Variables | Description | Min a | Max b | Mean | STD c | Ske d | Kur e | Unit |
|---|---|---|---|---|---|---|---|---|
| Temp | Water temperature | 18.600 | 33.200 | 26.524 | 1.704 | −0.287 | 0.449 | °C |
| Depth | Depth to water | 0.000 | 39.627 | 4.944 | 4.569 | 2.777 | 10.014 | m |
| EC | Electrical conductivity | 2.000 | 65,800.000 | 1899.373 | 6205.603 | 6.419 | 43.874 | μS/cm 25 °C |
| pH | pH | 4.100 | 9.300 | 6.739 | 0.552 | −0.922 | 1.799 | - |
| TH | Total hardness | 2.700 | 8390.000 | 416.436 | 735.115 | 6.207 | 43.697 | mg/L |
| TDS | Total dissolved solids | 4.100 | 52,300.000 | 1302.772 | 4502.321 | 6.689 | 48.265 | mg/L |
| Cl | Chloride salt | 0.500 | 27,800.000 | 454.055 | 2268.745 | 6.763 | 49.321 | mg/L |
| NH4 | Ammonia Nitrogen | 0.001 | 20.000 | 0.791 | 1.704 | 3.9 | 19.951 | mg/L |
| NO3 | Nitrate Nitrogen | 0.010 | 45.500 | 2.170 | 3.708 | 3.22 | 15.553 | mg/L |
| SO4 | Sulfate | 0.500 | 4260.000 | 139.139 | 312.284 | 6.386 | 46.222 | mg/L |
| TOC | Total organic carbon | 0.020 | 15.800 | 1.922 | 1.549 | 2.453 | 9.442 | mg/L |
| As | Arsenic | 0.000 | 0.146 | 0.006 | 0.013 | 4.031 | 20.404 | mg/L |
| Mn | Manganese | 0.005 | 11.300 | 0.519 | 0.854 | 4.152 | 26.062 | mg/L |
| Fe | Iron | 0.000 | 58.800 | 1.526 | 4.161 | 5.821 | 44.764 | mg/L |
a minimum; b maximum; c standard deviation; d skewness coefficient; e kurtosis coefficient.
Figure 3Scatter plot showing the spatial distribution of derived clusters and outliers.
Figure 4Scatter plot showing the temporal distribution of arsenic (a), iron (b), and manganese (c) in each cluster.
Figure 5Spearman correlation matrix for all parameters.
Figure 6Boxplots of model performance (RMSE) on 100 datasets: (a) As prediction; (b) Fe prediction; (c) Mn prediction.
Model performance on 100 testing data sets for different targets (mean R2 score ± standard deviation).
| Models | As Prediction | Fe Prediction | Mn Prediction |
|---|---|---|---|
| GBR | 0.65 ± 0.02 | 0.59 ± 0.03 | 0.62 ± 0.02 |
| KNR | 0.73 ± 0.02 | 0.60 ± 0.03 | 0.65 ± 0.03 |
| LR | 0.18 ± 0.01 | 0.08 ± 0.01 | 0.18 ± 0.01 |
| MLP | 0.30 ± 0.06 | 0.47 ± 0.06 | 0.50 + 0.08 |
| RFR | 0.79 ± 0.02 | 0.70 ± 0.03 | 0.76 ± 0.02 |
| SVR | −26.32 ± 1.88 | 0.02 ± 0.01 | 0.21 ± 0.02 |
Random Forest Regressor hyperparameter optimization.
| Hyperparameters | Description | As Model | Fe Model | Mn Model |
|---|---|---|---|---|
| min_samples_leaf | The lowest number of observations in a terminal node | 4 | 4 | 4 |
| max_features | Number of variables for the best split | 10 | 6 | 8 |
| min_samples_split | The lowest number of observations needed to split an internal node | 6 | 4 | 8 |
| n_estimators | Number of trees in a forest | 1848 | 1727 | 1000 |
| max_depth | The maximum depth of the tree | 16 | 18 | 20 |
Figure 7Validation curve of R2 score versus max_features: (a) As model; (b) Fe model; (c) Mn model. Vertical red lines indicate optimal max_features.
Figure 8Learning curves of training versus 5-fold cross-validation: (a) As model; (b) Fe model; and (c) Mn model.
Figure 9Permute importance distributions on training data: the vertical axis is the feature names, the horizontal axis is the prediction of score reduction when permuting that feature: (a) As model; (b) Fe model; and (c) Mn model.
Feature grouping for models.
| Feature Set | As Model | Fe Model | Mn Model |
|---|---|---|---|
| Full features | X, Y, pH, TH, EC, TDS, Cl, SO4, NO3, NH4, TOC, Temp, Depth | ||
| Important features | NH4, X, Y, pH, NO3, SO4, Cl, Depth, TH, TDS | pH, NH4, Y, NO3, Depth, X | pH, Y, NO3, TH, NH4, SO4, X, EC |
| Low-cost features | NH4, X, Y, pH, NO3, EC, Cl, Depth, TH, TDS | pH, NH4, Y, EC, Depth, X | pH, Y, NO3, TH, NH4, Cl, X, EC |
Figure 10Boxplots of model performance (R2) on 100 random testing datasets: (a) As model; (b) Fe model; (c) Mn model.
Results of Wilcoxon-signed rank test on R2 scores of 100 testing data sets: statistic (p-value).
| Paired Tests | As Model | Fe Model | Mn Model |
|---|---|---|---|
| Full features—Important features | 2065.000 (0.912) | 0.000 (0.000) | 0.000 (0.000) |
| Full features—Low-cost features | 1883.000 (0.140) | 6.000 (0.000) | 0.000 (0.000) |
| Important features—Low-cost features | 1396.500 (0.009) | 661.500 (0.000) | 1697.000 (0.277) |
Coverage probability of 90% prediction intervals from different inputs: PICP (MPI).
| Feature Sets | As Model | Fe Model | Mn Model |
|---|---|---|---|
| Full features | 98.32 (0.0174) | 80.53 (6.1914) | 87.65 (1.4521) |
| Important features | 98.05 (0.0154) | 75.02 (4.8758) | 86.31 (1.2356) |
| Low-cost features | 98.07 (0.0152) | 77.05 (4.9346) | 86.41 (1.2340) |
Figure 11Visualization of observed data, median prediction line, and interval prediction by low-cost features: (a) As model, PICP = 97.97%, MPI = 0.015; (b) Fe model, PICP = 77.13%, MPI = 4.907; and (c) Mn model, PICP = 86.41%, MPI = 1.230.
Figure 12Distribution of local feature contributions for each predicted sample: (a) As model; (b) Fe model; (c) Mn model.
Global feature contributions in each model.
| Features | As Model | Fe Model | Mn Model | |||
|---|---|---|---|---|---|---|
| Train | Test | Train | Test | Train | Test | |
| Cl | 4.33 × 10−6 | −1.86 × 10−5 | 6.56 × 10−4 | 2.76 × 10−3 | ||
| Depth | 4.85 × 10−6 | 3.63 × 10−5 | 4.34 × 10−3 | −7.28 × 10−-3 | ||
| EC | 4.49 × 10−6 | −1.71 × 10−6 | 3.19 × 10−3 | −3.52 × 10−2 | 4.19 × 10−4 | −2.94 × 10−3 |
| NH4 | 6.13 × 10−6 | 9.45 × 10−5 | 2.56 × 10−3 | −6.71 × 10−4 | 3.62 × 10−4 | −1.85 × 10−3 |
| NO3 | 6.87 × 10−6 | 4.68 × 10−5 | 3.64 × 10−4 | 7.00 × 10−3 | ||
| pH | 6.86 × 10−8 | 8.14 × 10−5 | 1.22 × 10−3 | −5.52 × 10−3 | −2.52 × 10−4 | −4.82 × 10−3 |
| TDS | 1.49 × 10−6 | 2.25 × 10−5 | ||||
| TH | 3.10 × 10−6 | 6.76 × 10−6 | 5.27 × 10−4 | 4.19 × 10−3 | ||
| X | 7.58 × 10−7 | 2.95 × 10−5 | 1.17 × 10−3 | −4.67 × 10−3 | 2.76 × 10−4 | −2.61 × 10−3 |
| Y | −2.42 × 10−6 | 7.36 × 10−5 | −2.55 × 10−4 | −9.63 × 10−3 | 7.97 × 10−5 | −5.02 × 10−3 |