| Literature DB >> 36236438 |
Thulane Paepae1, Pitshou N Bokoro1, Kyandoghere Kyamakya2.
Abstract
Harmful cyanobacterial bloom (HCB) is problematic for drinking water treatment, and some of its strains can produce toxins that significantly affect human health. To better control eutrophication and HCB, catchment managers need to continuously keep track of nitrogen (N) and phosphorus (P) in the water bodies. However, the high-frequency monitoring of these water quality indicators is not economical. In these cases, machine learning techniques may serve as viable alternatives since they can learn directly from the available surrogate data. In the present work, a random forest, extremely randomized trees (ET), extreme gradient boosting, k-nearest neighbors, a light gradient boosting machine, and bagging regressor-based virtual sensors were used to predict N and P in two catchments with contrasting land uses. The effect of data scaling and missing value imputation were also assessed, while the Shapley additive explanations were used to rank feature importance. A specification book, sensitivity analysis, and best practices for developing virtual sensors are discussed. Results show that ET, MinMax scaler, and a multivariate imputer were the best predictive model, scaler, and imputer, respectively. The highest predictive performance, reported in terms of R2, was 97% in the rural catchment and 82% in an urban catchment.Entities:
Keywords: accuracy benchmark; baseline model; data scaling; machine learning; missing values handling; soft-sensor; specification book; surrogate parameters; water quality monitoring
Mesh:
Substances:
Year: 2022 PMID: 36236438 PMCID: PMC9572788 DOI: 10.3390/s22197338
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1A working principle of the virtual sensing system.
Datasets used for virtual sensor development: parameters measured in each catchment, their chemical formulas, and descriptions where N/A denotes the absence of standard formulas and • indicates the catchment where the variable was measured.
| Predictors | Formula | The Cut | Enborne | Description |
|---|---|---|---|---|
| pH | N/A | • | • | A measure of water’s acidity or basicity. A changing stream pH indicates an increase in water pollution. |
| Flow rate | N/A | • | • | The volume of water flowing past a point per unit time. Streamflow and runoff drive the generation and delivery of various diffuse (non-point) pollutants; therefore, the knowledge of streamflow enables the determination of pollutant loads. |
| Turbidity | N/A | • | • | A measure of water’s relative clarity. It is an optical characteristic that indicates the presence of bacteria, pathogens, and other harmful contaminants. |
| Chlorophyll | C55H72O5N4Mg | • | • | A measure of how much algae is growing in a water body. It is usually used to classify the water body’s trophic condition. |
| Temperature | N/A | • | • | A property that expresses how cold or hot the water is. It influences various other variables and can alter water’s chemical and physical properties. |
| Conductivity | N/A | • | • | A measure of water’s ability to conduct electricity. Its increase may indicate that a discharge has decreased the water body’s relative health or condition. |
| Dissolved oxygen | O2 | • | • | A measure of how much non-compound oxygen is available in the water. It is a direct indicator of the water body’s ability to support aquatic life. |
| Nitrogen as nitrate | NO3 | • | Nitrates are one form of nitrogen found in aquatic environments. Although nitrates are vital plant nutrients, their excess amounts can accelerate eutrophication. | |
| Nitrogen as ammonium | NH4 | • | Ammonium is another form of nitrogen found in water bodies. It has toxic effects on aquatic life at elevated concentrations. | |
| Total phosphorus | P | • | Total phosphorus is more stable and, therefore, a more reliable index of the phosphorus status in water bodies. Similar to nitrates, excess amounts of phosphorus lead to eutrophication and harmful algal growth. | |
| Total reactive phosphorus | PO43− | • | • | Total reactive phosphorus (orthophosphate) is regarded as the best indicator of the nutrient status of water bodies. It has similar effects to nitrates and ammonium in excess amounts. |
Figure 2A boxplot showing conductivity outliers in The Cut.
Transformation of variables in each catchment.
| Variable | Transformation | |
|---|---|---|
| The Cut | River Enborne | |
| Flow rate (Flow) | Reciprocal | Logarithm |
| Chlorophyll (Chl) | Logarithm | Logarithm |
| Dissolved oxygen (DO) | Square root | Logarithm |
| Nitrate (as NH4 or NO3) | Cube root | Cube root |
| Turbidity (Turb) | Reciprocal | Reciprocal |
| Total Reactive Phosphorus (TRP) | None | Cube root |
| pH | None | Reciprocal |
| Conductivity (EC) | None | None |
| Temperature (Temp) | None | None |
| Total Phosphorus (TP) | Square root | |
Proposed accuracy requirements for virtual sensor-based nutrient monitoring.
| Accuracy Metric | Accuracy Ratings | |||
|---|---|---|---|---|
| Target | Acceptable | Tolerable | Poor | |
| R2 | 95–100% | 90–94% | 85–89% | 80–84% |
Figure 3(a) Spot checking nitrate predictive performance in the River Enborne; (b) Spot checking nitrate predictive performance in The Cut.
Performance comparison of RF and ET models using RMSE ± standard deviation (std).
| Predictors in RF and ET Models | RF: [ | ET: Our Work | Improvement (%) |
|---|---|---|---|
| RMSE ± Std | RMSE ± Std | ||
|
| |||
| EC | 0.458 ± 0.286 | 0.062 ± 0.002 | 86% |
| EC, pH | 0.343 ± 0.229 | 0.050 ± 0.002 | 85% |
| EC, pH, Flow | 0.254 ± 0.186 | 0.027 ± 0.001 | 89% |
| EC, pH, Flow, Temp | 0.194 ± 0.138 | 0.017 ± 0.001 | 91% |
|
| |||
| EC | 0.061 ± 0.043 | 0.066 ± 0.002 | −8% |
| EC, Flow | 0.043 ± 0.035 | 0.051 ± 0.001 | −19% |
| EC, Flow, Temp | 0.030 ± 0.025 | 0.027 ± 0.001 | 10% |
| EC, Flow, Temp, Turb | 0.025 ± 0.021 | 0.020 ± 0.001 | 20% |
|
| |||
| Chl | 0.210 ± 0.144 | 0.130 ± 0.004 | 38% |
| Chl, Temp | 0.190 ± 0.123 | 0.135 ± 0.003 | 29% |
| Chl, Temp, Turb | 0.150 ± 0.104 | 0.091 ± 0.004 | 39% |
| Chl, Temp, Turb, pH | 0.128 ± 0.095 | 0.067 ± 0.004 | 48% |
| Chl, Temp, Turb, pH, EC | 0.107 ± 0.077 | 0.051 ± 0.003 | 52% |
|
| |||
| EC | 0.199 ± 0.115 | 0.196 ± 0.005 | 2% |
| EC, Turb | 0.180 ± 0.112 | 0.205 ± 0.007 | −14% |
| EC, Turb, Temp | 0.141 ± 0.099 | 0.142 ± 0.003 | −1% |
| EC, Turb, Temp, pH | 0.117 ± 0.088 | 0.107 ± 0.005 | 9% |
| EC, Turb, Temp, pH, Flow | 0.107 ± 0.078 | 0.088 ± 0.004 | 18% |
|
| |||
| EC | 0.192 ± 0.114 | 0.122 ± 0.003 | 36% |
| EC, Turb | 0.173 ± 0.116 | 0.128 ± 0.004 | 26% |
| EC, Turb, Temp | 0.153 ± 0.104 | 0.088 ± 0.002 | 42% |
| EC, Turb, Temp, pH | 0.119 ± 0.087 | 0.067 ± 0.003 | 44% |
| EC, Turb, Temp, pH, Flow | 0.110 ± 0.081 | 0.055 ± 0.002 | 50% |
The effect of different scalers on NO3 prediction in the River Enborne.
| Scaling Method | RF | XGB | LGBM | kNN | ET | BR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | |
| No scaling | 0.0211 | 0.9544 | 0.0246 | 0.9382 | 0.0279 | 0.9200 | 0.0534 | 0.7085 | 0.0176 | 0.9682 | 0.0229 | 0.9465 |
| Robust scaler | 0.0211 | 0.9547 | 0.0241 | 0.9405 | 0.0280 | 0.9198 | 0.0257 | 0.9322 | 0.0176 | 0.9684 | 0.0226 | 0.9469 |
| MaxAbs scaler | 0.0211 | 0.9546 | 0.0246 | 0.9378 | 0.0279 | 0.9200 | 0.0292 | 0.9125 | 0.0176 | 0.9681 | 0.0224 | 0.9466 |
| MinMax scaler | 0.0211 | 0.9548 | 0.0241 | 0.9403 | 0.0279 | 0.9200 | 0.0210 | 0.9548 | 0.0176 | 0.9683 | 0.0226 | 0.9467 |
| Standard scaler | 0.0210 | 0.9546 | 0.0243 | 0.9397 | 0.0278 | 0.9207 | 0.0236 | 0.9427 | 0.0176 | 0.9683 | 0.0226 | 0.9459 |
| Power transformer | 0.0211 | 0.9545 | 0.0242 | 0.9398 | 0.0279 | 0.9200 | 0.0234 | 0.9440 | 0.0176 | 0.9684 | 0.0227 | 0.9494 |
| Quantile transformer | 0.0223 | 0.9487 | 0.0313 | 0.8999 | 0.0315 | 0.8984 | 0.0253 | 0.9344 | 0.0199 | 0.9596 | 0.0236 | 0.9429 |
The effect of different scalers on TRP prediction in the River Enborne.
| Scaling Method | RF | XGB | LGBM | kNN | ET | BR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | |
| No scaling | 0.0268 | 0.9386 | 0.0298 | 0.9240 | 0.0322 | 0.9112 | 0.0565 | 0.7268 | 0.0243 | 0.9498 | 0.0284 | 0.9318 |
| Robust scaler | 0.0269 | 0.9377 | 0.0299 | 0.9234 | 0.0323 | 0.9108 | 0.0311 | 0.9170 | 0.0243 | 0.9497 | 0.0285 | 0.9282 |
| MaxAbs scaler | 0.0268 | 0.9383 | 0.0297 | 0.9244 | 0.0322 | 0.9112 | 0.0331 | 0.9065 | 0.0243 | 0.9495 | 0.0283 | 0.9295 |
| MinMax scaler | 0.0269 | 0.9381 | 0.0299 | 0.9234 | 0.0324 | 0.9103 | 0.0270 | 0.9377 | 0.0242 | 0.9498 | 0.0287 | 0.9302 |
| Standard scaler | 0.0268 | 0.9379 | 0.0298 | 0.9242 | 0.0325 | 0.9099 | 0.0294 | 0.9260 | 0.0241 | 0.9498 | 0.0290 | 0.9299 |
| Power transformer | 0.0269 | 0.9385 | 0.0299 | 0.9237 | 0.0324 | 0.9102 | 0.0294 | 0.9258 | 0.0241 | 0.9490 | 0.0287 | 0.9300 |
| Quantile transformer | 0.0293 | 0.9269 | 0.0354 | 0.8926 | 0.0360 | 0.8892 | 0.0316 | 0.9145 | 0.0276 | 0.9343 | 0.0302 | 0.9222 |
The effect of different scalers on NH4 prediction in The Cut.
| Scaling Method | RF | XGB | LGBM | kNN | ET | BR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | |
| No scaling | 0.0494 | 0.8552 | 0.0554 | 0.8183 | 0.0560 | 0.8140 | 0.1157 | 0.2085 | 0.0436 | 0.8856 | 0.0538 | 0.8299 |
| Robust scaler | 0.0494 | 0.8540 | 0.0548 | 0.8218 | 0.0558 | 0.8157 | 0.0501 | 0.8508 | 0.0438 | 0.8852 | 0.0529 | 0.8275 |
| MaxAbs scaler | 0.0497 | 0.8538 | 0.0554 | 0.8183 | 0.0560 | 0.8140 | 0.0536 | 0.8287 | 0.0440 | 0.8850 | 0.0543 | 0.8309 |
| MinMax scaler | 0.0495 | 0.8520 | 0.0540 | 0.8275 | 0.0559 | 0.8151 | 0.0445 | 0.8821 | 0.0438 | 0.8859 | 0.0537 | 0.8298 |
| Standard scaler | 0.0496 | 0.8544 | 0.0550 | 0.8208 | 0.0560 | 0.8141 | 0.0463 | 0.8726 | 0.0440 | 0.8842 | 0.0524 | 0.8312 |
| Power transformer | 0.0567 | 0.8093 | 0.0577 | 0.8024 | 0.0590 | 0.7934 | 0.0549 | 0.8183 | 0.0514 | 0.8385 | 0.0599 | 0.7903 |
| Quantile transformer | 0.0631 | 0.7611 | 0.0732 | 0.6832 | 0.0767 | 0.6524 | 0.0562 | 0.8118 | 0.0627 | 0.7678 | 0.0642 | 0.7670 |
The effect of different scalers on TRP prediction in The Cut.
| Scaling Method | RF | XGB | LGBM | kNN | ET | BR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | |
| No scaling | 0.0954 | 0.8058 | 0.1040 | 0.7675 | 0.1079 | 0.7498 | 0.1723 | 0.3623 | 0.0904 | 0.8229 | 0.1022 | 0.7805 |
| Robust scaler | 0.0950 | 0.8059 | 0.1042 | 0.7667 | 0.1080 | 0.7496 | 0.1044 | 0.7660 | 0.0909 | 0.8227 | 0.1019 | 0.7784 |
| MaxAbs scaler | 0.0956 | 0.8047 | 0.1045 | 0.7651 | 0.1079 | 0.7498 | 0.1085 | 0.7471 | 0.0904 | 0.8229 | 0.1019 | 0.7782 |
| MinMax scaler | 0.0951 | 0.8041 | 0.1044 | 0.7657 | 0.1079 | 0.7499 | 0.0983 | 0.7926 | 0.0906 | 0.8221 | 0.1009 | 0.7795 |
| Standard scaler | 0.0952 | 0.8041 | 0.1037 | 0.7687 | 0.1082 | 0.7486 | 0.1000 | 0.7849 | 0.0909 | 0.8222 | 0.1014 | 0.7798 |
| Power transformer | 0.1017 | 0.7792 | 0.1054 | 0.7612 | 0.1080 | 0.7493 | 0.1069 | 0.7538 | 0.0974 | 0.7951 | 0.1090 | 0.7529 |
| Quantile transformer | 0.1006 | 0.7820 | 0.1110 | 0.7353 | 0.1133 | 0.7242 | 0.1038 | 0.7684 | 0.1005 | 0.7837 | 0.1051 | 0.7564 |
Performance comparison of the ET model using different imputation methods.
| Type | Method(s) | River Enborne | The Cut | ||||||
|---|---|---|---|---|---|---|---|---|---|
| NO3 | TRP | NH4 | TRP | ||||||
| RMSE | R2 | RMSE | R2 | RMSE | R2 | RMSE | R2 | ||
| Deletion | Listwise | 0.0176 | 0.9683 | 0.0242 | 0.9498 | 0.0438 | 0.8859 | 0.0906 | 0.8221 |
| Univariate | Mean | 0.0282 | 0.9018 | 0.0335 | 0.8624 | 0.0427 | 0.9053 | 0.0783 | 0.7794 |
| Mode | 0.0285 | 0.9004 | 0.0461 | 0.8141 | 0.0426 | 0.9054 | 0.0781 | 0.7853 | |
| Median | 0.0282 | 0.9010 | 0.0337 | 0.8629 | 0.0424 | 0.9057 | 0.0780 | 0.7775 | |
| Multivariate | Bayesian ridge | 0.0215 | 0.9463 | 0.0245 | 0.9370 | 0.0430 | 0.9033 | 0.0760 | 0.7957 |
| RF | 0.0176 | 0.9649 | 0.0208 | 0.9580 | 0.0429 | 0.9058 | 0.0798 | 0.7948 | |
| BR | 0.0178 | 0.9635 | 0.0217 | 0.9542 | 0.0427 | 0.9051 | 0.0819 | 0.7893 | |
| XGB | 0.0183 | 0.9687 | 0.0217 | 0.9548 | 0.0428 | 0.9040 | 0.0794 | 0.7904 | |
| LGBM | 0.0181 | 0.9678 | 0.0212 | 0.9555 | 0.0426 | 0.9098 | 0.0763 | 0.8009 | |
| Nearest neighbors | kNN | 0.0238 | 0.9328 | 0.0320 | 0.8884 | 0.0436 | 0.9017 | 0.0853 | 0.7949 |
Predictive performance of the ET model as a function of each predictor’s contribution.
| Predictors in the ET Model | RMSE | R2 |
|---|---|---|
|
| ||
| EC | 0.0617 | 0.6107 |
| EC, Temp | 0.0559 | 0.6818 |
| EC, Temp, pH | 0.0274 | 0.9223 |
| EC, Temp, pH, DO | 0.0205 | 0.9566 |
| EC, Temp, pH, DO, Turb | 0.0172 | 0.9695 |
| EC, Temp, pH, DO, Turb, Chl | 0.0177 | 0.9681 |
|
| ||
| EC | 0.0666 | 0.5637 |
| EC, DO | 0.0608 | 0.6355 |
| EC, DO, Temp | 0.0343 | 0.8848 |
| EC, DO, Temp, Turb | 0.0257 | 0.9345 |
| EC, DO, Temp, Turb, pH | 0.0213 | 0.9559 |
| EC, DO, Temp, Turb, pH, Chl | 0.0212 | 0.9558 |
|
| ||
| Temp | 0.1312 | 0.1620 |
| Temp, Chl | 0.1342 | 0.1220 |
| Temp, Chl, Turb | 0.0907 | 0.5986 |
| Temp, Chl, Turb, EC | 0.0655 | 0.7895 |
| Temp, Chl, Turb, EC, DO | 0.0526 | 0.8647 |
| Temp, Chl, Turb, EC, DO, pH | 0.0429 | 0.9101 |
|
| ||
| EC | 0.1952 | 0.1820 |
| EC, Turb | 0.2037 | 0.1072 |
| EC, Turb, DO | 0.1554 | 0.4813 |
| EC, Turb, DO, Temp | 0.1101 | 0.7401 |
| EC, Turb, DO, Temp, Chl | 0.0999 | 0.7864 |
| EC, Turb, DO, Temp, Chl, pH | 0.0907 | 0.8219 |
|
| ||
| EC | 0.1213 | 0.1697 |
| EC, DO | 0.1291 | 0.0593 |
| EC, DO, Turb | 0.0956 | 0.4853 |
| EC, DO, Turb, Temp | 0.0680 | 0.7382 |
| EC, DO, Turb, Temp, Chl | 0.0610 | 0.7880 |
| EC, DO, Turb, Temp, Chl, pH | 0.0556 | 0.8253 |
A methodological comparison of our work with the three most related studies, where N/S = not specified, ML = machine learning, kNN = k-nearest neighbor, RF = random forest, MLR = multiple linear regression, DT = decision tree, XGB = extreme gradient boosting, LGBM = light gradient boosting machine, BR = bagging regressor, MLP = multi-layer perceptron, SVM = support vector machine, ET = extremely randomized trees, GB = gradient boosting, SGD = stochastic gradient descent, and SHAP = Shapley additive explanations.
| Step | [ | [ | [ | Our Work | Remark |
|---|---|---|---|---|---|
| Data transformation | N/S | Cubic | N/S | Cubic | Even though water quality data is usually skewed, only [ |
| Data scaling | N/S | Standard scaler | N/S | Standard scaler | Only one study [ |
| Missing values handling | N/S | Listwise deletion | Median imputation | Listwise deletion | Amongst the various missing data handling methods, our work showed that multivariate imputation, which was not implemented in the three reference studies, results in best-performing models |
| ML models | RF, MLR | RF, MLR | RF | RF, DT, XGB, LGBM, BR, MLP, kNN, SVM, ET, GB, SGD, MLR, Ridge | Although the commonly used RF performed competitively, the extensive analysis in our work showed that ET performs better. |
| Input variable selection | N/S | Stepwise selection | Stepwise selection | SHAP | Contrary to the commonly used stepwise selection method, we applied SHAP in this work because it satisfies the interpretability requirements, which are essential for sensitive applications like public health. |
A comparative analysis (in terms of the accuracy values reached) of our work and the three most related studies. The predictive performance is not reported where there is a hyphen.
| Method | NO3 | TRP | TP | |||
|---|---|---|---|---|---|---|
| RMSE | NSE/R2 | RMSE | NSE/R2 | RMSE | NSE/R2 | |
| Random forest [ | 0.059 | 0.89 |
| 0.903 | - | - |
| Random forest [ | 0.194 | - | 0.025 | - | 0.110 | - |
| Random forest [ | 0.120 | 0.89 | 2.000 | 0.320 | 11.00 | 0.740 |
| Extra trees [our work] |
|
| 0.021 |
|
|
|