| Literature DB >> 35088045 |
Matthew D Stocker1,2,3, Yakov A Pachepsky1, Robert L Hill3.
Abstract
The microbial quality of irrigation water is an important issue as the use of contaminated waters has been linked to several foodborne outbreaks. To expedite microbial water quality determinations, many researchers estimate concentrations of the microbial contamination indicator Escherichia coli (E. coli) from the concentrations of physiochemical water quality parameters. However, these relationships are often non-linear and exhibit changes above or below certain threshold values. Machine learning (ML) algorithms have been shown to make accurate predictions in datasets with complex relationships. The purpose of this work was to evaluate several ML models for the prediction of E. coli in agricultural pond waters. Two ponds in Maryland were monitored from 2016 to 2018 during the irrigation season. E. coli concentrations along with 12 other water quality parameters were measured in water samples. The resulting datasets were used to predict E. coli using stochastic gradient boosting (SGB) machines, random forest (RF), support vector machines (SVM), and k-nearest neighbor (kNN) algorithms. The RF model provided the lowest RMSE value for predicted E. coli concentrations in both ponds in individual years and over consecutive years in almost all cases. For individual years, the RMSE of the predicted E. coli concentrations (log10 CFU 100 ml-1) ranged from 0.244 to 0.346 and 0.304 to 0.418 for Pond 1 and 2, respectively. For the 3-year datasets, these values were 0.334 and 0.381 for Pond 1 and 2, respectively. In most cases there was no significant difference (P > 0.05) between the RMSE of RF and other ML models when these RMSE were treated as statistics derived from 10-fold cross-validation performed with five repeats. Important E. coli predictors were turbidity, dissolved organic matter content, specific conductance, chlorophyll concentration, and temperature. Model predictive performance did not significantly differ when 5 predictors were used vs. 8 or 12, indicating that more tedious and costly measurements provide no substantial improvement in the predictive accuracy of the evaluated algorithms.Entities:
Keywords: E. coli; food safety; irrigation water; machine learning; microbial water quality
Year: 2022 PMID: 35088045 PMCID: PMC8787305 DOI: 10.3389/frai.2021.768650
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Average root-mean-squared errors (RMSE) of logarithms of E. coli concentrations predicted with four machine learning algorithms and multiple linear regression.
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| ||||||||
| SGB | 0.354 ± 0.015 | 0.343 ± 0.009 | 0.257 ± 0.012 | 0.348 ± 0.012 | 0.325 ± 0.008 | 0.336 ± 0.011 | ||
| kNN | 0.279 ± 0.016 | 0.276 ± 0.012 | 0.395 ± 0.015 | 0.366 ± 0.010 | 0.283 ± 0.016 | 0.385 ± 0.016 | 0.356 ± 0.011 | 0.361 ± 0.016 |
| MLR | 0.452 ± 0.033 | 0.287 ± 0.013 | 0.556 ± 0.016 | 0.504 ± 0.009 | 0.288 ± 0.014 | 0.518 ± 0.014 | 0.461 ± 0.008 | 0.447 ± 0.012 |
| RF | 0.255 ± 0.015 | |||||||
| SVM | 0.269 ± 0.013 | 0.255 ± 0.012 | 0.384 ± 0.013 | 0.356 ± 0.009 | 0.260 ± 0.012 | 0.382 ± 0.014 | 0.344 ± 0.009 | 0.371 ± 0.014 |
|
| ||||||||
| SGB | 0.332 ± 0.011 | 0.422 ± 0.013 | 0.381 ± 0.007 | 0.402 ± 0.007 | 0.428 ± 0.015 | 0.375 ± 0.008 | 0.403 ± 0.007 | 0.314 ± 0.009 |
| kNN | 0.370 ± 0.015 | 0.416 ± 0.015 | 0.405 ± 0.008 | 0.423 ± 0.008 | 0.424 ± 0.012 | 0.401 ± 0.009 | 0.408 ± 0.009 | 0.396 ± 0.009 |
| MLR | 0.421 ± 0.016 | 0.463 ± 0.012 | 0.434 ± 0.008 | 0.506 ± 0.008 | 0.467 ± 0.012 | 0.418 ± 0.009 | 0.506 ± 0.006 | 0.391 ± 0.010 |
| RF | 0.306 ± 0.012 | |||||||
| SVM | 0.424 ± 0.014 | 0.365 ± 0.008 | 0.404 ± 0.007 | 0.431 ± 0.013 | 0.378 ± 0.011 | 0.406 ± 0.009 | 0.340 ± 0.010 | |
The ± separates the average from the standard error of the mean. The smallest RMSE are shown in bold. Machine learning algorithms: SGB, stochastic gradient boosting machines; kNN, k-nearest neighbor; MLR, multiple linear regression; RF, random forest; SVM, support vector machines. Predictor sets: A—temperature (C), DO, pH, turbidity, and SPC; AB—all from A and PC, CHL, and fDOM; ABC—all from AB and .
Figure 1Probabilities for the mean RMSE value to be the same in the RF and other ML applications based on the corrected t-statistic. Symbols in red show statistical significance.
Figure 2Dependence of the root-mean-squared error (RMSE) on the predictor set size. Predictor sets have 5, 8, and 12 predictors for A, AB, and ABC, respectively. Displayed results are from the 2018 datasets.
The top five important variable as determined by the recursive feature selection algorithm in caret with Random Forests.
|
|
| ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| SPC | 1.3 | SPC | 2.0 | 1 | C | 2.0 | SPC | 1.5 | 1 | ||
| C | 1.7 | C | 2.5 | SPC | 2 | SPC | 2.7 | C | 2.0 | SPC | 2 |
| DO | 3.7 | 3.5 | CHL | 3 | pH | 3.3 | pH | 2.5 | CHL | 3 | |
| pH | 4.0 | CHL | 4.5 | TN | 4 | NTU | 3.3 | NTU | 5.0 | TN | 4 |
| NTU | 4.3 | NTU | 5.5 | TC | 5 | DO | 3.7 | 5.5 | TC | 5 | |
Variable set A was measured in 2016, 2017, and 2018. Variable set AB was measured in 2017 and 2018 and variable set ABC was measured in 2018 only. A = C, pH, NTU, SPC, AB = A + CHL, PC, fDOM. ABC = AB + .