| Literature DB >> 34056577 |
Daniel L Weller1,2,3, Tanzy M T Love1, Martin Wiedmann2.
Abstract
Since E. coli is considered a fecal indicator in surface water, government water quality standards and industry guidance often rely on E. coli monitoring to identify when there is an increased risk of pathogen contamination of water used for produce production (e.g., for irrigation). However, studies have indicated that E. coli testing can present an economic burden to growers and that time lags between sampling and obtaining results may reduce the utility of these data. Models that predict E. coli levels in agricultural water may provide a mechanism for overcoming these obstacles. Thus, this proof-of-concept study uses previously published datasets to train, test, and compare E. coli predictive models using multiple algorithms and performance measures. Since the collection of different feature data carries specific costs for growers, predictive performance was compared for models built using different feature types [geospatial, water quality, stream traits, and/or weather features]. Model performance was assessed against baseline regression models. Model performance varied considerably with root-mean-squared errors and Kendall's Tau ranging between 0.37 and 1.03, and 0.07 and 0.55, respectively. Overall, models that included turbidity, rain, and temperature outperformed all other models regardless of the algorithm used. Turbidity and weather factors were also found to drive model accuracy even when other feature types were included in the model. These findings confirm previous conclusions that machine learning models may be useful for predicting when, where, and at what level E. coli (and associated hazards) are likely to be present in preharvest agricultural water sources. This study also identifies specific algorithm-predictor combinations that should be the foci of future efforts to develop deployable models (i.e., models that can be used to guide on-farm decision-making and risk mitigation). When deploying E. coli predictive models in the field, it is important to note that past research indicates an inconsistent relationship between E. coli levels and foodborne pathogen presence. Thus, models that predict E. coli levels in agricultural water may be useful for assessing fecal contamination status and ensuring compliance with regulations but should not be used to assess the risk that specific pathogens of concern (e.g., Salmonella, Listeria) are present.Entities:
Keywords: E. coli; food safety; machine learning; predictive model; water quality
Year: 2021 PMID: 34056577 PMCID: PMC8160515 DOI: 10.3389/frai.2021.628441
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
FIGURE 1Sampling sites for each of the streams represented in the training and test data used here. Point size is proportional to (A) mean E. coli concentration (log10 MPN/100-ml) and (B) mean turbidity levels (log10 NTUs). Municipal boundaries (yellow), and major lakes (blue) are included as references. The map depicts the Finger Lakes, Western, and Southern Tier regions of New York State, United States.
List of algorithms used in the study reported here. This table was adapted from Kuhn and Johnson (2016) and Weller et al., (2020a) to i) reflect the algorithms used here, and ii) report information relevant to continuous (as opposed to categorical) data ,
| Algorithm | Package |
| Centering and Scaling Recommended | For Features, It Can Handle | Automatic Feature Selection | Interpretable | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Correlation | Missingness | Near-Zero Variance | Noise | |||||||
| Tree-based Learners | ||||||||||
| Conditional Inference Tree | party ( |
|
|
|
|
|
|
|
| |
| Evolutionary Optimal Tree | evtree ( |
|
|
|
|
|
|
| ||
| Regression Tree | rpart ( |
|
|
|
|
|
|
|
| |
| Ensemble Learners | ||||||||||
| Conditional Forest | party ( |
|
|
|
|
|
|
|
| |
| Extremely Randomized Trees | extraTrees ( |
|
|
|
| |||||
| Node Harvest | nodeHarvest ( |
|
|
|
|
|
|
|
| |
| Random Forest | randomForest ( |
|
|
|
|
|
|
|
| |
| Regularized Random Forest | RRF ( |
|
|
|
|
|
|
|
| |
| Extreme Gradient Boosting | xgboost ( |
|
|
|
|
|
|
|
| |
| Instance-Based Learners | ||||||||||
| k-Nearest Neighbor | kknn ( |
|
|
|
|
|
|
|
| |
| Weighted k-Nearest Neighbor | kknn ( |
|
|
|
|
|
|
|
| |
| Multivariate Adaptive Regression Splines | earth ( |
|
|
|
|
|
|
| ||
| Neural Network | nnet ( |
|
|
|
|
|
|
| ||
| Regression | ||||||||||
| Log-Linear | stats |
|
|
|
|
|
|
|
| |
| Partial Least Squares | pls ( |
|
|
|
|
|
|
|
| |
| Principal Component | pls ( |
|
|
|
|
| ||||
| Penalized Regression | ||||||||||
| Elastic Net | glmnet ( |
|
|
|
|
|
|
|
| |
| Lasso | glmnet ( |
|
|
|
|
|
|
|
| |
| Ridge | glmnet ( |
|
|
|
|
|
|
|
| |
| Rule-Based Algorithms | ||||||||||
| Cubist | Cubist ( |
|
|
|
|
|
|
| ||
| SVM | e1071 ( |
|
|
|
|
|
|
|
| |
The information reported here is based on i) Kuhn and Johnson, (2016), ii) the papers cited for each algorithm in the methods section, and iii) the constraints listed in the R packages below (based on the package version available in January 2020).
Y means the algorithm meets the condition in the header. N means the algorithm does not meet this condition. • means the algorithm is in between (e.g., random forest is not as interpretable as tree-based methods but is not a 100% black-box method like support vector machines). If the cell is blank it means there was limited information on this condition for the given algorithm.
Preferentially selects continuous factors and categorical factors with many levels as the splitting variable resulting in variable selection bias (Strobl et al., 2007b; Strobl et al., 2008; Strobl et al., 2009).
Feature selection recommended before model development.
Centering and scaling are required but are performed as part of model fitting in the R package.
FIGURE 2Graph of RMSE, which measures a model’s ability to predict absolute E. coli count, vs. Kendall’s tau, which measures the model’s ability to predict relative E. coli concentration. The dashed line represents the RMSE for the featureless regression model; an RMSE to the right of this line indicates that the model was unable to predict absolute E. coli counts. To facilitate readability, nested models are displayed in a separate facet from the full and log-linear models. Supplementary Figures S3, S4 display the nested model facet as a series of convex hulls to facilitate comparisons between models built using different feature types and algorithms, respectively. The top-performing models are in the top left of each facet.
FIGURE 4Density plots and split quantiles plots showing the performance of the top-ranked full, nested, and log-linear models. The density plot shows the ability of each model to predict E. coli counts in the test data, while the split quantiles plots show the ability of each model to predict when E. coli levels are likely to be high or low (i.e., relative E. coli concentration). The split quantiles plot is generated by sorting the test data from (i) lowest to highest predicted E. coli concentration, and (ii) lowest to highest observed E. coli concentration. The test data is then divided into quintiles based on the percentile the predicted value (color-coding; see legend) and observed values (x-axis goes from quintile with lowest observed E. coli levels on the left to highest on the right) fell into. In a good model, all samples that are predicted to have a low E. coli concentration (red) would be in the far left column, while samples that are predicted to have a high E. coli concentration (blue) would be in the far right column.
FIGURE 5Permutation variable importance (PVI) of the 30 factors that were most strongly associated with predicting E. coli levels in the test and training data using full, boosted, k-nearest neighbor Cubist model. The black dot shows median importance, while the line shows the upper and lower 5% and 95% quantiles of PVI values from the 150 permutations performed. Avg. Sol Rad = Average Solar Radiation; Elev = Elevation; FP = Floodplain; SPDES = Wastewater Discharge Site; Soil A = Hydrologic Soil Type-A.
FIGURE 6Accumulated local effects plots showing the effect of the four factors with the highest PVI when predictions were made on the training (shown in red) and test (shown in blue) data using the full boosted, k-nearest neighbor Cubist model. All predictors were centered and scaled before training each model. As a result, the units on the x-axis are the number of standard deviations above or below the mean for the given factor (e.g., in the rainfall plot 0 and two indicate the mean rainfall 0–1 day before sample are collection [BSC], and 2 standard deviations above this mean, respectively).