| Literature DB >> 33791594 |
Daniel L Weller1,2, Tanzy M T Love2, Alexandra Belias1, Martin Wiedmann1.
Abstract
While the Food Safety Modernization Act established standards for the use of surface water for produce production, water quality is known to vary over space and time. Targeted approaches for identifying hazards in water that account for this variation may improve growers' ability to address pre-harvest food safety risks. Models that utilize publicly-available data (e.g., land-use, real-time weather) may be useful for developing these approaches. The objective of this study was to use pre-existing datasets collected in 2017 (N = 181 samples) and 2018 (N = 191 samples) to train and test models that predict the likelihood of detecting Salmonella and pathogenic E. coli markers (eaeA, stx) in agricultural water. Four types of features were used to train the models: microbial, physicochemical, spatial and weather. "Full models" were built using all four features types, while "nested models" were built using between one and three types. Twenty learners were used to develop separate full models for each pathogen. Separately, to assess information gain associated with using different feature types, six learners were randomly selected and used to develop nine, nested models each. Performance measures for each model were then calculated and compared against baseline models where E. coli concentration was the sole covariate. In the methods, we outline the advantages and disadvantages of each learner. Overall, full models built using ensemble (e.g., Node Harvest) and "black-box" (e.g., SVMs) learners out-performed full models built using more interpretable learners (e.g., tree- and rule-based learners) for both outcomes. However, nested eaeA-stx models built using interpretable learners and microbial data performed almost as well as these full models. While none of the nested Salmonella models performed as well as the full models, nested models built using spatial data consistently out-performed models that excluded spatial data. These findings demonstrate that machine learning approaches can be used to predict when and where pathogens are likely to be present in agricultural water. This study serves as a proof-of-concept that can be built upon once larger datasets become available and provides guidance on the learner-data combinations that should be the foci of future efforts (e.g., tree-based microbial models for pathogenic E. coli).Entities:
Keywords: E. coli; Salmonella; agricultural water; eaeA; machine learning; predictive model; stx
Year: 2020 PMID: 33791594 PMCID: PMC8009603 DOI: 10.3389/fsufs.2020.561517
Source DB: PubMed Journal: Front Sustain Food Syst ISSN: 2571-581X
Foodborne pathogen prevalence and E. coli levels in New York streams used for produce production.
| Year | No. of | Prevalence (No. of positive samples) | Median MPN of | |||
|---|---|---|---|---|---|---|
| Streams | Samples | Culture-confirmed | PCR-screen positive | |||
| 2017 | 6 | 181 | 44% (80) | 94% (171) | 69% (125) | 160.4 (18.5–>2,419.6) |
| 2018 | 68 | 191 | 41% (79) | 99% (190) | 70% (133) | 211.4 (2.0–>2,419.6) |
| Total | 68 | 372 | 43% (159) | 97% (361) | 69% (258) | 193.5 (2.0–>2,419.6) |
The outcome of the eaeA-stx models was codetection of both the eaeA and the stx genes; in both years all stx-positive samples were also eaeA-positive, as a result the prevalence of samples that were positive for both genes was 69% in 2017 and 68% in 2018.
FIGURE 1 |Map of land cover in watersheds sampled in both study years (No. = 6), and watersheds sampled only in 2018 (No. = 62).
FIGURE 2 |Visualization of the inverse distance weighting approach used to calculate the percent of the watershed (A), floodplain (B), and riparian buffer (C) under different land uses. (D–F) Provide a close-up view of (A–C), respectively, for areas near the sampling site.
List of learners used here, including advantages and disadvantages of each learner as implemented in the R package used herea.
| Learners | Package | Centering and scaling needed | In features it can handle | Automatic feature selection | Interpretable | ||||
|---|---|---|---|---|---|---|---|---|---|
| Correlation | Missingness | Near-Zero | Noise[ | ||||||
| Bayesian Learners | |||||||||
| Naive Bayes | e1071 ( | • | • | ||||||
| Tree-Based Learners | |||||||||
| Classification tree[ | rpart ( | • | • | ||||||
| Conditional tree | party ( | • | |||||||
| Evolutionary optimal tree | evtree ( | • | |||||||
| Ensemble Learners[ | |||||||||
| Conditional forest | party ( | • | • | • | |||||
| Node harvest[ | nodeHarvest ( | • | • | ||||||
| Random forest[ | randomForest ( | • | • | • | |||||
| Regularized RF | RRF ( | • | |||||||
| Random ferns[ | rferns ( | • | • | ||||||
| Random KNN[ | rknn ( | • | • | ||||||
| Extreme gradient boosting | xgboost ( | • | • | ||||||
| Instance-Based Learners[ | |||||||||
| k-Nearest neighbor | kknn ( | • | • | ||||||
| Weighted kKNN | kknn ( | • | • | ||||||
| Penalized Regression | |||||||||
| Elastic net | glmnet ( | ||||||||
| Lasso | glmnet ( | ||||||||
| Ridge | glmnet ( | ||||||||
| Rule-Based Learners | |||||||||
| JRip | RWeka ( | ||||||||
| One rule | RWeka ( | ||||||||
| Partial decision lists | RWeka ( | ||||||||
| SVM | e1071 ( | • | |||||||
This table was adapted from Kuhn and Johnson (2016) to include all learners used here. The information reported here is based on the papers cited for each learner in the methods section, and the constraints of the R packages used to implement the learners in this study (based on the version available in January 2020). Y means the learner meets the condition in the header. N means the learner does not meet this conditional. • = the learner is in between (e.g., random forest is not as interpretable as tree-based methods but is not a 100% black-box method like support vector machines). If the cell is blank it means there was limited information on this parameter for the given learner.
It is important to note that although tree-based methods are relatively robust to noise in the features, they are less robust than tree-based ensembles. Theoretically, ensemble methods are more robust to noise in the features than constituent models used to build the ensemble (rFERNS should be more robust than Naïve Bayes, rKNN should be more robust then wKNN and kKNN, forests should be more robust than trees).
Preferentially selects continuous variables and categorical variables with many levels as the splitting variable resulting in variable selection bias (Strobl et al., 2007, 2008, 2009). Conditional inference trees and conditional forests were developed to overcome these limitations (Strobl et al., 2007, 2008, 2009).
Predicts class labels but not probability of detecting a positive.
Feature selection recommended prior to model development.
FIGURE 3 |Log 10 E. coli levels in training and test data samples that tested positive and negative for eaeA-stx and Salmonella. The colored lines represent the thresholds for agricultural water that were considered during development of the US Food Safety Modernization Act’s Produce Safety [126 MPN/100-MmL (pink), 235 MPN/100-mL (blue), and 410 MPN/100-mL (green)].
FIGURE 4 |Mean rank (0 = worst; 65 = best) of each learner-data combination for each outcome. To facilitate readability, full and baseline models are depicted in a separate facet from the nested models, which were built using a subset of features. For baseline models, the letters refer to the organism the cutoff is based on (EC = E. coli, TC = total coliforms), and the number refers to the cut-off value (e.g., EC.126 is based on a cut-off of 126 MPN of E. coli/100-mL). Models that were able to accurately predict both Salmonella and eaeA-stx presence appear in the top right corner of each facet, while poor performing models appear in the bottom left of each facet.
FIGURE 5 |Plot showing kappa score and area under the curve for the full models.
FIGURE 8 |Plots showing the performance of the top-ranked full and nested eaeA-stx models. (A) Shows how well the model can distinguish samples that tested positive and negative for eaeA-stx. The x-axis of (A) is the probability of eaeA-stx detection generated by the model, and the y-axis is density. (B) Is the receiver-operating curve (ROC) for the model; the x-axis is 1-Specificity and the y-axis is Sensitivity. (C) Shows how well the model is at accurately classifying positive and negative samples. The split quantiles plot is generated by sorting the test data from lowest to highest probability of eaeA-stx detection based on the given model. The test data is then divided into quantiles (based on the percentile the probability falls into). The proportion of samples in each quantile that were actually eaeA-stx -positive or negative were plotted. A good model would identify all low probability percentile samples (red) as negative (N) and all high probability percentile samples (blue) as positive (P).
FIGURE 7 |Plots showing the performance of the top-ranked full and nested Salmonella models. (A) Shows how well the model can distinguish samples that tested positive and negative for Salmonella. The x-axis of (A) is the probability of Salmonella detection generated by the model, and the y-axis is density. (B) Is the receiver-operating curve (ROC) for the model; the x-axis is 1-Specificity and the y-axis is Sensitivity. (C) Shows how well the model is at accurately classifying positive and negative samples. The split quantiles plot is generated by sorting the test data from lowest to highest probability of Salmonella detection based on the given model. The test data is then divided into quantiles (based on the percentile the probability falls into). The proportion of samples in each quantile that were actually Salmonella-positive or negative were then plotted. A good model would identify all low probability percentile samples (red) as negative (N) and all high probability percentile samples (blue) as positive (P).
FIGURE 6 |Kappa score and AUC for the nested models. Results are faceted by model outcome and learner: Mq, microbial; MqTurb, microbial data and turbidity; Pq, Physicochemical water quality and air temperature collected on site; W, Weather from publicly-available databases; S, Spatial. With the exception of the Mq models, each nested model used data on site traits (e.g., stream bottom substrate). Top performing models are in the top right corner of each facet.