| Literature DB >> 32515055 |
Nanna Munck1, Patrick Murigu Kamau Njage1, Pimlapas Leekitcharoenphon1, Eva Litrup2, Tine Hald1.
Abstract
Prevention of the emergence and spread of foodborne diseases is an important prerequisite for the improvement of public health. Source attribution models link sporadic human cases of a specific illness to food sources and animal reservoirs. With the next generation sequencing technology, it is possible to develop novel source attribution models. We investigated the potential of machine learning to predict the animal reservoir from which a bacterial strain isolated from a human salmonellosis case originated based on whole-genome sequencing. Machine learning methods recognize patterns in large and complex data sets and use this knowledge to build models. The model learns patterns associated with genetic variations in bacteria isolated from the different animal reservoirs. We selected different machine learning algorithms to predict sources of human salmonellosis cases and trained the model with Danish Salmonella Typhimurium isolates sampled from broilers (n = 34), cattle (n = 2), ducks (n = 11), layers (n = 4), and pigs (n = 159). Using cgMLST as input features, the model yielded an average accuracy of 0.783 (95% CI: 0.77-0.80) in the source prediction for the random forest and 0.933 (95% CI: 0.92-0.94) for the logit boost algorithm. Logit boost algorithm was most accurate (valid accuracy: 92%, CI: 0.8706-0.9579) and predicted the origin of 81% of the domestic sporadic human salmonellosis cases. The most important source was Danish produced pigs (53%) followed by imported pigs (16%), imported broilers (6%), imported ducks (2%), Danish produced layers (2%), Danish produced cattle and imported cattle (<1%) while 18% was not predicted. Machine learning has potential for improving source attribution modeling based on sequence data. Results of such models can inform risk managers to identify and prioritize food safety interventions.Entities:
Keywords: Machine learning; source attribution; whole genome sequencing
Year: 2020 PMID: 32515055 PMCID: PMC7540586 DOI: 10.1111/risa.13510
Source DB: PubMed Journal: Risk Anal ISSN: 0272-4332 Impact factor: 4.000
Fig 1Conceptual model of the machine learning method development.
Fig 2Phylogenetic tree of the Danish data set. Branch length to outgroup ST36 reduced by 30 from 0.71421 to 0.023807. Isolates are annotated by source (inner ring). Whether the source of the human cases was predicted is indicated in the outer ring. Source: light red: domestically produced pigs, Pigs (DK), light pink: imported pigs, Pigs (import), yellow: imported ducks, Ducks (import), blue: domestically produced broilers, Broilers (DK), turquoise: domestically produced eggs, Layers (DK), light green: domestically produced cattle, Cattle (DK), dark green: imported cattle, Cattle (Import), and dark grey: Danish human salmonellosis cases, Humans (DK).
Number of Salmonella Typhimurium and its Monophasic Variants Included in the Danish Data Set
| DK Data Set, 2013–2014 Source | 2013 | 2014 | Number of Isolates |
|---|---|---|---|
| Pigs (DK) | 84 | 41 | 125 |
| Pigs (import) | 20 | 14 | 34 |
| Broilers (DK) | 13 | 21 | 34 |
| Ducks (import) | 0 | 11 | 11 |
| Layers (DK) | 3 | 1 | 4 |
| Cattle (DK) | 1 | 0 | 1 |
| Cattle (import) | 0 | 1 | 1 |
| Total animal | 121 | 89 | 210 |
| Human | 29 | 112 | 141 |
Note: DK, Denmark.
Seventeen Loci Sorted by Maximum Importance across the Sources
| Broilers | Cattle | Cattle | Ducks | Layers | Pigs | Pigs | |
|---|---|---|---|---|---|---|---|
| Loci | (DK) | (DK) | (Import) | (Import) | (DK) | (DK) | (Import) |
| SALM01217 | 0.9677 | 0.9677 | 0.5161 | 0.5269 | 0.6819 | 1.0000 | 0.9677 |
| SALM02906 | 1.0000 | 1.0000 | 0.5000 | 0.5238 | 0.6802 | 1.0000 | 1.0000 |
| SALM01562 | 1.0000 | 1.0000 | 0.7796 | 0.7204 | 0.7204 | 1.0000 | 1.0000 |
| SALM01921 | 0.7796 | 0.7796 | 0.7796 | 0.7238 | 0.7204 | 1.0000 | 0.7796 |
| SALM01860 | 1.0000 | 0.8647 | 0.7796 | 0.7204 | 0.7204 | 1.0000 | 1.0000 |
| SALM02626 | 0.7796 | 0.7796 | 0.7796 | 0.7238 | 0.7204 | 1.0000 | 0.7796 |
| SALM02334 | 1.0000 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 1.0000 | 1.0000 |
| SALM00032 | 0.9409 | 0.9409 | 0.9409 | 0.7737 | 0.7881 | 1.0000 | 0.9409 |
| SALM01381 | 0.5000 | 0.9432 | 0.5000 | 0.5238 | 0.6802 | 0.5000 | 0.9432 |
| SALM01938 | 0.8172 | 0.8172 | 0.9367 | 0.8172 | 0.8172 | 0.8172 | 0.8172 |
| SALM00628 | 0.8172 | 0.8172 | 0.8172 | 0.8172 | 0.8172 | 0.8172 | 0.8172 |
| SALM00010 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 0.7204 |
| SALM02003 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 0.7204 | 0.7204 |
| SALM01572 | 0.5000 | 0.5000 | 0.5000 | 0.5417 | 0.6977 | 0.5000 | 0.5000 |
| SALM02871 | 0.5269 | 0.5269 | 0.5269 | 0.5875 | 0.6689 | 0.5269 | 0.5269 |
| SALM01670 | 0.5269 | 0.5269 | 0.6019 | 0.5269 | 0.6602 | 0.5269 | 0.5269 |
| SALM00643 | 0.5753 | 0.5753 | 0.5753 | 0.5753 | 0.5753 | 0.5753 | 0.5753 |
Note: The numbers represent importance based on accuracies of the prediction of the sources by each feature (loci). The values are in fact area under the ROC curve (AUC) derived from the source specific sensitivities and specificities values (Table III).
Sensitivity, Specificity and Balanced Accuracy for the Prediction of Sources by the Logit Boost Machine Learning Model
| Broilers (DK) | Cattle (DK) | Cattle (import) | Ducks (Import) | Layers (DK) | Pigs (DK) | Pigs (Import) | |
|---|---|---|---|---|---|---|---|
| Sensitivity | 1 | NA | 1 | 0.8919 | 1 | 0.61538 | 0.73333 |
| Specificity | 0.9353 | 1 | 0.9688 | 1 | 1 | 1 | 1 |
| Balanced accuracy | 0.9676 | NA | 0.98 | 0.95 | 1 | 0.81 | 0.87 |
Note: Balanced accuracies defined by Brodersen et al. (2010) as the average accuracy obtained on either class. DK, Denmark.
Confusion Matrix of the Constructed Model
| % of Total Predicted | Broilers (DK) | Cattle (DK) | Cattle (Import) | Ducks (Import) | Layers (DK) | Pigs (DK) | Pigs (Import) |
|---|---|---|---|---|---|---|---|
| Broilers (DK) | 100 | 0 | 0 | 0 | 0 | 38 | 27 |
| Cattle (DK) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Cattle (import) | 0 | 0 | 100 | 0 | 0 | 0 | 0 |
| Ducks (import) | 0 | 0 | 0 | 100 | 0 | 0 | 0 |
| Layers (DK) | 0 | 0 | 0 | 0 | 100 | 0 | 0 |
| Pigs (DK) | 0 | 0 | 0 | 0 | 0 | 62 | 0 |
| Pigs (import) | 0 | 0 | 0 | 0 | 0 | 0 | 73 |
| % of isolates in testing data | |||||||
| Total predicted | 87.5 | 0 | 100 | 89.2 | 100.0 | 31.7 | 38.5 |
| Not predicted | 12.5 | 100 | 0 | 10.8 | 0.0 | 68.3 | 61.5 |
Note: Rows: Predicted Source. Column: Observed Source. DK, Denmark.
Fig 3Result of the predictive machine learning model. All the 95 predicted human salmonellosis cases are lined up along the x‐axis and the source specific probabilities for each of the human cases are stacked along the y‐axis. Human cases attributed to unknown sources are not shown.
Source Attribution Results Obtained by Applying the Machine Learning Model and the Bayesian Model to the Same Data Set
| Machine Learning Model | Bayesian Model | ||
|---|---|---|---|
| DK data | 17 loci | Serotype, MLVA profile, resistance profile | |
| Performance measure |
Valid accuracy: 0.922 (CI 0.8706–0.9579) Kappa value: 0.9033 | Fit 0.9 (0.7–1.2) | |
| Prediction |
Human cases attributed,
|
Human cases attributed,
| 95%CI of |
| Number of human cases predicted (%) | 95 (81) | 69 (49) | NA |
| Broilers (DK) | 7.5 (6.4) | 8 (5.7) | 0.2–24.7 |
| Cattle (DK) | 0.4 (0.3) | 4 (2.8) | 0.4–13.2 |
| Cattle (Import) | 0.3 (0.2) | 3 (2.1) | 0.3–7 |
| Ducks (Import) | 2.7 (2.3) | 2 (1.4) | 0.2–5.6 |
| Layers (DK) | 2 (1.7) | 2 (1.4) | 0.2–5.6 |
| Pigs (DK) | 62.9 (53.3) | 31 (22.0) | 13.6–53.5 |
| Pigs (Import) | 19.3 (16.4) | 19 (13.5) | 3.2–42.6 |
| Not predicted | 21 (17.8) | 31 (22.0) | 11–49.2 |
| Travel cases | 23* | 41 (29.1) | 36.7–45.4 |
Note: MLVA profile: Allelic number in loci STTR3, STTR10, and STTR9. Resistance profile: Phenotypic resistance Towards ampicillin, chloramphenicol, sulphamethoxazole, tetracyclin, trimethoprim, ciprofloxacin, gentamicin, nalidixan, and ceftiofur. *No percentage as these were attributed directly to travel and thus not predicted. CI: 95% Confidence Interval, DK, Denmark.