| Literature DB >> 35745545 |
Collins K Tanui1,2, Edmund O Benefo1, Shraddha Karanth1, Abani K Pradhan1,2.
Abstract
Despite its low morbidity, listeriosis has a high mortality rate due to the severity of its clinical manifestations. The source of human listeriosis is often unclear. In this study, we investigate the ability of machine learning to predict the food source from which clinical Listeria monocytogenes isolates originated. Four machine learning classification algorithms were trained on core genome multilocus sequence typing data of 1212 L. monocytogenes isolates from various food sources. The average accuracies of random forest, support vector machine radial kernel, stochastic gradient boosting, and logit boost were found to be 0.72, 0.61, 0.7, and 0.73, respectively. Logit boost showed the best performance and was used in model testing on 154 L. monocytogenes clinical isolates. The model attributed 17.5 % of human clinical cases to dairy, 32.5% to fruits, 14.3% to leafy greens, 9.7% to meat, 4.6% to poultry, and 18.8% to vegetables. The final model also provided us with genetic features that were predictive of specific sources. Thus, this combination of genomic data and machine learning-based models can greatly enhance our ability to track L. monocytogenes from different food sources.Entities:
Keywords: Listeria monocytogenes; food source attribution; machine learning; predictive modeling; whole-genome sequencing
Year: 2022 PMID: 35745545 PMCID: PMC9230378 DOI: 10.3390/pathogens11060691
Source DB: PubMed Journal: Pathogens ISSN: 2076-0817
Models performance from 10 iterations of random forest, support vector machine radial kernel, stochastic gradient boosting, and logit boost models.
| Models | Accuracy | 95% CI | Kappa |
|---|---|---|---|
| Logit boost | 0.732 a | 0.665–0.760 | 0.654 |
| Random forest | 0.722 a | 0.667–0.776 | 0.657 |
| Stochastic gradient boosting | 0.701 a | 0.645–0.745 | 0.633 |
| Support vector machine | 0.614 b | 0.569–0.671 | 0.530 |
Values under the Accuracy column with different superscripts are significantly different (p < 0.05).
Figure 1Predicted sources of clinical L. monocytogenes isolates.
Twenty putative genes sorted by maximum importance across the food sources.
| Loci | Gene | Protein Name | Dairy | Fruits | Leafy Greens | Meat | Poultry | Seafood | Vegetables |
|---|---|---|---|---|---|---|---|---|---|
| lmo2702 |
| Recombination protein RecR | 0.6653 | 0.5945 | 0.6925 | 0.8315 | 0.7212 | 0.6219 | 0.6653 |
| lmo2401 |
| Hypothetical protein | 0.7017 | 0.663 | 0.6997 | 0.8231 | 0.7664 | 0.663 | 0.7017 |
| lmo2615 |
| 30S ribosomal protein S5 | 0.6873 | 0.5786 | 0.708 | 0.8199 | 0.7465 | 0.6081 | 0.6873 |
| lmo2577 |
| Hypothetical protein | 0.7066 | 0.6611 | 0.6809 | 0.808 | 0.7851 | 0.6611 | 0.7066 |
| lmo1501 |
| Hypothetical protein | 0.6925 | 0.6014 | 0.6839 | 0.8022 | 0.716 | 0.6374 | 0.6925 |
| lmo1933 |
| GTP cyclohydrolase 1 | 0.577 | 0.6111 | 0.599 | 0.8012 | 0.7435 | 0.627 | 0.6111 |
| lmo2215 |
| Similar to ABC transporter (ATP-binding protein) | 0.692 | 0.6473 | 0.6633 | 0.7988 | 0.72 | 0.6473 | 0.692 |
| lmo0821 |
| Hypothetical protein | 0.6641 | 0.6641 | 0.7076 | 0.7979 | 0.7461 | 0.6641 | 0.657 |
| lmo1715 |
| Methyltransferase | 0.674 | 0.6314 | 0.6612 | 0.7963 | 0.7371 | 0.6314 | 0.674 |
| lmo2515 |
| NarL family, response regulator DegU | 0.6923 | 0.6482 | 0.6759 | 0.7952 | 0.7781 | 0.6482 | 0.6923 |
| lmo0625 |
| Putative lipase/acylhydrolase | 0.6548 | 0.6242 | 0.6813 | 0.7945 | 0.743 | 0.6242 | 0.6548 |
| lmo0544 |
| PTS sorbitol transporter subunit IIC | 0.7125 | 0.6483 | 0.7073 | 0.7928 | 0.7713 | 0.6483 | 0.7125 |
| lmo2728 |
| Transcriptional regulator, MerR family protein | 0.62 | 0.6322 | 0.6294 | 0.7909 | 0.6994 | 0.6041 | 0.6322 |
| lmo2348 |
| Amino acid ABC transporter permease | 0.6776 | 0.6673 | 0.681 | 0.7901 | 0.7512 | 0.6673 | 0.6776 |
| lmo2422 |
| Two-component response regulator | 0.6988 | 0.6498 | 0.6574 | 0.7883 | 0.7307 | 0.6498 | 0.6988 |
| lmo0623 |
| Hypothetical protein | 0.6382 | 0.6382 | 0.6382 | 0.7877 | 0.7026 | 0.6382 | 0.6307 |
| lmo0635 |
| Hypothetical protein | 0.6715 | 0.6715 | 0.7008 | 0.7872 | 0.744 | 0.6715 | 0.656 |
| lmo2658 |
| Hypothetical protein | 0.5621 | 0.5409 | 0.5969 | 0.7859 | 0.6298 | 0.5644 | 0.5621 |
| lmo0611 |
| Azoreductase | 0.626 | 0.6511 | 0.7853 | 0.7737 | 0.626 | 0.626 | 0.6511 |
| lmo1425 |
| Hypothetical protein | 0.7079 | 0.651 | 0.6755 | 0.7852 | 0.7607 | 0.651 | 0.7079 |
Note: The numbers represent importance based on the accuracies of source prediction by each feature (genes). These values are the area under the receiver operating characteristic curve (AUC-ROC) determined from source-specific sensitivities and specificities (Supplementary Tables S1 and S2).