| Literature DB >> 34886362 |
Rachel A Oldroyd1,2, Michelle A Morris1,3,4, Mark Birkin1,2,4.
Abstract
Consumer food environments have transformed dramatically in the last decade. Food outlet prevalence has increased, and people are eating food outside the home more than ever before. Despite these developments, national spending on food control has reduced. The National Audit Office report that only 14% of local authorities are up to date with food business inspections, exposing consumers to unknown levels of risk. Given the scarcity of local authority resources, this paper presents a data-driven approach to predict compliance for newly opened businesses and those awaiting repeat inspections. This work capitalizes on the theory that food outlet compliance is a function of its geographic context, namely the characteristics of the neighborhood within which it sits. We explore the utility of three machine learning approaches to predict non-compliant food outlets in England and Wales using openly accessible socio-demographic, business type, and urbanness features at the output area level. We find that the synthetic minority oversampling technique alongside a random forest algorithm with a 1:1 sampling strategy provides the best predictive power. Our final model retrieves and identifies 84% of total non-compliant outlets in a test set of 92,595 (sensitivity = 0.843, specificity = 0.745, precision = 0.274). The originality of this work lies in its unique and methodological approach which combines the use of machine learning with fine-grained neighborhood data to make robust predictions of compliance.Entities:
Keywords: food environments; food hygiene; food safety; machine learning
Mesh:
Year: 2021 PMID: 34886362 PMCID: PMC8656817 DOI: 10.3390/ijerph182312635
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1An overview of the analysis process. The full FHRS dataset is split into training and testing phases prior to SMOTE and under-sampling. † Synthetic minority oversampling technique. * Cross-validation.
Data sources and variables.
| Data Domain and Source | Geography | Variable | Categories/Levels |
|---|---|---|---|
| Food Hygiene Rating Scheme Scores [ | Reported for individual food outlets | FHRS score (ordinal) | 0 (Improvement necessary), 1, 2, 3, 4, 5 (Very good) |
| Business Type (categorical) | Restaurants, cafés, & canteens; other retailers; super- & hyper-markets; other catering; pubs, bars, & nightclubs; takeaways & sandwich shops; hotels, guesthouses, bed & breakfasts | ||
| Region (categorical) | East Midland, West Midlands, East of England, London, North East, North West, South East, South West, Wales, Yorkshire | ||
| Socio-demographic 2011 census data [ | Reported at OA level | Age (% of persons) | 0–4; 5–14; 15–19; 20–24; 25–44; 45–64; 65+ |
| Ethnicity (% of persons) | Asian, Black, Mixed, Other, White | ||
| Unemployment (% of persons) | |||
| Overcrowding (% of households) | |||
| No car access (% of households) | |||
| Renting (% of households) | |||
| Rural Urban Classification [ | Reported at OA level | RUC (categorical): | Urban cities and towns; rural hamlets and isolated dwellings; rural town and fringe; rural village; and urban conurbation |
| Output Area Classification [ | Reported at OA level | OAC Supergroups (categorical): | (1) Rural residents; (2) cosmopolitans; (3) ethnicity central; (4) multicultural metropolitans; (5) urbanites; (6) suburbanites; (7) constrained city dwellers; (8) hard-pressed living |
(FHRS = Food Hygiene Rating Scheme Score, OA = Output Area, OAC = Output Area Classification, RUC = Rural Urban Classification).
Descriptive statistics for numerical predictor variables. All variables are reported at output area level (SD = Standard Deviation).
| Variable | Level | Mean | SD | Min. | Max. |
|---|---|---|---|---|---|
| Ethnicity (%) | White | 84.06 | 19.64 | 0.00 | 100.00 |
| Mixed | 2.42 | 2.31 | 0.00 | 26.61 | |
| Asian | 8.73 | 13.69 | 0.00 | 99.76 | |
| Black | 3.28 | 6.41 | 0.00 | 78.04 | |
| Other | 1.40 | 2.69 | 0.00 | 48.90 | |
| Age (%) | 0–4 | 5.62 | 2.86 | 0.00 | 29.30 |
| 5–14 | 9.10 | 4.49 | 0.00 | 52.20 | |
| 15–19 | 5.86 | 4.62 | 0.00 | 84.62 | |
| 20–24 | 9.13 | 8.35 | 0.00 | 85.12 | |
| 25–44 | 30.72 | 11.59 | 0.00 | 88.33 | |
| 45–64 | 23.74 | 7.96 | 0.00 | 69.19 | |
| ≥65 | 15.83 | 10.35 | 0.00 | 96.75 | |
| Unemployment (% of individuals) | - | 7.35 | 5.53 | 0.00 | 55.68 |
| Overcrowding (% of households) | - | 2.57 | 3.76 | 0.00 | 38.00 |
| No Car Access (% of households) | - | 34.45 | 21.48 | 0.00 | 96.71 |
| Renting (% of households) | - | 47.53 | 24.77 | 0.00 | 100.00 |
Final model tuning parameters. For linear and radial SVM, the cost parameter represents the optimal penalty threshold for misclassifications. For the random forest models, mtry, the optimal number of randomly selected predictor variables is reported (SVM = Support Vector Machine, SMOTE = Synthetic Minority Oversampling Technique).
| Sampling Set/Ratio (Non-Comp:Comp) | Model Tuning Parameters | |||||
|---|---|---|---|---|---|---|
| Linear SVM (Cost) | Radial SVM (Cost) | Random Forest (Mtry) | ||||
| SMOTE | Under-Sampled | SMOTE | Under-Sampled | SMOTE | Under-Sampled | |
| Set 1 (1:1) | 1.895 | 0.842 | 32.00 | 32.00 | 5 | 3 |
| Set 2 (2:1) | 0.632 | 0.947 | 64.00 | 0.250 | 5 | 3 |
| Set 3 (3:2) | 0.316 | 0.105 | 64.00 | 2.000 | 6 | 3 |
| Set 4 (2:3) | 2.000 | 2.000 | 16.00 | 16.00 | 5 | 3 |
| Set 5 (1:2) | 1.368 | 2.000 | 16.00 | 0.250 | 5 | 3 |
(SVM = Support Vector Machine, SMOTE = Synthetic Minority Oversampling Technique).
Figure 2ROC curves for the five top performing algorithms. Random forest models trained using synthetic minority oversampling technique data at five sampling ratios. The diagonal line represents the AUC (Area Under Curve) of a random classifier.
Weighted and unweighted performance metrics for random forest models utilizing SMOTE datasets across 5 sampling strategies where weighted observations have a cost penalty applied (30) when extracting the optimal probability threshold and where precision is the proportion of correctly classified non-compliant outlets.
| Metric | RF Set 1 | RF Set 2 | RF Set 3 | RF Set 4 | RF Set 5 | RF Unsampled | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unweighted | Weighted | Unweighted | Weighted | Unweighted | Weighted | Unweighted | Weighted | Unweighted | Weighted | Unweighted | Weighted | |
| Probability Threshold | 0.603 | 0.481 | 0.729 | 0.645 | 0.657 | 0.515 | 0.595 | 0.459 | 0.473 | 0.367 | 0.067 | 0.021 |
| Area Under Curve | 0.87 | 0.87 | 0.859 | 0.859 | 0.864 | 0.864 | 0.868 | 0.868 | 0.873 | 0.873 | 0.796 | 0.796 |
| Sensitivity | 0.759 | 0.843 | 0.773 | 0.833 | 0.745 | 0.838 | 0.76 | 0.853 | 0.784 | 0.849 | 0.661 | 0.859 |
| Specificity | 0.858 | 0.745 | 0.836 | 0.741 | 0.859 | 0.737 | 0.85 | 0.724 | 0.836 | 0.737 | 0.797 | 0.481 |
| True Positives | 4624 | 5139 | 4712 | 5076 | 4540 | 5107 | 4630 | 5201 | 4781 | 5175 | 4029 | 5903 |
| False Positives | 12,264 | 21,676 | 14,572 | 22,383 | 12,180 | 22,752 | 12,976 | 23,872 | 14,210 | 22,773 | 17,571 | 77,591 |
| True Negatives | 74,235 | 64,823 | 71,924 | 64,116 | 74,319 | 63,747 | 73,523 | 62,627 | 72,289 | 63,726 | 68,928 | 8908 |
| False Negatives | 1472 | 957 | 1384 | 1020 | 1556 | 989 | 1466 | 895 | 1315 | 921 | 2067 | 193 |
| Kappa | 0.338 | 0.230 | 0.301 | 0.218 | 0.334 | 0.216 | 0.325 | 0.210 | 0.313 | 0.220 | 0.210 | 0.010 |
| Precision | 0.274 | 0.192 | 0.244 | 0.185 | 0.272 | 0.183 | 0.263 | 0.179 | 0.252 | 0.185 | 0.187 | 0.071 |
(SMOTE = Synthetic Minority Oversampling Technique, RF = Random Forest).
Figure 3Variable importance scores for SMOTE random forest set 1 where red variables have higher predictive strength and blue variables have lower predictive strength.
Figure 4Boxplots for numeric predictor variables reported across the two classes.