| Literature DB >> 35252288 |
Mélina Côté1,2, Mazid Abiodoun Osseni3,4, Didier Brassard1,2, Élise Carbonneau1,2, Julie Robitaille1,2, Marie-Claude Vohl1,2, Simone Lemieux1,2, François Laviolette1,3,4, Benoît Lamarche1,2.
Abstract
Machine learning (ML) algorithms may help better understand the complex interactions among factors that influence dietary choices and behaviors. The aim of this study was to explore whether ML algorithms are more accurate than traditional statistical models in predicting vegetable and fruit (VF) consumption. A large array of features (2,452 features from 525 variables) encompassing individual and environmental information related to dietary habits and food choices in a sample of 1,147 French-speaking adult men and women was used for the purpose of this study. Adequate VF consumption, which was defined as 5 servings/d or more, was measured by averaging data from three web-based 24 h recalls and used as the outcome to predict. Nine classification ML algorithms were compared to two traditional statistical predictive models, logistic regression and penalized regression (Lasso). The performance of the predictive ML algorithms was tested after the implementation of adjustments, including normalizing the data, as well as in a series of sensitivity analyses such as using VF consumption obtained from a web-based food frequency questionnaire (wFFQ) and applying a feature selection algorithm in an attempt to reduce overfitting. Logistic regression and Lasso predicted adequate VF consumption with an accuracy of 0.64 (95% confidence interval [CI]: 0.58-0.70) and 0.64 (95%CI: 0.60-0.68) respectively. Among the ML algorithms tested, the most accurate algorithms to predict adequate VF consumption were the support vector machine (SVM) with either a radial basis kernel or a sigmoid kernel, both with an accuracy of 0.65 (95%CI: 0.59-0.71). The least accurate ML algorithm was the SVM with a linear kernel with an accuracy of 0.55 (95%CI: 0.49-0.61). Using dietary intake data from the wFFQ and applying a feature selection algorithm had little to no impact on the performance of the algorithms. In summary, ML algorithms and traditional statistical models predicted adequate VF consumption with similar accuracies among adults. These results suggest that additional research is needed to explore further the true potential of ML in predicting dietary behaviours that are determined by complex interactions among several individual, social and environmental factors.Entities:
Keywords: artificial intelligence; dietary behaviour; machine learning; nutrition; prediction; statistical models
Year: 2022 PMID: 35252288 PMCID: PMC8891134 DOI: 10.3389/fnut.2022.740898
Source DB: PubMed Journal: Front Nutr ISSN: 2296-861X
Model and algorithm description.
|
|
|
|
|---|---|---|
| Logistic regression (28) | Not typically | Model that calculates the probability of belonging to one of two classes (if outcome is binary) by computing the logit function of the combination of weighted input features. The weights are estimated using maximum-likelihood estimation. |
| Lasso (Least absolute shrinkage and selection operator) (29) | Not typically | Model that uses feature selection and shrinkage to reduce the number of features for classification purposes. The coefficients of features that are useless to the model are shrunk to zero. |
| Decision tree (30) | Yes | Algorithm with a flowchart-like structure that makes predictions by learning decision rules. Each node represents an input feature, each branch represents a decision rule and each leaf represents a prediction. The top of the tree represents the best predictor and input features are compared until a leaf node is reached. |
| Random forest (31) | Yes | Algorithm that generates a large ensemble of decision trees with bootstrapped samples of the data. The predicted class is then determined by averaging the estimated outcome variable of each decision tree. |
| Set-covering machine (32) | Yes | Algorithm that learns a conjunction or disjunction of rules to find a decision function with the smallest number of rules. |
| Support vector machine (33) | Yes | Algorithm that attempts sorting the data between two classes with a hyperplane. The hyperplane can either be a linear, a polynomial, a radial basis or a sigmoid function and is determined using only the points closest to the hyperplane. |
| K-nearest neighbor (34) | Yes | Algorithm that assumes that close data points are similar. The class in which a new data point belongs is determined according to the shared characteristics of a pre-determined number of closest points. |
| Adaboost (35) | Yes | Algorithm that fits a classifier (ex: decision tree) to the dataset and then adjusts the weights of the incorrectly classified data points, forcing the algorithm to focus on the data that is more difficult to classify. |
Predictive metrics and corresponding equations.
|
|
|
|---|---|
| Accuracy | (True positives + True negatives)/Total Sample |
| Precision (positive predictive value) | True positives/(False positives+ True positives) |
| Recall (sensitivity) | True positives/(True positives + False negatives) |
| F1 score | 2 × (Precision * Recall)/(Precision + Recall) |
Sociodemographic characteristics of the French-speaking adults from Quebec, Canada (N = 1,147).
|
|
|
|---|---|
| Age group, y | |
| 18–34 | 432 (37.7) |
| 35–49 | 330 (28.8) |
| 50–65 | 385 (33.5) |
| Sex | |
| Female | 572 (50) |
| Male | 575 (50) |
| Education | |
| High school or less | 270 (23.5) |
| CEGEP | 332 (29.0) |
| University | 485 (42.3) |
| Missing information | 60 (5.2) |
| Household income, CAD $ | |
| <30, 000 | 163 (14.2) |
| 30, 000 to <60, 000 | 281 (24.5) |
| 60, 000 to <90, 000 | 196 (17.1) |
| ≥ 90, 000 | 348 (30.3) |
| Missing information | 159 (13.9) |
| Ethnicity | |
| Caucasian | 997 (86.9) |
| Arabic | 25 (2.2) |
| Hispanic | 19 (1.7) |
| Other | 32 (2.8) |
| Missing information | 74 (6.4) |
| Administrative region | |
| Capitale-Nationale/Chaudière-Appalaches | 416 (36.3) |
| Estrie | 121 (10.5) |
| Mauricie | 98 (8.5) |
| Montreal | 410 (35.8) |
| Saguenay-Lac-St-Jean | 102 (8.9) |
CEGEP is a preuniversity and technical college institution specific to the Quebec educational system.
Performance metrics of two traditional statistical models and nine machine learning algorithms to predict adequate vegetable and fruit (VF) consumption based on dietary intake data obtained from web-based 24-hr recalls (R24W) among1147 French-speaking adults from Québec, Canada.
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||||||
|
|
|
|
|
| ||||||||
|
| 0.75 (0.73–0.77) | 0.76 (0.74–0.78) | 0.66 (0.58–0.74) | 0.94 (0.82–1.06) | 0.64 (0.60–0.68) | 1.00 (1.00–1.00) | 0.80 (0.64–0.96) | 0.86 (0.72–1.00) | 0.75 (0.73–0.77) | 0.73 (0.48–0.98) | 0.80 (0.68–0.92) | |
|
| 0.64 (0.58–0.70) | 0.64 (0.60–0.68) | 0.62 (0.58–0.74) | 0.64 (0.56–0.72) | 0.62 (0.54–0.70) | 0.55 (0.49–0.61) | 0.64 (0.58–0.70) | 0.65 (0.59–0.71) | 0.65 (0.59–0.71) | 0.58 (0.50–0.66) | 0.60 (0.56–0.64) | |
|
|
| 0.64 (0.58–0.70) | 0.68 (0.62–0.74) | 0.62 (0.54–0.70) | 0.63 (0.57–0.69) | 0.62 (0.56–0.68) | 0.55 (0.49–0.61) | 0.64 (0.58–0.70) | 0.65 (0.59–0.71) | 0.64 (0.58–0.70) | 0.57 (0.51–0.63) | 0.60 (0.56–0.64) |
| 0.65 (0.57–0.73) | 0.65 (0.57–0.73) | 0.63 (0.53–0.73) | 0.63 (0.53–0.73) | 0.63 (0.53–0.73) | 0.57 (0.49–0.65) | 0.64 (0.56–0.72) | 0.65 (0.57–0.73) | 0.64 (0.56–0.72) | 0.58 (0.50–0.66) | 0.61 (0.55–0.67) | ||
| 0.68 (0.58–0.78) | 0.68 (0.60–0.76) | 0.66 (0.52–0.80) | 0.73 (0.65–0.81) | 0.67 (0.53–0.81) | 0.58 (0.50–0.66) | 0.72 (0.62–0.82) | 0.72 (0.64–0.80) | 0.74 (0.68–0.80) | 0.69 (0.57–0.81) | 0.61 (0.51–0.71) | ||
|
| 0.66 (0.60–0.72) | 0.66 (0.60–0.72) | 0.64 (0.56–0.72) | 0.68 (0.62–0.74) | 0.65 (0.57–0.73) | 0.58 (0.52–0.64) | 0.67 (0.59–0.75) | 0.68 (0.62–0.74) | 0.69 (0.63–0.75) | 0.63 (0.55–0.71) | 0.61 (0.55–0.67) | |
15 bootstrap resamples were used to estimate 95%CI.
LR, logistic regression; DT, decision tree; RF, random forest; SCM, set-covering machine; SVM, support vector machine; KNN, k-nearest neighbors.
Positive predictive value, also referred to as precision.
Sensitivity, also referred to as recall.
Figure 1Discriminant features retained in the logistic regression (LR) and Lasso models and in the decision tree (DT), random forest (RF), set-covering machine (SCM), support vector machine (SVM) with a linear kernel and Adaboost machine learning algorithms to predict adequate vegetable and fruit consumption. Features are colour-coded according to the questionnaire to which they belong; different shades within a given color indicate that more than one feature of a questionnaire was retained; numbers indicate the rank of a given question from a given questionnaire retained in the model or algorithm. REBS, Regulation of Eating Behaviour Scale; SDL, Socioeconomic and demographic factors, eating and lifestyle habits; SSHEQ, Social support for healthy eating questionnaire; BIDR, Balanced inventory of desirable responding; FLQ, Food liking questionnaire; MED, Medical questionnaire; NKQ, Nutrition knowledge questionnaire; IES, Intuitive eating scale; SPSRQ, Sensitivity to punishment and sensitivity to reward questionnaire.
Performance metrics of two traditional models and nine machine learning algorithms to predict adequate vegetable and fruit (VF) consumption based on dietary intake data obtained from a web-based food frequency questionnaire (wFFQ) among1147 French-speaking adults from Québec, Canada.
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||||||
|
|
|
|
|
| ||||||||
|
| 0.76 (0.74–0.78) | 0.78 (0.76–0.80) | 0.69 | 0.91 (0.85–0.97) | 0.69 (0.67–0.71) | 1.00 (1.00–1.00) | 0.96 (0.78–1.00) | 0.90 (0.82–0.98) | 0.76 (0.52–1.00) | 0.92 (0.67–1.00) | 0.82 (0.72–0.92) | |
|
| 0.70 (0.62–0.78) | 0.70 (0.62–0.78) | 0.67 | 0.68 (0.62–0.74) | 0.66 (0.60–0.72) | 0.63 (0.59–0.67) | 0.67 (0.61–0.73) | 0.69 (0.65–0.73) | 0.66 (0.60–0.72) | 0.67 (0.61–0.73) | 0.67 (0.61–0.73) | |
|
|
| 0.59 (0.53–0.65) | 0.72 (0.64–0.80) | 0.52 (0.44–0.60) | 0.52 (0.50–0.54) | 0.51 (0.47–0.55) | 0.58 (0.54–0.62) | 0.57 (0.49–0.65) | 0.57 (0.51–0.63) | 0.52 (0.44–0.60) | 0.55 (0.49–0.61) | 0.61 (0.55–0.67) |
| 0.72 (0.64–0.80) | 0.72 (0.64–0.80) | 0.68 (0.60–0.76) | 0.68 (0.62–0.74) | 0.68 (0.60–0.76) | 0.72 (0.66–0.78) | 0.71 (0.65–0.77) | 0.71 (0.63–0.79) | 0.69 (0.61–0.77) | 0.70 (0.64–0.76) | 0.74 (0.68–0.80) | ||
| 0.91 (0.87–0.95) | 0.90 (0.84–0.96) | 0.95 (0.75–1.00) | 0.99 (0.97–1.00) | 0.96 (0.84–1.00) | 0.74 (0.66–0.82) | 0.86 (0.74–0.98) | 0.90 (0.80–1.00) | 0.93 (0.71–1.00) | 0.90 (0.82–0.98) | 0.80 (0.74–0.86) | ||
|
| 0.80 (0.74–0.86) | 0.80 (0.74–0.86) | 0.79 (0.73–0.85) | 0.81 (0.77–0.85) | 0.79 (0.75–0.83) | 0.73 (0.69–0.77) | 0.78 (0.72–0.84) | 0.79 (0.75–0.83) | 0.78 (0.72–0.84) | 0.78 (0.74–0.82) | 0.77 (0.73–0.81) | |
15 bootstrap resamples were used to estimate 95%CI.
LR, logistic regression; DT, decision tree; RF, random forest; SCM, set-covering machine; SVM, support vector machine; KNN, k-nearest neighbors.
Positive predictive value, also referred to as precision.
Sensitivity, also referred to as recall.
Figure 2Comparing the accuracy of traditional statistical models and machine learning algorithms to predict adequate vegetable and fruit (VF) consumption when other dietary intake features are included in addition to the 2452 features originally included. These are servings of grain products, milk and alternatives, meat and alternatives, as well as components of the Canadian Healthy Eating Index (C-HEI) other than the VF component and the C-HEI score itself. LR, logistic regression; DT, decision tree; RF, random forest; SCM, set-covering machine; SVM, support vector machine; KNN, k-nearest neighbor.