| Literature DB >> 32403349 |
Ayan Chatterjee1, Martin W Gerdes1, Santiago G Martinez2.
Abstract
Social determining factors such as the adverse influence of globalization, supermarket growth, fast unplanned urbanization, sedentary lifestyle, economy, and social position slowly develop behavioral risk factors in humans. Behavioral risk factors such as unhealthy habits, improper diet, and physical inactivity lead to physiological risks, and "obesity/overweight" is one of the consequences. "Obesity and overweight" are one of the major lifestyle diseases that leads to other health conditions, such as cardiovascular diseases (CVDs), chronic obstructive pulmonary disease (COPD), cancer, diabetes type II, hypertension, and depression. It is not restricted within the age and socio-economic background of human beings. The "World Health Organization" (WHO) has anticipated that 30% of global death will be caused by lifestyle diseases by 2030 and it can be prevented with the appropriate identification of associated risk factors and behavioral intervention plans. Health behavior change should be given priority to avoid life-threatening damages. The primary purpose of this study is not to present a risk prediction model but to provide a review of various machine learning (ML) methods and their execution using available sample health data in a public repository related to lifestyle diseases, such as obesity, CVDs, and diabetes type II. In this study, we targeted people, both male and female, in the age group of >20 and <60, excluding pregnancy and genetic factors. This paper qualifies as a tutorial article on how to use different ML methods to identify potential risk factors of obesity/overweight. Although institutions such as "Center for Disease Control and Prevention (CDC)" and "National Institute for Clinical Excellence (NICE)" guidelines work to understand the cause and consequences of overweight/obesity, we aimed to utilize the potential of data science to assess the correlated risk factors of obesity/overweight after analyzing the existing datasets available in "Kaggle" and "University of California, Irvine (UCI) database", and to check how the potential risk factors are changing with the change in body-energy imbalance with data-visualization techniques and regression analysis. Analyzing existing obesity/overweight related data using machine learning algorithms did not produce any brand-new risk factors, but it helped us to understand: (a) how are identified risk factors related to weight change and how do we visualize it? (b) what will be the nature of the data (potential monitorable risk factors) to be collected over time to develop our intended eCoach system for the promotion of a healthy lifestyle targeting "obesity and overweight" as a study case in the future? (c) why have we used the existing "Kaggle" and "UCI" datasets for our preliminary study? (d) which classification and regression models are performing better with a corresponding limited volume of the dataset following performance metrics?Entities:
Keywords: BMI; Prisma; Sklearn; calibration; classification; data visualization; deep learning; discrimination; eCoach; gradient descent; hypothesis test; lifestyle diseases; machine learning; model performance; monitoring; normal distribution; obesity; overweight; python; regression; sensor data
Mesh:
Year: 2020 PMID: 32403349 PMCID: PMC7248873 DOI: 10.3390/s20092734
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Epidemiological study design [16].
| Study Design | Type of Information Collected | Usage of the Information |
|---|---|---|
| Meta-analysis and systematic reviews | Summary of the evidence of predominance of obesity/overweight worldwide Summary of the evidence of physiological risks associated to obesity/overweight Summary of the evidence of risk factors associated to obesity/overweight Summary of the evidence of effectiveness of obesity/overweight prevention plan | Strategy and guideline planning |
| Qualitative and quantitative studies | Burden of obesity/overweight in society Correlation of risk factors with body energy imbalance Distribution of obesity prevalence among different age groups and socio-economic groups Identification of key risk factors, high risk groups of people, and related datasets Identification of used artificial intelligence (AI) models with their accuracy for classification and regression | Policy, algorithm selection, data selection, controlled trial selection, feasibility study, goal setting, planning, resource allocation, priority setting, impact analysis, and evaluation |
Figure 1Prisma flowchart for the article selection process [16].
AI models and the risk factors related to obesity/overweight.
| Researcher | Model Use | Risk Factors |
|---|---|---|
| DeGregory et al. | Linear and logistic regression, artificial neural networks, deep learning, decision tree analysis, cluster analysis, principal component analysis (PCA), network science, and topological data analysis | Inactivity, improper diet |
| Singh et al. | Multivariate regression methods and multilayer perceptron (MLP) feed-forward neural network models | BMI |
| Bassam et al. | Logistic regression, k-nearest neighbor (KNN), support vector machine (SVM) | Age, sex, body mass index (BMI), pre-existing hypertension, family history of hypertension, and diabetes (type II) |
| Meghana et al. | Automatic machine learning (AutoML) | cardiovascular diseases (CVDs) |
| Seyla et al. | SVM | Activity, nutrition |
| Jindal et al. | Random Forest | Age, height, weight, BMI |
| Zheng et al. | improved decision tree (IDT), KNN, artificial neural network (ANN) | Inactivity, improper diet |
| Dunstan et al. | SVM, Random Forest (RF), Extreme Gradient Boosting (XGB) | Unhealthy diet |
| Golino et al. | Classification tree, logistic regression | Blood Pressure (BP), BMI, Waist Circumference (WC), Hip Circumference (HC), Waist–Hip Ratio (WHR) |
| Pleuss et al. | Machine learning (ML) and 3D image processing | BMI, WC, HC |
| Maharana et al. | convolutional neural network (CNN) | Environment, context |
| Pouladzadeh et al. | CNN | Nutrition |
Figure 2The focused epidemiological study triangle [33].
Selected datasets for the statistical analysis and machine learning.
| Repository | Name | Source | Category |
|---|---|---|---|
| Kaggle | 500_Person_Gender_Height_Weight_Index | Obesity | |
| Kaggle | Insurance | Obesity | |
| Kaggle | Eating-health-module-dataset [ | US Bureau of Labor Statistics | Obesity |
| Kaggle/UCI | Pima-Indians-diabetes-database | UCI Machine Learning | Diabetes |
| Kaggle | Cardiovascular-disease-dataset | Ryerson University | CVDs |
Short description of the selected datasets.
| Type | Sample Size | Key Features |
|---|---|---|
| Person_Gender_Height_Weight_Index | 500 | Gender, height, weight |
| Insurance | 1338 | Age, sex, BMI, smoking, charge, location |
| Eating-health-module-dataset | 11212 | Sweet beverages, economic condition, fast food, sleeping, meat and milk consumption, drinking habit, exercise |
| Pima-Indians-diabetes-database | 768 | Blood glucose, blood pressure, insulin intake, and age |
| Cardiovascular-disease-dataset | 462 | Blood pressure, tobacco consumption, lipid profile, adiposity, family history, obesity, drinking habit, and age |
Python libraries for data processing [38].
| No. | Libraries | Purpose |
|---|---|---|
| 1 | Pandas | Data importing, structuring, and analysis |
| 2 | NumPy | Computing with multidimensional array object |
| 3 | Matplotlib | Python 2-D plotting |
| 4 | SciPy | Statistical analysis |
| 5 | Seaborn, plotly | Plotting of high-level statistical graphs |
| 6 | Scikit-learn (Sklearn) | Machine learning, preprocessing, cross-validation, and evaluating the model’s performance |
| 7 | Graph Viz | Plotting of decision trees |
Statistical analysis methods on the selected datasets [32,39].
| No. | Methods | Purpose |
|---|---|---|
| 1 | Mean, standard deviation, skewness | Distribution test |
| 2 | t-test, z-test, F-test, Chi-square | Hypothesis test |
| 3 | Shapiro–Wilk, D’Agostino’s K^2, and Anderson–Darling test | Normality test |
| 4 | Covariance, correlation | Association test |
| 5 | Histogram, Swarm, Violin, Bee Swarm, Joint, Box, Scatter | Distribution plot |
| 6 | Quantile analysis | Outlier detection |
Hypothesis testing methods.
| Method | Description | Samples |
|---|---|---|
| T Test | Test if the mean of a normally distributed value is different from a specified value (µ0) | Sample size < 30 |
| Z Test | Test if two samples are equal or not | Sample size > 30 |
| ANOVA or F Test | Test multiple groups at the same time | More than 2 samples |
| Chi-Square Test | Check if observed patterns (O) of data fit some given distribution (E) or not. | Two categorical variables from a sample |
Statistical analysis methods on the selected datasets.
| |r| Value | Meaning |
|---|---|
| 0.00–0.2 | Very weak |
| 0.2–0.4 | Weak to moderate |
| 0.4–0.6 | Medium to substantial |
| 0.6–0.8 | Very strong |
| 0.8–1.0 | Extremely strong |
Statistical analysis methods on the selected datasets [31,32,39].
| Type | Name | Optimization Method |
|---|---|---|
| Classification | SVM (kernel = linear or rbf) | Gradient descent |
| Classification | Naïve Bayes | Gradient descent |
| Classification | Decision Tree (entropy or gini) | Information Gain, Gini |
| Classification | Logistic | Gradient descent |
| Classification | KNN | Gradient descent |
| Classification | Random Forest (RF) | Ensemble |
| Calibration Classification | Calibrated Classifier (CV) | Probability (sigmoid, isotonic) |
| Regression | Linear Regression | Gradient descent |
| Regression | KNeighbors Regressor | Gradient descent |
| Regression | Support Vector Regressor | Gradient descent |
| Regression | Decision Tree Regressor | Gain, Gini |
| Regression | Random Forest Regressor | Ensemble |
| Regression | Bayesian Regressor | Gradient descent |
| Regularization | Lasso (L1), Ridge (L2) | Gradient descent |
Machine learning model store [27].
| Method | Implementation |
|---|---|
| Pickle string | Import pickle library |
| Pickled model | Import joblib from the sklearn.externals library |
Figure 3Correlation heatmap and classification accuracy of ML models to classify “BMI” data.
Figure 4Performance metric of “SVM” classification with 5-fold cross validation.
Figure 5(a) Reliability curve to classify the “BMI” data with different ML classifiers. (b) Reliability curve to classify the “BMI” data with “Calibrated Decision Tree”.
Figure 6Correlation heatmap and classification accuracy of ML models to classify “Insurance” data.
Figure 7Relationship between “smoker” and “charges”.
Figure 8Performance metric of “Decision Tree” classification with 5-fold cross validation.
Figure 9Relationship between “age category” and “BMI”.
Figure 10Relationship between “age category” and “charges”.
Figure 11(a) Obese condition by smoking status; (b) distribution of obese patient group by smoking status.
Figure 12(a) Reliability curve to classify “Insurance” data with different ML classifiers. (b) Reliability curve to classify “Insurance” data with the “Calibrated Decision Tree”.
Figure 13Performance metric of the “Decision Tree” classification with a 5-fold cross validation.
Figure 14(a) Reliability curve to classify the “Eating-health-module” data with different ML classifiers. (b) Reliability curve to classify the “Eating-health-module” data with the “Calibrated Decision Tree”.
Figure 15(a) Relationship of the outcome (obesity) with blood glucose; (b) relationship of the outcome (obesity) with blood pressure; (c) relationship of the outcome (obesity) with age.
Figure 16(a) Reliability curve to classify the “Diabetes” data with different ML classifiers. (b) Reliability curve to classify the “Diabetes” data with the “Calibrated LR”.
Figure 17Correlation heatmap and classification accuracy of ML models to classify the “Cardiovascular-disease” data.
Figure 18(a) Performance metric of the “SVM” classification with a 5-fold cross validation. (b) Performance metric of the “Logistic Regression” classification with a 5-fold cross validation.
Figure 19(a) Reliability curve to classify the “Cardiovascular disease” data with different ML classifiers. (b) Reliability curve to classify the “Cardiovascular disease” data with the “Calibrated LR”.
The data processing synopsis with a discrimination analysis.
| Name of the Dataset | Data Processing Reason | Best ML Model with Performance Metrics | Identified Risk Factors |
|---|---|---|---|
| Person_Gender_Height_Weight_Index | To check correlation between BMI and weight change. Comparing the performance of multiclass classifiers. | SVM classifier | BMI |
| Insurance | To check the impact of identified health risk factors on weight change using regression and correlation. Comparing the performance of multiclass classifiers Comparing the performance of regression algorithms. To check if BMI has any relation with age or not. | Decision tree (DTree) classifier | Age, sex, BMI, smoking habit, economic condition |
| Eating-health-module-dataset | To check the impact of the identified health risk factors on weight change using regression and correlation. Comparing the performance of multiclass classifiers. | DTree classifier | Sweet beverages, economic condition, fast food, sleeping, meat and milk consumption, drinking habit, exercise |
| Pima-Indians-diabetes-database | To check the impact of the identified health risk factors on weight change using regression and correlation. Comparing the performance of multiclass classifiers. To check the relationship between diabetes type II and obesity. | SVM, Naïve Bayes, Logistic Regression (LR) | Blood glucose, blood pressure, and age |
| Cardiovascular-disease-dataset | To check the impact of the identified health risk factors on weight change using regression and correlation. Comparing the performance of multiclass classifiers. To check the relationship between heart disease and obesity. | SVM and Logistic regression | Blood pressure, tobacco consumption, lipid profile, adiposity, family history, obesity, drinking habit, and age |
The data processing synopsis with the calibrated classification.
| Name of the Dataset | Best ML Model | Best Calibration Method | Uncalibrated Brier Score | Calibrated Brier Score |
|---|---|---|---|---|
| Person_Gender_Height_Weight_Index | Decision Tree | Isotonic | 0.000 | 0.000 |
| Insurance | Decision Tree | Isotonic | 0.000 | 0.000 |
| Eating-health-module-dataset | Decision Tree | Isotonic, Sigmoid | 0.000 | 0.000 |
| Pima-Indians-diabetes-database | Logistic Regression | Isotonic | 0.144 | 0.143 |
| Cardiovascular-disease-dataset | Logistic Regression | Isotonic | 0.198 | 0.187 |