| Literature DB >> 33951067 |
Kushan De Silva1, Siew Lim1, Aya Mousa1, Helena Teede1, Andrew Forbes2, Ryan T Demmer3,4, Daniel Jönsson5,6, Joanne Enticott1.
Abstract
OBJECTIVES: Using a nationally-representative, cross-sectional cohort, we examined nutritional markers of undiagnosed type 2 diabetes in adults via machine learning.Entities:
Year: 2021 PMID: 33951067 PMCID: PMC8099133 DOI: 10.1371/journal.pone.0250832
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Nutritional and other markers of undiagnosed type 2 diabetes identified by the best-performing logistic model (AUC = 75.7%).
| GLM original | |
|---|---|
| (AUCinternal = 75.7%) | |
| (AUCexternal = 74.6%) | |
| Marker | OR (95% CI) |
| | |
| How healthy is the diet? | 0.85 (0.72, 0.99) |
| Number of meals from fast food or pizza place | 1.01 (1.00, 1.02) |
| | |
| Weight | 1.07 (1.01, 1.13) |
| Body mass index | 1.06 (1.03, 1.09) |
| Waist circumference | 1.06 (1.04, 1.09) |
| | |
| Total fat | 1.14 (1.02, 1.27) |
| Beta-cryptoxanthin | 0.99 (0.98, 1.00) |
| Folic acid | 0.83 (0.69, 0.99) |
| Food folate | 0.84 (0.74, 0.96) |
| Calcium | 0.97 (0.94, 1.00) |
| Caffeine | 0.998 (0.997, 1.000) |
| Vitamin B12 | 0.99 (0.98, 1.00) |
| Smoked at least 100 cigarettes in life? = yes (ref = no) | 1.09 (1.00, 1.19) |
| Age | 1.05 (1.03, 1.07) |
| Ethnicity = other (ref = White) | 1.04 (1.01, 1.08) |
| Total number of people in the household | 1.23 (1.01, 1.47) |
a: logistic regression model on original, un-resampled data
b: area under receiver operating characteristic curve on the internal validation data
c: area under receiver operating characteristic curve on the external validation data
d: self-reported
e: measured via two 24-hour dietary recalls.
AUC = area under receiver operating characteristic curve.
Nutritional and other markers of undiagnosed type 2 diabetes identified by best-performing ANN and RF models.
| RF ROSE | ANN ROSE | ANN SMOTE | |||
|---|---|---|---|---|---|
| (AUCinternal = 75.2%) | (AUCinternal = 74.9%) | (AUCinternal = 74.9%) | |||
| (AUCexternal = 74.3%) | (AUCexternal = 74.3%) | (AUCexternal = 73.6%) | |||
| Marker | Importance | Marker | Importance | Marker | Importance |
| Waist circumference | 53.03 | Number of meals not home prepared | 0.5755 | Number of meals not home prepared | 0.5755 |
| Number of frozen meals/pizzas in past 30 days | 0.5620 | Number of frozen meals/pizzas in past 30 days | 0.5620 | ||
| Body mass index | 44.07 | ||||
| Age when heaviest weight | 43.35 | Waist circumference | 0.7228 | Waist circumference | 0.7228 |
| Upper leg length | 38.42 | Body mass index | 0.6967 | Body mass index | 0.6967 |
| Arm circumference | 33.41 | Age when heaviest weight | 0.6806 | Age when heaviest weight | 0.6806 |
| Weight | 32.70 | Arm circumference | 0.6386 | Arm circumference | 0.6386 |
| Self-reported weight—1 year ago | 32.32 | Weight | 0.6295 | Weight | 0.6295 |
| Current self-reported weight | 30.81 | Self-reported weight—1 year ago | 0.6285 | Self-reported weight—1 year ago | 0.6285 |
| Self-reported greatest weight | 30.31 | Current self-reported weight | 0.6272 | Current self-reported weight | 0.6272 |
| Standing height | 26.82 | Upper leg length | 0.6199 | Upper leg length | 0.6199 |
| Self-reported greatest weight | 0.6130 | Self-reported greatest weight | 0.6130 | ||
| Carbohydrate | 25.21 | How do you consider your weight? | 0.5948 | How do you consider your weight | 0.5948 |
| Caffeine | 24.01 | Like to weigh more, less or same? | 0.5755 | Like to weigh more, less or same | 0.5755 |
| Standing height | 0.5665 | Standing height | 0.5665 | ||
| Beta-carotene | 23.18 | ||||
| Alpha-carotene | 23.09 | Carbohydrate | 0.5606 | Carbohydrate | 0.5606 |
| Energy | 22.96 | ||||
| SFA 12:0 (Dodecanoic) | 22.88 | Self-rated general health | 0.6043 | Self-rated general health | 0.6043 |
| Copper | 22.82 | Vigorous recreational activities | 0.5776 | Vigorous recreational activities | 0.5776 |
| Vitamin E as alpha-tocopherol | 22.70 | Moderate recreational activities | 0.5662 | Moderate recreational activities | 0.5662 |
| Age | 49.96 | Age | 0.6843 | Age | 0.6843 |
| Income-poverty ratio | 23.51 | Education level | 0.6132 | Education level | 0.6132 |
a = random forest model on train data restructured by ROSE sampling
b = artificial neural network model on training data restructured by ROSE sampling algorithm
c = artificial neural network model on training data restructured by SMOTE sampling algorithm
d = area under receiver operating characteristic curve on the internal validation data
e = area under receiver operating characteristic curve on the external validation data
f = by default, mean decrease in prediction accuracy after a variable is permuted
g = default method uses combinations of the absolute values of the weights.
ANN = artificial neural network; AUC = area under receiver operating characteristic curve; RF = random forest; ROSE = random oversampling examples; SFA = saturated fatty acid; SMOTE = synthetic minority oversampling technique.
Creation of variables analogous to those in the American Diabetes Association (ADA) diabetes risk test using National Health and Nutrition Examination Survey (NHANES) data.
| Variable | Information used/modified from NHANES | Score |
|---|---|---|
| Age | Age in years at screening was categorised with following cut-points to ascribe scores | |
| <40 | 0 | |
| 40–49 | 1 | |
| 50–59 | 2 | |
| ≥60 | 3 | |
| Gender | Self-reported gender | |
| Female | 0 | |
| Male | 1 | |
| Previous gestational diabetes (if female) | Self-reported history of gestational diabetes | |
| No | 0 | |
| Yes | 1 | |
| 1st degree relative with diabetes | NHANES questionnaire collects information on familial diabetes, but not on 1st degree relatives with diabetes per se, so the self-reported family history of diabetes was used as a proxy variable. | |
| No | 0 | |
| Yes | 1 | |
| Hypertension (self-reported history of hypertension, prescribed antihypertensive medication, and/or BP ≥140/90) | NHANES provides information on all 3 criteria; self-reported history of hypertension (“Ever told you had high blood pressure?”), prescribed antihypertensive medication (“Taking prescription for hypertension?”) and/or BP ≥140/90 (objectively measured and averaged over 3 or 4 measurements of SBP and DBP). | |
| Absence of all 3 criteria | 0 | |
| Presence of history of self-reported hypertension or prescribed antihypertensive medication, or BP ≥140/90 | 1 | |
| Physically active (self-reported) | Derived a binary variable by checking if any of the following activities were done in 5 or more days of a typical week: vigorous or moderate work, recreational work, walk or bicycle | |
| Yes | 0 | |
| No | 1 | |
| BMI, kg/m2 | Available in NHANES. Objectively measured. | |
| <25 | 0 | |
| 25 to <30 | 1 | |
| 30 to <40 | 2 | |
| ≥40 | 3 |
* Cumulative scores ≥5 should be formally screened for diabetes, per ADA guidelines, which was chosen as the cut-point for classifying individuals.
ADA = American Diabetes Association; BMI = body mass index; BP = blood pressure; NHANES = National Health and Nutrition Examination Survey.
Performance comparison of the ADA diabetes risk test versus the best-performing model on NHANES data.
| Benchmarking with the best-performing ML model | ||||
|---|---|---|---|---|
| Criterion | ADA diabetes risk test | Best-performing ML model | ||
| Performance upon the internal validation dataset | Performance upon the external validation dataset | Performance upon the internal validation dataset | Performance upon the external validation dataset | |
| AUROC | 0.737028 | 0.7401352 | 0.7566544 | 0.7464869 |
| Sensitivity | 0.688716 | 0.7639015 | 0.6810036 | 0.7745098 |
| Specificity | 0.690244 | 0.6109271 | 0.7105263 | 0.6148893 |
| Accuracy | 0.690164 | 0.6319564 | 0.7088374 | 0.6222089 |
| PPV | 0.147892 | 0.1092312 | 0.1249178 | 0.0881368 |
| NPV | 0.912503 | 0.9219408 | 0.9734803 | 0.9826807 |
a = This was a logistic regression model on original, unbalanced training data without any resampling
b = A randomly partitioned sample from NHANES 2007–2012
c = from NHANES 2013–2016.
ADA = American Diabetes Association; AUROC = area under the receiver operating characteristic curve; ML = machine learning; PPV = positive predictive value; NPV = negative predictive value.