Literature DB >> 33951067

Nutritional markers of undiagnosed type 2 diabetes in adults: Findings of a machine learning analysis with external validation and benchmarking.

Kushan De Silva¹, Siew Lim¹, Aya Mousa¹, Helena Teede¹, Andrew Forbes², Ryan T Demmer^3,4, Daniel Jönsson^5,6, Joanne Enticott¹.

Abstract

OBJECTIVES: Using a nationally-representative, cross-sectional cohort, we examined nutritional markers of undiagnosed type 2 diabetes in adults via machine learning.
METHODS: A total of 16429 men and non-pregnant women ≥ 20 years of age were analysed from five consecutive cycles of the National Health and Nutrition Examination Survey. Cohorts from years 2013-2016 (n = 6673) was used for external validation. Undiagnosed type 2 diabetes was determined by a negative response to the question "Have you ever been told by a doctor that you have diabetes?" and a positive glycaemic response to one or more of the three diagnostic tests (HbA1c > 6.4% or FPG >125 mg/dl or 2-hr post-OGTT glucose > 200mg/dl). Following comprehensive literature search, 114 potential nutritional markers were modelled with 13 behavioural and 12 socio-economic variables. We tested three machine learning algorithms on original and resampled training datasets built using three resampling methods. From this, the derived 12 predictive models were validated on internal- and external validation cohorts. Magnitudes of associations were gauged through odds ratios in logistic models and variable importance in others. Models were benchmarked against the ADA diabetes risk test.
RESULTS: The prevalence of undiagnosed type 2 diabetes was 5.26%. Four best-performing models (AUROC range: 74.9%-75.7%) classified 39 markers of undiagnosed type 2 diabetes; 28 via one or more of the three best-performing non-linear/ensemble models and 11 uniquely by the logistic model. They comprised 14 nutrient-based, 12 anthropometry-based, 9 socio-behavioural, and 4 diet-associated markers. AUROC of all models were on a par with ADA diabetes risk test on both internal and external validation cohorts (p>0.05).
CONCLUSIONS: Models performed comparably to the chosen benchmark. Novel behavioural markers such as the number of meals not prepared from home were revealed. This approach may be useful in nutritional epidemiology to unravel new associations with type 2 diabetes.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33951067 PMCID： PMC8099133 DOI： 10.1371/journal.pone.0250832

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Diabetes is one of the most wide-spread non-communicable diseases in the world, which is expected to affect 552 million people by year 2030 [1]. Primary prevention of the most prevalent form of diabetes i.e. type 2 diabetes [2] is driven by healthy lifestyle-focussed interventions and policies [3, 4]. However, different principles and policies may underpin prevention and management of other less prevalent phenotypes such as type 1 diabetes [5], latent autoimmune diabetes in adults [6], or rare monogenic diabetes [7]. Nutritional aspects including food habits, dietary constituents, and anthropometric measures offer value since these are relatively easily modifiable at an individual level [8] compared to socio-economic factors such as income, education, or occupation, the modification of which would often require higher policy-level and broader societal interventions [9]. However, there is a dearth of nutritional information for optimising type 2 diabetes prevention [10]. Further studies are needed to deepen our understanding of dietary factors associated with type 2 diabetes risk and specific physiological and systemic pathways underlying those associations. Extraneous factors such as cooking practices and food contaminants as well as individual metabolic heterogeneity such as variations in genetics, epigenetics, and microbiome may further confound diet-type 2 diabetes associations, resulting in even contradictory findings that are not uncommon in the literature [11]. As such, studies which model these associations should strive to adjust for these factors to derive meaningful evidence [12]. With increasingly available multidimensional big data and machine learning (ML) techniques, such precision nutrition approaches are needed to understand nutritional aetiopathogenesis of disease and to develop tailored programs [13]. Presently, ML is sparingly used in nutrition research [14], despite its promise and broadening applications in other areas of research including type 2 diabetes [15]. It should be noted that the current screening tools of type 2 diabetes are heavily hinged on non-modifiable markers such as age and family history, with less emphasis on modifiable, behavioural aspects including little to no nutritional inputs. The American Diabetes Association (ADA) type 2 diabetes risk test comprises age, gender, history of gestational diabetes mellitus (in women), family history of diabetes, history of hypertension, body mass index (BMI) and physical activity [16]. Similarly, the Australian type 2 diabetes risk assessment tool (AUSDRISK) incorporates age, gender, ethnicity/country of birth, family history of diabetes, history of hyperglycaemia and history of hypertension as well as some lifestyle or anthropometric factors such as smoking, physical activity, and waist circumference, and a single dietary question (frequency of fruit/vegetables intake) [17]. The Finnish diabetes risk score (FINDRISC) is derived from age, gender, history of hypertension, history of hyperglycaemia, family history of diabetes, immediate relatives with history of diabetes, BMI, waist circumference, physical activity and the frequency of fruit/vegetable intake [18]. It has been found that the available screening tools composed of a few known predictors result in the underdiagnosis of early dysglycaemia [19]. A study which assessed four type 2 diabetes risk assessment tools based on these few predictors reported of sub-standard performance and low external validity on new populations [20]. At present, evidence-based nutritional practices for primary prevention of type 2 diabetes in adults include lower consumption of dietary fat and energy as well as sufficient intake of dietary fibres (14 g fibres/1000 kcal) and whole-grain foods (equivalent to 50% of grain intake). These dietary practices should be combined with lifestyle interventions focussed on moderate weight loss (7% body weight) and steady exercises (150 minutes/week). Consumption of low glycaemic index (GI) foods enriched with fibres and nutrients is encouraged despite lack of direct evidence that low GI food per se prevents the onset of type 2 diabetes. Alcohol use is not recommended for individuals at high-risk of type 2 diabetes regardless of the beneficial effects associated with its moderate use revealed by observational studies [21]. A thorough understanding of the role of nutrition and its complex interplay with other factors in the natural history of type 2 diabetes is key to developing personalised prevention programs as well as managing overt diabetes [10]. Therefore, there is a need to explore opportunities to expand on and improve the existing sparse models to achieve higher predictive ability by incorporating more granular information on modifiable predictors of type 2 diabetes such as nutritional aspects. From a translational perspective, cost-effective, scalable markers derived from self-reports may be preferred over costly nutritional biomarkers (e.g. blood concentrations) that are not collected or measured in resource-constrained contexts or faced with implementation challenges [22]. Moreover, the validity of self-reported dietary assessment methods is well-documented [23, 24]. Classical statistics have developed mathematical models to explain inferential relationships between variables and outcomes such as type 2 diabetes, which are sometimes used to predict events although it is often inferential statistics underpinning these algorithms [25, 26]. Inferential statistics is constrained in the task of predictive modelling due to a number of reasons including that it struggles incorporating collinear factors and complex interactions. The pure prediction world is anti-parsimonious [27]; there will be a multitude of potential factors that combined together in complex non-linear ways can produce more accurate predictions for particular events. Real prediction is done using ML algorithms that are capable of detecting complex patterns and handling collinear factors, and are designed with the primary aim to predict future events [28]. Examining new factors as potential candidate predictors for type 2 diabetes or other clinical conditions using ML and large datasets formulate an extensive knowledge discovery process. Machine learning also has broadened our abilities to detect patterns between predictors and outcomes not previously possible [27]. With the increasing availability of big data, the scope to investigate a multitude of other possible predictors is now a reality. Together, large datasets and new analytical approaches with ML, have provided us with the opportunity to expand the knowledge base on other factors associated with type 2 diabetes. It is envisioned that identifying the best cohort of these predictors, many of which will have small effects, may be used to eventually build the best predictive tools with high predictive abilities and give clinicians and their patients the best certainty in risk prediction probabilities. To date, no study has applied ML to explore nutritional markers of undiagnosed type 2 diabetes which could be used to improve its early diagnosis and understand its pathology beyond routinely-used risk factors. In this context, the present study used prediction models and ML, coupled with serial cross-sectional data from five consecutive cycles of the National Health and Nutrition Examination Survey (NHANES) (https://www.cdc.gov/nchs/nhanes/index.htm) over the years 2007–2016, with the aim of identifying nutritional markers that could predict undiagnosed type 2 diabetes together with routinely used non-modifiable, behavioural and socio-economic predictors. We also benchmarked the performance of these models against a national risk assessment method (i.e. ADA diabetes risk test) [16]. The rest of the manuscript is structured as follows: We first describe the database and study cohort followed by an account of the operationalisation of outcome variable. Thereafter, we detail the statistical analysis including data pre-processing, ML, and benchmarking steps. We then present results of univariate analyses followed by details of best-performing ML models derived by each algorithm and the elucidated nutritional markers. Results section is concluded with information on the findings from benchmarking and algorithmic performance comparison steps. Next, we discuss the strengths, limitations, novel aspects, and potential clinical implications of the study. Finally, conclusions of the study are presented.

Materials and methods

Data source and study sample

The NHANES is a series of biennial cross-sectional surveys conducted by the Centres for Disease Control and Prevention (CDC) [29]. This is a large database containing voluminous information from nationally-representative samples of non-institutionalised US civilians, which can be used for predictive analytic purposes. For this study, we pooled five consecutive cycles in order to maximise the number of adult participants with undiagnosed type 2 diabetes and to enable robust adjustment for potential confounders. Each survey cycle had been approved by the National Centre for Health Statistics Institutional Ethics Review Board and all adult participants had provided written informed consent. Additionally, Monash University Human Research Ethics Committee approved this study (#24888). The approach to participant selection is presented in .

Flowchart depicting the analytic workflow adopted in the study.

a-adjusted by resampling methods incl. oversampling, under-sampling, random oversampling (ROSE) and synthetic minority oversampling technique (SMOTE). The resulting sample (n = 16429) included men and non-pregnant women ≥ 20 years of age with nutritional, behavioural, socio-economic and non-modifiable demographic data collected using pre-defined and uniform methods, from five consecutive data collection cycles of the NHANES spanning years 2007–2016. Design and methods of NHANES are well-documented (https://www.cdc.gov/nchs/nhanes/index.htm). In brief, dietary information was collected via two 24-hour dietary recall interviews; the first was an in-person visit in specially-designed Mobile Examination Centres (MECs) and the second was by telephone 3–10 days later. All dietary data were collected using similar methods in each survey cycle, enabling accurate total nutrient intake estimations and comparisons. Other health information was gathered by home-based interviews and via clinical examination in MECs. Although NHANES also collected serum biomarker data in MECs, these were not included, as we aimed to incorporate only easily collected, cost-effectively scalable nutritional and other clinical information frequently associated with dysglycaemia.

Outcome variable

Undiagnosed type 2 diabetes among men and non-pregnant women ≥ 20 years of age was determined using all three diagnostic tests administered in NHANES: fasting plasma glucose [FPG], oral glucose tolerance test [OGTT], and haemoglobin A1c [HbA1c]. A participant was classified as having undiagnosed type 2 diabetes if they had a negative response to the question “Have you ever been told by a doctor that you have diabetes?” and a positive glycaemic response to the above diagnostic tests [HbA1c ≥ 48 mmol/mol (≥ 6.5%) or FPG ≥126 mg/dl or 2-hr post-OGTT glucose ≥ 200mg/dl] as per ADA criteria [30]. All diagnosed diabetes cases, defined by a positive response to the question above and a positive glycaemic response [HbA1c ≥ 48 mmol/mol (≥ 6.5%) or FPG ≥126 mg/dl or 2-hr post-OGTT glucose ≥ 200mg/dl] were removed. Since the aim was to elucidate markers of overt type 2 diabetes as opposed to normoglycaemia, individuals with prediabetes according to ADA criteria [HbA1c = 39–47 mmol/mol (5.7–6.4%) or FPG = 100–125 mg/dl or 2-hr post-OGTT glucose = 140–199 mg/dl] [31] were also removed. Normoglycaemia was defined as a negative response to the question above and a negative glycaemic response for all three diagnostic tests [HbA1c <39 mmol/mol (< 5.7%) and FPG < 100mg and OGTT < 140mg].

Statistical analysis

The analytic workflow of this study was based on our previously published proof-of-study exploring predictors of prediabetes [32]. However, substantial modifications were made including analysing nutritional variables (omitted in the previous study) and excluding serum biomarkers in order to consider only those predictors which are simple, scalable and based on self-reported or easily measurable parameters. Another advancement was that, to be consistent with the cross-sectional design of NHANES, only undiagnosed type 2 diabetes was modelled in the present analysis whereas such a refinement was not applied to define the prediabetes cohort in the previous proof-of-concept study. We also used different benchmarking instruments in congruence with the two different conditions analysed in respective studies and we included much larger cohorts for training, testing, external validation and benchmarking. Finally, to identify all potential nutritional associations, we did not incorporate any statistical feature selection.

Data pre-processing

All analyses were performed using R statistical software [33]. Variables with ≥ 30% missing data were excluded, after which 139 variables that are potentially associated with undiagnosed type 2 diabetes (114 nutritional/dietary/food-intake associated; 13 other modifiable/health behaviour associated; 12 socio-economic/demographic) were included as independent variables, selected based on comprehensive literature surveys as summarised in . The rationale for inclusion of behavioural and socio-economic variables was to enable robust adjustment of resulting multivariate models for these factors and to elucidate nutritional markers jointly with information that are routinely incorporated into type 2 diabetes screening. Statistical feature selection was omitted as we aimed to identify all potential predictors of undiagnosed type 2 diabetes from the repertoire of 139 variables. The multiple imputation by chained equations (MICE) package [34] was used with default functions for imputing missing values; predictive mean matching, polytomous, and binary logistic regression for numeric, multi-level (> 2 levels) categorical and dichotomous categorical variables, respectively. Summary measures and variable distributions in the original and complete datasets were compared to evaluate goodness of fit. The distribution of characteristics for individuals with undiagnosed type 2 diabetes and those with normoglycaemia within the entire cohort is outlined in . NHANES 2013–2016 data were set aside as external validation sample to temporally validate constructed models. We performed random 50/50 split of the remaining NHANES 2007–2012 data to generate training samples (n = 4879) and internal validation samples (n = 4877).

Machine learning

We applied three ML algorithms, including logistic regression (LR) (linear), artificial neural network (ANN) (non-linear), and random forests (RF) (ensemble). To resolve the effect of class imbalance, resampling algorithms including minority class oversampling, Random OverSampling Examples (ROSE) [35], and Synthetic Minority Oversampling TEchnique (SMOTE) [36], were incorporated and trained in conjunction with each ML algorithm. Thus, a total of four models were built with: 1) original data, 2) oversampling, 3) ROSE, and 4) SMOTE per each ML algorithm. For ANN, parameter tuning and 5-fold cross-validation were conducted whereas default R package parameters and 10-fold cross validation were used for training the other two algorithms [37-39]. In detail, ANN settings were as follows: tuning grid composed of three weight decay parameters (0, 0.1, 0.01) and the size parameter was set from 1 to a maximum of 139 to be equivalent with the number of features. Bagging option was set to false and variable standardisation was performed via centering and scaling. The maximum number of iterations was 500. All other parameters were trained under default values. This resulted in 12 ML models which were built on training data and tested on internal and external validation cohorts (Figs ). Confusion matrix metrics such as sensitivity, specificity, and negative and positive predictive values as well as area under the receiver operating characteristic curve (AUROC) were used to assess the predictive performance of these models. Adjusted odds ratios (OR) indicated the relative impact of predictors in LR models with confidence intervals (CI) used to measure variability and significance. Predictors from the other two algorithms were identified by variable importance values, as calculated by default R software functions (Figs ) [37-39].

Overlapped ROC curves demonstrating predictive performance of logistic regression models on internal validation data.