Literature DB >> 29888055

Multi-Task Learning to Identify Outcome-Specific Risk Factors that Distinguish Individual Micro and Macrovascular Complications of Type 2 Diabetes.

Era Kim^1,2, David S Pieczkiewicz¹, M Regina Castro³, Pedro J Caraballo³, Gyorgy J Simon¹.

Abstract

Because deterioration in overall metabolic health underlies multiple complications of Type 2 Diabetes Mellitus, a substantial overlap among risk factors for the complications exists, and this makes the outcomes difficult to distinguish. We hypothesized each risk factor had two roles: describing the extent of deteriorating overall metabolic health and signaling a particular complication the patient is progressing towards. We aimed to examine feasibility of our proposed methodology that separates these two roles, thereby, improving interpretation of predictions and helping prioritize which complication to target first. To separate these two roles, we built models for six complications utilizing Multi-Task Learning-a machine learning technique for modeling multiple related outcomes by exploiting their commonality-in 80% of EHR data (N=9,793) from a university hospital and validated them in remaining 20% of the data. Additionally, we externally validated the models in claims and EHR data from the OptumLabs™ Data Warehouse (N=72,720). Our methodology successfully separated the two roles, revealing distinguishing outcome-specific risk factors without compromising predictive performance. We believe that our methodology has a great potential to generate more understandable thus actionable clinical information to make a more accurate and timely prognosis for the patients.

Entities: CellLine Chemical Disease Gene Species

Year: 2018 PMID： 29888055 PMCID： PMC5961813

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Type 2 Diabetes Mellitus (T2DM) is an irreversible chronic disease. It is associated with the metabolic syndrome, a cluster of interrelated conditions that include high blood pressure (BP), chronically elevated fasting plasma glucose (FPG), abdominal obesity, and lipids imbalance including elevated triglycerides (TG), and low high-density lipoprotein (HDL)[1]. Since complicated interactions among these conditions exist, even a minor adjustment on a single risk factor can dramatically influence the patient’s health status and clinical outcomes[2-4]. Hence, a comprehensive understanding of the effects of these risk factors on various complications is necessary for the successful long-term management of T2DM patients. Studies that identify risk factors for complications of T2DM abound[5-7], however they fail to paint an accurate picture of the patient’s health status and progression to the most likely next complication. In these studies, regardless of which complication they focus on, the risk factors tend to be largely the same (e.g., BP, FPG, lipids, and kidney function). The reason for this large overlap is that the above risk factors capture the effect of deteriorating overall metabolic health that underlies all these outcomes rather than capturing the effects that differentiate among the outcomes. This suggests that the risk factors have two roles: first, they describe the extent to which the patient’s overall metabolic health has deteriorated and second, they signal a particular complication that the patient is progressing to. Given that existing studies have focused on a single or occasionally a few complications and modeled them independently, they have not separated these two roles. Hence, it is difficult to know whether a risk factor is significant in progression to a particular complication or whether it merely describes the deterioration of overall metabolic health. To understand the direction of progression, namely, which of the many possible complications the patient is most likely to develop next, separating these two roles is critical. The deterioration of underlying metabolic health is a commonality across all the complications. If we identify the commonality and remove it from the entirety of a risk factor’s effect, all that remains is outcome-specific effect. To model this, we had two challenges. First, to correctly capture the commonality, we needed to examine a wide range of complications using sufficient amounts of patient data. Because, if we study a single complication, the commonality is not identifiable so the distinction is lost. In this study, we used two independent datasets. As the primary, we had EHR data (N=9,793) collected from the University of Minnesota Medical Center (UMMC) and used them for model training and internal validation. As the secondary, we had claims and EHR data from the OptumLabs Data Warehouse (OLDW) (N=72,720)[8] and used them for external validation. Because these datasets contained years of medical history of a large number of patients, they offered sufficient amounts of patient data and allowed us to examine multiple complications simultaneously. Second, it was methodologically challenging to isolate the commonality from the entirety of a risk factor’s effect because, it is not distinguishable. Multi-Task Learning (MTL) is a technique to model multiple related outcomes by exploiting their commonality[9,10]. In our case, modeling progression to each individual complication is a modeling task, and these tasks are related because deterioration in overall metabolic health underlies them all. We used MTL to integrate these tasks and identify the commonality among them. This approach is tantamount to applying MTL in reverse: rather than exploiting the commonality across the outcomes towards improved predictive performance, we discard the commonality to reveal differential markers, risk factors that are specific to each complication. Considering that an accurate and timely prognosis for the patients often remains unsatisfactory[11] and there is limited evidence available for clinical decision support, our methodology that improves the interpretation of predictions and generates more understandable clinical information will help prioritizing the outcomes and developing optimal individualized T2DM management.

Materials and Methods

Primary Dataset for Training and Internal Validation

We used 10-year, de-identified EHR data (Jan 1, 2004-Dec 31, 2013) including inpatient, outpatient, and emergency department visits from the University of Minnesota Medical Center (UMMC), a main university hospital, located in Minneapolis, MN. From the EHR data, we extracted patient demographics (age, gender), smoking status, vital signs (BP, pulse, and Body Mass Index BMI), lab results (HbA1c, lipid panel, Glomerular Filtration Rate GFR), three diagnoses comorbid to T2DM (dyslipidemia, hypertension, obesity), and six diagnoses of complications of interest: chronic kidney disease (CKD), acute renal failure (ARF), ischemic heart disease (IHD), congestive heart failure (CHF), peripheral vascular disease (PVD), and cerebrovascular disease (CVD). ARF is not usually associated with T2DM but involves organs or functions that are affected by T2DM. To demonstrate our proposed methodological validity, we intentionally included ARF as one of outcomes. We used 80% and 20% of UMMC data for model training and internal validation, respectively.

Study Design and Cohort Selection

We conducted retrospective cohort study. In UMMC data, we set up the study baseline at Jan. 1, 2010, collected patients’ 6-year medical history to create baseline patient characteristics, and followed them from baseline to Dec. 31, 2013, determining whether or not they developed any complication of interest (Figure 1).

Figure 1.

Study design

Initially, we identified 22,946 adult T2DM patients based on ICD-9 codes. These patients were at least 18 years old at baseline, and they were generally diagnosed with T2DM within the 6-year period. When patients develop multiple complications, the effects of risk factors become conflated. Thus, we excluded 8,979 patients who already developed any of the complications before baseline and 914 patients who developed multiple complications during the follow-up period. This was because we wanted to start with simple data without such conflated effects and achieve our goal of examining the feasibility of our methodology able to separate the two roles. We excluded 1,152 patients who had no HbA1c measurements at all, 1,611 patients who had no BP, pulse, or BMI measurements at all, 494 patients who had no lipids information at all, and 3 patients without any known smoking status, resulting in 9,793 patients.

Second Dataset for External Validation

For external validation, we used claims and EHR data from the OptumLabs Data Warehouse (OLDW), which includes de-identified claims data for privately insured and Medicare Advantage enrollees in a large, private, U.S. health plan, as well as de-identified EHR data from a nationwide network of provider groups. The database contains longitudinal health information on enrollees, representing a diverse mixture of ages, ethnicities and geographical regions across the United States. The health plan provides comprehensive full insurance coverage for physician, hospital, and prescription drug services. The EHR data sourced from provider groups reflects all payers, including uninsured patients[8]. We extracted 10-year data (Jan 1, 2006-Dec 31, 2015) from the OLDW and identified 72,720 T2DM patients using the same study design (Figure 1) and selection procedure.

Baseline Patient Characteristics in UMMC and OLDW Datasets

Table 1 shows baseline patient characteristics in UMMC and OLDW datasets with variables available in this study. These variables represent risk factors and are used henceforth. UMMC patients had similar HbA1c but higher SBP and DBP compared to US adults with diabetes (HbA1c, SBP, and DBP are 7.2%, 131.5mmHg, and 69.4mmHg, respectively.)[12], and they had signs of established CKD based on GFR[13]. Compared to UMMC patients, OLDW patients were older and had better HbA1c, better lipids, better kidney function, but higher SBP and DBP.

Table 1.

Baseline Patient Characteristics in UMMC and OLDW Datasets

Variable	Description	UMMC (N=9,793)	OLDW (N=72,720)
male	Male	51	46
age	Age (years)	58±13	60±12
never_smoker	Non-smoker	56	45
a1c	HbA1C	7.2±1	7.0±1
ldl	LDL-cholesterol (mg/dL)	103±28	101±28
hdl	HDL-cholesterol (mg/dL)	44±12	46±12
trigl	Triglycerides (mg/dL)	172±90	169±117
tchol	Total-cholesterol (mg/dL)	181±34	179±34
gfr	Glomerular Filtration Rate (ml/min/1.73m2)	58±32	76±27
gfr_norm	Normal Glomerular Filtration Rate	22	7
bmi	Body Mass Index (kg/m2)	34±7	34±8
sbp	Systolic Blood Pressure (mmHg)	127±11	131±11
dbp	Diastolic Blood Pressure (mmHg)	75±7	77±7
pls	Pulse (bpm)	76±9	77±9
hyperlip	Hyperlipidemia	81	86
htn	Hypertension	71	81
obese	Obesity (BMI > 30)	70	67

Developing Multi-Task Learning Methodology

Under our hypothesis, each risk factor played two roles. The first role quantified the extent to which the patient’s metabolic health had deteriorated, and the second role signaled which complication the patient was most likely to develop next. The first role was common across all complications (common effect), and the second role was specific to each complication (outcome-specific effect). Formally, given a design matrix X that contained patients as rows and variables as columns, and t measuring time to event (complication or censoring), we simultaneously built the following six models, one for each complication c where λ) was the patient’s hazard of developing complication c at time was a complication-specific baseline hazard, C and C were user-defined thresholds, chosen via cross validation, and ||. ||1 denoted the L-1 norm (LASSO-penalty). X contained all variables in Table 1 except T2DM-comorbidities (hyperlipidemia, hypertension and obesity) since their defining factors (lab results and vital signs) were included. Conceptually, each D model could be separated into two submodels as where the first submodel (with coefficients α) was a Cox model[14] capturing the common effects, and the second submodel (with coefficients βc) was a Cox model capturing outcome-specific effects for each complication c. We called the first submodel General Progression Model and the second submodel Differential Progression Model. Since these two models used the same set of variables, coefficients α and β were generally not identifiable (for each complication c, the effects of α and β were not distinguishable). While the first of the two LASSO constraints in Eq. (1) simply induced sparsity in General Progression Model with the purpose of performing variables selection, the second LASSO-penalty made Differential Progression Models identifiable as it shrunk β coefficients towards 0 forcing General Progression Model to explain as much of the variability as possible. We iteratively updated a and β coefficients until they were stabilized (squared differences of coefficients between previous and current iterations were effectively zero). If the entirety of a variable’s effect was only general deterioration of the metabolic health, its α would be exactly 0. Conversely, if α > 0, the variable increased the risk of complication c by β from α (harmful); and if α < 0, it decreased the risk of complication c by β from α (protective). Therefore, non-zero coefficients identified differential markers; these were the risk factors that had effects beyond General Progression and enabled improved interpretation of progression to the most likely next complication.

Internal and External Validation

To determine the significance of α and βc coefficients, we performed 1,000 permutation tests and calculated empirical p-values[15]. The key idea of permutation test was that variables were independent of randomly permuted labels; thus, coefficients of permuted labels were expected to have weaker associations than those of true labels. Then, the p-value of a coefficient could be calculated as the ratio of the number of permutation tests resulting in a stronger association to the total number of permutation tests. We internally evaluated predictive performance of our models in 20% of UMMC data and externally evaluated it in OLDW data using concordance index (c-index), typically used to assess predictive performance of Cox models. In internal validation, we also performed 1,000 bootstrapping with sample size of 100% UMMC patients to obtain 95% confidence intervals (95CIs). To demonstrate that we did not suffer a loss of performance due to our proposed MTL-based methodology, we compared predictive performance between ours and a reference methodology that built six independent models (LASSO-penalized Cox regression) for the six complications at a time.

Results

In this section, we are presenting results from our proposed methodology focusing on improved interpretation of risk factors and predictive performance in comparison with the reference methodology.

Coefficients from Multi-Task Learning Methodology

Figure 2 presents α and β coefficients from General Progression and Differential Progression Models. The rows are the variables. The first column corresponds to α coefficients from General Progression Model and the remaining columns correspond to β coefficients from Differential Models for each complication c. The interpretation of the coefficients is analogous to the regular Cox models: the exponent of a coefficient is the hazard ratio (HR) that the variable confers on the patient.

Figure 2.

Coefficients from Multi-Task Learning Methodology

For most variables (e.g., HbA1c, LDL) which higher values are associated with higher risks, if α > 0, it indicates a harmful association; and if α < 0, it indicates a protective association with General Progression. In Differential Progression, if β > 0, the variable increases the risk of complication c by β from α, making the variable more important (harmful effect becomes larger); and if β < 0, the variable decreases the risk of complication c by β from α, making the variable less important (harmful effect becomes smaller). There are also variables (HDL, GFR, normal GFR, and never smoker) which higher values are associated with lower risks; thus, the interpretation of their coefficients is opposite. For example, if a of GFR > 0, it means that higher GFR is protective of General Progression; if β of GFR > 0, it indicates that higher GFR is more important in progression to complication c (prospective effect becomes larger). To help detecting significant α and β coefficients, we visualized associations between variables and complications (harmful, protective, more important, and less important) and p-values (Figure 2). Coefficients with a circle are statistically significant, and those without are insignificant. Larger circles indicate smaller p-values (more significant). The exact p-values can be found in appendix Table A-1.

Table A.1.

P-values of Coefficients from Multi-Task Learning Methodology

Variable	P-value
Variable	General	CKD	ARF	IHD	PVD	CHF	CVD
a1c	0.008	0.049	0.340	0.027	0 005	0 058	0 327
Idl	0 329	0.117	0.289	0.034	0.265	0.259	0.256
hdl	0.058	0281	0.343	0.123	0.145	< 0 001	0.283
trigl	0 014	0297	0 307	0 059	0 079	< 0 001	0.278
tchol	0 021	0 057	0.25	0 039	0.023	0 059	0.234
gfr	< 0.001	< 0.001	0.265	< 0.001	< 0.001	0.251	0.255
gfr_norm	< 0.001	< 0.001	0.261	< 0.001	< 0.001	0.025	0.237
bmi	0.042	0.067	0.333	0.051	0 034	< 0.001	0.272
pis	0.188	0.290	0.320	0.105	0.076	< 0.001	0.294
sbp	0 01	0.076	0 31	0.002	0.003	< 0.001	0272
dbp	0 006	0.261	0.307	0.029	< 0.001	0.289	0.266
neversmoker	0.002	0.089	0 017	0 098	< 0 001	< 0 001	0.317
age	< 0 001	0289	0 311	0.045	< 0 001	< 0 001	< 0 001
male	0 107	0.094	0.321	0.019	< 0 001	0 008	0.312

As expected, most variables significantly predicted General Progression: HbA1c, triglycerides, total cholesterol, GFR, normal GFR, BMI, SBP, DBP, non-smoker, and age (Figure 2). Traditionally, higher DBP is known to be harmful. But, several recent studies showed that DBP was protective of cardiovascular disease especially for older adults[16]. We also found that DBP is protective of General Progression. General Progression Model was a latent model in the sense that it did not have an observable outcome; it described the extent of deterioration in overall metabolic health. These variables of General Progression were those that many studies found to be significantly associated with an increased risk of micro and macrovascular complications and all-cause mortality[17,18].

What is General Progression Model?

We defined General Progression mathematically as the effects that were common across all the complications and explained that General Progression captured deteriorating overall metabolic health. As an alternative, we interpreted α coefficients as the log HR of progression to any complication. To illustrate this, we built a LASSO-penalized Cox model that predicted the development of any complication (this model had an event if a patient developed any complication) and compared coefficients from this model (Figure 3) with α coefficients from General Progression Model (Figure 2). We found that they were similar with respect to effect size, sign, and significance, and this suggested that General Progression could be indicative of progression to any complication.

Figure 3.

Coefficients for Risk of Developing Any Complication

Interpretation of Coefficients from Multi-Task Learning Methodology

After achieving the overarching goal of our proposed methodology to separate common effects (α) and outcome-specific effects (β) of a risk factor, we examined if results and their interpretations from our models clinically made sense. Especially, we wanted to have some of them consistent with known facts because if they were not, the utility of our methodology could be in doubt. To demonstrate this, let us consider the role of HbA1c in progression to CKD and IHD as an example as it is commonly accepted facts in practice: hyperglycemia is a key driver of microvascular complications (e.g., CKD), while dyslipidemia is a key driver of macrovascular complications (e.g., IHD)[19]. General Progression showed that a unit increase in HbA1c conferred a HR of 1.039 (exp(.0385)) on all complications uniformly (Figure 2, row1, column1). However, higher levels of HbA1c ultimately affect the different complications differently. As mentioned, it is well-known that HbA1c is more predictive of CKD than IHD. Indeed, Differential Progression for CKD showed that a unit increase in HbA1c conferred an additional log HR of .0407 on patients, increasing the HR of CKD from 1.039 to 1.0824 (exp(.0385+.0407) (Figure 2, row1, column2). It is also known that HbA1c is not as important in IHD as in CKD. Differential Progression for IHD showed that patients with higher HbA1c tended to suffer other (microvascular) complications. The log HR of IHD that a unit increase in HbA1c conferred on patients was negative, which decreased the HR of IHD from 1.039 to 0.9842 (exp(.0385-.0544)) (Figure 2, row1, column4). What it means is that patients with higher HbA1c are more likely to progress to a complication than patients with lower HbA1c, and that complication is less likely to be IHD but more likely to be a microvascular complication such as CKD. That is, General Progression described the patient’s tendency to progress to a complication, and the Differential Progression helped to target which complication the patient is more likely to develop next.

Differential Markers of CKD, IHD, PVD and CHF

To easily detect distinguishing patterns of differential markers, we visualized each of CKD, IHD, PVD and CHF as a series of spider plots[20] in Figure 4. In a spider plot, variables are arranged as axes extending radially from a central point, and each observation makes a closed polygon connecting points on all of the axes. Emphasis is upon discerning the characteristic shapes of these polygons among observations, rather than extracting specific values. The interpretation of our plots is as follows. Each plot corresponds to a complication. Ten variables for vital signs and lab results construct individual axes, radially arranged around a center point. The β cofficient of each variable is depicted by an anchor (node) on an axis. As higher values of HDL, GFR, and DBP are protective, the sign of coefficients of them are reversed only for visualization purposes. The same color encoding was used to identify significantly more or less important differential markers. For each variable, distance from the center indicates an increased risk. In each plot, a navy line connecting the β coefficients represents Differential Progression, while a green line connecting zero on each axis conceptually represents General Progression, a reference for Differential Progression. By comparing these two lines on each axis, differential risk for complication c beyond or below General Progression is easily distinguished. As we focused on straightforward interpretation, we did not perform normalization; thus, the scales of the variables are not comparable with each other. What is important is whether the navy line (Differential Progression) is outside or inside the green line (General Progression) on each axis.

Figure 4.

Characteristic Shapes of Differential Markers for CKD, IHD, PVD, and CHF

IHD, PVD, and CHF are well-known concomitant macrovascular complications. They share similar pathophysiology and are believed to have similar risk factors[21]. Given these facts, distinguishing among them without a methodology like ours is more difficult. In Figure 4, spider plots show distinguishing patterns of differential markers among these very similar diseases. In progression to IHD, LDL was more important; SBP, and lower DBP were less important[16]. In progression to PVD, SBP and lower DBP were more important; BMI was less important. In progression to CHF, lipid abnormalities were less important; BMI, pulse, and SBP were more important (irregular or fast pulse is one of the symptoms of CHF).

Coefficients from Reference Methodology

Figure 5 shows coefficients from reference models. The rows are the variables, and the columns are complications. If a coefficient > 0, it indicates a harmful association; and if a coefficient < 0, it indicates a protective association with a complication c. In reference models, the two roles of a variable (α and β coefficients) were not identifiable; thus, only the entirely of a variable’s effect was estimated, and the outcome-specific effect was masked. This was the motivation of our study and the key difference from our proposed methodology. To help detecting significant coefficients, we visualized the association between variables and complications (harmful and protective) and p-values. The exact p-values can be found in appendix Table A-2.

Figure 5.

Coefficients from Baseline Methodology

Table A.2.

P-values of Coefficients from Baseline Methodology

Variable	P-value
Variable	CKD	ARF	IHD	PVD	CHF	CVD
a1c	0.033	< 0.001	0.132	0.023	0.164	0 007
Idl	0.052	< 0.001	0.043	0.101	0.289	0.045
hdl	0.088	0.014	0.101	0.333	< 0 001	0.024
trigl	0 333	0 010	0 023	0 025	0 003	0.162
tchol	0269	< 0.001	0 029	0.045	0.239	0.271
gfr	< 0.001	< 0.001	0.006	< 0.001	0.021	0.019
gfr_norm	< 0.001	< 0.001	0 007	< 0.001	0.258	0.033
bmi	0.085	0.102	0.101	0.118	< 0.001	0 184
pis	0.348	< 0.001	0.121	0 081	< 0.001	0 088
sbp	0.055	0.005	0.024	< 0.001	< 0.001	0.007
dbp	0.154	0.004	0.117	< 0.001	0.138	0.159
neversmoker	0.344	< 0.001	0 016	< 0.001	< 0 001	0.128
age	<0 001	< 0.001	0 021	< 0 001	< 0 001	< 0 001
male	0 347	< 0.001	0 059	< 0.001	0.002	0.041

Utility of Our Proposed Multi-Task-Learning Methodology

To demonstrate clinical utility of our methodology in comparison with reference methodology, let us take ARF as an example. Although a major cause of ARF is not diabetes, reference models identified virtually all the variables to be predictive of ARF (Figure 5). While, Differential Progression Model for ARF showed that progression to ARF was only associated with the underlying advanced metabolic deterioration (General Progression), and all the variables were not specific to ARF (Figure 2). Another example is CKD. Risk factors of CKD are well-understood. The reference model for CKD identified HbA1c (barely), GFR and age as significant risk factors, and they are indeed known risk factors. In fact, reference models identified age as a risk a factor for every complication; however, it is not that a patient is more likely to develop CKD just because he is older. Whereas, General Progression Model and Differential Progression Model for CKD suggested that older patients were more likely to have their metabolic health deteriorated than younger patients; and, age played no role in progression to CKD beyond General Progression. Table 2 presents predictive performance in C-Index of our MTL-based models and reference models. Generally, they achieved similar predictive performance. Minimal albeit statistically significant differences were only observed in complications with small number of progressing patients.

Table 2.

Predictive Performance in C-Index (95CIs)

Dataset	Methodology	CKD	ARF	IHD	PVD	CHF	CVD
Internal (UMMC)	MTL	.74(.73-.79)	.58(.48-.82)	.57(.52-.58)	.75(.59-.80)	.83(.70-.91)	.75(.63-.78)
Internal (UMMC)	Reference	.74(.73-.79)	.62(.48-.79)	.57(.52-.58)	.75(.60-.81)	.84(.67-.91)	.78(.65-.80)
External (OLDW)	MTL	.71	.61	.53	.61	.73	.64
External (OLDW)	Reference	.71	.63	.53	.61	.74	.68

For both, predictive performance was lower in external validation. UMMC data consisted of smaller number of patients from one healthcare system. Thus, they might be less representative of T2DM population than OLDW patients, or they might be subpopulation of OLDW patients. Also, patient characteristics differed fundamentally between them (Table 1). However, except CKD, C-Indices were still within 95CIs.

Discussion

Given that the effect of deteriorating overall metabolic health is common across all the complications, we hypothesized each risk factor had two roles: describing the extent of deteriorating overall metabolic health and signaling a particular complication the patient is progressing towards. We have successfully demonstrated that our proposed methodology separated these two roles of risk factors and revealed distinguishing patterns of differential markers. Also, we modeled multiple complications simultaneously by sharing their information; thereby, generating systematic and comprehensive interpretation of different roles of risk factors among various complications. Our study has important strengths. First, we made more understandable predictions for clinicians by focusing on improved interpretability. Usually, high predictive accuracy is of key importance in prediction models. However, lack of clarity in the interpretation of predictions limits their usefulness in practice. Second, we externally evaluated predictive performance of our models, which were rarely done in other studies. Although the reference model did well or slightly better than our proposed model, the difference was minimal. So, we would say that we did not compromise predictive performance due to our proposed methodology. Third, our methodology is of high utility because it can be applied to other clinical conditions in which comorbidities matter. We have several limitations. When identifying cohorts, we excluded patients who developed multiple complications. Although this action limits the generalizability of our work, our primary interest was to demonstrate the feasibility and utility of our proposed methodology. Additionally, we excluded patients with unknown vital signs, lab results, and/or smoking status. We tested differences between final study cohort and these excluded patients. All variables except hdl, tchol, dbp, never_smoker, and age were significantly different, and the excluded patients were generally sicker. Thus, our study is subject to selection bias. But, ours was not to estimate the effect of a risk factor, in which addressing selection bias using imputation methods is critical, but to separate the entirety of the effect into common and outcome-specific effects. Lastly, we used variables easily obtained from EHR data. Many studies have found that T2DM is disproportionally affected by race, ethnicity and/or socioeconomic status[22]. Although they are very important risk factors, we mainly focused on modifiable risk factors. When we build a model on large amounts of data, most variables become statistically significant; however, they may be clinically irrelevant. To impact individualized patient care, it is critical to develop enabling technologies that extract clinically useful information from the large amounts of data. Our future work is to extend this work to larger cohorts and overcome the limitations. If we can obtain reasonable levels of generalizability, we believe that our methodology will have significant potential to help clinicians prioritizing outcomes and making a more accurate prognosis for T2DM patients.

12 in total

1. Risk factors for renal dysfunction in type 2 diabetes: U.K. Prospective Diabetes Study 74.

Authors: Ravi Retnakaran; Carole A Cull; Kerensa I Thorne; Amanda I Adler; Rury R Holman
Journal: Diabetes Date: 2006-06 Impact factor: 9.461

2. Does the relation of blood pressure to coronary heart disease risk change with aging? The Framingham Heart Study.

Authors: S S Franklin; M G Larson; S A Khan; N D Wong; E P Leip; W B Kannel; D Levy
Journal: Circulation Date: 2001-03-06 Impact factor: 29.690

3. Insulin signalling and the regulation of glucose and lipid metabolism.

Authors: A R Saltiel; C R Kahn
Journal: Nature Date: 2001-12-13 Impact factor: 49.962

4. Cardiovascular morbidity and mortality associated with the metabolic syndrome.

Authors: B Isomaa; P Almgren; T Tuomi; B Forsén; K Lahti; M Nissén; M R Taskinen; L Groop
Journal: Diabetes Care Date: 2001-04 Impact factor: 19.112

5. Type 2 Diabetes Mellitus Trajectories and Associated Risks.

Authors: Wonsuk Oh; Era Kim; M Regina Castro; Pedro J Caraballo; Vipin Kumar; Michael S Steinbach; Gyorgy J Simon
Journal: Big Data Date: 2016-03-01 Impact factor: 2.128

6. Smoking, lipids, glucose intolerance, and blood pressure as risk factors for peripheral atherosclerosis compared with ischemic heart disease in the Edinburgh Artery Study.

Authors: F G Fowkes; E Housley; R A Riemersma; C C Macintyre; E H Cawood; R J Prescott; C V Ruckley
Journal: Am J Epidemiol Date: 1992-02-15 Impact factor: 4.897

7. Risk factors for coronary artery disease in non-insulin dependent diabetes mellitus: United Kingdom Prospective Diabetes Study (UKPDS: 23)

Authors: R C Turner; H Millns; H A Neil; I M Stratton; S E Manley; D R Matthews; R R Holman
Journal: BMJ Date: 1998-03-14

8. The incidence of congestive heart failure in type 2 diabetes: an update.

Authors: Gregory A Nichols; Christina M Gullion; Carol E Koro; Sara A Ephross; Jonathan B Brown
Journal: Diabetes Care Date: 2004-08 Impact factor: 19.112

9. Racial and ethnic disparities in diabetes complications in the northeastern United States: the role of socioeconomic status.

Authors: Chandra Y Osborn; Mary de Groot; Julie A Wagner
Journal: J Natl Med Assoc Date: 2013 Impact factor: 1.798

10. Harmonizing the metabolic syndrome: a joint interim statement of the International Diabetes Federation Task Force on Epidemiology and Prevention; National Heart, Lung, and Blood Institute; American Heart Association; World Heart Federation; International Atherosclerosis Society; and International Association for the Study of Obesity.

Authors: K G M M Alberti; Robert H Eckel; Scott M Grundy; Paul Z Zimmet; James I Cleeman; Karen A Donato; Jean-Charles Fruchart; W Philip T James; Catherine M Loria; Sidney C Smith
Journal: Circulation Date: 2009-10-05 Impact factor: 29.690

1 in total

1. Application of multi-label classification models for the diagnosis of diabetic complications.

Authors: Liang Zhou; Xiaoyuan Zheng; Di Yang; Ying Wang; Xuesong Bai; Xinhua Ye
Journal: BMC Med Inform Decis Mak Date: 2021-06-07 Impact factor: 2.796

1 in total