Literature DB >> 25926855

An application of association rule mining to extract risk pattern for type 2 diabetes using tehran lipid and glucose study database.

Azra Ramezankhani¹, Omid Pournik², Jamal Shahrabi³, Fereidoun Azizi⁴, Farzad Hadaegh¹.

Abstract

BACKGROUND: Type 2 diabetes, common and serious global health concern, had an estimated worldwide prevalence of 366 million in 2011, which is expected to rise to 552 million people, by 2030, unless urgent action is taken.
OBJECTIVES: The aim of this study was to identify risk patterns for type 2 diabetes incidence using association rule mining (ARM). PATIENTS AND METHODS: A population of 6647 individuals without diabetes, aged ≥ 20 years at inclusion, was followed for 10-12 years, to analyze risk patterns for diabetes occurrence. Study variables included demographic and anthropometric characteristics, smoking status, medical and drug history and laboratory measures.
RESULTS: In the case of women, the results showed that impaired fasting glucose (IFG) and impaired glucose tolerance (IGT), in combination with body mass index (BMI) ≥ 30 kg/m(2), family history of diabetes, wrist circumference > 16.5 cm and waist to height ≥ 0.5 can increase the risk for developing diabetes. For men, a combination of IGT, IFG, length of stay in the city (> 40 years), central obesity, total cholesterol to high density lipoprotein ratio ≥ 5.3, low physical activity, chronic kidney disease and wrist circumference > 18.5 cm were identified as risk patterns for diabetes occurrence.
CONCLUSIONS: Our study showed that ARM is a useful approach in determining which combinations of variables or predictors occur together frequently, in people who will develop diabetes. The ARM focuses on joint exposure to different combinations of risk factors, and not the predictors alone.

Entities: CellLine Chemical Disease Gene Species

Keywords: Body Mass Index; Data Mining; Diabetes Mellitus, Type 2

Year: 2015 PMID： 25926855 PMCID： PMC4393501 DOI： 10.5812/ijem.25389

Source DB: PubMed Journal: Int J Endocrinol Metab ISSN： 1726-913X

1. Background

Type 2 diabetes, a common and serious global health concern, had an estimated worldwide diabetes prevalence of 366 million in 2011, which is expected to rise to about 552 million people by 2030, unless urgent action is taken (1, 2). Diabetes leads to significant medical complications, including retinopathy, nephropathy, neuropathy, stroke, and myocardial infarction (3). In type 2 diabetes, cardiovascular events are responsible for 80% of all deaths (4). Several possible risk factors have been known to contribute to the development of type 2 diabetes, such as ethnicity, obesity, unhealthy diet, inactivity, insulin resistance and family history of diabetes (5). Epidemiological studies have shown that lifestyle modifications can prevent or delay development of type 2 diabetes, making the early identification of populations at high risk a major healthcare necessity (6, 7). During the past 2 decades, epidemiologists and statisticians have attempted to develop simple, reliable, affordable, and widely implementable weighted models for the prediction of future type 2 diabetes risk (7). Although firm scientific evidences for the prevention of type 2 diabetes are available, the researchers continue to look for the causes of diabetes and ways to manage, prevent, or cure the disorder (8, 9). In current medical studies, the identification of risk factors for diabetes and designing a diagnostic or prediction model are generally based on multivariate statistical analysis utilizing logistic regression (6). Huge amounts of data generated by health care systems and epidemiological studies, on the other hand, contain hidden knowledge, which is impossible to uncover by using traditional methods; using data mining, therefore, is more adapted for medical studies (10-12). Data mining, a part of the Knowledge Discovering from Databases (KDD), is an interdisciplinary active research area that is used to extract high-level knowledge from low-level data, in the context of large data sets (13). Decision trees, clustering and neural network are the most common data mining methods that have been used for pattern extraction and to develop prediction models in the medical domain. However, association rules mining (ARM), one of the most popular methods in data mining, is rarely used (14-17). The ARM is a technique used to discover associations between variables (14). Several applications of ARM in the medical domain include discovering disease co-occurrences (18), identifying adverse effects of drugs (19), public health surveillance (20), detecting risk factors for heart disease and diabetes (21, 22) and determining relations among complications or the various diseases that accompany type 2 diabetes (23).

2. Objectives

In this study, we applied ARM for extracting risk patterns associated with type 2 diabetes occurrence, using data from the Tehran Lipid and Glucose Study (TLGS).

3. Patients and Methods

3.1. Study Population

Data was obtained from the TLGS, a prospective population based study performed in district 13 of Tehran, the capital of Iran. In a baseline survey performed from 1999 to 2001 (phase 1), 15000 residents, aged ≥ 3 years, were investigated and followed during three consecutive phases with 3-year intervals [2002 to 2005 (phase 2), 2005 to 2008 (phase 3), and the last, 2009-2012 (phase 4)] (24). Another 3500 people were added in the second phase and followed in the next two phases. In our study, subjects aged ≥ 20 years, from the first and second phases, were selected. We then excluded subjects with prevalent diabetes at baseline and, also, those with missing data on fasting and 2-hour post-prandial glucose, leaving 10310 non-diabetic subjects in the first and second phases, which were followed in the next phases. Overall, 3663 (35%) subjects missed their follow-ups. A total of 729 individuals developed type 2 diabetes by the end of a 12-year follow-up (phase 4), and 5918 subjects remained diabetes-free. The final data source, hence, included 6647 people (Figure 1). The incidence of type 2 diabetes was defined based on fasting plasma glucose (FPG) ≥ 126 mg/dL or 2 h post-challenge plasma glucose (2 h PCPG) ≥ 200 mg/dL, or taking anti-diabetic medication (25).

Figure 1.

Follow up Status of the Subjects in the Tehran Lipid and Glucose Study

* Gray box shows study population, which included 6647 diabetic and non-diabetic persons.

Follow up Status of the Subjects in the Tehran Lipid and Glucose Study

* Gray box shows study population, which included 6647 diabetic and non-diabetic persons.

3.2. General Description of Data

The data set included demographic characteristics, like age, gender, marital status, education, and information on smoking status, physical activity, and medical and drug history, which were collected during interviews, using a pretested questionnaire. Anthropometric measures, including weight, height, waist, hip and wrist circumference, were obtained according to a standard protocol (26). Systolic and diastolic blood pressures were measured twice, on the right arm, using a standardized mercury sphygmomanometer, and the mean of the two measurements was considered as the subject's blood pressure. All blood parameters, except for 2h plasma glucose, were based on fasting blood samples (after 12-14 hours overnight fasting). Blood samples obtained were centrifuged within 30-45 minutes of collection and kept cool until analysis at the TLGS research laboratory. Fasting and 2h plasma glucose were determined, based on an enzymatic colorimetric method, using oxidase kits (Pars Azmoon Inc., Tehran, Iran) with inter- and intra-assay coefficients of variation (CV), both < 2.2%. Triglycerides (TGs), total and high density lipoprotein (HDL) cholesterol were measured, using methods described elsewhere (27, 28).

3.3. Pre-Processing and Transformation of Data

In order to mine the dataset for association rules, all the variables should be transformed to the binary form. We had two types of variables for analysis, numeric and categorical. Numeric variables were categorized using the well-known medical cutoffs into intervals and then each interval was mapped to an item (binary variable). Age and total length of stay in the city were categorized into four groups (Table 1). Chronic kidney disease (CKD) was considered as an estimated glomerular filtration fate (GFR) below 60 mL/min/1.73 m2 (29). For categorization of lipid profiles (triglycerides, HDL, total cholesterol, cholesterol to HDL ratio, and triglyceride to HDL ratio) we used the cutoff value for prediction of diabetes in Iranian populations (30). Waist circumference was categorized using the cutoff value for Iranian men and women to predict incident cardiovascular diseases (31). Body mass index (BMI), waist-to-hip ratio (WHR), waist-to-height ratio (WHtR), blood pressure (BP), fasting plasma glucose (FPG) and 2 h plasma glucose were categorized using the World Health Organization (WHO) cutoffs for definition of metabolic syndrome, overweight and obesity (32). As there is no definite cutoff point for wrist circumferences, we categorized it in men and women using the equal frequency binning method. In this method, individuals were divided into three groups, so that each group contains the same number of individuals (Table 1). For other categorical variables, each category was mapped to an item. In summary, each row of data is a set of items and each item corresponds to the presence or absence of one category or numerical interval value of the variables.

Table 1.

Distribution of the Demographic Variables in the Study Population [a]

Variables	Cutoff Value	Values [b]
Age, y
Group 1	20-34	2525 (37.98)
Group 2	35-49	2444 (36.76)
Group 3	50-64	1310 (19.70)
Group 4	≥ 65	368 (5.53)
Total length of stay in the city, y
Group 1	< 20	609 (9.2)
Group 2	20-39	3884 (58.4)
Group 3	≥ 40	2131 (32.1)
Education
Group 1	≥ 13 years	990 (14.89)
Group 2	6-12 years	3877 (58.32)
Group 3	≤ 5 years	1780 (26.77)
Gender
Female	-	3762 (56.59)
Male	-	2885 (43.40)
Occupation
Employed	-	2718 (40.89)
Housekeeping (for females)	-	3041 (45.74)
Student	-	215 (3.23)
Unemployed	-	642 (9.65)
Other	-	31 (0.49)
Marital status
Divorced	-	55 (0.82)
Married	-	5512 (82.92)
Single (unmarried)	-	869 (13.07)
Widowed	-	210 (3.15)

a Data are presented as No. (%).

bThe data contains missing values when the cell percentages do not sum up to 100%.

a Data are presented as No. (%). bThe data contains missing values when the cell percentages do not sum up to 100%.

3.4. Analysis Method

3.4.1. Association Rules Mining

The ARM is a fundamental data mining technique that exhaustively looks for hidden patterns, making them ideal for the discovery of predictive rules from medical databases (33, 34). An association rule (AR) is a pair (X, Y) of sets of attributes, denoted by X → Y, where X is the antecedent and Y is the consequent of the rule X → Y. Basically, the rule states that if X happens, then Y does happen. In general, a set of items, such as X or Y, which are disjoint, is called an item set (33). Applied to a medical condition, association rules can identify subpopulations at particularly high risk of a given disease. They are interpretable, and suggest interactions between risk factors (35). Association rules are generated in two steps. First, a set of frequent item sets, or patterns, are generated. Second, these patterns can be used for generation of association rules (33, 36). Support and confidence are two measures of statistical significance and strength of a rule, respectively (21, 33). The support of a rule (X → Y) is defined as the percentage of records (rows) in the dataset that contain both X and Y (XUY). Confidence of a rule (X → Y) is the percentage of records in a dataset containing X that also contain Y (33). The correlation between X and Y is measured by the lift value, which is given as follows: The lift value is simply the ratio of the posterior and the prior confidence of an association rule. Consider the number of samples (records) in our database is “θ”, if " θ → diabetes" has a confidence of 10% and "X → Y" has a confidence of 70%, then the lift value (of the second rule) is 70/10 = 7. Obviously, if the posterior confidence equals the prior confidence, the value of this measure is 1. If the posterior confidence is greater than the prior confidence, the lift value exceeds 1 (the presence of the antecedent items raises the confidence), and if the posterior confidence is less than the prior confidence, the lift value is less than 1 (the presence of the antecedent items lowers the confidence). More formally, the lift of the rule (X → Y) is (Equation 1): Supp (θ) = the number of records in database (33, 37). Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts, while the lift value cannot be set by user (16). For the current study, we used the Apriori algorithm, the best-known and standard approach for discovering association rules in a cohort dataset (16). The algorithm includes two separate steps. In the first, minimum support is applied to identify all frequent item sets in a database. In the second step, these frequent item sets and the minimum confidence are used to generate rules (38). We considered support, confidence and lift as measures of interestingness and performance of the rules (39). As the goal of our study was to relate predictor variables to diabetes occurrence, we therefore, limited all predictor variables, to appear only in the antecedent (IF part), and diabetes occurrence (outcome variable) to appear only in the consequent (THEN part). To generate strong association rules, we started our analysis from initial support 2% (which encompasses about 20% of diabetic subjects), and confidence thresholds 75% (based on acceptable level of sensitivity for prediction models), for the generation of frequent item sets and rule induction. We also set the number of item sets to five, in the antecedents of rules; two types of rules were extracted for males and females, separately. We used SPSS modeler 14.2 (IBM, Armonk, NY, USA) to apply an association rule algorithm.

4. Results

A total of 6647 persons participated in the study, with more than 10 years follow up. The study included 42 input variables (Tables 1-5). The occurrence of diabetes during the follow-up period was 11% (427 cases) and 10% (302 cases) in females and males, respectively.

Table 5.

Distribution of the Medical Histories Variables in the Study Population [a]

Variables	Values [b]
Current cigarette smoking
Yes (daily / occasionally)	830 (12.48)
No	5817 (87.52)
Former cigarette smoking
Yes (daily / occasionally)	489 (7.35)
No	6158 (92.65)
Exposed to second hand smoke at home or at work
Yes	1690 (25.42)
No	4957 (74.57)
Physical activity levels
Low (doing exercise or labor less than three times a week)	4426 (66.58)
Normal (doing exercise or labor more than three times in a week)	2221 (33.42)
Use of diet or exercise for the management of hyperlipidemia
Yes	559 (8.40)
No	6088 (91.59)
Use of diet or exercise for the management of hypertension
Yes	319 (4.79)
No	6328 (95.20)
Use of antihypertensive drugs in the past month
Yes	324 (4.87)
No	6323 (95.12)
Use of lipid lowering drugs in the past month
Yes	150 (2.25)
No	6497 (97.74)
Use of diuretic drugs in the past month
Yes	112 (1.68)
No	6535 (98.32)
Use of thyroid drugs in the past month
Yes (ordered / unordered)	166 (2.49)
No	6481 (97.51)
Use of aspirin in the past month
Yes (ordered / unordered)	178 (2.67)
No	6469 (97.33)

a Data are presented as No. (%).

bThe data contains missing values when the cell percentages do not sum up to 100%.

a Data are presented as No. (%). bThe data contains missing values when the cell percentages do not sum up to 100%. a Abbreviations: CKD, chronic kidney disease. b Data are presented as No. (%). cThe data contains missing values when the cell percentages do not sum up to 100%. a Data are presented as No. (%). b The data contains missing values when the cell percentages do not sum up to 100%. a Data are presented as No. (%). bThe data contains missing values when the cell percentages do not sum up to 100%.

4.1. Results of Association Rules Mining Using the Apriori Algorithm

With initial support and confidence thresholds of 2% and 75%, respectively, we could not find any rule for men. Hence, we reduced the thresholds for support and confidence by 0.2 on each run. Finally, with a threshold value of 1.8% and 65% for support and confidence, respectively, the Apriori discovered a set of rules for men. Since such thresholds were acceptable by experts, we set the support and confidence to the mentioned values. A total of 480 rules were discovered for females. Due to the finding of a large number of rules in females, we had to report a number of the most interesting rules. As both support and confidence were important in our study, we could not select the rules based on only one of them. Therefore, considering a tradeoff between two measures, we reported rules above a threshold 2% and 75% for support and confidence, respectively. For several of the rules that were subsets of other rules, and with the same confidence and support, we reported the rule with more item sets in the IF Part. After that, we were left with a collection of seven association rules for females, shown in Table 6. For males, a total of four rules were found (Table 7).

Table 6.

Rules Extracted for Females Using the Apriori Algorithm [a]

Rule Number	Antecedent	Consequent	Support [b]	Confidence [c]	Lift [d]
1	IFG = yes, IGT = yes, BMI ≥ 30, waist to height ≥ 0.5	Type 2 DM	2.8	75.0	6.6
2	IFG = yes, IGT = yes, BMI ≥ 30, Marital status = Married, waist to height ≥ 0.5	Type 2 DM	2.4	75.0	6.6
3	IFG = yes, IGT = yes, BMI ≥ 30, HDL < 50, waist to height ≥ 0.5	Type 2 DM	2.3	75.6	6.7
4	IFG = yes, IGT = yes, BMI ≥ 30, wrist circumference ≥ 16.5, waist to height ≥ 0.5	Type 2 DM	2.2	76.5	6.7
5	IFG = yes, IGT = yes, , BMI ≥ 30, waist to hip ≥ 0.85, waist to height ≥ 0.5	Type 2 DM	2.2	76.5	6.7
6	IFG = yes, IGT = yes, Family history of diabetes = yes, waist to height ≥ 0.5	Type 2 DM	2.1	78.2	6.9
7	IFG = yes, IGT = yes , BMI ≥ 30, HDL< 50, Marital status = Married	Type 2 DM	2.1	75.6	6.7

a Abbreviations: IFG, Impaired Fasting Glucose; IGT, Impaired Glucose Tolerance; BMI, Body Mass Index; Type 2 DM, Type 2 Diabetes Mellitus.

b The percentage of records in the data for which the antecedents are true.

c The percentage of records in the data that for which both antecedents and consequent is true.

d The ratio between the rule’s confidence and the support of the item sets in consequent of a rule.

Table 7.

Rules extracted for males using the Apriori algorithm [a]

Rule Number	Antecedent	Consequent	Support [b]	Confidence [c]	Lift [d]
1	IGT= yes, IFG= yes, CHO to HDL ≥ 5.3, occupation status= employed, waist to hip ≥ 0.9	Type 2 DM	2.2	65.1	6.2
2	IGT= yes, IFG= yes, length of stay in the city ≥ 40, wrist circumference ≥ 18, waist to hip ≥ 0.9	Type 2 DM	1.9	66.1	6.3
3	IGT= yes, IFG= yes, and CKD= yes, Physical activity levels = low, waist to hip ≥ 0.9	Type 2 DM	1.8	65.4	6.2
4	IGT= yes, IFG= yes, wrist circumference ≥ 18, occupation status= employed , waist to height ≥ 0.5	Type 2 DM	1.8	69.2	6.6

a Abbreviations: IFG, Impaired Fasting Glucose; IGT, Impaired Glucose Tolerance; BMI, Body Mass Index; Type 2 DM, Type 2 Diabetes Mellitus.

b The percentage of records in the data for which the antecedents are true.

c The percentage of records in the data that for which both antecedents and consequent is true.

d The ratio between the rule’s confidence and the support of the item sets in consequent of a rule.

a Abbreviations: IFG, Impaired Fasting Glucose; IGT, Impaired Glucose Tolerance; BMI, Body Mass Index; Type 2 DM, Type 2 Diabetes Mellitus. b The percentage of records in the data for which the antecedents are true. c The percentage of records in the data that for which both antecedents and consequent is true. d The ratio between the rule’s confidence and the support of the item sets in consequent of a rule. a Abbreviations: IFG, Impaired Fasting Glucose; IGT, Impaired Glucose Tolerance; BMI, Body Mass Index; Type 2 DM, Type 2 Diabetes Mellitus. b The percentage of records in the data for which the antecedents are true. c The percentage of records in the data that for which both antecedents and consequent is true. d The ratio between the rule’s confidence and the support of the item sets in consequent of a rule.

5. Discussion

This study describes a rule extraction experiment on a cohort database, using ARM. We analyzed the association of baseline characteristics and diabetes occurrence in the study population. As we limited diabetes occurrence to appear only in the consequent part, all variables, which appear in the antecedent, are the predictor and observed pattern, or a combination of predictors, known as the risk pattern. Generated rules can be interpreted as the conditional probability of diabetes incidence within the special subpopulation. For example, in Table 6, rule 1 (IFG = yes, IGT = yes, BMI ≥ 30 kg/m2, waist to height ≥ 0.5) can be interpreted as: The probability of diabetes in women who are obese and have IFG, IGT and waist to height ratio ≥ 0.5 is 75%. Lift value of this rule can be interpreted as ratio of this probability (75%) to probability of diabetes in all the population of the database. Since the prior probability of diabetes in the female database is 11%, the lift value will be 6.6 (75% / 11.3%) for this rule. Association rule mining revealed several interesting patterns or relations between variables. The results showed that for women, IFG and IGT repeated in six rules (Table 6), whereas no rule was found containing only these two items in the antecedent part by the defined threshold. This shows that in women, several predisposing factors simultaneously affect the development of diabetes, besides IFG and IGT. Obesity in women is the most important risk factor, and combined with IFG and IGT, it can progress to type 2 diabetes, because it has appeared in six of the seven rules. Family history of diabetes and wrist circumference greater than 16.5 cm appeared in two rules (Table 6, Rules 6 and 4). No study has yet prospectively examined wrist circumference, as a predictor for diabetes occurrence, in an adult population. There is just one study conducted on TLGS data that has shown that wrist circumference is a significant predictor for diabetes, in both genders of adult population (40). The overall cutoff for wrist circumference in that study was found to be 15.7 cm in women. In our study, we categorized wrist circumference in three equal count groups. The related rule (Table 6, Rule 4) showed that wrist circumference >16.5 cm in a group with IFG, IGT, BMI ≥ 30 kg/m2 and waist to height ≥ 0.5 associate with diabetes occurrence. For men (Table 7), results show that IGT and IFG are also two frequent items, which are repeated in all the rules. Comparison of these results with the risk patterns in women showed that items, such as length of stay > 40 years in the city, central obesity, cholesterol to HDL ratio, low physical activity, CKD and wrist circumference > 18.5 cm are potential risk patterns for men. Previous studies in adults have shown the predictability of most of these factors in diabetes occurrence (7, 41). However, only one study in men has shown predictability of wrist circumference in diabetes occurrence (40). Unlike women, among men, high wrist circumference was observed in combinations with central obesity (Table 7). Generally, the rules discovered in the present study showed different risk patterns for males and females. Previous studies have identified several of these variables as risk factors for the development of type 2 diabetes (6). Our study uncovered novel data into diabetes research. Most of the discovered rules had lift values greater than 6, and confidence levels above 75%. These association rules are easily interpretable, and could be designed to provide support to the healthcare professional (35, 42). When choosing ARM for extracting rules, several points should be considered. For instance, changing support and confidence level will produce different rules or patterns. Therefore, obtaining the desired supports and confidence level for extracting interesting rules requires a close relationship between the medical researcher and data miners. Also, it should be considered that selecting cutoffs for categorizing variables is the most important step in data mining method, specially extracting rules by association rules mining (43). The strength of this study is that we used a prospective large data set for extraction of predictive rules. We also used well known cut points for categorizing of continuous variables, which can be interpreted easily by users. Also, we entered a large number of variables in the study, which would be impossible in the case of traditional methods. There are several limitations in the current study over not considering all possible predictors, such as sociological factors (44), or several nutritional factors (45). It should be mentioned that several risk factors could be changed, treated, or modified, during the study period. Therefore, as our next step in this context, we intend to identify the temporal or sequential pattern for predicting type 2 diabetes. The results of this study showed the usefulness of ARM for extracting risk patterns from a large prospective database. Our study showed that ARM is a useful approach in determining which combinations of variables or predictors occur together frequently, in people who will develop diabetes. Although most of the variables in the extracted patterns had been known in previous studies, as a predictors of diabetes, the ARM focuses on joint exposure to different combinations of risk factors, and not the predictors alone.

Table 2.

Distribution of the Anthropometric Variables in the Study Population [a]

Variables	Cutoff Value	Values [b]
Waist circumference, cm
Normal	< 90	3554 (53.47)
Abnormal	≥ 90	2914 (43.84)
Wrist circumference, cm
Women
Group 1	< 15.5	1058 (28.12)
Group 2	15.5-16.5	1357 (36.07)
Group 3	> 16.5	1252 (33.28)
Men
Group 1	< 17.2	934 (32.37)
Group 2	17.2-18	806 (27.94)
Group 3	> 18	1109 (38.44)
Waist to height ratio
Normal	< 0.5	785 (11.80)
Abnormal	≥ 0.5	5683 (85.49)
Waist to hip ratio
Men
Normal	< 0.9	476 (16.49)
Abnormal	≥ 0.9	2373 (82.25)
Women
Normal	< 0.85	2172 (57.73)
Abnormal	≥ 0.85	1447 (38.46)
Body Mass Index, kg/ m ²
Normal	< 25	2364 (35.56)
Overweight	25-30	2702 (40.64)
Obese	≥ 30	1424 (21.42)

a Data are presented as No. (%).

bThe data contains missing values when the cell percentages do not sum up to 100%.

Table 3.

Distribution of the Biochemical and Disease History Variables in the Study Population [a,b]

Variables	Cutoff Value	Values [c]
Fasting Plasma Glucose, mg/dL
Normal	< 100	946 (14.23)
Impaired Fasting Glucose	100-126	5701 (85.76)
Two hour postprandial plasma glucose, mg/dL
Normal	< 140	5848 (87.97)
Impaired Glucose Tolerance	140-200	799 (12.02)
Total cholesterol, mg/dL
Normal	< 200	3382 (50.88)
Hypercholesterolemia	≥ 200	3265 (49.11)
Triglyceride Levels, mg/dL
Normal	< 150	3690 (55.51)
Hypertriglyceridemia	≥150	2957 (44.48)
Cholesterol to High Density lipoprotein Ratio
Normal	< 5.3	3841 (57.78)
Abnormal	≥ 5.3	2801 (42.13)
High Density Lipoprotein, mg/dL
Men
Low	< 40	1893 (65.61)
Normal	≥ 40	990 (34.31)
Women
Low	< 50	2747 (73.01)
Normal	≥ 50	1012 (26.90)
Triglyceride to High Density Lipoprotein Ratio
Men
Normal	< 4.7	1647 (57.08)
Abnormal	≥ 4.7	1236 (42.84)
Women
Normal	< 3.7	2358 (62.67)
Abnormal	≥ 3.7	1401 (37.24)

Chronic Kidney Disease
CKD	GFR < 60	2262 (34.03)
Non CKD	GFR ≥ 60	4382 (65.92)
Systolic Blood Pressure, mm Hg
Normal	< 140	5946 (89.45)
Hypertension	≥ 140	613 (9.22)
Diastolic Blood Pressure, mm Hg
Normal	< 90	5852 (88.04)
Hypertension	≥ 90	707 (10.64)

a Abbreviations: CKD, chronic kidney disease.

b Data are presented as No. (%).

cThe data contains missing values when the cell percentages do not sum up to 100%.

Table 4.

Distribution of the Medical Histories Variables in the Study Population[a]

Variables	Values [b]
History of hospitalization until now
Yes	4566 (68.69)
No	2081 (31.30)
History of ischemic heart disease
Yes	173 (2.60)
No	6474 (97.40)
History of non-ischemic heart disease
Yes	295 (4.43)
No	6352 (95.57)
History of hypertension
Yes	725 (10.90)
No	5922 (89.10)
History of hyperlipidemia
Yes	1161 (17.46)
No	5486 (82.54)
Family history of cardiovascular disease in male relatives (father, brother, son) aged under 55
Yes	566 (8.51)
No	6081 (91.49)
Family history of cardiovascular disease in female relatives (mother, sister, daughter) aged under 65
Yes	523 (7.86)
No	6124 (92.14)
Family history of diabetes in first-degree relatives
Yes	1731 (26.04)
No	4916 (73.96)
Goiter Status
Grade 1and 2	1764 (26.53)
No goiter	4883 (73.47)
Thyroid nodules
Yes	393 (5.91)
No	6254 (94.09)

a Data are presented as No. (%).

b The data contains missing values when the cell percentages do not sum up to 100%.

26 in total

1. K/DOQI clinical practice guidelines for chronic kidney disease: evaluation, classification, and stratification.

Authors:
Journal: Am J Kidney Dis Date: 2002-02 Impact factor: 8.860

2. Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics.

Authors: Hui Cao; Marianthi Markatou; Genevieve B Melton; Michael F Chiang; George Hripcsak
Journal: AMIA Annu Symp Proc Date: 2005

3. Survival association rule mining towards type 2 diabetes risk assessment.

Authors: Gyorgy J Simon; John Schrom; M Regina Castro; Peter W Li; Pedro J Caraballo
Journal: AMIA Annu Symp Proc Date: 2013-11-16

4. Diabetes mellitus, fasting blood glucose concentration, and risk of vascular disease: a collaborative meta-analysis of 102 prospective studies.

Authors: N Sarwar; P Gao; S R Kondapally Seshasai; R Gobin; S Kaptoge; E Di Angelantonio; E Ingelsson; D A Lawlor; E Selvin; M Stampfer; C D A Stehouwer; S Lewington; L Pennells; A Thompson; N Sattar; I R White; K K Ray; J Danesh
Journal: Lancet Date: 2010-06-26 Impact factor: 202.731

5. Wrist circumference as a novel predictor of diabetes and prediabetes: results of cross-sectional and 8.8-year follow-up studies.

Authors: Younes Jahangiri Noudeh; Farzad Hadaegh; Nasibeh Vatankhah; Amir Abbas Momenan; Navid Saadat; Davood Khalili; Fereidoun Azizi
Journal: J Clin Endocrinol Metab Date: 2013-01-22 Impact factor: 5.958

6. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study.

Authors: Azra Ramezankhani; Omid Pournik; Jamal Shahrabi; Davood Khalili; Fereidoun Azizi; Farzad Hadaegh
Journal: Diabetes Res Clin Pract Date: 2014-07-18 Impact factor: 5.602

7. Global estimates of the prevalence of diabetes for 2010 and 2030.

Authors: J E Shaw; R A Sicree; P Z Zimmet
Journal: Diabetes Res Clin Pract Date: 2009-11-06 Impact factor: 5.602

8. Population-based incidence of Type 2 diabetes and its associated risk factors: results from a six-year cohort study in Iran.

Authors: Hadi Harati; Farzad Hadaegh; Navid Saadat; Fereidoun Azizi
Journal: BMC Public Health Date: 2009-06-16 Impact factor: 3.295

9. High prevalence of undiagnosed diabetes and abnormal glucose tolerance in the Iranian urban population: Tehran Lipid and Glucose Study.

Authors: Farzad Hadaegh; Mohammad Reza Bozorgmanesh; Asghar Ghasemi; Hadi Harati; Navid Saadat; Fereidoun Azizi
Journal: BMC Public Health Date: 2008-05-24 Impact factor: 3.295

10. Association of depression and anxiety with diabetes mellitus type 2 concerning some sociological factors.

Authors: Maryam Palizgir; Maryam Bakhtiari; Alireza Esteghamati
Journal: Iran Red Crescent Med J Date: 2013-08-05 Impact factor: 0.611

5 in total

1. A survey on data mining techniques used in medicine.

Authors: Saba Maleki Birjandi; Seyed Hossein Khasteh
Journal: J Diabetes Metab Disord Date: 2021-08-31

Review 2. Machine Learning and Data Mining Methods in Diabetes Research.

Authors: Ioannis Kavakiotis; Olga Tsave; Athanasios Salifoglou; Nicos Maglaveras; Ioannis Vlahavas; Ioanna Chouvarda
Journal: Comput Struct Biotechnol J Date: 2017-01-08 Impact factor: 7.271

Review 3. Markers of insulin resistance in Polycystic ovary syndrome women: An update.

Authors: Chantal Anifa Amisi
Journal: World J Diabetes Date: 2022-03-15

4. Evaluation of rational nonsteroidal anti-inflammatory drugs and gastro-protective agents use; association rule data mining using outpatient prescription patterns.

Authors: Oraluck Pattanaprateep; Mark McEvoy; John Attia; Ammarin Thakkinstian
Journal: BMC Med Inform Decis Mak Date: 2017-07-04 Impact factor: 2.796

5. Using association rule mining to jointly detect clinical features and differentially expressed genes related to chronic inflammatory diseases.

Authors: Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S Rocha; Cláudia V Maurer-Morelli; Silvana Regina Perez Orrico; Joni A Cirelli; Fernando J Von Zuben; Raquel Mantuaneli Scarel-Caminaga
Journal: PLoS One Date: 2020-10-02 Impact factor: 3.240

5 in total