| Literature DB >> 28053651 |
Mehdi Birjandi1, Seyyed Mohammad Taghi Ayatollahi1, Saeedeh Pourahmad1.
Abstract
Tree structured modeling is a data mining technique used to recursively partition a dataset into relatively homogeneous subgroups in order to make more accurate predictions on generated classes. One of the classification tree induction algorithms, GUIDE, is a nonparametric method with suitable accuracy and low bias selection, which is used for predicting binary classes based on many predictors. In this tree, evaluating the accuracy of predicted classes (terminal nodes) is clinically of special importance. For this purpose, we used GUIDE classification tree in two statuses of equal and unequal misclassification cost in order to predict nonalcoholic fatty liver disease (NAFLD), considering 30 predictors. Then, to evaluate the accuracy of predicted classes by using bootstrap method, first the classification reliability in which individuals are assigned to a unique class and next the prediction probability reliability as support for that are considered.Entities:
Mesh:
Year: 2016 PMID: 28053651 PMCID: PMC5174753 DOI: 10.1155/2016/3874086
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1GUIDE classification tree with estimated priors probability (the proportion of patients in each class) and equal misclassification costs for predicting NAFLD. At each intermediate node, an observation goes to the left branch if and only if the condition is satisfied. Dark nodes represent predicted class “with NAFLD” and white nodes represent predicted class “without NAFLD.” Each terminal node has been formed in boxes of 3 parts so that the specified section in the left side of the box represents the number of individuals in the trained dataset who have been placed in this node. Specified percentages for this dataset are ratio of patients with NAFLD. The middle specified section of the box exhibits the node's number and specified percent below it shows the overall ratio of the test datasets which have been placed in this node. The specified section in the right side of the box shows the number and percentage of the test datasets that have really high risk of NAFLD.
Figure 2GUIDE classification tree with estimated priors probability (the proportion of patients in each class) and unequal misclassification costs for predicting NAFLD. At each intermediate node, an observation goes to the left branch if and only if the condition is satisfied. Dark nodes represent predicted class “with NAFLD” and white nodes represent predicted class “without NAFLD.” Each terminal node has been formed in boxes of 3 parts so that the specified section in the left side of the box represents the number of individuals in the trained dataset who have been placed in this node. Specified percentages for this dataset are ratio of patients with NAFLD. The middle specified section of the box exhibits the node's number and specified percent below it shows the overall ratio of the test datasets which have been placed in this node. The specified section in the right side of the box shows the number and percentage of the test datasets that have really high risk of NAFLD.
Cross tabulation of the observed and predicted NAFLD of the classification tree for training and test sample and the measures of evaluating the classification tree with equal misclassification cost.
| Observed | Predicted | |||||
|---|---|---|---|---|---|---|
| Training sample | Test sample | |||||
| Yes | No | Total | Yes | No | Total | |
| Yes | 157 (59%) | 111 (41%) | 268 | 43 (48%) | 47 (52%) | 90 |
| No | 58 (7%) | 794 (93%) | 852 | 42 (11%) | 348 (89%) | 390 |
| Total | 215 | 905 | 1120 | 85 | 395 | 480 |
|
| ||||||
|
| 85% | 81% | ||||
Cross tabulation of the observed and predicted NAFLD of the classification tree for training and test sample unequal misclassification cost.
| Observed | Predicted | |||||
|---|---|---|---|---|---|---|
| Training sample | Test sample | |||||
| Yes | No | Total | Yes | No | Total | |
| Yes | 197 (74%) | 71 (26%) | 268 | 66 (73%) | 24 (27%) | 90 |
| No | 144 (17%) | 708 (83%) | 852 | 94 (24%) | 296 (76%) | 390 |
| Total | 341 | 779 | 1120 | 160 | 320 | 480 |
|
| ||||||
|
| 81% | 75% | ||||
Result from bootstrapping the classification tree with equal costs for classification and prediction reliability of terminal nodes.
| Node | Information | Classification reliability | Prediction reliability |
|---|---|---|---|
| 2 | 0.003 | 100 | 1.75 |
| 13 | 0.339 |
| 90.88 |
| 15 | 0.334 | 100 | 89.64 |
| 24 | 0.057 | 100 | 23.3 |
| 29 | 0.296 | 100 | 85.06 |
| 50 | 0.057 | 100 | 37.28 |
| 51 | 0.3 |
|
|
| 57 | 0.125 | 100 | 81.2 |
| 112 | 0.194 | 100 | 65.34 |
| 113 | 0.353 |
| 93.13 |
Unreliable nodes (those with classification reliability less than 95 percent or prediction reliability more than 95 per cent) are in bold font.
Result from bootstrapping the classification tree with unequal costs for classification and prediction reliability of terminal nodes.
| Node | Information | Classification reliability | Prediction reliability |
|---|---|---|---|
| 2 | 0.006 | 100 | 3.17 |
| 7 | 0.353 |
| 91.55 |
| 13 | 0.281 | 100 | 81.92 |
| 25 | 0.153 | 100 | 58.67 |
| 48 | 0.075 | 100 | 30.82 |
| 98 | 0.041 | 100 | 36.83 |
| 99 | 0.168 | 100 | 34.29 |
Unreliable nodes (those with classification reliability less than 95 percent or prediction reliability more than 95 per cent) are in bold font.
Table of demographic and clinical characteristics of participants according to groups (number (%) or mean ± SD).
| Risk factors | Abbreviation | Level | Without NAFLD | With NAFLD |
|---|---|---|---|---|
| Sex | SEX | Male | 361 (% 29.1) | 110 (% 30.7) |
| Female | 880 (% 70.9) | 249 (% 69.3) | ||
| Marital status | MS | Single | 447 (36%) | 27 (7.5%) |
| Married | 726 (58.5%) | 297 (83%) | ||
| Other | 68 (5.5%) | 35 (9.5%) | ||
| History of hepatitis B vaccine | HEP | Yes | 538 (43.4%) | 70 (19.3%) |
| No | 703 (56.6%) | 289 (80.7%) | ||
| History of blood transfusion | BT | Yes | 22 (1.8%) | 11 (3.1%) |
| No | 1219 (98.2%) | 348 (96.9%) | ||
| Thalassemia | THAL | Yes | 2 (.2%) | 1 (.3%) |
| No | 1239 (99.8%) | 358 (99.7%) | ||
| Hemophilia | HEMO | Yes | 3 (.2%) | 0 (.0%) |
| No | 1238 (99.8%) | 359 (100%) | ||
| Dialysis | DI | Yes | 3 (.2%) | 1 (.3%) |
| No | 1238 (99.8%) | 358 (99.7%) | ||
| Surgery | SU | Yes | 3 (.2%) | 1 (.3%) |
| No | 1238 (99.8%) | 358 (99.7%) | ||
| History of surgery | HS | Yes | 356 (28.7%) | 141 (39.4%) |
| No | 885 (71.3%) | 218 (60.4%) | ||
| History of dental surgery | DE | Yes | 1002 (80.7%) | 303 (84.6%) |
| No | 239 (19.3%) | 56 (15.4%) | ||
| History of phlebotomy | PH | Yes | 94 (7.6%) | 35 (9.8%) |
| No | 1147 (92.4%) | 324 (90.2%) | ||
| Tattoos | TA | Yes | 38 (3.1%) | 19 (5.3%) |
| No | 1203 (96.9%) | 340 (94.7%) | ||
| History of unsanitary piercing ears | UPE | Yes | 541 (43.6%) | 141 (39.4%) |
| No | 700 (56.4%) | 218 (60.6%) | ||
| Hookah | HOO | Yes | 83 (6.7%) | 28 (7.8%) |
| No | 1158 (93.3%) | 331 (92.2%) | ||
| Current smoking | SMOK | Yes | 39 (3.1%) | 19 (5.3%) |
| No | 1202 (96.9%) | 340 (94.7%) | ||
| History of drug using | HDU | Yes | 28 (2.3%) | 6 (1.7%) |
| No | 1213 (97.7%) | 353 (98.3%) | ||
| HBS Ag | HBSAG | Negative | 1215 (98.1%) | 353 (98.5%) |
| Positive | 26 (1.9%) | 6 (1.5%) | ||
| HBS Ab | HBSAB | Negative | 1079 (88.5%) | 307 (87.0%) |
| Positive | 162 (11.5%) | 52 (13.0%) | ||
| Body mass index | BMI | Underweight (UW) | 197 (15.9%) | 1 (.3%) |
| Normal (N) | 633 (51%) | 62 (17.3%) | ||
| Overweight (OW) | 320 (25.8%) | 186 (51.7%) | ||
| Obese (OB) | 87 (7%) | 110 (30.7%) | ||
| Waist-hip ratio | WHR | 0.83 ± 0.09 | 0.92 ± 0.09 | |
| Systolic blood pressure | SBP | 100.05 ± 26.1 | 108.42 ± 31.86 | |
| Diastolic blood pressure | DBP | 82.14 ± 20.01 | 93.37 ± 23.85 | |
| High density lipoprotein | HDL | 50.95 ± 11.5 | 48.9 ± 9.73 | |
| Triglycerides | TG | 120.3 ± 68.52 | 193.89 ± 113.5 | |
| Alanine aminotransferase | ALT | 15.56 ± 10.92 | 19.11 ± 12.5 | |
| Cholesterol | CHO | 184.94 ± 42.58 | 207.62 ± 41.79 | |
| Aspartate aminotransferase | AST | 24.84 ± 11.66 | 28.06 ± 17.84 | |
| Glucose | GLU | 96.68 ± 26.86 | 108.45 ± 39.56 | |
| Albumin | AL | 4.32 ± 0.37 | 4.23 ± 0.4 | |
| Age | AGE | 34.85 ± 17.45 | 45.9 ± 13.34 |