Literature DB >> 26120265

Decision tree methods: applications for classification and prediction.

Abstract

Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.

Entities: Disease Gene Species

Keywords: classification; data mining; decision tree; prediction

Year: 2015 PMID： 26120265 PMCID： PMC4466856 DOI： 10.11919/j.issn.1002-0829.215044

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

Introduction

Data mining is used to extract useful information from large datasets and to display it in easy-to-interpret visualizations. First introduced in 1960’s, decision trees are one of the most effective methods for data mining; they have been widely used in several disciplines [1] because they are easy to be used, free of ambiguity, and robust even in the presence of missing values. Both discrete and continuous variables can be used either as target variables or independent variables. More recently, decision tree methodology has become popular in medical research. An example of the medical use of decision trees is in the diagnosis of a medical condition from the pattern of symptoms, in which the classes defined by the decision tree could either be different clinical subtypes or a condition, or patients with a condition who should receive different therapies. [2] Common usages of decision tree models include the following: • Variable selection. The number of variables that are routinely monitored in clinical settings has increased dramatically with the introduction of electronic data storage. Many of these variables are of marginal relevance and, thus, should probably not be included in data mining exercises. Like stepwise variable selection in regression analysis, decision tree methods can be used to select the most relevant input variables that should be used to form decision tree models, which can subsequently be used to formulate clinical hypotheses and inform subsequent research. • Assessing the relative importance of variables. Once a set of relevant variables is identified, researchers may want to know which variables play major roles. Generally, variable importance is computed based on the reduction of model accuracy (or in the purities of nodes in the tree) when the variable is removed. In most circumstances the more records a variable have an effect on, the greater the importance of the variable. • Handling of missing values. A common - but incorrect - method of handling missing data is to exclude cases with missing values; this is both inefficient and runs the risk of introducing bias in the analysis. Decision tree analysis can deal with missing data in two ways: it can either classify missing values as a separate category that can be analyzed with the other categories or use a built decision tree model which set the variable with lots of missing value as a target variable to make prediction and replace these missing ones with the predicted value. • Prediction. This is one of the most important usages of decision tree models. Using the tree model derived from historical data, it’s easy to predict the result for future records. • Data manipulation. Too many categories of one categorical variable or heavily skewed continuous data are common in medical research. In these circumstances, decision tree models can help in deciding how to best collapse categorical variables into a more manageable number of categories or how to subdivide heavily skewed variables into ranges.

Basic concepts

Figure 1 illustrates a simple decision tree model that includes a single binary target variable Y (0 or 1) and two continuous variables, x1 and x2, that range from 0 to 1. The main components of a decision tree model are nodes and branches and the most important steps in building a model are splitting, stopping, and pruning. Nodes. There are three types of nodes. (a) A root node, also called a decision node, represents a choice that will result in the subdivision of all records into two or more mutually exclusive subsets. (b) Internal nodes, also called chance nodes, represent one of the possible choices available at that point in the tree structure; the top edge of the node is connected to its parent node and the bottom edge is connected to its child nodes or leaf nodes. (c) Leaf nodes, also called end nodes, represent the final result of a combination of decisions or events. Branches. Branches represent chance outcomes or occurrences that emanate from root nodes and internal nodes. A decision tree model is formed using a hierarchy of branches. Each path from the root node through internal nodes to a leaf node represents a classification decision rule. These decision tree pathways can also be represented as ‘if-then’ rules. For example, “if condition 1 and condition 2 and condition … and condition k occur, then outcome j occurs. ” Splitting. Only input variables related to the target variable are used to split parent nodes into purer child nodes of the target variable. Both discrete input variables and continuous input variables (which are collapsed into two or more categories) can be used. When building the model one must first identify the most important input variables, and then split records at the root node and at subsequent internal nodes into two or more categories or ‘bins’ based on the status of these variables. Characteristics that are related to the degree of ‘purity’ of the resultant child nodes (i. e. , the proportion with the target condition) are used to choose between different potential input variables; these characteristics include entropy, Gini index, classification error, information gain, gain ratio, and twoing criteria. [3] This splitting procedure continues until pre-determined homogeneity or stopping criteria are met. In most cases, not all potential input variables will be used to build the decision tree model and in some cases a specific input variable may be used multiple times at different levels of the decision tree. Stopping. Complexity and robustness are competing characteristics of models that need to be simultaneously considered whenever building a statistical model. The more complex a model is, the less reliable it will be when used to predict future records. An extreme situation is to build a very complex decision tree model that spreads wide enough to make the records in each leaf node 100% pure (i. e. , all records have the target outcome). Such a decision tree would be overly fitted to the existing observations and have few records in each leaf, so it could not reliably predict future cases and, thus, would have poor generalizability (i. e. , lack robustness). To prevent this from happening, stopping rules must be applied when building a decision tree to prevent the model from becoming overly complex. Common parameters used in stopping rules include: (a) the minimum number of records in a leaf; (b) the minimum number of records in a node prior to splitting; and (c) the depth (i. e. , number of steps) of any leaf from the root node. Stopping parameters must be selected based on the goal of the analysis and the characteristics of the dataset being used. As a rule-of-thumb, Berry and Linoff [4] recommend avoiding overfitting and underfitting by setting the target proportion of records in a leaf node to be between 0. 25 and 1. 00% of the full training data set. Pruning. In some situations, stopping rules do not work well. An alternative way to build a decision tree model is to grow a large tree first, and then prune it to optimal size by removing nodes that provide less additional information. [5] A common method of selecting the best possible sub-tree from several candidates is to consider the proportion of records with error prediction (i. e. , the proportion in which the predicted occurrence of the target is incorrect). Other methods of selecting the best alternative is to use a validation dataset (i. e. , dividing the sample in two and testing the model developed on the training dataset on the validation dataset), or, for small samples, crossvalidation (i. e. , dividing the sample in 10 groups or ‘folds’, and testing the model developed from 9 folds on the 10th fold, repeated for all ten combinations, and averaging the rates or erroneous predictions). There are two types of pruning, pre-pruning (forward pruning) and post-pruning (backward pruning). Pre-pruning uses Chi-square tests [6] or multiple-comparison adjustment methods to prevent the generation of non-significant branches. Post-pruning is used after generating a full decision tree to remove branches in a manner that improves the accuracy of the overall classification when applied to the validation dataset. Decision trees can also be illustrated as segmented space, as shown in Figure 2. The sample space is subdivided into mutually exclusive (and collectively exhaustive) segments, where each segment corresponds to a leaf node (that is, the final outcome of the serial decision rules). Each record is allocated to a single segment (leaf node). Decision tree analysis aims to identify the best model for subdividing all records into different segments.

Available algorithms and software packages for building decision tree models

Several statistical algorithms for building decision trees are available, including CART (Classification and Regression Trees), [7] C4. 5, [8] CHAID (Chi-Squared Automatic Interaction Detection), [9] and QUEST (Quick, Unbiased, Efficient, Statistical Tree). [10] Table 1 provides a brief comparison of the four most widely used decision tree methods. [11,12] Comparison of different decision tree algorithms Decision trees based on these algorithms can be constructed using data mining software that is included in widely available statistical software packages. For example, there is one decision tree dialogue box in SAS Enterprise Miner [13] which incorporates all four algorithms; the dialogue box requires the user to specify several parameters of the desired model. The IBM SPSS Modeler [14] software package is more user-friendly; it includes four separate dialog boxes, one for each of four algorithms (it uses C5. 0, [15] an upgraded version of C4. 5). Based on the desired method of selecting input variables, the user goes to the dialog box for the corresponding algorithm (i. e. , using the following steps: Analyze menu ==> Classify ==> Tree==>select algorithm of choice). For example, the SPSS syntax associated with the CART algorithm dialog box [16] would be as follows: tree y [n] by x1 [s] x2 [c] x3 [o] /tree display=topdown nodes=statistics branchstatistics=yes nodedefs=yes scale=auto /depcategories usevalues=[valid] /print modelsummary classification risk /method type=crt maxsurrogates=auto prune=none /growthlimit maxdepth=auto minparentsize=100 minchildsize=50 /validation type=none output=bothsamples /crt impurity=gini minimprovement=0. 0001 /costs equal /priors fromdata adjust=no. NOTE: [n], [s], [c], [o] indicate the variables are nominal, scale, categorical and ordinal.

Example

We use the analysis of risk factors related to major depressive disorder (MDD) in a four-year cohort study [17] to illustrate the building of a decision tree model. The goal of the analysis was to identify the most important risk factors from a pool of 17 potential risk factors, including gender, age, smoking, hypertension, education, employment, life events, and so forth. The decision tree model generated from the dataset is shown in Figure 3. All individuals were divided into 28 subgroups from root node to leaf nodes through different branches. The risk of having depressive disorder varied from 0 to 38%. For example, only 2% of the non-smokers at baseline had MDD four years later, but 17. 2% of the male smokers, who had a score of 2 or 3 on the Goldberg depression scale and who did not have a fulltime job at baseline had MDD at the 4-year follow-up evaluation. By using this type of decision tree model, researchers can identify the combinations of factors that constitute the highest (or lowest) risk for a condition of interest.

Discussion

The decision tree method is a powerful statistical tool for classification, prediction, interpretation, and data manipulation that has several potential applications in medical research. Using decision tree models to describe research findings has the following advantages: • Simplifies complex relationships between input variables and target variables by dividing original input variables into significant subgroups. • Easy to understand and interpret. • Non-parametric approach without distributional assumptions. • Easy to handle missing values without needing to resort to imputation. • Easy to handle heavy skewed data without needing to resort to data transformation. • Robust to outliers. As with all analytic methods, there are also limitations of the decision tree method that users must be aware of. The main disadvantage is that it can be subject to overfitting and underfitting, particularly when using a small data set. This problem can limit the generalizability and robustness of the resultant models. Another potential problem is that strong correlation between different potential input variables may result in the selection of variables that improve the model statistics but are not causally related to the outcome of interest. Thus, one must be cautious when interpreting decision tree models and when using the results of these models to develop causal hypotheses. Lohand and Strobl [18,19] provided a comprehensive review of the statistical literature of classification tree methods that may be useful for readers who want to learn more about the statistical theories behind the decision tree method. There are several further applications of decision tree models that have not been considered in this brief overview. We have described decision tree models that use binary or continuous target variables; several authors have developed other decision tree methods to be employed when the endpoint is the prediction of survival. [20-27] Our discussion was limited to cases in which the selection of input variables was based on statistical properties, but in the real world selection of input variables may be based on the relative cost of collecting the variables or on the clinical meaningfulness of the variables; Jin and colleagues [28,29] introduced an alternative classification tree method that allows for the selection of input variables based on a combination of preference (e. g. , based on cost) and non-inferiority to the statistically optimal split. Another extension of the decision tree method is to develop a decision tree that identifies subgroups of patients who should have different diagnostic tests or treatment strategies to achieve optimal medical outcomes. [30]

Table 1.

Comparison of different decision tree algorithms

Methods	CART	C4. 5	CHAID	QUEST
Measure used to select input variable	Gini index; Twoing criteria	Entropy info-gain	Chi-square	Chi-square for categorical variables; J-way ANOVA for continuous/ordinal variables
Pruning	Pre-pruning using a single-pass algorithm	Pre-pruning using a single-pass algorithm	Pre-pruning using Chi-square test for independence	Post-pruning
Dependent variable	Categorical/ Continuous	Categorical/ Continuous	Categorica	Categorical
Input variables	Categorical/ Continuous	Categorical/ Continuous	Categorical/ Continuous	Categorical/ Continuous
Split at each node	Binary; Split on linear combinations	Multiple	Multiple	Binary; Split on linear combinations

11 in total

1. Residual-based tree-structured survival analysis.

Authors: Sündüz Keleş; Mark R Segal
Journal: Stat Med Date: 2002-01-30 Impact factor: 2.373

2. Classification algorithms for hip fracture prediction based on recursive partitioning methods.

Authors: Hua Jin; Ying Lu; Steven T Harris; Dennis M Black; Katie Stone; Marc C Hochberg; Harry K Genant
Journal: Med Decis Making Date: 2004 Jul-Aug Impact factor: 2.583

3. Alternative tree-structured survival analysis based on variance of survival time.

Authors: Hua Jin; Ying Lu; Kaite Stone; Dennis M Black
Journal: Med Decis Making Date: 2004 Nov-Dec Impact factor: 2.583

4. Relative risk trees for censored survival data.

Authors: M LeBlanc; J Crowley
Journal: Biometrics Date: 1992-06 Impact factor: 2.571

5. A procedure for determining whether a simple combination of diagnostic tests may be noninferior to the theoretical optimum combination.

Authors: Hua Jin; Ying Lu
Journal: Med Decis Making Date: 2008-06-12 Impact factor: 2.583

6. Features of tree-structured survival analysis.

Authors: M R Segal
Journal: Epidemiology Date: 1997-07 Impact factor: 4.822

7. Comments on Fifty Years of Classification and Regression Trees.

Authors: Chi Song; Heping Zhang
Journal: Int Stat Rev Date: 2014-12-01 Impact factor: 2.217

8. A comparison of estimated proportional hazards models and regression trees.

Authors: M R Segal; D A Bloch
Journal: Stat Med Date: 1989-05 Impact factor: 2.373

9. Tree-structured subgroup analysis of receiver operating characteristic curves for diagnostic tests.

Authors: Caixia Li; Claus-C Glüer; Richard Eastell; Dieter Felsenberg; David M Reid; Christian Roux; Ying Lu
Journal: Acad Radiol Date: 2012-12 Impact factor: 3.173

10. Modifiable risk factors predicting major depressive disorder at four year follow-up: a decision tree approach.

Authors: Philip J Batterham; Helen Christensen; Andrew J Mackinnon
Journal: BMC Psychiatry Date: 2009-11-22 Impact factor: 3.630

103 in total

1. Statistical models of fecal coliform levels in Pacific Northwest estuaries for improved shellfish harvest area closure decision making.

Authors: Amity G Zimmer-Faust; Cheryl A Brown; Alex Manderson
Journal: Mar Pollut Bull Date: 2018-10-22 Impact factor: 5.553

2. Gender-specific associations between polymorphisms in the Toll-like receptor (TLR) genes and lung function among workers in swine operations.

Authors: Zhiwei Gao; James A Dosman; Donna C Rennie; David A Schwartz; Ivana V Yang; Jeremy Beach; Ambikaipakan Senthilselvan
Journal: J Toxicol Environ Health A Date: 2018-11-12

3. PolTree: A Data Structure for Making Efficient Access Decisions in ABAC.

Authors: Ronit Nath; Saptarshi Das; Shamik Sural; Jaideep Vaidya; Vijay Atluri
Journal: Proc ACM Symp Access Control Model Technol Date: 2019-06

4. Analysis of Consciousness Level Using Galvanic Skin Response during Therapeutic Effect.

Authors: Çiğdem Gülüzar Altıntop; Fatma Latifoğlu; Aynur Karayol Akın; Ramis İleri; Mehmet Akif Yazar
Journal: J Med Syst Date: 2020-11-24 Impact factor: 4.460

5. Toward the Standardization of Mycological Examination of Sputum Samples in Cystic Fibrosis: Results from a French Multicenter Prospective Study.

Authors: Noémie Coron; Marc Pihet; Emilie Fréalle; Yolande Lemeille; Claudine Pinel; Hervé Pelloux; Gilles Gargala; Loic Favennec; Isabelle Accoceberry; Isabelle Durand-Joly; Frédéric Dalle; Frédéric Huet; Annlyse Fanton; Amale Boldron; Guy-André Loeuille; Philippe Domblides; Bérengère Coltey; Isabelle Pin; Catherine Llerena; Françoise Troussier; Christine Person; Christophe Marguet; Nathalie Wizla; Caroline Thumerelle; Dominique Turck; Stéphanie Bui; Michael Fayon; Alain Duhamel; Anne Prévotat; Benoit Wallaert; Sylvie Leroy; Jean-Philippe Bouchara; Laurence Delhaes
Journal: Mycopathologia Date: 2017-07-26 Impact factor: 2.574

6. Differentiating Post-Polio Syndrome from Myalgic Encephalomyelitis and Chronic Fatigue Syndrome.

Authors: Lauren Klebek; Madison Sunnquist; Leonard A Jason
Journal: Fatigue Date: 2019-11-06

7. Knockoff boosted tree for model-free variable selection.

Authors: Tao Jiang; Yuanyuan Li; Alison A Motsinger-Reif
Journal: Bioinformatics Date: 2021-05-17 Impact factor: 6.937

8. Preventing Hospital Readmissions: Healthcare Providers' Perspectives on "Impactibility" Beyond EHR 30-Day Readmission Risk Prediction.

Authors: Natalie Flaks-Manov; Einav Srulovici; Rina Yahalom; Henia Perry-Mezre; Ran Balicer; Efrat Shadmi
Journal: J Gen Intern Med Date: 2020-03-05 Impact factor: 5.128

Review 9. Machine learning and statistical methods for predicting mortality in heart failure.

Authors: Dineo Mpanya; Turgay Celik; Eric Klug; Hopewell Ntsinjana
Journal: Heart Fail Rev Date: 2020-11-09 Impact factor: 4.214

10. Effect of a Real-Time Risk Score on 30-day Readmission Reduction in Singapore.

Authors: Christine Xia Wu; Ernest Suresh; Francis Wei Loong Phng; Kai Pik Tai; Janthorn Pakdeethai; Jared Louis Andre D'Souza; Woan Shin Tan; Phillip Phan; Kelvin Sin Min Lew; Gamaliel Yu-Heng Tan; Gerald Seng Wee Chua; Chi Hong Hwang
Journal: Appl Clin Inform Date: 2021-05-19 Impact factor: 2.342