Literature DB >> 22479449

Comprehensive decision tree models in bioinformatics.

Gregor Stiglic1, Simon Kocbek, Igor Pernek, Peter Kokol.   

Abstract

PURPOSE: Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible.
METHODS: This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree.
RESULTS: The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree.
CONCLUSIONS: The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22479449      PMCID: PMC3316502          DOI: 10.1371/journal.pone.0033812

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Decision trees are one of the most popular classification techniques in data mining [1]. One of the main reasons for this is decision trees' ability to represent the results in a simple decision tree format which is easy to interpret for experts, as they can see the structure of decisions in the classifying process. The basic idea of the decision tree format is to construct a tree whose leaves are labeled with a particular value for the class attribute and whose inner nodes represent descriptive attributes. Given an inner node N, the children of N correspond to different possible values of the associated descriptive attribute. Once a decision tree is built, determining the class value for a new instance is achieved by following a path from the root to a leaf according to the values of the descriptive attributes of the instance. The class value assigned will be that labeling the leaf. Following this process one can easily extract classification rules that can be readily be expressed so that humans can understand them. In addition to their simplicity, building decision trees is often a less time consuming classification process compared to other classification techniques [2], and decision tree rules can be directly used as statements in a database access language (e.g. SQL). Decision trees can be built with several different approaches where the most popular are C4.5 [3] and CART [4]. Due to their popularity, decision trees have been applied to different research fields including bioinformatics [5], [6], medicine [7] and image classification [8]. In addition, several commercial products use decision trees for knowledge discovery, predictive analysis and other purposes. For instance, KnowledgeSeeker [9] offers business intelligence software for customer analytics and marketing analytics. From the knowledge discovery perspective, the ability to track and evaluate every step in the decision-making process is one of the most important factors for trusting the decisions gained from data-mining methods. Examples of such techniques are decision trees that possess an important advantage in comparison with competitive classification methods - i.e., the symbolic representation of the extracted knowledge. Decision trees, along with rule-based classifiers, represent a group of classifiers that perform classification by a sequence of simple, easy-to-understand tests whose semantics are intuitively clear to domain experts [10]. Although current state-of-the art classifiers (e.g. Support Vector Machines [11]) or ensembles of classifiers (e.g. Random Forest [12] or Rotation Forest [13]) significantly outperform classical decision tree classification models in terms of classification accuracy or other classification performance metrics, they are not suitable for knowledge discovery process. When decision trees are used in knowledge discovery, one should usually include domain experts in the analysis process. Therefore, in most cases the final decision trees will be presented to domain experts for evaluation of extracted knowledge – i.e., rules that can be derived from a decision tree. In such cases the complexity of decision trees, which is usually measured as the number of nodes or the number of rules that can be extracted from a tree, is of high importance and can influence the evaluation of the discovered knowledge by domain experts [14]. Decision tree complexity has been studied in terms of reducing the complexity and maintaining or improving the accuracy at the same time. Bohanec and Bratko [15] studied the difference between pruning a decision tree for better approximation of the target concept and pruning the decision tree to make it practical for communication and understanding by the user. Their study focused on developing algorithms for obtaining the smallest pruned decision trees that represent concepts within some chosen accuracy. Oates and Jensen [16] studied the influence of database size on decision tree complexity. They demonstrated that the tree size strongly depends on the training set size. Therefore, many approaches that are based on removing training instances prior to tree construction [17], [18], [19] could result in smaller trees just because of the training set reduction. Different visual representations of decision trees like the classical node-link diagrams [20], [21], treemaps [22], [23], concentric circles [24], [25], and many others have been proposed in the past. A major consideration in evaluation of decision trees is also how efficiently they use screen space to communicate the tree information [26]. Through application of decision trees to different fields of research and their use in open source and commercial software for machine learning and data mining, it has been demonstrated that end-users still prefer node-link diagrams although their space covering is not optimal. Huysmans et al. [27] observe that currently, most research focuses on improving the accuracy or precision of predictive models and comparatively little research has been undertaken to increase their comprehensibility to the analyst or end-user. They empirically investigated suitability of decision tables, (binary) decision trees, propositional rules, and oblique rules in environments where interpretability of models is of high importance. The results showed that users prefer decision tables, followed by decision trees to other compared knowledge representations, but authors admitted that only inexperienced users were included in the study. A multi-criteria approach to evaluation of decision trees that also includes size of the built decision trees was proposed by Osei-Bryson [28]. It aims to make the data mining process simpler for data mining project teams, especially when they have to evaluate significant number of decision trees. The proposed project uses three measures to evaluate appropriateness of the decision trees: stability, simplicity and discriminatory power. Simplicity, or equivalently complexity is further divided in the number of rules that can be extracted from the tree and the average length of the extracted rules. Due to their popularity and a need to build simple decision trees with as little effort as possible, this paper proposes a novel method called Visual Tuning of Decision Trees (VTDT). This method helps data analysts in building effective decision tree representations with spending less time on setting and tuning the parameters of decision tree induction algorithm when compared to classical methods. From the analyst's perspective it is very important that the produced representation of the decision tree allows effective communication with end-users (i.e. customers) or domain experts in cases of decision tree applications in research. In addition, from our own and experience of our colleagues, we know that, although we live in a digital age, we still meet a lot of experts in different domains who prefer to have the final decision tree printed out on a sheet of paper. The result of the VTDT method is a decision tree that can be printed out on a single page or displayed on a computer screen without the need for scrolling or zooming. It is also important to take care of the decision trees that would be too pruned when using the default parameters of decision tree induction method. One could also call this type of decision tree induction “one-button decision trees” as there is no need to tune the parameters and build multiple decision trees anymore.

Methods

The proposed method in this paper presents an automated tuning process for the widely used C4.5 decision tree, which was developed by Quinlan [3]. More precisely, it focuses on C4.5's implementation in the Weka machine learning framework [29], where it is referred to as J48.

2.1 Tuning the Parameters

There are multiple settings that can influence the size of the generated decision tree. Two types of pruning are available - i.e., subtree replacement and subtree raising. Subtree raising uses a technique where a node may be moved upwards towards the root of the tree, replacing other nodes along the way during a process of pruning. In general, subtree raising is computationally more complex than subtree replacement where the nodes in a decision tree can be replaced by leafs. Another setting influencing the pruning process is confidence factor that represents a threshold of allowed inherent error in data while pruning the decision tree. By lowering the threshold one is applying more pruning and consequently generates more general models. To obtain simpler models where leafs contain higher number of samples, it is possible to set the minimal number of objects in a single leaf. This setting can also be used in tuning to achieve simpler and smaller decision trees. The final setting that can be used to tune the visual outlook of the tree is called binary splits selection. This setting forces the splitting of nodes to only two branches instead of multiple splits. The default J48 decision tree in Weka uses pruning based on subtree raising, confidence factor of 0.25, minimal number of objects is set to 2, and nodes can have multiple splits. To allow automated tuning in Weka, a package called Visually Tuned J48 (VTJ48, available at http://ri.fzv.uni-mb.si/vtj48/) was developed during this study. All parameters, mentioned in the previous paragraph, are automatically tuned in VTJ48 to allow the so called “one-button data mining”. However, it is possible to change the default values for dimensions of the resulting window that represent boundaries of the VTJ48 decision tree. Default values for maximal dimensions of the decision tree are set to 1280×800 pixels corresponding to the Widescreen eXtended Graphics Array (WXGA) video standard. The aspect ratio of this resolution is 16∶10 (1.60) and comes very close to aspect ratio of A4 paper dimensions (approx 1.41). The chosen dimensions can also be displayed on most computer monitors in use today. Although it would be possible to use the original Weka source code to display decision trees, some adaptations to original decision tree visualization methods had to be done to allow better covering of space for nodes and leaves. In comparison to classical Weka decision tree visualization, we changed the shape of internal nodes to allow more space on both sides of nodes. Additionally, we reduced the height of the trees with reduction of the vertical distance between nodes by 50%. Tuning of parameters in VTJ48 is done using adapted binary search where confidence factor of pruning is optimized until highest acceptable value of this parameter is found. Boundaries for confidence factor optimization are set at 0 and 0.5 (starting value in VTJ48 and the maximal allowed setting in J48). In cases where initial confidence factor tuning cannot build an acceptable decision tree, binary splits are turned on. This step usually significantly reduces horizontal dimensions of the tuned decision tree. Tuning of confidence factor is done once again. In rare cases, where binary splits are not enough, VTJ48 tries to increase minimal number of objects in leaves. This parameter (m) is increased from 2 until m
Figure 1

Comparison of the original J48 decision tree (upper image) and visually tuned version from VTJ48 (lower image) on the letter dataset.

In rare cases the default settings of VTJ48 algorithm will produce an extremely small tree consisting of just one or even without splitting nodes. Therefore, in cases of decision trees with only one or two leaves, an approach using unpruned decision tree is used. With confidence factor set to 0.5 such tree will usually grow over the predefined boundaries. This time a linear hill-climbing approach is used to increase the minimal number of objects in leaves, because there is no need to tune confidence factor in an unpruned decision tree.

2.2 Experimental Settings

By reducing the size and complexity of decision trees to fit the predefined screen resolution or paper size one is expecting significantly lower classification accuracy, especially in initially very large decision trees. We used several different datasets to test this assumption.

2.2.1. UCI Datasets

Forty UCI repository [30] datasets retrieved from the Weka website were used to evaluate the classification performance of the VTDTs. Basic information including the information on attributes that can influence the size of a decision tree for all datasets is presented in Table 1.
Table 1

Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV).

DatasetSamplesAttributesNominalNumericClassesLANLAV
anneal898393366225
anneal.orig898393366225
arrhythmia4522807420616282
audiology22670700242332
autos20526111571713
balance-scale6255143141
breast-cancer2861010021120
breast-w69910192219
colic3682316722729
colic.orig368282172277
credit-a69016106252
credit-g10002114722230
diabetes7689182515
ecoli336817853
glass21410197420
heart-c30314865821
heart-h294148651021
heart-statlog270141132367
hepatitis155201462156
hypothyroid37723023742523
ionosphere35135134251
iris15051431115
kr-vs-kp319637370255
labor57179823013
letter20000171162651
lymph1481916341512
mushroom8124232302241
optdigits5620651641071
pendigits10992171161071
primary-tumor33918180221517
segment2310201197209
sick3772302372258
sonar208611602124
soybean68336360191527
splice31906262031324
vehicle846191184254
vote4351717023810
vowel9901441011146
waveform-5000500041140351
zoo101181717812

2.2.2. Protein Solubility Datasets

In addition to the datasets from the UCI repository, we tested our method on datasets in the field of bioinformatics. Protein solubility is an important protein property since low protein solubility can lead to several diseases [31] or affect isolation of proteins from complex mixtures [32]. Several attempts to classify and predict protein solubility have been made [33]–[36]. To assess our method, we used the eSol database (available at http://tp-esol.genes.nig.ac.jp/) which includes information about protein solubility of the entire ensemble of E.coli proteins. The database contains 1,625 proteins, out of which 782 are insoluble and 843 are soluble proteins. We calculated 21 feature datasets for each of these proteins as shown in Table 2. These numeric features have shown to be influential in protein solubility prediction in previous works, where:
Table 2

Feature datasets used in protein solubility classification.

#NameSize
1MonomersNatural20
2DimersNatural13
3TrimersNatural24
4MonomersHydro5
5TrimersHydro12
6MonomersConfSimi7
7DimersConfSimi20
8TrimersConfSimi15
9MonomersBlosum8
10DimersBlosum25
11MonomersClustEm1414
12DimersClustEm1416
13TrimersClustEm1422
14MonomersClustEm1717
15DimersClustEm1727
16TrimersClustEm1742
17MonomersPhysChem7
18DimersPhysChem21
19Computed4
20eSol22
21All Features342
the feature datasets 1–18 contain mono-, di- and tri-mers using 7 different alphabets, the feature dataset 19 contains 4 sequence-computed features, i.e., molecular weight, sequence length, isolectric point and GRAVY index, the feature dataset 20 contains features used in [33], and the feature dataset 21 combines all features from the previous datasets.

2.2.3. Gene Expression Datasets

Comprehensible classifiers can provide an important insight in gene expression analysis studies. In this study we used 9 Gene Expression Machine Learning Repository (GEMLeR) datasets [37]. Altogether 1545 samples are divided in the following groups by tumor type: breast (344 samples), colon (286), kidney (260), ovary (198), lung (126), uterus (124), omentum (77), prostate (69) and endometrium (61). GEMLeR datasets used in this study were created by selecting one out of 9 groups of samples in so called one-versus-all binary classification setting. Unsupervised highest variance filter was chosen to avoid the so called “selection bias” when reducing the number of attributes by eliminating the measurements with extremely low variance. Samples consisting of original 54,681 expression measurements from Human Genome U133 Plus 2.0 Array GeneChip were reduced to 10,935 (20%) gene expression measurements that represent attributes of 9 datasets.

2.2.4. Performance Evaluation

Different measures were observed for J48 decision tree using default settings and VTJ48 decision tree on all datasets. Basic size related measures like width and height of decision tree in pixels, number of leaves and number of nodes were calculated for each decision tree on each dataset. Additionally, Classification accuracy (ACC) and area under ROC curve (AUC) were calculated using 20 runs of 10-fold cross-validation on all datasets to observe differences in classification performance.

Results

To evaluate the proposed method we compared the classification performance and size of the classical C4.5 trees (J48) with the visually tuned C4.5 trees (VTJ48). Initially, the tests were performed on 40 datasets from the well-known machine learning repository. In addition, the tests were done on two types of datasets where decision trees can be applied in the field of bioinformatics - i.e., 21 protein solubility datasets and 9 gene expression analysis datasets. As expected, in most cases, the original J48 decision tree vastly exceeded the predefined display resolution of 1280×800 pixels (Table 3 and Table 4). In some extreme cases the width of the decision tree exceeded the predefined dimension by more than 10-fold (letter, audiology, soybean). However, decision trees of this size and high number of classes are inappropriate for extraction of rules and presentation to end-user. Altogether, in the UCI datasets evaluation, there are 30 datasets where VTJ48 optimized a decision tree by reducing the number of leaves to fit into predefined dimensions. In 8 cases VTJ48 produced decision trees with more leaves than the original J48 method. Increase of the tree size occurred in cases when there were only one or two leaves produced using default settings of J48, pruning was automatically turned off in VTJ48 resulting in more complex decision trees. In case of protein solubility datasets, there were 20 datasets where the complexity of the tree was reduced and only one case where it increased. Similar changes in tree complexity were observed in gene expression problems, where complexity increased only in 2 out of 10 datasets. Observing the complexity of built decision trees one should also note that VTJ48 starts the tuning process of confidence factor at 0.5, whereas J48 starts at 0.25 resulting in more complex VTJ48 decision trees that still fit into predefined visual boundaries.
Table 3

Comparison of decision tree dimensions on 40 UCI datasets including the number of leaves.

LeavesWidthHeight
J48VTJ48J48VTJ48J48VTJ48
anneal37.6912.982753.622555.43670.11677.68
anneal.ORIG46.3711.103426.411362.30868.05546.30
arrhythmia40.5910.201679.341589.041555.571462.47
audiology30.259.113799.183781.91923.98921.00
autos45.2512.776527.374199.37654.02637.58
balance-scale41.2425.861986.121222.98821.96747.91
breast-cancer9.604.041177.921518.04348.63354.23
breast-w12.0814.23781.99967.01637.84698.75
colic6.078.76546.441198.39360.41424.61
colic.ORIG1.006.831.00480.831.00372.96
credit-a21.4012.011664.811098.50669.91619.21
credit-g89.057.0713906.861077.07877.89335.60
diabetes21.8711.971488.65963.30830.31694.87
ecoli18.7017.781039.471055.72735.37723.22
glass23.7312.232293.101838.96827.88754.41
heart-c26.058.853273.481399.74618.49476.47
heart-h7.218.17673.281042.54408.37464.92
heart-statlog17.8513.411577.551309.66633.84605.13
hepatitis9.2412.41522.66754.78571.69659.98
hypothyroid14.3913.431101.641070.54756.02771.15
ionosphere13.8511.591070.981019.02775.47734.02
iris4.694.76227.87231.10428.43432.21
kr-vs-kp28.9813.161187.641104.251091.281076.77
labor4.005.20329.00464.17333.06380.56
letter1165.0012.6563285.5563344.281916.691919.54
lymph17.4310.121863.971252.03580.12462.35
mushroom24.9324.931022.251022.25527.00527.00
optdigits205.4616.0911154.5611195.651330.361334.04
pendigits188.1316.0410719.4110784.981297.691296.05
primary-tumor43.1814.623794.331797.86891.43789.16
segment41.1211.093749.023748.841084.951085.78
sick27.5914.221763.571087.54815.68710.67
sonar14.7113.801107.131089.68665.59659.63
soybean61.2811.046175.626180.02913.67920.67
splice173.8320.787537.586176.44759.48731.51
vehicle69.2216.275069.704183.991168.311065.60
vote5.816.22390.94432.98508.98513.86
vowel126.4110.5811046.4311045.28985.60986.01
waveform-5000295.6616.8216325.9713756.921494.511386.66
zoo8.318.31436.69436.69567.50567.50
Table 4

Comparison of decision tree dimensions on the protein feature datasets including the number of leaves.

LeavesWidthHeight
J48VTJ48J48VTJ48J48VTJ48
MonomersNatural91.7313.086965.054779.431296.911147.77
DimersNatural54.0512.933880.201807.831226.37882.51
TrimersNatural15.5711.031403.251338.10798.08784.64
MonomersHydro7.257.05576.10600.18547.75553.87
TrimersHydro41.5411.383035.911498.451068.53816.91
MonomersConfSimi16.5511.911225.22999.07765.66701.04
DimersConfSimi85.5813.026518.233880.321256.651045.51
TrimersConfSimi37.4911.022607.051251.431112.98807.38
MonomersBlosum29.7213.612270.261399.86909.40781.34
DimersBlosum94.2113.277139.474640.001297.531129.44
MonomersClustEm1468.4413.465272.883006.281169.90984.28
DimersClustEm1451.6611.204115.811974.131202.74921.91
TrimersClustEm1435.0410.532808.011310.341245.19845.54
MonomersClustEm1784.8712.906687.423637.081182.30956.30
DimersClustEm17117.3610.267419.493430.751609.621158.41
TrimersClustEm1788.5210.066730.813020.261912.711221.96
MonomersPhysChem32.9213.672655.871674.10919.93831.75
DimersPhysChem80.2210.815927.142879.091356.171047.01
Computed7.998.51724.04779.70516.48534.92
eSol89.7914.137611.004331.361124.45976.64
All Features111.0913.146949.884988.281744.361448.88

3.1 Classification Performance on UCI Data

Accuracy and AUC (Table 5) were used for evaluation of classification performance, although due to the high number of multiclass datasets, it is debatable whether accuracy is the right measure for classification performance. As suggested in [38], the Wilcoxon signed ranks test was used to assess statistical significance of difference in performance and complexity of the decision tree. Comparing accuracy using win/draw/lose record one can observe J48 wins on 21 datasets, while VTJ48 managed to outperform J48 on 16 datasets. Statistical significance testing shows that J48 significantly outperforms VTJ48 in accuracy (p = 0.022), while there is no significant difference in results of AUC (p = 0.766). As already mentioned, one should be cautious when interpreting the results above, since accuracy is not a well suited performance measure in cases of unbalanced multi-class datasets. Therefore we did another test where only 16 binary class datasets were used and found out that there are no statistically significant differences present (p = 0.320). Table 3 demonstrates a big difference in decision tree size (number of leaves) comparing J48 to VTJ48 decision trees.
Table 5

Comparison of classification performance (20 runs of 10-fold cross-validation) on 40 UCI datasets.

AccuracyAUCΔ (J48 - VTJ48)
J48VTJ48J48VTJ48ACCAUC
Anneal98.64±0.298.93±0.299.36±0.398.85±0.3−0.280.51
anneal.orig92.34±0.581.34±197.47±0.483.6±2.611.0013.87
arrhythmia65.88±1.170.63±173.58±1.479.01±1−4.75−5.43
audiology77.3±1.466.31±3.992.31±0.691.78±111.000.53
Autos82.59±2.664.07±2.491.45±1.182.42±2.418.519.04
balance-scale77.9±0.977.35±0.782.36±0.883.93±1.10.55−1.57
breast-cancer74.25±0.874.48±0.958.76±1.859.69±1.5−0.23−0.93
breast-w94.64±0.494.69±0.495.21±195.44±0.6−0.06−0.23
Colic85.15±0.485.03±0.780.79±0.981.12±1.20.12−0.33
colic.orig66.3±065.33±1.748.55±070.31±1.50.98−21.76
credit-a85.83±0.786.24±0.788.49±0.789.18±0.8−0.41−0.70
credit-g71.03±0.871.85±0.664.46±1.270.96±0.6−0.82−6.50
diabetes74.29±1.174.52±1.175.31±1.374.6±1.4−0.230.71
Ecoli82.96±1.282.62±1.190.63±0.891.03±0.60.34−0.40
Glass67.17±2.567.78±2.280.13±280.97±1.3−0.61−0.85
heart-c76.85±1.676.2±1.677.24±2.477.73±20.64−0.49
heart-h78.33±1.178.4±1.275.22±1.577.53±1.8−0.07−2.31
heart-statlog77.83±1.778.56±2.177.49±2.677.91±2.2−0.72−0.42
hepatitis79.77±1.979.84±1.767.57±4.670.54±4.7−0.06−2.97
hypothyroid99.53±099.55±099.27±0.299.28±0.2−0.02−0.01
ionosphere89.9±1.189.93±188.95±1.788.11±1.4−0.030.83
Iris94.7±0.994.7±0.995.73±0.795.76±0.80.00−0.03
kr-vs-kp99.39±0.197.35±0.199.81±099.41±0.12.030.40
Labor80.09±3.182.28±3.172.05±4.775.89±4.4−2.19−3.85
Letter88.02±0.229.42±0.695.4±0.188.77±0.258.606.63
Lymph77.03±1.576.45±1.979.39±1.978.73±30.570.66
mushroom100±0100±0100±0100±00.000.00
optdigits90.51±0.273.94±0.495.39±0.193.69±0.116.571.70
pendigits96.53±0.180.13±0.698.44±0.196.89±0.116.401.56
primary-tumor42.68±1.541.83±171.95±0.871.52±1.10.860.43
segment96.93±0.292±0.398.66±0.198.34±0.14.920.32
Sick98.73±0.198.38±0.195.51±0.792.05±1.10.353.46
Sonar72.07±3.172.26±2.673.58±3.373.13±3.2−0.190.44
soybean91.96±0.861.46±0.998.11±0.394.87±0.230.503.23
Splice94.13±0.294.45±0.296.67±0.197.92±0.1−0.32−1.25
Vehicle72.21±1.271.64±185.38±0.789.31±0.40.57−3.93
Vote96.41±0.496.38±0.496.97±0.497.03±0.40.03−0.06
Vowel80.11±1.343.39±1.192.34±0.687.78±0.536.724.56
waveform-500075.36±0.674.11±0.482.82±0.588.72±0.21.25−5.90
Zoo92.23±0.492.23±0.497.67±0.197.67±0.10.000.00
J48/tie/VTJ48(21/3/16)(17/2/21)

3.2 Classification Performance on Bioinformatics Data

Table 4 shows the average decision tree dimensions for the protein datasets including the average number of leaves. It can be noticed that the size was reduced on the majority of the feature datasets. The only exceptions are the DimersClustEm14 and TrimersHydro datasets, on which the tree size increased. Table 6 shows accuracy and AUC for the evaluation of classification performance on protein datasets. Since all these datasets present a binary classification problem, accuracy and ACC are more appropriate measurements when compared to the UCI datasets. Again, the Wilcoxon signed ranks test was used to assess statistical significance of difference in performance and complexity of the decision tree. When observing the accuracy win/draw/lose record, one can notice that J48 wins on 5 datasets, while VTJ48 managed to outperform J48 on 15 datasets. The results are similar for AUC where J48 wins on 5 datasets, while VTJ48 wins on 16 datasets.
Table 6

Comparison of classification performance (20 runs of 10-fold cross-validation) on the protein datasets.

AccuracyAUCΔ (J48 - VTJ48)
J48VTJ48J48VTJ48ACCAUC
MonomersNatural70.76±0.972.41±0.770.58±1.376.41±0.6−1.66−5.83
DimersNatural62.58±161.94±0.964.65±164.45±10.640.20
TrimersNatural55.44±0.355.33±0.453.91±0.653.97±0.70.10−0.06
MonomersHydro64.64±0.964.58±0.968.1±0.868.08±0.80.060.02
TrimersHydro62.79±0.863.25±0.664.43±164.68±0.7−0.46−0.25
MonomersConfSimi66.75±0.966.79±0.971.97±0.971.96±0.8−0.030.01
DimersConfSimi64.68±166.51±0.663.35±1.168.82±0.9−1.83−5.48
TrimersConfSimi63.25±0.863.69±0.765.77±1.166.28±0.8−0.44−0.51
MonomersBlosum66.46±0.766.62±0.769.33±0.869.79±0.7−0.16−0.46
DimersBlosum66.32±169.27±0.865.65±1.273.3±0.8−2.95−7.66
MonomersClustEm1470.07±171.13±0.870.73±174.19±0.6−1.06−3.45
DimersClustEm1466.87±0.867.52±169.33±171.23±0.9−0.64−1.90
TrimersClustEm1473.74±0.776.31±0.773.62±180.43±0.8−2.57−6.81
MonomersClustEm1772.69±0.874.22±0.572.44±1.277.12±0.6−1.53−4.68
DimersClustEm1763.88±165.06±0.963.68±167.51±0.9−1.18−3.83
TrimersClustEm1762.35±161.37±1.262.92±1.262.57±1.30.980.35
MonomersPhysChem71.64±0.971.64±0.675.07±0.875.19±0.70.00−0.12
DimersPhysChem68.93±0.871.29±0.868.78±1.373.44±0.9−2.36−4.66
Computed74.92±0.574.75±0.679.2±0.679.41±0.60.17−0.21
eSol61.16±0.861.47±0.863.67±0.963.6±0.9−0.310.07
All Features72.19±175.87±0.871.63±1.481.21±0.6−3.68−9.57
J48/tie/VTJ48(5/1/15)(5/0/16)
Table 7 shows the average decision tree dimensions for the 9 GEMLeR datasets including the average number of leaves. In comparison to protein and UCI datasets it is evident that gene expression problems do not create very large tree, therefore the reduction in size, when VTJ48 is used is not that big.
Table 7

Comparison of decision tree dimensions on the GEMLeR datasets including the number of leaves.

LeavesWidthHeight
J48VTJ48J48VTJ48J48VTJ48
OVA_Breast21.6013.501673.001199.40728.80609.00
OVA_Colon16.7012.301608.301430.00609.30571.90
OVA_Endometrium13.2013.001129.501151.40616.80616.80
OVA_Kidney11.5011.101169.501117.90542.00549.50
OVA_Lung12.0013.201053.401069.70616.60661.20
OVA_Omentum17.7012.701291.301326.10802.80802.80
OVA_Ovary25.5013.902148.401842.00773.20743.40
OVA_Prostate2.003.60191.00249.40224.00345.60
OVA_Uterus23.6015.301883.201563.80758.50721.50
OVA_Uterus21.6013.501673.001199.40728.80609.00
Table 8 shows accuracy and AUC for the evaluation of classification performance for the GEMLeR datasets and also demonstrates that the performance actually increases if we use simpler (i.e., smaller) decision tree models.
Table 8

Comparison of classification performance (20 runs of 10-fold cross-validation) on the GEMLeR datasets.

AccuracyAUCΔ (J48 - VTJ48)
J48VTJ48J48VTJ48ACCAUC
OVA_Breast93.53±0.494.63±0.489.94±0.890.02±1−1.10−0.07
OVA_Colon96.31±0.496.7±0.392.39±1.291.76±1.3−0.390.62
OVA_Endometrium95.15±0.495.08±0.563.57±6.564.11±5.40.06−0.53
OVA_Kidney96.38±0.396.31±0.393.03±0.893.25±0.70.06−0.22
OVA_Lung97.35±0.297.28±0.390.12±1.789.87±1.40.060.25
OVA_Omentum93.98±0.594.43±0.454.82±5.967.99±7.9−0.45−13.16
OVA_Ovary92.23±0.692.62±0.679.21±2.281.84±2.2−0.39−2.63
OVA_Prostate99.68±0.199.61±0.197.02±198.69±0.80.06−1.67
OVA_Uterus92.17±0.492.43±0.373.16±3.570.22±3.2−0.262.93
J48/tie/VTJ48(4/0/5)(3/0/6)
Statistical significance testing was done on all 20 cross-validation run results for 30 bioinformatics datasets together using Wilcoxon signed-rank test. In case of accuracy (p = 0.002) and AUC (p = 0.001) the VTJ48 trees significantly outperformed J48 trees. Although we did not expect such significant differences between results in favor of VTJ48, it is obvious that VTJ48 is well suited for datasets with binary class attributes and a high number of possibly redundant attributes.

3.3 Examples of Large Decision Trees

In this section we demonstrate two examples, each of them with two decision trees built on a single dataset. The first tree in each example is the result of J48 using default settings, and the second tree is the result of VTJ48.The dataset in the first example is the letter dataset from the UCI repository. This dataset contains 26 class values which represent 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20.000 instances with 16 attributes. Fig. 2 shows the original tree and the visually tuned tree. One can notice the extremely complex original decision tree, which is the result of the high number of classes. Since the visually tuned tree does not cover all the possible classes, it cannot achieve competitive classification accuracy.
Figure 2

Comparison of original J48 decision tree and visually tuned version from VTJ48 on All Features dataset.

The second example presents two decision trees built on the protein solubility dataset with all features (Figure 3). In this case the accuracy and AUC were both improved significantly, when the size of the decision tree model was reduced. This is possible due to binary class attribute that still allows effective trees that are much smaller than the original pruned J48 trees.
Figure 3

Comparison of durations for different datasets.

In addition, to demonstrate the most significant rules from both decision trees in Figure 3, we extracted top 5 rules according to their support in training set (Table 9). The numbers at the end of mono-, di- and tri-mer attribute names (e.g. the number 34 in MonomersClustEm17_34) distinguish attributes inside different alphabets. It can be observed that J48 produces more complex rules with a higher number of conditions which use attributes from more different alphabets. On the other hand, top 5 rules from VTJ48 tree cover much more samples (70.6%) than top 5 rules derived from J48 (36.6%). It is evident that low error rates on the training set do not guarantee good classification performance on the test set. We can once again conclude that in most cases, at least in protein solubility domain, more complex trees result in overfitting to training samples.
Table 9

Top 5 rules with the highest support in All Features extracted from J48 and VTJ48 decision trees.

RuleConditionsSupportError
J48
IF Length< = 233 AND MonomersClustEm17_34>0.136 AND TrimersConfSimi_40< = 0.002 AND TrimersClustEm17_98< = 0.005 AND DimersConfSimi_19< = 0.069 THEN Soluble52281.32
IF Length>233 AND DimersClustEm17_102< = 0.069 AND MonomersNatural_0>0.047 AND Ip >5.181 AND TrimersClustEm17_96< = 0.002 AND Length >251 AND MonomersBlosum_14>0.074 AND TrimersNatural_19< = 0 AND MonomersNatural_1>0.039 THEN Insoluble92180.92
IF Length>233 AND DimersClustEm17_102< = 0.069 AND MonomersNatural_0< = 0.047 AND DimersClustEm14_70< = 0.002 AND TrimersClustEm17_90< = 0 AND DimersBlosum_40< = 0.015 AND MonomersClustEm14_20>0.132 AND TrimersClustEm17_85< = 0.003 THEN Soluble8535.66
IF Length>233 AND DimersClustEm17_102>0.069 AND DimersClustEm17_95< = 0.0121 AND DimersClustEm14_62>0.004 AND MonomersConfSimi_8>0.076 AND MonomersBlosum_14>0.076 AND DimersClustEm14_65< = 0.001 AND DimersClustEm14_100< = 0.002 AND TrimersNatural_6< = 0 AND TrimersClustEm14_46< = 0.003 AND DimersNatural_5< = 0.009 AND DimersClustEm14_71< = 0.004 THEN Soluble11492.04
IF Length< = 233 AND MonomersClustEm17_34< = 0.136 AND MonomersBlosum_16< = 0.173 AND DimersConfSimi_14>0.020 AND DimersClustEm14_59< = 0.040 AND MonomersClustEm17_34< = 0.113 AND TrimersNatural_0< = 0.002 AND MonomersNatural_2>0.066 AND TrimersClustEm17_80< = 0.002 AND DimersBlosum_58< = 0.009 AND DimersBlosum_38< = 0.032 AND MonomersClustEm14_22>0.022 AND DimersPhysChem_118>0.002 AND TrimersClustEm14_65< = 0.005 THEN Insoluble14472.13
VTJ48
IF Length>233 AND DimersClustEm17_102< = 0.069 AND MonomersNatural_0>0.047 THEN Insoluble359317.54
IF Length< = 233 AND MonomersClustEm17_34>0.136 THEN Soluble22875.23
IF Length< = 233 AND MonomersClustEm17_34< = 0.136 AND MonomersBlosum_16< = 0.173 AND DimersClustEm14_59< = 0.040 AND MonomersBlosum_14>0.086 AND MonomersHydro_0>0.324 THEN Insoluble610030.00
IF Length< = 233 AND MonomersClustEm17_34< = 0.136 AND MonomersBlosum_16< = 0.173 AND DimersClustEm14_59< = 0.040 AND MonomersBlosum_14< = 0.086 THEN Soluble59921.21
IF Length< = 233 AND MonomersClustEm17_34< = 0.136 AND MonomersBlosum_16< = 0.173 AND DimersClustEm14_59< = 0.040 AND MonomersBlosum_14>0.086 AND MonomersHydro_0< = 0.324 AND MonomersClustEm17_34>0.110 THEN Soluble76820.59

3.4 User Study

To test the effectiveness of the VTJ48 method in terms of usability, a Weka package was developed implementing the visually constrained tree building algorithm. An experiment to compare the duration of building decision trees using the J48 and VTJ48 Weka packages in Weka Explorer was set up. Three different datasets from the UCI repository (balance-scale, credit-g, and splice) were chosen based on their complexity where the need for tuning the tree models is more likely to be necessary. Fourteen master students, all enrolled in a Bioinformatics program, were recruited to take part in the experiment. After a brief introduction to the VTJ48 Weka package, the participants were given the datasets and were asked to build a comprehensible decision tree from each dataset using both, J48 and VTJ48 methods. Additionally, the participants were instructed to optimize each decision tree to fit to a single computer screen to allow optimal comprehensibility. In the case of J48 classifiers, this meant tuning the binary splits, minimal number of objects, and pruning parameters. In the case of VTJ48, this simply meant setting the desired resolution parameters. The duration from the start of the tree building process to the point when the decision tree was displayed on a single screen, was stored for further analysis. Figure 4 clearly shows that tree building times were shorter for VTJ48 method on all datasets.
Figure 4

Pseudocode of decision tree reduction in Visually Tuned J48.

In order to test the statistical significance of the obtained results, the Wilcoxon signed-rank test was chosen to compare the distributions of tree building times and accuracies for different tree building methods. Each of the tests was assessed at a significance level of 95%. The medians of tree building times for the J48 and VTJ48 methods were significantly different for two datasets (balance-scale: p = 0.002, credit-g: p = 0.020). Tree building times were not significantly different for the splice dataset (p = 0.396), however, the mean tree building time was still 33.14 seconds shorter for the VTJ48 method.

Discussion

This study focused on evaluation of decision tree performance when useful and comprehensible decision trees are needed. The evaluation was done on 30 datasets from the following two areas in bioinformatics: protein solubility classification and gene expression analysis. More precisely, strict boundaries for width and height of built decision trees were set to produce more comprehensible trees. It is important to note that VTDT approach only helps the end-user in tuning the decision tree building parameters and does not propose a novel decision tree building algorithm. Although this paper presents the automated visual tuning on C4.5 decision trees, it would be possible to adapt the VTDT principles to any other decision tree building algorithm that requires tuning of parameters to achieve optimal results. By tuning the parameters, without interfering with the internal decision tree building process and constraining the tuning only by the dimensions of the decision tree, the bias of influencing the classification performance is avoided. The results of our study confirmed there is no statistically significant difference in predictive performance between the decision trees built using default values and the ones that were built using the proposed process of visual tuning. Moreover, when AUC is observed, visually tuned models, that are usually also much simpler than large default models, performed better on majority of datasets. This is especially true for most of the protein and gene expression datasets, where the performance improvements were significant. However, it has to be noted that a larger sample of datasets would be needed to draw more reliable conclusions. Based on these results, one could conclude that simpler models usually produce at least comparable results if not better. This has also been shown in many other studies related to the Occam's razor theory [39], [40]. However, there are also studies that demonstrate the contrary - i.e., growing the trees will improve the classification performance [41]. To sum it up, it all depends on how a simple model is defined. In the case of VTDTs, we should probably state that if the model is simple enough (i.e., fits into our predefined visual boundaries), it will produce good or even better results than most of the more complex models. Unfortunately, the proposed decision tree tuning suffers from the high time complexity in comparison to classical decision tree that is built only once. However, as shown by the the user study, it still saves a lot of time in comparison to manual tuning and fitting of the decision tree to desired dimensions. In this paper we evaluated the visual tuning strategy only on C4.5 decision trees. From the research and also from the practical usability point of view, it would be important to extend this study and consequently develop a Weka package that would allow simultaneous tuning of different decision tree models (e.g. CART [4]). Some of the areas where visual tuning could also be applied are comprehensible ensembles of classifiers or variations of decision tree models (e.g, Alternating decision trees [42]). These models combine boosted decision stumps in a structure where visual constraints could be beneficial for the end-user in different areas of bioinformatics.
  4 in total

1.  Rotation forest: A new classifier ensemble method.

Authors:  Juan J Rodríguez; Ludmila I Kuncheva; Carlos J Alonso
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2006-10       Impact factor: 6.226

2.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins.

Authors:  Tatsuya Niwa; Bei-Wen Ying; Katsuyo Saito; WenZhen Jin; Shoji Takada; Takuya Ueda; Hideki Taguchi
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-27       Impact factor: 11.205

3.  Development and validation of a computer-aided diagnostic tool to screen for age-related macular degeneration by optical coherence tomography.

Authors:  P Serrano-Aguilar; R Abreu; L Antón-Canalís; C Guerra-Artal; Y Ramallo-Fariña; F Gómez-Ulla; J Nadal
Journal:  Br J Ophthalmol       Date:  2011-08-26       Impact factor: 4.638

4.  Stability of ranked gene lists in large microarray analysis studies.

Authors:  Gregor Stiglic; Peter Kokol
Journal:  J Biomed Biotechnol       Date:  2010-06-27
  4 in total
  16 in total

1.  Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review.

Authors:  Seyedeh Neelufar Payrovnaziri; Zhaoyi Chen; Pablo Rengifo-Moreno; Tim Miller; Jiang Bian; Jonathan H Chen; Xiuwen Liu; Zhe He
Journal:  J Am Med Inform Assoc       Date:  2020-07-01       Impact factor: 4.497

Review 2.  Artificial Intelligence in Bariatric Surgery: Current Status and Future Perspectives.

Authors:  Mustafa Bektaş; Beata M M Reiber; Jaime Costa Pereira; George L Burchell; Donald L van der Peet
Journal:  Obes Surg       Date:  2022-06-17       Impact factor: 3.479

3.  Discrimination of soluble and aggregation-prone proteins based on sequence information.

Authors:  Yaping Fang; Jianwen Fang
Journal:  Mol Biosyst       Date:  2013-02-25

4.  A complementary graphical method for reducing and analyzing large data sets. Case studies demonstrating thresholds setting and selection.

Authors:  X Jing; J J Cimino
Journal:  Methods Inf Med       Date:  2014-04-14       Impact factor: 2.176

5.  Chemokine profile in the sera and urine of patients with schistosomal glomerulopathy.

Authors:  Alba Otoni; Antônio Lúcio Teixeira; Izabela Voieta; Carlos Maurício Antunes; Vinícius Lins Costa Melo; Sandra Costa Drummond; Valério Ladeira Rodrigues; José Roberto Lambertucci
Journal:  Am J Trop Med Hyg       Date:  2013-11-04       Impact factor: 2.345

6.  How do eubacterial organisms manage aggregation-prone proteome?

Authors:  Rishi Das Roy; Manju Bhardwaj; Vasudha Bhatnagar; Kausik Chakraborty; Debasis Dash
Journal:  F1000Res       Date:  2014-06-27

7.  Reconciliation of Decision-Making Heuristics Based on Decision Trees Topologies and Incomplete Fuzzy Probabilities Sets.

Authors:  Karel Doubravsky; Mirko Dohnal
Journal:  PLoS One       Date:  2015-07-09       Impact factor: 3.240

8.  Rough set theory based prognostic classification models for hospice referral.

Authors:  Eleazar Gil-Herrera; Garrick Aden-Buie; Ali Yalcin; Athanasios Tsalatsanis; Laura E Barnes; Benjamin Djulbegovic
Journal:  BMC Med Inform Decis Mak       Date:  2015-11-25       Impact factor: 2.796

9.  Evaluation of Major Online Diabetes Risk Calculators and Computerized Predictive Models.

Authors:  Gregor Stiglic; Majda Pajnkihar
Journal:  PLoS One       Date:  2015-11-11       Impact factor: 3.240

10.  A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli.

Authors:  Narjeskhatoon Habibi; Siti Z Mohd Hashim; Alireza Norouzi; Mohammed Razip Samian
Journal:  BMC Bioinformatics       Date:  2014-05-08       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.