Literature DB >> 31527280

Building more accurate decision trees with the additive tree.

José Marcio Luna¹, Efstathios D Gennatas², Lyle H Ungar³, Eric Eaton³, Eric S Diffenderfer⁴, Shane T Jensen⁵, Charles B Simone⁶, Jerome H Friedman⁷, Timothy D Solberg², Gilmer Valdes⁸.

Abstract

The expansion of machine learning to high-stakes application domains such as medicine, finance, and criminal justice, where making informed decisions requires clear understanding of the model, has increased the interest in interpretable machine learning. The widely used Classification and Regression Trees (CART) have played a major role in health sciences, due to their simple and intuitive explanation of predictions. Ensemble methods like gradient boosting can improve the accuracy of decision trees, but at the expense of the interpretability of the generated model. Additive models, such as those produced by gradient boosting, and full interaction models, such as CART, have been investigated largely in isolation. We show that these models exist along a spectrum, revealing previously unseen connections between these approaches. This paper introduces a rigorous formalization for the additive tree, an empirically validated learning technique for creating a single decision tree, and shows that this method can produce models equivalent to CART or gradient boosted stumps at the extremes by varying a single parameter. Although the additive tree is designed primarily to provide both the model interpretability and predictive performance needed for high-stakes applications like medicine, it also can produce decision trees represented by hybrid models between CART and boosted stumps that can outperform either of these approaches.

Entities: Chemical Disease Gene Species

Keywords: CART; additive tree; decision tree; gradient boosting; interpretable machine learning

Year: 2019 PMID： 31527280 PMCID： PMC6778203 DOI： 10.1073/pnas.1816748116

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

The increasing application of machine learning to high-stakes domains such as criminal justice (1, 2) and medicine (3–5) has led to a surge of interest in understanding the generated models. Mispredictions in these domains can incur serious risks in scenarios such as technical debt (6, 7), nondiscrimination (8), medical outcomes (9, 10), and, recently, the right to explanation (11), thus motivating the need for users to be able to examine why the model made a particular prediction. Despite recent efforts to formalize the concept of interpretability in machine learning, there is considerable disagreement on what such a concept means and how to measure it (12, 13). Lately, 2 broad categories of approaches for interpretability have been proposed (14), namely, post hoc simple explanations for potentially complex models (e.g., visual and textual explanations) (15, 16) and intuitively simple models (e.g., additive models, decision trees). This paper focuses on intuitively simple models, specifically decision trees, which are used widely in fields such as healthcare, yet have lower predictive power than more sophisticated models. Classification and Regression Tree (CART) analysis (17) is a well-established statistical learning technique that has been adopted by numerous fields for its model interpretability, scalability to large datasets, and connection to rule-based decision-making (18). Specifically, in fields like medicine (19–21), the aforementioned traits are considered a requirement for clinical decision support systems. CART builds a model by recursively partitioning the instance space, and labeling each partition with either a predicted category (in the case of classification) or real value (in the case of regression). Despite widespread use, CART models are often less accurate than other statistical learning models, such as kernel methods and ensemble techniques (22). Among the latter, boosting methods were developed as a means to train an ensemble that iteratively combines multiple weak learners (often CART models) into a high-performance predictive model, albeit with an increase of the number of nodes, therefore, harming model interpretability. In particular, gradient boosting methods (23) iteratively optimize an ensemble’s prediction to increasingly match the labeled training data. In addition, some ad hoc approaches (24, 25) have been successful at improving the accuracy of decision trees, but at the expense of altering their topology, therefore affecting their capacity of explanation. Decision tree learning and gradient boosting have been connected primarily through CART models used as the weak learners in boosting. However, a rigorous analysis in ref. 26 proves that decision tree algorithms, specifically CART and C4.5 (27), are, in fact, boosting algorithms. Based on this approach, AdaTree, a tree-growing method based on AdaBoost (28), was proposed in ref. 29. A sequence of weak classifiers on each branch of the decision tree was trained recursively using AdaBoost; therefore, rendering a decision tree where each branch conforms to a strong classifier. The weak classifiers at each internal node were implemented either as linear classifiers composed with a sigmoid or as Haar-type filters with threshold functions at their output. Another approach is the Probabilistic Boosting-Tree (PBT) algorithm introduced in ref. 30, which also uses AdaBoost to build decision trees. PBT trains ensembles of strong classifiers on each branch for image classification. The strong classifiers are based on up to third-order moment histogram features extracted from Gabor and intensity filter responses. PBT also carries out a divide-and-conquer strategy in order to perform data augmentation to estimate the posterior probability of each class. Recently, MediBoost, a tree-growing method based on the more general gradient boosting approach, was proposed in ref. 31. In contrast to AdaTree and PBT, MediBoost emphasizes interpretability, since it was conceived to support medical applications while exploiting the accuracy of boosting. It takes advantage of the shrinkage factor inherent to gradient boosting, and introduces an acceleration parameter that controls the observation weights during training to leverage predictive performance. Although the presented experimental results showed that MediBoost exhibits better predictive performance than CART, no theoretical analysis was provided. In this paper, we propose a rigorous mathematical approach for building single decision trees that are more accurate than traditional CART trees. In this context, we use the total number of decision nodes (NDN) as a quantification of interpretability, due to their direct association with the number of decision rules of the model and the depth of the tree. This paper addresses a number of fundamental shortcomings: 1) We introduce the additive tree (AddTree) as a mechanism for creating decision trees. Each path from the root node to a leaf represents the outcome of a gradient boosted ensemble for a partition of the instance space. 2) We theoretically prove that AddTree generates a continuum of single-tree models between CART and gradient boosted stumps (GBS), controlled via a single tunable parameter of the algorithm. In effect, AddTree bridges between CART and gradient boosting, identifying previously unknown connections between additive and full interaction models. 3) We provide empirical results showing that this hybrid combination of CART and GBS outperforms CART in terms of accuracy and interpretability. Our experiments also provide further insight into the continuum of models revealed by AddTree.

Background on CART and Boosting

Assume we are given a training dataset , where each -dimensional data instance has a corresponding label , drawn independently and from an identical and unknown distribution . In a binary classification setting, ; in regression, . The goal is to learn a function that will perform well in predicting the label on new examples drawn from . CART analysis recursively partitions , with assigning a single label in to each terminal node; in this manner, there is full interaction. Different branches of the tree are trained with disjoint subsets of the data, as shown in Fig. 1.

Fig. 1.

A depiction of the continuum relating CART, GBS, and our AddTree. Each algorithm has been given the same 4 training instances (blue and red symbols); the symbol’s size depicts its weight when used to train the adjacent node. In contrast, boosting iteratively trains an ensemble of weak learners , such that the resulting model is a weighted sum of the weak learners’ predictions: with .* Each boosted weak learner is trained with a different weighting of the entire dataset, unlike CART, repeatedly emphasizing mispredicted instances at every round (Fig. 1). GBS or simple regression creates a pure additive model in which each new ensemble member reduces the residual of previous members (32). Interaction terms can be included in the ensemble by using more complex learners, such as deeper trees. Classifier ensembles with decision stumps as the weak learners, , can be trivially rewritten (31) as a complete binary tree of depth , where the decision made at each internal node at depth t − 1 is given by , and the prediction at each leaf is given by . Intuitively, each path through the tree represents the same ensemble, but one that tracks the unique combination of predictions made by each member.

The Additive Tree

This interpretation of boosting lends itself to the creation of a tree-structured ensemble learner that bridges CART and GBS: the AddTree. At each node, a weak learner is found using gradient boosting on the entire dataset. Rather than using only the data in the current node to decide the next split, the algorithm allows the remaining data to also influence this split, albeit with a potentially differing weight. Critically, this process results in an interpretable model that is simply a decision tree of the weak learner’s predictions, but with higher predictive power due to its growth via boosting and the resulting effect of regularizing the tree splits. This process is carried out recursively until the depth limit is reached. By varying the weighting scheme, this approach interpolates between CART trees at one extreme and boosted regression stumps at the other. Notably, the resulting tuning of the weighting scheme allows AddTree to improve in predictive performance over CART and GBS, while allowing direct interpretation. AddTree maintains a perfect binary tree of depth , with internal nodes, each of which corresponds to a weak learner. The th weak learner, denoted , along the path from the root node to a leaf prediction node induces 2 disjoint partitions of the feature space , namely and its complement . Let be the corresponding set of partitions along that path to , where the th term in the tuple, namely , can be either or . We can then define the partition of associated with the leaf node as the intersection of all of the set of partitions along the path from the root node to the leaf node ; that is, . AddTree predicts a label for each via the ensemble consisting of all weak learners along the path to so that the resulting model is given by . To focus each branch of the tree on corresponding instances, thereby constructing diverse ensembles, AddTree maintains a set of weights denoted over all training data. Such weights are associated with training a weak learner at the leaf node at depth . We train the tree as follows. At each boosting step, we have a current estimate of the function at the leaf of a perfect binary tree of depth . We seek to improve this estimate by replacing each of the leaf prediction nodes with additional weak learners with corresponding scaling factor , growing the tree by one level. This yields a revised estimate of the function at each terminal node equivalent to separate functions,one for each leaf’s corresponding ensemble. The goal is to minimize the loss over the databy choosing the appropriate values of and at each leaf. Our proposed approach aims at breaking the connection between the global loss in Eq. and the loss used for each weak learner by independently minimizing the inner summation for each ; that is,Eq. can be solved efficiently via gradient boosting of each in a level-wise manner through the tree. Next, we focus on deriving AddTree where the weak learners are binary regression trees with least squares as the loss function . Following ref. 23, we first estimate the negative unconstrained gradients at each data instance , which are equivalent to the residuals [i.e., ]. Then, we can determine the optimal parameters for by solvingGradient boosting solves Eq. by first fitting decision stumps to the residuals ,then solving for the optimal scaling factor , whose main function is to scale the weak learners obtained from pseudoresiduals to fit the original data,or, equivalently, solving for the simple regressor as follows:If all instance weights remain constant, this approach would build a perfect binary tree of depth , where each path from the root to a leaf represents the same ensemble, and so would be exactly equivalent to gradient boosting of . Instead, AddTree constructs diverse ensembles by focusing each branch of the tree on corresponding instances that belong to that partition. To accomplish this, the weights are updated separately for each of ’s 2 children: Instances in the corresponding partition have their weights multiplied by a factor of , and instances outside the partition have their weights multiplied by a factor denoted “interpretability factor.” The update rule for the weight of for is given bywhere is a binary indicator function that is 1 if predicate is true, and 0 otherwise, and the initial weights are typically uniformly distributed. Therefore, we can also think of AddTree as growing a collection of interrelated ensembles, where each is tuned to yield optimal predictions for one partition of the instance space. The complete AddTree approach is detailed as Algorithm 1. Notice that a learning rate or shrinkage parameter is in place in order to apply regularization by shrinkage in step 8 of the algorithm. The learning rate balances the contributions of each node to the running function estimate of its ensemble. Larger learning rates allow each weak learner to contribute more to the function estimate, resulting in shorter (and consequently more interpretable trees); smaller learning rates slow the running function estimate to yield larger but potentially more accurate trees as shown in .

Results

We provide the following results: 1) a theoretical analysis establishing the connections between CART, GBS, and AddTree; 2) empirical results demonstrating that AddTree yields a decision tree that outperforms CART and performs equivalently to GBS; and 3) an empirical analysis of the behavior of AddTree under various parameter settings.

Theoretical Analysis.

This section proves that AddTree is equivalent to CART when and converges to GBS as , thus establishing a continuum between CART and GBS. We include proof sketches for the 3 lemmas used to prove our main result in Theorem 1; full proofs are included in . Lemma 1 The weight of at leaf at depth of the tree () is given bywhere is the sequence of partitions along the path from the root to . Shown by induction based on . Lemma 2 The optimal simple regressor that minimizes the squared error loss function at leaf at depth of the tree is given by For a given region at the depth of the tree, the simple regressor has the formwith constants . We take the derivative of the loss function () in each of the 2 regions and , and solve for where the derivative is equal to zero, obtaining . Lemma 3 The AddTree update rule is given by . If is defined aswith constant , then is constant, with The proof is by induction on , building upon . Each is constant and so is constant; therefore, the lemma holds under the given update rule. Building upon these 3 lemmas, our main theoretical result is presented in the following theorem, and explained in the subsequent 2 remarks: Theorem 1 Given the AddTree optimal simple regressor () that minimizes the loss (), the following limits hold for the parameter of the weight update rule ():where is the initial weight for the th training sample. Proof follows from applying in Lemma 1 to in Lemma 2 and taking the limit when , regarding the result , with constant defined by . Similarly, follows from applying in Lemma 1 to in Lemma 2 and taking the limit when . Remark 1 The simple regressor given by calculates a weighted average of the difference between the random output variables and the previous estimate of the function in the disjoint regions defined by . This formally defines the behavior of the CART algorithm. Remark 2 The simple regressor given by calculates a weighted average of the difference between the random output variables and the previous estimate of given by the piece-wise constant function . is defined in the overlapping region determined by the latest stump, namely . This defines the behavior of the GBS algorithm. Based on Remarks 1 and 2, we can conclude that AddTree is equivalent to CART when and GBS as . In addition to identifying connections between these 2 algorithms, AddTree provides the flexibility to train a model that lies between CART and GBS, with potentially improved performance over either, as we show empirically in the next section. Finally, notice that, although the introduction of the parameter might be reminiscent of decision trees trained with fuzzy logic (33), in the fuzzy logic approach, the continuity is between a simple constant model and CART, while AddTree produces a continuous mapping between CART and GBS.

Empirical Evaluation of Predictive Performance.

In this section, the predictive performance based on balanced accuracy (BACC) (34) of a set of optimized machine learning models, namely, AddTree, CART, GBS, Random Forest (RF) and Gradient Boosting Machines (GBM), was calculated in 83 classification tasks selected from the Penn Machine Learning Benchmark (PMLB) (35). The frequency of the classification tasks for which one of the aforementioned models outperformed the BACC of all of the others is illustrated in Fig. 2, where ties are not included, for the sake of clarity. AddTree outperformed CART in 55 (66.3%) classification tasks with high statistical significance (), while AddTree was outperformed by GBS in 46 (55.4%) tasks, but with no statistical significance (). In fact, AddTree exhibits the highest mean BACC (improvement rate 1.4%) and F1 score (improvement rate 2.9%) across the PMLB tasks, as shown in Table 1. Moreover, RF and GBM displayed superior performance to GBS and to the single trees, that is, CART and AddTree. These results are also consistent with the F1-score comparison shown in .

Fig. 2.

Table 1.

Performance indexes for all of the learning algorithms under comparison across 83 PMLB classification tasks

Performance measure	CART	AddTree	GBS	RF	GBM
Mean BACC	811.5×10−3	815.5×10−3	804.1×10−3	837.5×10−3	843.4×10−3
SD BACC	146.1×10−3	144.8×10−3	156.4×10−3	140.9×10−3	138.2×10−3
Mean F1 score	796.7×10−3	799.7×10−3	776.9×10−3	831.9×10−3	833.3×10−3
SD F1-score	177.0×10−3	174.8×10−3	193.4×10−3	173.4×10−3	170.0×10−3
Mean no. of nodes	55.4	50.7	4,692.0	412,458.4	316,728.8
SD no. of nodes	86.2	84.1	1,614.1	752,654.8	700,981.6

Frequency that an algorithm (rows) has higher average BACC compared to each one of the remaining algorithms (columns) under study across 83 binary classification tasks. Each of the learning algorithms has been tuned to maximize BACC. Ties have been ruled out for the sake of clarity. Performance indexes for all of the learning algorithms under comparison across 83 PMLB classification tasks In Table 1, the average BACC and F1 score of AddTree overcomes GBS, even exhibiting less SD, despite the fact that GBS overcomes AddTree in the counts of classification tasks with better BACC as illustrated in Fig. 2. Furthermore, comparison of the BACCs of AddTree and CART is illustrated on the bar chart in Fig. 3, where the 55 tasks where AddTree outperforms CART are indicated by the bars with . A similar result is shown in the scatter plot in . Also, notice the meaningful reduction of NDNs of AddTree with respect to GBS. As expected, the ensemble methods RF and GBM are more accurate than the single-tree approaches, but at the expense of having larger NDNs.

Fig. 3.

Bar chart showing the difference . AddTree exhibits significantly better BACC than CART () in 55 tasks () out of the 83 PMLB classification tasks. The one outlier task in favor of CART consists of a synthetic parity problem, which is ill-suited to be captured by any soft regression-like method such as AddTree, and it is better solved by the hard binary logic structure of CART (more details in ).

Number of Decision Nodes.

As shown in Table 1, the average NDN used to carry out the classification tasks was 55.4 () for CART trees and 50.7 () for AddTree trees, and this reduction was statistically significant (). Contrasting with the previous NDN values for AddTree and CART, the average number of stumps assembled by GBS to carry out the classification tasks was 4,692.0 (SD = 1,614.1), surpassed only by the large tree ensembles RF with 412,458.4 (SD = 752,654.8) and GBM with 316,728.8 (SD = 700,981.6). In contrast to GBS, AddTree exploits model interactions, which may explain the equivalence in performance with a smaller number of nodes.

Empirical Variance of AddTree.

By using training data from PMLB, the variance of AddTree as a function of the interpretability parameter is estimated and shown in Fig. 4. As observed in the plot, as increases, AddTree reduces variance with respect to CART () without compromising bias since, as , it converges to GBS, which is a consistent estimator. Therefore, the reduction of variance without compromising bias explains the superior performance of AddTree over CART.

Fig. 4.

Estimate of the variance of AddTree as a function of the interpretability parameter . Notice that AddTree is able to improve the variance with respect to CART. The error bars represent 1 SE.

Discussion

The previous section provided a formal proof that AddTree establishes a previously unknown connection between CART and GBS. This connection motivates the hypothesis that the predictive performance of AddTree may surpass the performances of CART and GBS, while improving the limited interpretability that GBS provides. Preliminary results regarding performance of AddTree were previously presented in ref. 36, but using small datasets. The superior statistical power of PMLB allows a more rigorous assessment of the performance and interpretability of AddTree. Therefore, a set of experiments was carried out to validate this hypothesis. The results presented in Fig. 2 validate the ability of AddTree to surpass the predictive performance of CART over a wide variety of classification tasks. This is further corroborated by , which shows the distribution of the difference , in which the mean reflects a bias toward positive values of ; therefore, . The superior performance of AddTree over CART has been shown to be statistically significant as well. Moreover, AddTree showed a significant reduction on the average NDN with respect to CART. The difference between the NDNs of AddTree and CART, that is, , has the distribution shown in , whose mean value is −4.66 (). Therefore, we conclude that AddTree can generate better predictions than CART while building smaller trees. Despite the apparent performance advantage of GBS over AddTree, the results are not statistically significant. Furthermore, GBS had an average NDN of 4,692.0 (SD = 1,614.1) over the classification tasks, whereas AddTree had an average NDN of 50.7 (SD = 0.17); however, GBS can be indirectly interpreted through partial dependence plots (23). Notice that limiting the GBS ensemble size to match the maximum depth of CART and AddTree would have provided a better comparison between GBS and AddTree in terms of interpretability based on NDNs. However, the number of stumps of GBS was adaptively added, and the hyperparameters were optimally tuned, to provide the best performance possible, and yet the predictive performance of GBS does not improve over AddTree significantly. Through these results, we conclude that AddTree provides an alternative paradigm for building decision trees with a predictive performance that surpasses the performance of CART and which is as accurate as optimized GBS. AddTree improvement in predictive performance over CART is a result not only of its connection to boosting, but of the effect of regularizing the tree splits through the passing of all of the training data to the descendant nodes and the variation of the weighting scheme. In fact, as illustrated in Fig. 4, AddTree reduces variance as increases without compromising bias since, as , it converges to GBS, a boosting-based algorithm that primarily reduces bias. The tree ensembles RF and GBM outperformed CART, AddTree, and GBS with statistical significance, at the expense of obtaining less-interpretable models, by using the total NDN as proxies to measure interpretability. However, even in the scenarios where understanding the model is not of concern, the superior predictive performance of AddTree provides possibilities for improving over the predictive performance of RF and GBM, namely, 1) the replacement of the stumps (, Eq. ) by decision trees such as CART trees, 2) the implementation of bagged AddTree, and 3) the application of gradient boosting to AddTree. Furthermore, 4) formal extensions of AddTree to other losses, and 5) the analysis of connections of AddTree to the recently introduced Generalized Random Forest (37) are in our future research agenda.

Materials and Methods

All experiments were carried out using the rtemis machine learning library (38) written in the R language (39). The statistical significance of all of the predictive performance measurements was evaluated using the Wilcoxon rank sum test (40).

Predictive Performance Evaluation and NDN.

The PMLB repository (35) contains a selection of curated datasets, accessible through GitHub, for evaluating supervised classification. PMLB includes datasets from commonly used machine learning benchmarks such as the University of California, Irvine machine learning repository (41), the metalearning benchmark (42), and the Knowledge Extraction based on Evolutionary Learning tool (43). It includes real and synthetic datasets in order to assess and compare the performance of different learning algorithms. Out of the 95 available PMLB datasets, 11 datasets with less than 100 instances were ruled out to avoid overfitting in the analysis, and 1 dataset with 48,842 instances was discarded because of the extended execution times that the performance analyses demanded in this particular dataset. The remaining 83 datasets have a mean of 1,713.0 instances (SD = 2,900.3) with 36.3 features (SD = 111.4). The distributions of performance comparisons, including the tests for statistically significant differential performances, were based on viewing the 83 datasets as a random sample of potential classification tasks. In this study, we evaluated the performance of AddTree against CART, GBS, RF, and GBM using BACC, a measurement commonly used to calculate performance in 2-class imbalanced domains (34). All experiments were performed using nested resampling. For each dataset, the hyperparameters of each algorithm were tuned using grid search to maximize BACC across 5 stratified subsamples (internal resampling). External resampling was performed using 25 stratified subsamples, fixed for each dataset to ensure a direct comparison across algorithms. In both internal and external resampling, the training set size was set to 75% of the available number of cases. The following hyperparameters were tuned: 1) The interpretability factor and the learning rate were tuned for AddTree, 2) the complexity parameter used in cost complexity pruning (44) was tuned for CART, 3) the learning rate and the maximum number of stumps in the ensemble were tuned for GBS, 4) the number of predictors randomly sampled as candidates at each split was tuned for RF, and 5) the maximum tree depth as well as the learning rate were tuned for GBM. Detailed definitions of the hyperparameters are provided in , and the hyperparameter space is shown in . The AddTree approach grows trees where a minimum of one observation per child node is required to make a node internal; otherwise, such a node becomes a leaf node (min.membership parameter). A maximum tree depth of 30 is also in place (maxdepth parameter); however, this tree depth was not reached in any of the experiments, mostly due to the stopping effect of min.membership. Furthermore, the number of nodes in AddTree is reduced based on the elimination of impossible and redundant paths (i.e., paths where subsequent child branches give the same class in both outputs); however, no other statistical pruning method (e.g., minimum-error or small-tree based pruning) was carried out. In contrast, CART trees were grown and pruned using the rpart package (44). The CART tree growth is limited by a minimum number of 2 observations for a split to be attempted (minsplit parameter) and, same as in AddTree, a maxdepth of 30 for fair comparison. In addition, any branch whose complexity measure is less than a value denoted prune.cp is trimmed. The prune.cp is optimally tuned using grid search. CART also relies on a variation of the complexity parameter that indicates that any split which does not decrease the overall lack of fit based on the cross-validated classification error by a factor of 0.0001 is not attempted. The GBS and GBM ensembles are grown and pruned using the gbm package (45). Individual trees’ growth is limited by the minimum number of 1 observation in the terminal nodes (n.minobsinnode parameter), as well as the maxdepth parameter, which was also tuned using grid search. The number of GBS and GBM trees were tuned by early stopping based on cross-validated classification error. RF tree ensembles were grown and pruned using the ranger package (46). The trees were grown until the minimum number of 1 observation at the terminal nodes was achieved (nodesize parameter). Moreover, a default number of 1,000 ensembled trees is specified (ntree parameter) to try to ensure that every input row gets predicted at least a few times. The variance of AddTree was calculated as indicated in ref. 47 using the twonorm dataset of PMLB. We calculated the percent of times a test prediction was different from the mode of all test set predictions across the bootstraps where the case was left out (not part of training set); 500 bootstraps were drawn from 1,000 randomly subsampled cases. AddTree was trained on each bootstrap for a set of values, namely, .

Algorithm 1:

Inputs: Training data (X,y)={(xi,yi)}i=1N, instance weights wn,l∈RN (default: wn,l,i=1N), λ∈[0,+∞), node depth n (default: 1), max depth T, node domain Rn,l (default: X), prediction function Fn−1,l(x) (default: F0(x)=ȳ0), learning rate ν.

Outputs: The root node of a hierarchical ensemble

1: If n≥T, return a prediction node ln that predicts the weighted average of y with weights wn,l

2: Create a new subtree root ln to hold a weak learner

3: Compute negative gradients: y~i=−∂ℓ(yi,Fn−1,l(xi))∂Fn−1,l(xi)i=1N

4: Fit weak classifier hn,l(x):X↦Y by solving: hn,l←argminh∑i=1Nwn,l(xi)(y~i−h)2

5: Let {Pn,l,Pn,lc} be the partitions induced by hn,l.

6: ρn,l←argminρ∑i=1Nwn,l(xi)yi−Fn−1,l(xi)−ρhn,l2

7: Define the simple regressor: γn,l←ρn,lhn,l

8: Update the current function estimation: Fn,l(x)=Fn−1,l(x)+ν⋅γn,l

9: Update the left and right subtree instance weights: wn,l(left)(xi)←wn,l(xi)λ+1[xi∈Pn,l]

wn,l(right)(xi)←wn,l(xi)λ+1[xi∈Pn,lc]

10: If Rn,l⋂Pn,l≠∅, compute left subtree recursively: ln.left←AddTreeX,y,λ,wn,l(left),Rn,l⋂Pn,l,n+1,T,Fn,l,ν

11: If Rn,l⋂Pn,lc≠∅, compute right subtree recursively: ln.right←AddTreeX,y,λ,wn,l(right),Rn,l⋂Pn,lc,n+1,T,Fn,l,ν

5 in total

1. Predicting radiation pneumonitis in locally advanced stage II-III non-small cell lung cancer using machine learning.

Authors: José Marcio Luna; Hann-Hsiang Chao; Eric S Diffenderfer; Gilmer Valdes; Chidambaram Chinniah; Grace Ma; Keith A Cengel; Timothy D Solberg; Abigail T Berman; Charles B Simone
Journal: Radiother Oncol Date: 2019-01-23 Impact factor: 6.280

2. Semantics derived automatically from language corpora contain human-like biases.

Authors: Aylin Caliskan; Joanna J Bryson; Arvind Narayanan
Journal: Science Date: 2017-04-14 Impact factor: 47.728

Review 3. Machine learning in laboratory medicine: waiting for the flood?

Authors: Federico Cabitza; Giuseppe Banfi
Journal: Clin Chem Lab Med Date: 2018-03-28 Impact factor: 3.694

4. MediBoost: a Patient Stratification Tool for Interpretable Decision Making in the Era of Precision Medicine.

Authors: Gilmer Valdes; José Marcio Luna; Eric Eaton; Charles B Simone; Lyle H Ungar; Timothy D Solberg
Journal: Sci Rep Date: 2016-11-30 Impact factor: 4.379

5. PMLB: a large benchmark suite for machine learning evaluation and comparison.

Authors: Randal S Olson; William La Cava; Patryk Orzechowski; Ryan J Urbanowicz; Jason H Moore
Journal: BioData Min Date: 2017-12-11 Impact factor: 2.522

5 in total

Review 1. Artificial Intelligence: reshaping the practice of radiological sciences in the 21st century.

Authors: Issam El Naqa; Masoom A Haider; Maryellen L Giger; Randall K Ten Haken
Journal: Br J Radiol Date: 2020-02-01 Impact factor: 3.039

Review 2. Integration of AI and Machine Learning in Radiotherapy QA.

Authors: Maria F Chan; Alon Witztum; Gilmer Valdes
Journal: Front Artif Intell Date: 2020-09-29

3. A methodology for evaluating salespeople performance considering efficiency and effect: A case study of a liquor company in China.

Authors: Aixin Cai; Maohong Liu; Huan Liu
Journal: Front Psychol Date: 2022-08-12

4. Osteocalcin improves outcome after acute ischemic stroke.

Authors: Jiayan Wu; Yunxiao Dou; Wangmi Liu; Yanxin Zhao; Xueyuan Liu
Journal: Aging (Albany NY) Date: 2020-01-05 Impact factor: 5.682

5. Development and validation of a gradient boosting machine to predict prognosis after liver resection for intrahepatic cholangiocarcinoma.

Authors: Gu-Wei Ji; Chen-Yu Jiao; Zheng-Gang Xu; Xiang-Cheng Li; Ke Wang; Xue-Hao Wang
Journal: BMC Cancer Date: 2022-03-11 Impact factor: 4.430

5 in total