Literature DB >> 35609192

Provable Boolean interaction recovery from tree ensemble obtained via random forests.

Merle Behr¹, Yu Wang¹, Xiao Li¹, Bin Yu^1,2,3.

Abstract

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

Entities: Chemical

Keywords: consistency; decision trees; ensemble methods; interaction selection; interpretable machine learning

Mesh：

Year: 2022 PMID： 35609192 PMCID： PMC9295780 DOI： 10.1073/pnas.2118636119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

Supervised machine learning (ML) algorithms have been proven to be extremely powerful in a wide range of predictive tasks from genomics to cosmology to pharmacology. Understanding how a model makes predictions is of paramount value in science and business alike (1). For example, when a geneticist wants to understand a particular disease—e.g., breast cancer—a black-box algorithm predicting the risk of breast cancer from genotype features is useful, but it does not offer biological insight. That is, discovery of genes and gene interactions driving a particular disease provides not only understanding as a basic goal in science, but also opens doors for therapeutic treatments. It is a pressing task, in genomics and beyond, to interpret supervised ML models or algorithms and extract mechanistic information in addition to prediction. Among many supervised ML algorithms, tree ensembles, such as those from Random Forests (RFs) (2) and gradient-boosted decision trees (3), stand out, as they enjoy both state-of-the-art prediction performance in a variety of practical problems and lead to relatively simple interpretations (4–8). To interpret a tree ensemble model, two questions are central: Feature importance: What features are important for the model’s prediction? Interaction importance: What interactions among features are important for the model’s prediction? While many studies (refs. 4 and 6–8 and the references therein) focus on the RF feature importance, there are relatively few results on the second question. In genetics, Wan et al. (9) and Yoshida and Koike (10) seek (higher-order) gene interactions (or epistasis) by extracting genetic variant interactions from paths of ensembles of fitted decision trees. Wan et al. (9) use MegaSNPHunter based on boosting trees and interpret all groups of features that jointly appear on one of the decision paths as a candidate interaction. Yoshida and Koike (10) propose to rank interactions of genetic variants based on how often they appear together on decision paths in an RF tree ensemble. Recently, iterative RFs (iRFs) (11) were proposed to seek predictive, stable, and high-order nonlinear or Boolean feature interactions. Even though iRF uses the idea that the set of interacting features often appear together on individual decision paths of a tree in an RF ensemble, as in Yoshida and Koike (10), it uses several other ideas. That is, iRF incorporates a soft dimension-reduction step via iterative reweighting of features in terms of their Gini importance, in order to stabilize individual decision paths in the trees. Using the random intersection trees (RIT) (12) algorithm, iRF extracts stable interactions of arbitrary order in a computationally efficient way, even when the number of features is large. There is very positive evidence that iRF extracts predictive, stable, and high-order Boolean interaction information from RF in genomics and other fields (11, 13, 14). While all the works mentioned above provide strong empirical evidence that interactions extracted from the ensemble of decision trees via RF or iRF are informative about underlying biological functional relationships, there are no theoretical results regarding interaction discovery using RF, iRF, or other tree-based methods. In this paper, as a first step toward understanding the interaction-discovery property of tree-based methods, we investigate a key idea in the previous works (9–11)—namely, that frequent joint appearance of features on decision paths in the RF tree ensemble suggests an interaction. One of the most common assumptions made in previous theoretical analyses of RF is a family of smoothness conditions on the underlying mean regression function, such as the Lipschitz smoothness condition (see, e.g., refs. 15–17). However, many biological processes show thresholding or discontinuous interacting behavior among biomolecules (18, 19), which strongly violates the Lipschitz assumption. It is therefore necessary to introduce a model that can capture the thresholding behavior through discontinuous mean regression function.

The Locally Spiky Sparse Model.

Motivated by this thresholding behavior of biomolecules and inspired by RF’s predictive performance successes in genomics data problems (20–22), we consider the locally spiky sparse (LSS) model:* an additive regression model where the mean regression function is assumed to be a linear combination of Boolean interaction functions. The linear coefficients, as well as the threshold coefficients of the Boolean functions, are called “model coefficients.” Via Boolean functions, the LSS model is able to capture discontinuous thresholding behavior in biology; hence, it can be more relevant for biologists than models with smoothness constraints. We believe the LSS model is suitable and useful as a benchmark model under which to evaluate theoretically (and computationally) interaction-discovery performance of tree-based ML algorithms, including RF.

Our Contributions.

Assume that independent and identically distributed (i.i.d.) data samples from the LSS model are given and an RF is fit to these data. For an RF tree ensemble, we first define “signed features.” For a decision path of a set of signed features in the ensemble, we then define a quantity called “depth-weighted prevalence” (DWP). Intuitively speaking, DWP of measures how frequently the features in appear together in an RF tree ensemble. We show that DWP has a universal upper bound that depends only on the size of the set of signed features. Moreover, the upper bound is attained with high probability as the sample size increases if and only if the signed features represent a union of interactions in the LSS model. Based on DWP, we show that a simple algorithm—i.e., LSSFind, defined in Algorithm 1—can consistently recover interaction components in the LSS model, regardless of the model coefficients. Our theoretical results imply that feature subsampling of RF is essential to recover interactions by the RF tree ensemble. When too few features are sampled at each node, the tree ensemble is close to extremely randomized trees, and DWP of any set of signed features is independent of the response, which means that it does not contain information on the LLS model; when too many features are sampled, all the trees in the ensemble will be very similar to one another, and that turns out to make it difficult to use tree structures to distinguish between interactions and noninteractions. More specifically, the ratio between the number of subsampled features m and the total number of features p should be a nonzero constant in order for our algorithm to learn higher-order interactions from tree paths.

Existing Theoretical Works on RF.

Existing theoretical studies of RF and its variants belong to two categories. The first focuses on estimating the regression function under Lipschitz or related conditions on the underlying regression function via averaging the decision trees in the RF tree ensemble. The second category studies feature importance measures as an RF output. In contrast, we provide a study on feature interaction selection consistency under an LSS model using DWP extracted from the RF tree ensemble. In particular, in the first category, Biau (15) considers “median forests” (23), originally considered as a theoretical surrogate by Breiman (24), and obtains the L2 convergence rate under the Lipschitz continuous models. Scornet et al. (16) give the first consistency result for Breiman’s original RF with subsampling instead of bootstrapping in the low-dimensional setting when data are generated via an additive regression model with continuous components. Wager and Athey (17) consider a variant of RF, called honest RF, in the causal inference setup and prove its point-wise consistency and asymptotic normality when the conditional mean function is Lipschitz continuous. Similarly, Mentch and Hooker (25) showed that, under some Lipschitz-type conditions, a moderately large number of trees approximate well the infinite number of trees. Based on these asymptotic normality results, ref. 26 derived hypothesis tests for the null hypothesis that the regression function is additive. Thus, if one defines features interaction as the deviation from a continuous additive regression function, then their results enable testing on a particular candidate. In contrast, in this work, we define feature interaction via the noncontinuous Boolean functions in the LSS model, and we derive consistent interaction selection via the RF tree ensemble, as opposed to a test for an individual interaction, as in ref. 26. The second category focuses on theory regarding individual feature importance measures. Results in this line of work do not rely on Lipschitz conditions. However, to the best of our knowledge, these works study statistical properties of only noisy features, but do not provide results for signal features in finite samples. Louppe et al. (5) show that Mean Decrease in Impurity (MDI) feature importance for randomized trees has a closed-form formula with an infinite number of samples. Zhou and Hooker (6) use out-of-sample data to improve the MDI feature importance with unbiased theoretical guarantees. Li et al. (8) show that the MDI feature importance of noisy features is inversely proportional to the minimum leaf-node size and suggest a way to improve the MDI using out-of-bag samples. Loecher (7) gives a family of MDI feature importance via out-of-bag samples that are unbiased for the noisy features. Moreover, many studies focus on permutation-based feature importance measures—in particular, Shapley effects (27–33). Among these works, ref. 33 shows some conceptual similarities to the DWP approach considered in this paper, as the authors also consider the concept of joint appearance of features on decision paths in the RF tree ensemble. However, instead of using this concept to extract feature interactions, as done in this work, they use it to define an importance sampling scheme to estimate the Shapley effects. Also related to our work is the recent work ref. 34, which analyzes the extraction of rule sets from an RF tree ensemble. This is very similar to interaction selection, as considered in this work, except that the extracted rules in ref. 34 also include specific estimated thresholds for the individual features. The theoretical analysis in ref. 34 focuses on the stability of the selected rules without specifying a particular data-generating model. In contrast, this paper obtains model-selection-consistency results for LSSFind to estimate signed interactions of signal features under the LSS model. The rest of the paper is organized as follows: Section 1 introduces the LSS model and Boolean interactions in more detail. Section 2 reviews the RF algorithm and formally defines DWP for a given set of signed features relative to an RF tree ensemble. Section 3 presents our main theoretical results for DWP and introduces LSSFind, a theoretically inspired algorithm to detect interactions from RF tree ensembles via DWP. Section 4 contains simulation results. We conclude with a discussion in Section 5.

LSS Model to Describe Boolean Interactions

In this section, we introduce necessary notations and a precise mathematical definition of the LSS model. To this end, for an integer , let . For a set S of finite elements of , let denote its cardinality or the number of elements in S. For any event A, let denote the indicator function of A. We assume a given dataset of n samples, with and . We say that the data are generated from an LSS model when the following assumptions hold true. Assume are i.i.d. samples from a distribution P(X, Y), such that for some fixed constants , the regression function takes the following form:where in Eq. means either or , potentially different for every k. Coefficients βand thresholds γ 0 and 1, i.e.,for . are sets of features called basic interactions. We associate in Eq. with a negative sign (–1) and with a positive sign (+ 1), such that a signed feature can be written as a tuple . We call basic signed interactions with . Note that for interactions with only one feature k, due to the sign ambiguity in the LSS model—i.e., —both and are counted as an interaction. The LSS model aims to capture interactive thresholding behavior, which has been observed for various biological processes (18, 35–39). For example, in gene regulatory networks, often a few different expression patterns are possible. Switching between those patterns can be associated with individual components that interact via a threshold effect (36–38). Such a threshold behavior is also observed for other signal-transduction mechanisms in cells—e.g., protein kinase (35) and cell differentiation (18). Another example of a well-studied threshold effect is gene-expression regulation via small RNA (39). Although for most biological processes, the precise functional mechanisms between different features and a response variable of interest are much more complicated than what the LSS model can capture, theoretical investigations of a particular learning algorithm, such as RF, are only feasible within a well-defined and relatively simple mathematical model and useful for practice when such a model is empirically relevant. Given the empirically observed interactive threshold effects in many real biological systems, the LSS model clearly provides a useful enrichment to the current state of theoretical studies of RF and related methods, since current theoretical models do not capture the often-observed interactive threshold behavior. In order to prove our main Theorem 2, we further impose the following constraints on the LSS model. (C1) (Uniformity): X is uniformly distributed on . This uniformity assumption implies that each feature is independent of each other. Because any decision tree remains invariant under any strictly monotone transform of an individual feature, the uniform distribution assumption of X can be relaxed to the assumption that individual features X, are independent with a distribution that has Lebesgue density. We note that such an independence assumption might be violated in real-world problems. For example, for genetic data with single-nucleotide polymorphisms or gene expression as features X, there will typically be a strong correlation between features that are located close by on the chromosome. However, in many cases, it is feasible to restrict to a subset of features (e.g., those that are located sufficiently far apart on the genome) in order to obtain approximate independence. In Section 4, we also demonstrate in simulations that for sufficiently weak feature correlation, one can still obtain accurate interaction selection with LSSFind. C2 (Bounded-Response): Y is bounded—i.e., . Note that although we assume , the constant one can be changed to any constant, as we can scale Y by any positive number, and the conclusions in our main results will remain intact. This boundedness condition can be further relaxed so that the residue is independent of X and 1-subgaussian if we assume a slightly stronger assumption on p and n than the conditions in C4. See for more detail. C3 (Nonoverlapping Basic Interactions): do not overlap—i.e., The nonoverlapping assumption that different interactions with are disjoint might not always be justified in real-world problems. However, it is a crucial assumption for our theorem to hold. The general problem with overlapping interactions in the LSS model is that such models can be nonidentifiable, meaning that different forms of Eq. can imply the same regression function . For example, for the response , by the definition of signed interactions in the LSS model, it has two basic signed interactions, and . However, we can also write it as , which has two different basic interactions, and . This means that a set of signed features that is an interaction in one of the representations is not an interaction in the other. Due to this identifiability problem, overlapping features can lead to both false positives and false negatives in terms of interaction recovery with RF. One may try to define interaction more broadly to avoid this identifiability problem. For the previous example , although the basic signed interactions are not unique, they always constitute both X1 and X2. Whether the coefficients are allowed to have different signs also affects the identifiability. The previous example is identifiable if we only allow positive coefficients. For domain problems, where interactions are believed to be overlapping, one should investigate different identifiability conditions, but as this depends on the precise application, we leave this for future work. Our work in this paper provides a pathway to investigate this in detail later. We demonstrate how overlapping features affect our results with a simulation study in Section 4. In Section 3, we show that a simple algorithm, LSSFind, that takes an RF tree ensemble as input can consistently recover basic interactions in the LSS model. Besides recovering , LSSFind can also recover the signs of each feature in the LSS model, which indicates whether the corresponding threshold behavior in Eq. is given by a - or -inequality. Without loss of generality, in the rest of the paper, we assume that all inequalities are in Eq. —that is, We stress, however, that all our results also hold for the general case Eq. . Because we assume that all the features in basic interactions have minus signs, we denote with as basic signed interactions of the LSS model. As our theoretical results will show, the RF tree ensemble can recover not only the basic interactions , but also basic signed interactions . In other words, through DWP and under the LSS model, the RF tree ensemble can recover not only which features interact with each other in the LSS model, but also whether a particular feature in an interaction has to be larger or smaller than some threshold for this interaction to be active. Besides basic signed interactions, we also define a “union signed interaction” as a union of individual basic signed interactions, as made more precise in the following definition. (Union Signed Interactions): In the LSS model with basic signed interactions , a (nonempty) set of signed features is called a union signed interaction, iffor some (possibly empty) set of indices . In other words, a union signed interaction is a union of one or more basic signed interactions. For a single-feature signed interaction, its sign-flipped counterpart can also be added to the union. For example, for an LSS model with there are two basic signed interactions—namely, and —and five union signed interactions—namely, , and . The theoretical results that we present in Section 3 are asymptotic, in the sense that they assume the sample size n to go to infinity. Denote the number of signal features in the LSS model to be s—i.e., . We assume s is uniformly bounded, regardless of n and p. However, the overall number of features p or the number of noisy features p – s can grow to infinity as n increases. Our theoretical results also assume C4 (Sparsity): This means that, in contrast to many theoretical works (16, 17, 40), our results hold in a high-dimensional setting, as long as the overall number of signal features s is bounded. The limit is a common assumption for high-dimensional settings when analyzing consistency properties of Lasso (see, for instance, refs. 41–43).

DWP for an RF Tree Ensemble

In this section, we first review the RF algorithm and then define DWP for a given RF tree ensemble.

Review of RF.

RF is an ensemble of classification or regression trees, where each tree T defines a mapping from the feature space to the response. Trees are constructed on a bootstrapped or subsampled dataset of the original data . Note that each tree is conditionally independent of one another, given the data. Any node t in a tree T represents a hyper-rectangle R in the feature space. A split of the node t is a pair , which divides the hyper-rectangle R into two hyper-rectangles and , corresponding to the left child t and right child t of node t, respectively. For a node t in a tree T, denotes the number of samples falling into R. Each tree T is grown using a recursive procedure (denoted as the CART algorithm (2)), which proceeds in two steps for each node t. First, a subset of features is chosen uniformly at random. The size of is . Then, the optimal split is determined by maximizing impurity decrease defined in Eq. :where t (t) is the left (right) child of t, and for sample size n, is the impurity measure defined in this paper aswhich is the variance of the response y’s for all the samples in the region R. Note that the analysis of this paper holds only for the variance impurity measure, but it is possible to extend to other impurities measures, which is left as future work. The procedure terminates at a node t if two children contain too few samples—e.g., —or if all responses are identical—e.g., . For any tree T and any leaf node , denote to be a path to that leaf node. (Depth of a Path): Given a path that connects root node t1 and leaf node in a tree T, we define the depth of the path to be the number of nonroot nodes contained in the path. For any hyper-rectangle R, denotes its volume. We make the following assumptions on an RF tree ensemble: (A1) (Increasing Depth of a Tree in the RF Ensemble). The minimum depth of any path in any tree goes to infinity—i.e.,as . A2 (Balanced Split in a Tree of the RF Ensemble). Each split is balanced: for any node t, Note that, without loss of generality, we use the same here as in the LSS model. Otherwise, we can always let to be the minimum of the two. A3 ( Is of Order p). , where is a constant. A4 (No Bootstrap or Subsampling of Samples). All the trees in RF are grown on the whole dataset without bootstrapping or subsampling—i.e., for any T. A4 is a technical assumption that simplifies our notation and analysis. We assume that each tree is grown using all of the samples, which is quite different from the assumptions on subsampling in recent theoretical works on RF (e.g., refs. 15 and 17). The subsampling rate plays a crucial role in the analysis of the asymptotic distribution of the RF predictor (15, 17), where it is assumed that the subsampling rate converges to zero at a desirable rate. However, since we focus on the features selected at each node, and not on the asymptotic distribution of the predictor, we do not require such assumptions on the subsampling rate. A1 ensures that the length of any decision path in any tree tends to infinity. This assumption is reasonable as tree depths in RF is usually of order , which tends to infinity as . A2 ensures that each node split is balanced. Similar conditions are used commonly in other papers (17). A3 shows the important role of the parameter . Roughly speaking, cannot be too small or too big. When is too small, there will be too many splits on irrelevant features, which makes the tree noisy. When is too big, there will be too little variability in the tree ensemble. This motivation will be made rigorous in the proof of Theorem 2.

DWP.

In this section, for a tree ensemble from RF, we formally introduce DWP. Given a decision tree T in an RF tree ensemble, we can randomly select a path of T as follows: We start at the root node of T and then, at every node, randomly go left or right until we reach a leaf node. This is equivalent to selecting a path in T of depth D with probability from all the paths in a decision tree. Denote the nodes in to be . As such, any path in a decision tree T can be associated with a sequence of signed features , where D is the depth of the path, and for any inner node on the path, the sign b indicates whether the path at node t followed the direction () or the direction () for the split on feature . For a given RF tree ensemble depending on data , the randomly selected path of tree T, and any fixed constant , we now define to be the set of signed features on , where the corresponding node in the RF had an impurity decrease of at least ϵ—that is, We use as a shorthand for when the path from tree T and the data of interest are clear. Note that if a feature appears more than once on the path , its sign in is the sign when the feature appears the first time with the impurity decrease above the threshold. Our main theorem will be stated in terms of the DWP of a signed feature set on the random path within . To formally define the DWP of , we first need to identify the sources of randomness underlying . There are three layers of randomness involved: ( : Data randomness): The randomness involved in the data generation; (): The randomness involved in growing an individual tree with parameter , given data ; ( : Path randomness): The randomness involved in selecting a random path of depth d with probability , given a tree T from an RF tree ensemble with parameter based on data . In the following definition of the DWP of signed feature sets, the probability is conditioned on data and taken only over the randomness of the tree T and the randomness of selecting one of its paths, as in . (DWP): Conditioning on data, for any signed feature set , we define the DWP of as the probability that appears on the random path within the set —that is, We emphasize that the probability of selecting a path in a tree T is , where d is the depth of the path . While we only have a fixed sample size, which means that the data randomness is inevitable, the tree randomness and path randomness are generated by the algorithm and thus can be eliminated by sampling as many trees and paths as we like. Because the DWP in Eq. is only conditioned on data, for any given and set of signed features , it can be computed with arbitrary precision from an RF tree ensemble with sufficiently many trees (recall that, conditioned on data , the different trees in an RF tree ensemble are generated independently).

Main Results

In this section, we present our main theoretical results, which are concerned with DWP, as introduced in the previous section. Our results show that LSSFind (Algorithm 1), which is based on DWP at an appropriate level ϵ described in Theorem 3, consistently recovers signed interactions under an LSS model. Before we state our main results in full detail, we want to illustrate it with a simple example. LSSFind (,ϵ, η, ) Input: Dataset , RF hyperparameter m, impurity threshold , prevalence threshold , and maximum interaction size . Output: A collection of sets of signed features. Train an RF using dataset with parameter m; return such that and Illustrative Example: Assume that , and there are just two features X1 and X2. Assume there is a single interaction , and the regression function is given by The response surface of Eq. is shown in Fig. 1, Upper. We consider the population case, where we have full access to the joint distribution P(X, Y)—that is, we have access to an unlimited amount of data (). When we apply the RF algorithm as in Section 2, then, for each individual tree in the forest, the root node either splits on feature X1 or on feature X2. Since X1 and X2 are completely symmetric in the distribution P(X, Y), thus, if the RF algorithm grows more and more trees, in the limit, half of them will split on X1 at the root node and half of them split on X2 at the root node. For infinite data, this 50/50 split is introduced by the CART algorithm, since the two splits have identical decreases of impurities. Furthermore, the split at any node will be at 0.5 for any of the two features, since the two splits corresponding to and maximize the impurity decrease given infinite data. This is illustrated in Fig. 1, where Left Lower shows a tree that splits on feature X1 at the root node, and Right Lower shows a tree that splits on feature X2 at the root node. As each tree in RF grows to purity, when the root node splits at feature X1, then, for the path of the tree that follows the direction—that is, the direction—the tree will stop growing, as the respective response surface is already constant. However, for the path of the tree that follows the direction—that is, the direction—the tree will further split on the remaining feature X2. Then, the tree will stop because the node reaches purity. Thus, we conclude that the forest consists of exactly the two different trees shown in Fig. 1 and in the limit, where the number of trees grows to infinity, each of the two trees appears equally often.

Fig. 1.

Exemplary RF decision trees trained on data as in Eq. 9 to illustrate the results that will appear in Theorem 2. (Upper) Response surface of , as in Eq. 2, with on the x axis and on the y axis. (Lower Left) A decision tree that splits on feature X1 at the root node with the respective regions and conditional response surfaces for the left and right child of the root node. (Lower Right) A decision tree that splits on feature X2 at the root node. The red-marked decision paths contain all signed features from the basic signed interaction from an LSS model, as in Eq. 9. For both of the trees, if one starts at the root node and randomly goes left or right at every node, then the probability of the basic signed interaction to appear on the path is . In contrast, for any other set of signed features , it holds that . This provides a simple example for the more general result in Theorem 2. For each node t in these trees, the impurity decrease satisfies . Thus, for any , we can show that the DWP of the basic signed interaction is . To show this, we can get: In the above example with infinite data, the tree depth is not going to infinity, which means it does not satisfy A1. A1 is needed only for the finite sample case because, for finite samples, internal nodes in a tree can never reach purity due to noise. In Fig. 1, the paths that contain the basic signed interaction are marked red. For all the other sets of signed features , it is easy to check that For example,and As we will formally state in the two theorems below, the same reasoning holds true asymptotically for any RF trained on the data from the LSS model—namely, the DWP of a set of signed features is always upper-bounded by , and this upper bound is attained if and only if is a union-signed interaction. Recall that the DWP depends on the data . It turns out that the general upper bound follows directly from the construction of DWP and holds for any data —i.e., independent of the LSS model—as the following theorem shows. For any impurity threshold and any set of signed features for the RF algorithm from Section (General upper bound) In addition, when the data are generated from an LSS model, asymptotically (as the sample size increases), the general upper bound is attained if and only if is a union signed interaction, as the following theorem shows. Assume that the data are generated from an LSS model with uniformity, bounded-response, nonoverlap basic interactions, and sparsity constraints (see C1–C4). For any impurity threshold , letwith constants as in Eq. , as in Eq. , s as in C4, and C3. Given a set of signed features , for the RF algorithm from Section ,withwhere denotes convergence in probability. (Interaction lower bound) when is a union signed interaction as in , we have (Noninteraction upper bound) when is not a union signed interaction, then, The detailed proof of Theorem 2 is deferred to . It has two major parts: first, showing the assertion for the idealized population case and, second, extending the population case to the finite sample case. In the first part, we define a population version of the set , which we denote as . The set only contains desirable features, which are features of a path that correspond to a positive decrease in impurity if the RF gets to see the full distribution (not just a finite sample ). Note that desirable/nondesirable features are different from signal/noisy features. The definition of desirable/nondesirable features depends on the concerned path in a tree. A noisy feature is always a nondesirable feature, but a signal feature can become a nondesirable feature when it has been split in the path. See . The set is an oracle, in the sense that its construction depends on the true underlying LSS model. This is in contrast to the set , which can be computed for any given path from a tree of RF. Given this definition of , a sketch of the proof of the major assertions of Theorem 1 and 2 is as follows: When a set of signed features appears in , this implies that every time a signed feature appears on the way from the root node to the leaf, the splitting direction implied by b was selected for , which gives rise to the general upper bound of (Theorem 1). If is a union interaction, then (assuming all leaf nodes of the tree are pure) a correct splitting direction for each of its features already implies that appears on and, thus, (see first part of Theorem 3). If is not a union interaction, then there will always be the possibility that, although every split for an encountered feature that is an element of was done in the correct direction, some of the features in were just never encountered, and, therefore, a correct splitting direction does not imply that appears on ; hence, (see second part of Theorem 3). In the second part of the proof, we show that the observed set and the oracle set are the same, with probability going to one as ϵ goes to zero and n goes to infinity. That would be nice and easy if a tree grown using finite samples will converge to a tree grown using the population in terms of the splitting features and thresholds when sample size tends to infinity. However, that is not true. The obstacle is that, when a node splits on a nondesirable feature, since all the thresholds yield the same impurity decrease in the population case, the threshold selected via finite samples can deviate from the threshold via the population, no matter how many samples are used. Thus, we need to carefully analyze desirable features and nondesirable features separately based on uniform convergence results. Theorems 1 and 2 demonstrate that recovery of interactions becomes exponentially more difficult as the size of an interaction increases. An interaction corresponds to a region of size , which means the sample size must be much larger than to have enough samples in that region. Also, the DWP at an appropriate level ϵ of a basic interaction is . To have a consistent estimate, the number of independent paths should be much larger than . Thus, when one wants to recover an interaction of size s, the number of samples and the number of trees must be much larger than . That shows the intrinsic difficulty of estimating high-order interactions. Using the conclusions in Theorem 2, one can show that LSSFind (Algorithm 1) can consistently recover all the basic interactions from the LSS model, as stated in Theorem 3. Let the output of LSSFind () be . Under the same settings as in , as long as satisfies the assumptions in and the following condition:with defined in Eq. and C3, then, with probability approaching one as is a superset of the basic signed interactions with size at most and a subset of union signed interactions. In particular, if we definethen equals the set of basic signed interactions of size at most . Note that to recover all the basic interactions, needs to be larger than or equal to the order of all the basic signed interactions, but the latter is unknown, as we do not know the underlying LSS model. If is not a union signed interaction, then it follows from the second part of Theorem 2 and that with probability approaching one as . Thus, is a subset of union signed interactions. If is a basic signed interaction of size at most , then it follows from the first part of Theorem 2 and the fact that that with probability approaching one as . Thus, is a superset of the basic signed interactions with size at most . One important assumption in our theorem is the sparsity of signal features. If there are many “weak” signal features, it is very hard for RF to work well. For RF, at each node of a tree, only one feature is used. That means the total number of features used along each path is limited by the depth of the tree, which is usually of order . For our assertions of Theorem 2, the hard threshold ϵ in the set has the purpose to select the signal features. Clearly, the choice of an appropriate value of ϵ is hard in practice. The fitting procedure in iRFs (11) (which uses joint prevalence on decision paths in RF to recover interactions, similar as suggested by Theorem 2) filters noisy features not with a hard, but with a soft thresholding procedure: It grows several RFs iteratively and samples features at each node, according to their feature importance from the previous iteration. In that way, one does not need to chose a single hard threshold, which leads to a much more practical algorithm. Unfortunately, such an iterative soft thresholding makes theoretical analysis much harder, which is why we restrict to the hard threshold for the theoretical analysis in this work. One of the remarkable aspects of the result in Theorem 3 is that the range of η is independent of any model coefficients in the LSS model (that is, the linear β coefficients and the γ thresholds). For sufficiently small ϵ, it only depends on the number of signal features s and the bound of —i.e., C—and nothing else. In a sense, this shows that the tree ensemble of RF contains the qualitative or discrete-set information of which features interact with each other, independently of the quantitative information about what are the numerical parameters or model coefficients in the LSS model. Another interesting aspect about the results from Theorem 3 is that they shed some light on the influence of m on the interaction recovery performance of RF. For the third assertion in Theorem 2, we actually show that When m is too large, can get very small, as particularly strong features (large initial impurity decrease) can mask weaker features. As an extreme example, consider the situation where , and, thus, the root node gets to see all the features. In that case, the single feature that has the highest impurity decrease, say, X1, will always appear at the root node, and, hence, for or , one will get , independent of whether is an interaction or not. This shows that when m is too large, DWPs corresponding to false interactions can attain the universal upper bound , which leads to false positives in terms of interaction recovery. On the other hand, when m is too small, for a signal feature , it can take a long time until it gets selected into the candidate feature set at a node. In particular, for a finite sample, it can happen that the tree reaches purity due to lack of samples without having split on any of the signal features. Hence, the reasoning of Theorem 2—namely, that correct split direction + pure path implies that a union interaction appears on the path does not hold anymore. This can lead to union interactions having significantly smaller DWP than the universal upper bound —i.e., false negatives in terms of interaction recovery.

LSSFind and Simulation Results

In this section, motivated by our theoretical results in the previous section, we evaluate LSSFind empirically in terms of its ability to recover interactions. Simulated experiments are carried out to assess the ability of LSSFind to correctly recover interactions from the LSS model, even when some of the LSS model assumptions are violated. In LSSFind, one needs to search over all possible sets with size at most to obtain the final result. That is computationally very intensive. One more efficient way is to only look for sets with size at most and also withwhich implies that we don’t need to search over all possible sets with sizes at most ; instead, we need to search only for sets whose ’s are larger than . Because many sets with sizes at most are filtered out, this significantly reduces the search space. We use the FP-growth algorithm (44) to obtain those sets of signed features that have a DWP higher than some threshold. Note that DWP requires an infinite number of trees. To approximate DWP, we use 100 trees in the simulation. Since each tree contains thousands of paths, we have hundreds of thousands of paths to estimate the DWP for.

Simulated Data from LSS Models.

In the following, we present simulation results, where we generated data from the LSS model for different numbers and orders of basic interactions and different signal-to-noise ratios (SNRs). We find that LSSFind recovers the true interactions from the LSS model with high probability whenever the overall number of basic interactions and their orders are small. More precisely, we consider p = 20 features and n = 1, 000 samples, where each feature X is generated from an uniform distribution , independent from one another. The number of basic interactions is denoted as J, and the order of each interaction is denoted by L. We consider the same threshold τ for all features. The noise is Gaussian with variance , and the response is: We consider different values for J, L, and —namely, J = 1, 2, L = 2, 3, 4, and s such that the SNR is 0.5, 1, 2, or 5. For a given J and L, the threshold τ is chosen such that about 50% of samples fall into the union of hyper-rectangles—that is, . As we know that the number of samples falling into , which can also be roughly thought as the label imbalance, has a high impact on the results, keeping this number the same across different simulation settings makes sure that the simulation outcomes are more comparable. The results are averaged across 40 independent Monte Carlo runs. We grow RF using the scikit-learn package with 100 trees. We apply LSSFind with parameters , and . Recall that we use Eq. to select candidate interactions. If is set to L, the condition Eq. would be too restrictive for challenging situations, such as when the LSS model is violated, and LSSFind can end up finding no interactions. Given a set of K true basic signed interactions from the respective LSS model and output from LSSFind , we evaluate their proximity based on their Jaccard distance: Note that any element in and is a set of signed features. This score gives no credit for partial recovery: If one interaction in is , there will be no credit for if it contains subsets of , such as , or same features with different signs, such as . While this score can be overly restrictive for practical problems, it is suitable for our simulation because we would like to evaluate whether LSSFind can consistently recover the interactions in the LSS model. The simulation results are shown in . In general, the performance of LSSFind sharply degrades when the number of basic interactions and the order of interactions increases. For K = 1 and L = 2, 3, 4, LSSFind almost always recovers the correct basic signed interactions. For , it mostly recovers the correct basic signed interactions, except for small SNR. When K = 2 and L = 3, 4, LSSFind rarely recover the basic signed interactions for this simulation setup, resulting in a score of almost zero. Note that this is consistent with our results in Theorem 2, which indicates that the problem is much harder for more interactions and higher-order interactions. We also explored the high-dimensional case. When and , the score for LSSFind is shown in . The scaling of p and n is chosen to make sure , and also when p = 20, n will be 1,000, which corresponds to our previous numerical setting for better comparison. Recall that Theorems 2 and 3 require condition , as stated in condition C4. We also note that is commonly imposed when analyzing lasso problems, too (41–43). As can be seen in , the score increases and approaches to one as the dimension p increases. This is consistent with Theorem 3, which shows that LSSFind can recover the underlying interactions for the high-dimensional case.

Robustness to LSS Model Violations.

In the following, we present simulation results for LSSFind when the data are generated from a misspecified LSS model, which means that some of the LSS model assumptions are violated. We find that LSSFind deteriorate when the LSS model is violated. We consider a misspecified LSS model with SNR = 5 and two order-2 interactions with p = 20 features and n = 1, 000 samples, analog as in , second row, first column, third bar. We consider the following violations of LSS model assumptions: Overlapping interactions: Different basic interactions have overlapping features. When , the basic interactions are ). Correlated features: Different features are correlated instead of independent. When , the correlation between feature j1 and j2 is . Heavy-tail noise: Tthe noise follows a Laplace or Cauchy distribution, which have heavier tails than (sub)Gaussian distributions. The noise is normalized such that the SNR is 50. Results of LSSFind are shown in . For heavy-tail noise, we observe a gradual drop in performance. For the correlated feature case, one can see that LSSFind has reasonable performance when the correlation is close to zero, but its performance deteriorates when the correlation is high. Similarly, for the overlapping feature case, the performance worsens.

Empirical Comparison between LSSFind and iRF.

Our original motivation to study DWP in RF tree ensemble came from the strong positive empirical evidence of iRF (11, 13). There are three major reasons why the full iRF procedure is hard to analyze theoretically: First, the iterative reweighting in iRF is based on the feature importance metric of MDI. Analyzing MDI for the RF algorithm is a challenging task on its own. In particular, MDI of noisy features in deep trees are known to have a bias (6–8), which may propagate through various iterations in iRF and make a theoretical analysis very challenging. Second, the iRF procedure selects interactions from the paths of the RF tree ensemble via the RIT algorithm (12). Thereby, individual paths are weighted according to the number of observations which fall into their respective leaf nodes. This means that the selected feature interactions of iRF cannot be derived from the RF tree ensemble directly, but depend on the data in a more complex way. Third, the outer stability layer of iRF, where interactions are evaluated based on their consistent appearance among several bootstrap replications of the procedure, adds an additional layer of complexity for theoretical analysis. In order to still analyze the major aspects of iRF theoretically, we proposed the related LSSFind algorithm. Instead of iterative reweighting via MDI, LSSFind introduces a single hard threshold on the impurity index at individual tree nodes. Moreover, instead of selecting interactions via RIT, LSSFind is based on DWP, which is derived from the tree ensemble directly, without an additional data-dependent sampling scheme. In other words, although a high DWP in LSSFind does not exactly correspond to the RIT interaction selection strategy employed in iRF, they both build on similar high-level quantities—namely, sets of stable features, which often appear together on decision paths in an RF tree ensemble. Therefore, our theoretical results on DWP and LSSFind provide evidence that the general interaction discovery strategy of iRF is theoretically justified. In the following, we complement our theoretical findings about LSSFind with an empirical comparison between iRF and LSSFind. We consider the same simulation setting as in Section A. However, we replace the very strict performance measure in Eq. by a weaker one. Specifically, given a set of K true basic signed interactions from the respective LSS model and output from LSSFind and iRF, respectively, , we now evaluate their proximity based on: Note that Eq. corresponds to the Jaccard distance on the set of unsigned features that appear in any of the detected interactions. While the stricter metric in Eq. is more appropriate to evaluate finite sample validity of Theorem 2, the relaxed version in Eq. is arguably of more practical interest. This is because it gives partial credit for interactions that are almost, but not perfectly, recovered. If is high, it means that the features in the discovered interactions overlap with the features in the true interactions, which would greatly narrow down the interaction search space and save tremendous effort for subsequent analysis for a practical problem. For the iRF algorithm, we used the signed iRF algorithm (siRF) from the Python iRF package iRF, with default parameter settings and a threshold on iRF’s stability score of 0.5 for interaction selection, as recommended in ref. 13. Simulation results are shown in . When the LSS model as in Eq. is relatively simple—for example, when it has only a single signed interaction (K = 1) or only a single feature per signed interaction (L =1)—iRF and LSSFind perform comparably (first row and second row, first column, of ). However, when the LSS model gets more complex, with several additive interactions (K > 1) each having more than one signed features () (second row, second and third columns in ), iRF outperforms LSSFind in terms of the metric Eq. . In summary, we find that iRF outperforms LSSFind in situations where the underlying LSS model is more complex and when a flexible performance metric is chosen. This appears to be consistent with the fact that the iRF algorithm has witnessed empirical success on specific domain data problems (11), whereas LSSFind was specifically constructed in such a way that it reflects our result in Theorem 2.

Discussion

Relevant statistics theory starts with a model that is a good approximation to reality. Thus, it is important to derive theoretical results under a model that is scientifically motivated. Our proposed LSS model class provides such a family that reflects the biological phenomena of biomolecules interacting through thresholding. Also, analyzing RF-based algorithms under different models, rather than the smoothness classes in the literature, can give insights into their empirical adaptivity. Our results give a theoretical result that DWP of a set of features in an RF tree ensemble recovers high-order interactions under the LSS model and reasonable conditions on the RF hyperparameters. Moreover, the universality of interaction’s DWP in LSS models gives insights into the general difference between quantitative (e.g., prediction accuracy) and qualitative (e.g., interaction recovery) information extraction. In scientific problems, often the latter is of higher interest. Thus, this work narrows the gap between theory and practice for Boolean interaction discovery and is of general interest to the fields of statistics, data science, ML, and scientific fields, such as genomics. Our theoretical analysis also gives some insights of RF for tuning a crucial hyperparameter : Given an interaction with a fixed size, the noninteraction DWP upper bound in Theorem 2 depends only on C, and C is only constrained by (A3). Therefore, one can find an optimal that minimizes this upper bound. The optimal choice of turns out to be . If one-third of all features are signal features—that is, — recovers the default choice in standard RF implementations for regression—namely, . However, when , the optimal choice from our theoretical results corresponds to , which suggests that with the presence of many noisy features, should be larger than , as in the default choice. Further investigations through data-inspired simulations and theoretical analyses are needed. One might wonder whether the form of interaction defined by the LSS model constitutes a particularly difficult or a particularly easy form of feature interaction. In general, there appears to be no clear (mathematical) answer to this question, as one cannot define what is meant by feature interaction in a clear way for a generic (possibly discontinuous) regression function . For example, it is easy to check that for any multivariate function , one can find (possibly discontinuous) univariate functions , such that . We stress that the reason why we considered the LSS model in this work was not because it defines a particularly easy form of interaction, but rather its biological relevance, as the thresholding relationships captured in the LSS model are observed in various biological data. Although, the LSS model is motivated from biological phenomena, some of the assumptions that we made in order to derive our theoretical results might be difficult to justify directly in real-world problems, in particular, the independence condition C1 and the nonoverlapping interaction-set condition C3. Note, however, that in many application settings, it is possible to overcome these limitations by appropriate data preprocessing—e.g., decorrelating features (recall the discussion after condition C1). Nevertheless, for future work, it will be interesting to extend our results to a general LSS model (with possibly overlapping interaction sets and correlated features) or even interaction models beyond Boolean interactions, in order to further close the gap between theory and practice. Finally, it will also be of interest to compare LSSFind and iRF with methods that, more generally, employ an ML black-box model to extract interactions. For example, when individual features are independent, as we assume in C1, one can use Monte Carlo methods (45) to estimate higher-order Sobol indices for the fitted ML model.

20 in total

1. Quantitative kinetic analysis of the bacteriophage lambda genetic network.

Authors: Oren Kobiler; Assaf Rokney; Nir Friedman; Donald L Court; Joel Stavans; Amos B Oppenheim
Journal: Proc Natl Acad Sci U S A Date: 2005-02-22 Impact factor: 11.205

Review 2. Tripping the switch fantastic: how a protein kinase cascade can convert graded inputs into switch-like outputs.

Authors: J E Ferrell
Journal: Trends Biochem Sci Date: 1996-12 Impact factor: 13.807

3. Positional information and the spatial pattern of cellular differentiation.

Authors: L Wolpert
Journal: J Theor Biol Date: 1969-10 Impact factor: 2.691

Review 4. Small RNAs establish gene expression thresholds.

Authors: Erel Levine; Terence Hwa
Journal: Curr Opin Microbiol Date: 2008-11-18 Impact factor: 7.934

5. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Authors: Wouter G Touw; Jumamurat R Bayjanov; Lex Overmars; Lennart Backus; Jos Boekhorst; Michiel Wels; Sacha A F T van Hijum
Journal: Brief Bioinform Date: 2012-07-10 Impact factor: 11.622

6. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks.

Authors: Ashley Cliff; Jonathon Romero; David Kainer; Angelica Walker; Anna Furches; Daniel Jacobson
Journal: Genes (Basel) Date: 2019-12-02 Impact factor: 4.096

7. Veridical data science.

Authors: Bin Yu; Karl Kumbier
Journal: Proc Natl Acad Sci U S A Date: 2020-02-13 Impact factor: 11.205

8. Integrative annotation of chromatin elements from ENCODE data.

Authors: Michael M Hoffman; Jason Ernst; Steven P Wilder; Anshul Kundaje; Robert S Harris; Max Libbrecht; Belinda Giardine; Paul M Ellenbogen; Jeffrey A Bilmes; Ewan Birney; Ross C Hardison; Ian Dunham; Manolis Kellis; William Stafford Noble
Journal: Nucleic Acids Res Date: 2012-12-05 Impact factor: 16.971

9. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features.

Authors: Peng Jiang; Haonan Wu; Wenkai Wang; Wei Ma; Xiao Sun; Zuhong Lu
Journal: Nucleic Acids Res Date: 2007-06-06 Impact factor: 16.971

10. Conditional variable importance for random forests.

Authors: Carolin Strobl; Anne-Laure Boulesteix; Thomas Kneib; Thomas Augustin; Achim Zeileis
Journal: BMC Bioinformatics Date: 2008-07-11 Impact factor: 3.169