Literature DB >> 34222331

BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes.

Demetrius DiMucci^1,2, Mark Kon^1,3, Daniel Segrè^1,2,4,5,6.

Abstract

Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables ("rules") frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn's disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.

Entities: Chemical Disease Gene Species

Keywords: Boolean rules; complex phenotypes; decision tree; epistasis; high-order interactions; microbiome; random forest

Year: 2021 PMID： 34222331 PMCID： PMC8245782 DOI： 10.3389/fmolb.2021.663532

Source DB: PubMed Journal: Front Mol Biosci ISSN： 2296-889X

Introduction

The production of large biological data sets with high-throughput techniques has increased the utilization of supervised machine learning algorithms (Goodswen et al., 2021; Reel et al., 2021), including support vector machines (Yang et al., 2021), neural networks (Rampelli et al., 2021) and random forests (Dicker et al., 2021), to produce predictions of complex phenotypes (e.g., healthy vs. disease) from measurable traits (Cesario et al., 2021; Hughes et al., 2021; Marcos-Zambrano et al., 2021). These algorithms use measurements of relevant traits such as gene variants, the presence/absence of microbial taxa, or metabolic consumption variables as predictors. Categorical prediction of phenotypes is typically the end goal of these applications. However, an additional benefit of these algorithms is the potential to extract explanatory classification rules. In this context, a rule is defined as a Boolean function of a set of traits, such that the value of the function is 1 (true) when the traits are associated with a given phenotype. Identifying the relationships between the traits involved in classification rules may yield key insights into the biological processes associated with important phenotypes (Furqan and Siyal, 2016; Visscher et al., 2017). This realization is creating demand for methods that assist in the interpretation of supervised machine learning methods (Azmi et al., 2019; Nguyen et al., 2019; Le et al., 2020), especially when the measured traits may be causal agents of disease states, such as genetic variants or microbial taxa (LaPierre et al., 2019). Identifying classification rules associated with a phenotype of interest is valuable because these rules are likely to carry information about the causal mechanisms that generate the phenotype. Algorithms that are particularly valuable in this respect are those involving decision trees, such as random forests, since decision trees are easily interpretable (Brodley and Friedl, 1997). Decision trees are rule-based classifiers, where rules arise from a series of “yes-no” questions that can efficiently divide the data into categorical groups. In a biological context, such rules may arise from sets of genes whose simultaneous modulation could affect a phenotype, or sets of microbial species whose co-occurrence may be associated with a disease state. While in several cases it seems like disease phenotypes are uniquely associated with a single specific pattern [e.g., retinoblastoma (Knudson, 1971)], there is increasing evidence for cases in which multiple distinct patterns can be associated with (and potentially causing) the same high-level phenotype (Emily et al., 2009; Leem et al., 2014). A particular example we will explore in this work is the multiplicity of distinct microbial presence/absence patterns which may be associated with Crohn’s disease (Proctor et al., 2019). Crohn’s disease has five clinically defined sub-types (Reading, 2014) but studies of the associated microbiome do not usually indicate which form of Crohn’s disease a donor has been diagnosed with. Each sub-type of the disease may be associated with different microbes, each requiring different treatment regimes. As discussed later, we hypothesize that the different rules associated with a given phenotype label may be related to these different subtypes, with potential therapeutic implications. The fact that there may be multiple etiologies that generate the same or similar phenotypes complicates the straightforward interpretation of parameter coefficients or variable importance scores (Louppe, 2014; Wright et al., 2016). Uncovering the multiple interactions between predictive variables as they relate to phenotypic labels remains a challenging statistical endeavor, but one that is of paramount importance. In an ideal situation, one could conduct a best subset search, evaluating all possible classification rules that can be defined using the data and identifying a set of rules that concisely explain the observed associations. This strategy is computationally intractable using a brute force approach: even a relatively small biological data set of 50 features with binary coding would require examining over 250 variable sets and many more specific rules (since the specific value of features, 0, 1, or ‘omitted’, is important). Identifying the associated rules that a random forest uses to classify a given sample (a specific row of the data matrix) offers the possibility to bypass the brute force approach and enables the development of mechanistic hypotheses for follow-up studies. This challenge, and an overview of the key strategy we propose, are illustrated in Figure 1. In Figure 1A we depict a toy model where measured variables (traits) have only two possible values (e.g., present/absent), the high-level phenotype (category) is binary (e.g., no disease/disease), and two distinct Boolean rules can both generate the phenotype. The goal in this case is to identify each of the rules that are associated with the phenotype. The multiple Boolean rules obtained in this manner can be thought of as a consensus decision tree that possesses the most informative branches of the forest with respect to a given class label. In this work, we will show how this can be achieved by in-depth analyses of any given random forest (RF) (Figure 1B).

FIGURE 1

(A) In a hypothetical dataset there are two phenotype labels–“Disease” and “No Disease” that we wish to discriminate based on input predictor variables. In this example, there are two distinct high-order patterns that both confer the same “Disease” phenotype. Our goal is to identify a potentially diverse set of patterns (or, in this simplified case, all patterns) that are associated with the “Disease” label. (B) Instead of exhaustively evaluating variable combinations we leverage the structure that emerges from an ensemble of decisions trees like those produced by a trained random forest. (C) For each sample with the observed phenotype “Disease” we first identify the vector containing its input values (i). Then follow the paths it takes downs each tree that attempts to predict its class and record the frequency of parent-child variable pairs (ii). Next, we rank parent-child variable pairs in descending order of frequency (iii). Finally, we use a great search to construct a sample-specific rule that is fully associated with the “Disease” phenotype (iv). (D) All sample-specific rules are evaluated in order to obtain a consensus set of rules that combined account for all samples with the “Disease” phenotype. The random forest algorithm intrinsically takes advantage of non-linear relationships between variables and is widely used in the life sciences (Boulesteix et al., 2012; Nguyen et al., 2013; Touw et al., 2013). RFs, when used to distinguish between disease states known to have multiple causes, often result in excellent classifiers (Duvallet et al., 2017; Franzosa et al., 2019). It has also been reported that RFs capture subtle statistical interactions between variables (Louppe, 2014). Unfortunately, an RF is not straightforwardly interpretable despite its hierarchical structure, and recovering those interactions is notoriously difficult (Wright et al., 2016) due in large part to the method’s reliance on ensembles of trees (Breiman, 2001). The difficulties in interpretation created by these properties has led many to refer to RF as a ‘black box’ model (Castelvecchi, 2016). Identifying the rules that a RF utilizes in classification tasks is an active area of research, and many strategies have been developed to address this problem. Effective strategies have focused on evaluating how individual variables influence the classification probabilities of specific samples (Palczewska et al., 2013; Welling et al., 2016), pruning existing decision rules found in the tree ensemble to produce compact models (Deng, 2019), computing conditional importance scores (Strobl et al., 2008), or iteratively enriching the most prevalent variable co-occurrences through regularization (Basu et al., 2018). These approaches offer valuable methods for the identification of statistical interactions between variables. However, we and others have observed that while these methods are capable of recovering a true causal rule in simulated data when exactly one such rule is present, the existence of multiple rules associated with one phenotype can confound interpretation efforts (Basu et al., 2018). Here we describe BowSaw, a new set of algorithms that utilizes variable interactions in a trained RF model in order to extract multiple candidate explanatory rules. With BowSaw, we set out to develop a post hoc method intended to aid in the discovery of these rules when the input variables are categorical in nature. The primary approach of BowSaw is to start by approximating a best combination of variables (i.e., a rule) that explain the forest’s predictions for individual samples of a given class in the data set and then to curate the collection of best combinations to obtain a concise set of combinations that collectively segregate a class of interest with high precision. For individual samples a rule is identified by systematically quantifying the co-occurrence of specific variable pairs across trees in the forest that attempt to predict the class of the sample (out-of-bag trees) and then using the frequency of these co-occurring variable pairs to guide the construction of a rule that precisely identifies the sample as its observed class. For the entire set of samples, we then curate the collection of all rules identified in this way, in order to produce a small set of rules that are broadly and precisely applicable to samples of the given class label. We first demonstrate that BowSaw can recover true rules (when they exist) by applying the algorithms to simulated data sets of varying complexity. We then apply BowSaw to a study on the role of the gut microbiome on Crohn’s disease (Proctor et al., 2019), and show that it can find a previously unreported combination of microbial taxa that is broadly and precisely associated with Crohn’s disease samples in the data set. In its current implementation, BowSaw can be applied to any dataset with categorical or discrete predictors with any number of class labels.

Methods

Overview of the Pipeline

Provided with a trained random forest and a training set, BowSaw goes through three steps in order to generate a candidate rule (variable-value combination) for each sample associated with the phenotype of interest. First, for a specific sample, the Count algorithm counts the frequency of unique ordered pairs of variables encountered along each of its out-of-bag trees in the forest (Figure 1C–step 2). Second, for that sample, the Construct algorithm takes the counts from the first step and generates a list of ordered pairs, ranked by their frequencies, then uses this list as a guide to construct a candidate decision rule (which could consist of two or more variables) that is associated with the observed phenotype at a user defined precision threshold (Figure 1C–steps 3–4). Finally, the Curate algorithm pools the candidate decision rules from each sample together and greedily selects a subset of rules that collectively account for all of the samples with the desired phenotype (Figure 1D). Optionally, the Sub-rule algorithm can be used to generate pruned versions of candidate rules prior to applying the Curate algorithm in order to obtain a more concise, albeit less specific, set of candidate rules. The Count and Construct algorithms generate the candidate rules for individual samples while the Curate and Sub-rule algorithms produce a combined set of rules that account for all samples with the chosen phenotype. In the following section, we provide a description of the inputs BowSaw takes and the algorithms that implement these steps along with pseudocode.

Inputs

BowSaw takes as inputs a dataset, , composed of observed vectors (together with their respective classes ) each of categorical variables. There are assumed to be possible class labels for each vector in which for the purposes of this discussion denote different phenotypes. A random forest is assumed to be trained on to distinguish the classes . Additionally, BowSaw takes as input the feature vector of a specific sample for which the goal is to identify a set of simplified rules associated with the phenotype .

Counting Stubs

Given an RF machine trained on dataset and a feature vector , the first sub-routine of our method (the count algorithm) proceeds as follows. It starts by identifying among the set of trees in , those sub-paths (sequences of successive variable indices) encountered by sample as it travels through , its set of out-of-bag trees. An out-of-bag tree is a tree for which was not included in the training set. For a specific path in the sequence of successive variable indices forms a vector ) (note that each is one of the variables ). Each stub (ordered pair of sequentially encountered variables ) in all out-of-bag elements along for i = 1, … r-1 is accounted for in a matrix , where the element records the number of stubs containing the ordered pair of variables and among all paths of . We restrict the counting to sequentially encountered variables because higher order interactions involving 3 or more sequential variables are much rarer and would require many more trees than is necessary to build an acceptable classifier.

Constructing a Candidate Rule

A rule for classifying to a test point will have the form “If then classify to class ”. Here is a designated subcollection of the variable indices , and is the sub-vector of current vector corresponding just to the indices . The vector will denote a pre-defined set of values to , with the above rule requirement effectively meaning that each appearing in the second vector must equal the corresponding in the first vector. Thus the condition requires a specific assignment of values to for , and the rule is that if a test vector satisfies this condition, we classify it to category k. The second sub-routine (the construct algorithm) builds a candidate rule , based (initially) on a fixed training point, say , in class . This is done by first placing all of the stubs with non-zero counts into a list sorted in descending order by their values in . We define the candidate rule (based on ) through the following steps. We initialize using the first stub in the list together with the two fixed values This is the initialized form of the rule , which requires that for any test vector, its values at the above indices and match the values of the above fixed training vector , so that , and . For brevity, denote the pair and the corresponding assigned values as Then the content of rule will be denoted succinctly as . Since ordering of the indices does not matter, (as long as the indices are identified), we will henceforth write . We then update rule as follows. We find all that satisfy the initial part of rule , i.e., i.e., all training points matching the two indices of training sample , and store them as a subcollection of the training set. We call the fraction of data points in that have phenotype , i.e., match the phenotype of the initial sample . When F is greater than or equal to a user defined threshold, the algorithm terminates and returns . If , we stop and return the current above rule . If , we continue by choosing the second stub in the above list , and augment the current rule by adding the condition (again written ) and maintaining the assignment of class (i.e., the same class as the currently fixed sample ). If the second stub happens to overlap with the initial stub , this added condition in the rule will clearly be consistent, being still based on the fixed sample . We augment the current index list to a list , adding to it the two new indices and , so that now } writing the augmented rule as Again defining to be the fraction of the data subset (matching the more restrictive new rule ) with phenotype , we stop the algorithm and use the current rule if , and otherwise augment rule by adding the indices to it, as above, yielding a larger set of indices and the augmented rule : , with a more restricted subset , and a new value for , now the fraction of in the class of the fixed .This process continues until the fraction , e.g., 100% of the samples in match the current set of indices, and also match the class of the current sample . Alternatively, the algorithm stops when all stubs in have been exhausted. In the examples that follow we have set threshold to 1. The rationale for this choice is that we allow overfit with intention of pruning the overfit rules in order to find more generalizable forms. We make this choice because from the perspective of discovery, we assume that it is more desirable to capture as much of a true underlying rule as possible and then prune back to a shorter one, than it is to extract a concise rule. In practice one might decide to tune the threshold, F, to approximate the overall precision of the model in order to identify less complex rules or tune it as a hyper-parameter in order to reduce the combinatorial search space.

Curating Candidate Rules

The count and construct algorithms are the heart of BowSaw. In our workflow, we apply these algorithms to each sample , where denotes the set of samples in dataset with phenotype . At this stage in the algorithm, we have associated a single candidate rule for each vector in . The union of these candidate rules over all samples in will form a list which we will denote as , which ranks each rule by the size of the set consisting of all samples consistent with rule . Since may include many redundant rules or rules that strictly extend each other, we have another sub-routine (the curate algorithm) to generate a concise set of candidate rules that collectively account for all samples in class . Briefly, we initialize a list , with the element representing the largest set of samples. At each stage, the next rule in is selected so as to be satisfied by the largest number of elements that do not satisfy any of the previous rules. This rule is then added to , with ties resolved randomly. This is then continued until the elements in satisfying at least one rule in are exhausted.

Constructing Sub-Rules

In any given dataset, rules are rarely perfectly associated with specific phenotypes. Given the current list of rules describing phenotype as above, we may consider a looser set of rules by creating a new list consisting of rules in together with sub-rules satisfying some user defined minimal complexity criterion and precision thresholds which serve to exclude low quality rules from the analysis. Given a rule is the conjunction of a set of conditions, by sub-rule we mean the conjunction of a subset of these conditions. The list can be treated precisely as the list was above, resulting in a new curated list obtained as earlier, yielding a new candidate rule set which has a reduced likelihood of overfitting the data. Thus, we will require a strategy for selecting a set of candidate sub-rules that account for all samples with desired observed phenotype class . Candidate sub-rules are shorter candidate rules (with less complexity, likely less precise, and more broadly applicable) derived from larger candidate rules by keeping one or more (generally ) variables. For each candidate rule in , and complexity level i, we include only sub-rules that meet the user-defined complexity criterion, designated as complexity level . We place each of the sub-rules derived from at complexity level into a new list . For each rule in its precision is calculated with respect to the class , and those rules with a precision below a given threshold are eliminated. Finally, this reduced list is subject to the above Curate algorithm again. Within the above aggregation algorithm, is determined for each rule in which is then pruned with the curate algorithm to produce . The algorithms described above are generalizable to multi-classification tasks but are currently limited to discretized or categorical representations of the feature space. Pseudocode for implementing each of the algorithms described above along with an implementation of the algorithms in R (R Core Team, 2020) can be found in the supplemental files and on github: https://github.com/segrelab/BowSaw.

Results

Application to Simulated Data

To test the capacity of BowSaw to recover multiple decision rules when the ground truth is known, we applied it to increasingly challenging simulated data sets. These data sets consist of binary vectors representing different samples. The phenotype associated with each sample is a function of the corresponding vector. The function consists of a set of multiple mutually distinct Boolean rules, such that if a rule is satisfied, it will cause the sample to have the phenotype with a certain probability (which we call here “penetrance” because of its resemblance to the genetics concept). The first dataset (IDEALIZED) we use is relatively simple and includes multiple equally prevalent rules. It is also generated under the assumption that there are no unmeasured confounders, i.e., that if a sample does have a phenotype, then it must be satisfying at least one of the above rules. We then apply BowSaw to a more challenging scenario (INTERMEDIATE) in which the phenotype-generating rules differ in their relative prevalence and the assumption of unmeasured confounders is violated. Finally, is a set of data sets with complex co-varying parameters (COMPLEX), we systematically varied the underlying parameters of the simulation and examined the relationship between summary statistics of the RF performance and the ability of BowSaw to generate candidate rules containing the true phenotype-generating rules. For the IDEALIZED scenario, we simulated a data set of 100 independent and identically distributed random binary variables and 2,000 samples. We randomly defined five rules, each requiring four randomly selected variables to have specific values (e.g., all variables equal to 1) in order to assign a hypothetical phenotype with likelihood between 0.8 and 0.9. Here we present the results of this scenario with a specified random seed, but other seeds and parameters can be explored using the scripts provided in the supplemental files. Using these parameters, 497 samples were assigned the phenotype and BowSaw produced a set of 135 unique candidate rules ranging in complexity from six to fourteen variables. From these rules, we produced all sub-rules involving anywhere between two and five variables, which resulted in unique 50,034 sub-rules. To reduce the number of sub-rules that the Curate algorithm would need to examine, we eliminated from consideration any rules that had a class precision below 80%. We selected an 80% threshold because in the cluster centered around 125 matching samples there is a small cloud of rules that are clearly segregating the phenotype more efficiently than the others (Figure 2A). We selected the most general remaining sub-rule to initialize our list of candidate rules. This produced a final list consisting of five candidate rules that accounted for all of the samples with the phenotype and were each one of the true phenotype generating rules (Figure 2A red points). These results demonstrate that in an ideal scenario with no measurement errors, BowSaw is indeed capable of recovering multiple true rules.

FIGURE 2

For both scenarios 2,000 samples were generated with 100 randomly generated binary features. (A) The generality of sub-rules (number of points that exactly satisfy the rule criteria) is plotted against their precision for the IDEALIZED scenario (Five rules that cause the phenotype and no noise). Each point represents a unique sub-rule. X-axis is the number of samples in the dataset that exactly match the pattern defined by the rule. Y-axis is the fraction of matching samples with the observed phenotype (i.e., precision of the rule). Each cluster of points corresponds to decreasing rule complexity from 5 variables per rule to 2 on the right-most cluster. These clusters appear because the values of each variable are produced by an identical binomial distribution. The dashed line is the precision threshold we chose in order to exclude low quality rules. Only candidate rules with precision above this threshold were considered for the curate algorithm. Red points are the causative sub-rules we defined. BowSaw correctly identified all five red points in this scenario. (B) Candidate sub-rules generated for the more challenging INTERMEDIATE scenario. We defined 5 causative rules of varying lengths in this scenario and allowed 2% of samples without a causative rule to be assigned the label. BowSaw completely recovered 4 of the causative rules (red points). The longest rule which involved 5 variables was not fully recovered by any candidate rule. Rules that were selected by the Curate algorithm because of their contribution to additional coverage but that did not contain a complete true rule are indicated by blue points. For the more challenging scenario (INTERMEDIATE), we generated the data set as before, except that this time we allowed the five underlying rules to vary in complexity from three to five variables. Varying the complexity of rules resulted in different prevalence among them, as rules that are more complicated are less likely to appear in the data. In this case, we had one rule of complexity five, two that required four variables, and two that used three variables. We also added background noise by randomly assigning the phenotype to 2% of samples that did not possess any of the rules, 655 samples were assigned the phenotype. BowSaw produced 176 unique candidate rules involving between six to thirteen variables. From this list we generated 68,938 sub-rules and chose a precision threshold of 75% because there are two clusters at ∼|| = 125 that begin to clearly separate in that range and the two outlier points at ∼|| = 250 do not combine to account for all of the phenotype (Figure 2B). Applying the Curate algorithm to the rules meeting this threshold selected 19 candidate sub-rules, the top four (when ranked by ||) of which were true rules (red points). The remaining 15 rules were noise rules (blue points). The rule of five variables was not recovered. These results show that BowSaw is able to recover strongly associated patterns (and in this case, causal patterns) even in the presence of noise, but low prevalence rules can be masked by more highly prevalent rules. We used the same data generation method to investigate BowSaw’s ability to produce candidate rules containing true rules when the underlying parameters change. We applied BowSaw to 20,000 simulated data sets where we randomly altered the number of features (50–1,000), sample size (200 or 2,000 samples), complexity of the rules (2–8 variables), number of rules (2–8), the likelihood of each rule assigning the phenotype (0.0005–1), and the background noise (1x10−5 to 0.1). For each simulation we extracted a single candidate rule per sample with the assigned phenotype and ranked them without generating sub-rules. To investigate how effectively BowSaw recovers true rules, for each simulation we calculated the fraction of true rules fully recovered, the probability of fully recovering at least one rule, the median rank of the first recovered rule when at least one is recovered, and the mean rule completeness of recovered rules. We investigated the relationship of these measurements to the to the ROC-AUC, PR-AUC, number of features, and sample size. These values were chosen because they are easily accessible to researchers during model building and could potentially be used to assess the likelihood of obtaining useful insights from applications of BowSaw. ROC-AUC, PR-AUC, and sample size are positively correlated with full recovery of true rules, mean completeness of recovered rules, and median rank. Number of features was negatively correlated with these values. These correlations are summarized in Table 1. The probability of recovering at least one true rule gradually decreases with increasing feature space, gradually increases with increasing sample size, and forms a sigmoidal curve with both ROC-AUC and PR-AUC. Plots depicting the relationship of the four metrics with the fraction of fully recovered rules, probability of recovering at least one rule, median rank of rules, and mean rule completeness can be found in Supplementary Figures S1–S4.

TABLE 1

Correlation of performance metrics and data dimensions with rule recovery.

	ROC-AUC	PR-AUC	N Features	Sample size
Fraction of rules recovered	0.672	0.585	-0.151	0.556
Mean partial recovery all rules	0.683	0.581	-0.251	0.657
Median rank of first recovered rule	0.268	0.195	-0.073	0.071

Correlation of performance metrics and data dimensions with rule recovery.

Application to Human Microbiome Data

Irregular distributions of microbial taxa within the gut are often associated with serious illnesses such as Crohn’s disease or ulcerative colitis (Carding et al., 2015; Levy et al., 2017). Human microbiome studies regularly use 16s rRNA amplicon sequencing methods and extensive reference databases to report on microbial taxa found in samples as operational taxon units (OTUs). RF classifiers are frequently built using counts of OTUs to accurately discriminate between disease and healthy patient samples (Ai et al., 2019; Vangay et al., 2019). Despite their demonstrated effectiveness as good classifiers of Crohn’s disease, studies that look to discover associations with disease status typically focus on individual OTUs, while specific microbial association rules found by RF are not discussed, as a result it is uncertain how heterogeneous study cohorts are. To investigate potential rule heterogeneity in a human microbiome cohort we downloaded processed files from the Human Microbiome Project for inflammatory bowel disease (IBD) (Proctor et al., 2019) which contain information on the taxonomic profiles of 982 OTUs in 178 patients–86 of which have been diagnosed with Crohn’s disease, 46 diagnosed with ulcerative colitis, and 46 diagnosed as non-IBD. We were specifically interested in finding rules that separate the Crohn’s disease samples from ulcerative colitis and non-IBD, so we framed the problem as a binary classification task with Crohn’s disease as the target phenotype. Since the current implementation of BowSaw is limited to finding rules when the variables have categorical values, we first converted the OTU counts of each taxon to a simple presence/absence scheme. This resulted in nearly equivalent RF performance relative to training RF with the original continuous OTU inputs: ROC AUC of 0.856 (binary) vs 0.872 (continuous) and PR AUC of 0.853 (binary) vs 0.86 (continuous) (Figures 3A,B). This is an important result because it allows us to think about associations just in terms of presence or absence of an OTU without sacrificing much in model performance. We next applied BowSaw to the Crohn’s disease samples and generated 86 unique classification rules. These rules ranged in complexity from 4 OTUs to 16 OTUs (median 9 OTUs) and applied to as few as 1 sample up to 36 samples (mean 6.3, +/-6.6, median: 4). The most broadly applicable rule involved 8 OTUs.

FIGURE 3

(A) Performance of the random forest classifier as measured by area under the receiver operator curve (ROC-AUC) is not strongly perturbed by simplifying OTU representation to a presence/absence scheme vs. the original continuous count. Dashed line indicates the performance of a perfectly random classifier. (B) The area under the curve of the precision recall curve is similarly not strongly affected by the new representation scheme. Dashed horizontal line is the random performance line. (C) Each point represents a unique candidate sub-rule. On the x-axis is the number of samples in the data matrix that are subject to that rule. The y-axis represents what fraction of matching samples were diagnosed as Crohn’s disease. (D) The taxon identities of the OTUs that make up the most generally applicable of the sub-rules where all matching samples have the Crohn’s disease label. We then applied the Sub-rule algorithm and visualized 56,902 resultant sub-rules ranging in complexity from 2 to 7 variables (Figure 3C). There were 1,941 sub-rules with precision = 1. We selected the most general of these rules (max| |) to be the top candidate for the curate algorithm and found that it considers the status of 5 OTUs and accounts for 38 of the 86 Crohn’s disease samples (Figure 3C), this rule was derived from the rule that considered the status of 8 OTUs and accounted for 36 Crohn’s disease samples. We set a precision threshold of 90% and ended up with 10 sub-rules involving an average of 4 OTUs (min = 2, max = 7), each derived from a unique parent rule (average OTUs = 9.6, min = 6, max = 16), that together account for all 86 Crohn’s disease samples and an additional 11 non-Crohn’s disease samples (4 non-IBD, 7 ulcerative colitis). The top five rules combine to account for 78 of 86 Crohn’s disease samples and include 10 non-Crohn’s disease samples (Table 2).

TABLE 2

Association rules identified by BowSaw that account for all Crohn’s disease samples.

Rule	CD samples	Non CD samples	New samples covered	Taxonomy	Presence
1	38	0	38	Bacteroides (genus)	y
				Lachnoclsotridium (genus)	y
				Tyzzerella (genus)	n
				Lachnospira (genus)	n
				Lachnospiricae UCG-001 (genus)	n
2	41	4	20	Dialister (genus)	y
				Christensenellacea R7 group (genus)	n
				Collinsella (genus)	n
				Ruminococcaceae (family)	n
				Finegoldia (genus)	n
				Ruminococcus (genus)	n
3	9	1	9	Ruminococcus (genus)	y
				Ruminococcaceae UCG-002 (genus)	n
				Lachnospirceae (family)	n
4	24	2	6	Streptococcus (genus)	y
				Tyzzerella (genus)	n
				Lachnospiraceae (family)	n
				Hafnia obesumbacterium	n
5	27	3	5	Lachnospiricae UCG-008 (genus)	y
				Ruminococcus 1 (genus)	n
				Eubacterium eligens group	n
6	5	0	2	Ruminococcus 1 (genus)	y
6	5	0	2	Dorea (genus)	n
7	7	0	2	Bacteroides (genus)	y
				Dialister (genus)	n
				Eubacterium rectale group	n
8	15	0	2	Lachnospiraceae NK4A136 group	y
				Eubacterium eligens group	y
				Tyzzerella (genus)	n
				Christensenellacea R7 group (genus)	n
				Lachnospira (genus)	n
9	3	0	1	Ruminococcus gnavus group	y
				Veillonella (genus)	n
				Bacteroides (genus)	n
				Finegoldia (genus)	n
10	10	1	1	Parabacteroides (genus)	y
				Eubacterium eligens group	y
				Ruminococcaceae Ucg-003 (genus)	n

Association rules identified by BowSaw that account for all Crohn’s disease samples. The top candidate rule is comprised of the presence of Bacteroides and Lachnoclostridium and the absence of three genera from the family Lachnospiraceae: Lachnospira, Tyzerrella, and Lachnospiracea UCG 001 (Figure 3D). Detection of Bacteroides was nearly ubiquitous within the cohort, it was found in 170 of 178 total samples, but only 3 of the samples in which it was missing are diagnosed as Crohn’s disease. For the remaining taxa we performed a t-test comparing the distribution of the taxa in Crohn’s disease vs. ulcerative colitis and vs. healthy samples. Lachnoclostridium was frequently found in Crohn’s disease (67/86) but not in ulcerative colitis (27/46, p = 0.02) and was detected at roughly the same rate in non-IBD samples (34/46, p = 0.616). Detection of Lachnospira was depleted in Crohn’s disease samples (20/86) relative to ulcerative colitis (20/46, p = 0.022) and to non-IBD samples (31/46, p = 9.9–7). Tyzzerella was also detected at a lower rate in Crohn’s disease (63/86) relative to ulcerative colitis (24/46, p = 0.019) and non-IBD (24/46, p = 0.019). Lachnospiracea UCG 001 was rarely detected in Crohn’s disease (4/86) which is a lower rate than it was detected in ulcerative colitis (9/46, p = 0.022) and in non-IBD samples (19/46, p = 1.45–5).

Application to Mushroom Data

To further demonstrate the generalizability of our approach to non-binarized datasets we identified the mushroom data set from the UCI machine learning repository (UCI Machine Learning Repository). This data set contains 8,123 observations of poisonous (3,915) and edible (4,208) mushrooms. There are 22 categorical features ranging from 2 to 12 categories. The two classes are perfectly separable, and the documentation accompanying the matrix describes a set of rules that separate all edible mushrooms from poisonous samples. This rule set provides a good baseline to compare the complexity of the final rule sets obtained with BowSaw to. We applied our approach to the original matrix of 22 features with multiple categories and to a binarized transformation where we give each category its own column (117 features). In both cases we used BowSaw to extract classification rules that account either for all edible mushrooms or for all poisonous mushrooms. Since the samples are fully separable we again set F = 1. This setting resulted in candidate rules ranging in complexity from 2 to 9 variables. We examined all sub-rules from complexity 1 up to complexity 9 and retained only those that were entirely associated with the target class (precision = 1) for curating a short list. In total we generated 4 different rule lists that fully separate edible from poisonous mushrooms and also differ from the data donor’s contributed list. Each list is composed of 7 rules. The rule lists obtained from each run are described in Supplementary Table S1 along with the contributed list

Discussion

Linear models for classification such as logistic regression are often the “go to” approach due to their ease of implementation and interpretation of coefficients. However, many biological datasets contain non-linear interactions between features. In these situations it is not uncommon for random forests to significantly outperform logistic regression. Interpretation of random forest models for classification is not straightforward and may be complicated when there are multiple rules (combinations of variables and their specific values) associated with a phenotype of interest. Our newly developed BowSaw approach, best applied when random forest is the appropriate classifier, is an algorithmic method for identifying the rules that a trained random forest model uses to make classifications when the values are categorical in nature. By taking advantage of the structure of trees found within a random forest, BowSaw produces a set of multiple decision rules that combine to account for each sample with a given observed phenotype. When the variables are the presumed causal agents, these rules represent plausible mechanistic relationships. Results on simulated data demonstrate that when there are multiple rules associated with a single phenotype label that BowSaw is capable of faithfully identifying them. Application to data from the human microbiome project offers further evidence that BowSaw provides an efficient way of generating plausible hypotheses for high throughput metagenomics studies. In particular we identified a rule that utilizes a presence/absence pattern of five microbial taxa (present: Bacteroides, Lachnoclostridium; absent: Lachnospira, Lachnospiracea, Tyzerrella) that accounts for nearly half of all Crohn’s disease samples in the cohort (38/86). This specific pattern of microbial colonization in the guts of Crohn’s disease patients is unreported, but each taxon’s respective enrichment or depletion status and association with disease status has been reported. If the cohort of patients in the human microbiome study are representative of all people afflicted by Crohn’s disease, then this rule represents a significantly large subset of those suffering. Inquiries into the relationship of the taxa included in this rule with disease status may yield important insights into the mechanisms of the disease and potential therapeutic strategies for this sub-population. Of the five associated taxa, we suspect that the absence of Lachnospira, Lachnospiracea UCG 001, and Tyzzerella are biologically meaningful. We have reason to believe so because it has been reported that the Lachnospiraceae family is generally suppressed in Crohn’s disease (Loh and Blaut, 2012; Geirnaert et al., 2017; Nagao-Kitamoto and Kamada, 2017). Lachnospira has been reported as depleted with respect to Crohn’s disease several times (Wright et al., 2017; Wang Y. et al., 2018). The depletion of Tyzzerella has been associated with chronic intestinal inflammation and supplementation suggested as a probiotic for Crohn’s disease (Berry et al., 2018; Chen et al., 2018). While the relationship of Lachnospiracea UCG 001 with Crohn’s disease is still unclear, its depletion has been reported in mice displaying symptoms of anhedonia and it was significantly enriched in anhedonia resilient mice (Yang et al., 2019). Partly because IBD is frequently accompanied by depression, anhedonia has been suggested as an important symptom in the diagnosis of IBD (Carpinelli et al., 2019). The associations of the individual OTUs defined by this rule are consistent with previously reported findings in the existing literature and describe a taxonomic profile that exclusively identifies a large sub-population of Crohn’s disease samples within this cohort. The presence of Bacteroides does not appear to be particularly useful and in this context is probably preserved because it causes a perfect association, although high levels of some species are implicated in the pathology of Crohn’s disease (Rabizadeh et al., 2007). Lachnoclostridium is differentially distributed across the three classes. Notably it is less frequently detected in ulcerative colitis relative to Crohn’s and non-IBD samples, which roughly resemble one another. Increased levels of this genus were detected in rats that showed relief of colitis symptoms after treatment with a proposed therapeutic agent (Wang K. et al., 2018). The current implementation of the algorithms is restricted to classification tasks with categorical predictor values. This is a challenge that can be addressed in future variants of this approach, in order to make it more generally applicable. Future work could also focus on extending these approaches to the interpretation of regression models or to consider the effect of counting stubs of higher-order interactions or co-occurring pairs on bookkeeping and rule extraction as opposed to strict parent-child relationships. We anticipate that the concept at the core of BowSaw and its different possible extensions could help uncover complex feature-phenotype maps for other types of biological datasets.

38 in total

1. Using biological networks to search for interacting loci in genome-wide association studies.

Authors: Mathieu Emily; Thomas Mailund; Jotun Hein; Leif Schauser; Mikkel Heide Schierup
Journal: Eur J Hum Genet Date: 2009-03-11 Impact factor: 4.246

2. Using Machine Learning To Predict Antimicrobial MICs and Associated Genomic Features for Nontyphoidal Salmonella.

Authors: Marcus Nguyen; S Wesley Long; Patrick F McDermott; Randall J Olsen; Robert Olson; Rick L Stevens; Gregory H Tyson; Shaohua Zhao; James J Davis
Journal: J Clin Microbiol Date: 2019-01-30 Impact factor: 5.948

3. Parasutterella, in association with irritable bowel syndrome and intestinal chronic inflammation.

Authors: Yan-Jie Chen; Hao Wu; Sheng-Di Wu; Nan Lu; Yi-Ting Wang; Hai-Ning Liu; Ling Dong; Tao-Tao Liu; Xi-Zhong Shen
Journal: J Gastroenterol Hepatol Date: 2018-06-03 Impact factor: 4.029

Review 4. High-content phenotypic and pathway profiling to advance drug discovery in diseases of unmet need.

Authors: Rebecca E Hughes; Richard J R Elliott; John C Dawson; Neil O Carragher
Journal: Cell Chem Biol Date: 2021-03-18 Impact factor: 8.116

5. Screening for Core Genes Related to Pathogenesis of Alzheimer's Disease.

Authors: Longxiu Yang; Yuan Qin; Chongdong Jian
Journal: Front Cell Dev Biol Date: 2021-04-22

6. Dysbiosis of the gut microbiota in disease.

Authors: Simon Carding; Kristin Verbeke; Daniel T Vipond; Bernard M Corfe; Lauren J Owen
Journal: Microb Ecol Health Dis Date: 2015-02-02

Review 7. Host-microbial Cross-talk in Inflammatory Bowel Disease.

Authors: Hiroko Nagao-Kitamoto; Nobuhiko Kamada
Journal: Immune Netw Date: 2017-02-23 Impact factor: 6.303

8. Gut microbiome structure and metabolic activity in inflammatory bowel disease.

Authors: Eric A Franzosa; Alexandra Sirota-Madi; Julian Avila-Pacheco; Nadine Fornelos; Henry J Haiser; Stefan Reinker; Tommi Vatanen; A Brantley Hall; Himel Mallick; Lauren J McIver; Jenny S Sauk; Robin G Wilson; Betsy W Stevens; Justin M Scott; Kerry Pierce; Amy A Deik; Kevin Bullock; Floris Imhann; Jeffrey A Porter; Alexandra Zhernakova; Jingyuan Fu; Rinse K Weersma; Cisca Wijmenga; Clary B Clish; Hera Vlamakis; Curtis Huttenhower; Ramnik J Xavier
Journal: Nat Microbiol Date: 2018-12-10 Impact factor: 17.745

Review 9. Using machine learning approaches for multi-omics data analysis: A review.

Authors: Parminder S Reel; Smarti Reel; Ewan Pearson; Emanuele Trucco; Emily Jefferson
Journal: Biotechnol Adv Date: 2021-03-29 Impact factor: 14.227

10. Key role of gut microbiota in anhedonia-like phenotype in rodents with neuropathic pain.

Authors: Chun Yang; Xi Fang; Gaofeng Zhan; Niannian Huang; Shan Li; Jiangjiang Bi; Riyue Jiang; Ling Yang; Liying Miao; Bin Zhu; Ailin Luo; Kenji Hashimoto
Journal: Transl Psychiatry Date: 2019-01-31 Impact factor: 6.222