| Literature DB >> 34222331 |
Demetrius DiMucci1,2, Mark Kon1,3, Daniel Segrè1,2,4,5,6.
Abstract
Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables ("rules") frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn's disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.Entities:
Keywords: Boolean rules; complex phenotypes; decision tree; epistasis; high-order interactions; microbiome; random forest
Year: 2021 PMID: 34222331 PMCID: PMC8245782 DOI: 10.3389/fmolb.2021.663532
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
FIGURE 1(A) In a hypothetical dataset there are two phenotype labels–“Disease” and “No Disease” that we wish to discriminate based on input predictor variables. In this example, there are two distinct high-order patterns that both confer the same “Disease” phenotype. Our goal is to identify a potentially diverse set of patterns (or, in this simplified case, all patterns) that are associated with the “Disease” label. (B) Instead of exhaustively evaluating variable combinations we leverage the structure that emerges from an ensemble of decisions trees like those produced by a trained random forest. (C) For each sample with the observed phenotype “Disease” we first identify the vector containing its input values (i). Then follow the paths it takes downs each tree that attempts to predict its class and record the frequency of parent-child variable pairs (ii). Next, we rank parent-child variable pairs in descending order of frequency (iii). Finally, we use a great search to construct a sample-specific rule that is fully associated with the “Disease” phenotype (iv). (D) All sample-specific rules are evaluated in order to obtain a consensus set of rules that combined account for all samples with the “Disease” phenotype.
FIGURE 2For both scenarios 2,000 samples were generated with 100 randomly generated binary features. (A) The generality of sub-rules (number of points that exactly satisfy the rule criteria) is plotted against their precision for the IDEALIZED scenario (Five rules that cause the phenotype and no noise). Each point represents a unique sub-rule. X-axis is the number of samples in the dataset that exactly match the pattern defined by the rule. Y-axis is the fraction of matching samples with the observed phenotype (i.e., precision of the rule). Each cluster of points corresponds to decreasing rule complexity from 5 variables per rule to 2 on the right-most cluster. These clusters appear because the values of each variable are produced by an identical binomial distribution. The dashed line is the precision threshold we chose in order to exclude low quality rules. Only candidate rules with precision above this threshold were considered for the curate algorithm. Red points are the causative sub-rules we defined. BowSaw correctly identified all five red points in this scenario. (B) Candidate sub-rules generated for the more challenging INTERMEDIATE scenario. We defined 5 causative rules of varying lengths in this scenario and allowed 2% of samples without a causative rule to be assigned the label. BowSaw completely recovered 4 of the causative rules (red points). The longest rule which involved 5 variables was not fully recovered by any candidate rule. Rules that were selected by the Curate algorithm because of their contribution to additional coverage but that did not contain a complete true rule are indicated by blue points.
Correlation of performance metrics and data dimensions with rule recovery.
| ROC-AUC | PR-AUC | N Features | Sample size | |
|---|---|---|---|---|
| Fraction of rules recovered | 0.672 | 0.585 | -0.151 | 0.556 |
| Mean partial recovery all rules | 0.683 | 0.581 | -0.251 | 0.657 |
| Median rank of first recovered rule | 0.268 | 0.195 | -0.073 | 0.071 |
FIGURE 3(A) Performance of the random forest classifier as measured by area under the receiver operator curve (ROC-AUC) is not strongly perturbed by simplifying OTU representation to a presence/absence scheme vs. the original continuous count. Dashed line indicates the performance of a perfectly random classifier. (B) The area under the curve of the precision recall curve is similarly not strongly affected by the new representation scheme. Dashed horizontal line is the random performance line. (C) Each point represents a unique candidate sub-rule. On the x-axis is the number of samples in the data matrix that are subject to that rule. The y-axis represents what fraction of matching samples were diagnosed as Crohn’s disease. (D) The taxon identities of the OTUs that make up the most generally applicable of the sub-rules where all matching samples have the Crohn’s disease label.
Association rules identified by BowSaw that account for all Crohn’s disease samples.
| Rule | CD samples | Non CD samples | New samples covered | Taxonomy | Presence |
|---|---|---|---|---|---|
| 1 | 38 | 0 | 38 |
| y |
|
| y | ||||
|
| n | ||||
|
| n | ||||
|
| n | ||||
| 2 | 41 | 4 | 20 |
| y |
|
| n | ||||
|
| n | ||||
|
| n | ||||
|
| n | ||||
|
| n | ||||
| 3 | 9 | 1 | 9 |
| y |
|
| n | ||||
|
| n | ||||
| 4 | 24 | 2 | 6 |
| y |
|
| n | ||||
|
| n | ||||
|
| n | ||||
| 5 | 27 | 3 | 5 |
| y |
|
| n | ||||
|
| n | ||||
| 6 | 5 | 0 | 2 |
| y |
|
| n | ||||
| 7 | 7 | 0 | 2 |
| y |
|
| n | ||||
|
| n | ||||
| 8 | 15 | 0 | 2 |
| y |
|
| y | ||||
|
| n | ||||
|
| n | ||||
|
| n | ||||
| 9 | 3 | 0 | 1 |
| y |
|
| n | ||||
|
| n | ||||
|
| n | ||||
| 10 | 10 | 1 | 1 |
| y |
|
| y | ||||
|
| n |