Literature DB >> 29351989

Iterative random forests to discover predictive and stable high-order interactions.

Sumanta Basu^1,2,3, Karl Kumbier⁴, James B Brown^5,4,6,7, Bin Yu^5,4,8.

Abstract

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.

Entities: Chemical Disease Gene Species

Keywords: genomics; high-order interaction; interpretable machine learning; random forests; stability

Mesh：

Year: 2018 PMID： 29351989 PMCID： PMC5828575 DOI： 10.1073/pnas.1711236115

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

High-throughput, genome-wide measurements of protein–DNA and protein–RNA interactions are driving new insights into the principles of functional regulation. For instance, databases generated by the Berkeley Drosophila Transcriptional Network Project (BDTNP) and the ENCODE consortium provide maps of transcription factor (TF) binding events and chromatin marks for substantial fractions of the regulatory factors active in the model organism Drosophila melanogaster and human-derived cell lines, respectively (1–6). A central challenge with these data lies in the fact that chromatin immunoprecipitation sequencing (ChIP-seq), the principal tool used to measure DNA–protein interactions, assays a single protein target at a time. In well-studied systems, regulatory factors such as TFs act in concert with other chromatin-associated and RNA-associated proteins, often through stereospecific interactions (5, 7); for a review see ref. 8. While several methods have been developed to identify interactions in large genomics datasets, for example refs. 9–11, these approaches either focus on pairwise relationships or require explicit enumeration of higher-order interactions, which becomes computationally infeasible for even moderate-sized datasets. In this paper, we present a computationally efficient tool for directly identifying high-order interactions in a supervised learning framework. We note that the interactions we identify do not necessarily correspond to biomolecular complexes or physical interactions. However, among the pairwise TF interactions identified as stable, have been previously reported (). The empirical success of our approach, combined with its computational efficiency, stability, and interpretability, make it uniquely positioned to guide inquiry into the high-order mechanisms underlying functional regulation. Popular statistical and machine-learning methods for detecting interactions among features include decision trees and their ensembles: CART (12), random forests (RFs) (13), Node Harvest (14), Forest Garrote (15), and Rulefit3 (16), as well as methods more specific to gene–gene interactions with categorical features, such as logic regression (17), multifactor dimensionality reduction (18), and Bayesian epistasis mapping (19). With the exception of RFs, the above tree-based procedures grow shallow trees to prevent overfitting, excluding the possibility of detecting high-order interactions without affecting predictive accuracy. RFs are an attractive alternative, leveraging high-order interactions to obtain state-of-the-art prediction accuracy. However, interpreting interactions in the resulting tree ensemble remains a challenge. We take a step toward overcoming these issues by proposing a fast algorithm built on RFs that searches for stable, high-order interactions. Our method, the iterative random forest algorithm (iRF), sequentially grows feature-weighted RFs to perform soft dimension reduction of the feature space and stabilize decision paths. We decode the fitted RFs using a generalization of the random intersection trees algorithm (RIT) (20). This procedure identifies high-order feature combinations that are prevalent on the RF decision paths. In addition to the high predictive accuracy of RFs, the decision tree base learner captures the underlying biology of local, combinatorial interactions (21), an important feature for biological data, where a single molecule often performs many roles in various cellular contexts. Moreover, invariance of decision trees to monotone transformations (12) to a large extent mitigates normalization issues that are a major concern in the analysis of genomics data, where signal-to-noise ratios vary widely even between biological replicates (22, 23). Using empirical and numerical examples, we show that iRF is competitive with RF in terms of predictive accuracy and extracts both known and compelling candidate interactions in two motivating biological problems in epigenomics and transcriptomics. An open-source R implementation of iRF is available through CRAN (https://cran.r-project.org/web/packages/iRF/index.html).

Our Method: Iterative RFs

The iRF algorithm searches for high-order feature interactions in three steps. First, iterative feature reweighting adaptively regularizes RF fitting. Second, decision rules extracted from a feature-weighted RF map from continuous or categorical to binary features. This mapping allows us to identify prevalent interactions in the RF through a generalization of the RIT, a computationally efficient algorithm that searches for high-order interactions in binary data (20). Finally, a bagging step assesses the stability of recovered interactions with respect to the bootstrap perturbation of the data. We briefly review the feature-weighted RF and RIT before presenting iRF.

Preliminaries: Feature-Weighted RF and RIT.

To reduce the dimensionality of the feature space without removing marginally unimportant features that may participate in high-order interactions, we use a feature-weighted version of RF. Specifically, for a set of nonnegative weights , where is the number of features, let denote a feature-weighted RF constructed with . In , instead of taking a uniform random sample of features at each split, one chooses the th feature with probability proportional to . Weighted-tree ensembles have been proposed in ref. 24 under the name “enriched random forests” and used for feature selection in genomic data analysis. Note that with this notation, Breiman’s original RF amounts to . iRF builds upon a generalization of the RIT, an algorithm that performs a randomized search for high-order interactions among binary features in a deterministic setting. More precisely, the RIT searches for co-occurring collections of binary features, or order- interactions, that appear with greater frequency in a given class. The algorithm recovers such interactions with high probability (relative to the randomness it introduces) at a substantially lower computational cost than , provided the interaction pattern is sufficiently prevalent in the data and individual features are sparse. We briefly present the basic RIT algorithm and refer readers to the original paper (20) for a complete description. Consider a binary classification problem with observations and binary features. Suppose we are given data in the form , . Here, each is a binary label and is a feature-index subset indicating the indexes of “active” features associated with observation . In the context of gene transcription, can be thought of as a collection of TFs and histone modifications with abnormally high or low enrichments near the th gene’s promoter region, and can indicate whether gene is transcribed or not. With these notations, prevalence of an interaction in the class is defined aswhere denotes the empirical probability distribution and the indicator function. For given thresholds , the RIT performs a randomized search for interactions satisfyingFor each class and a prespecified integer , let be randomly chosen indexes from the set of observations . To search for interactions satisfying condition 1, the RIT takes -fold intersections from the randomly selected observations in class . To reduce computational complexity, these interactions are performed in a tree-like fashion (), where each nonleaf node has children. This process is repeated times for a given class , resulting in a collection of survived interactions , where each is the set of interactions that remains following the -fold intersection process in tree . The prevalences of interactions across different classes are subsequently compared using condition 1. The main intuition is that if an interaction is highly prevalent in a particular class, it will survive the -fold intersection with high probability.

iRFs.

The iRF algorithm places interaction discovery in a supervised learning framework to identify class-specific, active index sets required for the RIT. This framing allows us to recover high-order interactions that are associated with accurate prediction in feature-weighted RFs. We consider the binary classification setting with training data in the form , with continuous or categorical features , and a binary label . Our goal is to find subsets of features, or interactions, that are both highly prevalent within a class and that provide good differentiation between the two classes. To encourage generalizability of our results, we search for interactions in ensembles of decision trees fitted on bootstrap samples of . This allows us to identify interactions that are robust to small perturbations in the data. Before describing iRF, we present a generalized RIT that uses any RF, weighted or not, to generate active index sets from continuous or categorical features. Our generalized RIT is independent of the other iRF components in the sense that other approaches could be used to generate the input for the RIT. We remark on our particular choices in .

Generalized RIT (Through an RF).

For each tree in the output tree ensemble of an RF, we collect all leaf nodes and index them by . Each feature–response pair is represented with respect to a tree by , where is the set of unique feature indexes falling on the path of the leaf node containing in the th tree. Hence, each produces such index set and label pairs, corresponding to the trees. We aggregate these pairs across observations and trees asand apply RIT on this transformed dataset to obtain a set of interactions. We now describe the three components of iRF. A depiction is shown in Fig. 1 and the complete workflow is presented in . We remark on the algorithm further in .

Fig. 1.

iRF workflow. Iteratively reweighted RFs (blue boxes) are trained on full data and pass Gini importance as weights to the next iteration. In iteration (red box), feature-weighted RFs are grown using on bootstrap samples of the full data . Decision paths and predicted leaf node labels are passed to the RIT (green box), which computes prevalent interactions in the RF ensemble. Recovered interactions are scored for stability across (outer-layer) bootstrap samples.

1) Iteratively reweighted RF.

Given an iteration number , iRF iteratively grows feature-weighted RFs , , on the data . The first iteration of iRF () starts with and stores the importance (mean decrease in Gini impurity) of the features as . For iterations , we set and grow a weighted RF with weights set equal to the RF feature importance from the previous iteration. Iterative approaches for fitting RFs have been previously proposed in ref. 25 and combined with hard thresholding to select features in microarray data.

2) Generalized RIT (through ).

We apply the generalized RIT to the last feature-weighted RF grown in iteration . That is, decision rules generated in the process of fitting provide the mapping from continuous or categorical to binary features required for the RIT. This process produces a collection of interactions .

3) Bagged stability scores.

In addition to bootstrap sampling in the weighted RF, we use an “outer layer” of bootstrapping to assess the stability of recovered interactions. We generate bootstrap samples of the data , fit on each bootstrap sample , and use the generalized RIT to identify interactions across each bootstrap sample. We define the stability score of an interaction asrepresenting the proportion of times (out of bootstrap samples) an interaction appears as an output of the RIT. This averaging step is exactly the bagging idea of Breimain (26).

iRF Tuning Parameters.

The iRF algorithm inherits tuning parameters from its two base algorithms, RF and RIT. The predictive performance of RF is known to be highly resistant to choice of parameters (13), so we use the default parameters in the R randomForest package. Specifically, we set the number of trees and the number of variables sampled at each node and grow trees to purity. For the RIT algorithm, we use the basic version or algorithm 1 of ref. 20 and grow intersection trees of depth with , which empirically leads to a good balance between computation time and quality of recovered interactions. We find that both prediction accuracy and interaction recovery of iRF are fairly robust to these parameter choices (). In addition to the tuning parameters of RF and RIT, the iRF workflow introduces two additional tuning parameters: (i) number of bootstrap samples and (ii) number of iterations . Larger values of provide a more precise description of the uncertainty associated with each interaction at the expense of increased computation cost. In our simulations and case studies we set and find that results are qualitatively similar in this range. The number of iterations controls the degree of regularization on the fitted RF. We find that the quality of recovered interactions can improve dramatically for (). In and , we report interactions with selected by fivefold cross-validation.

Simulation Experiments

We developed and tested iRF through extensive simulation studies based on biologically inspired generative models using both synthetic and real data (). In particular, we generated responses using Boolean rules intended to reflect the stereospecific nature of interactions among biomolecules (27). In total, we considered seven generative models built from and (AND), or (OR), and exclusive OR (XOR) rules, with the number of observations and features ranging from to and to , respectively. We introduced noise into our models both by randomly swapping response labels for up to of observations and through RF-derived rules learned on held-out data. We find that the predictive performance of iRF () is generally comparable with that of RF (). However, iRF recovers the full data-generating rule, up to an order-8 interaction in our simulations, as the most stable interaction in many settings where RF rarely recovers interactions of order . The computational complexity of recovering these interactions is substantially lower than that of competing methods that search for interactions incrementally (). Our experiments suggest that iterative reweighting encourages iRF to use a stable set of features on decision paths (). Specifically, features that are identified as important in early iterations tend to be selected among the first several splits in later iterations (). This allows iRF to generate partitions of the feature space where marginally unimportant, active features become conditionally important and thus more likely to be selected on decision paths. For a full description of simulations and results, see .

Case Study I: Enhancer Elements in Drosophila

Development and function in multicellular organisms rely on precisely regulated spatiotemporal gene expression. Enhancers play a critical role in this process by coordinating combinatorial TF binding, whose integrated activity leads to patterned gene expression during embryogenesis (28). In the early Drosophila embryo, a small cohort of 40 TFs drive patterning (for a review see ref. 29), providing a well-studied, simplified model system in which to investigate the relationship between TF binding and enhancer activities. Extensive work has resulted in genome-wide, quantitative maps of DNA occupancy for 23 TFs (30) and 13 histone modifications (6), as well as labels of enhancer status for genomic sequences in blastoderm (stage 5) Drosophila embryos (1, 31). See for descriptions of data collection and preprocessing. To investigate the relationship between enhancers, TF binding, and chromatin state, we used iRF to predict enhancer status for each of the genomic sequences ( training, test). We achieved an area under the precision-recall curve (AUC-PR) on the held-out test data of 0.5 for (Fig. 2). This corresponds to a Matthews correlation coefficient (MCC) of [positive predictive value (PPV) of ] when predicted probabilities are thresholded to maximize MCC in the training data.

Fig. 2.

(A) Accuracy of iRF (AUC-PR) in predicting active elements from TF binding and histone modification data. (B) The 20 most stable interactions recovered by iRF after five iterations. Interactions that are a strict subset of another interaction with stability score have been removed for cleaner visualization. iRF recovers known interactions among Gt, Kr, and Hb and interacting roles of master regulator Zld. (C) Surface maps demonstrating the proportion of active enhancers by quantiles of Zld, Gt, and Kr binding (held-out test data). On the subset of data where Kr binding is lower than the median Kr level, the proportion of active enhancers does not change with Gt and Zld. On the subset of data with Kr binding above the median level, the structure of the response surface reflects an order-3 AND interaction: Increased levels of Zld, Gt, and Kr binding are indicative of enhancer status for a subset of observations. (D) Quantiles of Zld, Gt, and Kr binding grouped by enhancer status (balanced sample of held-out test data). The block of active elements highlighted in red represents the subset of observations for which the AND interaction is active. Fig. 2 reports stability scores of recovered interactions for . We note that the data analyzed are whole embryo and interactions found by iRF do not necessarily represent physical complexes. However, for the well-studied case of pairwise TF interactions, 80% of our findings with stability score have been previously reported as physical (). For instance, interactions among gap proteins Giant (Gt), Krüppel (Kr), and Hunchback (Hb), some of the most well-characterized interactions in the early Drosophila embryo (32), are all highly stable [, , ]. Physical evidence supporting high-order mechanisms is a frontier of experimental research and hence limited, but our excellent pairwise results give us hope that high-order interactions we identify as stable have a good chance of being confirmed by follow-up work. iRF also identified several high-order interactions surrounding the early regulatory factor Zelda (Zld) [, ]. Zld has been previously shown to play an essential role during the maternal–zygotic transition (33, 34), and there is evidence to suggest that Zld facilitates binding to regulatory elements (35). We find that Zld binding in isolation rarely drives enhancer activity, but in the presence of other TFs, particularly the anterior–posterior (AP) patterning factors Gt and Kr, it is highly likely to induce transcription. This generalizes the dependence of Bicoid-induced transcription on Zld binding to several of the AP factors (36) and is broadly consistent with the idea that Zld is potentiating, rather than an activating factor (35). More broadly, response surfaces associated with stable high-order interactions indicate AND-like rules (Fig. 2). In other words, the proportion of active enhancers is substantially higher for sequences where all TFs are sufficiently bound, compared with sequences where only some of the TFs exhibit high levels of occupancy. Fig. 2 demonstrates a putative third-order interaction found by iRF (). In Fig. 2C, Left, the Gt-Zld response surface is plotted using only sequences for which Kr occupancy is lower than the median Kr level, and the proportion of active enhancers is uniformly low (<10%). The response surface in Fig. 2C, Right is plotted using only sequences where Kr occupancy is higher than median Kr level and shows that the proportion of active elements is as high as 60% when both Zld and Gt are sufficiently bound. This points to an order-3 AND rule, where all three proteins are required for enhancer activation in a subset of sequences. In Fig. 2, we show the subset of sequences that correspond to this AND rule (highlighted in red), using a superheat map (37), which juxtaposes two separately clustered heat maps corresponding to active and inactive elements. Note that the response surfaces are drawn using held-out test data to illustrate the generalizability of interactions detected by iRF. While overlapping patterns of TF binding have been previously reported (30), to the best of our knowledge this is the first report of an AND-like response surface for enhancer activation. Third-order interactions have been studied in only a handful of enhancer elements, most notably eve stripe 2 (for a review see ref. 38), and our results indicate that they are broadly important for the establishment of early zygotic transcription and therefore body patterning.

Case Study II: Alternative Splicing in a Human-Derived Cell Line

In eukaryotes, alternative splicing of primary messenger RNA (mRNA) transcripts is a highly regulated process in which multiple distinct mRNAs are produced by the same gene. In the case of mRNAs, the result of this process is the diversification of the proteome and hence the library of functional molecules in cells. The activity of the spliceosome, the ribonucleoprotein responsible for most splicing in eukaryotic genomes, is driven by complex, cell-type–specific interactions with cohorts of RNA-binding proteins (RBP) (39, 40), suggesting that high-order interactions play an important role in the regulation of alternative splicing. However, our understanding of this system derives from decades of study in genetics, biochemistry, and structural biology. Learning interactions directly from genomics data has the potential to accelerate our pace of discovery in the study of co- and posttranscriptional gene regulation. Studies, initially in model organisms, have revealed that the chromatin mark H3K36me3, the DNA-binding protein CTCF, and a few other factors all play splice-enhancing roles (41–43). However, the extent to which chromatin state and DNA-binding factors interact to modulate cotranscriptional splicing remains unknown (44). To identify interactions that form the basis of chromatin-mediated splicing, we used iRF to predict thresholded splicing rates for exons [RNA-seq percent-spliced-in (PSI) values (https://github.com/guigolab/ipsa-nf); training, test], from ChIP-seq assays measuring enrichment of chromatin marks and TF-binding events (253 ChIP assays on 107 unique TFs and 11 histone modifications). Preprocessing methods are described in . In this prediction problem, we achieved an AUC-PR on the held-out test data of 0.51 for (Fig. 3). This corresponds to a MCC of (PPV ) on held-out test data when predicted probabilities are thresholded to maximize MCC in the training data. Fig. 3 reports stability scores of recovered interactions for . We find interactions involving H3K36me3, a number of interactions involving other chromatin marks, and posttranslationally modified states of RNA Pol II. In particular, we find that the impact of serine 2 phosphorylation of Pol II appears highly dependent on local chromatin state. Remarkably, iRF identified an order- interaction surrounding H3K36me3 and S2 phospho-Pol II (stability score , Fig. 3 ) along with two highly stable order- subsets of this interaction (stability scores ). A subset of highly spliced exons highlighted in red is enriched for all six of these elements, indicating a potential AND-type rule related to splicing events (Fig. 3). This observation is consistent with, and offers a quantitative model for, the previously reported predominance of cotranscriptional splicing in this cell line (45). We note that the search space of order- interactions is and that this interaction is discovered with an order-zero increase over the computational cost of finding important features using RF. Recovering such interactions without exponential speed penalties represents a substantial advantage over previous methods and positions our approach uniquely for the discovery of complex, nonlinear interactions.

Fig. 3.

(A) Accuracy of iRF (AUC-PR) in classifying included exons from excluded exons in held-out test data. iRF shows increase in AUC-PR over RF. (B) An order-6 interaction recovered by iRF (stability score 0.5) displayed on a superheat map which juxtaposes two separately clustered heat maps of exons with high and low splicing rates. Coenrichment of all six plotted features reflects an AND-type rule indicative of high splicing rates for the exons highlighted in red (held-out test data). The subset of Pol II, S2 phospho-Pol II, H3K36me3, H3K79me2, and H4K20me1 was recovered as an order-5 interaction in all bootstrap samples (stability score ). (C) The 20 most stable interactions recovered in the second iteration of iRF. Interactions that are a strict subset of another interaction with stability score have been removed for cleaner visualization.

Discussion

Systems governed by nonlinear interactions are ubiquitous in biology. We developed a predictive and stable method, iRF, for learning such feature interactions. iRF identified known and promising interactions in early zygotic enhancer activation in the embryo and posits more high-order interactions in splicing regulation for a human-derived system. Validation and assessment of complex interactions in biological systems are necessary and challenging, but new wet-lab tools are becoming available for targeted genome and epigenome engineering. For instance, the CRISPR system has been adjusted for targeted manipulation of posttranslational modifications to histones (46). This may allow for tests to determine whether modifications to distinct residues at multivalent nucleosomes function in a nonadditive fashion in splicing regulation. Several of the histone marks that appear in the interactions we report, including H3K36me3 and H4K20me1, have been previously identified (47) as essential for establishing splicing patterns in the early embryo. Our findings point to direct interactions between these two distinct marks. This observation generates interesting questions: What proteins, if any, mediate these dependencies? What is the role of Phospho-S2 Pol II in the interaction? Proteomics on ChIP samples may help reveal the complete set of factors involved in these processes, and new assays such as Co-ChIP may enable the mapping of multiple histone marks at single-nucleosome resolution (48). We have offered evidence that iRF constitutes a useful tool for generating hypotheses from the study of high-throughput genomics data, but many challenges await. iRF currently handles data heterogeneity only implicitly, and the order of detectable interaction depends directly on the depth of the tree, which is on the order of . We are currently investigating local importance measures to explicitly relate discovered interactions to specific observations. This strategy has the potential to further localize feature selection and improve the interpretability of discovered rules. Additionally, iRF does not distinguish between interaction forms, for instance additive vs. nonadditive. We are exploring tests of rule structure to provide better insights into the precise form of rule–response relationships. To date, machine learning has been driven largely by the need for accurate prediction. Leveraging machine-learning algorithms for scientific insights into the mechanics that underlie natural and artificial systems will require an understanding of why prediction is possible. The stability principle, which asserts that statistical results should at a minimum be reproducible across reasonable data and model perturbations, has been advocated in ref. 49 as a second consideration to work toward understanding and interpretability in science. Iterative and data-adaptive regularization procedures such as iRF are based on prediction and stability and have the potential to be widely adaptable to diverse algorithmic and computational architectures, improving interpretability and informativeness by increasing the stability of learners.

38 in total

Review 1. From gradients to stripes in Drosophila embryogenesis: filling in the gaps.

Authors: R Rivera-Pomar; H Jäckle
Journal: Trends Genet Date: 1996-11 Impact factor: 11.639

2. Co-ChIP enables genome-wide mapping of histone mark co-occurrence at single-molecule resolution.

Authors: Assaf Weiner; David Lara-Astiaso; Vladislav Krupalnik; Ohad Gafni; Eyal David; Deborah R Winter; Jacob H Hanna; Ido Amit
Journal: Nat Biotechnol Date: 2016-07-25 Impact factor: 54.908

3. Mutations affecting segment number and polarity in Drosophila.

Authors: C Nüsslein-Volhard; E Wieschaus
Journal: Nature Date: 1980-10-30 Impact factor: 49.962

4. Zelda binding in the early Drosophila melanogaster embryo marks regions subsequently activated at the maternal-to-zygotic transition.

Authors: Melissa M Harrison; Xiao-Yong Li; Tommy Kaplan; Michael R Botchan; Michael B Eisen
Journal: PLoS Genet Date: 2011-10-20 Impact factor: 5.917

5. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development.

Authors: Sean Thomas; Xiao-Yong Li; Peter J Sabo; Richard Sandstrom; Robert E Thurman; Theresa K Canfield; Erika Giste; William Fisher; Ann Hammonds; Susan E Celniker; Mark D Biggin; John A Stamatoyannopoulos
Journal: Genome Biol Date: 2011-05-11 Impact factor: 13.583

6. Global quantitative modeling of chromatin factor interactions.

Authors: Jian Zhou; Olga G Troyanskaya
Journal: PLoS Comput Biol Date: 2014-03-27 Impact factor: 4.475

7. eFORGE: A Tool for Identifying Cell Type-Specific Signal in Epigenomic Data.

Authors: Charles E Breeze; Dirk S Paul; Jenny van Dongen; Lee M Butcher; John C Ambrose; James E Barrett; Robert Lowe; Vardhman K Rakyan; Valentina Iotchkova; Mattia Frontini; Kate Downes; Willem H Ouwehand; Jonathan Laperle; Pierre-Étienne Jacques; Guillaume Bourque; Anke K Bergmann; Reiner Siebert; Edo Vellenga; Sadia Saeed; Filomena Matarese; Joost H A Martens; Hendrik G Stunnenberg; Andrew E Teschendorff; Javier Herrero; Ewan Birney; Ian Dunham; Stephan Beck
Journal: Cell Rep Date: 2016-11-15 Impact factor: 9.423

8. Integrative annotation of chromatin elements from ENCODE data.

Authors: Michael M Hoffman; Jason Ernst; Steven P Wilder; Anshul Kundaje; Robert S Harris; Max Libbrecht; Belinda Giardine; Paul M Ellenbogen; Jeffrey A Bilmes; Ewan Birney; Ross C Hardison; Ian Dunham; Manolis Kellis; William Stafford Noble
Journal: Nucleic Acids Res Date: 2012-12-05 Impact factor: 16.971

9. The zinc-finger protein Zelda is a key activator of the early zygotic genome in Drosophila.

Authors: Hsiao-Lan Liang; Chung-Yi Nien; Hsiao-Yun Liu; Mark M Metzstein; Nikolai Kirov; Christine Rushlow
Journal: Nature Date: 2008-10-19 Impact factor: 49.962

10. A balanced iterative random forest for gene selection from microarray data.

Authors: Ali Anaissi; Paul J Kennedy; Madhu Goyal; Daniel R Catchpoole
Journal: BMC Bioinformatics Date: 2013-08-27 Impact factor: 3.169

40 in total

1. Sparse and Low-rank Tensor Estimation via Cubic Sketchings.

Authors: Botao Hao; Anru Zhang; Guang Cheng
Journal: IEEE Trans Inf Theory Date: 2020-03-23 Impact factor: 2.501

2. Investigating higher-order interactions in single-cell data with scHOT.

Authors: John C Marioni; Jean Yee Hwa Yang; Shila Ghazanfar; Yingxin Lin; Xianbin Su; David Ming Lin; Ellis Patrick; Ze-Guang Han
Journal: Nat Methods Date: 2020-07-13 Impact factor: 28.547

3. Classification and interaction in random forests.

Authors: Danielle Denisko; Michael M Hoffman
Journal: Proc Natl Acad Sci U S A Date: 2018-02-12 Impact factor: 11.205

4. Definitions, methods, and applications in interpretable machine learning.

Authors: W James Murdoch; Chandan Singh; Karl Kumbier; Reza Abbasi-Asl; Bin Yu
Journal: Proc Natl Acad Sci U S A Date: 2019-10-16 Impact factor: 11.205

5. Learning stable and predictive structures in kinetic systems.

Authors: Niklas Pfister; Stefan Bauer; Jonas Peters
Journal: Proc Natl Acad Sci U S A Date: 2019-11-27 Impact factor: 11.205

6. The role of multiple global change factors in driving soil functions and microbial biodiversity.

Authors: Matthias C Rillig; Masahiro Ryo; Anika Lehmann; Carlos A Aguilar-Trigueros; Sabine Buchert; Anja Wulf; Aiko Iwasaki; Julien Roy; Gaowen Yang
Journal: Science Date: 2019-11-15 Impact factor: 47.728

7. Single-cell expression profiling reveals dynamic flux of cardiac stromal, vascular and immune cells in health and injury.

Authors: Nona Farbehi; Ralph Patrick; Aude Dorison; Munira Xaymardan; Vaibhao Janbandhu; Katharina Wystub-Lis; Joshua Wk Ho; Robert E Nordon; Richard P Harvey
Journal: Elife Date: 2019-03-26 Impact factor: 8.140

8. A mechanism-aware and multiomic machine-learning pipeline characterizes yeast cell growth.

Authors: Christopher Culley; Supreeta Vijayakumar; Guido Zampieri; Claudio Angione
Journal: Proc Natl Acad Sci U S A Date: 2020-07-16 Impact factor: 11.205

9. Specific histone modifications associate with alternative exon selection during mammalian development.

Authors: Qiwen Hu; Casey S Greene; Elizabeth A Heller
Journal: Nucleic Acids Res Date: 2020-05-21 Impact factor: 16.971

10. Development and validation of a machine learning-based prediction model for near-term in-hospital mortality among patients with COVID-19.

Authors: Prem Timsina; Arash Kia; Prathamesh Parchure; Himanshu Joshi; Kavita Dharmarajan; Robert Freeman; David L Reich; Madhu Mazumdar
Journal: BMJ Support Palliat Care Date: 2020-09-22 Impact factor: 3.568