Literature DB >> 18826957

BNFinder: exact and efficient method for learning Bayesian networks.

Abstract

MOTIVATION: Bayesian methods are widely used in many different areas of research. Recently, it has become a very popular tool for biological network reconstruction, due to its ability to handle noisy data. Even though there are many software packages allowing for Bayesian network reconstruction, only few of them are freely available to researchers. Moreover, they usually require at least basic programming abilities, which restricts their potential user base. Our goal was to provide software which would be freely available, efficient and usable to non-programmers.
RESULTS: We present a BNFinder software, which allows for Bayesian network reconstruction from experimental data. It supports dynamic Bayesian networks and, if the variables are partially ordered, also static Bayesian networks. The main advantage of BNFinder is the use exact algorithm, which is at the same time very efficient (polynomial with respect to the number of observations).

Mesh：

Year: 2008 PMID： 18826957 PMCID： PMC2639006 DOI： 10.1093/bioinformatics/btn505

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Computational methods of Bayesian network inference are very popular in many different areas of bioinformatics and other fields of science. Examples include: regulatory network reconstruction (Dojer et al., 2006; Husmeier, 2003) where nodes represent genes and edges represent statistical dependencies which may indicate regulatory interactions; predicting gene expression from promoter sequence (Beer and Tavazoie, 2004; Segal et al., 2003) where edges lead from promoter features (motif occurrences, their positions, etc.) to expression patterns (affinity to overlapping expression clusters); neural signal transduction analysis (Smith et al., 2006) where network topology mimics the topology of connections between different parts of the brain and many others. Despite differences in the interpretation of network structure, the methodology of these studies is remarkably similar [see, Needham et al. (2007) for overview and further examples]. We aim to provide new software which could be used for different applications of Bayesian network reconstruction. Most programs learning Bayesian networks from data are based on heuristic search techniques of identifying good models. This is due to a number of discouraging complexity results (Chickering, 1996; Chickering et al., 2004; Meek, 2001) showing that, without restrictive assumptions, learning Bayesian networks from data is NP-hard with respect to the number of network vertices. On the other hand, the known exact algorithms learn the structure of optimal networks having up to 20–40 vertices (Ott et al., 2004). In an extensive comparison, Murphy (2007) lists over 50 software packages available for different applications of Bayesian networks. However, if one is searching for a free software able to infer the structure of static and dynamic Bayesian networks from data there are only two such applications: Banjo package (Smith et al., 2006): Bayesian ANalysis with Java Objects, Bayes Net Toolbox (Murphy, 2002) for Matlab with an extension for dynamic Bayesian networks inference using MCMC (Husmeier, 2003). Both of these software packages use heuristic search algorithms to find the best scoring network topology in a vast space of possible directed graphs, usually with some constraints on the maximal vertex in-degree.

2 METHODS

For a thorough treatment of the contents of the present section, we refer the reader to Supplementary Materials. The BNFinder program is based on a novel polynomial-time algorithm for learning an optimal Bayesian network structure (Dojer, 2006). The algorithm was designed to save reasonable speed and perfect quality of learning in a wide class of problems occurring in the computational molecular biology. It works under the assumption that there is no need to examine the acyclicity of the graph, which is satisfied in the following cases: When dealing with dynamic Bayesian networks, a dynamic Bayesian network describes stochastic evolution of a set of random variables over discretized time. Therefore, conditional distributions refer to random variables in neighboring time points and the graph is always acyclic. In case of static Bayesian networks, the set of possible network structures must be restricted. BNFinder lets the user divide the set of variables into an ordered set of disjoint subsets of variables, where edges can only lead from upstream to downstream subsets. If such ordering is not known beforehand, one can try to run BNFinder with different orderings and choose a network with the best overall score. BNFinder learns optimal networks with respect to two generally used scoring criteria: Bayesian–Dirichlet equivalence (BDe) and minimal description length (MDL). The (default) BDe score originates from Bayesian statistics and corresponds to the posterior probability of a network-given data. The MDL score originates from information theory and corresponds to the length of the data compressed with the compression model derived from the network structure. It also has a statistical interpretation as an approximation of the posterior probability. The algorithm works in polynomial time for both scores, but computations with the MDL are faster, especially for large datasets. However, we recommend using the BDe score due to its exactness in the statistical interpretation. Both MDL and BDe scores were originally designed for discrete variables. Continuous variables are handled with corresponding scores, derived under the assumption that conditional distributions belong to a family of Gaussian mixtures. BNFinder may learn either dynamic Bayesian networks (from time series data) or static ones (from independent experiment data). In the second case it is necessary to specify constraints on the network's structure forcing its acyclicity. A special treatment is required for experiments, in which the values of some variables were perturbed (e.g. knockout experiments). Since perturbations change the structure of interactions, learning procedures have to use data selectively. BNFinder handles perturbations in the way following Dojer et al. (2006), i.e. for scoring sets of parents of a variable v, it takes into account only the experiments where v was not perturbed. A prior distribution on the network structure may be specified through assigning weights to potential variable interactions in the way following Tamada et al. (2003). Moreover, the size of regulator sets of each variable may be bounded to a given number and the spaces of possible conditional probability distributions of selected variables may be restricted to noisy-and or noisy-or distributions. There are important biological applications of Bayesian networks, in which usually the amount of learning data is small relative to the network size (e.g. reconstruction of gene regulatory networks from microarray data). Typically in such cases suboptimal models explain the data nearly as well as the optimal (highest scoring) one. For this reason, Friedman and Koller (2003) propose to pay attention for network features frequently appearing in suboptimal networks. Following this idea, BNFinder splits a potential network structure into independently learned features, each one composed of a vertex and its parent set. For each vertex BNFinder returns as an output a user-specified number of suboptimal parents features with their relative posterior probabilities. Setting this parameter to 1 causes BNFinder to learn the optimal network structure composed of the highest scoring features. Otherwise returned features constitute a class of suboptimal networks. Output may be written in a few formats, supported in various graph and Bayesian network applications.

3 IMPLEMENTATION

The BNFinder software is implemented in the Python programming language so it can be installed and run on all popular operating systems. The only requirement is the availability of a recent version (>2.4) of the Python interpreter. Detailed installation instructions can be found on the Supplementary Web Page. Besides of the stand-alone version of BNFinder we have made a publicly available web server which allows for using BNFinder running on our servers on users' data. The server uses a very simple web form for input and sends the results to the e-mail address provided. To save the resources, we have limited the web version to handle at most 20 variables and 500 observations. In order to judge the performance of our software, we have compared it to the Banjo library (Smith et al., 2006). As a realistic dataset, we have chosen the dataset attached as an example to the Banjo package, consisting of 20 variables and 2000 observations, published by Smith et al. (2006). The authors search for a dynamic Bayesian network with an in-degree of all vertices not larger than 5. It should be noted, that the number of such networks is extremely large (((20× 19× 18× 17× 16)/(1× 2× 3× 4× 5))20∼6.4× 1084). Even though Banjo is able to analyze approximately 1 million networks per minute on a single CPU it would take it more than 1070 years to search through all possible networks. Thanks to the new algorithm (Dojer, 2006) our method is able to find the correct topology for the same dataset in a few hours on the same computer.

8 in total

1. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks.

Authors: Dirk Husmeier
Journal: Bioinformatics Date: 2003-11-22 Impact factor: 6.937

2. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression.

Authors: E Segal; R Yelensky; D Koller
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

3. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection.

Authors: Yoshinori Tamada; SunYong Kim; Hideo Bannai; Seiya Imoto; Kousuke Tashiro; Satoru Kuhara; Satoru Miyano
Journal: Bioinformatics Date: 2003-10 Impact factor: 6.937

4. Predicting gene expression from sequence.

Authors: Michael A Beer; Saeed Tavazoie
Journal: Cell Date: 2004-04-16 Impact factor: 41.582

5. Finding optimal models for small gene networks.

Authors: S Ott; S Imoto; S Miyano
Journal: Pac Symp Biocomput Date: 2004

6. Applying dynamic Bayesian networks to perturbed gene expression data.

Authors: Norbert Dojer; Anna Gambin; Andrzej Mizera; Bartek Wilczyński; Jerzy Tiuryn
Journal: BMC Bioinformatics Date: 2006-05-08 Impact factor: 3.169

7. Computational inference of neural information flow networks.

Authors: V Anne Smith; Jing Yu; Tom V Smulders; Alexander J Hartemink; Erich D Jarvis
Journal: PLoS Comput Biol Date: 2006-10-12 Impact factor: 4.475

Review 8. A primer on learning in Bayesian networks for computational biology.

Authors: Chris J Needham; James R Bradford; Andrew J Bulpitt; David R Westhead
Journal: PLoS Comput Biol Date: 2007-08 Impact factor: 4.475

8 in total

31 in total

1. Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development.

Authors: Stefan Bonn; Robert P Zinzen; Charles Girardot; E Hilary Gustafson; Alexis Perez-Gonzalez; Nicolas Delhomme; Yad Ghavi-Helm; Bartek Wilczyński; Andrew Riddell; Eileen E M Furlong
Journal: Nat Genet Date: 2012-01-08 Impact factor: 38.330

2. Inferring gene regression networks with model trees.

Authors: Isabel A Nepomuceno-Chamorro; Jesus S Aguilar-Ruiz; Jose C Riquelme
Journal: BMC Bioinformatics Date: 2010-10-15 Impact factor: 3.169

3. Comparative analysis of cis-regulation following stroke and seizures in subspaces of conserved eigensystems.

Authors: Michal Dabrowski; Norbert Dojer; Malgorzata Zawadzka; Jakub Mieczkowski; Bozena Kaminska
Journal: BMC Syst Biol Date: 2010-06-17

4. Lignification in sugarcane: biochemical characterization, gene discovery, and expression analysis in two genotypes contrasting for lignin content.

Authors: Alexandra Bottcher; Igor Cesarino; Adriana Brombini dos Santos; Renato Vicentini; Juliana Lischka Sampaio Mayer; Ruben Vanholme; Kris Morreel; Geert Goeminne; Jullyana Cristina Magalhães Silva Moura; Paula Macedo Nobile; Sandra Maria Carmello-Guerreiro; Ivan Antonio dos Anjos; Silvana Creste; Wout Boerjan; Marcos Guimarães de Andrade Landell; Paulo Mazzafera
Journal: Plant Physiol Date: 2013-10-21 Impact factor: 8.340