Literature DB >> 23818512

BNFinder2: Faster Bayesian network learning and Bayesian classification.

Norbert Dojer1, Pawel Bednarz, Agnieszka Podsiadlo, Bartek Wilczynski.   

Abstract

SUMMARY: Bayesian Networks (BNs) are versatile probabilistic models applicable to many different biological phenomena. In biological applications the structure of the network is usually unknown and needs to be inferred from experimental data. BNFinder is a fast software implementation of an exact algorithm for finding the optimal structure of the network given a number of experimental observations. Its second version, presented in this article, represents a major improvement over the previous version. The improvements include (i) a parallelized learning algorithm leading to an order of magnitude speed-ups in BN structure learning time; (ii) inclusion of an additional scoring function based on mutual information criteria; (iii) possibility of choosing the resulting network specificity based on statistical criteria and (iv) a new module for classification by BNs, including cross-validation scheme and classifier quality measurements with receiver operator characteristic scores.
AVAILABILITY AND IMPLEMENTATION: BNFinder2 is implemented in python and freely available under the GNU general public license at the project Web site https://launchpad.net/bnfinder, together with a user's manual, introductory tutorial and supplementary methods.

Entities:  

Mesh:

Year:  2013        PMID: 23818512      PMCID: PMC3722519          DOI: 10.1093/bioinformatics/btt323

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


Bayesian Networks (BNs) are robust and versatile probabilistic models applicable to many different phenomena (Needham ). In biology, the applications range from gene regulatory networks (Dojer ) to protein interactions (Jansen ) to gene expression prediction (Beer and Tavazoie, 2004) to relationships between chromatin-associated proteins (Van Steensel ) to chromatin state prediction (Bonn ). In many cases one needs to infer the structure of the network to build a BN model. While this problem is NP-hard in general (Chickering, 1995), it was shown by Dojer (2006) that in cases where the acyclicity of the network is ensured, it is possible to find the optimal network in polynomial time. BNFinder (Wilczyński and Dojer, 2009) is a flexible tool for network topology learning from experimental data. Originally developed for inferring gene regulatory networks from expression data (Dojer ), it has been since successfully applied to linking expression data with sequence motif information (Dabrowski ), identifying histone modifications connected to enhancer activity (Bonn ) and to predicting gene expression profiles of tissue-specific genes (Wilczynski ). The last study is also an example of using BNFinder not as a standalone tool but as a software library. Thanks to the availability of the source code and documented API it was possible to use BNs as a part of a larger probabilistic model using Expectation-Maximization for parameter optimization. BNFinder can be also used for classification tasks (Fig. 1). In this case the network topology is constrained to a bipartite graph between feature and class variables. The structure represents conditional dependencies of classes on selected features. This classifier model is equivalent to diagnostic BNs introduced by Kontkanen . The process of classification consists of several steps, carried out with dedicated BNFinder2 modules. First, to train the classifier, the optimal network structure and the conditional probability functions (CPDs) are learned with the basic bnf tool. Second, the bnc module makes predictions on new examples using the learned network and CPDs.
Fig. 1.

An example of a classification problem with three features (A, B, C) and two class variables (X, Y). The true dependency structure is depicted as a graph (top left). Class variables are not predictable from any single feature, but from different pairs of features. Classification of X is possible from features A and B, while classification of Y requires features A and C (scatter plots, top right, green and blue dots represent examples positive with respect to X and Y variables, respectively). Continuous feature variables have different noise/signal ratios (gray histograms, top right), but all of them are accurately described by the fitted Gaussian model (orange and red lines). The exemplary ROC curve for classification of variable X (bottom left)

Additionally, the bnf-cv tool facilitates using BNFinder2 in a cross-validation framework by automatically dividing the input dataset into training and testing sets. The performance can be measured either with numerical measures such as specificity or sensitivity or by generating receiver operator characteristic (ROC) or precision-recall plots [using the Rocr package (Sing ) or a pure python implementation, example plots shown in Fig. 1 and Supplementary Fig. S1]. All these tools, together with other BNFinder2 features, like handling mixed (both continuous and discrete) datasets, make it a complete package for easily generating classifiers for a broad range of biological datasets, as is illustrated by an application to histone modifications measurements (Bonn ). An example of a classification problem with three features (A, B, C) and two class variables (X, Y). The true dependency structure is depicted as a graph (top left). Class variables are not predictable from any single feature, but from different pairs of features. Classification of X is possible from features A and B, while classification of Y requires features A and C (scatter plots, top right, green and blue dots represent examples positive with respect to X and Y variables, respectively). Continuous feature variables have different noise/signal ratios (gray histograms, top right), but all of them are accurately described by the fitted Gaussian model (orange and red lines). The exemplary ROC curve for classification of variable X (bottom left) Although BNFinder always finds optimal networks with respect to a given score, the reliability of learned networks may vary, depending on the input data. Therefore, BNFinder attaches a couple of statistics to returned network features. This includes relative posterior probability for each set of parents and each variable as well as weighted frequency of occurrence in (sub-)optimal regulator sets for each edge. BNFinder2 is equipped with additional quality control mechanism, allowing the user to predetermine the specificity of optimal network. Namely, the expected proportion of pairs of unrelated variables wrongly connected by an edge may be specified. Based on this proportion and the distribution of scoring function, prior distributions of network structures are adjusted to yield networks with the user-specified error rate (Supplementary Methods and Tutorial). After the publication of the original BNFinder method, it was shown that the polynomial algorithm introduced by Dojer (2006) can also be applied to the Mutual Information Test (MIT), another BN scoring function based on mutual information (Vinh ). The authors have shown that the MIT score gives more accurate results than the Minimal Description Length (MDL) score, while taking less time than Bayesian Dirichlet equivalence (BDe) score. As this compromise between accuracy and speed is desirable, we decided to adapt BNFinder to include the MIT score. This allows users to find networks with optimal MIT score not only in case of Dynamic BNs as presented by Vinh but also in the case of static BNs with constrained topology. Our current implementation allows users to freely choose from all three scoring functions: MDL, BDe and MIT for static and dynamic BNs. Additionally, we provide a generalized MIT score for continuous variables (Supplementary Methods and Tutorial). While BNFinder uses an efficient algorithm for BN structure learning, the original implementation was limited to running on a single CPU due to the limitations of the Python interpreter. Since then, multicore CPUs have become a majority and multiprocessing support was introduced into the Python language. BNFinder2 takes advantage of these developments to facilitate using multiple CPU cores for faster computation. As the learning method used in BNFinder performs parent-set optimizations independently for each variable, it can be parallelized efficiently. Supplementary Figure S2 shows that using BNFinder2 one can achieve speed-ups almost linearly scaling with the number of cores available on different hardware platforms. In summary, BNFinder2 represents a significant improvement over the original method in several aspects. From the user perspective, it allows for using BNFinder2 in classification setting with automated cross-validation, accuracy scoring and ROC plotting. Methodologically, it also provides a more comprehensive method for inferring networks with predefined error rate and introduces the possibility of calculating the optimal networks under the MIT score adapted to handle continuous variables as well as discrete ones. Last but not least, BNFinder2 can use parallelization on muliplecore machines to greatly improve the running times of BN learning, especially in case of the BDe score. Funding: Polish Ministry of Science and Higher Education grant [N N301 065236 to B.W. and N.D.] and Foundation for Polish Science within Homing Plus programme co-financed by the European Union—European Regional Development Fund [to A.P. and P.B.]. Conflict of Interest: none declared.
  11 in total

1.  A Bayesian networks approach for predicting protein-protein interactions from genomic data.

Authors:  Ronald Jansen; Haiyuan Yu; Dov Greenbaum; Yuval Kluger; Nevan J Krogan; Sambath Chung; Andrew Emili; Michael Snyder; Jack F Greenblatt; Mark Gerstein
Journal:  Science       Date:  2003-10-17       Impact factor: 47.728

2.  Predicting gene expression from sequence.

Authors:  Michael A Beer; Saeed Tavazoie
Journal:  Cell       Date:  2004-04-16       Impact factor: 41.582

3.  ROCR: visualizing classifier performance in R.

Authors:  Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal:  Bioinformatics       Date:  2005-08-11       Impact factor: 6.937

4.  GlobalMIT: learning globally optimal dynamic bayesian network with the mutual information test criterion.

Authors:  Nguyen Xuan Vinh; Madhu Chetty; Ross Coppel; Pramod P Wangikar
Journal:  Bioinformatics       Date:  2011-08-03       Impact factor: 6.937

5.  Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development.

Authors:  Stefan Bonn; Robert P Zinzen; Charles Girardot; E Hilary Gustafson; Alexis Perez-Gonzalez; Nicolas Delhomme; Yad Ghavi-Helm; Bartek Wilczyński; Andrew Riddell; Eileen E M Furlong
Journal:  Nat Genet       Date:  2012-01-08       Impact factor: 38.330

6.  Bayesian network analysis of targeting interactions in chromatin.

Authors:  Bas van Steensel; Ulrich Braunschweig; Guillaume J Filion; Menzies Chen; Joke G van Bemmel; Trey Ideker
Journal:  Genome Res       Date:  2009-12-09       Impact factor: 9.043

7.  Comparative analysis of cis-regulation following stroke and seizures in subspaces of conserved eigensystems.

Authors:  Michal Dabrowski; Norbert Dojer; Malgorzata Zawadzka; Jakub Mieczkowski; Bozena Kaminska
Journal:  BMC Syst Biol       Date:  2010-06-17

8.  Predicting spatial and temporal gene expression using an integrative model of transcription factor occupancy and chromatin state.

Authors:  Bartek Wilczynski; Ya-Hsin Liu; Zhen Xuan Yeo; Eileen E M Furlong
Journal:  PLoS Comput Biol       Date:  2012-12-06       Impact factor: 4.475

9.  Applying dynamic Bayesian networks to perturbed gene expression data.

Authors:  Norbert Dojer; Anna Gambin; Andrzej Mizera; Bartek Wilczyński; Jerzy Tiuryn
Journal:  BMC Bioinformatics       Date:  2006-05-08       Impact factor: 3.169

Review 10.  A primer on learning in Bayesian networks for computational biology.

Authors:  Chris J Needham; James R Bradford; Andrew J Bulpitt; David R Westhead
Journal:  PLoS Comput Biol       Date:  2007-08       Impact factor: 4.475

View more
  8 in total

1.  Probabilistic Graphical Models Applied to Biological Networks.

Authors:  Natalia Faraj Murad; Marcelo Mendes Brandão
Journal:  Adv Exp Med Biol       Date:  2021       Impact factor: 2.622

2.  Inferring Broad Regulatory Biology from Time Course Data: Have We Reached an Upper Bound under Constraints Typical of In Vivo Studies?

Authors:  Saurabh Vashishtha; Gordon Broderick; Travis J A Craddock; Mary Ann Fletcher; Nancy G Klimas
Journal:  PLoS One       Date:  2015-05-18       Impact factor: 3.240

3.  CausalTrail: Testing hypothesis using causal Bayesian networks.

Authors:  Daniel Stöckel; Florian Schmidt; Patrick Trampert; Hans-Peter Lenhof
Journal:  F1000Res       Date:  2015-12-30

Review 4.  Computational dynamic approaches for temporal omics data with applications to systems medicine.

Authors:  Yulan Liang; Arpad Kelemen
Journal:  BioData Min       Date:  2017-06-17       Impact factor: 2.522

5.  TissueNexus: a database of human tissue functional gene networks built with a large compendium of curated RNA-seq data.

Authors:  Cui-Xiang Lin; Hong-Dong Li; Chao Deng; Yuanfang Guan; Jianxin Wang
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

6.  Dynamic Bayesian Network Modeling of the Interplay between EGFR and Hedgehog Signaling.

Authors:  Holger Fröhlich; Gloria Bahamondez; Frank Götschel; Ulrike Korf
Journal:  PLoS One       Date:  2015-11-16       Impact factor: 3.240

7.  Distributed Bayesian networks reconstruction on the whole genome scale.

Authors:  Alina Frolova; Bartek Wilczyński
Journal:  PeerJ       Date:  2018-10-19       Impact factor: 2.984

8.  Reverse engineering directed gene regulatory networks from transcriptomics and proteomics data of biomining bacterial communities with approximate Bayesian computation and steady-state signalling simulations.

Authors:  Antoine Buetti-Dinh; Malte Herold; Stephan Christel; Mohamed El Hajjami; Francesco Delogu; Olga Ilie; Sören Bellenberg; Paul Wilmes; Ansgar Poetsch; Wolfgang Sand; Mario Vera; Igor V Pivkin; Ran Friedman; Mark Dopson
Journal:  BMC Bioinformatics       Date:  2020-01-21       Impact factor: 3.169

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.