| Literature DB >> 24922310 |
Michael J McGeachie1, Hsun-Hsien Chang2, Scott T Weiss3.
Abstract
Bayesian Networks (BN) have been a popular predictive modeling formalism in bioinformatics, but their application in modern genomics has been slowed by an inability to cleanly handle domains with mixed discrete and continuous variables. Existing free BN software packages either discretize continuous variables, which can lead to information loss, or do not include inference routines, which makes prediction with the BN impossible. We present CGBayesNets, a BN package focused around prediction of a clinical phenotype from mixed discrete and continuous variables, which fills these gaps. CGBayesNets implements Bayesian likelihood and inference algorithms for the conditional Gaussian Bayesian network (CGBNs) formalism, one appropriate for predicting an outcome of interest from, e.g., multimodal genomic data. We provide four different network learning algorithms, each making a different tradeoff between computational cost and network likelihood. CGBayesNets provides a full suite of functions for model exploration and verification, including cross validation, bootstrapping, and AUC manipulation. We highlight several results obtained previously with CGBayesNets, including predictive models of wood properties from tree genomics, leukemia subtype classification from mixed genomic data, and robust prediction of intensive care unit mortality outcomes from metabolomic profiles. We also provide detailed example analysis on public metabolomic and gene expression datasets. CGBayesNets is implemented in MATLAB and available as MATLAB source code, under an Open Source license and anonymous download at http://www.cgbayesnets.com.Entities:
Mesh:
Year: 2014 PMID: 24922310 PMCID: PMC4055564 DOI: 10.1371/journal.pcbi.1003676
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Performance of various classifiers on a simple example dataset, discretization/data_gen2_train.csv and discretization/data_gen2_test.csv, trained and tested on separate data generated from a BN of 5 discrete and 15 continuous variables (n = 25), and also averages over 10 similar, randomly generated networks.
| True Network | BNfinder 2.0 (K2) | BNfinder 2.0 (reverse-K2) | Weka 3.6.9 | CGBayesNets | |||||
| Data | nodes | Nodes | AUC | Nodes | AUC | nodes | AUC | nodes | AUC |
| Original | 14 | 5 | 50% | 3 | 98.7% | - | - | 8 | 99.3% |
| Discretized | 14 | 3 | 50% | 2 | 61.3% | 20 | 50.0% | 5 | 72.6% |
| Average Original | - | - | 68.7% | - | 82.4% | - | - | - | 99.2% |
| Average Discretized | - | - | 68.7% | - | 68.7% | - | 70.0% | - | 70.0% |
This illustrates the differences between CGBayesNets and two other software packages, BNfinder 2.0 and Weka 3.6.9. Training data is presented in two forms: in its original form, and in its discretized form, where continuous nodes are binned into 10 equal-width discrete bins. The “nodes” columns refer to the number of nodes in the Markov blanket of the phenotype node in the discovered network. “AUC” is the Area Under receiver-operator characteristic Curve, evaluated by predicting the phenotype node given values of the other 19 nodes in a separate test set. “BNfinder 2.0 (K2)” refers to BNfinder 2.0 supplied with node-parent constraints consistent with the topological ordering of nodes in the true network. “BNfinder 2.0 (reverse-K2)” indicates BNfinder 2.0 supplied with node-parent constraints consistent with the opposite of the topological ordering of the nodes in the true network. No data is reported for Weka 3.6.9 on the Original data, since Weka 3.6.9 does not handle Bayesian Networks with continuous data variables. Average AUCs over 10 similarly-generated networks are shown for each method, before and after discretization.