| Literature DB >> 25717394 |
Jeya B Balasubramanian1, Shyam Visweswaran2, Gregory F Cooper2, Vanathi Gopalakrishnan2.
Abstract
Accurate disease classification and biomarker discovery remain challenging tasks in biomedicine. In this paper, we develop and test a practical approach to combining evidence from multiple models when making predictions using selective Bayesian model averaging of probabilistic rules. This method is implemented within a Bayesian Rule Learning system and compared to model selection when applied to twelve biomedical datasets using the area under the ROC curve measure of performance. Cross-validation results indicate that selective Bayesian model averaging statistically significantly outperforms model selection on average in these experiments, suggesting that combining predictions from multiple models may lead to more accurate quantification of classifier uncertainty. This approach would directly impact the generation of robust predictions on unseen test data, while also increasing knowledge for biomarker discovery and mechanisms that underlie disease.Entities:
Year: 2014 PMID: 25717394 PMCID: PMC4333697
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1:An example of a BN structure learned from BRL. Panel (a) displays a constrained BN structure (S) with two predictive variables, ‘Gene1’ and ‘Gene2’, as parents of the target variable T. The two predictive variables are binary with values of ‘UP’ and ‘DOWN’. The target variable is binary having values ‘Case’ and ‘Control’. Panel (b) shows the parameters for the target node as a complete decision tree. The interior nodes of the tree are the predictive variables (represented by ellipses) and the leaf nodes (represented by rectangles) show the probability distribution over T. Panel (c) shows the rule set inferred from the decision tree by BRL. Each rule antecedent is a path from a leaf to the root node. The consequent is the probability distribution of T. The following parentheses show the number of ‘Case’ instances and the number of ‘Control’ instances that match the antecedent, respectively.
Figure 2:Algorithm for model selection and model averaging in BRL.
The 12 biomedical datasets used for analysis. The first eleven are genomic and the twelfth one is proteomic. The data are identified with the ‘Dataset ID’. The column ‘P/D’ describes the type of data as Prognostic (P) or Diagnostic (D). The ‘# V’ column is the number of predictor variables originally in the dataset. The ‘#VPAIFE’ column shows the number of variables selected by PAIFE. The ‘Sample Class Distribution’ shows the number of samples in each class in the dataset. The ‘Reference’ points to the relevant literature for the dataset.
| Dataset | P/D | #V | #VPAIFE | Sample class distribution | Reference |
|---|---|---|---|---|---|
| 1 | D | 6584 | 1972 | 40:21:00 | ( |
| 2 | D | 12582 | 2371 | 28:24:20 | ( |
| 3 | P | 5372 | 858 | 69:17:00 | ( |
| 4 | D | 7129 | 2288 | 47:25:00 | ( |
| 5 | D | 7464 | 1880 | 18:18 | ( |
| 6 | P | 7129 | 699 | 40:20:00 | ( |
| 7 | D | 2308 | 832 | 29:25:17:12 | ( |
| 8 | D | 7129 | 1927 | 58:19:00 | ( |
| 9 | D | 10510 | 6713 | 52:50:00 | ( |
| 10 | P | 24481 | 4251 | 44:34:00 | ( |
| 11 | D | 7039 | 1230 | 35:04:00 | ( |
| 12 | D | 70 | 15 | 139:66 | ( |
Average AUCs obtained from BRL and SMA-BRL using 10 runs of 10-fold cross-validation for the 12 datasets described in Table 1. For each dataset, the result of the better performing algorithm is shown in bold. The last row shows the average from the 12 datasets and the standard error of mean (SEM).
| Dataset | BRL | SMA-BRL |
|---|---|---|
| 1 | ||
| 2 | 95.12 | |
| 3 | 60.14 | |
| 4 | 91.88 | |
| 5 | 94.13 | |
| 6 | 57.19 | |
| 7 | 84.67 | |
| 8 | 81.58 | |
| 9 | 90.87 | |
| 10 | 86.12 | |
| 11 | 95.42 | |
| 12 | 80.96 | |
| Average ± SEM | 84.80 ± 3.90 | 86.20 ± 4.05 |
Figure 3:The ROC curve for BRL and SMA-BRL, on one training fold of dataset 12 (Table 1).