| Literature DB >> 24864238 |
Minoo Aminian1, David Couvin2, Amina Shabbeer1, Kane Hadley1, Scott Vandenberg3, Nalin Rastogi2, Kristin P Bennett4.
Abstract
We develop a novel approach for incorporating expert rules into Bayesian networks for classification of Mycobacterium tuberculosis complex (MTBC) clades. The proposed knowledge-based Bayesian network (KBBN) treats sets of expert rules as prior distributions on the classes. Unlike prior knowledge-based support vector machine approaches which require rules expressed as polyhedral sets, KBBN directly incorporates the rules without any modification. KBBN uses data to refine rule-based classifiers when the rule set is incomplete or ambiguous. We develop a predictive KBBN model for 69 MTBC clades found in the SITVIT international collection. We validate the approach using two testbeds that model knowledge of the MTBC obtained from two different experts and large DNA fingerprint databases to predict MTBC genetic clades and sublineages. These models represent strains of MTBC using high-throughput biomarkers called spacer oligonucleotide types (spoligotypes), since these are routinely gathered from MTBC isolates of tuberculosis (TB) patients. Results show that incorporating rules into problems can drastically increase classification accuracy if data alone are insufficient. The SITVIT KBBN is publicly available for use on the World Wide Web.Entities:
Mesh:
Year: 2014 PMID: 24864238 PMCID: PMC4016944 DOI: 10.1155/2014/398484
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Example rules from SpolDB4. The rule column represents characteristic patterns specified by the visual rules as underlined subsequences in the spoligotype patterns. Each line corresponds to a rule. The underlined portions of the spoligotype must match exactly while the portions not underlined can take any value. All of these rules fire for the spoligotype 1101111111110111111100001111111100001111111, while three of the rules fire for 1111111111110011111100001111111100001111111.
Figure 2(a) The spoligotype conformal Bayesian network uses a single rule based on the number of repeats at the MIRU24 locus as the first level of a hierarchical Bayesian network. It uses the 43 spacers as features. CBN predicts the major lineage with high accuracy. (b) The KBBN uses multiple rules based on the presence of characteristic deletions as the first level of a hierarchical Bayesian network. As with the CBN, it uses the 43 spoligotype spacers.
SITVIT and CDC MTBC testbeds.
| Testbed | Dataset | Size | Number of classes | Max class size | Min class size | Number of rules |
|---|---|---|---|---|---|---|
| SITVIT | Train | 2714 | 69 | 390 | 1 | 69 |
| Test | 7949 | 69 | 1107 | 1 | 69 | |
| CV | 2593 | 45 | 390 | 11 | 45 | |
|
| ||||||
| CDC | CV | 1286 | 8 | 356 | 39 | 8 |
Training F-measure for KBBN trained on all 10633 SITVIT isolates.
| Clade |
| Clade |
| Clade |
|
|---|---|---|---|---|---|
| AFRI | 0.800 | H | 0.736 | PINI | 0.750 |
| AFRI_1 | 0.944 | H1 | 0.924 | PINI1 | 1.000 |
| AFRI_2 | 0.908 | H2 | 0.875 | PINI2 | 0.667 |
| AFRI_3 | 0.966 | H3 | 0.915 | S | 0.976 |
| Beijing | 1.000 | H3-Ural-1 | 0.873 | T | 0.926 |
| BOV | 0.948 | H37Rv | 0.958 | T1-RUS2 | 0.778 |
| BOV_1 | 0.993 | H4-Ural-2 | 0.933 | T2 | 0.953 |
| BOV_2 | 1.000 | LAM | 0.947 | T2-Uganda | 0.991 |
| BOV_3 | 0.644 | LAM1 | 0.977 | T3 | 0.964 |
| BOV_4-Caprae | 0.891 | LAM11-ZWE | 0.954 | T3-ETH | 0.65 |
| Cameroon | 0.929 | LAM12-Madrid1 | 0.947 | T3-OSA | 0.626 |
| CANETTI | 1.000 | LAM2 | 0.991 | T4 | 0.988 |
| CAS | 0.937 | LAM3 | 0.988 | T4-CEU1 | 1.000 |
| CAS1-Delhi | 0.961 | LAM4 | 0.970 | T5 | 0.984 |
| CAS1-Kili | 0.973 | LAM5 | 0.978 | T5-Madrid2 | 1.000 |
| CAS2 | 0.921 | LAM6 | 0.856 | T5-RUS1 | 0.949 |
| EAI | 0.982 | LAM8 | 1.000 | T-Tuscany | 1.000 |
| EAI1-SOM | 0.986 | Manu_ancestor | 1.000 | Turkey | 0.928 |
| EAI2-Manila | 0.984 | Manu1 | 0.991 | X1 | 0.989 |
| EAI2-Nonthaburi | 1.000 | Manu2 | 1.000 | X2 | 0.963 |
| EAI3-IND | 0.963 | Manu3 | 1.000 | X3 | 0.995 |
| EAI4-VNM | 1.000 | Microti | 0.750 | ZERO | 0.800 |
| EAI6-BGD1 | 0.989 | ||||
| EAI7-BGD2 | 1.000 | AVERAGE |
| ||
| EAI8-MDG | 1.000 |
Results of the F-measures of KBBN based on out of-sample test. The KBBN model was trained on SITVIT-Train (with 2714 records) and tested on SITVIT-Test with 7949 records. Overall average F-measure is 0.939.
| Clade |
| Clade |
| Clade |
|
|---|---|---|---|---|---|
| AFRI | 0.889 | H | 0.942 | PINI | 0.667 |
| AFRI_1 | 0.975 | H1 | 0.977 | PINI1 | 0.923 |
| AFRI_2 | 0.926 | H2 | 0.625 | PINI2 | 0.522 |
| AFRI_3 | 1.000 | H3 | 0.944 | S | 0.956 |
| Beijing | 0.980 | H3-Ural-1 | 0.887 | T | 0.969 |
| BOV | 0.981 | H37Rv | 1.000 | T1-RUS2 | 0.956 |
| BOV_1 | 0.996 | H4-Ural-2 | 0.960 | T2 | 0.991 |
| BOV_2 | 1.000 | LAM | 0.949 | T2-Uganda | 1.000 |
| BOV_3 | 1.000 | LAM1 | 0.986 | T3 | 0.969 |
| BOV_4-Caprae | 0.914 | LAM11-ZWE | 0.976 | T3-ETH | 0.977 |
| Cameroon | 0.967 | LAM12-Madrid1 | 1.000 | T3-OSA | 0.978 |
| Canetti | 0.500 | LAM2 | 0.993 | T4 | 0.984 |
| CAS | 0.990 | LAM3 | 0.973 | T4-CEU1 | 1.000 |
| CAS1-Delhi | 0.990 | LAM4 | 0.967 | T5 | 1.000 |
| CAS1-Kili | 0.846 | LAM5 | 0.985 | T5-Madrid2 | 1.000 |
| CAS2 | 1.000 | LAM6 | 0.889 | T5-RUS1 | 0.883 |
| EAI | 0.989 | LAM8 | 0.970 | T-Tuscany | 0.889 |
| EAI1-SOM | 1.000 | Manu_ancestor | 1.000 | Turkey | 0.941 |
| EAI2-Manila | 1.000 | Manu1 | 0.995 | X1 | 0.963 |
| EAI2-Nonthaburi | 0.933 | Manu2 | 0.997 | X2 | 0.944 |
| EAI3-IND | 1.000 | Manu3 | 1.000 | X3 | 0.971 |
| EAI4-VNM | 1.000 | Microti | 0.667 | ZERO | 0.800 |
| EAI6-BGD1 | 1.000 | ||||
| EAI7-BGD2 | 0.993 | Average |
| ||
| EAI8-MDG | 1.000 |
Average F-measure of KBBN, BN, Rules-only, and SVM (nonlinear and linear) on two testbeds. While using Rules-only provides poor results, KBBN is able to provide results that are significantly better or at least not worse than BN and SVM on both domains. Results significantly different from KBBN at 5% significance level are shown in bold.
| Dataset | Model | ||||
|---|---|---|---|---|---|
| KBBN | BN | Rules-only | SVM nonlinear | SVM linear | |
| SITVIT-CV | 0.945 |
|
|
|
|
| CDC-Sublineage | 0.981 |
|
| 0.994 | 0.993 |
Figure 3The result of adding rules to different training set sizes for the (a) SITVIT-CV and (b) CDC-Sublineage testbeds.
Figure 4Effect of removing rules for each class on the average F-value for (a) SITVIT-CV and (b) CDC-Sublineage.
Posterior probability of each rule given class for CDC-Sublineage dataset. Blanks indicate 0.
| Class | |||||||
|---|---|---|---|---|---|---|---|
| Haarlem | LAM | S | X | India | Manila | Vietnam | |
| Rule | |||||||
| Haarlem | 0.707 | 0.015 | |||||
| LAM | 1.000 | ||||||
| S | 1.000 | 0.015 | 0.005 | 0.033 | |||
| X | 1.000 | 0.015 | 0.017 | ||||
| India | 0.970 | 0.022 | 0.017 | ||||
| Manila | 0.735 | ||||||
| Vietnam | 0.283 | ||||||
| No rule | 0.297 | 0.015 | 0.243 | 0.700 | |||
Figure 5The heat map represents the posterior probability of each rule given the sublineage for the SITVIT dataset. A strong association of a rule in predicting a sublineage is shown with a red square while a blue square represents no relation. Here H includes URAL-1 and URAL-2 and LAM includes Turkey and Cameroon sublineages.