| Literature DB >> 21453508 |
Xia Jiang1, Richard E Neapolitan, M Michael Barmada, Shyam Visweswaran.
Abstract
BACKGROUND: Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.Entities:
Mesh:
Year: 2011 PMID: 21453508 PMCID: PMC3080825 DOI: 10.1186/1471-2105-12-89
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example BN. A BN that models lung disorders. This BN is intentionally simple to illustrate concepts; it is not intended to be clinically complete.
Figure 2An example DAG. A DAG showing probabilistic relationships among SNPs and a disease D.
Figure 3An example DDAG. A DDAG showing probabilistic relationships between SNPs and a disease D. A DDAG differs from the DAG in Figure 2 in that the relationships among the SNPs are not represented.
Accuracies of scoring criteria
| Scoring Criterion | 200 | 400 | 800 | 1600 | Total | |
|---|---|---|---|---|---|---|
| 1 | 4379 | 5426 | 6105 | 6614 | 22524 | |
| 2 | 4438 | 5421 | 6070 | 6590 | 22519 | |
| 3 | 4227 | 5389 | 6095 | 6625 | 22336 | |
| 4 | 4419 | 5349 | 5996 | 6546 | 22313 | |
| 5 | 3989 | 5286 | 6060 | 6602 | 21934 | |
| 6 | 4220 | 5165 | 5874 | 6442 | 21701 | |
| 7 | 4049 | 5111 | 5881 | 6463 | 21504 | |
| 8 | 3749 | 5156 | 5991 | 6562 | 21448 | |
| 9 | 4112 | 4954 | 5555 | 5982 | 20603 | |
| 10 | 3839 | 4814 | 5629 | 6277 | 20559 | |
| 11 | 3571 | 4791 | 5648 | 6297 | 20307 | |
| 12 | 3285 | 4779 | 5755 | 6415 | 20234 | |
| 13 | 3768 | 4914 | 5754 | 5780 | 20216 | |
| 14 | 2344 | 5225 | 6065 | 6553 | 20187 | |
| 15 | 3489 | 4580 | 5521 | 6215 | 19805 | |
| 16 | 2810 | 4393 | 5464 | 6150 | 18817 | |
| 17 | 2310 | 4052 | 5158 | 5895 | 17415 | |
| 18 | 1850 | 3475 | 5095 | 6116 | 16536 | |
| 19 | 2245 | 3529 | 4684 | 5673 | 16131 | |
| 20 | 1651 | 3297 | 4492 | 5329 | 14769 | |
| 21 | 3364 | 3153 | 2812 | 2520 | 11847 | |
| 22 | 2497 | 1967 | 1462 | 1126 | 7052 | |
| 23 | 26 | 476 | 1300 | 2046 | 3848 | |
The number of times out of 7000 data sets that each scoring criterion identified the correct model for sample sizes of 200, 400, 800, and 1600. The last column gives the total accuracy over all sample sizes. The scoring criteria are listed in descending order of total accuracy.
Statistical comparison of accuracies of scoring criteria
| Scoring Criterion | ||
|---|---|---|
| 1 | NA | |
| 2 | 0.996 | |
| 3 | 0.076 | |
| 4 | 0.046 | |
| 5 | 4.086 × 10-8 | |
| 6 | 3.468 × 10-14 | |
| 7 | 1.200 × 10-20 | |
P-values obtained by comparing the accuracy of the highest ranking scoring criterion (score) with the next six highest ranking scoring criteria using the McNemar chi-square test. Each p-value is obtained by comparing the accuracies for 28,000 data sets.
Recall for scoring criteria
| Scoring Criterion | 200 | 400 | 800 | 1600 | Total | |
|---|---|---|---|---|---|---|
| 1 | 5259 | 6043 | 6566 | 6890 | 24758 | |
| 2 | 5204 | 5969 | 6511 | 6849 | 24533 | |
| 3 | 5186 | 5960 | 6481 | 6830 | 24457 | |
| 4 | 5223 | 5941 | 6473 | 6813 | 24450 | |
| 5 | 5303 | 5962 | 6371 | 6747 | 24383 | |
| 6 | 5203 | 5902 | 6425 | 6794 | 24324 | |
| 7 | 5181 | 5866 | 6395 | 6768 | 24210 | |
| 8 | 5147 | 5816 | 6352 | 6754 | 24069 | |
| 9 | 5080 | 5767 | 6300 | 6725 | 23872 | |
| 10 | 5031 | 5733 | 6265 | 6704 | 23733 | |
| 11 | 4870 | 5710 | 6324 | 6748 | 23652 | |
| 12 | 4973 | 5681 | 6230 | 6681 | 23565 | |
| 13 | 4902 | 5622 | 6183 | 6647 | 23354 | |
| 14 | 4984 | 5529 | 6105 | 6575 | 23193 | |
| 15 | 4786 | 5531 | 6119 | 6605 | 23041 | |
| 16 | 4649 | 5416 | 6026 | 6547 | 22638 | |
| 17 | 4383 | 5219 | 5901 | 6453 | 21956 | |
| 18 | 4151 | 5159 | 5903 | 6473 | 21686 | |
| 19 | 3881 | 4969 | 5780 | 6412 | 21042 | |
| 20 | 3895 | 4901 | 5715 | 6329 | 20840 | |
| 21 | 3953 | 4862 | 5652 | 6285 | 20752 | |
| 22 | 3618 | 4696 | 5595 | 6251 | 20160 | |
| 23 | 2500 | 3712 | 4811 | 5737 | 17760 | |
The sum of the recall for each scoring criterion over 7000 data sets for sample sizes of 200, 400, 800, and 1600. The last column gives the total recall over all sample sizes. The scoring criteria are listed in descending order of total recall.
Accuracies of scoring criteria on most difficult models
| Scoring Criterion | 200 | 400 | 800 | 1600 | Total | |
|---|---|---|---|---|---|---|
| 1 | 14 | 48 | 167 | 352 | 581 | |
| 2 | 1 | 21 | 146 | 355 | 563 | |
| 3 | 13 | 46 | 155 | 318 | 532 | |
| 4 | 12 | 43 | 106 | 289 | 450 | |
| 5 | 11 | 37 | 91 | 274 | 413 | |
| 6 | 3 | 25 | 79 | 245 | 352 | |
| 7 | 7 | 25 | 65 | 215 | 312 | |
| 8 | 16 | 33 | 80 | 138 | 267 | |
| 9 | 5 | 20 | 48 | 186 | 259 | |
| 10 | 4 | 16 | 47 | 179 | 246 | |
| 11 | 2 | 7 | 23 | 140 | 172 | |
| 12 | 3 | 6 | 13 | 86 | 108 | |
| 13 | 0 | 1 | 4 | 72 | 77 | |
| 14 | 0 | 1 | 2 | 41 | 44 | |
The number of times out of 500 that each scoring criterion correctly learned the correct model in the case of the most difficult models (55-59) for sample sizes of 200, 400, 800, and 1600. The last column gives the total accuracy over all sample sizes. The scoring criteria are listed in descending order of accuracy.
Statistical comparison of accuracies of scoring criteria on most difficult models
| Scoring Criterion | |
|---|---|
| NA | |
| 0.610 | |
| 0.147 | |
| 4.870 × 10-5 | |
| 1.080 × 10-7 | |
| 7.254 × 10-14 | |
P-values obtained by comparing the accuracy of the highest ranking scoring criterion (score) with the next five highest ranking scoring criteria using the McNemar chi-square test. Each p-value is obtained by comparing the accuracies for 2,000 data sets generated by the hardest-to-detect models.
Evaluation of scoring criteria concerning detection of GAB2 SNPs
| Rank | α = 3 | α = 12 | α = 21 | α = 54 | α = 162 | α = 1000 | K2 | MML1 | MDLn | Suz1 | Epi2 | MDR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 4 | 4 | 4 | 4 G | 4 G | 4 | 4 G | 4 G | 4 G | 3 | 4 G | 4 |
| 2 | 4 | 4 | 4 | 4 G | 4 G | 4 | 4 G | 4 G | 4 G | 3 G | 4 G | 4 |
| 3 | 4 | 4 | 4 G | 4 G | 4 | 4 | 4 G | 4 G | 4 G | 3 G | 4 G | 4 |
| 4 | 4 | 4 | 4 | 4 G | 4 G | 4 | 4 | 4 | 4 G | 3 | 4 G | 4 |
| 5 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 3 | 4 G | 3 | 3 | 4 |
| 6 | 4 | 4 | 4 | 4 | 4 G | 4 | 4 | 4 | 4 | 3 G | 4 G | 4 G |
| 7 | 4 | 4 | 4 | 4 G | 4 | 4 | 4 | 4 | 4 | 3 G | 4 | 4 |
| 8 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 G | 3 G | 4 | 4 G |
| 9 | 4 | 4 | 4 | 4 G | 4 | 4 | 4 | 4 | 4 G | 3 G | 4 G | 4 |
| 10 | 4 | 4 | 4 | 4 G | 4 G | 4 | 4 | 3 G | 4 | 2 | 4 G | 4 G |
| 11 | 4 | 4 | 4 G | 4 G | 4 G | 4 G | 4 G | 4 | 4 | 3 | 4 | 4 |
| 12 | 4 | 4 G | 4 G | 4 | 4 G | 4 G | 4 G | 4 | 4 | 3 G | 4 | 4 G |
| 13 | 4 | 4 | 4 | 4 G | 4 G | 4 G | 4 | 4 G | 4 G | 3 G | 4 | 4 |
| 14 | 4 | 4 | 4 | 4 | 4 G | 4 | 4 G | 4 G | 4 | 3 | 3 G | 4 |
| 15 | 4 | 4 | 4 G | 4 G | 4 G | 4 | 4 G | 3 G | 4 | 3 G | 4 G | 4 G |
| 16 | 4 | 4 | 4 G | 4 G | 4 G | 4 G | 4 G | 4 | 4 | 3 G | 3 G | 4 |
| 17 | 4 | 4 | 4 | 4 G | 4 G | 4 | 4 G | 3 | 4 G | 3 | 4 | 4 G |
| 18 | 4 | 4 | 4 G | 4 G | 4 G | 4 G | 4 | 4 G | 4 G | 3 G | 4 | 4 |
| 19 | 4 | 4 | 4 G | 4 G | 4 G | 4 G | 4 | 4 G | 4 G | 3 G | 4 | 4 |
| 20 | 4 | 4 | 4 | 4 G | 4 G | 4 G | 4 | 4 G | 4 | 3 G | 4 G | 4 G |
| 21 | 4 | 4 | 4 | 4 G | 4 | 4 G | 4 | 4 G | 4 G | 3 G | 4 G | 4 G |
| 22 | 4 | 4 G | 4 | 4 G | 4 G | 4 G | 4 | 4 | 4 G | 3 | 4 G | 4 |
| 23 | 4 | 4 G | 4 | 4 G | 4 | 4 G | 4 G | 4 | 4 G | 3 G | 4 | 4 |
| 24 | 4 | 4 | 4 G | 4 G | 4 G | 4 | 4 | 4 | 4 G | 3 G | 4 G | 4 |
| 25 | 4 | 4 | 4 | 4 | 4 G | 4 | 4 | 4 | 4 G | 3 G | 3 | 4 |
| Total # G G##GGG | 0 | 3 | 7 | 19 | 18 | 10 | 10 | 11 | 16 | 17 | 14 | 8 |
| # Diff G | 0 | 2 | 3 | 7 | 6 | 4 | 4 | 4 | 8 | 8 | 8 | 6 |
Information about the 25 highest scoring models for a variety of scoring criteria. The number on the left in a cell is the number of SNPs in the model, and the letter G appears to the right of that number if a GAB2 SNP appears in the model. The second to the last row shows the total number of models in the top 25 that contained a GAB2 SNP. The last row shows the total number of different GAB2 SNPs appearing in the top 25 models.