| Literature DB >> 30415424 |
René Breuer1, Manuel Mattheisen2,3,4,5, Josef Frank1, Bertram Krumm6, Jens Treutlein1, Layla Kassem7, Jana Strohmaier1, Stefan Herms2,3, Thomas W Mühleisen2,3, Franziska Degenhardt2,3, Sven Cichon2,3,8,9, Markus M Nöthen2,3, George Karypis10, John Kelsoe11, Tiffany Greenwood11,12, Caroline Nievergelt11, Paul Shilling11, Tatyana Shekhtman11, Howard Edenberg13, David Craig14, Szabolcs Szelinger14, John Nurnberger15, Elliot Gershon16, Ney Alliey-Rodriguez16, Peter Zandi17, Fernando Goes18, Nicholas Schork14,19, Erin Smith20,21, Daniel Koller22, Peng Zhang23, Judith Badner16, Wade Berrettini24, Cinnamon Bloss25, William Byerley26, William Coryell27, Tatiana Foroud22, Yirin Guo28, Maria Hipolito29, Brendan Keating30,31, William Lawson32, Chunyu Liu33, Pamela Mahon18, Melvin McInnis34, Sarah Murray20,35, Evaristus Nwulia32, James Potash36, John Rice37, William Scheftner38, Sebastian Zöllner23, Francis J McMahon7, Marcella Rietschel1, Thomas G Schulze39,40,41,42.
Abstract
BACKGROUND: Disentangling the etiology of common, complex diseases is a major challenge in genetic research. For bipolar disorder (BD), several genome-wide association studies (GWAS) have been performed. Similar to other complex disorders, major breakthroughs in explaining the high heritability of BD through GWAS have remained elusive. To overcome this dilemma, genetic research into BD, has embraced a variety of strategies such as the formation of large consortia to increase sample size and sequencing approaches. Here we advocate a complementary approach making use of already existing GWAS data: a novel data mining procedure to identify yet undetected genotype-phenotype relationships. We adapted association rule mining, a data mining technique traditionally used in retail market research, to identify frequent and characteristic genotype patterns showing strong associations to phenotype clusters. We applied this strategy to three independent GWAS datasets from 2835 phenotypically characterized patients with BD. In a discovery step, 20,882 candidate association rules were extracted.Entities:
Keywords: Bipolar disorder; Data mining; Genotype–phenotype patterns; Rule discovery; Subphenotypes
Year: 2018 PMID: 30415424 PMCID: PMC6230336 DOI: 10.1186/s40345-018-0132-x
Source DB: PubMed Journal: Int J Bipolar Disord ISSN: 2194-7511
Fig. 1Outline of the overall approach. A main goal of market research is to identify rules that predict customer habits based on market baskets. In the cartoon, a male customer between 20 and 25 without children living in the city favours junk food and beer and when he goes shopping he will most likely buy brands. Adapting this idea to genetic research we try to identify those genetic factors from the plethora of genetic factors in the “market basket” that are characterized by specific phenotypic features (like specific phobia or restlessness). The cartoon contains graphical depictions by Benjamin Albiach Galan and Konstantinos Kokkinis
Fig. 2Illustration of the implemented version of the association rule mining algorithm. The lattice shown left is traversed starting from root {} to all leaves. Each genotype pattern (node in the tree) represents a subgroup of patients shown in the genotype matrix G. Additionally, using the p phenotype information of the patients from matrix P, we can count genotype and phenotype occurrences in contingency tables. Here illustrated for the genotype pattern g1g2gn with ‘a’ counting all patients where genotype g1g2gn and phenotype pi are present, ‘d’ were neither of both are present, and ‘b’ and ‘c’ counting patients with presentation of genotype g1g2gn but not phenotype pi and visa versa. The lattice is traversed as long as there are unprocessed genotype patterns that cannot be pruned before
Top 10 association rules regarding their p-values in the replication dataset (TGEN + BoMa)
| PID | Groups | Statistics | Adjusted p-value | |||||
|---|---|---|---|---|---|---|---|---|
| GP | Gp | gP | gp | p_chisq | Odds ratio (0.95 CI) | Bonferroni | FDR | |
| 12978 | 25 | 105 | 107 | 1598 | 3.576e−08 | 3.566 [2.169–5.681] | 0.00075 | 0.00075 |
| 6221 | 26 | 84 | 162 | 1563 | 1.780e−06 | 2.995 [1.841–4.730] | 0.03717 | 0.01859 |
| 12681 | 33 | 103 | 187 | 1512 | 4.648e−06 | 2.596 [1.682–3.917] | 0.09706 | 0.02771 |
| 12981 | 25 | 129 | 107 | 1574 | 5.720e−06 | 2.860 [1.751–4.520] | 0.11944 | 0.02771 |
| 6225 | 25 | 84 | 163 | 1563 | 6.635e−06 | 2.862 [1.747–4.545] | 0.13855 | 0.02771 |
| 6228 | 26 | 93 | 162 | 1554 | 1.585e−05 | 2.690 [1.661–4.225] | 0.33102 | 0.05517 |
| 4428 | 31 | 88 | 212 | 1504 | 2.021e−05 | 2.505 [1.600–3.830] | 0.42198 | 0.06028 |
| 6111 | 21 | 109 | 111 | 1594 | 4.096e−05 | 2.779 [1.636–4.529] | 0.85530 | 0.10654 |
| 6183 | 20 | 66 | 168 | 1581 | 4.592e−05 | 2.864 [1.652–4.765] | 0.95887 | 0.10654 |
| 6178 | 20 | 68 | 168 | 1579 | 7.577e−05 | 2.777 [1.604–4.611] | 1 | 0.15823 |
Listed are the rule identifier (PID), the counts per group of the contingency tables, the p-values based on the Chi squared test along with the odds ratios including confidence intervals (CI), and results from two multiple correction methods (based on 20,882 tests). The coding of the groups is as follow: G, if genotype pattern is present, g if not; P, if phenotype pattern is present, p if not. FDR false discovery rate