| Literature DB >> 27053949 |
Jing Li1, James D Malley2, Angeline S Andrew3, Margaret R Karagas3, Jason H Moore4.
Abstract
BACKGROUND: Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions.Entities:
Keywords: GWAS; Machine learning; Random forest; Scale invariant
Year: 2016 PMID: 27053949 PMCID: PMC4822295 DOI: 10.1186/s13040-016-0093-5
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Overview of the permuted Random Forest (pRF). Shown in panel a is the original dataset with all the SNP information (0, 1 or 2) and class (cases-control status). Each row represents a sample; different three colors in the SNP columns indicate different genotypes, and two colors in the class column indicate case-control status. b shows the first permutation framework that keeps SNPs’ main effects, in which cases and controls are separated, two selected SNP columns shuffle the information separately within each class. c shows the second permutation framework that keeps SNPs’ interaction and main effects, in which cases and controls are separated, two selected SNPs shuffle their information together by keeping their genotype combinations, separately within each class. RF is trained using original dataset and tested using the datasets from the above two permutation schemes. Error rates are calculated by averaging the classification errors across all samples. The same process is repeated 10 times and the error rates are averaged from 10 permutation results. The average classification error from the first permutation framework is named E1, while the average classification error from the second permutation framework is named E2. The whole process is repeated on all pairs of SNPs and the difference in average error rates (Δ E = E1 - E2) are calculated and ranked to identify the top candidates
Success rates for identification of interaction pairs of SNPs from simulation studies
| Interaction SNP Pair Detection Success Rate | |||||
|---|---|---|---|---|---|
| Sample Size = 2000 | Sample Size = 4000 | ||||
| SNP Numbers | Heritability | Highest EDM | Lowest EDM | Highest EDM | Lowest EDM |
| 5 | 0.001 | 52 % | 7 % | 70 % | 16 % |
| 5 | 0.005 | 99 % | 28 % | 100 % | 43 % |
| 5 | 0.01 | 100 % | 34 % | 100 % | 68 % |
| 5 | 0.05 | 100 % | 100 % | 100 % | 100 % |
| 5 | 0.1 | 100 % | 100 % | 100 % | 100 % |
| 5 | 0.2 | 100 % | 100 % | 100 % | 100 % |
| 5 | 0.3 | 100 % | 100 % | 100 % | 100 % |
| 5 | 0.4 | 100 % | 100 % | 100 % | 100 % |
| 10 | 0.001 | 8 % | 3 % | 29 % | 1 % |
| 10 | 0.005 | 80 % | 4 % | 99 % | 13 % |
| 10 | 0.01 | 100 % | 8 % | 100 % | 43 % |
| 10 | 0.05 | 100 % | 98 % | 100 % | 100 % |
| 10 | 0.1 | 100 % | 100 % | 100 % | 100 % |
| 10 | 0.2 | 100 % | 100 % | 100 % | 100 % |
| 10 | 0.3 | 100 % | 100 % | 100 % | 100 % |
| 10 | 0.4 | 100 % | 100 % | 100 % | 100 % |
| 15 | 0.001 | 6 % | 0 % | 8 % | 0 % |
| 15 | 0.005 | 51 % | 2 % | 77 % | 6 % |
| 15 | 0.01 | 93 % | 4 % | 100 % | 12 % |
| 15 | 0.05 | 100 % | 92 % | 100 % | 100 % |
| 15 | 0.1 | 100 % | 100 % | 100 % | 100 % |
| 15 | 0.2 | 100 % | 100 % | 100 % | 100 % |
| 15 | 0.3 | 100 % | 100 % | 100 % | 100 % |
| 15 | 0.4 | 100 % | 100 % | 100 % | 100 % |
| 20 | 0.001 | 0 % | 0 % | 2 % | 1 % |
| 20 | 0.005 | 10 % | 2 % | 39 % | 3 % |
| 20 | 0.01 | 49 % | 0 % | 93 % | 12 % |
| 20 | 0.05 | 100 % | 70 % | 100 % | 98 % |
| 20 | 0.1 | 100 % | 100 % | 100 % | 100 % |
| 20 | 0.2 | 100 % | 100 % | 100 % | 100 % |
| 20 | 0.3 | 100 % | 100 % | 100 % | 100 % |
| 20 | 0.4 | 100 % | 100 % | 100 % | 100 % |
| 25 | 0.001 | 1 % | 0 % | 1 % | 0 % |
| 25 | 0.005 | 2 % | 0 % | 10 % | 1 % |
| 25 | 0.01 | 15 % | 1 % | 44 % | 2 % |
| 25 | 0.05 | 99 % | 30 % | 100 % | 87 % |
| 25 | 0.1 | 100 % | 87 % | 100 % | 100 % |
| 25 | 0.2 | 100 % | 99 % | 100 % | 100 % |
| 25 | 0.3 | 100 % | 100 % | 100 % | 100 % |
| 25 | 0.4 | 100 % | 100 % | 100 % | 100 % |
Success rates for identification of interaction pairs of SNPs with different numbers of SNPs, heritabilities, models with highest/lowest EDMs, under the sample sizes of 2000 or 4000 with balanced cases and controls. The success rates were calculated using 100 replicate datasets. The percentage was calculated as the fraction of correctly detection times of the interacting SNP pair, M0P0 and M1P1, in 100 replicate datasets under each genetic constraint combination of number of SNPs (5, 10, 15, 20 or 25), heritability (0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3 or 0.4), extreme EDM (highest or lowest) and sample size (2000 or 4000)
SNP interactions identified by permuted random forest (pRF)
| SNP pairs | E1 | E2 |
| MDR ranking |
|---|---|---|---|---|
| XPD.751, XPD.312 | 41.00 % | 33.76 % | 7.23 % | 1 |
| XRCC3, XPD.751 | 41.87 % | 40.89 % | 0.99 % | 6 |
| APE1, XPD.312 | 40.99 % | 40.01 % | 0.97 % | 14 |
| XRCC3, XRCC1.399 | 34.76 % | 33.83 % | 0.93 % | 2 |
| XRCC3, APE1 | 35.27 % | 34.43 % | 0.84 % | 7 |
| XPD.312, XRCC1.194 | 40.14 % | 39.53 % | 0.61 % | 19 |
| XPD.751, XRCC1.194 | 40.84 % | 40.32 % | 0.53 % | 9 |
| XRCC3, XPD.312 | 42.22 % | 41.73 % | 0.49 % | 11 |
| XPD.751, XRCC1.399 | 41.74 % | 41.30 % | 0.44 % | 4 |
| XRCC3, XRCC1.194 | 35.64 % | 35.24 % | 0.39 % | 18 |
| XRCC1.399, XPD.312 | 40.35 % | 40.25 % | 0.10 % | 13 |
| APE1, XPD.751 | 41.14 % | 41.05 % | 0.09 % | 15 |
| XRCC1.399, XRCC1.194 | 33.63 % | 33.63 % | 0.00 % | 3 |
| APE1, XPC.PAT | 35.36 % | 35.55 % | −0.19 % | 17 |
| APE1, XRCC1.194 | 34.04 % | 34.22 % | −0.19 % | 21 |
| XRCC1.194, XPC.PAT | 33.38 % | 33.65 % | −0.27 % | 20 |
| XRCC3, XPC.PAT | 35.00 % | 35.37 % | −0.37 % | 10 |
| XRCC1.399, XPC.PAT | 33.01 % | 33.43 % | −0.42 % | 12 |
| XPD.312, XPC.PAT | 40.48 % | 40.95 % | −0.47 % | 5 |
| XPD.751, XPC.PAT | 40.25 % | 40.78 % | −0.53 % | 16 |
| APE1, XRCC1.399 | 33.86 % | 34.71 % | −0.84 % | 8 |
Classification error rates from datasets obtained by two permutation schemes, E1 (with main effects only) and E2 (with main effects and interaction), are shown in the table. Error rate differences were calculated and listed as Δ E. SNP pairs were ranked by their error rates differences in percentage, indicating the strength of interactions. SNP pairs column shows the permuted SNP names. MDR ranking column shows the ranking of top 2-way models MDR identified according to the results from our approach
Fig. 2Statistical epistasis network (SEN) and permuted random forest networks (PRFN). a shows the largest connected components from statistical epistasis network, which includes 39 SNPs. The largest connected components were divided into three clusters. Permuted Random Forest (pRF) was applied using the SNPs within each of the three clusters separately. b, c and d show the PRFNs built from each cluster. The width of the edges are in proportion to how strong the interactions exist, which are represented by the differences in error rates using our method. The cut-off for the SEN was based on entropy value of 0.013. PRFNs were built using same numbers of edges as in each cluster in SEN
Fig. 3Characterization of newly identified interacted SNP pairs using GIANT. Network filters were set as minimum relationship confidence 0.8 and maximum number of genes 5. Interactions between genes CCL5 and PARP4, MBD2 and GSTM, BCL6 and XPC were characterized using GIANT and the results were shown in panel (a, b and c). a shows the network of CCL5 and PARP4; b shows MBD2 and GSTM3; c shows BCL6 and XPC