| Literature DB >> 34977040 |
Yingjie Guo1,2, Chenxi Wu3, Zhian Yuan4, Yansu Wang1,2, Zhen Liang5, Yang Wang2, Yi Zhang6, Lei Xu2.
Abstract
Among the myriad of statistical methods that identify gene-gene interactions in the realm of qualitative genome-wide association studies, gene-based interactions are not only powerful statistically, but also they are interpretable biologically. However, they have limited statistical detection by making assumptions on the association between traits and single nucleotide polymorphisms. Thus, a gene-based method (GGInt-XGBoost) originated from XGBoost is proposed in this article. Assuming that log odds ratio of disease traits satisfies the additive relationship if the pair of genes had no interactions, the difference in error between the XGBoost model with and without additive constraint could indicate gene-gene interaction; we then used a permutation-based statistical test to assess this difference and to provide a statistical p-value to represent the significance of the interaction. Experimental results on both simulation and real data showed that our approach had superior performance than previous experiments to detect gene-gene interactions.Entities:
Keywords: XGBoost; additive model; gene-based testing; gene–gene interactions; genome-wide association studies
Year: 2021 PMID: 34977040 PMCID: PMC8716787 DOI: 10.3389/fcell.2021.801113
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
FIGURE 1Illustration of trees with the additive constraint. With the constraint [[0,1] (Hindorff et al., 2009; Liu et al., 2018a; Loos, 2020),], (A) a tree violates the first constraint (0,1) that would not be in the boosting tree system; (B) a tree complies with both the first and second constraints.
FIGURE 2Illustration of the GGInt-XGBoost workflow for gene-based gene–gene interaction detection.
FIGURE 3Illustration of LD structures within genes GNPDA2 and FAIM2. The plots are generated by Haploview. measures the LD strength of each pair of SNPs in each square, , where 0 indicates no LD and 1 indicates complete LD. The GNPDA2 has a much stronger LD pattern than that within FAIM2, and they are not correlated.
Type-I error for KCCU, GBIGM, AGGrEGATOr, and GGInt-XGBoost when varying the sample size.
| Method | Sample size | ||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
| KCCU | 0.02 | 0.02 | 0.01 | 0.05 | 0.07 |
| GBIGM | 0.13 | 0.06 | 0.07 | 0.07 | 0.07 |
| AGGrEGATOr | 0.05 | 0.06 | 0.07 | 0.04 | 0.02 |
| GGInt-XGBoost | 0.03 | 0.06 | 0.07 | 0.04 | 0.06 |
FIGURE 4Statistical power of simulation studies for KCCU (blue), GBIGM (yellow), AGGrEGATOr (green), and GGInt-XGBoost (red) under six disease models with
Calculated p-value for the 20 gene pairs using all four different methods. p-values in bold font indicate that they are significant. The ``Ref'' column indicates that the pair can be found as direct interaction in our literature search.
| Gene1 | Gene2 | Ref |
| |||
|---|---|---|---|---|---|---|
| GGInt-XGBoost | AGGrEGATOr | KCCU | GBIGM | |||
| HLA class II | TGF |
|
| 0.588 |
|
|
| HLA class II | LFA-1 |
|
| 0.591 | 0.195 | 0.373 |
| HLA class II | TEK |
| 0.213 | 0.521 | 0.226 | |
| IL-8 | ANG-1 |
|
| 1.0 | 0.818 | 0.32 |
| MMP-3 | April |
| 0.164 | 0.161 | 0.063 | |
| HLA-DQA1 | ANG-1 |
| 1.0 | 0.788 | 0.962 | |
| CTLA4 | HLA class II |
|
| 0.663 | 0.292 |
|
| MMP-3 | BLYs |
|
| 0.473 |
| 0.5 |
| JUN | FOS |
|
| 0.441 | 0.692 |
|
| GM-CSF | HLA class II |
|
| 0.391 |
|
|
| CD80 | April | 0.549 |
| 0.941 | 0.334 | |
| CTSK | BLYS |
|
| 0.356 | 0.056 | |
| AP-1 | IL-6 | 0.764 |
| 0.098 | 0.287 | |
| CD86 | CTSL | 0.235 |
| 0.519 | 0.252 | |
| CXCL6 | FLT-1 | 0.098 |
|
| 0.52 | |
| CTLA4 | AP-1 |
| 0.843 |
|
| 0.102 |
| FLT-1 | LFA-1 | 0.117 |
| 0.063 |
| |
| CCL3 | TRAP |
| 0.098 |
| 0.682 |
|
| IL-18 | TGF | 0.137 |
| 0.149 | 0.22 | |
| IL-1 | SDF-1 | 0.647 |
| 0.116 | 0.636 | |
FIGURE 5Plot that shows pairs of SNPs that lie in two nodes of a regression tree connected by an edge. The color indicates the sumGain measure for the SNP pairs. The pair with the red star indicates the interacting SNP pair from IL-8 and Ang-I.
sumGain measure of the 10 most significant interacting SNP pairs from IL-8 and Ang-I. “Frequency” is the number of occurrences of the SNP pair in the trained model.
| Index | Parent_SNP | Child_SNP | sumGain | Frequency |
|---|---|---|---|---|
| 1 | G2_26 | G2_12 | 387.846,816 | 87 |
| 2 | G2_44 | G2_46 | 237.672,225 | 50 |
| 3 | G2_57 | G1_2 | 228.947,974 | 75 |
| 4 | G2_51 | G2_33 | 218.003046 | 33 |
| 5 | G2_2 | G2_1 | 214.650,794 | 68 |
| 6 | G2_51 | G2_36 | 213.414,267 | 28 |
| 7 | G2_6 | G2_54 | 184.671,309 | 81 |
| 8 | G2_51 | G2_49 | 178.749,247 | 27 |
| 9 | G2_47 | G2_6 | 154.492,033 | 31 |
| 10 | G2_6 | G2_9 | 140.040435 | 54 |