| Literature DB >> 32093643 |
Yu-Chuan Chang1,2, June-Tai Wu3, Ming-Yi Hong4, Yi-An Tung2,5, Ping-Han Hsieh1, Sook Wah Yee6, Kathleen M Giacomini6,7, Yen-Jen Oyang1, Chien-Yu Chen8,9.
Abstract
BACKGROUND: Genome-wide association studies (GWAS) provide a powerful means to identify associations between genetic variants and phenotypes. However, GWAS techniques for detecting epistasis, the interactions between genetic variants associated with phenotypes, are still limited. We believe that developing an efficient and effective GWAS method to detect epistasis will be a key for discovering sophisticated pathogenesis, which is especially important for complex diseases such as Alzheimer's disease (AD).Entities:
Keywords: Epistasis; GWAS; Machine learning
Mesh:
Year: 2020 PMID: 32093643 PMCID: PMC7041299 DOI: 10.1186/s12859-020-3368-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The architecture of GenEpi
Fig. 2The boxplot for the rank of the target epistasis in different algorithms. a The results of three basic-model datasets with one epistasis consisting of a SNP pair. b The result of the complex-model dataset, which contained three epistasis. The ‘S1-S2’ means the epistasis between SNP 1 and SNP 2 and so on. The values on the boxplot are the medians of the rank of the target epistasis among the 100 runs of simulation
The medians of the rank of the SNPs in the target epistasis for ReliefF
| SNP 1 | SNP 2 | SNP 3 | SNP 4 | SNP 5 | SNP 6 | |
|---|---|---|---|---|---|---|
| Basic Model | 1 | 2 | 1 | 2 | 1 | 2 |
| Complex Model | 7 | 8.5 | 9.5 | 11.5 | 11 | 19.5 |
Fig. 3The boxplot of false positives in L1-regularized regression with and without stability selection
The score of different models in predicting control subjects or AD patients
| Precision | Recall | F1 Score | Accuracy | |
|---|---|---|---|---|
| Training | 0.9633 | 0.8537 | 0.9052 | 0.9396 |
| 2-fold CV | 0.7748 | 0.6992 | 0.7350 | 0.8297 |
| LOO CV | 0.8100 | 0.6585 | 0.7265 | 0.8324 |
F1 Score = 2 × (Precision × Recall) / (Precision + Recall); ‘Training’ stands for the process of a single-loop CV; ‘2-fold CV’ means that 2-fold CV was used in the external loop of double CV; ‘LOO CV’ means that LOO CV was used in the external loop of double CV
The statistical significance of genetic features selected by GenEpi in predicting patients with AD
| Selected SNPs (RSID) | Weight | Odds Ratio | χ2-test | Genotype | Gene |
|---|---|---|---|---|---|
| rs3130614_BB, rs41276317_AB | 3.16 | 19.23 | 1.42E-09 | 0.0742 | |
| rs12095538_BB, rs2774308_AB | 2.41 | 7.69 | 6.87E-07 | 0.0824 | |
| rs12926153_AB, rs12922908_AA | 1.18 | 4.83 | 6.89E-07 | 0.1511 | |
| rs9652600_AB, rs12922908_AA | 0.94 | 4.83 | 6.89E-07 | 0.1511 | |
| rs9344977_BB, rs56148686_AB | 1.94 | 4.32 | 1.14E-06 | 0.1813 | |
| rs429358_AA | −2.01 | 0.17 | 1.73E-06 | 0.5962 | |
| rs56233035_AB, rs3678_AB | 2.26 | 10.16 | 1.91E-06 | 0.0604 | |
| rs11675339_AA, rs2710687_AA | 2.32 | 3.94 | 3.55E-06 | 0.1923 | |
| rs12189429_BB, rs6881360_AA | 1.36 | 4.34 | 3.65E-06 | 0.467 | |
| rs12187423_BB, rs6881360_AA | 0.58 | 4.34 | 3.65E-06 | 0.467 | |
| rs10831829_BB, rs12366151_AA | 3.48 | 9.50 | 4.90E-06 | 0.0577 | |
| rs2052573_BB, rs34580133_AB | 1.80 | 4.08 | 5.00E-06 | 0.1648 | |
| rs2421701_AB, rs200512701_AB | 1.82 | 4.12 | 5.29E-06 | 0.1593 | |
| rs769449_AA | −1.19 | 0.16 | 8.42E-06 | 0.6648 |
The sign ‘a’ between two gene symbols indicates cross-gene epistasis
The comparison of different algorithms
| Algorithm | # Input SNP | Time Cost | Top 15 | Top 30 | Top 45 | Top 60 |
|---|---|---|---|---|---|---|
| GenEpi | 4,916,249 | 9.95 | 0.76 | 0.72 | 0.71 | 0.68 |
| BOOST | 4,916,249 | 2157.6 | 0.31 | 0.24 | 0.30 | 0.37 |
| ReliefF | 33,868 | 0.11 | 0.52 | 0.48 | 0.45 | 0.46 |
| FastEpistasis | 12,809,667 | 836.8 | 0.62 | 0.61 | 0.60 | 0.59 |
‘Time Cost’ is the time spent on identifying the epistasis, which was measured by single CPU time in days. The values in column top 15, top 30, top 45 and top 60 are the 2-fold CV scores. The 2-fold CV scores are the F1 scores
Fig. 4The ROC curves of different algorithms
Fig. 5The heatmap of gene expression in different tissues for the 12 genes selected by GenEpi. The blue box highlights the sub-regions of brain