| Literature DB >> 26839594 |
Silke Szymczak1, Emily Holzinger2, Abhijit Dasgupta3, James D Malley4, Anne M Molloy5, James L Mills6, Lawrence C Brody7, Dwight Stambolian8, Joan E Bailey-Wilson2.
Abstract
BACKGROUND: Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses.Entities:
Keywords: Genetic; Genome-wide association study; Machine learning; Random forest; SNP; Variable importance; Variable selection
Year: 2016 PMID: 26839594 PMCID: PMC4736152 DOI: 10.1186/s13040-016-0087-3
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Information about the nine causal SNPs under the alternative hypothesis in simulation study 1
| SNP | MAF | RR | no SNPs strong LD | no SNPs moderate LD |
|---|---|---|---|---|
| 11-103959987 | 0.474 | 1.3 | 0 | 1 |
| 22-28469630 | 0.488 | 1.3 | 4 | 8 |
| 17-9807099 | 0.496 | 1.3 | 0 | 0 |
| 1-240799543 | 0.312 | 1.5 | 0 | 0 |
| 7-45984820 | 0.312 | 1.5 | 0 | 2 |
| 5-130104076 | 0.323 | 1.5 | 12 | 18 |
| 14-67463012 | 0.062 | 2 | 2 | 31 |
| 18-34645639 | 0.062 | 2 | 0 | 1 |
| 3-2770509 | 0.064 | 2 | 0 | 1 |
Table shows SNP identifier in chromosome and position notation, minor allele frequency (MAF), relative risks (RR) and number of SNPs within a 1 Mb region that are in strong (r > 0.8) or moderate LD (0.3 < r ≤ 0.8)
Fig. 1Heatmaps of type I error of single SNPs in simulation study 1. Shown are results for logistic regression and r2VIM with several factors in the different scenarios (different sample sizes and mtry parameters in RF). Columns correspond to SNPs that are selected in at least one approach and are ordered by chromosomal position. Type I error is color-coded in gray with white and black denoting 0 and 1, respectively. In addition, LD information is shown at the top with SNPs in high (r > 0.8) and moderate LD (0.3 < r ≤ 0.8) colored in red and yellow
Fig. 2Heatmaps of empirical power of single SNPs in simulation study 1. Shown are results for logistic regression and r2VIM with several factors in the different scenarios (different sample sizes and mtry parameters in RF). Only the nine causal SNPs (marked in red on top) and false-positive SNPs that are uncorrelated to each causal SNP are shown in columns and ordered by chromosomal position. Empirical power is color-coded in gray with white and black denoting 0 and 1, respectively
Number of SNPs in simulation study 1 with empirical power > 0
| Method |
|
| Factor | Total | Causal | High LD | Mod LD | Low LD | FP |
|---|---|---|---|---|---|---|---|---|---|
| LR | 2000 | 40 | 8 | 15 | 11 | 4 | 0 | ||
| LR | 6000 | 98 | 9 | 16 | 15 | 15 | 0 | ||
| r2VIM | 2000 | 100000 | 1 | 38 | 7 | 13 | 10 | 2 | 4 |
| r2VIM | 2000 | 100000 | 3 | 28 | 7 | 12 | 6 | 2 | 0 |
| r2VIM | 2000 | 100000 | 5 | 24 | 7 | 10 | 5 | 2 | 0 |
| r2VIM | 2000 | 250000 | 1 | 40 | 8 | 12 | 9 | 2 | 8 |
| r2VIM | 2000 | 250000 | 3 | 25 | 7 | 9 | 5 | 2 | 1 |
| r2VIM | 2000 | 250000 | 5 | 23 | 7 | 9 | 5 | 2 | 0 |
| r2VIM | 6000 | 100000 | 1 | 51 | 9 | 16 | 10 | 5 | 3 |
| r2VIM | 6000 | 100000 | 3 | 41 | 9 | 16 | 6 | 4 | 0 |
| r2VIM | 6000 | 100000 | 5 | 37 | 9 | 16 | 6 | 4 | 0 |
| r2VIM | 6000 | 250000 | 1 | 63 | 9 | 16 | 12 | 6 | 13 |
| r2VIM | 6000 | 250000 | 3 | 42 | 9 | 16 | 7 | 4 | 1 |
| r2VIM | 6000 | 250000 | 5 | 37 | 9 | 16 | 5 | 4 | 0 |
Shown are results for logistic regression (LR) and r2VIM. Columns denote method, sample size (n), mtry parameter and factor for r2VIM, total number of SNPs, number of SNPs in strong (r > 0.8), moderate LD (0.5 < r ≤ 0.8) and low LD (0.3 < r ≤ 0.5) with any causal SNP as well as number of false-positive SNPs (FP)
Number of SNPs and clumps in simulation study 2 with empirical power > 0
| Causal clumps | FP clumps | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| OR |
| Method | Factor | Total | Causal | no | no SNPs | no | no SNPs |
| 1.3 | 6000 | LR | 765 | 9 | 9 | 741 | 2 | 24 | |
| 1.3 | 6000 | r2VIM | 1 | 194 | 8 | 9 | 178 | 13 | 16 |
| 1.3 | 6000 | r2VIM | 3 | 110 | 8 | 9 | 106 | 3 | 4 |
| 1.3 | 6000 | r2VIM | 5 | 78 | 8 | 9 | 77 | 1 | 1 |
| 1.1 | 20000 | LR | 106 | 5 | 5 | 105 | 1 | 1 | |
| 1.1 | 20000 | r2VIM | 1 | 104 | 5 | 8 | 62 | 32 | 42 |
| 1.1 | 20000 | r2VIM | 3 | 43 | 4 | 7 | 30 | 8 | 13 |
| 1.1 | 20000 | r2VIM | 5 | 26 | 4 | 7 | 21 | 3 | 5 |
Shown are results for logistic regression (LR) and r2VIM. Columns denote odds ratio (OR), sample size (n), method, factor for r2VIM, total number of SNPs, number of causal SNPs, number of clumps based on causal SNPs, number of SNPs in clumps based on causal SNPs, number of clumps based on false-positive (FP) SNPs and number of SNPs clumps based on false-positive (FP) SNPs
Fig. 3Empirical power of causal SNPs in simulation study 2. Shown are results for logistic regression and r2VIM. Empirical power is shown for the scenario with OR = 1.3 (a) and with OR = 1.1 (b). Causal SNPs are ordered by LD pattern and MAF. Results for logistic regression are shown in blue, whereas red denotes power of r2VIM (stratified by factor value)
Fig. 4Manhattan plots for TRINITY data set. a) P-values of logistic regression for each SNP. Dotted line denotes genome-wide significance level of 5*10−8. b) Minimal relative variable importance (VIM) based on RF analysis for each SNP
Fig. 5Manhattan plots for AREDS data set. a) P-values of logistic regression for each SNP. b) Minimal relative variable importance (VIM) based on RF analysis for each SNP