| Literature DB >> 21981765 |
Hongsheng Gui1, Miaoxin Li, Pak C Sham, Stacey S Cherny.
Abstract
BACKGROUND: Though rooted in genomic expression studies, pathway analysis for genome-wide association studies (GWAS) has gained increasing popularity, since it has the potential to discover hidden disease pathogenic mechanisms by combining statistical methods with biological knowledge. Generally, algorithms or programs proposed recently can be categorized by different types of input data, null hypothesis or counts of analysis stages. Due to complexity caused by SNP, gene and pathway relationships, re-sampling strategies like permutation are always utilized to derive an empirical distribution for test statistics for evaluating the significance of candidate pathways. However, evaluation of these algorithms on real GWAS datasets and real biological pathway databases needs to be addressed before we apply them widely with confidence.Entities:
Year: 2011 PMID: 21981765 PMCID: PMC3199264 DOI: 10.1186/1756-0500-4-386
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Summary of selected Canonical pathways
| Pathway size by gene | Pathway overlapping | Gene mapping to pathway | |||
|---|---|---|---|---|---|
| Range | Proportion | Range | Proportion1 | Range | Proportion |
| > = 10, < 20 | 36% | 0 | 84% | Unique | 26.20% |
| > = 20, < 100 | 55% | > 0, < 0.01 | 2.10% | > = 2, < 10 | 58.50% |
| > = 100, < 200 | 7.70% | > 0.01, < 0.1 | 11.30% | > = 10, < 100 | 15.10% |
| > = 200, < 300 | 1.30% | > 0.1, < 0.5 | 2.50% | > = 100 | 0.20% |
| > 0.5 | 0.10% | ||||
Note 1: consider all possible pairs from of 854 canonical pathways; define the ratio as sharing count of genes out of all different genes for one pair of pathways (union overlap).
A survey on 854 selected candidate pathways was conducted in order to investigate the characteristics of pathway size, pair-wise pathway overlap and gene allocation. Pathway size was measured by number of genes contained in the pathway; pathway overlapping was defined by the union overlapping ratio for one pair of pathways; gene allocation was counted as the number of pathways the genes belong to.
Figure 1QQ plot for SNP/gene p-values of a permuted CD dataset. SNPs were divided into inside-of-genes and outside-of-genes according to their physical coordinates on the hg18 genome. Gene p-values were calculated by KGG, using the GATES algorithm.
Type I error rate for seven algorithms
| Algorithms | Type I error | K-S test |
|---|---|---|
| GATES-Simes | 0.043 | < 2.2e-12 |
| GATES-Hyper | 0.016 | 5.4e-09 |
| Aligator | 0.032 | 2.6e-3 |
| GRASS | 0.018 | 5.1e-12 |
| GSEAforGWAS | 0.016 | < 2.2e-16 |
| PLINK-Ave | 0.055 | 3.1e-06 |
| PLINK-Max | 0.036 | 9.5e-05 |
Two indices (Type I error and K-S test) were used to check whether those algorithms produced more false positive results than by chance. Type I error was calculated as proportion of pathways with nominal p-values < 0.05. The two-sided Kolmogorov-Smirnov test was used to investigate whether p-values from each algorithm follow the theoretical (0, 1) uniform distribution.
Figure 2QQ-plots for GWAS pathway p-values of seven algorithms for permuted datasets. P-values for all 854 candidate pathways produced by each algorithm were plotted against their expected values from a (0, 1) uniform distribution.
Figure 3QQ plot for SNP/gene p-values from the CD dataset. SNPs were divided into inside-of-genes and outside-of-genes according to their physical coordinates on hg18 genome. Gene p-values were calculated by KGG, using the GATES algorithm.
Power indication from CD dataset
| GATES-Simes | 4 | 3 | 1.20e-4 | 0.009 | 0.154 |
| GATES-Hyper | 0 | 0 | -- | 0.005 | 0.493 |
| Aligator | 4 | 2 | 0.006 | 0.595 | 1 |
| GRASS | 41 | 3 | 0.146 | 0.031 | 0.158 |
| GSEAforGWAS | 10 | 2 | 0.04 | 0.004 | 0.118 |
| PLINK-Ave | 40 | 8 | 1.77e-5 | 0.002 | 0.043 |
| PLINK-Max | 0 | 0 | -- | 0.006 | 0.155 |
Notes: 1, significance was defined as FDR for individual pathway smaller than 0.05; 2, no. of overlapping pathways between significant pathways in this study and previous known pathways for CD; 3, hyper-geometric test.
Running time summary
| Algorithm | Software | Input | Null Hypothesis | Computer configuration | Runtime |
|---|---|---|---|---|---|
| GATES-Simes | KGG1 | Summary statistics | Self contained | Intel Core 2 Quad CPU Q9400 2.67 GHz, | 30 mins5 |
| GATES-Hyper | KGG | Summary statistics | Self contained | As above | 30 mins5 |
| Aligator | R-SNPath2 | Summary statistics | Competitive | Intel XEON 2 six-core x5670 2.93 Ghz, | 2 hours |
| GRASS | R-SNPath | Raw data | Self contained | As above | 14 days |
| GSEAforGWAS | GenGen3 | Raw data | Competitive | As above | 2 days |
| PLINK-Ave | PLINK4 | Raw data | Self contained | As above | 40 hours |
| PLINK-Max | PLINK | Raw data | Self contained | As above | 40 hours |
Notes: 1, URL for KGG at http://bioinfo.hku.hk:13080/kggweb/home.htm; 2, URL for SNPath package at http://linchen.fhcrc.org/grass.html; 3, URL for GenGen program at http://www.openbioinformatics.org/gengen/; 4, URL for PLINK at http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml; 5, excluding time spent building analysis genome (see KGG online manual).