| Literature DB >> 20862301 |
Brooke L Fridley1, Gregory D Jenkins, Joanna M Biernacka.
Abstract
Gene set methods aim to assess the overall evidence of association of a set of genes with a phenotype, such as disease or a quantitative trait. Multiple approaches for gene set analysis of expression data have been proposed. They can be divided into two types: competitive and self-contained. Benefits of self-contained methods include that they can be used for genome-wide, candidate gene, or pathway studies, and have been reported to be more powerful than competitive methods. We therefore investigated ten self-contained methods that can be used for continuous, discrete and time-to-event phenotypes. To assess the power and type I error rate for the various previously proposed and novel approaches, an extensive simulation study was completed in which the scenarios varied according to: number of genes in a gene set, number of genes associated with the phenotype, effect sizes, correlation between expression of genes within a gene set, and the sample size. In addition to the simulated data, the various methods were applied to a pharmacogenomic study of the drug gemcitabine. Simulation results demonstrated that overall Fisher's method and the global model with random effects have the highest power for a wide range of scenarios, while the analysis based on the first principal component and Kolmogorov-Smirnov test tended to have lowest power. The methods investigated here are likely to play an important role in identifying pathways that contribute to complex traits.Entities:
Mesh:
Year: 2010 PMID: 20862301 PMCID: PMC2941449 DOI: 10.1371/journal.pone.0012693
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The mean type 1 error and power for all gene set methods averaged across all null (mean type 1 error) non-null (mean power) simulation scenarios for sample sizes of 20, 100, and 500.
| Mean Type 1 Error | Mean Power | ||||||
| Type of Method | Gene Set Method | N = 20 | N = 100 | N = 500 | N = 20 | N = 100 | N = 500 |
| Based on combining individual SNP p-values | Kolmogorov-Smirnov (KS) | 0.047 | 0.048 | 0.053 | 0.533 | 0.751 | 0.831 |
| Fisher's Method (FM) | 0.048 | 0.050 | 0.052 | 0.608 | 0.894 | 0.981 | |
| Stouffer's Method (SM) | 0.048 | 0.049 | 0.051 | 0.571 | 0.825 | 0.937 | |
| Tail Strength (TS) | 0.048 | 0.049 | 0.052 | 0.573 | 0.807 | 0.876 | |
| Modified Tail Strength (MTS) | 0.050 | 0.048 | 0.048 | 0.549 | 0.798 | 0.929 | |
| Based on modeling the data | Global model using fixed effects (GMFE) | 0.053 | 0.045 | 0.051 | 0.639 | 0.907 | 0.985 |
| Global model using random effects (GMRE) | 0.048 | 0.050 | 0.053 | 0.604 | 0.900 | 0.984 | |
| PCA using 1st principal component (PCA1) | 0.050 | 0.050 | 0.049 | 0.537 | 0.717 | 0.821 | |
| PCA using1–5 principal components (PCA1.5) | 0.046 | 0.050 | 0.050 | 0.543 | 0.800 | 0.925 | |
| PCA using principal components that explain 80% (PCA80) | 0.049 | 0.047 | 0.052 | 0.489 | 0.861 | 0.975 | |
*GMFE could not be applied in 27 out of 36, 18 out of 39, and 9 of the 39 simulation scenarios used to assess type 1 error at samples sizes of 20, 100, and 500, respectively. GMFE could not be applied in 594 out of 720, 396 out of 774, and 198 of the 774 power scenarios with sample sizes of 20, 100, and 500, respectively.
Figure 1Pairwise scatterplot of power for the various methods for scenarios with standard deviation (σ) of 6.0.
Figure 2Plots of power for all methods.
Power is plotted as a function of (A) sample size, (B) the correlation between expression values within the gene set (ρ), (C) the proportion of probes associated with the phenotype, and (D) the calculated R2, the proportion of variation in the quantitative phenotype explained by the gene expression values in the pathway. The average power values are based on all simulated non-null scenarios. Plot (B) excludes scenarios with between-probe correlation structure defined by the gemcitabine pathway, and only shows fixed-correlation scenarios (ρ = 0, 0.1, 0.3). Plots (B), (C), and (D) are based on sample size of 100. Similar plots for sample sizes of 20 and 500 are shown in Figure S1. For plots (C) and (D) a kernel smoother was used to fit a curve to the data. Scenarios with all expression probes being associated with the trait were excluded from plot (C), as all the methods had very high power in this situation.
Figure 3Power of Fisher's Method (FM) as a function of sample size, correlation of expression values between probes (ρ), and R2 (proportion of variation in the quantitative phenotype explained by the gene expression values in the gene set).
Results from analysis of gemcitabine pathway, glutathione pathway and null gene set from the various gene set methods.
| Type of Method | Gene Set Method | Glutathione Pathway p-value | Gemcitabine pathway p-value | Null gene set |
| Based on combining individuals SNP p-values | Kolmogorov-Smirnov (KS) | 0.0178 | 0.250 | 0.447 |
| Fisher's Method (FM) | 0.0121 | 0.016 | 0.126 | |
| Stouffer's Method (SM) | 0.0241 | 0.158 | 0.272 | |
| Tail Strength (TS) | 0.0211 | 0.160 | 0.344 | |
| Modified Tail Strength (MTS) | 0.0371 | 0.267 | 0.278 | |
| Based on modeling the data | Global model using fixed effects (GMFE) | 2.27×10−5 | 0.032 | 0.004 |
| Global model using random effects (GMRE) | 0.0137 | 0.012 | 0.780 | |
| PCA using 1st principal component (PCA1) | 0.3610 | 0.341 | 0.668 | |
| PCA using1–5 principal components (PCA1.5) | 0.2920 | 0.002 | 0.518 | |
| PCA using principal components that explain 80% (PCA80) | 0.00396 | 0.022 | 0.050 |