| Literature DB >> 25879888 |
H Robert Frost1,2,3, Zhigang Li4,5, Jason H Moore6,7,8.
Abstract
BACKGROUND: Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25879888 PMCID: PMC4365810 DOI: 10.1186/s12859-015-0490-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Simulation results. Results for the simulation studies detailed in Section “Evaluation using simulated gene sets and simulated data”. For all plots, error bars represent ±1S E for the mean value over all 1000 simulated datasets. a)-c) Results for the type I error simulation study based on MVN data generated with an identity population covariance matrix. This model is consistent with H 0. d)-f) Results for the power simulation study based on MVN data generated according to a single-factor population covariance matrix. Under this model, an association exists between the first gene set and PC 1. g)-i) Results for the power simulation study based on MVN data generated according to a two-factor population covariance matrix. Under this model, an association exists between the first gene set and PCs 1 and 2. a), d) and g) Mean p-values computed using the PCGSE method for the first simulated gene set relative to the first 5 PCs. b), e) and h) Mean weights used by the SGSE method to combine the PCGSE-computed p-values for each gene set relative to the first 5 PCs. PC variance weights are shown as round points connected by a solid line. PC variance scaled by the lower-tailed p-value computed using the Tracy-Widom distribution for the PC variance is shown using square points connected by a dashed line. c), f) and i) Quantile-quantile plot of the p-values computed using the SGSE method, with both PC variance weights (Var.) or weights defined by the PC variance scaled by the lower-tailed Tracy-Widom p-value of the PC variance (TW*Var.), or the benchmark method that uses a Chi-squared test between cluster membership and gene set membership (Chisq).
Figure 2Leukemia gene expression results. Scatter plot showing the association between phenotype gene set enrichment p-values and unsupervised gene set enrichment p-values computed using the benchmark cluster-based method (plot a)) and SGSE (plots b) and c)) for the Armstrong et al. [44] leukemia gene expression data, AML/ALL phenotype, and MSigDB C2 v4.0 gene sets. Phenotype enrichment, unsupervised cluster-based enrichment and spectral gene set enrichment p-values were computed as outlined in Section “Evaluation using MSigDB C2 v4.0 gene sets and Armstrong et al. leukemia gene expression data”. Displayed in each plot is the Spearman correlation coefficient between phenotype and unsupervised gene set enrichment p-values and the positive predictive value of unsupervised gene set enrichment for identifying gene sets that are significantly enriched relative to the phenotype at an α=0.1 (shown by dotted lines). The results from the two different SGSE weighting methods outlined in Section “Combined significance of PCGSE p-values” are shown in plots b) and c) with b) plotting SGSE p-values generated using PC variance weighting and c) plotting SGSE p-values generated using weights defined by the PC variance scaled by the lower-tailed Tracy-Widom p-value for the variance.
Figure 3DLBCL gene expression results. Scatter plot showing the association between phenotype gene set enrichment p-values and unsupervised gene set enrichment p-values computed using the benchmark cluster-based method (plot a)) and SGSE (plots b) and c)) for the Rosenwald et al. [45] DLBCL gene expression data, log survival time phenotype, and MSigDB C2 v4.0 gene sets. Phenotype enrichment, unsupervised cluster-based enrichment and spectral gene set enrichment p-values were computed as outlined in Section “Evaluation using Rosenwald et al. DLBCL gene expression data and MSigDB C2 v4.0 gene sets”. Displayed in each plot is the Spearman correlation coefficient between phenotype and unsupervised gene set enrichment p-values and the positive predictive value of unsupervised gene set enrichment for identifying gene sets that are significantly enriched relative to the phenotype at an α=0.1 (shown by dotted lines). The results from the two different SGSE weighting methods outlined in Section “Combined significance of PCGSE p-values” are shown in plots b) and c) with b) plotting SGSE p-values generated using PC variance weighting and c) plotting SGSE p-values generated using weights defined by the PC variance scaled by the lower-tailed Tracy-Widom p-value for the variance.